Title: | Tools for 'iNZight' |
---|---|
Description: | Provides a collection of wrapper functions for common variable and dataset manipulation workflows primarily used by 'iNZight', a graphical user interface providing easy exploration and visualisation of data for students of statistics, available in both desktop and online versions. Additionally, many of the functions return the 'tidyverse' code used to obtain the result in an effort to bridge the gap between GUI and coding. |
Authors: | Tom Elliott [aut, cre] , Daniel Barnett [aut], Yiwen He [aut], Zhaoming Su [aut], Lushi Cai [ctb], Akshay Gupta [ctb], Owen Jin [ctb], Christoph Knopf [ctb] |
Maintainer: | Tom Elliott <[email protected]> |
License: | GPL-3 |
Version: | 2.0.1 |
Built: | 2024-11-11 05:32:04 UTC |
Source: | https://github.com/inzightvit/inzighttools |
When creating new variables or modifying the data set, we often add a suffix added to distinguish the new name from the original one. However, if the same action is performed twice (for example, filtering a data set), the suffix is duplicated (data.filtered.filtered). This function averts this by adding the suffix if it doesn't exist, and otherwise appending a counter (data.filtered2).
add_suffix(name, suffix)
add_suffix(name, suffix)
name |
a character vector containing (original) names |
suffix |
the suffix to add, a length-one character vector |
character vector of names with suffix appended
add_suffix("data", "filtered") add_suffix(c("data.filtered", "data.filtered.reshaped"), "filtered")
add_suffix("data", "filtered") add_suffix(c("data.filtered", "data.filtered.reshaped"), "filtered")
Summarizes non-categorical variables in a dataframe by grouping them based on specified categorical variables and returns the aggregated result along with the tidyverse code used to generate it.
aggregate_data( data, group_vars, summaries, vars = NULL, names = NULL, quantiles = c(0.25, 0.75) ) aggregate_dt( data, dt, dt_comp, group_vars = NULL, summaries, vars = NULL, names = NULL, quantiles = c(0.25, 0.75) )
aggregate_data( data, group_vars, summaries, vars = NULL, names = NULL, quantiles = c(0.25, 0.75) ) aggregate_dt( data, dt, dt_comp, group_vars = NULL, summaries, vars = NULL, names = NULL, quantiles = c(0.25, 0.75) )
data |
A dataframe or survey design object to be aggregated. |
group_vars |
A character vector specifying the variables in |
summaries |
An unnamed character vector or named list of summary functions to calculate for each group. If unnamed, the vector elements should be names of variables in the dataset for which summary statistics need to be calculated. If named, the names should correspond to the summary functions (e.g., "mean", "sd", "iqr") to be applied to each variable. |
vars |
(Optional) A character vector specifying the names of variables
in the dataset for which summary statistics need to be calculated.
This argument is ignored if |
names |
(Optional) A character vector or named list providing name templates for the newly created variables. See details for more information. |
quantiles |
(Optional) A numeric vector specifying the desired quantiles (e.g., c(0.25, 0.5, 0.75)). See details for more information. |
dt |
A character string representing the name of the date-time variable in the dataset. |
dt_comp |
A character string specifying the component of the date-time to use for grouping. |
The aggregate_data()
function accepts any R function that returns a
single-value summary (e.g., mean
, var
, sd
, sum
, IQR
). By default,
new variables are named {var}_{fun}
, where {var}
is the variable name
and {fun}
is the summary function used. The user can provide custom names
using the names
argument, either as a vector of the same length as vars
,
or as a named list where the names correspond to summary functions (e.g.,
"mean" or "sd").
The special summary "missing" can be included, which counts the number of
missing values in the variable. The default name for this summary is
{var}_missing
.
If quantiles
are requested, the function calculates the specified
quantiles (e.g., 25th, 50th, 75th percentiles), creating new variables for
each quantile. To customize the names of these variables, use {p}
as a
placeholder in the names
argument, where {p}
represents the quantile
value. For example, using names = "Q{p}_{var}"
will create variables like
"Q0.25_Sepal.Length" for the 25th percentile.
An aggregated dataframe containing the summary statistics for each group, along with the tidyverse code used for the aggregation.
aggregate_dt()
: Aggregate data by dates and times
Tom Elliott, Owen Jin, Zhaoming Su
Zhaoming Su
aggregated <- aggregate_data(iris, group_vars = c("Species"), summaries = c("mean", "sd", "iqr") ) code(aggregated) head(aggregated)
aggregated <- aggregate_data(iris, group_vars = c("Species"), summaries = c("mean", "sd", "iqr") ) code(aggregated) head(aggregated)
Append rows to a dataset
append_rows(data, new_data, when_added = FALSE)
append_rows(data, new_data, when_added = FALSE)
data |
The original dataset to which new rows will be appended. |
new_data |
The dataset containing the new rows. |
when_added |
Logical; indicates whether a |
A dataset with new rows appended below the original data
.
Yiwen He, Zhaoming Su
Used to grab code from a data.frame generated by this package.
code(data)
code(data)
data |
dataset you want to extract the code from |
This is simply a helper function to grab the contents of the 'code' attribute contained in the data object.
The code used to generate the data.frame, if available (else NULL)
Tom Elliott
Collapse values in a categorical variable into one defined level
collapse_cat(data, var, levels, new_level, name = NULL)
collapse_cat(data, var, levels, new_level, name = NULL)
data |
a dataframe to collapse |
var |
a string of the name of the categorical variable to collapse |
levels |
a character vector of the levels to be collapsed |
new_level |
a string for the new level |
name |
a name for the new variable |
the original dataframe containing a new column of the collapsed variable with tidyverse code attached
Zhaoming Su
collapsed <- collapse_cat(iris, var = "Species", c("versicolor", "virginica"), new_level = "V" ) cat(code(collapsed)) tail(collapsed)
collapsed <- collapse_cat(iris, var = "Species", c("versicolor", "virginica"), new_level = "V" ) cat(code(collapsed)) tail(collapsed)
Combine chosen variables of any class by concatenating them into one factor variable, and returns the result along with tidyverse code used to generate it.
combine_vars( data, vars, sep = ":", name = NULL, keep_empty = FALSE, keep_na = TRUE )
combine_vars( data, vars, sep = ":", name = NULL, keep_empty = FALSE, keep_na = TRUE )
data |
a dataframe with the columns to be combined |
vars |
a character vector of the variables to be combined |
sep |
a character string to separate the levels |
name |
a name for the new variable |
keep_empty |
logical, if |
keep_na |
logical, if |
original dataframe containing new columns of the new categorical variable with tidyverse code attached
Owen Jin, Zhaoming Su
combined <- combine_vars(warpbreaks, vars = c("wool", "tension"), sep = "_") cat(code(combined)) head(combined)
combined <- combine_vars(warpbreaks, vars = c("wool", "tension"), sep = "_") cat(code(combined)) head(combined)
Convert specified variables into factors
convert_to_cat(data, vars, names = NULL)
convert_to_cat(data, vars, names = NULL)
data |
a dataframe with the categorical column to convert |
vars |
a character vector of column names to convert |
names |
a character vector of names for the created variables |
original dataframe containing new columns of the converted variables with tidyverse code attached
Zhaoming Su
converted <- convert_to_cat(iris, vars = c("Petal.Width")) cat(code(converted)) head(converted)
converted <- convert_to_cat(iris, vars = c("Petal.Width")) cat(code(converted)) head(converted)
Convert variables to dates
convert_to_date(data, vars, ord = NULL, names = NULL)
convert_to_date(data, vars, ord = NULL, names = NULL)
data |
a dataframe with the variables to convert |
vars |
a character vector of column names to convert |
ord |
a character vector of date-time formats |
names |
a character vector of names for the created variables |
original dataframe containing new columns of the converted variables with tidyverse code attached
Zhaoming Su
Convert variables to date-time
convert_to_datetime(data, vars, ord = NULL, names = NULL, tz = "")
convert_to_datetime(data, vars, ord = NULL, names = NULL, tz = "")
data |
a dataframe with the variables to convert |
vars |
a character vector of column names to convert |
ord |
a character vector of date-time formats |
names |
a character vector of names for the created variables |
tz |
a time zone name (default: local time zone). See
|
original dataframe containing new columns of the converted variables with tidyverse code attached
Zhaoming Su
Convert a given string to a valid R variable name, converting spaces to underscores (_) instead of dots.
create_varname(x)
create_varname(x)
x |
a string to convert |
a string, which is also a valid variable name
Tom Elliott
create_varname("a new variable") create_varname("8d4-2q5")
create_varname("a new variable") create_varname("8d4-2q5")
Create new variables by using valid R expressions and returns the result along with tidyverse code used to generate it.
create_vars(data, vars = ".new_var", vars_expr = NULL)
create_vars(data, vars = ".new_var", vars_expr = NULL)
data |
a dataframe to which to add new variables to |
vars |
a character of the new variable names |
vars_expr |
a character of valid R expressions which can generate vectors of values |
original dataframe containing the new columns
created from vars_expr
with tidyverse code attached
Zhaoming Su
created <- create_vars( data = iris, vars = "Sepal.Length_less_Sepal.Width", "Sepal.Length - Sepal.Width" ) cat(code(created)) head(created)
created <- create_vars( data = iris, vars = "Sepal.Length_less_Sepal.Width", "Sepal.Length - Sepal.Width" ) cat(code(created)) head(created)
Delete variables from a dataset
delete_vars(data, vars = NULL)
delete_vars(data, vars = NULL)
data |
dataset |
vars |
variable names to delete |
dataset without chosen variables
Zhaoming Su
This function extracts a specific date component from a date-time variable in a dataframe.
extract_dt_comp(data, var, comp, name = NULL)
extract_dt_comp(data, var, comp, name = NULL)
data |
The dataframe containing the date-time variable. |
var |
The name of the date-time variable to extract the component. |
comp |
The date component wanted from the variable. See
|
name |
The name of the new column to store the extracted date component. |
A dataframe with the new date component column.
Zhaoming Su
This function has been replaced by 'extract_dt_comp' and will be removed in the next release.
extract_part(.data, varname, part, name)
extract_part(.data, varname, part, name)
.data |
dataframe |
varname |
name of the variable |
part |
part of the variable wanted |
name |
name of the new column |
see 'extract_dt_comp'
Filter
Filter inzdf
## S3 method for class 'inzdf_db' filter(.data, ..., table = NULL, .preserve = FALSE)
## S3 method for class 'inzdf_db' filter(.data, ..., table = NULL, .preserve = FALSE)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< |
table |
name of the table to use, defaults to first in list |
.preserve |
ignored |
This function filters a dataframe or survey design object by keeping only the rows where a specified categorical variable matches one of the given levels. The resulting filtered dataframe is returned, along with the tidyverse code used to generate it.
filter_cat(data, var, levels)
filter_cat(data, var, levels)
data |
A dataframe or survey design object to be filtered. |
var |
The name of the column in |
levels |
A character vector of levels in |
A filtered dataframe with the tidyverse code attached.
Owen Jin, Zhaoming Su
filtered <- filter_cat(iris, var = "Species", levels = c("versicolor", "virginica") ) cat(code(filtered)) head(filtered)
filtered <- filter_cat(iris, var = "Species", levels = c("versicolor", "virginica") ) cat(code(filtered)) head(filtered)
This function filters a dataframe or survey design object by applying a specified boolean condition to one of its numeric variables. The resulting filtered dataframe is returned, along with the tidyverse code used to generate it.
filter_num(data, var, op = c("<=", "<", ">=", ">", "==", "!="), num)
filter_num(data, var, op = c("<=", "<", ">=", ">", "==", "!="), num)
data |
A dataframe or survey design object to be filtered. |
var |
The name of the column in |
op |
A logical operator to apply for the filtering condition. Valid options are: "<=", "<", ">=", ">", "==", or "!=". |
num |
The numeric value for which the specified |
A filtered dataframe with the tidyverse code attached.
Owen Jin, Tom Elliott, Zhaoming Su
filtered <- filter_num(iris, var = "Sepal.Length", op = "<=", num = 5) cat(code(filtered)) head(filtered) library(survey) data(api) svy <- svydesign(~ dnum + snum, weights = ~pw, fpc = ~ fpc1 + fpc2, data = apiclus2 ) svy_filtered <- filter_num(svy, var = "api00", op = "<", num = 700) cat(code(svy_filtered))
filtered <- filter_num(iris, var = "Sepal.Length", op = "<=", num = 5) cat(code(filtered)) head(filtered) library(survey) data(api) svy <- svydesign(~ dnum + snum, weights = ~pw, fpc = ~ fpc1 + fpc2, data = apiclus2 ) svy_filtered <- filter_num(svy, var = "api00", op = "<", num = 700) cat(code(svy_filtered))
Fit a survey design to an object
fitDesign(svydes, dataset.name)
fitDesign(svydes, dataset.name)
svydes |
a design |
dataset.name |
a dataset name |
a survey object
Tom Elliott
Wrapper function for 'lm', 'glm', and 'svyglm'.
fitModel( y, x, data, family = "gaussian", link = switch(family, gaussian = "gaussian", binomial = "logit", poisson = "log", negbin = "log"), design = "simple", svydes = NA, surv_params = NULL, ... )
fitModel( y, x, data, family = "gaussian", link = switch(family, gaussian = "gaussian", binomial = "logit", poisson = "log", negbin = "log"), design = "simple", svydes = NA, surv_params = NULL, ... )
y |
character string representing the response, |
x |
character string of the explanatory variables, |
data |
name of the object containing the data. |
family |
gaussian, binomial, poisson (so far, no others will be added) |
link |
the link function to use |
design |
data design specification. one of 'simple', 'survey' or 'experiment' |
svydes |
a vector of arguments to be passed to the svydesign function, excluding data (defined above) |
surv_params |
a vector containing arguments for |
... |
further arguments to be passed to lm, glm, svyglm, such as offset, etc. |
A model call formula (using lm, glm, or svyglm)
Tom Elliott
This function creates categorical intervals from a numeric variable in the given dataset.
form_class_intervals( data, variable, method = c("equal", "width", "count", "manual"), n_intervals = 4L, interval_width, format = "(a,b]", range = NULL, format_lowest = ifelse(isinteger, "< a", "<= a"), format_highest = "> b", break_points = NULL, name = sprintf("%s.f", variable) )
form_class_intervals( data, variable, method = c("equal", "width", "count", "manual"), n_intervals = 4L, interval_width, format = "(a,b]", range = NULL, format_lowest = ifelse(isinteger, "< a", "<= a"), format_highest = "> b", break_points = NULL, name = sprintf("%s.f", variable) )
data |
A dataset or a survey object. |
variable |
The name of the numeric variable to convert into intervals. |
method |
The method used to create intervals:
|
n_intervals |
For methods 'equal' and 'count', this specifies the number of intervals to create. |
interval_width |
For method 'width', this sets the width of the intervals. |
format |
The format for interval labels; use 'a' and 'b' to represent the min/max of each interval, respectively. |
range |
The range of the data; use this to adjust the labels (e.g., for continuous data, set this to the floor/ceiling of the min/max of the data to get prettier intervals). If range does not cover the range of the data, values outside will be placed into 'less than a' and 'greater than b' categories. |
format_lowest |
Label format for values lower than the min of range. |
format_highest |
Label format for values higher than the max of range. |
break_points |
For method 'manual', specify breakpoints here as a numeric vector. |
name |
The name of the new variable in the resulting data set. |
A dataframe with an additional column containing categorical class intervals.
Tom Elliott, Zhaoming Su
form_class_intervals(iris, "Sepal.Length", "equal", 5L)
form_class_intervals(iris, "Sepal.Length", "equal", 5L)
This object allows the data to be either a standard R data.frame
or
a connection to a database.
inzdf(x, name, ...) ## S3 method for class 'tbl_df' inzdf(x, name, ...) ## S3 method for class 'data.frame' inzdf(x, name, ...) ## S3 method for class 'SQLiteConnection' inzdf( x, name = deparse(substitute(x)), schema = NULL, var_attrs = list(), dictionary = NULL, keep_con = FALSE, ... )
inzdf(x, name, ...) ## S3 method for class 'tbl_df' inzdf(x, name, ...) ## S3 method for class 'data.frame' inzdf(x, name, ...) ## S3 method for class 'SQLiteConnection' inzdf( x, name = deparse(substitute(x)), schema = NULL, var_attrs = list(), dictionary = NULL, keep_con = FALSE, ... )
x |
a data.frame or db connection |
name |
the name of the data |
... |
additional arguments passed to methods |
schema |
a list specifying the schema of the database (used for linking) |
var_attrs |
nested list of variables attributes for each table > variable |
dictionary |
an inzdict object |
keep_con |
if 'TRUE' data will remain in DB (use for very large data) |
TODO: It is possible to specify a linking structure between multiple datasets, and when variables are selected the dataset will be linked 'on-the-fly'. This, when used with databases, will significantly reduce the size of data in memory.
an inzdf
object
This function checks if a variable a factor.
is_cat(x)
is_cat(x)
x |
the variable to check |
logical, TRUE
if the variable is a factor
Tom Elliott
This function checks if a variable a date/time/datetime
is_dt(x)
is_dt(x)
x |
the variable to check |
logical, TRUE
if the variable is a datetime
Tom Elliott
This function checks if a variable is numeric,
or could be considered one.
For example, dates and times can be treated as numeric,
so return TRUE
.
is_num(x)
is_num(x)
x |
the variable to check |
logical, TRUE
if the variable is numeric
Tom Elliott
Checks if the complete file was read or not.
is_preview(df)
is_preview(df)
df |
data to check |
logical
Check if object is a survey object (either standard or replicate design)
is_survey(x)
is_survey(x)
x |
object to be tested |
logical
Tom Elliott
Check if object is a survey object (created by svydesign())
is_svydesign(x)
is_svydesign(x)
x |
object to be tested |
logical
Tom Elliott
Check if object is a replicate survey object (created by svrepdesign())
is_svyrep(x)
is_svyrep(x)
x |
object to be tested |
logical
Tom Elliott
Join data with another dataset
join_data( data_l, data_r, by = NULL, how = c("inner", "left", "right", "full", "anti", "semi"), suffix_l = ".x", suffix_r = ".y" )
join_data( data_l, data_r, by = NULL, how = c("inner", "left", "right", "full", "anti", "semi"), suffix_l = ".x", suffix_r = ".y" )
data_l |
original data |
data_r |
imported dataset |
by |
a character vector of variables to join by |
how |
the method used to join the datasets |
suffix_l |
suffix for the original dataset (ignored for filter-joins) |
suffix_r |
suffix for the imported dataset (ignored for filter-joins) |
joined dataset
Zhaoming Su
code
, mutate-joins
,
filter-joins
inzdf
objectImport linked data into an inzdf
object
load_linked( x, schema, con, name = ifelse(missing(con), deparse(substitute(x)), deparse(substitute(con))), keep_con = FALSE, progress = FALSE, ... )
load_linked( x, schema, con, name = ifelse(missing(con), deparse(substitute(x)), deparse(substitute(con))), keep_con = FALSE, progress = FALSE, ... )
x |
a linked specification file or vector of data set paths |
schema |
a list describing the schema/relationships between the files |
con |
a database connection to load the linked data into |
name |
the name of the data set collection |
keep_con |
if |
progress |
either |
... |
additional arguments passed to data reading function |
an inzdf
object
Load object(s) from an Rdata file
load_rda(file)
load_rda(file)
file |
path to an rdata file |
list of data frames, plus code
Tom Elliott
Helper function to create new variable names that are unique given a set of existing names (in a data set, for example). If a variable name already exists, a number will be appended.
make_names(new, existing = character())
make_names(new, existing = character())
new |
a vector of proposed new variable names |
existing |
a vector of existing variable names |
a vector of unique variable names
Tom Elliott
make_names(c("var_x", "var_y"), c("var_x", "var_z"))
make_names(c("var_x", "var_y"), c("var_x", "var_z"))
Turn <NA>
in categorical variables into "(Missing)"
;
numeric variables will be converted to categorical variables where numeric
values as "(Observed)"
and NA
as "(Missing)"
.
missing_to_cat(data, vars, names = NULL)
missing_to_cat(data, vars, names = NULL)
data |
a dataframe with the columns to convert its missing values into categorical |
vars |
a character vector of the variables in |
names |
a character vector of names for the new variables |
original dataframe containing new columns of the converted variables for the missing values with tidyverse code attached
Zhaoming Su
missing <- missing_to_cat(iris, vars = c("Species", "Sepal.Length")) cat(code(missing)) head(missing)
missing <- missing_to_cat(iris, vars = c("Species", "Sepal.Length")) cat(code(missing)) head(missing)
Opens a new graphics device
newdevice(width = 7, height = 7, ...)
newdevice(width = 7, height = 7, ...)
width |
the width (in inches) of the new device |
height |
the height (in inches) of the new device |
... |
additional arguments passed to the new device function |
Depending on the system, difference devices are better. The windows device works fine (for now), only attempt to speed up any other devices that we're going to be using. We speed them up by getting rid of buffering.
Tom Elliott
Anti value matching
x %notin% table
x %notin% table
x |
vector of values to be matched |
table |
vector of values to match against |
A logical vector of same length as 'x', indicating if each element does not exist in the table.
NULL or operator
a %||% b
a %||% b
a |
an object, potentially NULL |
b |
an object |
a if a is not NULL, otherwise b
Tidy-printing of the code attached to an object
print_code(x, ...)
print_code(x, ...)
x |
a dataframe with code attached |
... |
additional arguments passed to tidy_all_code() |
Called for side-effect of printing code to the console.
iris_agg <- aggregate_data(iris, group_vars = "Species", summaries = "mean") print_code(iris_agg)
iris_agg <- aggregate_data(iris, group_vars = "Species", summaries = "mean") print_code(iris_agg)
Take a specified number of groups of observations with fixed group size by sampling without replacement and returns the result along with tidyverse code used to generate it.
random_sample(data, n, sample_size)
random_sample(data, n, sample_size)
data |
a dataframe to sample from |
n |
the number of groups to generate |
sample_size |
the size of each group specified in |
a dataframe containing the random samples with tidyverse code attached
Owen Jin, Zhaoming Su
rs <- random_sample(iris, n = 5, sample_size = 3) cat(code(rs)) head(rs)
rs <- random_sample(iris, n = 5, sample_size = 3) cat(code(rs)) head(rs)
Rank the values of numeric variables, for example, in descending order,
and then returns the result along with tidyverse code used to generate it.
See row_number
and percent_rank
.
rank_vars(data, vars, rank_type = c("min", "dense", "percent"))
rank_vars(data, vars, rank_type = c("min", "dense", "percent"))
data |
a dataframe with the variables to rank |
vars |
a character vector of numeric variables in |
rank_type |
either |
the original dataframe containing new columns with the ranks of the
variables in vars
with tidyverse code attached
Zhaoming Su
ranked <- rank_vars(iris, vars = c("Sepal.Length", "Petal.Length")) cat(code(ranked)) head(ranked)
ranked <- rank_vars(iris, vars = c("Sepal.Length", "Petal.Length")) cat(code(ranked)) head(ranked)
This function reads a data dictionary from a file and attaches it to a dataset. The attached data dictionary provides utility functions that can be used by other methods, such as plots, to automatically create axes and more.
read_dictionary( file, name = "name", type = "type", title = "title", description = "description", units = "units", codes = "codes", values = "values", level_separator = "|", ... ) ## S3 method for class 'dictionary' print(x, kable = FALSE, include_other = TRUE, ...) ## S3 method for class 'dictionary' x[i, ...] apply_dictionary(data, dict) has_dictionary(data) get_dictionary(data)
read_dictionary( file, name = "name", type = "type", title = "title", description = "description", units = "units", codes = "codes", values = "values", level_separator = "|", ... ) ## S3 method for class 'dictionary' print(x, kable = FALSE, include_other = TRUE, ...) ## S3 method for class 'dictionary' x[i, ...] apply_dictionary(data, dict) has_dictionary(data) get_dictionary(data)
file |
The path to the file containing the data dictionary. |
name |
The name of the column containing the variable name. |
type |
The name of the column containing the variable type. |
title |
The name of the column containing a short, human-readable title for the variable. If blank, the variable name will be used instead. |
description |
The name of the column containing the variable description. |
units |
The name of the column containing units (for numeric variables only). |
codes |
The name of the column containing factor codes (for categorical variables only). |
values |
The name of the column containing factor values corresponding to the codes. These should be in the same order as the codes. |
level_separator |
The separator used to separate levels in |
... |
Additional arguments, passed to |
x |
A |
kable |
If |
include_other |
If |
i |
Subset index. |
data |
A dataset (dataframe, tibble). |
dict |
A dictionary (created using |
The dataset with the attached data dictionary.
This function will read a CSV file with iNZight metadata in the header. This allows plain text CSV files to be supplied with additional comments that describe the structure of the data to make import and data handling easier.
read_meta(file, preview = FALSE, column_types, ...)
read_meta(file, preview = FALSE, column_types, ...)
file |
the plain text file with metadata |
preview |
logical, if |
column_types |
optional column types |
... |
more arguments |
The main example is to define factor levels for an integer variable in large data sets.
a data frame
Tom Elliott
The text can also be the value '"clipboard"' which will use 'readr::clipboard()'.
read_text(txt, delim = "\t", ...)
read_text(txt, delim = "\t", ...)
txt |
character string |
delim |
the delimiter to use, passed to 'readr::read_delim()' |
... |
additional arguments passed to 'readr::read_delim()' |
data.frame
Tom Elliott
This function filters a dataframe or a survey design object by removing specified rows based on the provided row numbers. The resulting filtered dataframe is returned, along with the tidyverse code used to generate it.
remove_rows(data, rows)
remove_rows(data, rows)
data |
A dataframe or a survey design object to be filtered. |
rows |
A numeric vector of row numbers to be sliced off. |
A filtered dataframe with the tidyverse code attached.
Owen Jin, Zhaoming Su
data <- remove_rows(iris, rows = c(1, 4, 5)) cat(code(data)) head(data)
data <- remove_rows(iris, rows = c(1, 4, 5)) cat(code(data)) head(data)
Rename the levels of a categorical variables, and returns the result along with tidyverse code used to generate it.
rename_levels(data, var, tobe_asis, name = NULL)
rename_levels(data, var, tobe_asis, name = NULL)
data |
a dataframe with the column to be renamed |
var |
a character of the categorical variable to rename |
tobe_asis |
a named list of the old level names assigned to the new level names ie. list('new level names' = 'old level names') |
name |
a name for the new variable |
original dataframe containing a new column of the renamed categorical variable with tidyverse code attached
Zhaoming Su
renamed <- rename_levels(iris, var = "Species", tobe_asis = list(set = "setosa", ver = "versicolor") ) cat(code(renamed)) head(renamed)
renamed <- rename_levels(iris, var = "Species", tobe_asis = list(set = "setosa", ver = "versicolor") ) cat(code(renamed)) head(renamed)
Rename columns of a dataset with desired names
rename_vars(data, tobe_asis)
rename_vars(data, tobe_asis)
data |
a dataframe with columns to rename |
tobe_asis |
a named list of the old column names assigned to the new column names ie. list('new column names' = 'old column names') |
original dataframe containing new columns of the renamed columns with tidyverse code attached
Zhaoming Su
renamed <- rename_vars(iris, list( sepal_length = "Sepal.Length", sepal_width = "Sepal.Width", petal_length = "Petal.Length", petal_width = "Petal.Width" )) cat(code(renamed)) head(renamed)
renamed <- rename_vars(iris, list( sepal_length = "Sepal.Length", sepal_width = "Sepal.Width", petal_length = "Petal.Length", petal_width = "Petal.Width" )) cat(code(renamed)) head(renamed)
Reorder the levels of a categorical variable either manually or automatically
reorder_levels( data, var, new_levels = NULL, auto = c("freq", "order", "seq"), name = NULL )
reorder_levels( data, var, new_levels = NULL, auto = c("freq", "order", "seq"), name = NULL )
data |
a dataframe to reorder |
var |
a categorical variable to reorder |
new_levels |
a character vector of the new factor order;
overrides |
auto |
only meaningful if |
name |
name for the new variable |
original dataframe containing a new column of the reordered categorical variable with tidyverse code attached
Zhaoming Su
reordered <- reorder_levels(iris, var = "Species", new_levels = c("versicolor", "virginica", "setosa") ) cat(code(reordered)) head(reordered) reordered <- reorder_levels(iris, var = "Species", auto = "freq" ) cat(code(reordered)) head(reordered)
reordered <- reorder_levels(iris, var = "Species", new_levels = c("versicolor", "virginica", "setosa") ) cat(code(reordered)) head(reordered) reordered <- reorder_levels(iris, var = "Species", auto = "freq" ) cat(code(reordered)) head(reordered)
Reshaping dataset from wide to long or from long to wide
reshape_data( data, data_to = c("long", "wide"), cols, names_to = "name", values_to = "value", names_from = "name", values_from = "value" )
reshape_data( data, data_to = c("long", "wide"), cols, names_to = "name", values_to = "value", names_from = "name", values_from = "value" )
data |
a dataset to reshape |
data_to |
whether the target dataset is |
cols |
columns to gather together (for wide to long) |
names_to |
name for new column containing old names (for wide to long) |
values_to |
name for new column containing old values (for wide to long) |
names_from |
column to spread out (for long to wide) |
values_from |
values to be put in the spread columns (for long to wide) |
reshaped dataset
Zhaoming Su
Save an object with, optionally, a (valid) name
save_rda(data, file, name)
save_rda(data, file, name)
data |
the data frame to save |
file |
where to save it |
name |
optional, the name the data will have in the rda file |
logical, should be TRUE, along with code for the save
Tom Elliott
Select a (reordered) subset of variables from a subset.
select_vars(data, keep)
select_vars(data, keep)
data |
the dataset |
keep |
vector of variable names to keep |
a data frame with tidyverse code attribute
Tom Elliott, Zhaoming Su
select_vars(iris, c("Sepal.Length", "Species", "Sepal.Width"))
select_vars(iris, c("Sepal.Length", "Species", "Sepal.Width"))
Separate columns
separate_var(data, var, by, names, into = c("cols", "rows"))
separate_var(data, var, by, names, into = c("cols", "rows"))
data |
dataset |
var |
name of variable to be separated |
by |
a string as delimiter between values (separate by delimiter) or
integer(s) as number of characters to split by (separate by position),
the length of |
names |
for |
into |
whether to split into new rows or columns |
Separated dataset
Zhaoming Su
Useful when reading an Excel file to quickly check what other sheets are available.
sheets(x)
sheets(x)
x |
a dataframe, presumably returned by |
vector of sheet names, or NULL if the file was not an Excel workbook
Tom Elliott
cas_file <- system.file("extdata/cas500.xls", package = "iNZightTools") cas <- smart_read(cas_file) sheets(cas)
cas_file <- system.file("extdata/cas500.xls", package = "iNZightTools") cas <- smart_read(cas_file) sheets(cas)
A simple function that imports a file without the users needing to
specify information about the file type (see Details for more).
The smart_read()
function uses the file's extension to determine
the appropriate function to read the data.
Additionally, characters are converted to factors by default,
mostly for compatibility with iNZight (https://inzight.nz).
smart_read( file, ext = tools::file_ext(file), preview = FALSE, column_types = NULL, ... )
smart_read( file, ext = tools::file_ext(file), preview = FALSE, column_types = NULL, ... )
file |
the file path to read |
ext |
file extension, namely "csv" or "txt" |
preview |
logical, if |
column_types |
vector of column types (see ?readr::read_csv) |
... |
additional parameters passed to read_* functions |
Currently, smart_read()
understands the following file types:
delimited (.csv, .txt)
Excel (.xls, .xlsx)
SPSS (.sav)
Stata (.dta)
SAS (.sas7bdat, .xpt)
R data (.rds)
JSON (.json)
A dataframe with some additional attributes:
name
is the name of the file
code
contains the 'tidyverse' code used to read the data
sheets
contains names of sheets if 'file' is an Excel file (can be retrieved using the sheets()
helper function)
By default, smart_read()
will detect the delimiter used in the file
if the argument delimiter = NULL
is passed in (the default).
If this does not work, you can override this argument:
smart_read('path/to/file', delimiter = '+')
Tom Elliott
Sorts a dataframe by one or more variables, and returns the result along with tidyverse code used to generate it.
sort_vars(data, vars, asc = rep(TRUE, length(vars)))
sort_vars(data, vars, asc = rep(TRUE, length(vars)))
data |
a dataframe to sort |
vars |
a character vector of variable names to sort by |
asc |
logical, length of 1 or same length as |
data with tidyverse code attached
Owen Jin, Zhaoming Su
sorted <- sort_vars(iris, vars = c("Sepal.Width", "Sepal.Length"), asc = c(TRUE, FALSE) ) cat(code(sorted)) head(sorted)
sorted <- sort_vars(iris, vars = c("Sepal.Width", "Sepal.Length"), asc = c(TRUE, FALSE) ) cat(code(sorted)) head(sorted)
Centre then divide by the standard error of the values in a numeric variable
standardize_vars(data, vars, names = NULL)
standardize_vars(data, vars, names = NULL)
data |
a dataframe with the columns to standardize |
vars |
a character vector of the numeric variables in |
names |
names for the created variables |
the original dataframe containing new columns of the standardized variables with tidyverse code attached
Zhaoming Su
standardized <- standardize_vars(iris, var = c("Sepal.Width", "Petal.Width")) cat(code(standardized)) head(standardized)
standardized <- standardize_vars(iris, var = c("Sepal.Width", "Petal.Width")) cat(code(standardized)) head(standardized)
Calculates the interquartile range from complex survey data.
A wrapper for taking differences of svyquantile
at 0.25 and 0.75 quantiles,
and meant to be called from within summarize
(see srvyr package).
survey_IQR(x, na.rm = TRUE)
survey_IQR(x, na.rm = TRUE)
x |
A variable or expression |
na.rm |
logical, if |
a vector of interquartile ranges
Tom Elliott
library(survey) library(srvyr) data(api) dstrata <- apistrat %>% as_survey(strata = stype, weights = pw) dstrata %>% summarise(api99_iqr = survey_IQR(api99))
library(survey) library(srvyr) data(api) dstrata <- apistrat %>% as_survey(strata = stype, weights = pw) dstrata %>% summarise(api99_iqr = survey_IQR(api99))
Tidy code with correct indents and limit the code to the specific width
tidy_all_code(x, width = 80, indent = 4, outfile, incl_library = TRUE)
tidy_all_code(x, width = 80, indent = 4, outfile, incl_library = TRUE)
x |
character string or file name of the file containing messy code |
width |
the width of a line |
indent |
how many spaces for one indent |
outfile |
the file name of the file containing formatted code |
incl_library |
logical, if true, the output code will contain library name |
formatted code, optionally written to 'outfile'
Tom Elliott, Lushi Cai
Transform the values of numeric variables by applying a mathematical function
transform_vars(data, vars, fn, names = NULL)
transform_vars(data, vars, fn, names = NULL)
data |
a dataframe with the variables to transform |
vars |
a character of the numeric variables in |
fn |
the name (a string) of a valid R function |
names |
the names of the new variables |
the original dataframe containing the new columns of the transformed variable with tidyverse code attached
Zhaoming Su
transformed <- transform_vars(iris, var = "Petal.Length", fn = "log" ) cat(code(transformed)) head(transformed)
transformed <- transform_vars(iris, var = "Petal.Length", fn = "log" ) cat(code(transformed)) head(transformed)
Generates the more detailed text required for the details section in
iNZValidateWin
.
validation_details(cf, v, var, id.var, df)
validation_details(cf, v, var, id.var, df)
cf |
Confrontation object from |
v |
Validator that generated |
var |
Rule name to give details about |
id.var |
Variable name denoting a unique identifier for each observation |
df |
The dataset that was confronted |
A character vector giving each line of the summary detail text
Daniel Barnett
Generates a summary of a confrontation which gives basic information about each validation rule tested.
validation_summary(cf)
validation_summary(cf)
cf |
Confrontation object from |
A data.frame
with number of tests performed, number of
passes, number of failures, and failure percentage for each validation rule.
Daniel Barnett
Get variable type name
vartype(x)
vartype(x)
x |
vector to be examined |
character vector of the variable's type
Tom Elliott
Get all variable types from data object
vartypes(x)
vartypes(x)
x |
data object (data.frame or inzdf) |
a named vector of variable types