Title: | Data and Variable Transformation Functions |
---|---|
Description: | Collection of miscellaneous utility functions, supporting data transformation tasks like recoding, dichotomizing or grouping variables, setting and replacing missing values. The data transformation functions also support labelled data, and all integrate seamlessly into a 'tidyverse'-workflow. |
Authors: | Daniel Lüdecke [aut, cre] , Iago Giné-Vázquez [ctb], Alexander Bartel [ctb] |
Maintainer: | Daniel Lüdecke <[email protected]> |
License: | GPL-3 |
Version: | 2.8.10 |
Built: | 2025-01-09 07:26:46 UTC |
Source: | CRAN |
Purpose of this package
Collection of miscellaneous utility functions, supporting data transformation tasks like recoding, dichotomizing or grouping variables, setting and replacing missing values. The data transformation functions also support labelled data, and all integrate seamlessly into a 'tidyverse'-workflow.
Design philosophy - consistent api
The design of this package follows, where appropriate, the tidyverse-approach, with the first argument of a function always being the data (either a data frame or vector), followed by variable names that should be processed by the function. If no variables are specified as argument, the function applies to the complete data that was indicated as first function argument.
There are two types of function designs:
Functions like rec()
or dicho()
, which transform or recode variables, typically return the complete data frame that was given as first argument, additionally including the transformed and recoded variables specified in the ...
-ellipses argument. The variables usually get a suffix, so original variables are preserved in the data.
Functions like to_factor()
or to_label()
, which convert variables into other types or add additional information like variable or value labels as attribute, also typically return the complete data frame that was given as first argument. However, the variables specified in the ...
-ellipses argument are converted ("overwritten"), all other variables remain unchanged. Hence, these functions do not return any new, additional variables.
Daniel Lüdecke [email protected]
%nin% is the complement to %in%. It looks which values
in x
do not match (hence, are not in)
values in y
.
x %nin% y
x %nin% y
x |
Vector with values to be matched. |
y |
Vector with values to be matched against. |
See 'Details' in match
.
A logical vector, indicating if a match was not located for each element
of x
, thus the values are TRUE
or FALSE
and
never NA
.
c("a", "B", "c") %in% letters c("a", "B", "c") %nin% letters c(1, 2, 3, 4) %in% c(3, 4, 5, 6) c(1, 2, 3, 4) %nin% c(3, 4, 5, 6)
c("a", "B", "c") %in% letters c("a", "B", "c") %nin% letters c(1, 2, 3, 4) %in% c(3, 4, 5, 6) c(1, 2, 3, 4) %nin% c(3, 4, 5, 6)
add_columns()
combines two or more data frames, but unlike
cbind
or dplyr::bind_cols()
, this function
binds data
as last columns of a data frame (i.e., behind columns
specified in ...
). This can be useful in a "pipe"-workflow, where
a data frame returned by a previous function should be appended
at the end of another data frame that is processed in
add_colums()
.
replace_columns()
replaces all columns in data
with
identically named columns in ...
, and adds remaining (non-duplicated)
columns from ...
to data
.
add_id()
simply adds an ID-column to the data frame, with values
from 1 to nrow(data)
, respectively for grouped data frames, values
from 1 to group size. See 'Examples'.
add_columns(data, ..., replace = TRUE) replace_columns(data, ..., add.unique = TRUE) add_id(data, var = "ID")
add_columns(data, ..., replace = TRUE) replace_columns(data, ..., add.unique = TRUE) add_id(data, var = "ID")
data |
A data frame. For |
... |
More data frames to combine, resp. more data frames with columns
that should replace columns in |
replace |
Logical, if |
add.unique |
Logical, if |
var |
Name of new the ID-variable. |
For add_columns()
, a data frame, where columns of data
are appended after columns of ...
.
For replace_columns()
, a data frame where columns in data
will be replaced by identically named columns in ...
, and remaining
columns from ...
will be appended to data
(if
add.unique = TRUE
).
For add_id()
, a new column with ID numbers. This column is always
the first column in the returned data frame.
For add_columns()
, by default, columns in data
with
identical names like columns in one of the data frames in ...
will be dropped (i.e. variables with identical names in ...
will
replace existing variables in data
). Use replace = FALSE
to
keep all columns. Identical column names will then be renamed, to ensure
unique column names (which happens by default when using
dplyr::bind_cols()
). When replacing columns, replaced columns
are not added to the end of the data frame. Rather, the original order of
columns will be preserved.
data(efc) d1 <- efc[, 1:3] d2 <- efc[, 4:6] if (require("dplyr") && require("sjlabelled")) { head(bind_cols(d1, d2)) add_columns(d1, d2) %>% head() d1 <- efc[, 1:3] d2 <- efc[, 2:6] add_columns(d1, d2, replace = TRUE) %>% head() add_columns(d1, d2, replace = FALSE) %>% head() # use case: we take the original data frame, select specific # variables and do some transformations or recodings # (standardization in this example) and add the new, transformed # variables *to the end* of the original data frame efc %>% select(e17age, c160age) %>% std() %>% add_columns(efc) %>% head() # new variables with same name will overwrite old variables # in "efc". order of columns is not changed. efc %>% select(e16sex, e42dep) %>% to_factor() %>% add_columns(efc) %>% head() # keep both old and new variables, automatically # rename variables with identical name efc %>% select(e16sex, e42dep) %>% to_factor() %>% add_columns(efc, replace = FALSE) %>% head() # create sample data frames d1 <- efc[, 1:10] d2 <- efc[, 2:3] d3 <- efc[, 7:8] d4 <- efc[, 10:12] # show original head(d1) library(sjlabelled) # slightly change variables, to see effect d2 <- as_label(d2) d3 <- as_label(d3) # replace duplicated columns, append remaining replace_columns(d1, d2, d3, d4) %>% head() # replace duplicated columns, omit remaining replace_columns(d1, d2, d3, d4, add.unique = FALSE) %>% head() # add ID to dataset library(dplyr) data(mtcars) add_id(mtcars) mtcars %>% group_by(gear) %>% add_id() %>% arrange(gear, ID) %>% print(n = 100) }
data(efc) d1 <- efc[, 1:3] d2 <- efc[, 4:6] if (require("dplyr") && require("sjlabelled")) { head(bind_cols(d1, d2)) add_columns(d1, d2) %>% head() d1 <- efc[, 1:3] d2 <- efc[, 2:6] add_columns(d1, d2, replace = TRUE) %>% head() add_columns(d1, d2, replace = FALSE) %>% head() # use case: we take the original data frame, select specific # variables and do some transformations or recodings # (standardization in this example) and add the new, transformed # variables *to the end* of the original data frame efc %>% select(e17age, c160age) %>% std() %>% add_columns(efc) %>% head() # new variables with same name will overwrite old variables # in "efc". order of columns is not changed. efc %>% select(e16sex, e42dep) %>% to_factor() %>% add_columns(efc) %>% head() # keep both old and new variables, automatically # rename variables with identical name efc %>% select(e16sex, e42dep) %>% to_factor() %>% add_columns(efc, replace = FALSE) %>% head() # create sample data frames d1 <- efc[, 1:10] d2 <- efc[, 2:3] d3 <- efc[, 7:8] d4 <- efc[, 10:12] # show original head(d1) library(sjlabelled) # slightly change variables, to see effect d2 <- as_label(d2) d3 <- as_label(d3) # replace duplicated columns, append remaining replace_columns(d1, d2, d3, d4) %>% head() # replace duplicated columns, omit remaining replace_columns(d1, d2, d3, d4, add.unique = FALSE) %>% head() # add ID to dataset library(dplyr) data(mtcars) add_id(mtcars) mtcars %>% group_by(gear) %>% add_id() %>% arrange(gear, ID) %>% print(n = 100) }
Merges (full join) data frames and preserve value and variable labels.
add_rows(..., id = NULL) merge_df(..., id = NULL)
add_rows(..., id = NULL) merge_df(..., id = NULL)
... |
Two or more data frames to be merged. |
id |
Optional name for ID column that will be created to indicate the source data frames for appended rows. |
This function works like dplyr::bind_rows()
, but preserves
variable and value label attributes. add_rows()
row-binds all data
frames in ...
, even if these have different numbers of columns.
Non-matching columns will be column-bound and filled with NA
-values
for rows in those data frames that do not have this column.
Value and variable labels are preserved. If matching columns have
different value label attributes, attributes from first data frame
will be used.
merge_df()
is an alias for add_rows()
.
A full joined data frame.
library(dplyr) data(efc) x1 <- efc %>% select(1:5) %>% slice(1:10) x2 <- efc %>% select(3:7) %>% slice(11:20) mydf <- add_rows(x1, x2) mydf str(mydf) ## Not run: library(sjPlot) view_df(mydf) ## End(Not run) x3 <- efc %>% select(5:9) %>% slice(21:30) x4 <- efc %>% select(11:14) %>% slice(31:40) mydf <- add_rows(x1, x2, x3, x4, id = "subsets") mydf str(mydf)
library(dplyr) data(efc) x1 <- efc %>% select(1:5) %>% slice(1:10) x2 <- efc %>% select(3:7) %>% slice(11:20) mydf <- add_rows(x1, x2) mydf str(mydf) ## Not run: library(sjPlot) view_df(mydf) ## End(Not run) x3 <- efc %>% select(5:9) %>% slice(21:30) x4 <- efc %>% select(11:14) %>% slice(31:40) mydf <- add_rows(x1, x2, x3, x4, id = "subsets") mydf str(mydf)
add_variables()
adds a new column to a data frame, while
add_case()
adds a new row to a data frame. These are convenient
functions to add columns or rows not only at the end of a data frame,
but at any column or row position. Furthermore, they allow easy integration
into a pipe-workflow.
add_variables(data, ..., .after = Inf, .before = NULL) add_case(data, ..., .after = Inf, .before = NULL)
add_variables(data, ..., .after = Inf, .before = NULL) add_case(data, ..., .after = Inf, .before = NULL)
data |
A data frame. |
... |
One or more named vectors that indicate the variables or values,
which will be added as new column or row to |
.after , .before
|
Numerical index of row or column, where after or before
the new variable or case should be added. If |
data
, including the new variables or cases from ...
.
For add_case()
, if variable does not exist, a new variable is
created and existing cases for this new variable get the value NA
.
See 'Examples'.
d <- data.frame( a = c(1, 2, 3), b = c("a", "b", "c"), c = c(10, 20, 30), stringsAsFactors = FALSE ) add_case(d, b = "d") add_case(d, b = "d", a = 5, .before = 1) # adding a new case for a new variable add_case(d, e = "new case") add_variables(d, new = 5) add_variables(d, new = c(4, 4, 4), new2 = c(5, 5, 5), .after = "b")
d <- data.frame( a = c(1, 2, 3), b = c("a", "b", "c"), c = c(10, 20, 30), stringsAsFactors = FALSE ) add_case(d, b = "d") add_case(d, b = "d", a = 5, .before = 1) # adding a new case for a new variable add_case(d, e = "new case") add_variables(d, new = 5) add_variables(d, new = c(4, 4, 4), new2 = c(5, 5, 5), .after = "b")
Check if all values in a vector are NA
.
all_na(x)
all_na(x)
x |
A vector or data frame. |
Logical, TRUE
if x
has only NA values, FALSE
if
x
has at least one non-missing value.
x <- c(NA, NA, NA) y <- c(1, NA, NA) all_na(x) all_na(y) all_na(data.frame(x, y)) all_na(list(x, y))
x <- c(NA, NA, NA) y <- c(1, NA, NA) all_na(x) all_na(y) all_na(data.frame(x, y)) all_na(list(x, y))
big_mark()
formats large numbers with big marks, while
prcn()
converts a numeric scalar between 0 and 1 into a character
vector, representing the percentage-value.
big_mark(x, big.mark = ",", ...) prcn(x)
big_mark(x, big.mark = ",", ...) prcn(x)
x |
A vector or data frame. All numeric inputs (including numeric character)
vectors) will be prettified. For |
big.mark |
Character, used as mark between every 3 decimals before the decimal point. |
... |
Other arguments passed down to the |
For big_mark()
, a prettified x
as character, with big marks.
For prcn
, a character vector with a percentage number.
# simple big mark big_mark(1234567) # big marks for several values at once, mixed numeric and character big_mark(c(1234567, "55443322")) # pre-defined width of character output big_mark(c(1234567, 55443322), width = 15) # convert numbers into percentage, as character prcn(0.2389) prcn(c(0.2143887, 0.55443, 0.12345)) dat <- data.frame( a = c(.321, .121, .64543), b = c("a", "b", "c"), c = c(.435, .54352, .234432) ) prcn(dat)
# simple big mark big_mark(1234567) # big marks for several values at once, mixed numeric and character big_mark(c(1234567, "55443322")) # pre-defined width of character output big_mark(c(1234567, 55443322), width = 15) # convert numbers into percentage, as character prcn(0.2389) prcn(c(0.2143887, 0.55443, 0.12345)) dat <- data.frame( a = c(.321, .121, .64543), b = c("a", "b", "c"), c = c(.435, .54352, .234432) ) prcn(dat)
This method counts tagged NA values (see tagged_na
)
in a vector and prints a frequency table of counted tagged NAs.
count_na(x, ...)
count_na(x, ...)
x |
A vector or data frame. |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
A data frame with counted tagged NA values.
if (require("haven")) { x <- labelled( x = c(1:3, tagged_na("a", "c", "z"), 4:1, tagged_na("a", "a", "c"), 1:3, tagged_na("z", "c", "c"), 1:4, tagged_na("a", "c", "z")), labels = c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"), "Refused" = tagged_na("a"), "Not home" = tagged_na("z")) ) count_na(x) y <- labelled( x = c(1:3, tagged_na("e", "d", "f"), 4:1, tagged_na("f", "f", "d"), 1:3, tagged_na("f", "d", "d"), 1:4, tagged_na("f", "d", "f")), labels = c("Agreement" = 1, "Disagreement" = 4, "An E" = tagged_na("e"), "A D" = tagged_na("d"), "The eff" = tagged_na("f")) ) # create data frame dat <- data.frame(x, y) # possible count()-function calls count_na(dat) count_na(dat$x) count_na(dat, x) count_na(dat, x, y) }
if (require("haven")) { x <- labelled( x = c(1:3, tagged_na("a", "c", "z"), 4:1, tagged_na("a", "a", "c"), 1:3, tagged_na("z", "c", "c"), 1:4, tagged_na("a", "c", "z")), labels = c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"), "Refused" = tagged_na("a"), "Not home" = tagged_na("z")) ) count_na(x) y <- labelled( x = c(1:3, tagged_na("e", "d", "f"), 4:1, tagged_na("f", "f", "d"), 1:3, tagged_na("f", "d", "d"), 1:4, tagged_na("f", "d", "f")), labels = c("Agreement" = 1, "Disagreement" = 4, "An E" = tagged_na("e"), "A D" = tagged_na("d"), "The eff" = tagged_na("f")) ) # create data frame dat <- data.frame(x, y) # possible count()-function calls count_na(dat) count_na(dat$x) count_na(dat, x) count_na(dat, x, y) }
de_mean()
computes group- and de-meaned versions of a
variable that can be used in regression analysis to model the between-
and within-subject effect.
de_mean(x, ..., grp, append = TRUE, suffix.dm = "_dm", suffix.gm = "_gm")
de_mean(x, ..., grp, append = TRUE, suffix.dm = "_dm", suffix.gm = "_gm")
x |
A data frame. |
... |
Names of variables that should be group- and de-meaned. |
grp |
Quoted or unquoted name of the variable that indicates the group- or cluster-ID. |
append |
Logical, if |
suffix.dm , suffix.gm
|
String value, will be appended to the names of the
group-meaned and de-meaned variables of |
de_mean()
is intended to create group- and de-meaned variables
for complex random-effect-within-between models (see Bell et al. 2018),
where group-effects (random effects) and fixed effects correlate (see
Bafumi and Gelman 2006)). This violation of one of the
Gauss-Markov-assumptions can happen, for instance, when analysing panel
data. To control for correlating predictors and group effects, it is
recommended to include the group-meaned and de-meaned version of
time-varying covariates in the model. By this, one can fit
complex multilevel models for panel data, including time-varying,
time-invariant predictors and random effects. This approach is superior to
simple fixed-effects models, which lack information of variation in the
group-effects or between-subject effects.
A description of how to translate the
formulas described in Bell et al. 2018 into R using lmer()
from lme4 or glmmTMB()
from glmmTMB can be found here:
for lmer()
and for glmmTMB().
For append = TRUE
, x
including the group-/de-meaned
variables as new columns is returned; if append = FALSE
, only the
group-/de-meaned variables will be returned.
Bafumi J, Gelman A. 2006. Fitting Multilevel Models When Predictors and Group Effects Correlate. In. Philadelphia, PA: Annual meeting of the American Political Science Association.
Bell A, Fairbrother M, Jones K. 2018. Fixed and Random Effects Models: Making an Informed Choice. Quality & Quantity. doi:10.1007/s11135-018-0802-x
data(efc) efc$ID <- sample(1:4, nrow(efc), replace = TRUE) # fake-ID de_mean(efc, c12hour, barthtot, grp = ID, append = FALSE)
data(efc) efc$ID <- sample(1:4, nrow(efc), replace = TRUE) # fake-ID de_mean(efc, c12hour, barthtot, grp = ID, append = FALSE)
This function prints a basic descriptive statistic, including variable labels.
descr( x, ..., max.length = NULL, weights = NULL, show = "all", out = c("txt", "viewer", "browser"), encoding = "UTF-8", file = NULL )
descr( x, ..., max.length = NULL, weights = NULL, show = "all", out = c("txt", "viewer", "browser"), encoding = "UTF-8", file = NULL )
x |
A vector or a data frame. May also be a grouped data frame (see 'Note' and 'Examples'). |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
max.length |
Numeric, indicating the maximum length of variable labels
in the output. If variable names are longer than |
weights |
Bare name, or name as string, of a variable in |
show |
Character vector, indicating which information (columns) that describe
the data should be returned. May be one or more of |
out |
Character vector, indicating whether the results should be printed
to console ( |
encoding |
Character vector, indicating the charset encoding used
for variable and value labels. Default is |
file |
Destination file, if the output should be saved as file.
Only used when |
A data frame with basic descriptive statistics.
data
may also be a grouped data frame (see group_by
)
with up to two grouping variables. Descriptive tables are created for each
subgroup then.
data(efc) descr(efc, e17age, c160age) efc$weights <- abs(rnorm(nrow(efc), 1, .3)) descr(efc, c12hour, barthtot, weights = weights) library(dplyr) efc %>% select(e42dep, e15relat, c172code) %>% descr() # show just a few elements efc %>% select(e42dep, e15relat, c172code) %>% descr(show = "short") # with grouped data frames efc %>% group_by(e16sex) %>% select(e16sex, e42dep, e15relat, c172code) %>% descr() # you can select variables also inside 'descr()' efc %>% group_by(e16sex, c172code) %>% descr(e16sex, c172code, e17age, c160age) # or even use select-helpers descr(efc, contains("cop"), max.length = 20)
data(efc) descr(efc, e17age, c160age) efc$weights <- abs(rnorm(nrow(efc), 1, .3)) descr(efc, c12hour, barthtot, weights = weights) library(dplyr) efc %>% select(e42dep, e15relat, c172code) %>% descr() # show just a few elements efc %>% select(e42dep, e15relat, c172code) %>% descr(show = "short") # with grouped data frames efc %>% group_by(e16sex) %>% select(e16sex, e42dep, e15relat, c172code) %>% descr() # you can select variables also inside 'descr()' efc %>% group_by(e16sex, c172code) %>% descr(e16sex, c172code, e17age, c160age) # or even use select-helpers descr(efc, contains("cop"), max.length = 20)
Dichotomizes variables into dummy variables (0/1). Dichotomization is
either done by median, mean or a specific value (see dich.by
).
dicho_if()
is a scoped variant of dicho()
, where recoding
will be applied only to those variables that match the logical condition
of predicate
.
dicho( x, ..., dich.by = "median", as.num = FALSE, var.label = NULL, val.labels = NULL, append = TRUE, suffix = "_d" ) dicho_if( x, predicate, dich.by = "median", as.num = FALSE, var.label = NULL, val.labels = NULL, append = TRUE, suffix = "_d" )
dicho( x, ..., dich.by = "median", as.num = FALSE, var.label = NULL, val.labels = NULL, append = TRUE, suffix = "_d" ) dicho_if( x, predicate, dich.by = "median", as.num = FALSE, var.label = NULL, val.labels = NULL, append = TRUE, suffix = "_d" )
x |
A vector or data frame. |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
dich.by |
Indicates the split criterion where a variable is dichotomized. Must be one of the following values (may be abbreviated):
|
as.num |
Logical, if |
var.label |
Optional string, to set variable label attribute for the
returned variable (see vignette Labelled Data and the sjlabelled-Package).
If |
val.labels |
Optional character vector (of length two), to set value label
attributes of dichotomized variable (see |
append |
Logical, if |
suffix |
Indicates which suffix will be added to each dummy variable.
Use |
predicate |
A predicate function to be applied to the columns. The
variables for which |
dicho()
also works on grouped data frames (see group_by
).
In this case, dichotomization is applied to the subsets of variables
in x
. See 'Examples'.
x
, dichotomized. If x
is a data frame,
for append = TRUE
, x
including the dichotomized. variables
as new columns is returned; if append = FALSE
, only
the dichotomized variables will be returned. If append = TRUE
and
suffix = ""
, recoded variables will replace (overwrite) existing
variables.
Variable label attributes are preserved (unless changed via
var.label
-argument).
data(efc) summary(efc$c12hour) # split at median table(dicho(efc$c12hour)) # split at mean table(dicho(efc$c12hour, dich.by = "mean")) # split between value lowest to 30, and above 30 table(dicho(efc$c12hour, dich.by = 30)) # sample data frame, values from 1-4 head(efc[, 6:10]) # dichtomized values (1 to 2 = 0, 3 to 4 = 1) library(dplyr) efc %>% select(6:10) %>% dicho(dich.by = 2) %>% head() # dichtomize several variables in a data frame dicho(efc, c12hour, e17age, c160age, append = FALSE) # dichotomize and set labels frq(dicho( efc, e42dep, var.label = "Dependency (dichotomized)", val.labels = c("lower", "higher"), append = FALSE )) # works also with gouped data frames mtcars %>% dicho(disp, append = FALSE) %>% table() mtcars %>% group_by(cyl) %>% dicho(disp, append = FALSE) %>% table() # dichotomizing grouped data frames leads to different # results for a dichotomized variable, because the split # value is different for each group. # compare: mtcars %>% group_by(cyl) %>% summarise(median = median(disp)) median(mtcars$disp) # dichotomize only variables with more than 10 unique values p <- function(x) dplyr::n_distinct(x) > 10 dicho_if(efc, predicate = p, append = FALSE)
data(efc) summary(efc$c12hour) # split at median table(dicho(efc$c12hour)) # split at mean table(dicho(efc$c12hour, dich.by = "mean")) # split between value lowest to 30, and above 30 table(dicho(efc$c12hour, dich.by = 30)) # sample data frame, values from 1-4 head(efc[, 6:10]) # dichtomized values (1 to 2 = 0, 3 to 4 = 1) library(dplyr) efc %>% select(6:10) %>% dicho(dich.by = 2) %>% head() # dichtomize several variables in a data frame dicho(efc, c12hour, e17age, c160age, append = FALSE) # dichotomize and set labels frq(dicho( efc, e42dep, var.label = "Dependency (dichotomized)", val.labels = c("lower", "higher"), append = FALSE )) # works also with gouped data frames mtcars %>% dicho(disp, append = FALSE) %>% table() mtcars %>% group_by(cyl) %>% dicho(disp, append = FALSE) %>% table() # dichotomizing grouped data frames leads to different # results for a dichotomized variable, because the split # value is different for each group. # compare: mtcars %>% group_by(cyl) %>% summarise(median = median(disp)) median(mtcars$disp) # dichotomize only variables with more than 10 unique values p <- function(x) dplyr::n_distinct(x) > 10 dicho_if(efc, predicate = p, append = FALSE)
A SPSS sample data set, imported with the read_spss
function.
# Attach EFC-data data(efc) # Show structure str(efc) # show first rows head(efc)
# Attach EFC-data data(efc) # Show structure str(efc) # show first rows head(efc)
These functions check which rows or columns of a data frame completely contain missing values, i.e. which observations or variables completely have missing values, and either 1) returns their indices; or 2) removes them from the data frame.
empty_cols(x) empty_rows(x) remove_empty_cols(x) remove_empty_rows(x)
empty_cols(x) empty_rows(x) remove_empty_cols(x) remove_empty_rows(x)
x |
A data frame. |
For empty_cols
and empty_rows
, a numeric (named) vector
with row or column indices of those variables that completely have
missing values.
For remove_empty_cols
and remove_empty_rows
, a
data frame with "empty" columns or rows removed.
tmp <- data.frame(a = c(1, 2, 3, NA, 5), b = c(1, NA, 3, NA , 5), c = c(NA, NA, NA, NA, NA), d = c(1, NA, 3, NA, 5)) tmp empty_cols(tmp) empty_rows(tmp) remove_empty_cols(tmp) remove_empty_rows(tmp)
tmp <- data.frame(a = c(1, 2, 3, NA, 5), b = c(1, NA, 3, NA , 5), c = c(NA, NA, NA, NA, NA), d = c(1, NA, 3, NA, 5)) tmp empty_cols(tmp) empty_rows(tmp) remove_empty_cols(tmp) remove_empty_rows(tmp)
This functions finds variables in a data frame, which variable names or variable (and value) label attribute match a specific pattern. Regular expression for the pattern is supported.
find_var( data, pattern, ignore.case = TRUE, search = c("name_label", "name_value", "label_value", "name", "label", "value", "all"), out = c("table", "df", "index"), fuzzy = FALSE, regex = FALSE ) find_in_data( data, pattern, ignore.case = TRUE, search = c("name_label", "name_value", "label_value", "name", "label", "value", "all"), out = c("table", "df", "index"), fuzzy = FALSE, regex = FALSE )
find_var( data, pattern, ignore.case = TRUE, search = c("name_label", "name_value", "label_value", "name", "label", "value", "all"), out = c("table", "df", "index"), fuzzy = FALSE, regex = FALSE ) find_in_data( data, pattern, ignore.case = TRUE, search = c("name_label", "name_value", "label_value", "name", "label", "value", "all"), out = c("table", "df", "index"), fuzzy = FALSE, regex = FALSE )
data |
A data frame. |
pattern |
Character string to be matched in |
ignore.case |
Logical, whether matching should be case sensitive or not.
|
search |
Character string, indicating where
|
out |
Output (return) format of the search results. May be abbreviated and must be one of:
|
fuzzy |
Logical, if |
regex |
Logical, if |
This function searches for pattern
in data
's column names
and - for labelled data - in all variable and value labels of data
's
variables (see get_label
for details on variable labels and
labelled data). Regular expressions are supported as well, by simply using
pattern = stringr::regex(...)
or regex = TRUE
.
By default (i.e. out = "table"
, returns a data frame with three
columns: column number, variable name and variable label. If
out = "index"
, returns a named vector with column indices
of matching variables (variable names are used as names-attribute);
if out = "df"
, returns the matching variables as data frame
data(efc) # find variables with "cop" in variable name find_var(efc, "cop") # return data frame with matching variables find_var(efc, "cop", out = "df") # or return column numbers find_var(efc, "cop", out = "index") # find variables with "dependency" in names and variable labels library(sjlabelled) find_var(efc, "dependency") get_label(efc$e42dep) # find variables with "level" in names and value labels res <- find_var(efc, "level", search = "name_value", out = "df") res get_labels(res, attr.only = FALSE) # use sjPlot::view_df() to view results ## Not run: library(sjPlot) view_df(res) ## End(Not run)
data(efc) # find variables with "cop" in variable name find_var(efc, "cop") # return data frame with matching variables find_var(efc, "cop", out = "df") # or return column numbers find_var(efc, "cop", out = "index") # find variables with "dependency" in names and variable labels library(sjlabelled) find_var(efc, "dependency") get_label(efc$e42dep) # find variables with "level" in names and value labels res <- find_var(efc, "level", search = "name_value", out = "df") res get_labels(res, attr.only = FALSE) # use sjPlot::view_df() to view results ## Not run: library(sjPlot) view_df(res) ## End(Not run)
This function creates a labelled flat table or flat proportional (marginal) table.
flat_table( data, ..., margin = c("counts", "cell", "row", "col"), digits = 2, show.values = FALSE, weights = NULL )
flat_table( data, ..., margin = c("counts", "cell", "row", "col"), digits = 2, show.values = FALSE, weights = NULL )
data |
A data frame. May also be a grouped data frame (see 'Note' and 'Examples'). |
... |
One or more variables of |
margin |
Specify the table margin that should be computed for proportional
tables. By default, counts are printed. Use |
digits |
Numeric; for proportional tables, |
show.values |
Logical, if |
weights |
Bare name, or name as string, of a variable in |
An object of class ftable
.
data
may also be a grouped data frame (see group_by
)
with up to two grouping variables. Cross tables are created for each subgroup then.
frq
for simple frequency table of labelled vectors.
data(efc) # flat table with counts flat_table(efc, e42dep, c172code, e16sex) # flat table with proportions flat_table(efc, e42dep, c172code, e16sex, margin = "row") # flat table from grouped data frame. You need to select # the grouping variables and at least two more variables for # cross tabulation. library(dplyr) efc %>% group_by(e16sex) %>% select(e16sex, c172code, e42dep) %>% flat_table() efc %>% group_by(e16sex, e42dep) %>% select(e16sex, e42dep, c172code, n4pstu) %>% flat_table() # now it gets weird... efc %>% group_by(e16sex, e42dep) %>% select(e16sex, e42dep, c172code, n4pstu, c161sex) %>% flat_table()
data(efc) # flat table with counts flat_table(efc, e42dep, c172code, e16sex) # flat table with proportions flat_table(efc, e42dep, c172code, e16sex, margin = "row") # flat table from grouped data frame. You need to select # the grouping variables and at least two more variables for # cross tabulation. library(dplyr) efc %>% group_by(e16sex) %>% select(e16sex, c172code, e42dep) %>% flat_table() efc %>% group_by(e16sex, e42dep) %>% select(e16sex, e42dep, c172code, n4pstu) %>% flat_table() # now it gets weird... efc %>% group_by(e16sex, e42dep) %>% select(e16sex, e42dep, c172code, n4pstu, c161sex) %>% flat_table()
This function returns a frequency table of labelled vectors, as data frame.
frq( x, ..., sort.frq = c("none", "asc", "desc"), weights = NULL, auto.grp = NULL, show.strings = TRUE, show.na = TRUE, grp.strings = NULL, min.frq = 0, out = c("txt", "viewer", "browser"), title = NULL, encoding = "UTF-8", file = NULL )
frq( x, ..., sort.frq = c("none", "asc", "desc"), weights = NULL, auto.grp = NULL, show.strings = TRUE, show.na = TRUE, grp.strings = NULL, min.frq = 0, out = c("txt", "viewer", "browser"), title = NULL, encoding = "UTF-8", file = NULL )
x |
A vector or a data frame. May also be a grouped data frame (see 'Note' and 'Examples'). |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
sort.frq |
Determines whether categories should be sorted
according to their frequencies or not. Default is |
weights |
Bare name, or name as string, of a variable in |
auto.grp |
Numeric value, indicating the minimum amount of unique
values in a variable, at which automatic grouping into smaller units
is done (see |
show.strings |
Logical, if |
show.na |
Logical, or |
grp.strings |
Numeric, if not |
min.frq |
Numeric, indicating the minimum frequency for which a
value will be shown in the output (except for the missing values, prevailing
|
out |
Character vector, indicating whether the results should be printed
to console ( |
title |
String, will be used as alternative title to the variable
label. If |
encoding |
Character vector, indicating the charset encoding used
for variable and value labels. Default is |
file |
Destination file, if the output should be saved as file.
Only used when |
The ...-argument not only accepts variable names or expressions from select-helpers. You can also use logical conditions, math operations, or combining variables to produce "crosstables". See 'Examples' for more details.
A list of data frames with values, value labels, frequencies, raw, valid and
cumulative percentages of x
.
x
may also be a grouped data frame (see group_by
)
with up to two grouping variables. Frequency tables are created for each
subgroup then.
The print()
-method adds a table header with information on the
variable label, variable type, total and valid N, and mean and
standard deviations. Mean and SD are always printed, even for
categorical variables (factors) or character vectors. In this case,
values are coerced into numeric vector to calculate the summary
statistics.
To print tables in markdown or HTML format, use print_md()
or
print_html()
.
flat_table
for labelled (proportional) tables.
# simple vector data(efc) frq(efc$e42dep) # with grouped data frames, in a pipe library(dplyr) efc %>% group_by(e16sex, c172code) %>% frq(e42dep) # show only categories with a minimal amount of frequencies frq(mtcars$gear) frq(mtcars$gear, min.frq = 10) frq(mtcars$gear, min.frq = 15) # with select-helpers: all variables from the COPE-Index # (which all have a "cop" in their name) frq(efc, contains("cop")) # all variables from column "c161sex" to column "c175empl" frq(efc, c161sex:c175empl) # for non-labelled data, variable name is printed, # and "label" column is removed from output data(iris) frq(iris, Species) # also works on grouped data frames efc %>% group_by(c172code) %>% frq(is.na(nur_pst)) # group variables with large range and with weights efc$weights <- abs(rnorm(n = nrow(efc), mean = 1, sd = .5)) frq(efc, c160age, auto.grp = 5, weights = weights) # different weight options frq(efc, c172code, weights = weights) frq(efc, c172code, weights = "weights") frq(efc, c172code, weights = efc$weights) frq(efc$c172code, weights = efc$weights) # group string values dummy <- efc[1:50, 3, drop = FALSE] dummy$words <- sample( c("Hello", "Helo", "Hole", "Apple", "Ape", "New", "Old", "System", "Systemic"), size = nrow(dummy), replace = TRUE ) frq(dummy) frq(dummy, grp.strings = 2) #### other expressions than variables # logical conditions frq(mtcars, cyl ==6) frq(efc, is.na(nur_pst), contains("cop")) iris %>% frq(starts_with("Petal"), Sepal.Length > 5) # computation of variables "on the fly" frq(mtcars, (gear + carb) / cyl) # crosstables set.seed(123) d <- data.frame( var_x = sample(letters[1:3], size = 30, replace = TRUE), var_y = sample(1:2, size = 30, replace = TRUE), var_z = sample(LETTERS[8:10], size = 30, replace = TRUE) ) table(d$var_x, d$var_z) frq(d, paste0(var_x, var_z)) frq(d, paste0(var_x, var_y, var_z))
# simple vector data(efc) frq(efc$e42dep) # with grouped data frames, in a pipe library(dplyr) efc %>% group_by(e16sex, c172code) %>% frq(e42dep) # show only categories with a minimal amount of frequencies frq(mtcars$gear) frq(mtcars$gear, min.frq = 10) frq(mtcars$gear, min.frq = 15) # with select-helpers: all variables from the COPE-Index # (which all have a "cop" in their name) frq(efc, contains("cop")) # all variables from column "c161sex" to column "c175empl" frq(efc, c161sex:c175empl) # for non-labelled data, variable name is printed, # and "label" column is removed from output data(iris) frq(iris, Species) # also works on grouped data frames efc %>% group_by(c172code) %>% frq(is.na(nur_pst)) # group variables with large range and with weights efc$weights <- abs(rnorm(n = nrow(efc), mean = 1, sd = .5)) frq(efc, c160age, auto.grp = 5, weights = weights) # different weight options frq(efc, c172code, weights = weights) frq(efc, c172code, weights = "weights") frq(efc, c172code, weights = efc$weights) frq(efc$c172code, weights = efc$weights) # group string values dummy <- efc[1:50, 3, drop = FALSE] dummy$words <- sample( c("Hello", "Helo", "Hole", "Apple", "Ape", "New", "Old", "System", "Systemic"), size = nrow(dummy), replace = TRUE ) frq(dummy) frq(dummy, grp.strings = 2) #### other expressions than variables # logical conditions frq(mtcars, cyl ==6) frq(efc, is.na(nur_pst), contains("cop")) iris %>% frq(starts_with("Petal"), Sepal.Length > 5) # computation of variables "on the fly" frq(mtcars, (gear + carb) / cyl) # crosstables set.seed(123) d <- data.frame( var_x = sample(letters[1:3], size = 30, replace = TRUE), var_y = sample(1:2, size = 30, replace = TRUE), var_z = sample(LETTERS[8:10], size = 30, replace = TRUE) ) table(d$var_x, d$var_z) frq(d, paste0(var_x, var_z)) frq(d, paste0(var_x, var_y, var_z))
This function groups elements of a string vector (character or string variable) according to the element's distance ('similatiry'). The more similar two string elements are, the higher is the chance to be combined into a group.
group_str( strings, precision = 2, strict = FALSE, trim.whitespace = TRUE, remove.empty = TRUE, verbose = FALSE, maxdist )
group_str( strings, precision = 2, strict = FALSE, trim.whitespace = TRUE, remove.empty = TRUE, verbose = FALSE, maxdist )
strings |
Character vector with string elements. |
precision |
Maximum distance ("precision") between two string elements, which is allowed to treat them as similar or equal. Smaller values mean less tolerance in matching. |
strict |
Logical; if |
trim.whitespace |
Logical; if |
remove.empty |
Logical; if |
verbose |
Logical; if |
maxdist |
Deprecated. Please use |
A character vector where similar string elements (values) are recoded
into a new, single value. The return value is of same length as
strings
, i.e. grouped elements appear multiple times, so
the count for each grouped string is still avaiable (see 'Examples').
oldstring <- c("Hello", "Helo", "Hole", "Apple", "Ape", "New", "Old", "System", "Systemic") newstring <- group_str(oldstring) # see result newstring # count for each groups table(newstring) # print table to compare original and grouped string frq(oldstring) frq(newstring) # larger groups newstring <- group_str(oldstring, precision = 3) frq(oldstring) frq(newstring) # be more strict with matching pairs newstring <- group_str(oldstring, precision = 3, strict = TRUE) frq(oldstring) frq(newstring)
oldstring <- c("Hello", "Helo", "Hole", "Apple", "Ape", "New", "Old", "System", "Systemic") newstring <- group_str(oldstring) # see result newstring # count for each groups table(newstring) # print table to compare original and grouped string frq(oldstring) frq(newstring) # larger groups newstring <- group_str(oldstring, precision = 3) frq(oldstring) frq(newstring) # be more strict with matching pairs newstring <- group_str(oldstring, precision = 3, strict = TRUE) frq(oldstring) frq(newstring)
Recode numeric variables into equal ranged, grouped factors,
i.e. a variable is cut into a smaller number of groups, where each group
has the same value range. group_labels()
creates the related value
labels. group_var_if()
and group_labels_if()
are scoped
variants of group_var()
and group_labels()
, where grouping
will be applied only to those variables that match the logical condition
of predicate
.
group_var( x, ..., size = 5, as.num = TRUE, right.interval = FALSE, n = 30, append = TRUE, suffix = "_gr" ) group_var_if( x, predicate, size = 5, as.num = TRUE, right.interval = FALSE, n = 30, append = TRUE, suffix = "_gr" ) group_labels(x, ..., size = 5, right.interval = FALSE, n = 30) group_labels_if(x, predicate, size = 5, right.interval = FALSE, n = 30)
group_var( x, ..., size = 5, as.num = TRUE, right.interval = FALSE, n = 30, append = TRUE, suffix = "_gr" ) group_var_if( x, predicate, size = 5, as.num = TRUE, right.interval = FALSE, n = 30, append = TRUE, suffix = "_gr" ) group_labels(x, ..., size = 5, right.interval = FALSE, n = 30) group_labels_if(x, predicate, size = 5, right.interval = FALSE, n = 30)
x |
A vector or data frame. |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
size |
Numeric; group-size, i.e. the range for grouping. By default,
for each 5 categories of |
as.num |
Logical, if |
right.interval |
Logical; if |
n |
Sets the maximum number of groups that are defined when auto-grouping is on
( |
append |
Logical, if |
suffix |
Indicates which suffix will be added to each dummy variable.
Use |
predicate |
A predicate function to be applied to the columns. The
variables for which |
If size
is set to a specific value, the variable is recoded
into several groups, where each group has a maximum range of size
.
Hence, the amount of groups differ depending on the range of x
.
If size = "auto"
, the variable is recoded into a maximum of
n
groups. Hence, independent from the range of
x
, always the same amount of groups are created, so the range
within each group differs (depending on x
's range).
right.interval
determins which boundary values to include when
grouping is done. If TRUE
, grouping starts with the lower
bound of size
. For example, having a variable ranging from
50 to 80, groups cover the ranges from 50-54, 55-59, 60-64 etc.
If FALSE
(default), grouping starts with the upper bound
of size
. In this case, groups cover the ranges from
46-50, 51-55, 56-60, 61-65 etc. Note: This will cover
a range from 46-50 as first group, even if values from 46 to 49
are not present. See 'Examples'.
If you want to split a variable into a certain amount of equal
sized groups (instead of having groups where values have all the same
range), use the split_var
function!
group_var()
also works on grouped data frames (see group_by
).
In this case, grouping is applied to the subsets of variables
in x
. See 'Examples'.
For group_var()
, a grouped variable, either as numeric or as factor (see paramter as.num
). If x
is a data frame, only the grouped variables will be returned.
For group_labels()
, a string vector or a list of string vectors containing labels based on the grouped categories of x
, formatted as "from lower bound to upper bound", e.g. "10-19" "20-29" "30-39"
etc. See 'Examples'.
Variable label attributes (see, for instance,
set_label
) are preserved. Usually you should use
the same values for size
and right.interval
in
group_labels()
as used in the group_var
function if you want
matching labels for the related recoded variable.
split_var
to split variables into equal sized groups,
group_str
for grouping string vectors or
rec_pattern
and rec
for another convenient
way of recoding variables into smaller groups.
age <- abs(round(rnorm(100, 65, 20))) age.grp <- group_var(age, size = 10) hist(age) hist(age.grp) age.grpvar <- group_labels(age, size = 10) table(age.grp) print(age.grpvar) # histogram with EUROFAMCARE sample dataset # variable not grouped library(sjlabelled) data(efc) hist(efc$e17age, main = get_label(efc$e17age)) # bar plot with EUROFAMCARE sample dataset # grouped variable ageGrp <- group_var(efc$e17age) ageGrpLab <- group_labels(efc$e17age) barplot(table(ageGrp), main = get_label(efc$e17age), names.arg = ageGrpLab) # within a pipe-chain library(dplyr) efc %>% select(e17age, c12hour, c160age) %>% group_var(size = 20) # create vector with values from 50 to 80 dummy <- round(runif(200, 50, 80)) # labels with grouping starting at lower bound group_labels(dummy) # labels with grouping startint at upper bound group_labels(dummy, right.interval = TRUE) # works also with gouped data frames mtcars %>% group_var(disp, size = 4, append = FALSE) %>% table() mtcars %>% group_by(cyl) %>% group_var(disp, size = 4, append = FALSE) %>% table()
age <- abs(round(rnorm(100, 65, 20))) age.grp <- group_var(age, size = 10) hist(age) hist(age.grp) age.grpvar <- group_labels(age, size = 10) table(age.grp) print(age.grpvar) # histogram with EUROFAMCARE sample dataset # variable not grouped library(sjlabelled) data(efc) hist(efc$e17age, main = get_label(efc$e17age)) # bar plot with EUROFAMCARE sample dataset # grouped variable ageGrp <- group_var(efc$e17age) ageGrpLab <- group_labels(efc$e17age) barplot(table(ageGrp), main = get_label(efc$e17age), names.arg = ageGrpLab) # within a pipe-chain library(dplyr) efc %>% select(e17age, c12hour, c160age) %>% group_var(size = 20) # create vector with values from 50 to 80 dummy <- round(runif(200, 50, 80)) # labels with grouping starting at lower bound group_labels(dummy) # labels with grouping startint at upper bound group_labels(dummy, right.interval = TRUE) # works also with gouped data frames mtcars %>% group_var(disp, size = 4, append = FALSE) %>% table() mtcars %>% group_by(cyl) %>% group_var(disp, size = 4, append = FALSE) %>% table()
This functions checks if variables or observations in a data
frame have NA
, NaN
or Inf
values.
has_na(x, ..., by = c("col", "row"), out = c("table", "df", "index")) incomplete_cases(x, ...) complete_cases(x, ...) complete_vars(x, ...) incomplete_vars(x, ...)
has_na(x, ..., by = c("col", "row"), out = c("table", "df", "index")) incomplete_cases(x, ...) complete_cases(x, ...) complete_vars(x, ...) incomplete_vars(x, ...)
x |
A data frame. |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
by |
Whether to check column- or row-wise for missing and infinite values.
If |
out |
Output (return) format of the results. May be abbreviated. |
If x
is a vector, returns TRUE
if x
has any missing
or infinite values. If x
is a data frame, returns TRUE
for
each variable (if by = "col"
) or observation (if by = "row"
)
that has any missing or infinite values. If out = "table"
, results
are returned as data frame, with column number, variable name and
label, and a logical vector indicating if a variable has missing values or
not. However, it's printed in colors, with green rows indicating that a
variable has no missings, while red rows indicate the presence of missings
or infinite values. If out = "index"
, a named vector is returned.
complete_cases()
and incomplete_cases()
are convenient
shortcuts for has_na(by = "row", out = "index")
, where the first
only returns case-id's for all complete cases, and the latter only for
non-complete cases. complete_vars()
and incomplete_vars()
are convenient shortcuts
for has_na(by = "col", out = "index")
, and again only return those
column-id's for variables which are (in-)complete.
data(efc) has_na(efc$e42dep) has_na(efc, e42dep, tot_sc_e, c161sex) has_na(efc) has_na(efc, e42dep, tot_sc_e, c161sex, out = "index") has_na(efc, out = "df") has_na(efc, by = "row") has_na(efc, e42dep, tot_sc_e, c161sex, by = "row", out = "index") has_na(efc, by = "row", out = "df") complete_cases(efc, e42dep, tot_sc_e, c161sex) incomplete_cases(efc, e42dep, tot_sc_e, c161sex) complete_vars(efc, e42dep, tot_sc_e, c161sex) incomplete_vars(efc, e42dep, tot_sc_e, c161sex)
data(efc) has_na(efc$e42dep) has_na(efc, e42dep, tot_sc_e, c161sex) has_na(efc) has_na(efc, e42dep, tot_sc_e, c161sex, out = "index") has_na(efc, out = "df") has_na(efc, by = "row") has_na(efc, e42dep, tot_sc_e, c161sex, by = "row", out = "index") has_na(efc, by = "row", out = "df") complete_cases(efc, e42dep, tot_sc_e, c161sex) incomplete_cases(efc, e42dep, tot_sc_e, c161sex) complete_vars(efc, e42dep, tot_sc_e, c161sex) incomplete_vars(efc, e42dep, tot_sc_e, c161sex)
These functions checks whether two factors are (fully) crossed
or nested, i.e. if each level of one factor occurs in combination with
each level of the other factor (is_crossed()
) resp. if each
category of the first factor co-occurs with only one category of the
other (is_nested()
). is_cross_classified()
checks if one
factor level occurs in some, but not all levels of another factor.
is_crossed(f1, f2) is_nested(f1, f2) is_cross_classified(f1, f2)
is_crossed(f1, f2) is_nested(f1, f2) is_cross_classified(f1, f2)
f1 |
Numeric vector or |
f2 |
Numeric vector or |
Logical. For is_crossed()
, TRUE
if factors are (fully)
crossed, FALSE
otherwise. For is_nested()
, TRUE
if
factors are nested, FALSE
otherwise. For is_cross_classified()
,
TRUE
, if one factor level occurs in some, but not all levels of
another factor.
If factors are nested, a message is displayed to tell whether f1
is nested within f2
or vice versa.
Grace, K. The Difference Between Crossed and Nested Factors. (web)
# crossed factors, each category of # x appears in each category of y x <- c(1,4,3,2,3,2,1,4) y <- c(1,1,1,2,2,1,2,2) # show distribution table(x, y) # check if crossed is_crossed(x, y) # not crossed factors x <- c(1,4,3,2,3,2,1,4) y <- c(1,1,1,2,1,1,2,2) # show distribution table(x, y) # check if crossed is_crossed(x, y) # nested factors, each category of # x appears in one category of y x <- c(1,2,3,4,5,6,7,8,9) y <- c(1,1,1,2,2,2,3,3,3) # show distribution table(x, y) # check if nested is_nested(x, y) is_nested(y, x) # not nested factors x <- c(1,2,3,4,5,6,7,8,9,1,2) y <- c(1,1,1,2,2,2,3,3,3,2,3) # show distribution table(x, y) # check if nested is_nested(x, y) is_nested(y, x) # also not fully crossed is_crossed(x, y) # but partially crossed is_cross_classified(x, y)
# crossed factors, each category of # x appears in each category of y x <- c(1,4,3,2,3,2,1,4) y <- c(1,1,1,2,2,1,2,2) # show distribution table(x, y) # check if crossed is_crossed(x, y) # not crossed factors x <- c(1,4,3,2,3,2,1,4) y <- c(1,1,1,2,1,1,2,2) # show distribution table(x, y) # check if crossed is_crossed(x, y) # nested factors, each category of # x appears in one category of y x <- c(1,2,3,4,5,6,7,8,9) y <- c(1,1,1,2,2,2,3,3,3) # show distribution table(x, y) # check if nested is_nested(x, y) is_nested(y, x) # not nested factors x <- c(1,2,3,4,5,6,7,8,9,1,2) y <- c(1,1,1,2,2,2,3,3,3,2,3) # show distribution table(x, y) # check if nested is_nested(x, y) is_nested(y, x) # also not fully crossed is_crossed(x, y) # but partially crossed is_cross_classified(x, y)
This function checks whether a string or character vector (of length 1), a list or any vector (numeric, atomic) is empty or not.
is_empty(x, first.only = TRUE, all.na.empty = TRUE)
is_empty(x, first.only = TRUE, all.na.empty = TRUE)
x |
String, character vector, list, data.frame or numeric vector or factor. |
first.only |
Logical, if |
all.na.empty |
Logical, if |
Logical, TRUE
if x
is a character vector or string and
is empty, TRUE
if x
is a vector or list and of length 0,
FALSE
otherwise.
NULL
- or NA
-values are also considered as "empty" (see
'Examples') and will return TRUE
, unless all.na.empty==FALSE
.
is_empty("test") is_empty("") is_empty(NA) is_empty(NULL) # string is not empty is_empty(" ") # however, this trimmed string is is_empty(trim(" ")) # numeric vector x <- 1 is_empty(x) x <- x[-1] is_empty(x) # check multiple elements of character vectors is_empty(c("", "a")) is_empty(c("", "a"), first.only = FALSE) # empty data frame d <- data.frame() is_empty(d) # empty list is_empty(list(NULL)) # NA vector x <- rep(NA,5) is_empty(x) is_empty(x, all.na.empty = FALSE)
is_empty("test") is_empty("") is_empty(NA) is_empty(NULL) # string is not empty is_empty(" ") # however, this trimmed string is is_empty(trim(" ")) # numeric vector x <- 1 is_empty(x) x <- x[-1] is_empty(x) # check multiple elements of character vectors is_empty(c("", "a")) is_empty(c("", "a"), first.only = FALSE) # empty data frame d <- data.frame() is_empty(d) # empty list is_empty(list(NULL)) # NA vector x <- rep(NA,5) is_empty(x) is_empty(x, all.na.empty = FALSE)
Checks whether x
is an even or odd number. Only
accepts numeric vectors.
is_even(x) is_odd(x)
is_even(x) is_odd(x)
x |
Numeric vector or single numeric value, or a data frame or list with such vectors. |
is_even()
returns TRUE
for each even value of x
, FALSE
for
odd values. is_odd()
returns TRUE
for each odd value of x
and FALSE
for even values.
is_even(4) is_even(5) is_even(1:4) is_odd(4) is_odd(5) is_odd(1:4)
is_even(4) is_even(5) is_even(1:4) is_odd(4) is_odd(5) is_odd(1:4)
is_float()
checks whether an input vector or value is a
numeric non-integer (double), depending on fractional parts of the value(s).
is_whole()
does the opposite and checks whether an input vector
is a whole number (without fractional parts).
is_float(x) is_whole(x)
is_float(x) is_whole(x)
x |
A value, vector or data frame. |
For is_float()
, TRUE
if x
is a floating value
(non-integer double), FALSE
otherwise (also returns FALSE
for character vectors and factors). For is_whole()
, TRUE
if x
is a vector with whole numbers only, FALSE
otherwise
(returns TRUE
for character vectors and factors).
data(mtcars) data(iris) is.double(4) is_float(4) is_float(4.2) is_float(iris) is_whole(4) is_whole(4.2) is_whole(mtcars)
data(mtcars) data(iris) is.double(4) is_float(4) is_float(4.2) is_float(iris) is_whole(4) is_whole(4.2) is_whole(mtcars)
is_num_fac()
checks whether a factor has only numeric or
any non-numeric factor levels, while is_num_chr()
checks whether
a character vector has only numeric strings.
is_num_fac(x) is_num_chr(x)
is_num_fac(x) is_num_chr(x)
x |
A factor for |
Logical, TRUE
if factor has numeric factor levels only, or
if character vector has numeric strings only, FALSE
otherwise.
# numeric factor levels f1 <- factor(c(NA, 1, 3, NA, 2, 4)) is_num_fac(f1) # not completeley numeric factor levels f2 <- factor(c(NA, "C", 1, 3, "A", NA, 2, 4)) is_num_fac(f2) # not completeley numeric factor levels f3 <- factor(c("Justus", "Bob", "Peter")) is_num_fac(f3) is_num_chr(c("a", "1")) is_num_chr(c("2", "1"))
# numeric factor levels f1 <- factor(c(NA, 1, 3, NA, 2, 4)) is_num_fac(f1) # not completeley numeric factor levels f2 <- factor(c(NA, "C", 1, 3, "A", NA, 2, 4)) is_num_fac(f2) # not completeley numeric factor levels f3 <- factor(c("Justus", "Bob", "Peter")) is_num_fac(f3) is_num_chr(c("a", "1")) is_num_chr(c("2", "1"))
This function merges multiple imputed data frames from
mice::mids()
-objects into a single data frame
by computing the mean or selecting the most likely imputed value.
merge_imputations( dat, imp, ori = NULL, summary = c("none", "dens", "hist", "sd"), filter = NULL )
merge_imputations( dat, imp, ori = NULL, summary = c("none", "dens", "hist", "sd"), filter = NULL )
dat |
The data frame that was imputed and used as argument in the
|
imp |
The |
ori |
Optional, if |
summary |
After merging multiple imputed data,
|
filter |
A character vector with variable names that should be plotted. All non-defined variables will not be shown in the plot. |
This method merges multiple imputations of variables into a single
variable by computing the (rounded) mean of all imputed values
of missing values. By this, each missing value is replaced by
those values that have been imputed the most times.
imp
must be a mids
-object, which is returned by the
mice()
-function of the mice-package. merge_imputations()
than creates a data frame for each imputed variable, by combining all
imputations (as returned by the complete
-function)
of each variable, and computing the row means of this data frame.
The mean value is then rounded for integer values (and not for numerical
values with fractional part), which corresponds to the most frequent
imputed value (mode) for a missing value. Missings in the original variable
are replaced by the most frequent imputed value.
A data frame with (merged) imputed variables; or ori
with
appended imputed variables, if ori
was specified.
If summary
is included, returns a list with the data frame
data
with (merged) imputed variables and some other summary
information, including the plot
as ggplot-object.
Typically, further analyses are conducted on pooled results of multiple
imputed data sets (see pool
), however, sometimes
(in social sciences) it is also feasible to compute the mean or mode
of multiple imputed variables (see Burns et al. 2011).
Burns RA, Butterworth P, Kiely KM, Bielak AAM, Luszcz MA, Mitchell P, et al. 2011. Multiple imputation was an efficient method for harmonizing the Mini-Mental State Examination with missing item-level data. Journal of Clinical Epidemiology;64:787-93 doi:10.1016/j.jclinepi.2010.10.011
if (require("mice")) { imp <- mice(nhanes) # return data frame with imputed variables merge_imputations(nhanes, imp) # append imputed variables to original data frame merge_imputations(nhanes, imp, nhanes) # show summary of quality of merging imputations merge_imputations(nhanes, imp, summary = "dens", filter = c("chl", "hyp")) }
if (require("mice")) { imp <- mice(nhanes) # return data frame with imputed variables merge_imputations(nhanes, imp) # append imputed variables to original data frame merge_imputations(nhanes, imp, nhanes) # show summary of quality of merging imputations merge_imputations(nhanes, imp, summary = "dens", filter = c("chl", "hyp")) }
move_columns()
moves one or more columns in a data frame
to another position.
move_columns(data, ..., .before, .after)
move_columns(data, ..., .before, .after)
data |
A data frame. |
... |
Unquoted names or character vector with names of variables that
should be move to another position. You may also use functions like
|
.before |
Optional, column name or numeric index of the position where
|
.after |
Optional, column name or numeric index of the position where
|
data
, with resorted columns.
If neither .before
nor .after
are specified, the
column is moved to the end of the data frame by default. .before
and .after
are evaluated in a non-standard fashion, so you need
quasi-quotation when the value for .before
or .after
is
a vector with the target-column value. See 'Examples'.
## Not run: data(iris) iris %>% move_columns(Sepal.Width, .after = "Species") %>% head() iris %>% move_columns(Sepal.Width, .before = Sepal.Length) %>% head() iris %>% move_columns(Species, .before = 1) %>% head() iris %>% move_columns("Species", "Petal.Length", .after = 1) %>% head() library(dplyr) iris %>% move_columns(contains("Width"), .after = "Species") %>% head() ## End(Not run) # using quasi-quotation target <- "Petal.Width" # does not work, column is moved to the end iris %>% move_columns(Sepal.Width, .after = target) %>% head() # using !! works iris %>% move_columns(Sepal.Width, .after = !!target) %>% head()
## Not run: data(iris) iris %>% move_columns(Sepal.Width, .after = "Species") %>% head() iris %>% move_columns(Sepal.Width, .before = Sepal.Length) %>% head() iris %>% move_columns(Species, .before = 1) %>% head() iris %>% move_columns("Species", "Petal.Length", .after = 1) %>% head() library(dplyr) iris %>% move_columns(contains("Width"), .after = "Species") %>% head() ## End(Not run) # using quasi-quotation target <- "Petal.Width" # does not work, column is moved to the end iris %>% move_columns(Sepal.Width, .after = target) %>% head() # using !! works iris %>% move_columns(Sepal.Width, .after = !!target) %>% head()
This function converts numeric variables into factors, and uses associated value labels as factor levels.
numeric_to_factor(x, n = 4)
numeric_to_factor(x, n = 4)
x |
A data frame. |
n |
Numeric, indicating the maximum amount of unique values in |
If x
is a labelled vector, associated value labels will be used
as level. Else, the numeric vector is simply coerced using as.factor()
.
x
, with numeric values with a maximum of n
unique values
being converted to factors.
library(dplyr) data(efc) efc %>% select(e42dep, e16sex, c12hour, c160age, c172code) %>% numeric_to_factor()
library(dplyr) data(efc) efc %>% select(e42dep, e16sex, c12hour, c160age, c172code) %>% numeric_to_factor()
rec()
recodes values of variables, where variable
selection is based on variable names or column position, or on
select helpers (see documentation on ...
). rec_if()
is a
scoped variant of rec()
, where recoding will be applied only
to those variables that match the logical condition of predicate
.
rec( x, ..., rec, as.num = TRUE, var.label = NULL, val.labels = NULL, append = TRUE, suffix = "_r", to.factor = !as.num ) rec_if( x, predicate, rec, as.num = TRUE, var.label = NULL, val.labels = NULL, append = TRUE, suffix = "_r", to.factor = !as.num )
rec( x, ..., rec, as.num = TRUE, var.label = NULL, val.labels = NULL, append = TRUE, suffix = "_r", to.factor = !as.num ) rec_if( x, predicate, rec, as.num = TRUE, var.label = NULL, val.labels = NULL, append = TRUE, suffix = "_r", to.factor = !as.num )
x |
A vector or data frame. |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
rec |
String with recode pairs of old and new values. See 'Details'
for examples. |
as.num |
Logical, if |
var.label |
Optional string, to set variable label attribute for the
returned variable (see vignette Labelled Data and the sjlabelled-Package).
If |
val.labels |
Optional character vector, to set value label attributes
of recoded variable (see vignette Labelled Data and the sjlabelled-Package).
If |
append |
Logical, if |
suffix |
String value, will be appended to variable (column) names of
If |
to.factor |
Logical, alias for |
predicate |
A predicate function to be applied to the columns. The
variables for which |
The rec
string has following syntax:
each recode pair has to be separated by a ;
, e.g. rec = "1=1; 2=4; 3=2; 4=3"
multiple old values that should be recoded into a new single value may be separated with comma, e.g. "1,2=1; 3,4=2"
a value range is indicated by a colon, e.g. "1:4=1; 5:8=2"
(recodes all values from 1 to 4 into 1, and from 5 to 8 into 2)
for double vectors (with fractional part), all values within the specified range are recoded; e.g. 1:2.5=1;2.6:3=2
recodes 1 to 2.5 into 1 and 2.6 to 3 into 2, but 2.55 would not be recoded (since it's not included in any of the specified ranges)
"min"
and "max"
minimum and maximum values are indicates by min (or lo) and max (or hi), e.g. "min:4=1; 5:max=2"
(recodes all values from minimum values of x
to 4 into 1, and from 5 to maximum values of x
into 2)
"else"
all other values, which have not been specified yet, are indicated by else, e.g. "3=1; 1=2; else=3"
(recodes 3 into 1, 1 into 2 and all other values into 3)
"copy"
the "else"
-token can be combined with copy, indicating that all remaining, not yet recoded values should stay the same (are copied from the original value), e.g. "3=1; 1=2; else=copy"
(recodes 3 into 1, 1 into 2 and all other values like 2, 4 or 5 etc. will not be recoded, but copied, see 'Examples')
NA
'sNA
values are allowed both as old and new value, e.g. "NA=1; 3:5=NA"
(recodes all NA into 1, and all values from 3 to 5 into NA in the new variable)
"rev"
"rev"
is a special token that reverses the value order (see 'Examples')
value labels for new values can be assigned inside the recode pattern by writing the value label in square brackets after defining the new value in a recode pair, e.g. "15:30=1 [young aged]; 31:55=2 [middle aged]; 56:max=3 [old aged]"
. See 'Examples'.
x
with recoded categories. If x
is a data frame,
for append = TRUE
, x
including the recoded variables
as new columns is returned; if append = FALSE
, only
the recoded variables will be returned. If append = TRUE
and
suffix = ""
, recoded variables will replace (overwrite) existing
variables.
Please note following behaviours of the function:
the "else"
-token should always be the last argument in the rec
-string.
Non-matching values will be set to NA
, unless captured by the "else"
-token.
Tagged NA values (see tagged_na
) and their value labels will be preserved when copying NA values to the recoded vector with "else=copy"
.
Variable label attributes (see, for instance, get_label
) are preserved (unless changed via var.label
-argument), however, value label attributes are removed (except for "rev"
, where present value labels will be automatically reversed as well). Use val.labels
-argument to add labels for recoded values.
If x
is a data frame, all variables should have the same categories resp. value range (else, see second bullet, NA
s are produced).
set_na
for setting NA
values, replace_na
to replace NA
's with specific value, recode_to
for re-shifting value ranges and ref_lvl
to change the
reference level of (numeric) factors.
data(efc) table(efc$e42dep, useNA = "always") # replace NA with 5 table(rec(efc$e42dep, rec = "1=1;2=2;3=3;4=4;NA=5"), useNA = "always") # recode 1 to 2 into 1 and 3 to 4 into 2 table(rec(efc$e42dep, rec = "1,2=1; 3,4=2"), useNA = "always") # keep value labels. variable label is automatically preserved library(dplyr) efc %>% select(e42dep) %>% rec(rec = "1,2=1; 3,4=2", val.labels = c("low dependency", "high dependency")) %>% frq() # works with mutate efc %>% select(e42dep, e17age) %>% mutate(dependency_rev = rec(e42dep, rec = "rev")) %>% head() # recode 1 to 3 into 1 and 4 into 2 table(rec(efc$e42dep, rec = "min:3=1; 4=2"), useNA = "always") # recode 2 to 1 and all others into 2 table(rec(efc$e42dep, rec = "2=1; else=2"), useNA = "always") # reverse value order table(rec(efc$e42dep, rec = "rev"), useNA = "always") # recode only selected values, copy remaining table(efc$e15relat) table(rec(efc$e15relat, rec = "1,2,4=1; else=copy")) # recode variables with same category in a data frame head(efc[, 6:9]) head(rec(efc[, 6:9], rec = "1=10;2=20;3=30;4=40")) # recode multiple variables and set value labels via recode-syntax dummy <- rec( efc, c160age, e17age, rec = "15:30=1 [young]; 31:55=2 [middle]; 56:max=3 [old]", append = FALSE ) frq(dummy) # recode variables with same value-range lapply( rec( efc, c82cop1, c83cop2, c84cop3, rec = "1,2=1; NA=9; else=copy", append = FALSE ), table, useNA = "always" ) # recode character vector dummy <- c("M", "F", "F", "X") rec(dummy, rec = "M=Male; F=Female; X=Refused") # recode numeric to character rec(efc$e42dep, rec = "1=first;2=2nd;3=third;else=hi") %>% head() # recode non-numeric factors data(iris) table(rec(iris, Species, rec = "setosa=huhu; else=copy", append = FALSE)) # recode floating points table(rec( iris, Sepal.Length, rec = "lo:5=1;5.01:6.5=2;6.501:max=3", append = FALSE )) # preserve tagged NAs if (require("haven")) { x <- labelled(c(1:3, tagged_na("a", "c", "z"), 4:1), c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"), "Refused" = tagged_na("a"), "Not home" = tagged_na("z"))) # get current value labels x # recode 2 into 5; Values of tagged NAs are preserved rec(x, rec = "2=5;else=copy") } # use select-helpers from dplyr-package out <- rec( efc, contains("cop"), c161sex:c175empl, rec = "0,1=0; else=1", append = FALSE ) head(out) # recode only variables that have a value range from 1-4 p <- function(x) min(x, na.rm = TRUE) > 0 && max(x, na.rm = TRUE) < 5 out <- rec_if(efc, predicate = p, rec = "1:3=1;4=2;else=copy") head(out)
data(efc) table(efc$e42dep, useNA = "always") # replace NA with 5 table(rec(efc$e42dep, rec = "1=1;2=2;3=3;4=4;NA=5"), useNA = "always") # recode 1 to 2 into 1 and 3 to 4 into 2 table(rec(efc$e42dep, rec = "1,2=1; 3,4=2"), useNA = "always") # keep value labels. variable label is automatically preserved library(dplyr) efc %>% select(e42dep) %>% rec(rec = "1,2=1; 3,4=2", val.labels = c("low dependency", "high dependency")) %>% frq() # works with mutate efc %>% select(e42dep, e17age) %>% mutate(dependency_rev = rec(e42dep, rec = "rev")) %>% head() # recode 1 to 3 into 1 and 4 into 2 table(rec(efc$e42dep, rec = "min:3=1; 4=2"), useNA = "always") # recode 2 to 1 and all others into 2 table(rec(efc$e42dep, rec = "2=1; else=2"), useNA = "always") # reverse value order table(rec(efc$e42dep, rec = "rev"), useNA = "always") # recode only selected values, copy remaining table(efc$e15relat) table(rec(efc$e15relat, rec = "1,2,4=1; else=copy")) # recode variables with same category in a data frame head(efc[, 6:9]) head(rec(efc[, 6:9], rec = "1=10;2=20;3=30;4=40")) # recode multiple variables and set value labels via recode-syntax dummy <- rec( efc, c160age, e17age, rec = "15:30=1 [young]; 31:55=2 [middle]; 56:max=3 [old]", append = FALSE ) frq(dummy) # recode variables with same value-range lapply( rec( efc, c82cop1, c83cop2, c84cop3, rec = "1,2=1; NA=9; else=copy", append = FALSE ), table, useNA = "always" ) # recode character vector dummy <- c("M", "F", "F", "X") rec(dummy, rec = "M=Male; F=Female; X=Refused") # recode numeric to character rec(efc$e42dep, rec = "1=first;2=2nd;3=third;else=hi") %>% head() # recode non-numeric factors data(iris) table(rec(iris, Species, rec = "setosa=huhu; else=copy", append = FALSE)) # recode floating points table(rec( iris, Sepal.Length, rec = "lo:5=1;5.01:6.5=2;6.501:max=3", append = FALSE )) # preserve tagged NAs if (require("haven")) { x <- labelled(c(1:3, tagged_na("a", "c", "z"), 4:1), c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"), "Refused" = tagged_na("a"), "Not home" = tagged_na("z"))) # get current value labels x # recode 2 into 5; Values of tagged NAs are preserved rec(x, rec = "2=5;else=copy") } # use select-helpers from dplyr-package out <- rec( efc, contains("cop"), c161sex:c175empl, rec = "0,1=0; else=1", append = FALSE ) head(out) # recode only variables that have a value range from 1-4 p <- function(x) min(x, na.rm = TRUE) > 0 && max(x, na.rm = TRUE) < 5 out <- rec_if(efc, predicate = p, rec = "1:3=1;4=2;else=copy") head(out)
Convenient function to create a recode pattern for the
rec
function, which recodes (numeric)
vectors into smaller groups.
rec_pattern(from, to, width = 5, other = NULL)
rec_pattern(from, to, width = 5, other = NULL)
from |
Minimum value that should be recoded. |
to |
Maximum value that should be recoded. |
width |
Numeric, indicating the range of each group. |
other |
String token, indicating how to deal with all other values
that have not been captured by the recode pattern. See 'Details'
on the |
A list with two values:
pattern
string pattern that can be used as rec
argument for the rec
-function.
labels
the associated values labels that can be used with set_labels
.
group_var
for recoding variables into smaller groups, and
group_labels
to create the asssociated value labels.
rp <- rec_pattern(1, 100) rp # sample data, inspect age of carers data(efc) table(efc$c160age, exclude = NULL) table(rec(efc$c160age, rec = rp$pattern), exclude = NULL) # recode carers age into groups of width 5 x <- rec( efc$c160age, rec = rp$pattern, val.labels = rp$labels ) # watch result frq(x)
rp <- rec_pattern(1, 100) rp # sample data, inspect age of carers data(efc) table(efc$c160age, exclude = NULL) table(rec(efc$c160age, rec = rp$pattern), exclude = NULL) # recode carers age into groups of width 5 x <- rec( efc$c160age, rec = rp$pattern, val.labels = rp$labels ) # watch result frq(x)
Recodes (or "renumbers") the categories of variables into new
category values, beginning with the lowest value specified by lowest
.
Useful when recoding dummy variables with 1/2 values to 0/1 values, or
recoding scales from 1-4 to 0-3 etc.
recode_to_if()
is a scoped variant of recode_to()
, where
recoding will be applied only to those variables that match the
logical condition of predicate
.
recode_to(x, ..., lowest = 0, highest = -1, append = TRUE, suffix = "_r0") recode_to_if( x, predicate, lowest = 0, highest = -1, append = TRUE, suffix = "_r0" )
recode_to(x, ..., lowest = 0, highest = -1, append = TRUE, suffix = "_r0") recode_to_if( x, predicate, lowest = 0, highest = -1, append = TRUE, suffix = "_r0" )
x |
A vector or data frame. |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
lowest |
Indicating the lowest category value for recoding. Default is 0, so the new variable starts with value 0. |
highest |
If specified and greater than |
append |
Logical, if |
suffix |
Indicates which suffix will be added to each dummy variable.
Use |
predicate |
A predicate function to be applied to the columns. The
variables for which |
x
with recoded category values, where lowest
indicates
the lowest value; If x
is a data frame, for append = TRUE
,
x
including the recoded variables as new columns is returned; if
append = FALSE
, only the recoded variables will be returned. If
append = TRUE
and suffix = ""
, recoded variables will replace
(overwrite) existing variables.
Value and variable label attributes are preserved.
rec
for general recoding of variables and set_na
for setting NA
values.
# recode 1-4 to 0-3 dummy <- sample(1:4, 10, replace = TRUE) recode_to(dummy) # recode 3-6 to 0-3 # note that numeric type is returned dummy <- as.factor(3:6) recode_to(dummy) # lowest value starting with 1 dummy <- sample(11:15, 10, replace = TRUE) recode_to(dummy, lowest = 1) # lowest value starting with 1, highest with 3 # all others set to NA dummy <- sample(11:15, 10, replace = TRUE) recode_to(dummy, lowest = 1, highest = 3) # recode multiple variables at once data(efc) recode_to(efc, c82cop1, c83cop2, c84cop3, append = FALSE) library(dplyr) efc %>% select(c82cop1, c83cop2, c84cop3) %>% mutate( c82new = recode_to(c83cop2, lowest = 5), c83new = recode_to(c84cop3, lowest = 3) ) %>% head()
# recode 1-4 to 0-3 dummy <- sample(1:4, 10, replace = TRUE) recode_to(dummy) # recode 3-6 to 0-3 # note that numeric type is returned dummy <- as.factor(3:6) recode_to(dummy) # lowest value starting with 1 dummy <- sample(11:15, 10, replace = TRUE) recode_to(dummy, lowest = 1) # lowest value starting with 1, highest with 3 # all others set to NA dummy <- sample(11:15, 10, replace = TRUE) recode_to(dummy, lowest = 1, highest = 3) # recode multiple variables at once data(efc) recode_to(efc, c82cop1, c83cop2, c84cop3, append = FALSE) library(dplyr) efc %>% select(c82cop1, c83cop2, c84cop3) %>% mutate( c82new = recode_to(c83cop2, lowest = 5), c83new = recode_to(c84cop3, lowest = 3) ) %>% head()
Changes the reference level of (numeric) factor.
ref_lvl(x, ..., lvl = NULL)
ref_lvl(x, ..., lvl = NULL)
x |
A vector or data frame. |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
lvl |
Either numeric, indicating the new reference level, or a string,
indicating the value label from the new reference level. If |
Unlike relevel
, this function behaves differently
for factor with numeric factor levels or for labelled data, i.e. factors
with value labels for the values. ref_lvl()
changes the reference
level by recoding the factor's values using the rec
function.
Hence, all values from lowest up to the reference level indicated by
lvl
are recoded, with lvl
starting as lowest factor value.
For factors with non-numeric factor levels, the function simply returns
relevel(x, ref = lvl)
. See 'Examples'.
x
with new reference level. If x
is a data frame, the complete data frame x
will be returned,
where variables specified in ...
will be re-leveled;
if ...
is not specified, applies to all variables in the
data frame.
to_factor
to convert numeric vectors into factors;
rec
to recode variables.
data(efc) x <- to_factor(efc$e42dep) str(x) frq(x) # see column "val" in frq()-output, which indicates # how values/labels were recoded after using ref_lvl() x <- ref_lvl(x, lvl = 3) str(x) frq(x) library(dplyr) dat <- efc %>% select(c82cop1, c83cop2, c84cop3) %>% to_factor() frq(dat) ref_lvl(dat, c82cop1, c83cop2, lvl = 2) %>% frq() # compare numeric and string value for "lvl"-argument x <- to_factor(efc$e42dep) frq(x) ref_lvl(x, lvl = 2) %>% frq() ref_lvl(x, lvl = "slightly dependent") %>% frq() # factors with non-numeric factor levels data(iris) levels(iris$Species) levels(ref_lvl(iris$Species, lvl = 3)) levels(ref_lvl(iris$Species, lvl = "versicolor"))
data(efc) x <- to_factor(efc$e42dep) str(x) frq(x) # see column "val" in frq()-output, which indicates # how values/labels were recoded after using ref_lvl() x <- ref_lvl(x, lvl = 3) str(x) frq(x) library(dplyr) dat <- efc %>% select(c82cop1, c83cop2, c84cop3) %>% to_factor() frq(dat) ref_lvl(dat, c82cop1, c83cop2, lvl = 2) %>% frq() # compare numeric and string value for "lvl"-argument x <- to_factor(efc$e42dep) frq(x) ref_lvl(x, lvl = 2) %>% frq() ref_lvl(x, lvl = "slightly dependent") %>% frq() # factors with non-numeric factor levels data(iris) levels(iris$Species) levels(ref_lvl(iris$Species, lvl = 3)) levels(ref_lvl(iris$Species, lvl = "versicolor"))
This function removes variables from a data frame, and is
intended to use within a pipe-workflow. remove_cols()
is an
alias for remove_var()
.
remove_var(x, ...) remove_cols(x, ...)
remove_var(x, ...) remove_cols(x, ...)
x |
A vector or data frame. |
... |
Character vector with variable names, or unquoted names
of variables that should be removed from the data frame.
You may also use functions like |
x
, with variables specified in ...
removed.
mtcars %>% remove_var("disp", "cyl") mtcars %>% remove_var(c("wt", "vs")) mtcars %>% remove_var(drat:am)
mtcars %>% remove_var("disp", "cyl") mtcars %>% remove_var(c("wt", "vs")) mtcars %>% remove_var(drat:am)
This function replaces (tagged) NA's of a variable, data frame
or list of variables with value
.
replace_na(x, ..., value, na.label = NULL, tagged.na = NULL)
replace_na(x, ..., value, na.label = NULL, tagged.na = NULL)
x |
A vector or data frame. |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
value |
Value that will replace the |
na.label |
Optional character vector, used to label the the former NA-value
(i.e. adding a |
tagged.na |
Optional single character, specifies a |
While regular NA
values can only be completely replaced with
a single value, tagged_na
allows to differentiate
between different qualitative values of NA
s.
Tagged NA
s work exactly like regular R missing values
except that they store one additional byte of information: a tag,
which is usually a letter ("a" to "z") or character number ("0" to "9").
Therewith it is possible to replace only specific NA values, while
other NA values are preserved.
x
, where NA
's are replaced with value
. If x
is a data frame, the complete data frame x
will be returned,
with replaced NA's for variables specified in ...
;
if ...
is not specified, applies to all variables in the
data frame.
Value and variable label attributes are preserved.
set_na
for setting NA
values, rec
for general recoding of variables and recode_to
for re-shifting value ranges.
library(sjlabelled) data(efc) table(efc$e42dep, useNA = "always") table(replace_na(efc$e42dep, value = 99), useNA = "always") # the original labels get_labels(replace_na(efc$e42dep, value = 99)) # NA becomes "99", and is labelled as "former NA" get_labels( replace_na(efc$e42dep, value = 99, na.label = "former NA"), values = "p" ) dummy <- data.frame( v1 = efc$c82cop1, v2 = efc$c83cop2, v3 = efc$c84cop3 ) # show original distribution lapply(dummy, table, useNA = "always") # show variables, NA's replaced with 99 lapply(replace_na(dummy, v2, v3, value = 99), table, useNA = "always") if (require("haven")) { x <- labelled(c(1:3, tagged_na("a", "c", "z"), 4:1), c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"), "Refused" = tagged_na("a"), "Not home" = tagged_na("z"))) # get current NA values x get_na(x) # replace only the NA, which is tagged as NA(c) replace_na(x, value = 2, tagged.na = "c") get_na(replace_na(x, value = 2, tagged.na = "c")) table(x) table(replace_na(x, value = 2, tagged.na = "c")) # tagged NA also works for non-labelled class # init vector x <- c(1, 2, 3, 4) # set values 2 and 3 as tagged NA x <- set_na(x, na = c(2, 3), as.tag = TRUE) # see result x # now replace only NA tagged with 2 with value 5 replace_na(x, value = 5, tagged.na = "2") }
library(sjlabelled) data(efc) table(efc$e42dep, useNA = "always") table(replace_na(efc$e42dep, value = 99), useNA = "always") # the original labels get_labels(replace_na(efc$e42dep, value = 99)) # NA becomes "99", and is labelled as "former NA" get_labels( replace_na(efc$e42dep, value = 99, na.label = "former NA"), values = "p" ) dummy <- data.frame( v1 = efc$c82cop1, v2 = efc$c83cop2, v3 = efc$c84cop3 ) # show original distribution lapply(dummy, table, useNA = "always") # show variables, NA's replaced with 99 lapply(replace_na(dummy, v2, v3, value = 99), table, useNA = "always") if (require("haven")) { x <- labelled(c(1:3, tagged_na("a", "c", "z"), 4:1), c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"), "Refused" = tagged_na("a"), "Not home" = tagged_na("z"))) # get current NA values x get_na(x) # replace only the NA, which is tagged as NA(c) replace_na(x, value = 2, tagged.na = "c") get_na(replace_na(x, value = 2, tagged.na = "c")) table(x) table(replace_na(x, value = 2, tagged.na = "c")) # tagged NA also works for non-labelled class # init vector x <- c(1, 2, 3, 4) # set values 2 and 3 as tagged NA x <- set_na(x, na = c(2, 3), as.tag = TRUE) # see result x # now replace only NA tagged with 2 with value 5 replace_na(x, value = 5, tagged.na = "2") }
reshape_longer()
reshapes one or more columns from
wide into long format.
reshape_longer( x, columns = colnames(x), names.to = "key", values.to = "value", labels = NULL, numeric.timevar = FALSE, id = ".id" )
reshape_longer( x, columns = colnames(x), names.to = "key", values.to = "value", labels = NULL, numeric.timevar = FALSE, id = ".id" )
x |
A data frame. |
columns |
Names of variables (as character vector), or column index of variables, that should be reshaped. If multiple column groups should be reshaped, use a list of vectors (see 'Examples'). |
names.to |
Character vector with name(s) of key column(s) to create in output. Either one name per column group that should be gathered, or a single string. In the latter case, this name will be used as key column, and only one key column is created. |
values.to |
Character vector with names of value columns (variable names) to create in output. Must be of same length as number of column groups that should be gathered. See 'Examples'. |
labels |
Character vector of same length as |
numeric.timevar |
Logical, if |
id |
Name of ID-variable. |
A reshaped data frame.
# Reshape one column group into long format mydat <- data.frame( age = c(20, 30, 40), sex = c("Female", "Male", "Male"), score_t1 = c(30, 35, 32), score_t2 = c(33, 34, 37), score_t3 = c(36, 35, 38) ) reshape_longer( mydat, columns = c("score_t1", "score_t2", "score_t3"), names.to = "time", values.to = "score" ) # Reshape multiple column groups into long format mydat <- data.frame( age = c(20, 30, 40), sex = c("Female", "Male", "Male"), score_t1 = c(30, 35, 32), score_t2 = c(33, 34, 37), score_t3 = c(36, 35, 38), speed_t1 = c(2, 3, 1), speed_t2 = c(3, 4, 5), speed_t3 = c(1, 8, 6) ) reshape_longer( mydat, columns = list( c("score_t1", "score_t2", "score_t3"), c("speed_t1", "speed_t2", "speed_t3") ), names.to = "time", values.to = c("score", "speed") ) # or ... reshape_longer( mydat, list(3:5, 6:8), names.to = "time", values.to = c("score", "speed") ) # gather multiple columns, label columns x <- reshape_longer( mydat, list(3:5, 6:8), names.to = "time", values.to = c("score", "speed"), labels = c("Test Score", "Time needed to finish") ) library(sjlabelled) str(x$score) get_label(x$speed)
# Reshape one column group into long format mydat <- data.frame( age = c(20, 30, 40), sex = c("Female", "Male", "Male"), score_t1 = c(30, 35, 32), score_t2 = c(33, 34, 37), score_t3 = c(36, 35, 38) ) reshape_longer( mydat, columns = c("score_t1", "score_t2", "score_t3"), names.to = "time", values.to = "score" ) # Reshape multiple column groups into long format mydat <- data.frame( age = c(20, 30, 40), sex = c("Female", "Male", "Male"), score_t1 = c(30, 35, 32), score_t2 = c(33, 34, 37), score_t3 = c(36, 35, 38), speed_t1 = c(2, 3, 1), speed_t2 = c(3, 4, 5), speed_t3 = c(1, 8, 6) ) reshape_longer( mydat, columns = list( c("score_t1", "score_t2", "score_t3"), c("speed_t1", "speed_t2", "speed_t3") ), names.to = "time", values.to = c("score", "speed") ) # or ... reshape_longer( mydat, list(3:5, 6:8), names.to = "time", values.to = c("score", "speed") ) # gather multiple columns, label columns x <- reshape_longer( mydat, list(3:5, 6:8), names.to = "time", values.to = c("score", "speed"), labels = c("Test Score", "Time needed to finish") ) library(sjlabelled) str(x$score) get_label(x$speed)
This function rotates a data frame, i.e. columns become rows and vice versa.
rotate_df(x, rn = NULL, cn = FALSE)
rotate_df(x, rn = NULL, cn = FALSE)
x |
A data frame. |
rn |
Character vector (optional). If not |
cn |
Logical (optional), if |
A (rotated) data frame.
x <- mtcars[1:3, 1:4] rotate_df(x) rotate_df(x, rn = "property") # use values in 1. column as column name rotate_df(x, cn = TRUE) rotate_df(x, rn = "property", cn = TRUE) # also works on list-results library(purrr) dat <- mtcars[1:3, 1:4] tmp <- purrr::map(dat, function(x) { sdev <- stats::sd(x, na.rm = TRUE) ulsdev <- mean(x, na.rm = TRUE) + c(-sdev, sdev) names(ulsdev) <- c("lower_sd", "upper_sd") ulsdev }) tmp as.data.frame(tmp) rotate_df(tmp) tmp <- purrr::map_df(dat, function(x) { sdev <- stats::sd(x, na.rm = TRUE) ulsdev <- mean(x, na.rm = TRUE) + c(-sdev, sdev) names(ulsdev) <- c("lower_sd", "upper_sd") ulsdev }) tmp rotate_df(tmp)
x <- mtcars[1:3, 1:4] rotate_df(x) rotate_df(x, rn = "property") # use values in 1. column as column name rotate_df(x, cn = TRUE) rotate_df(x, rn = "property", cn = TRUE) # also works on list-results library(purrr) dat <- mtcars[1:3, 1:4] tmp <- purrr::map(dat, function(x) { sdev <- stats::sd(x, na.rm = TRUE) ulsdev <- mean(x, na.rm = TRUE) + c(-sdev, sdev) names(ulsdev) <- c("lower_sd", "upper_sd") ulsdev }) tmp as.data.frame(tmp) rotate_df(tmp) tmp <- purrr::map_df(dat, function(x) { sdev <- stats::sd(x, na.rm = TRUE) ulsdev <- mean(x, na.rm = TRUE) + c(-sdev, sdev) names(ulsdev) <- c("lower_sd", "upper_sd") ulsdev }) tmp rotate_df(tmp)
round_num()
rounds numeric variables in a data frame
that also contains non-numeric variables. Non-numeric variables are
ignored.
round_num(x, digits = 0)
round_num(x, digits = 0)
x |
A vector or data frame. |
digits |
Numeric, number of decimals to round to. |
x
with all numeric variables rounded.
data(iris) round_num(iris)
data(iris) round_num(iris)
row_count()
mimics base R's rowSums()
, with sums
for a specific value indicated by count
. Hence, it is equivalent
to rowSums(x == count, na.rm = TRUE)
. However, this function
is designed to work nicely within a pipe-workflow and allows select-helpers
for selecting variables and the return value is always a data frame
(with one variable).
col_count()
does the same for columns. The return value is
a data frame with one row (the column counts) and the same number
of columns as x
.
row_count(x, ..., count, var = "rowcount", append = TRUE) col_count(x, ..., count, var = "colcount", append = TRUE)
row_count(x, ..., count, var = "rowcount", append = TRUE) col_count(x, ..., count, var = "colcount", append = TRUE)
x |
A vector or data frame. |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
count |
The value for which the row or column sum should be computed. May
be a numeric value, a character string (for factors or character vectors),
|
var |
Name of new the variable with the row or column counts. |
append |
Logical, if |
For row_count()
, a data frame with one variable: the sum of count
appearing in each row of x
; for col_count()
, a data frame with
one row and the same number of variables as in x
: each variable
holds the sum of count
appearing in each variable of x
.
If append = TRUE
, x
including this variable will be returned.
dat <- data.frame( c1 = c(1, 2, 3, 1, 3, NA), c2 = c(3, 2, 1, 2, NA, 3), c3 = c(1, 1, 2, 1, 3, NA), c4 = c(1, 1, 3, 2, 1, 2) ) row_count(dat, count = 1, append = FALSE) row_count(dat, count = NA, append = FALSE) row_count(dat, c1:c3, count = 2, append = TRUE) col_count(dat, count = 1, append = FALSE) col_count(dat, count = NA, append = FALSE) col_count(dat, c1:c3, count = 2, append = TRUE)
dat <- data.frame( c1 = c(1, 2, 3, 1, 3, NA), c2 = c(3, 2, 1, 2, NA, 3), c3 = c(1, 1, 2, 1, 3, NA), c4 = c(1, 1, 3, 2, 1, 2) ) row_count(dat, count = 1, append = FALSE) row_count(dat, count = NA, append = FALSE) row_count(dat, c1:c3, count = 2, append = TRUE) col_count(dat, count = 1, append = FALSE) col_count(dat, count = NA, append = FALSE) col_count(dat, c1:c3, count = 2, append = TRUE)
row_sums()
and row_means()
compute row sums or means
for at least n
valid values per row. The functions are designed
to work nicely within a pipe-workflow and allow select-helpers
for selecting variables.
row_sums(x, ...) ## Default S3 method: row_sums(x, ..., n, var = "rowsums", append = TRUE) ## S3 method for class 'mids' row_sums(x, ..., var = "rowsums", append = TRUE) row_means(x, ...) total_mean(x, ...) ## Default S3 method: row_means(x, ..., n, var = "rowmeans", append = TRUE) ## S3 method for class 'mids' row_means(x, ..., var = "rowmeans", append = TRUE)
row_sums(x, ...) ## Default S3 method: row_sums(x, ..., n, var = "rowsums", append = TRUE) ## S3 method for class 'mids' row_sums(x, ..., var = "rowsums", append = TRUE) row_means(x, ...) total_mean(x, ...) ## Default S3 method: row_means(x, ..., n, var = "rowmeans", append = TRUE) ## S3 method for class 'mids' row_means(x, ..., var = "rowmeans", append = TRUE)
x |
A vector or data frame. |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
n |
May either be
If a row's sum of valid (i.e. non- |
var |
Name of new the variable with the row sums or means. |
append |
Logical, if |
For n
, must be a numeric value from 0
to ncol(x)
. If
a row in x
has at least n
non-missing values, the
row mean or sum is returned. If n
is a non-integer value from 0 to 1,
n
is considered to indicate the proportion of necessary non-missing
values per row. E.g., if n = .75
, a row must have at least ncol(x) * n
non-missing values for the row mean or sum to be calculated. See 'Examples'.
For row_sums()
, a data frame with a new variable: the row sums from
x
; for row_means()
, a data frame with a new variable: the row
means from x
. If append = FALSE
, only the new variable
with row sums resp. row means is returned. total_mean()
returns
the mean of all values from all specified columns in a data frame.
data(efc) efc %>% row_sums(c82cop1:c90cop9, n = 3, append = FALSE) library(dplyr) row_sums(efc, contains("cop"), n = 2, append = FALSE) dat <- data.frame( c1 = c(1,2,NA,4), c2 = c(NA,2,NA,5), c3 = c(NA,4,NA,NA), c4 = c(2,3,7,8), c5 = c(1,7,5,3) ) dat row_means(dat, n = 4) row_sums(dat, n = 4) row_means(dat, c1:c4, n = 4) # at least 40% non-missing row_means(dat, c1:c4, n = .4) row_sums(dat, c1:c4, n = .4) # total mean of all values in the data frame total_mean(dat) # create sum-score of COPE-Index, and append to data efc %>% select(c82cop1:c90cop9) %>% row_sums(n = 1) # if data frame has only one column, this column is returned row_sums(dat[, 1, drop = FALSE], n = 0)
data(efc) efc %>% row_sums(c82cop1:c90cop9, n = 3, append = FALSE) library(dplyr) row_sums(efc, contains("cop"), n = 2, append = FALSE) dat <- data.frame( c1 = c(1,2,NA,4), c2 = c(NA,2,NA,5), c3 = c(NA,4,NA,NA), c4 = c(2,3,7,8), c5 = c(1,7,5,3) ) dat row_means(dat, n = 4) row_sums(dat, n = 4) row_means(dat, c1:c4, n = 4) # at least 40% non-missing row_means(dat, c1:c4, n = .4) row_sums(dat, c1:c4, n = .4) # total mean of all values in the data frame total_mean(dat) # create sum-score of COPE-Index, and append to data efc %>% select(c82cop1:c90cop9) %>% row_sums(n = 1) # if data frame has only one column, this column is returned row_sums(dat[, 1, drop = FALSE], n = 0)
seq_col(x)
is a convenient wrapper for seq_len(ncol(x))
,
while seq_row(x)
is a convenient wrapper for seq_len(nrow(x))
.
seq_col(x) seq_row(x)
seq_col(x) seq_row(x)
x |
A data frame. |
A numeric sequence from 1 to number of columns or rows.
data(iris) seq_col(iris) seq_row(iris)
data(iris) seq_col(iris) seq_row(iris)
set_na_if()
is a scoped variant of
set_na
, where values will be replaced only
with NA's for those variables that match the logical condition of
predicate
.
set_na_if(x, predicate, na, drop.levels = TRUE, as.tag = FALSE)
set_na_if(x, predicate, na, drop.levels = TRUE, as.tag = FALSE)
x |
A vector or data frame. |
predicate |
A predicate function to be applied to the columns. The
variables for which |
na |
Numeric vector with values that should be replaced with NA values,
or a character vector if values of factors or character vectors should be
replaced. For labelled vectors, may also be the name of a value label. In
this case, the associated values for the value labels in each vector
will be replaced with |
drop.levels |
Logical, if |
as.tag |
Logical, if |
x
, with all values in na
being replaced by NA
.
If x
is a data frame, the complete data frame x
will
be returned, with NA's set for variables specified in ...
;
if ...
is not specified, applies to all variables in the
data frame.
replace_na
to replace NA
's with specific
values, rec
for general recoding of variables and
recode_to
for re-shifting value ranges. See
get_na
to get values of missing values in
labelled vectors.
dummy <- data.frame(var1 = sample(1:8, 100, replace = TRUE), var2 = sample(1:10, 100, replace = TRUE), var3 = sample(1:6, 100, replace = TRUE)) p <- function(x) max(x, na.rm = TRUE) > 7 tmp <- set_na_if(dummy, predicate = p, na = 8:9) head(tmp)
dummy <- data.frame(var1 = sample(1:8, 100, replace = TRUE), var2 = sample(1:10, 100, replace = TRUE), var3 = sample(1:6, 100, replace = TRUE)) p <- function(x) max(x, na.rm = TRUE) > 7 tmp <- set_na_if(dummy, predicate = p, na = 8:9) head(tmp)
This function shortens strings that are longer than max.length
chars, without cropping words.
shorten_string(s, max.length = NULL, abbr = "...")
shorten_string(s, max.length = NULL, abbr = "...")
s |
A string. |
max.length |
Maximum length of chars for the string. |
abbr |
String that will be used as suffix, if |
If the string length defined in max.length
happens to be inside
a word, this word is removed from the returned string (see 'Examples'), so
the returned string has a maximum length of max.length
, but
might be shorter.
A shortened string.
s <- "This can be considered as very long string!" # string is shorter than max.length, so returned as is shorten_string(s, 60) # string is shortened to as many words that result in # a string of maximum 20 chars shorten_string(s, 20) # string including "considered" is exactly of length 22 chars shorten_string(s, 22)
s <- "This can be considered as very long string!" # string is shorter than max.length, so returned as is shorten_string(s, 60) # string is shortened to as many words that result in # a string of maximum 20 chars shorten_string(s, 20) # string including "considered" is exactly of length 22 chars shorten_string(s, 22)
Recode numeric variables into equal sized groups, i.e. a
variable is cut into a smaller number of groups at specific cut points.
split_var_if()
is a scoped variant of split_var()
, where
transformation will be applied only to those variables that match the
logical condition of predicate
.
split_var( x, ..., n, as.num = FALSE, val.labels = NULL, var.label = NULL, inclusive = FALSE, append = TRUE, suffix = "_g" ) split_var_if( x, predicate, n, as.num = FALSE, val.labels = NULL, var.label = NULL, inclusive = FALSE, append = TRUE, suffix = "_g" )
split_var( x, ..., n, as.num = FALSE, val.labels = NULL, var.label = NULL, inclusive = FALSE, append = TRUE, suffix = "_g" ) split_var_if( x, predicate, n, as.num = FALSE, val.labels = NULL, var.label = NULL, inclusive = FALSE, append = TRUE, suffix = "_g" )
x |
A vector or data frame. |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
n |
The new number of groups that |
as.num |
Logical, if |
val.labels |
Optional character vector, to set value label attributes
of recoded variable (see vignette Labelled Data and the sjlabelled-Package).
If |
var.label |
Optional string, to set variable label attribute for the
returned variable (see vignette Labelled Data and the sjlabelled-Package).
If |
inclusive |
Logical; if |
append |
Logical, if |
suffix |
Indicates which suffix will be added to each dummy variable.
Use |
predicate |
A predicate function to be applied to the columns. The
variables for which |
split_var()
splits a variable into equal sized groups, where
the amount of groups depends on the n
-argument. Thus, this
functions cuts
a variable into groups at the specified
quantiles
.
By contrast, group_var
recodes a variable into groups, where
groups have the same value range (e.g., from 1-5, 6-10, 11-15 etc.).
split_var()
also works on grouped data frames
(see group_by
). In this case, splitting is applied to
the subsets of variables in x
. See 'Examples'.
A grouped variable with equal sized groups. If x
is a data frame,
for append = TRUE
, x
including the grouped variables as new
columns is returned; if append = FALSE
, only the grouped variables
will be returned. If append = TRUE
and suffix = ""
,
recoded variables will replace (overwrite) existing variables.
In case a vector has only few number of unique values, splitting into
equal sized groups may fail. In this case, use the inclusive
-argument
to shift a value at the cut point into the lower, preceeding group to
get equal sized groups. See 'Examples'.
group_var
to group variables into equal ranged groups,
or rec
to recode variables.
data(efc) # non-grouped table(efc$neg_c_7) # split into 3 groups table(split_var(efc$neg_c_7, n = 3)) # split multiple variables into 3 groups split_var(efc, neg_c_7, pos_v_4, e17age, n = 3, append = FALSE) frq(split_var(efc, neg_c_7, pos_v_4, e17age, n = 3, append = FALSE)) # original table(efc$e42dep) # two groups, non-inclusive cut-point # vector split leads to unequal group sizes table(split_var(efc$e42dep, n = 2)) # two groups, inclusive cut-point # group sizes are equal table(split_var(efc$e42dep, n = 2, inclusive = TRUE)) # Unlike dplyr's ntile(), split_var() never splits a value # into two different categories, i.e. you always get a clean # separation of original categories library(dplyr) x <- dplyr::ntile(efc$neg_c_7, n = 3) table(efc$neg_c_7, x) x <- split_var(efc$neg_c_7, n = 3) table(efc$neg_c_7, x) # works also with gouped data frames mtcars %>% split_var(disp, n = 3, append = FALSE) %>% table() mtcars %>% group_by(cyl) %>% split_var(disp, n = 3, append = FALSE) %>% table()
data(efc) # non-grouped table(efc$neg_c_7) # split into 3 groups table(split_var(efc$neg_c_7, n = 3)) # split multiple variables into 3 groups split_var(efc, neg_c_7, pos_v_4, e17age, n = 3, append = FALSE) frq(split_var(efc, neg_c_7, pos_v_4, e17age, n = 3, append = FALSE)) # original table(efc$e42dep) # two groups, non-inclusive cut-point # vector split leads to unequal group sizes table(split_var(efc$e42dep, n = 2)) # two groups, inclusive cut-point # group sizes are equal table(split_var(efc$e42dep, n = 2, inclusive = TRUE)) # Unlike dplyr's ntile(), split_var() never splits a value # into two different categories, i.e. you always get a clean # separation of original categories library(dplyr) x <- dplyr::ntile(efc$neg_c_7, n = 3) table(efc$neg_c_7, x) x <- split_var(efc$neg_c_7, n = 3) table(efc$neg_c_7, x) # works also with gouped data frames mtcars %>% split_var(disp, n = 3, append = FALSE) %>% table() mtcars %>% group_by(cyl) %>% split_var(disp, n = 3, append = FALSE) %>% table()
This function extracts coefficients (and standard error and p-values) of fitted model objects from (nested) data frames, which are saved in a list-variable, and spreads the coefficients into new colummns.
spread_coef(data, model.column, model.term, se, p.val, append = TRUE)
spread_coef(data, model.column, model.term, se, p.val, append = TRUE)
data |
A (nested) data frame with a list-variable that contains fitted model objects (see 'Details'). |
model.column |
Name or index of the list-variable that contains the fitted model objects. |
model.term |
Optional, name of a model term. If specified, only this model term (including p-value) will be extracted from each model and added as new column. |
se |
Logical, if |
p.val |
Logical, if |
append |
Logical, if |
This function requires a (nested) data frame (e.g. created by the
nest
-function of the tidyr-package),
where several fitted models are saved in a list-variable (see
'Examples'). Since nested data frames with fitted models stored as list-variable
are typically fit with an identical formula, all models have the same
dependent and independent variables and only differ in their
subsets of data. The function then extracts all coefficients from
each model and saves each estimate in a new column. The result
is a data frame, where each row is a model with each
model's coefficients in an own column.
A data frame with columns for each coefficient of the models
that are stored in the list-variable of data
; or, if
model.term
is given, a data frame with the term's estimate.
If se = TRUE
or p.val = TRUE
, the returned data frame
also contains columns for the coefficients' standard error and
p-value.
If append = TRUE
, the columns are appended to data
,
i.e. data
is also returned.
if (require("dplyr") && require("tidyr") && require("purrr")) { data(efc) # create nested data frame, grouped by dependency (e42dep) # and fit linear model for each group. These models are # stored in the list variable "models". model.data <- efc %>% filter(!is.na(e42dep)) %>% group_by(e42dep) %>% nest() %>% mutate( models = map(data, ~lm(neg_c_7 ~ c12hour + c172code, data = .x)) ) # spread coefficients, so we can easily access and compare the # coefficients over all models. arguments `se` and `p.val` default # to `FALSE`, when `model.term` is not specified spread_coef(model.data, models) spread_coef(model.data, models, se = TRUE) # select only specific model term. `se` and `p.val` default to `TRUE` spread_coef(model.data, models, c12hour) # spread_coef can be used directly within a pipe-chain efc %>% filter(!is.na(e42dep)) %>% group_by(e42dep) %>% nest() %>% mutate( models = map(data, ~lm(neg_c_7 ~ c12hour + c172code, data = .x)) ) %>% spread_coef(models) }
if (require("dplyr") && require("tidyr") && require("purrr")) { data(efc) # create nested data frame, grouped by dependency (e42dep) # and fit linear model for each group. These models are # stored in the list variable "models". model.data <- efc %>% filter(!is.na(e42dep)) %>% group_by(e42dep) %>% nest() %>% mutate( models = map(data, ~lm(neg_c_7 ~ c12hour + c172code, data = .x)) ) # spread coefficients, so we can easily access and compare the # coefficients over all models. arguments `se` and `p.val` default # to `FALSE`, when `model.term` is not specified spread_coef(model.data, models) spread_coef(model.data, models, se = TRUE) # select only specific model term. `se` and `p.val` default to `TRUE` spread_coef(model.data, models, c12hour) # spread_coef can be used directly within a pipe-chain efc %>% filter(!is.na(e42dep)) %>% group_by(e42dep) %>% nest() %>% mutate( models = map(data, ~lm(neg_c_7 ~ c12hour + c172code, data = .x)) ) %>% spread_coef(models) }
std()
computes a z-transformation (standardized and centered)
on the input. center()
centers the input. std_if()
and
center_if()
are scoped variants of std()
and center()
,
where transformation will be applied only to those variables that match the
logical condition of predicate
.
std( x, ..., robust = c("sd", "2sd", "gmd", "mad"), include.fac = FALSE, append = TRUE, suffix = "_z" ) std_if( x, predicate, robust = c("sd", "2sd", "gmd", "mad"), include.fac = FALSE, append = TRUE, suffix = "_z" ) center(x, ..., include.fac = FALSE, append = TRUE, suffix = "_c") center_if(x, predicate, include.fac = FALSE, append = TRUE, suffix = "_c")
std( x, ..., robust = c("sd", "2sd", "gmd", "mad"), include.fac = FALSE, append = TRUE, suffix = "_z" ) std_if( x, predicate, robust = c("sd", "2sd", "gmd", "mad"), include.fac = FALSE, append = TRUE, suffix = "_z" ) center(x, ..., include.fac = FALSE, append = TRUE, suffix = "_c") center_if(x, predicate, include.fac = FALSE, append = TRUE, suffix = "_c")
x |
A vector or data frame. |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
robust |
Character vector, indicating the method applied when
standardizing variables with |
include.fac |
Logical, if |
append |
Logical, if |
suffix |
Indicates which suffix will be added to each dummy variable.
Use |
predicate |
A predicate function to be applied to the columns. The
variables for which |
std()
and center()
also work on grouped data frames
(see group_by
). In this case, standardization
or centering is applied to the subsets of variables in x
.
See 'Examples'.
For more complicated models with many predictors, Gelman and Hill (2007)
suggest leaving binary inputs as is and only standardize continuous predictors
by dividing by two standard deviations. This ensures a rough comparability
in the coefficients.
If x
is a vector, returns a vector with standardized or
centered variables. If x
is a data frame, for append = TRUE
,
x
including the transformed variables as new columns is returned;
if append = FALSE
, only the transformed variables will be returned.
If append = TRUE
and suffix = ""
, recoded variables will
replace (overwrite) existing variables.
std()
and center()
only return a vector, if x
is
a vector. If x
is a data frame and only one variable is specified
in the ...
-ellipses argument, both functions do return a
data frame (see 'Examples').
Gelman A (2008) Scaling regression inputs by dividing by two
standard deviations. Statistics in Medicine 27: 2865-2873.
http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf
Gelman A, Hill J (2007) Data Analysis Using Regression and Multilevel/Hierarchical
Models. Cambdridge, Cambdrige University Press: 55-57
data(efc) std(efc$c160age) %>% head() std(efc, e17age, c160age, append = FALSE) %>% head() center(efc$c160age) %>% head() center(efc, e17age, c160age, append = FALSE) %>% head() # NOTE! std(efc$e17age) # returns a vector std(efc, e17age) # returns a data frame # with quasi-quotation x <- "e17age" center(efc, !!x, append = FALSE) %>% head() # works with mutate() library(dplyr) efc %>% select(e17age, neg_c_7) %>% mutate(age_std = std(e17age), burden = center(neg_c_7)) %>% head() # works also with grouped data frames mtcars %>% std(disp) # compare new column "disp_z" w/ output above mtcars %>% group_by(cyl) %>% std(disp) data(iris) # also standardize factors std(iris, include.fac = TRUE, append = FALSE) # don't standardize factors std(iris, include.fac = FALSE, append = FALSE) # standardize only variables with more than 10 unique values p <- function(x) dplyr::n_distinct(x) > 10 std_if(efc, predicate = p, append = FALSE)
data(efc) std(efc$c160age) %>% head() std(efc, e17age, c160age, append = FALSE) %>% head() center(efc$c160age) %>% head() center(efc, e17age, c160age, append = FALSE) %>% head() # NOTE! std(efc$e17age) # returns a vector std(efc, e17age) # returns a data frame # with quasi-quotation x <- "e17age" center(efc, !!x, append = FALSE) %>% head() # works with mutate() library(dplyr) efc %>% select(e17age, neg_c_7) %>% mutate(age_std = std(e17age), burden = center(neg_c_7)) %>% head() # works also with grouped data frames mtcars %>% std(disp) # compare new column "disp_z" w/ output above mtcars %>% group_by(cyl) %>% std(disp) data(iris) # also standardize factors std(iris, include.fac = TRUE, append = FALSE) # don't standardize factors std(iris, include.fac = FALSE, append = FALSE) # standardize only variables with more than 10 unique values p <- function(x) dplyr::n_distinct(x) > 10 std_if(efc, predicate = p, append = FALSE)
This functions checks whether a string or character vector
x
contains the string pattern
. By default,
this function is case sensitive.
str_contains(x, pattern, ignore.case = FALSE, logic = NULL, switch = FALSE)
str_contains(x, pattern, ignore.case = FALSE, logic = NULL, switch = FALSE)
x |
Character string where matches are sought. May also be a character vector of length > 1 (see 'Examples'). |
pattern |
Character string to be matched in |
ignore.case |
Logical, whether matching should be case sensitive or not. |
logic |
Indicates whether a logical combination of multiple search pattern should be made.
|
switch |
Logical, if |
This function iterates all elements in pattern
and
looks for each of these elements if it is found in
any element of x
, i.e. which elements
of pattern
are found in the vector x
.
Technically, it iterates pattern
and calls
grep(x, pattern[i], fixed = TRUE)
for each element
of pattern
. If switch = TRUE
, it iterates
pattern
and calls grep(pattern[i], x, fixed = TRUE)
for each element of pattern
. Hence, in the latter case
(if switch = TRUE
), x
must be of length 1.
TRUE
if x
contains pattern
.
str_contains("hello", "hel") str_contains("hello", "hal") str_contains("hello", "Hel") str_contains("hello", "Hel", ignore.case = TRUE) # which patterns are in "abc"? str_contains("abc", c("a", "b", "e")) # is pattern in any element of 'x'? str_contains(c("def", "abc", "xyz"), "abc") # is "abcde" in any element of 'x'? str_contains(c("def", "abc", "xyz"), "abcde") # no... # is "abc" in any of pattern? str_contains("abc", c("defg", "abcde", "xyz12"), switch = TRUE) str_contains(c("def", "abcde", "xyz"), c("abc", "123")) # any pattern in "abc"? str_contains("abc", c("a", "b", "e"), logic = "or") # all patterns in "abc"? str_contains("abc", c("a", "b", "e"), logic = "and") str_contains("abc", c("a", "b"), logic = "and") # no patterns in "abc"? str_contains("abc", c("a", "b", "e"), logic = "not") str_contains("abc", c("d", "e", "f"), logic = "not")
str_contains("hello", "hel") str_contains("hello", "hal") str_contains("hello", "Hel") str_contains("hello", "Hel", ignore.case = TRUE) # which patterns are in "abc"? str_contains("abc", c("a", "b", "e")) # is pattern in any element of 'x'? str_contains(c("def", "abc", "xyz"), "abc") # is "abcde" in any element of 'x'? str_contains(c("def", "abc", "xyz"), "abcde") # no... # is "abc" in any of pattern? str_contains("abc", c("defg", "abcde", "xyz12"), switch = TRUE) str_contains(c("def", "abcde", "xyz"), c("abc", "123")) # any pattern in "abc"? str_contains("abc", c("a", "b", "e"), logic = "or") # all patterns in "abc"? str_contains("abc", c("a", "b", "e"), logic = "and") str_contains("abc", c("a", "b"), logic = "and") # no patterns in "abc"? str_contains("abc", c("a", "b", "e"), logic = "not") str_contains("abc", c("d", "e", "f"), logic = "not")
This function finds the element indices of partial matching or similar strings in a character vector. Can be used to find exact or slightly mistyped elements in a string vector.
str_find(string, pattern, precision = 2, partial = 0, verbose = FALSE)
str_find(string, pattern, precision = 2, partial = 0, verbose = FALSE)
string |
Character vector with string elements. |
pattern |
String that should be matched against the elements of |
precision |
Maximum distance ("precision") between two string elements, which is allowed to treat them as similar or equal. Smaller values mean less tolerance in matching. |
partial |
Activates similar matching (close distance strings) for parts (substrings)
of the
Default value is 0. See 'Details' for more information. |
verbose |
Logical; if |
Computation Details
Fuzzy string matching is based on regular expressions, in particular
grep(pattern = "(<pattern>){~<precision>}", x = string)
. This
means, precision
indicates the number of chars inside pattern
that may differ in string
to cosinder it as "matching". The higher
precision
is, the more tolerant is the search (i.e. yielding more
possible matches). Furthermore, the higher the value for partial
is, the more matches may be found.
Partial Distance Matching
For partial = 1
, a substring of length(pattern)
is extracted
from string
, starting at position 0 in string
until
the end of string
is reached. Each substring is matched against
pattern
, and results with a maximum distance of precision
are considered as "matching". If partial = 2
, the range
of the extracted substring is increased by 2, i.e. the extracted substring
is two chars longer and so on.
A numeric vector with index position of elements in string
that
partially match or are similar to pattern
. Returns -1
if no
match was found.
This function does not return the position of a matching string inside
another string, but the element's index of the string
vector, where
a (partial) match with pattern
was found. Thus, searching for "abc" in
a string "this is abc" will not return 9 (the start position of the substring),
but 1 (the element index, which is always 1 if string
only has one element).
string <- c("Hello", "Helo", "Hole", "Apple", "Ape", "New", "Old", "System", "Systemic") str_find(string, "hel") # partial match str_find(string, "stem") # partial match str_find(string, "R") # no match str_find(string, "saste") # similarity to "System" # finds two indices, because partial matching now # also applies to "Systemic" str_find(string, "sytsme", partial = 1) # finds partial matching of similarity str_find("We are Sex Pistols!", "postils")
string <- c("Hello", "Helo", "Hole", "Apple", "Ape", "New", "Old", "System", "Systemic") str_find(string, "hel") # partial match str_find(string, "stem") # partial match str_find(string, "R") # no match str_find(string, "saste") # similarity to "System" # finds two indices, because partial matching now # also applies to "Systemic" str_find(string, "sytsme", partial = 1) # finds partial matching of similarity str_find("We are Sex Pistols!", "postils")
str_start()
finds the beginning position of pattern
in each element of x
, while str_end()
finds the stopping position
of pattern
in each element of x
.
str_start(x, pattern, ignore.case = TRUE, regex = FALSE) str_end(x, pattern, ignore.case = TRUE, regex = FALSE)
str_start(x, pattern, ignore.case = TRUE, regex = FALSE) str_end(x, pattern, ignore.case = TRUE, regex = FALSE)
x |
A character vector. |
pattern |
Character string to be matched in |
ignore.case |
Logical, whether matching should be case sensitive or not.
|
regex |
Logical, if |
A numeric vector with index of start/end position(s) of pattern
found in x
, or -1
, if pattern
was not found
in x
.
path <- "this/is/my/fileofinterest.csv" str_start(path, "/") path <- "this//is//my//fileofinterest.csv" str_start(path, "//") str_end(path, "//") x <- c("my_friend_likes me", "your_friend likes_you") str_start(x, "_") # pattern "likes" starts at position 11 in first, and # position 13 in second string str_start(x, "likes") # pattern "likes" ends at position 15 in first, and # position 17 in second string str_end(x, "likes") x <- c("I like to move it, move it", "You like to move it") str_start(x, "move") str_end(x, "move") x <- c("test1234testagain") str_start(x, "\\d+4") str_start(x, "\\d+4", regex = TRUE) str_end(x, "\\d+4", regex = TRUE)
path <- "this/is/my/fileofinterest.csv" str_start(path, "/") path <- "this//is//my//fileofinterest.csv" str_start(path, "//") str_end(path, "//") x <- c("my_friend_likes me", "your_friend likes_you") str_start(x, "_") # pattern "likes" starts at position 11 in first, and # position 13 in second string str_start(x, "likes") # pattern "likes" ends at position 15 in first, and # position 17 in second string str_end(x, "likes") x <- c("I like to move it, move it", "You like to move it") str_start(x, "move") str_end(x, "move") x <- c("test1234testagain") str_start(x, "\\d+4") str_start(x, "\\d+4", regex = TRUE) str_end(x, "\\d+4", regex = TRUE)
This function "cleans" values of a character vector or levels of a factor by removing space and punctuation characters.
tidy_values(x, ...) clean_values(x, ...)
tidy_values(x, ...) clean_values(x, ...)
x |
A vector or data frame. |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
x
, with "cleaned" values or levels.
f1 <- sprintf("Char %s", sample(LETTERS[1:5], size = 10, replace = TRUE)) f2 <- as.factor(sprintf("F / %s", sample(letters[1:5], size = 10, replace = TRUE))) f3 <- sample(1:5, size = 10, replace = TRUE) x <- data.frame(f1, f2, f3, stringsAsFactors = FALSE) clean_values(f1) clean_values(f2) clean_values(x)
f1 <- sprintf("Char %s", sample(LETTERS[1:5], size = 10, replace = TRUE)) f2 <- as.factor(sprintf("F / %s", sample(letters[1:5], size = 10, replace = TRUE))) f3 <- sample(1:5, size = 10, replace = TRUE) x <- data.frame(f1, f2, f3, stringsAsFactors = FALSE) clean_values(f1) clean_values(f2) clean_values(x)
This function splits categorical or numeric vectors with more than two categories into 0/1-coded dummy variables.
to_dummy(x, ..., var.name = "name", suffix = c("numeric", "label"))
to_dummy(x, ..., var.name = "name", suffix = c("numeric", "label"))
x |
A vector or data frame. |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
var.name |
Indicates how the new dummy variables are named. Use
|
suffix |
Indicates which suffix will be added to each dummy variable.
Use |
A data frame with dummy variables for each category of x
.
The dummy coded variables are of type atomic
.
NA
values will be copied from x
, so each dummy variable
has the same amount of NA
's at the same position as x
.
data(efc) head(to_dummy(efc$e42dep)) # add value label as suffix to new variable name head(to_dummy(efc$e42dep, suffix = "label")) # use "dummy" as new variable name head(to_dummy(efc$e42dep, var.name = "dummy")) # create multiple dummies, append to data frame to_dummy(efc, c172code, e42dep) # pipe-workflow library(dplyr) efc %>% select(e42dep, e16sex, c172code) %>% to_dummy()
data(efc) head(to_dummy(efc$e42dep)) # add value label as suffix to new variable name head(to_dummy(efc$e42dep, suffix = "label")) # use "dummy" as new variable name head(to_dummy(efc$e42dep, var.name = "dummy")) # create multiple dummies, append to data frame to_dummy(efc, c172code, e42dep) # pipe-workflow library(dplyr) efc %>% select(e42dep, e16sex, c172code) %>% to_dummy()
This function converts wide data into long format. It allows to transform multiple key-value pairs to be transformed from wide to long format in one single step.
to_long(data, keys, values, ..., labels = NULL, recode.key = FALSE)
to_long(data, keys, values, ..., labels = NULL, recode.key = FALSE)
data |
A |
keys |
Character vector with name(s) of key column(s) to create in output. Either one key value per column group that should be gathered, or a single string. In the latter case, this name will be used as key column, and only one key column is created. See 'Examples'. |
values |
Character vector with names of value columns (variable names) to create in output. Must be of same length as number of column groups that should be gathered. See 'Examples'. |
... |
Specification of columns that should be gathered. Must be one character vector with variable names per column group, or a numeric vector with column indices indicating those columns that should be gathered. See 'Examples'. |
labels |
Character vector of same length as |
recode.key |
Logical, if |
This function reshapes data from wide to long format, however,
you can gather multiple column groups at once. Value and variable labels
for non-gathered variables are preserved. Attributes from gathered variables,
such as information about the variable labels, are lost during reshaping.
Hence, the new created variables from gathered columns don't have any
variable label attributes. In such cases, use labels
argument to set
back variable label attributes.
# create sample mydat <- data.frame(age = c(20, 30, 40), sex = c("Female", "Male", "Male"), score_t1 = c(30, 35, 32), score_t2 = c(33, 34, 37), score_t3 = c(36, 35, 38), speed_t1 = c(2, 3, 1), speed_t2 = c(3, 4, 5), speed_t3 = c(1, 8, 6)) # gather multiple columns. both time and speed are gathered. to_long( data = mydat, keys = "time", values = c("score", "speed"), c("score_t1", "score_t2", "score_t3"), c("speed_t1", "speed_t2", "speed_t3") ) # alternative syntax, using "reshape_longer()" reshape_longer( mydat, columns = list( c("score_t1", "score_t2", "score_t3"), c("speed_t1", "speed_t2", "speed_t3") ), names.to = "time", values.to = c("score", "speed") ) # or ... reshape_longer( mydat, list(3:5, 6:8), names.to = "time", values.to = c("score", "speed") ) # gather multiple columns, use numeric key-value to_long( data = mydat, keys = "time", values = c("score", "speed"), c("score_t1", "score_t2", "score_t3"), c("speed_t1", "speed_t2", "speed_t3"), recode.key = TRUE ) # gather multiple columns by colum names and colum indices to_long( data = mydat, keys = "time", values = c("score", "speed"), c("score_t1", "score_t2", "score_t3"), 6:8, recode.key = TRUE ) # gather multiple columns, use separate key-columns # for each value-vector to_long( data = mydat, keys = c("time_score", "time_speed"), values = c("score", "speed"), c("score_t1", "score_t2", "score_t3"), c("speed_t1", "speed_t2", "speed_t3") ) # gather multiple columns, label columns mydat <- to_long( data = mydat, keys = "time", values = c("score", "speed"), c("score_t1", "score_t2", "score_t3"), c("speed_t1", "speed_t2", "speed_t3"), labels = c("Test Score", "Time needed to finish") ) library(sjlabelled) str(mydat$score) get_label(mydat$speed)
# create sample mydat <- data.frame(age = c(20, 30, 40), sex = c("Female", "Male", "Male"), score_t1 = c(30, 35, 32), score_t2 = c(33, 34, 37), score_t3 = c(36, 35, 38), speed_t1 = c(2, 3, 1), speed_t2 = c(3, 4, 5), speed_t3 = c(1, 8, 6)) # gather multiple columns. both time and speed are gathered. to_long( data = mydat, keys = "time", values = c("score", "speed"), c("score_t1", "score_t2", "score_t3"), c("speed_t1", "speed_t2", "speed_t3") ) # alternative syntax, using "reshape_longer()" reshape_longer( mydat, columns = list( c("score_t1", "score_t2", "score_t3"), c("speed_t1", "speed_t2", "speed_t3") ), names.to = "time", values.to = c("score", "speed") ) # or ... reshape_longer( mydat, list(3:5, 6:8), names.to = "time", values.to = c("score", "speed") ) # gather multiple columns, use numeric key-value to_long( data = mydat, keys = "time", values = c("score", "speed"), c("score_t1", "score_t2", "score_t3"), c("speed_t1", "speed_t2", "speed_t3"), recode.key = TRUE ) # gather multiple columns by colum names and colum indices to_long( data = mydat, keys = "time", values = c("score", "speed"), c("score_t1", "score_t2", "score_t3"), 6:8, recode.key = TRUE ) # gather multiple columns, use separate key-columns # for each value-vector to_long( data = mydat, keys = c("time_score", "time_speed"), values = c("score", "speed"), c("score_t1", "score_t2", "score_t3"), c("speed_t1", "speed_t2", "speed_t3") ) # gather multiple columns, label columns mydat <- to_long( data = mydat, keys = "time", values = c("score", "speed"), c("score_t1", "score_t2", "score_t3"), c("speed_t1", "speed_t2", "speed_t3"), labels = c("Test Score", "Time needed to finish") ) library(sjlabelled) str(mydat$score) get_label(mydat$speed)
This function converts (replaces) factor levels with the
related factor level index number, thus the factor is converted to
a numeric variable. to_value()
and to_numeric()
are aliases.
to_value(x, ..., start.at = NULL, keep.labels = TRUE, use.labels = FALSE)
to_value(x, ..., start.at = NULL, keep.labels = TRUE, use.labels = FALSE)
x |
A vector or data frame. |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
start.at |
Starting index, i.e. the lowest numeric value of the variable's
value range. By default, this argument is |
keep.labels |
Logical, if |
use.labels |
Logical, if |
A numeric variable with values ranging either from start.at
to
start.at
+ length of factor levels, or to the corresponding
factor levels (if these were numeric). If x
is a data frame,
the complete data frame x
will be returned, where variables
specified in ...
are coerced to numeric; if ...
is
not specified, applies to all variables in the data frame.
This function is kept for backwards-compatibility. It is preferred to
use as_numeric
.
library(sjlabelled) data(efc) test <- as_label(efc$e42dep) table(test) table(to_value(test)) # Find more examples at '?sjlabelled::as_numeric'
library(sjlabelled) data(efc) test <- as_label(efc$e42dep) table(test) table(to_value(test)) # Find more examples at '?sjlabelled::as_numeric'
Trims leading and trailing whitespaces from strings or character vectors.
trim(x)
trim(x)
x |
Character vector or string, or a list or data frame with such vectors. Function is vectorized, i.e. vector may have a length greater than 1. See 'Examples'. |
Trimmed x
, i.e. with leading and trailing spaces removed.
trim("white space at end ") trim(" white space at start and end ") trim(c(" string1 ", " string2", "string 3 ")) tmp <- data.frame(a = c(" string1 ", " string2", "string 3 "), b = c(" strong one ", " string two", " third string "), c = c(" str1 ", " str2", "str3 ")) tmp trim(tmp)
trim("white space at end ") trim(" white space at start and end ") trim(c(" string1 ", " string2", "string 3 ")) tmp <- data.frame(a = c(" string1 ", " string2", "string 3 "), b = c(" strong one ", " string two", " third string "), c = c(" str1 ", " str2", "str3 ")) tmp trim(tmp)
This function returns the "typical" value of a variable.
typical_value(x, fun = "mean", weights = NULL, ...)
typical_value(x, fun = "mean", weights = NULL, ...)
x |
A variable. |
fun |
Character vector, naming the function to be applied to
|
weights |
Name of variable in |
... |
Further arguments, passed down to |
By default, for numeric variables, typical_value()
returns the
mean value of x
(unless changed with the fun
-argument).
For factors, the reference level is returned or the most common value
(if fun = "mode"
), unless fun
is a named vector. If
fun
is a named vector, specify the function for integer, numeric
and categorical variables as element names, e.g.
fun = c(integer = "median", factor = "mean")
. In this case,
factors are converted to numeric values (using to_value
)
and the related function is applied. You may abbreviate the names
fun = c(i = "median", f = "mean")
. See also 'Examples'.
For character vectors the most common value (mode) is returned.
The "typical" value of x
.
data(iris) typical_value(iris$Sepal.Length) library(purrr) map(iris, ~ typical_value(.x)) # example from ?stats::weighted.mean wt <- c(5, 5, 4, 1) / 15 x <- c(3.7, 3.3, 3.5, 2.8) typical_value(x, fun = "weighted.mean") typical_value(x, fun = "weighted.mean", weights = wt) # for factors, return either reference level or mode value set.seed(123) x <- sample(iris$Species, size = 30, replace = TRUE) typical_value(x) typical_value(x, fun = "mode") # for factors, use a named vector to apply other functions than "mode" map(iris, ~ typical_value(.x, fun = c(n = "median", f = "mean")))
data(iris) typical_value(iris$Sepal.Length) library(purrr) map(iris, ~ typical_value(.x)) # example from ?stats::weighted.mean wt <- c(5, 5, 4, 1) / 15 x <- c(3.7, 3.3, 3.5, 2.8) typical_value(x, fun = "weighted.mean") typical_value(x, fun = "weighted.mean", weights = wt) # for factors, return either reference level or mode value set.seed(123) x <- sample(iris$Species, size = 30, replace = TRUE) typical_value(x) typical_value(x, fun = "mode") # for factors, use a named vector to apply other functions than "mode" map(iris, ~ typical_value(.x, fun = c(n = "median", f = "mean")))
This function renames variables in a data frame, i.e. it renames the columns of the data frame.
var_rename(x, ..., verbose = TRUE) rename_variables(x, ..., verbose = TRUE) rename_columns(x, ..., verbose = TRUE)
var_rename(x, ..., verbose = TRUE) rename_variables(x, ..., verbose = TRUE) rename_columns(x, ..., verbose = TRUE)
x |
A data frame. |
... |
A named vector, or pairs of named vectors, where the name (lhs) equals the column name that should be renamed, and the value (rhs) is the new column name. |
verbose |
Logical, if |
x
, with new column names for those variables specified in ...
.
dummy <- data.frame( a = sample(1:4, 10, replace = TRUE), b = sample(1:4, 10, replace = TRUE), c = sample(1:4, 10, replace = TRUE) ) rename_variables(dummy, a = "first.col", c = "3rd.col") # using quasi-quotation library(rlang) v1 <- "first.col" v2 <- "3rd.col" rename_variables(dummy, a = !!v1, c = !!v2) x1 <- "a" x2 <- "b" rename_variables(dummy, !!x1 := !!v1, !!x2 := !!v2) # using a named vector new_names <- c(a = "first.col", c = "3rd.col") rename_variables(dummy, new_names)
dummy <- data.frame( a = sample(1:4, 10, replace = TRUE), b = sample(1:4, 10, replace = TRUE), c = sample(1:4, 10, replace = TRUE) ) rename_variables(dummy, a = "first.col", c = "3rd.col") # using quasi-quotation library(rlang) v1 <- "first.col" v2 <- "3rd.col" rename_variables(dummy, a = !!v1, c = !!v2) x1 <- "a" x2 <- "b" rename_variables(dummy, !!x1 := !!v1, !!x2 := !!v2) # using a named vector new_names <- c(a = "first.col", c = "3rd.col") rename_variables(dummy, new_names)
This function returns the type of a variable as character. It
is similar to pillar::type_sum()
, however, the
return value is not truncated, and var_type()
works
on data frames and within pipe-chains.
var_type(x, ..., abbr = FALSE)
var_type(x, ..., abbr = FALSE)
x |
A vector or data frame. |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
abbr |
Logical, if |
The variable type of x
, as character.
data(efc) var_type(1) var_type(1L) var_type("a") var_type(efc$e42dep) var_type(to_factor(efc$e42dep)) library(dplyr) var_type(efc, contains("cop"))
data(efc) var_type(1) var_type(1L) var_type("a") var_type(efc$e42dep) var_type(to_factor(efc$e42dep)) library(dplyr) var_type(efc, contains("cop"))
Insert line breaks in long character strings. Useful if you want to wordwrap labels / titles for plots or tables.
word_wrap(labels, wrap, linesep = NULL)
word_wrap(labels, wrap, linesep = NULL)
labels |
Label(s) as character string, where a line break should be inserted. Several strings may be passed as vector (see 'Examples'). |
wrap |
Maximum amount of chars per line (i.e. line length). If
|
linesep |
By default, this argument is |
New label(s) with line breaks inserted at every wrap
's position.
word_wrap(c("A very long string", "And another even longer string!"), 10) message(word_wrap("Much too long string for just one line!", 15))
word_wrap(c("A very long string", "And another even longer string!"), 10) message(word_wrap("Much too long string for just one line!", 15))
Replaces all infinite (Inf
and -Inf
) or NaN
values with regular NA
.
zap_inf(x, ...)
zap_inf(x, ...)
x |
A vector or a data frame. |
... |
Optional, unquoted names of variables that should be selected for
further processing. Required, if |
x
, where all Inf
, -Inf
and NaN
are converted to NA
.
x <- c(1, 2, NA, 3, NaN, 4, NA, 5, Inf, -Inf, 6, 7) zap_inf(x) data(efc) # produce some NA and NaN values efc$e42dep[1] <- NaN efc$e42dep[2] <- NA efc$c12hour[1] <- NaN efc$c12hour[2] <- NA efc$e17age[2] <- NaN efc$e17age[1] <- NA # only zap NaN for c12hour zap_inf(efc$c12hour) # only zap NaN for c12hour and e17age, not for e42dep, # but return complete data framee zap_inf(efc, c12hour, e17age) # zap NaN for complete data frame zap_inf(efc)
x <- c(1, 2, NA, 3, NaN, 4, NA, 5, Inf, -Inf, 6, 7) zap_inf(x) data(efc) # produce some NA and NaN values efc$e42dep[1] <- NaN efc$e42dep[2] <- NA efc$c12hour[1] <- NaN efc$c12hour[2] <- NA efc$e17age[2] <- NaN efc$e17age[1] <- NA # only zap NaN for c12hour zap_inf(efc$c12hour) # only zap NaN for c12hour and e17age, not for e42dep, # but return complete data framee zap_inf(efc, c12hour, e17age) # zap NaN for complete data frame zap_inf(efc)