Title: | Semi-Automatic Preprocessing of Messy Data with Change Tracking for Dataset Cleaning |
---|---|
Description: | Tools for assessing data quality, performing exploratory analysis, and semi-automatic preprocessing of messy data with change tracking for integral dataset cleaning. |
Authors: | David Hervas Marin [aut, cre] |
Maintainer: | David Hervas Marin <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.9.45 |
Built: | 2024-12-05 16:36:41 UTC |
Source: | CRAN |
'<=' operator where NA values return FALSE
x %<=NA% y
x %<=NA% y
x |
Vector for the left side of the operator |
y |
A Scalar or vector of the same length as x for the right side of the operator |
A logical vector of the same length as x
'<' operator where NA values return FALSE
x %<NA% y
x %<NA% y
x |
Vector for the left side of the operator |
y |
A Scalar or vector of the same length as x for the right side of the operator |
A logical vector of the same length as x
'>=' operator where NA values return FALSE
x %>=NA% y
x %>=NA% y
x |
Vector for the left side of the operator |
y |
A Scalar or vector of the same length as x for the right side of the operator |
A logical vector of the same length as x
'>' operator where NA values return FALSE
x %>NA% y
x %>NA% y
x |
Vector for the left side of the operator |
y |
A Scalar or vector of the same length as x for the right side of the operator |
A logical vector of the same length as x
Operator equivalent to x >= lower.value & x <= upper.value
x %between% y
x %between% y
x |
Vector for the left side of the operator |
y |
A vector of length two with the lower and upper values of the interval |
A logical vector of the same length as x
Operator equivalent to x >= lower.value & x <= upper.value & !is.na(x)
x %betweenNA% y
x %betweenNA% y
x |
Vector for the left side of the operator |
y |
A vector of length two with the lower and upper values of the interval |
A logical vector of the same length as x
Returns the least repeated value
antimoda(x)
antimoda(x)
x |
A categorical variable |
The anti-mode (least repeated value)
Checks for bivariate outliers in a data.frame
bivariate_outliers(x, threshold_r = 10, threshold_b = 1.5)
bivariate_outliers(x, threshold_r = 10, threshold_b = 1.5)
x |
A data.frame object |
threshold_r |
Threshold for the case of two continuous variables |
threshold_b |
Threshold for the case of one continuous and one categorical variable |
A data frame with all the observations considered as bivariate outliers
bivariate_outliers(iris)
bivariate_outliers(iris)
Returns different data quality details of a numeric or categorical variable
check_quality( x, id = 1:length(x), plot = TRUE, numeric = NULL, k = 5, n = ifelse(is.numeric(x) | ttrue(numeric) | class(x) %in% "Date", 5, 2), output = FALSE, ... )
check_quality( x, id = 1:length(x), plot = TRUE, numeric = NULL, k = 5, n = ifelse(is.numeric(x) | ttrue(numeric) | class(x) %in% "Date", 5, 2), output = FALSE, ... )
x |
A variable from a data.frame |
id |
ID column to reference the found extreme values |
plot |
If the variable is numeric, should a boxplot be drawn? |
numeric |
If set to TRUE, forces the variable to be considered numeric |
k |
Number of different numeric values in a variable to be considered as numeric |
n |
Number of extreme values to extract |
output |
Format of the output. If TRUE, optimize for exporting as csv |
... |
further arguments passed to boxplot() |
A list of a data.frame with information about data quality of the variable
check_quality(airquality$Ozone) #For one variable lapply(airquality, check_quality) #For a data.frame lapply(airquality, check_quality, output=TRUE) #For a data.frame, one row per variable
check_quality(airquality$Ozone) #For one variable lapply(airquality, check_quality) #For a data.frame lapply(airquality, check_quality, output=TRUE) #For a data.frame, one row per variable
Displays associations between variables in a data.frame in a heatmap with clustering
cluster_var(x, margins = c(8, 1))
cluster_var(x, margins = c(8, 1))
x |
A data.frame |
margins |
Margins for the plot |
A heatmap with the variable associations
cluster_var(iris) cluster_var(mtcars)
cluster_var(iris) cluster_var(mtcars)
Creates a detailed summary of the data
descriptive(x, z = 3, ignore.na = TRUE, by = NULL, print = TRUE)
descriptive(x, z = 3, ignore.na = TRUE, by = NULL, print = TRUE)
x |
A data.frame |
z |
Number of decimal places |
ignore.na |
If TRUE NA values will not count for relative frequencies calculations |
by |
Factor variable definining groups for the summary |
print |
Should results be printed? |
Summary of the data
descriptive(iris) descriptive(iris, by="Species")
descriptive(iris) descriptive(iris, by="Species")
Returns the nth lowest and highest values from a vector
extreme_values(x, n = 5, id = NULL)
extreme_values(x, n = 5, id = NULL)
x |
A vector |
n |
Number of extreme values to return |
id |
ID column to reference the found extreme values |
A matrix with the lowest and highest values from a vector
Searches a data.frame for a specific character string and replaces it with another one
f_replace( x, string, replacement, complete = TRUE, select = 1:ncol(x), track = TRUE )
f_replace( x, string, replacement, complete = TRUE, select = 1:ncol(x), track = TRUE )
x |
A data.frame |
string |
A character string to search in the data.frame |
replacement |
A character string to replace the old string (can be NA) |
complete |
If TRUE, search for complete strings only. If FALSE, search also for partial strings. |
select |
Numeric vector with the positions (all by default) to be affected by the function |
track |
Track changes? |
iris2 <- f_replace(iris, "setosa", "ensata") track_changes(iris2)
iris2 <- f_replace(iris, "setosa", "ensata") track_changes(iris2)
Tries to automatically fix all problems in the data.frame
fix_all(x, select = 1:ncol(x), track = TRUE)
fix_all(x, select = 1:ncol(x), track = TRUE)
x |
A data.frame |
select |
Numeric vector with the positions (all by default) to be affected by the function |
track |
Track changes? |
Fixes concatenated values in a variable
fix_concat(x, varname, sep = ", |; | ", track = TRUE)
fix_concat(x, varname, sep = ", |; | ", track = TRUE)
x |
A data.frame |
varname |
Variable name |
sep |
Separator for the different values |
track |
Track changes? |
mydata <- data.frame(concat=c("a", "b", "a b" , "a b, c", "a; c"), numeric = c(1, 2, 3, 4, 5)) fix_concat(mydata, "concat")
mydata <- data.frame(concat=c("a", "b", "a b" , "a b, c", "a; c"), numeric = c(1, 2, 3, 4, 5)) fix_concat(mydata, "concat")
Fixes dates. Dates can be recorded in numerous formats depending on the
country, the traditions and the field of knowledge. fix.dates
tries to detect
all possible date formats and transforms all of them in the ISO standard favored by
R (yyyy-mm-dd).
fix_dates( x, max.NA = 0.8, min.obs = nrow(x) * 0.05, use.probs = TRUE, select = 1:ncol(x), track = TRUE, parallel = TRUE )
fix_dates( x, max.NA = 0.8, min.obs = nrow(x) * 0.05, use.probs = TRUE, select = 1:ncol(x), track = TRUE, parallel = TRUE )
x |
A data.frame |
max.NA |
Maximum allowed proportion of NA values created by coercion. If the
coercion to date creates more NA values than those specified in |
min.obs |
Minimum number of non-NA observations allowed per variable. If the variable
has fewer non-NA observations, then it will be ignored by |
use.probs |
When there are multiple date formats in the same column, there can
be ambiguities. For example, 04-06-2015 can be interpreted as 2015-06-04 or as 2015-04-06.
If |
select |
Numeric vector with the positions (all by default) to be affected by the function |
track |
Track changes? |
parallel |
Should the computations be performed in parallel? Set up strategy first with future::plan() |
mydata<-data.frame(Dates1=c("25/06/1983", "25-08/2014", "2001/11/01", "2008-10-01"), Dates2=c("01/01/85", "04/04/1982", "07/12-2016", "September 24, 2020"), Numeric1=rnorm(4)) fix_dates(mydata)
mydata<-data.frame(Dates1=c("25/06/1983", "25-08/2014", "2001/11/01", "2008-10-01"), Dates2=c("01/01/85", "04/04/1982", "07/12-2016", "September 24, 2020"), Numeric1=rnorm(4)) fix_dates(mydata)
Fixes factors imported as numerics. It is usual in some fields to encode
factor variables as integers. This function detects such variables and transforms
them into factors. When drop=TRUE
(by default) it detects multiple versions
of the same levels due to different capitalization, whitespaces or non-ASCII characters.
fix_factors(x, k = 5, select = 1:ncol(x), drop = TRUE, track = TRUE)
fix_factors(x, k = 5, select = 1:ncol(x), drop = TRUE, track = TRUE)
x |
A data.frame |
k |
Maximum number of different numeric values to be converted to factor |
select |
Numeric vector with the positions (all by default) to be affected by the function |
drop |
Drop similar levels? |
track |
Keep track of changes? |
# mtcars data has all variables encoded as numeric, even the factor variables. descriptive(mtcars) # After using fix_factors, factor variables are recognized as such. descriptive(fix_factors(mtcars))
# mtcars data has all variables encoded as numeric, even the factor variables. descriptive(mtcars) # After using fix_factors, factor variables are recognized as such. descriptive(fix_factors(mtcars))
Fixes levels of a factor
fix_levels( data, factor_name, method = "dl", levels = NULL, plot = FALSE, k = ifelse(!is.null(levels), length(levels), 2), track = TRUE, ... )
fix_levels( data, factor_name, method = "dl", levels = NULL, plot = FALSE, k = ifelse(!is.null(levels), length(levels), 2), track = TRUE, ... )
data |
data.frame with the factor to fix |
factor_name |
Name of the factor to fix (as character) |
method |
Method from stringdist package to estimate distances |
levels |
Optional vector with the levels names. If "auto", levels are assigned based on frequency |
plot |
Optional: Plot cluster dendrogram? |
k |
Number of levels for clustering |
track |
Keep track of changes? |
... |
Further parameters passed to stringdist::stringdistmatrix function |
mydata <- data.frame(factor1=factor(c("Control", "Treatment", "Tretament", "Tratment", "treatment", "teatment", "contrl", "cntrol", "CONTol", "not available", "na"))) fix_levels(mydata, "factor1", k=4, plot=TRUE) #Chose k to select matching levels fix_levels(mydata, "factor1", levels=c("Control", "Treatment"), k=4)
mydata <- data.frame(factor1=factor(c("Control", "Treatment", "Tretament", "Tratment", "treatment", "teatment", "contrl", "cntrol", "CONTol", "not available", "na"))) fix_levels(mydata, "factor1", k=4, plot=TRUE) #Chose k to select matching levels fix_levels(mydata, "factor1", levels=c("Control", "Treatment"), k=4)
Fixes miscoded missing values
fix_NA( x, na.strings = c("^$", "^ $", "^\\?$", "^-$", "^\\.$", "^NaN$", "^NULL$", "^N/A$"), track = TRUE, parallel = TRUE )
fix_NA( x, na.strings = c("^$", "^ $", "^\\?$", "^-$", "^\\.$", "^NaN$", "^NULL$", "^N/A$"), track = TRUE, parallel = TRUE )
x |
A data.frame |
na.strings |
Strings to be considered NA |
track |
Track changes? |
parallel |
Should the computations be performed in parallel? Set up strategy first with future::plan() |
mydata <- data.frame(prueba = c("", NA, "A", 4, " ", "?", "-", "+"), casa = c("", 1, 2, 3, 4, " ", 6, 7)) fix_NA(mydata)
mydata <- data.frame(prueba = c("", NA, "A", 4, " ", "?", "-", "+"), casa = c("", 1, 2, 3, 4, " ", 6, 7)) fix_NA(mydata)
Fixes numeric data. In many cases, numeric data are not recognized by R
because there are data inconsistencies (wrong decimal separator, whitespaces, typos,
thousand separator, etc.). fix_numerics
detects and corrects these variables,
making them numeric again.
fix_numerics( x, k = 8, max.NA = 0.2, select = 1:ncol(x), track = TRUE, parallel = TRUE )
fix_numerics( x, k = 8, max.NA = 0.2, select = 1:ncol(x), track = TRUE, parallel = TRUE )
x |
A data.frame |
k |
Minimum number of different values a variable has to have to be considered numerical |
max.NA |
Maximum allowed proportion of NA values created by coercion. If the
coercion to numeric creates more NA values than those specified in |
select |
Numeric vector with the positions (all by default) to be affected by the function |
track |
Keep track of changes? |
parallel |
Should the computations be performed in parallel? Set up strategy first with future::plan() |
mydata<-data.frame(Numeric1=c(7.8, 9.2, "5.4e+2", 3.3, "6,8", "3..3"), Numeric2=c(3.1, 1.2, "3.4s", "48,500.04 $", 7, "$ 6.4")) descriptive(mydata) descriptive(fix_numerics(mydata, k=5))
mydata<-data.frame(Numeric1=c(7.8, 9.2, "5.4e+2", 3.3, "6,8", "3..3"), Numeric2=c(3.1, 1.2, "3.4s", "48,500.04 $", 7, "$ 6.4")) descriptive(mydata) descriptive(fix_numerics(mydata, k=5))
Reshapes a data frame from wide to long format
forge(data, affixes, force.fixed = NULL, var.name = "time")
forge(data, affixes, force.fixed = NULL, var.name = "time")
data |
data.frame |
affixes |
Affixes for repeated measures |
force.fixed |
Variables with matching affix to be excluded |
var.name |
Name for the new created variable (repetitions) |
#Data frame in wide format df1 <- data.frame(id = 1:4, age = c(20, 30, 30, 35), score1 = c(2,2,3,4), score2 = c(2,1,3,1), score3 = c(1,1,0,1)) df1 #Data frame in long format forge(df1, affixes= c("1", "2", "3")) #Data frame in wide format with two repeated measured variables df2 <- data.frame(df1, var1 = c(15, 20, 16, 19), var3 = c(12, 15, 15, 17)) df2 #Missing times are filled with NAs forge(df2, affixes = c("1", "2", "3")) #Use of parameter force.fixed df3 <- df2[, -7] df3 forge(df3, affixes=c("1", "2", "3")) forge(df3, affixes=c("1", "2", "3"), force.fixed = c("var1"))
#Data frame in wide format df1 <- data.frame(id = 1:4, age = c(20, 30, 30, 35), score1 = c(2,2,3,4), score2 = c(2,1,3,1), score3 = c(1,1,0,1)) df1 #Data frame in long format forge(df1, affixes= c("1", "2", "3")) #Data frame in wide format with two repeated measured variables df2 <- data.frame(df1, var1 = c(15, 20, 16, 19), var3 = c(12, 15, 15, 17)) df2 #Missing times are filled with NAs forge(df2, affixes = c("1", "2", "3")) #Use of parameter force.fixed df3 <- df2[, -7] df3 forge(df3, affixes=c("1", "2", "3")) forge(df3, affixes=c("1", "2", "3"), force.fixed = c("var1"))
Function to format dates
fxd(d, use.probs = TRUE)
fxd(d, use.probs = TRUE)
d |
A character vector |
use.probs |
Solve ambiguities by similarity to the most frequent formats |
Returns Goodman and Kruskal's tau measure of association between two categorical variables
GK_assoc(x, y)
GK_assoc(x, y)
x |
A categorical variable |
y |
A categorical variable |
Goodman and Kruskal's tau
data(infert) GK_assoc(infert$education, infert$case) GK_assoc(infert$case, infert$education) #Not the same
data(infert) GK_assoc(infert$education, infert$case) GK_assoc(infert$case, infert$education) #Not the same
Loads all libraries used in scripts inside the selected path
good2go(path = getwd(), info = TRUE, load = TRUE)
good2go(path = getwd(), info = TRUE, load = TRUE)
path |
Path where the scripts are located |
info |
List the libraries found? |
load |
Should the libraries found be loaded? |
Creates an improved boxplot with individual data points
ipboxplot(formula, boxwex = 0.6, ...)
ipboxplot(formula, boxwex = 0.6, ...)
formula |
Formula for the boxplot |
boxwex |
Width of the boxes |
... |
further arguments passed to beeswarm() |
ipboxplot(Sepal.Length ~ Species, data=iris) ipboxplot(mpg ~ gear, data=mtcars)
ipboxplot(Sepal.Length ~ Species, data=iris) ipboxplot(mpg ~ gear, data=mtcars)
Changes factor variables to character
kill.factors(dat, k = 10)
kill.factors(dat, k = 10)
dat |
A data.frame |
k |
Maximum number of levels for factors |
d <- data.frame(Letters=letters[1:20], Nums=1:20) d$Letters d <- kill.factors(d) d$Letters
d <- data.frame(Letters=letters[1:20], Nums=1:20) d$Letters d <- kill.factors(d) d$Letters
Calculates kurtosis of a numeric variable
kurtosis(x)
kurtosis(x)
x |
A numeric variable |
kurtosis value
Tracks manual fixes performed on a variable in a data.frame
manual_fix(data, variable, subset, newvalues = NULL)
manual_fix(data, variable, subset, newvalues = NULL)
data |
A data.frame |
variable |
A character string with the name of the variable to be fixed |
subset |
A logical expression for selecting the cases to be fixed |
newvalues |
New value or values that will take the cases selected by |
iris2 <- manual_fix(iris, "Petal.Length", Petal.Length < 1.2, 0) track_changes(iris2)
iris2 <- manual_fix(iris, "Petal.Length", Petal.Length < 1.2, 0) track_changes(iris2)
Checks if each value from a vector might be numeric
may.numeric(x)
may.numeric(x)
x |
A vector |
A logical vector
Creates a heatmap-like plot for exploring the data
mine.plot( x, fun = is.na, spacing = 5, sort = F, show.x = TRUE, show.y = TRUE, ... )
mine.plot( x, fun = is.na, spacing = 5, sort = F, show.x = TRUE, show.y = TRUE, ... )
x |
A data.frame |
fun |
A function that evaluates a vector and returns a logical vector |
spacing |
Numerical separation between lines at the y-axis |
sort |
If TRUE, variables are sorted according to their results |
show.x |
Should the x-axis be plotted? |
show.y |
Should the y-axis be plotted? |
... |
further arguments passed to order() |
mine.plot(airquality) #Displays missing data mine.plot(airquality, fun=outliers) #Shows extreme values
mine.plot(airquality) #Displays missing data mine.plot(airquality, fun=outliers) #Shows extreme values
Returns the most repeated value
moda(x)
moda(x)
x |
A categorical variable |
The mode
Estimates the number of modes
moda_cont(x)
moda_cont(x)
x |
A numeric variable |
Estimated number of modes.
Modification of the tapply function to use with data.frames. Consider using aggregate()
mtapply(x, group, fun)
mtapply(x, group, fun)
x |
A data.frame |
group |
Grouping variable |
fun |
Function to apply by group |
mtapply(mtcars, mtcars$gear, mean)
mtapply(mtcars, mtcars$gear, mean)
Modified version of the mtcars dataset with different types of errors in the data. The dataset has 13 variables and 32 observations.
mtcars_messy
mtcars_messy
A data frame with 32 observations and 13 variables
datasets
package
Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.
descriptive(mtcars_messy)
descriptive(mtcars_messy)
Finds positions for substitution of characters in Distribution column
nearest(x, to = seq(0, 1, length.out = 30))
nearest(x, to = seq(0, 1, length.out = 30))
x |
A numeric value between 0-1 |
to |
Range of reference values |
The nearest position to the input value
Changes names of a data frame to ease work with them
nice_names(x, select = 1:ncol(x), tolower = TRUE, track = TRUE)
nice_names(x, select = 1:ncol(x), tolower = TRUE, track = TRUE)
x |
A data.frame |
select |
Numeric vector with the positions (all by default) to be affected by the function |
tolower |
Set all names to lower case? |
track |
Track changes? |
The input data.frame x
with the fixed names
d <- data.frame('Variable 1'=NA, '% Response'=NA, ' Variable 3'=NA,check.names=FALSE) names(d) names(nice_names(d))
d <- data.frame('Variable 1'=NA, '% Response'=NA, ' Variable 3'=NA,check.names=FALSE) names(d) names(nice_names(d))
If possible, coerces values from a vector to numeric
numeros(x)
numeros(x)
x |
A vector |
A numeric vector
Function for detecting outliers based on the boxplot method
outliers(x, threshold = 1.5)
outliers(x, threshold = 1.5)
x |
A vector |
threshold |
Threshold (as multiple of the IQR) to consider an observation as outlier |
outliers(iris$Petal.Length) outliers(airquality$Ozone)
outliers(iris$Petal.Length) outliers(airquality$Ozone)
Takes a peek into a data.frame returning a concise visualization about it
peek(x, n = 10, which = 1:ncol(x))
peek(x, n = 10, which = 1:ncol(x))
x |
A data.frame |
n |
Number of rows to include in output |
which |
Columns to include in output |
peek(iris)
peek(iris)
Returns the proportion for the most repeated value
prop_may(x, ignore.na = TRUE)
prop_may(x, ignore.na = TRUE)
x |
A categorical variable |
ignore.na |
Should NA values be ignored for computing proportions? |
A proportion
Returns the proportion for the least repeated value
prop_min(x, ignore.na = TRUE)
prop_min(x, ignore.na = TRUE)
x |
A categorical variable |
ignore.na |
Should NA values be ignored for computing proportions? |
A proportion
Removes empty rows or columns from data.frames
remove_empty(x, remove_rows = TRUE, remove_cols = TRUE, track = TRUE)
remove_empty(x, remove_rows = TRUE, remove_cols = TRUE, track = TRUE)
x |
A data.frame |
remove_rows |
Remove empty rows? |
remove_cols |
Remove empty columns? |
track |
Track changes? |
mydata <- data.frame(a = c(NA, NA, NA, NA, NA), b = c(1, NA, 3, 4, 5), c=c(NA, NA, NA, NA, NA), d=c(4, NA, 5, 6, 3)) remove_empty(mydata)
mydata <- data.frame(a = c(NA, NA, NA, NA, NA), b = c(1, NA, 3, 4, 5), c=c(NA, NA, NA, NA, NA), d=c(4, NA, 5, 6, 3)) remove_empty(mydata)
Restores original values after using a fix function
restore_changes(tracking)
restore_changes(tracking)
tracking |
A data.frame generated by track_changes() function |
mydata<-data.frame(Dates1=c("25/06/1983", "25-08/2014", "2001/11/01", "2008-10-01"), Dates2=c("01/01/85", "04/04/1982", "07/12-2016", NA), Numeric1=rnorm(4)) mydata <- fix_dates(mydata) mydata tracking <- track_changes(mydata) mydata_r <- restore_changes(tracking) mydata_r
mydata<-data.frame(Dates1=c("25/06/1983", "25-08/2014", "2001/11/01", "2008-10-01"), Dates2=c("01/01/85", "04/04/1982", "07/12-2016", NA), Numeric1=rnorm(4)) mydata <- fix_dates(mydata) mydata tracking <- track_changes(mydata) mydata_r <- restore_changes(tracking) mydata_r
Escale data to 0-1
scale_01(x)
scale_01(x)
x |
A numeric variable |
Scaled data
Searches for strings in R script files
search_scripts(string, path = getwd(), recursive = TRUE)
search_scripts(string, path = getwd(), recursive = TRUE)
string |
Character string to search |
path |
Character vector with the path name |
recursive |
Logical. Should the search be recursive into subdirectories? |
A list with each element being one of the files containing the search string
Calculates skewness of a numeric variable
skewness(x)
skewness(x)
x |
A numeric variable |
skewness value
Function to transform text into dates
text_date(date, format = "%d/%Y %b")
text_date(date, format = "%d/%Y %b")
date |
A date |
format |
Format of the date |
Gets a data.frame with all the changes performed by the different fix functions
track_changes(x, subset)
track_changes(x, subset)
x |
A data.frame |
subset |
Logical expression for subsetting the data.frame with the changes |
mydata<-data.frame(Dates1=c("25/06/1983", "25-08/2014", "2001/11/01", "2008-10-01"), Dates2=c("01/01/85", "04/04/1982", "07/12-2016", NA), Numeric1=rnorm(4)) mydata <- fix_dates(mydata) mydata track_changes(mydata)
mydata<-data.frame(Dates1=c("25/06/1983", "25-08/2014", "2001/11/01", "2008-10-01"), Dates2=c("01/01/85", "04/04/1982", "07/12-2016", NA), Numeric1=rnorm(4)) mydata <- fix_dates(mydata) mydata track_changes(mydata)
Makes possible vectorized logical comparisons against NULL and NA values
ttrue(x)
ttrue(x)
x |
A logical vector |
A logical vector
Reshapes a data frame from long to wide format
unforge(data, origin, variables, prefix = origin)
unforge(data, origin, variables, prefix = origin)
data |
data.frame |
origin |
Character vector with variable names in data containing the values to be assigned to the different new variables |
variables |
Variable in data containing the variable names to be created |
prefix |
Vector with prefixes for the new variable names |
#Data frame in wide format df1 <- data.frame(id = 1:4, age = c(20, 30, 30, 35), score1 = c(2,2,3,4), score2 = c(2,1,3,1), score3 = c(1,1,0,1)) df1 #Data frame in long format df2 <- forge(df1, affixes= c("1", "2", "3")) df2 #Data frame in wide format again df3 <- unforge(df2, "score", "time", prefix="score")
#Data frame in wide format df1 <- data.frame(id = 1:4, age = c(20, 30, 30, 35), score1 = c(2,2,3,4), score2 = c(2,1,3,1), score3 = c(1,1,0,1)) df1 #Data frame in long format df2 <- forge(df1, affixes= c("1", "2", "3")) df2 #Data frame in wide format again df3 <- unforge(df2, "score", "time", prefix="score")
Function to track_changes
v_df_changes(x, y)
v_df_changes(x, y)
x |
Original data.frame |
y |
New data.frame |
Returns information regarding the different objects in global environment
workspace(table = FALSE)
workspace(table = FALSE)
table |
If TRUE a table with the frequencies of each type of object is given |
A list of object names by class or a table with frequencies if table = TRUE
df1 <- data.frame(x=rnorm(10), y=rnorm(10, 1, 2)) df2 <- data.frame(x=rnorm(20), y=rnorm(20, 1, 2)) workspace(table=TRUE) #Frequency table of the different object classes workspace() #All objects in the global object separated by class
df1 <- data.frame(x=rnorm(10), y=rnorm(10, 1, 2)) df2 <- data.frame(x=rnorm(20), y=rnorm(20, 1, 2)) workspace(table=TRUE) #Frequency table of the different object classes workspace() #All objects in the global object separated by class
Applies a function over all objects of a specific class in the global environment
workspace_sapply(object_class, action = "summary")
workspace_sapply(object_class, action = "summary")
object_class |
Class of the objects where the function is to be applied |
action |
Name of the function to apply |
Results of the function
df1 <- data.frame(x=rnorm(10), y=rnorm(10, 1, 2)) df2 <- data.frame(x=rnorm(20), y=rnorm(20, 1, 2)) workspace_sapply("data.frame", "summary") #Gives a summary of each data.frame
df1 <- data.frame(x=rnorm(10), y=rnorm(10, 1, 2)) df2 <- data.frame(x=rnorm(20), y=rnorm(20, 1, 2)) workspace_sapply("data.frame", "summary") #Gives a summary of each data.frame
Calculates different scores to measure how much extreme are the different data points
xscores(x, type = "z")
xscores(x, type = "z")
x |
A vector |
type |
'z' calculates standard normal scores, 'z-out' calculates standard normal scores excluding each data point when computing the mean and the standard deviation, 't' calculates t scores, 'chisq' calculates chisquared scores, 'tukey' calculates scores based on the boxplot method, 'mad' calculates scores using median and mad instead of mean and sd. |
xscores(iris$Sepal.Length, type="z-out")
xscores(iris$Sepal.Length, type="z-out")