Title: | Monotonic Increasing |
---|---|
Description: | Various imputation methods are utilized in this package, where one can flag and impute non-monotonic data that is outside of a prespecified range. |
Authors: | Melyssa Minto, Michele Josey, and ClarLynda Williams-DeVane |
Maintainer: | Michele Josey <[email protected]> |
License: | GPL-3 |
Version: | 1.1 |
Built: | 2024-12-17 06:53:42 UTC |
Source: | CRAN |
The MonoInc package in R seeks to clean data so that erroneous values are less effective statistically. Given a prespecified range, MonoInc will determine if an observation is “unusual”, and then replace the value at the user's will. MonoInc will impute on participant data individually, so that the number of time points need not be the same. MonoInc will also remove duplicate rows.
Package: | MonoInc |
Type: | Package |
Version: | 1.1 |
Date: | 2016-05-19 |
License: | GPL-3 |
Melyssa Minto, Michele Josey
Maintainer: Michele Josey [email protected]
CDC growth chart of heights of female children aged 0 to 120 months.
data("data.r")
data("data.r")
A data frame with 121 observations on the following 3 variables.
Age
a numeric vector
Per_5
a numeric vector
Per_95
a numeric vector
Range data needed for the simulated data.
http://www.cdc.gov/growthcharts/clinical_charts.htm
data(data.r) ## plot Range boundary lines tol <- 3 plot(data.r$Age, data.r$Per_5, type="l", lty=2, col=2) lines(data.r$Age, data.r$Per_95, type="l", lty=2, col=2) lines(data.r$Age, data.r$Per_5 - tol, type="l", lty=2, col=4) lines(data.r$Age, data.r$Per_95 + tol, type="l", lty=2, col=4)
data(data.r) ## plot Range boundary lines tol <- 3 plot(data.r$Age, data.r$Per_5, type="l", lty=2, col=2) lines(data.r$Age, data.r$Per_95, type="l", lty=2, col=2) lines(data.r$Age, data.r$Per_5 - tol, type="l", lty=2, col=4) lines(data.r$Age, data.r$Per_95 + tol, type="l", lty=2, col=4)
Chart of measurements of children aged 0 to 120 months
data("decData.r")
data("decData.r")
A data frame with 121 observations on the following 3 variables.
Age
a numeric vector
L.bound
a numeric vector
U.bound
a numeric vector
Range data needed for the simulated decreasing data.
data(decData.r) ## plot Range boundary lines tol <- 3 plot(decData.r[,1], decData.r[,2], type="l", lty=2, col=2) lines(decData.r[,1], decData.r[,3], type="l", lty=2, col=2) lines(decData.r[,1], decData.r[,2] - tol, type="l", lty=2, col=4) lines(decData.r[,1], decData.r[,3] + tol, type="l", lty=2, col=4)
data(decData.r) ## plot Range boundary lines tol <- 3 plot(decData.r[,1], decData.r[,2], type="l", lty=2, col=2) lines(decData.r[,1], decData.r[,3], type="l", lty=2, col=2) lines(decData.r[,1], decData.r[,2] - tol, type="l", lty=2, col=4) lines(decData.r[,1], decData.r[,3] + tol, type="l", lty=2, col=4)
This function flags data that is outside the prespecified range and that is not monotonic.
mono.flag(data, id.col, x.col, y.col, min, max, data.r = NULL, tol = 0, direction)
mono.flag(data, id.col, x.col, y.col, min, max, data.r = NULL, tol = 0, direction)
data |
a data.frame or matrix of measurement data |
id.col |
column where the id's are stored |
x.col |
column where x values, or time variable is stored |
y.col |
column where y values, or measurements are stored |
min |
lowest acceptable value for measurement; does not have to be a number in ycol |
max |
highest acceptable value for measurement; does not have to be a number in ycol |
data.r |
prespecified range for y values; must have three columns: 1 - must match values in xcol, 2 - lower range values, 3 - upper range values |
tol |
tolerance; how much outside of the range (data.r) is acceptable; same units as data in ycol |
direction |
the direction of the function a choice between increasing 'inc', and decreasing 'dec' |
The data range (data.r) does not need to have the same number of rows as data; it only needs to include the exact time increments as xcol.
Returns the data matrix with two additional columns. "Decreasing" is a logical vector that is TRUE if the observation decreases, or causes the ID to be non-monotonic. "Outside.Range" is a logical vector that returns TRUE if the observation is outside of the data.r +/- tol range. Any duplicate rows are removed.
Michele Josey [email protected] Melyssa Minto [email protected]
data(simulated_data) simulated_data <- simulated_data[1:1000,] data(data.r) ## run mono.flag function test <- mono.flag(simulated_data, 1, 2, 3, 30, 175, data.r=data.r, direction='inc') head(test)
data(simulated_data) simulated_data <- simulated_data[1:1000,] data(data.r) ## run mono.flag function test <- mono.flag(simulated_data, 1, 2, 3, 30, 175, data.r=data.r, direction='inc') head(test)
This function reports the proportion of entries that fall inside of the prespecified range.
mono.range(data, data.r, tol, xr.col, x.col, y.col)
mono.range(data, data.r, tol, xr.col, x.col, y.col)
data |
a data.frame or matrix of measurement data |
data.r |
range for y values; must have three columns: 1 - must match values in x.col, 2 - lower range values, 3 - upper range values |
tol |
tolerance; how much outside of the range (data.r) is acceptable; same units as data in y.col |
xr.col |
column where x values, or time variable is stored in data.r |
x.col |
column where x values, or time variable is stored in data |
y.col |
column where y values, or measurements are stored in data |
Returns the proportion of y values that fall inside the prespecified range
Michele Josey [email protected] Melyssa Minto [email protected]
data(simulated_data) data(data.r) mono.range(simulated_data, data.r, tol=4, xr.col=1 ,x.col=2, y.col=3)
data(simulated_data) data(data.r) mono.range(simulated_data, data.r, tol=4, xr.col=1 ,x.col=2, y.col=3)
Combines many of the functions in the MonoInc package. Given a data range, weights, and imputation methods of choice, MonoInc will impute flagged values using either one or a combination of two imputation methods. It can also perform all single imputation methods for comparison.
MonoInc(data, id.col, x.col, y.col, data.r = NULL, tol = 0, direction = "inc", w1 = 0.5, min, max, impType1 = "nn", impType2 = "reg", sum = FALSE)
MonoInc(data, id.col, x.col, y.col, data.r = NULL, tol = 0, direction = "inc", w1 = 0.5, min, max, impType1 = "nn", impType2 = "reg", sum = FALSE)
data |
a data.frame or matrix of measurement data |
id.col |
column where the id's are stored |
x.col |
column where x values, or time variable is stored |
y.col |
column where y values, or measurements are stored |
data.r |
range for y values; must have three columns: 1 - must match values in xcol, 2 - lower range values, 3 - upper range values |
tol |
tolerance; how much outside of the range (data.r) is acceptable; same units as data in ycol |
direction |
the direction of the function a choice between increasing 'inc', and decreasing 'dec' |
w1 |
weight of imputation type 1 (impType1); default is 0.50 |
min |
lowest acceptable value for measurement; does not have to be a number in ycol |
max |
highest acceptable value for measurement; does not have to be a number in ycol |
impType1 |
imputation method 1, a choice between Nearest Neighbor "nn", Regression "reg", Fractional Regression "fr", Last Observation Carried Forward "locf", or Last & Next "ln"; default is "nn" |
impType2 |
imputation method 2; default is "reg" |
sum |
if true the function will return a matrix of all imputation methods in the package |
If two imputation methods are chosen, MonoInc will take a weighted average of the output of the imputed values. User must chose one or two imputation methods or sum=TRUE for a comparison. If there are not enough values available to impute missing or erroneous values, MonoInc will return an NA. Advice: Do NOT overwrite original data using this function! Use parallel processing if available on your device.
Returns the data matrix with additional columns for the selected imputation method. If sum=TRUE, it will return a column for each single imputation method. The Y column will have NAs, indicating that this observation was flagged and imputed, for summary only. Duplicate rows are removed.
Michele Josey [email protected] Melyssa Minto [email protected]
data(simulated_data) simulated_data <- simulated_data[1:1000,] data(data.r) library(sitar) ## Run MonoInc sum <- MonoInc(simulated_data, 1,2,3, data.r,5,direction='inc', w1=0.3, min=30, max=175, impType1=NULL, impType2=NULL, sum=TRUE) head(sum) test <- MonoInc(simulated_data, 1,2,3, data.r,5,direction='inc', w1=0.3, min=30, max=175, impType1="nn", impType2="fr") head(test) ## plot longitudinal height for each id mplot(x=X, y=Nn.Fr, data=test) tol <- 5 lines(data.r[,1], data.r[,2]-tol, col=2, lty=2) lines(data.r[,1], data.r[,3]+tol, col=2, lty=2)
data(simulated_data) simulated_data <- simulated_data[1:1000,] data(data.r) library(sitar) ## Run MonoInc sum <- MonoInc(simulated_data, 1,2,3, data.r,5,direction='inc', w1=0.3, min=30, max=175, impType1=NULL, impType2=NULL, sum=TRUE) head(sum) test <- MonoInc(simulated_data, 1,2,3, data.r,5,direction='inc', w1=0.3, min=30, max=175, impType1="nn", impType2="fr") head(test) ## plot longitudinal height for each id mplot(x=X, y=Nn.Fr, data=test) tol <- 5 lines(data.r[,1], data.r[,2]-tol, col=2, lty=2) lines(data.r[,1], data.r[,3]+tol, col=2, lty=2)
This function can check the monoticity of a single vector, matrix, or data.frame that has multiple IDs within the matrix or data.frame.
monotonic(data, id.col=NULL, y.col=NULL, direction)
monotonic(data, id.col=NULL, y.col=NULL, direction)
data |
a data.frame or matrix or vector of measurement data |
id.col |
column where the id's are stored; default is NULL |
y.col |
column where y values, or measurements are stored; default is NULL |
direction |
the direction of the function a choice between increasing 'inc', and decreasing 'dec' |
If the user enters a vector, the function returns TRUE or FALSE as to where that particular vector is monotonic increasing or not, it returns NA if the vector has missing values. If the user enters a matrix or data frame, the function returns a matrix with 2 columns. The first column as the id. The second column as a 0 for FALSE and 1 for TRUE as to where the data in that particular id is monotonic increasing or not, or NA if the y column has missing values in that particular id.
Michele Josey [email protected] Melyssa Minto [email protected]
data(simulated_data) ## Run monotonic test <- monotonic(simulated_data, 1,3, direction='inc') ## look at the number of ids that are non-monotonic table(as.logical(test[,2])) ##to ignore NA values x<-c(1,2,3,5,NA,7,8) monotonic(na.omit(x), direction='inc')
data(simulated_data) ## Run monotonic test <- monotonic(simulated_data, 1,3, direction='inc') ## look at the number of ids that are non-monotonic table(as.logical(test[,2])) ##to ignore NA values x<-c(1,2,3,5,NA,7,8) monotonic(na.omit(x), direction='inc')
This data was simulated to be monotonically decreasing. There are 500 individuals, with a random number of data points. Each individual has a two-level random effect (intercept and slope), a common intercept, and a random error term. The ages range from 0 to 10 years, which is given in months.
data("simDEC_data")
data("simDEC_data")
A data frame with 5505 observations on the following 3 variables.
id
a numeric vector of the identification number of each individual
age
a numeric vector of the age in months
y
a numeric vector of measurements
http://blog.stata.com/2014/07/18/how-to-simulate-multilevellongitudinal-data/
data(simDEC_data) library(sitar) mplot(x=age, y=y, id=id, data=simDEC_data, col=id, main="Individual Measurement Curves")
data(simDEC_data) library(sitar) mplot(x=age, y=y, id=id, data=simDEC_data, col=id, main="Individual Measurement Curves")
This data was simulated to imitate height growth of female children in electronic medical records. There are 500 individuals, with a random number of data points. Based on the CDC growth curve, each individual has a two-level random effect (intercept and slope), a common intercept, and a random error term. The ages range from 0 to 10 years, which is given in months.
data("simulated_data")
data("simulated_data")
A data frame with 5673 observations on the following 3 variables.
nestid
a numeric vector of the identification number of each individual
age
a numeric vector of the age in months
height
a numeric vector of the height in centimeters
http://blog.stata.com/2014/07/18/how-to-simulate-multilevellongitudinal-data/
data(simulated_data) library(sitar) ## plot each individual growth curve mplot(x=age, y=height, id=nestid, data=simulated_data, col=nestid, main="Growth Curves")
data(simulated_data) library(sitar) ## plot each individual growth curve mplot(x=age, y=height, id=nestid, data=simulated_data, col=nestid, main="Growth Curves")