Title: | Generates Data Sets for Class Demonstrations |
---|---|
Description: | For educational purposes to demonstrate the importance of multiple regression. The genset function generates a data set from an initial data set to have the same summary statistics (mean, median, and standard deviation) but opposing regression results. |
Authors: | Lori Murray [aut, cre], John Wilson [aut, cre] |
Maintainer: | Lori Murray <[email protected]> |
License: | GPL-2 |
Version: | 0.1.0 |
Built: | 2024-10-31 20:45:40 UTC |
Source: | CRAN |
Generate data sets to demonstrate the importance
of multiple regression. 'genset'
generates a
data set from an initial data set to have the same
summary statistics (mean, median, and standard
deviation) but opposing regression results.
The initial data set will have one response variable
(continuous) and two predictor variables
(continous or one continuous and one categorical
with 2 levels) that are statistically significant
in a linear regression model.
genset(y, x1, x2, method, option, n, decrease, output)
genset(y, x1, x2, method, option, n, decrease, output)
y |
a vector containing the response variable (continuous), |
x1 |
a vector containing the first predictor variable (continuous) |
x2 |
a vector containing the second predictor variable (continuous or
categorical with 2 levels). If variable is categorical
then argument is |
method |
the method |
option |
the variable(s) that will not be
statistically significant in the new data set
( |
n |
maximum number of iterations |
decrease |
decreases the signficance level when |
output |
print each interation when |
The summary statistics are within a
(predetermined) tolerance level, and when rounded
will be the same as the original data set. We use
the standard convention 0.05 as the significance
level. The default for the number of iterations is
n=2000
. Less than n=2000
may or may
not be sufficient and is dependent on the initial
data set.
Returns an object of class "data.frame" containing the generated data set: (in order) the response variable, first predictor variable and second predictor variable.
Lori Murray & John Wilson
Murray, L. and Wilson, J. (2020). The Need for Regression: Generating Multiple Data Sets with Identical Summary Statistics but Differing Conclusions. Decision Sciences Journal of Innovative Education. Accepted for publication.
## Choose variables of interest y <- mtcars$mpg x1 <- mtcars$hp x2 <- mtcars$wt ## Create a dataframe set1 <- data.frame(y, x1, x2) ## Check summary statistics multi.fun <- function(x) { c(mean = mean(x), media=median(x), sd=sd(x)) } round(multi.fun(set1$y), 0) round(multi.fun(set1$x1), 1) round(multi.fun(set1$x2), 1) ## Fit linear regression model ## to verify regressors are statistically ## significant (p-value < 0.05) summary(lm(y ~ x1, x2, data=set1)) ## Set seed to reproduce same data set set.seed(101) set2 <- genset(y, x1, x2, method=1, option="x1", n=1000) ## Verify summary statistics match set 1 round(multi.fun(set2$y), 0) round(multi.fun(set2$x1), 1) round(multi.fun(set2$x2), 1) ## Fit linear regression model ## to verify x1 is not statistically ## significant (p-value > 0.05) summary(lm(y ~ x1 + x2, data=set2))
## Choose variables of interest y <- mtcars$mpg x1 <- mtcars$hp x2 <- mtcars$wt ## Create a dataframe set1 <- data.frame(y, x1, x2) ## Check summary statistics multi.fun <- function(x) { c(mean = mean(x), media=median(x), sd=sd(x)) } round(multi.fun(set1$y), 0) round(multi.fun(set1$x1), 1) round(multi.fun(set1$x2), 1) ## Fit linear regression model ## to verify regressors are statistically ## significant (p-value < 0.05) summary(lm(y ~ x1, x2, data=set1)) ## Set seed to reproduce same data set set.seed(101) set2 <- genset(y, x1, x2, method=1, option="x1", n=1000) ## Verify summary statistics match set 1 round(multi.fun(set2$y), 0) round(multi.fun(set2$x1), 1) round(multi.fun(set2$x2), 1) ## Fit linear regression model ## to verify x1 is not statistically ## significant (p-value > 0.05) summary(lm(y ~ x1 + x2, data=set2))