This package was developed for educational purposes to demonstrate
the importance of multiple regression. genset
generates a
data set from an initial data set to have the same summary statistics
(mean, median, and standard deviation) but opposing regression results.
The initial data set will have one response variable (continuous) and
two predictor variables (continous or one continuous and one categorical
with 2 levels) that are statistically significant in a linear regression
model such as Y = Xβ + ϵ.
Use the following function if your data set consist of 2 predictor variables (both continuous):
genset(y=y, x1=x1, x2=x2, method=1, option="x1", n=n)
Use the following function if your data set consist of 2 predictor variables (1 continuous and 1 categorical with 2 levels):
genset(y=y, x1=x1, x2=factor(x2), method=1, option="x1", n=n)
y
a vector containing the response variable
(continuous).x1
a vector containing the first predictor variable
(continuous).x2
a vector containing the second predictor variable
(continuous or categorical with 2 levels). If variable is categorical
then argument is factor(x2)
.1
or 2
to be used to
generate the data set. 1
(default) rearranges the values
within each variable, and 2
is a perturbation method that
makes subtle changes to the values of the variables."x1"
, "x2"
or -
"both"
).n
the number of iterations.The summary statistics are within a (predetermined) tolerance level,
and when rounded will be the same as the original data set. We use the
standard convention 0.05 as the significance level. The default for the
number of iterations is n=2000
. Less than
n=2000
may or may not be sufficient and is dependent on the
initial data set.
Returns an object of class “data.frame” containing the generated data set: (in order) the response variable, first predictor variable and second predictor variable.
Load the genset
library:
We will use the built-in data set mtcars
to illustrate
how to generate a new data set. Details about the data set can be found
by typing ?mtcars
. We set the variable mpg
as
the response variable y
, and hp
and
wt
as the two continous predictor variables
(x1
and x2
). Then we combine the variables
into a data frame called set1
.
We check the summary statistics (mean, median, and standard
deviation) for the response variable and two predictor variables using
the round()
function. We round the statistics to the first
significant digit of that variable. The multi.fun()
is
created for the convenience.
multi.fun <- function(x) {
c(mean = mean(x), media=median(x), sd=sd(x))
}
round(multi.fun(set1$y), 1)
#> mean media sd
#> 20.1 19.2 6.0
round(multi.fun(set1$x1), 0)
#> mean media sd
#> 147 123 69
round(multi.fun(set1$x2), 3)
#> mean media sd
#> 3.217 3.325 0.978
We fit a linear model to the data set using the function
lm()
and check to see that both predictor variables are
statistically significant (p-value < 0.05).
summary(lm(y ~ x1, x2, data=set1))
#>
#> Call:
#> lm(formula = y ~ x1, data = set1, subset = x2)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.5725 -0.5725 0.3489 0.3489 0.4867
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 27.257343 0.395224 68.97 < 2e-16 ***
#> x1 -0.051680 0.003591 -14.39 5.25e-15 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.4698 on 30 degrees of freedom
#> Multiple R-squared: 0.8735, Adjusted R-squared: 0.8692
#> F-statistic: 207.1 on 1 and 30 DF, p-value: 5.253e-15
We set the function arguments of genset()
to generate a
new data set (set2
) that will make the first predictor
variable hp
, no longer statistically significant using
method 2
. We will use the function set.seed()
so that the data set can be reproduced.
Check that the summary statisticis for Set 2 are the same as Set 1 above.
round(multi.fun(set2$y), 1)
#> mean media sd
#> 20.1 19.2 6.0
round(multi.fun(set2$x1), 0)
#> mean media sd
#> 147 123 69
round(multi.fun(set2$x2), 3)
#> mean media sd
#> 3.217 3.325 0.978
Fit a linear model to Set 2 and check to see that the first predictor
variable hp
is no longer statistically significant.
summary(lm(y ~ x1 + x2, data=set2))
#>
#> Call:
#> lm(formula = y ~ x1 + x2, data = set2)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -9.0219 -4.0385 -0.2858 3.8869 11.7938
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 14.34968 3.75108 3.825 0.000641 ***
#> x1 -0.01931 0.01499 -1.289 0.207666
#> x2 2.66500 1.05010 2.538 0.016785 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 5.59 on 29 degrees of freedom
#> Multiple R-squared: 0.1951, Adjusted R-squared: 0.1396
#> F-statistic: 3.515 on 2 and 29 DF, p-value: 0.04297
This time we will use a categorical predictor variable engine
vs
where 0
is V-shaped and 1
is
straight. We will use the same response variable mpg
and
predictor variable wt
making the categorical or factor
variable is assigned to x2
. Combine the three variables in
a data frame called set3
.
Since we have a categorical predictor variable, we need to subset the
data. Then we can check the summary statistics (mean, median, and
standard deviation) for the response variable and predictor variable in
terms of the categorical variable (ie. the marginal distributions for
vs
) We round the statistics to the first significant digit
of that variable.
multi.fun <- function(x) {
c(mean = mean(x), media=median(x), sd=sd(x))
}
round(multi.fun(v.shape$y), 1)
#> mean media sd
#> 16.6 15.7 3.9
round(multi.fun(v.shape$x1), 3)
#> mean media sd
#> 3.689 3.570 0.904
round(multi.fun(straight$y), 1)
#> mean media sd
#> 24.6 22.8 5.4
round(multi.fun(straight$x1), 3)
#> mean media sd
#> 2.611 2.622 0.715
We fit a linear model to the data set using the function
lm()
and check to see that both predictor variables are
statistically significant.
summary(lm(y ~ x1 + factor(x2), data=set3))
#>
#> Call:
#> lm(formula = y ~ x1 + factor(x2), data = set3)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.7071 -2.4415 -0.3129 1.4319 6.0156
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 33.0042 2.3554 14.012 1.92e-14 ***
#> x1 -4.4428 0.6134 -7.243 5.63e-08 ***
#> factor(x2)1 3.1544 1.1907 2.649 0.0129 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.78 on 29 degrees of freedom
#> Multiple R-squared: 0.801, Adjusted R-squared: 0.7873
#> F-statistic: 58.36 on 2 and 29 DF, p-value: 6.818e-11
We set the function arguments of genset()
to generate a
new data set (set4
) that will make the second predictor
variable vs
, no longer statistically significant using
method 2
. We will use the function set.seed()
so that the data set can be reproduced. Note that
factor(x2)
must be used in the formula argument when the
variable is categorical.
Check that the summary statisticis for the marginal distributions of Set 4 are the same as Set 3 above.
multi.fun <- function(x) {
c(mean = mean(x), media=median(x), sd=sd(x))
}
round(multi.fun(v.shape$y), 1)
#> mean media sd
#> 16.7 15.7 4.3
round(multi.fun(v.shape$x1), 3)
#> mean media sd
#> 3.757 3.570 0.842
round(multi.fun(straight$y), 1)
#> mean media sd
#> 24.8 22.8 5.8
round(multi.fun(straight$x1), 3)
#> mean media sd
#> 2.577 2.622 0.653
Fit a linear model to Set 4 and check to see that the second
predictor variable vs
is no longer statistically
significant.
summary(lm(y ~ x1 + factor(x2), data=set4))
#>
#> Call:
#> lm(formula = y ~ x1 + factor(x2), data = set4)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4.5841 -2.0828 -0.1872 1.5084 8.3522
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 35.2778 3.0511 11.562 2.22e-12 ***
#> x1 -4.9520 0.7854 -6.305 6.92e-07 ***
#> factor(x2)1 2.2566 1.4950 1.509 0.142
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.292 on 29 degrees of freedom
#> Multiple R-squared: 0.7509, Adjusted R-squared: 0.7337
#> F-statistic: 43.71 on 2 and 29 DF, p-value: 1.769e-09
Murray, L. and Wilson, J. (2020). The Need for Regression: Generating Multiple Data Sets with Identical Summary Statistics but Differing Conclusions. Decision Sciences Journal of Innovative Education. Accepted for publication.