Introduction

library(genset)

Introduction

This package was developed for educational purposes to demonstrate the importance of multiple regression. genset generates a data set from an initial data set to have the same summary statistics (mean, median, and standard deviation) but opposing regression results. The initial data set will have one response variable (continuous) and two predictor variables (continous or one continuous and one categorical with 2 levels) that are statistically significant in a linear regression model such as Y = Xβ + ϵ.

Functions

Use the following function if your data set consist of 2 predictor variables (both continuous):

genset(y=y, x1=x1, x2=x2, method=1, option="x1", n=n)

Use the following function if your data set consist of 2 predictor variables (1 continuous and 1 categorical with 2 levels):

genset(y=y, x1=x1, x2=factor(x2), method=1, option="x1", n=n)

Arguments

y response variable (continuous).
x1 first predictor variable (continuous).
x2 second predictor variable (continuous or categorical with 2 levels). If variable is categorical then argument is factor(x2).
method the method 1 or 2 to be used to generate the data set. 1 (default) rearranges the values within each variable, and 2 is a perturbation method that makes subtle changes to the values of the variables.
option the variable(s) that will not statistically significant in the new data set ("x1", "x2" or - "both").
n the number of iterations.

Details

The summary statistics are within a (predetermined) tolerance level, and when rounded will be the same as the original data set. We use the standard convention 0.05 as the significance level. The default for the number of iterations is n=2000. Less than n=2000 may or may not be sufficient and is dependent on the initial data set.

Example 1: Two Continuous Predictor Variables

Load the genset library:

library(genset)

We will use the built-in data set mtcars to illustrate how to generate a new data set. Details about the data set can be found by typing ?mtcars. We set the variable mpg as the response variable y, and hp and wt as the two continous predictor variables (x1 and x2). Then we combine the variables into a data frame called set1.

y <- mtcars$mpg
x1 <- mtcars$hp
x2 <- mtcars$wt

set1 <- data.frame(y, x1, x2)

We check the summary statistics (mean, median, and standard deviation) for the response variable and two predictor variables using the round() function. We round the statistics to the first significant digit of that variable. The multi.fun() is created for the convenience.

multi.fun <- function(x) {
  c(mean = mean(x), media=median(x), sd=sd(x))
}
round(multi.fun(set1$y), 1)
#>  mean media    sd 
#>  20.1  19.2   6.0
round(multi.fun(set1$x1), 0)
#>  mean media    sd 
#>   147   123    69
round(multi.fun(set1$x2), 3)
#>  mean media    sd 
#> 3.217 3.325 0.978

We fit a linear model to the data set using the function lm() and check to see that both predictor variables are statistically significant (p-value < 0.05).

summary(lm(y ~ x1, x2, data=set1))
#> 
#> Call:
#> lm(formula = y ~ x1, data = set1, subset = x2)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.5725 -0.5725  0.3489  0.3489  0.4867 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 27.257343   0.395224   68.97  < 2e-16 ***
#> x1          -0.051680   0.003591  -14.39 5.25e-15 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4698 on 30 degrees of freedom
#> Multiple R-squared:  0.8735, Adjusted R-squared:  0.8692 
#> F-statistic: 207.1 on 1 and 30 DF,  p-value: 5.253e-15

We set the function arguments of genset() to generate a new data set (set2) that will make the first predictor variable hp, no longer statistically significant using method 2. We will use the function set.seed() so that the data set can be reproduced.

set.seed(123)
set2 <- genset(y, x1, x2, method=2, option="x1")

Check that the summary statisticis for Set 2 are the same as Set 1 above.

round(multi.fun(set2$y), 1)
#>  mean media    sd 
#>  20.2  19.1   6.4
round(multi.fun(set2$x1), 0)
#>  mean media    sd 
#>   148   124    72
round(multi.fun(set2$x2), 3)
#>  mean media    sd 
#> 3.209 3.325 1.029

Fit a linear model to Set 2 and check to see that the first predictor variable hp is no longer statistically significant.

summary(lm(y ~ x1 + x2, data=set2))
#> 
#> Call:
#> lm(formula = y ~ x1 + x2, data = set2)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -9.1999 -3.8138 -0.2526  3.1098 12.7743 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 32.63578    3.17123  10.291 3.43e-11 ***
#> x1          -0.01825    0.01421  -1.284  0.20933    
#> x2          -3.04537    0.99538  -3.060  0.00474 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 5.22 on 29 degrees of freedom
#> Multiple R-squared:  0.3687, Adjusted R-squared:  0.3251 
#> F-statistic: 8.468 on 2 and 29 DF,  p-value: 0.001269

We can compare the plots for data Set 1 and 2:

plot(set1)
plot(set2)

Example 2: Two Predictor Variables (1 Continuous, 1 Categorical)

This time we will use a categorical predictor variable engine vs where 0 is V-shaped and 1 is straight. We will use the same response variable mpg and predictor variable wt making the categorical or factor variable is assigned to x2. Combine the three variables in a data frame called set3.

y <- mtcars$mpg
x1 <- mtcars$wt
x2 <- mtcars$vs

set3 <- data.frame(y, x1, x2)

Since we have a categorical predictor variable, we need to subset the data. Then we can check the summary statistics (mean, median, and standard deviation) for the response variable and predictor variable in terms of the categorical variable (ie. the marginal distributions for vs) We round the statistics to the first significant digit of that variable.

v.shape <- subset(set3, x2==0)
straight <- subset(set3, x2==1)

multi.fun <- function(x) {
  c(mean = mean(x), media=median(x), sd=sd(x))
}
round(multi.fun(v.shape$y), 1)
#>  mean media    sd 
#>  16.6  15.7   3.9
round(multi.fun(v.shape$x1), 3)
#>  mean media    sd 
#> 3.689 3.570 0.904
round(multi.fun(straight$y), 1)
#>  mean media    sd 
#>  24.6  22.8   5.4
round(multi.fun(straight$x1), 3)
#>  mean media    sd 
#> 2.611 2.622 0.715

We fit a linear model to the data set using the function lm() and check to see that both predictor variables are statistically significant.

summary(lm(y ~ x1 + factor(x2), data=set3))
#> 
#> Call:
#> lm(formula = y ~ x1 + factor(x2), data = set3)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.7071 -2.4415 -0.3129  1.4319  6.0156 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  33.0042     2.3554  14.012 1.92e-14 ***
#> x1           -4.4428     0.6134  -7.243 5.63e-08 ***
#> factor(x2)1   3.1544     1.1907   2.649   0.0129 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.78 on 29 degrees of freedom
#> Multiple R-squared:  0.801,  Adjusted R-squared:  0.7873 
#> F-statistic: 58.36 on 2 and 29 DF,  p-value: 6.818e-11

We set the function arguments of genset() to generate a new data set (set4) that will make the second predictor variable vs, no longer statistically significant using method 2. We will use the function set.seed() so that the data set can be reproduced. Note that factor(x2) must be used in the formula argument when the variable is categorical.

set.seed(123)
set4 <- genset(y, x1, factor(x2), method=2, option="x2")

Check that the summary statisticis for the marginal distributions of Set 4 are the same as Set 3 above.

v.shape <- subset(set4, x2==0)
straight <- subset(set4, x2==1)

multi.fun <- function(x) {
  c(mean = mean(x), media=median(x), sd=sd(x))
}
round(multi.fun(v.shape$y), 1)
#>  mean media    sd 
#>  16.7  15.7   4.3
round(multi.fun(v.shape$x1), 3)
#>  mean media    sd 
#> 3.757 3.570 0.842
round(multi.fun(straight$y), 1)
#>  mean media    sd 
#>  24.8  22.8   5.8
round(multi.fun(straight$x1), 3)
#>  mean media    sd 
#> 2.577 2.622 0.653

Fit a linear model to Set 4 and check to see that the second predictor variable vs is no longer statistically significant.

summary(lm(y ~ x1 + factor(x2), data=set4))
#> 
#> Call:
#> lm(formula = y ~ x1 + factor(x2), data = set4)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.5841 -2.0828 -0.1872  1.5084  8.3522 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  35.2778     3.0511  11.562 2.22e-12 ***
#> x1           -4.9520     0.7854  -6.305 6.92e-07 ***
#> factor(x2)1   2.2566     1.4950   1.509    0.142    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.292 on 29 degrees of freedom
#> Multiple R-squared:  0.7509, Adjusted R-squared:  0.7337 
#> F-statistic: 43.71 on 2 and 29 DF,  p-value: 1.769e-09

We can compare the plots for data Set 3 and 4:

plot(set3)
plot(set4)

References

L. Murray & J. Wilson.