Title: | Weighting and Weighted Statistics |
---|---|
Description: | Provides a variety of functions for producing simple weighted statistics, such as weighted Pearson's correlations, partial correlations, Chi-Squared statistics, histograms, and t-tests. Also now includes some software for quickly recoding survey data and plotting estimates from interaction terms in regressions (and multiply imputed regressions) both with and without weights. NOTE: Weighted partial correlation calculations pulled to address a bug. |
Authors: | Josh Pasek [aut, cre], with some assistance from Alex Tahk and some code modified from R-core; Additional contributions by Gene Culter and Marcus Schwemmle. |
Maintainer: | Josh Pasek <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.4 |
Built: | 2024-11-18 06:51:49 UTC |
Source: | CRAN |
A dataset containing demographic data from the 2004 American National Election Studies. The data include 5 variables: "female" (A Logical Variable Indicating Sex), "age" (Numerically Coded, Ranging From 18 to a Topcode of 90), "educats" (5 Education Categories corresponding to 1-Less than A High School Degree, 2-High School Gradutate, 3-Some College, 4-College Graduate, 5-Post College Education), "racecats" (6 Racial Categories), and "married" (A Logical Variable Indicating the Respondent's Marital Status, with one point of missing data). Dataset is designed show how production of survey weights works in practice.
data(anes04)
data(anes04)
The format is: chr "anes04"
http://www.electionstudies.org
dummify
creates a matrix with columns signifying separate dummy variables for each level of a factor. The column names are the former levels of the factor.
dummify(x, show.na=FALSE, keep.na=FALSE)
dummify(x, show.na=FALSE, keep.na=FALSE)
x |
|
show.na |
If |
keep.na |
If |
dummify
returns a matrix with a number of rows equal to the length of x
and a number of columns equal to the number of levels of x
.
Josh Pasek, Assistant Professor of Communication Studies at the University of Michigan (www.joshpasek.com).
data("anes04") anes04$agecats <- cut(anes04$age, c(17, 25,35,45,55,65, 99)) levels(anes04$agecats) <- c("age1824", "age2534", "age3544", "age4554", "age5564", "age6599") agedums <- dummify(anes04$agecats) table(anes04$agecats) summary(agedums)
data("anes04") anes04$agecats <- cut(anes04$age, c(17, 25,35,45,55,65, 99)) levels(anes04$agecats) <- c("age1824", "age2534", "age3544", "age4554", "age5564", "age6599") agedums <- dummify(anes04$agecats) table(anes04$agecats) summary(agedums)
nalevs
takes as an input any vector and recodes it to range from 0 to 1, to treat specified levels as missing, to treat specified levels as 0, 1, .5, or the mean (weighted or unweighted) of the levels present after coding.
nalevs(x, naset=NULL, setmid=NULL, set1=NULL, set0=NULL, setmean=NULL, weight=NULL)
nalevs(x, naset=NULL, setmid=NULL, set1=NULL, set0=NULL, setmean=NULL, weight=NULL)
x |
A vector to be recoded to range from 0 to 1. |
naset |
A vector of values of |
setmid |
A vector of values of |
set1 |
A vector of values of |
set0 |
A vector of values of |
setmean |
A vector of values of |
weight |
A vector of weights for |
A vector of length equal to that of x
of class numeric
.
Josh Pasek, Assistant Professor of Communication Studies at the University of Michigan (www.joshpasek.com).
data(anes04) summary(anes04$age) summary(nalevs(anes04$age)) table(anes04$educcats) table(nalevs(anes04$educcats, naset=c(2, 4)))
data(anes04) summary(anes04$age) summary(nalevs(anes04$age)) table(anes04$educcats) table(nalevs(anes04$educcats, naset=c(2, 4)))
plotwtdinteraction
produces a plot from a regression object to
illustrate a two- or three-way interaction for a prototypical individual
holding constant all other variables (or other counterfactuals,
depending on type). Prototypical individual is identified as the mean (numeric), median (ordinal), and/or modal (factors and logical variables) values for all measures. Standard errors are illustrated with polygons by default.
findwtdinteraction
generates a table of point estimates from a regression object to illustrate a two- or three-way interaction for a prototypical individual holding constant all other variables. Prototypical individual is identified as the mean (numeric), median (ordinal), and/or modal (factors and logical variables) values for all measures. Standard errors are illustrated with polygons by default.
plotinteractpreds
plots an object from findwtdinteraction
.
These functions are known to be compatible with lm
,
glm
, as well as multiply imputed lm
and
glm
data generated with the mice
package. They are also compatible with gam
and
bam
regressions from the mgcv package under default.
ordinal regressions (polr) and multinomial regressions (multinom) do not currently support standard errors, additional methods are still being added.
*Note, this set of functions is still in beta, please let me know if you run into any bugs when using it.*
**Important: If you are using a regression output from a multiply imputed dataset with a continuous variable as an interacting term, you should always specify the levels (acrosslevs, bylevs, or atlevs) for the variable, as imputations can change the set of levels that are available and thus can make the point estimates across imputed datasets incompatible with one-another.**
plotwtdinteraction(x, across, by=NULL, at=NULL, acrosslevs=NULL, bylevs=NULL, atlevs=NULL, weight=NULL, dvname=NULL, acclevnames=NULL, bylevnames=NULL, atlevnames=NULL, stdzacross=FALSE, stdzby=FALSE, stdzat=FALSE, limitlevs=20, type="response", seplot=TRUE, ylim=NULL, main=NULL, xlab=NULL, ylab=NULL, legend=TRUE, placement="bottomright", lwd=3, add=FALSE, addby = TRUE, addat=FALSE, mfrow=NULL, linecol=NULL, secol=NULL, showbynamelegend=FALSE, showatnamelegend=FALSE, showoutnamelegend = FALSE, lty=NULL, density=30, startangle=45, approach="prototypical", data=NULL, nsim=100, ...) findwtdinteraction(x, across, by=NULL, at=NULL, acrosslevs=NULL, bylevs=NULL, atlevs=NULL, weight=NULL, dvname=NULL, acclevnames=NULL, bylevnames=NULL, atlevnames=NULL, stdzacross=FALSE, stdzby=FALSE, stdzat=FALSE, limitlevs=20, type="response", approach="prototypical", data=NULL, nsim=100) plotinteractpreds(out, seplot=TRUE, ylim=NULL, main=NULL, xlab=NULL, ylab=NULL, legend=TRUE, placement="bottomright", lwd=3, add=FALSE, addby = TRUE, addat=FALSE, mfrow=NULL, linecol=NULL, secol=NULL, showbynamelegend=FALSE, showatnamelegend=FALSE, showoutnamelegend = FALSE, lty=NULL, density=30, startangle=45, ...)
plotwtdinteraction(x, across, by=NULL, at=NULL, acrosslevs=NULL, bylevs=NULL, atlevs=NULL, weight=NULL, dvname=NULL, acclevnames=NULL, bylevnames=NULL, atlevnames=NULL, stdzacross=FALSE, stdzby=FALSE, stdzat=FALSE, limitlevs=20, type="response", seplot=TRUE, ylim=NULL, main=NULL, xlab=NULL, ylab=NULL, legend=TRUE, placement="bottomright", lwd=3, add=FALSE, addby = TRUE, addat=FALSE, mfrow=NULL, linecol=NULL, secol=NULL, showbynamelegend=FALSE, showatnamelegend=FALSE, showoutnamelegend = FALSE, lty=NULL, density=30, startangle=45, approach="prototypical", data=NULL, nsim=100, ...) findwtdinteraction(x, across, by=NULL, at=NULL, acrosslevs=NULL, bylevs=NULL, atlevs=NULL, weight=NULL, dvname=NULL, acclevnames=NULL, bylevnames=NULL, atlevnames=NULL, stdzacross=FALSE, stdzby=FALSE, stdzat=FALSE, limitlevs=20, type="response", approach="prototypical", data=NULL, nsim=100) plotinteractpreds(out, seplot=TRUE, ylim=NULL, main=NULL, xlab=NULL, ylab=NULL, legend=TRUE, placement="bottomright", lwd=3, add=FALSE, addby = TRUE, addat=FALSE, mfrow=NULL, linecol=NULL, secol=NULL, showbynamelegend=FALSE, showatnamelegend=FALSE, showoutnamelegend = FALSE, lty=NULL, density=30, startangle=45, ...)
x |
|
out |
|
across |
|
by |
|
at |
|
acrosslevs |
|
bylevs |
|
atlevs |
|
weight |
|
dvname |
|
acclevnames |
|
bylevnames |
|
atlevnames |
|
stdzacross |
|
stdzby |
|
stdzat |
|
limitlevs |
|
type |
|
seplot |
|
ylim |
|
main |
|
xlab |
|
ylab |
|
legend |
|
placement |
|
lwd |
|
add |
|
addby |
|
addat |
|
mfrow |
|
linecol |
|
secol |
|
showbynamelegend |
|
showatnamelegend |
|
showoutnamelegend |
|
lty |
|
density |
|
startangle |
|
approach |
|
data |
|
nsim |
|
... |
|
A table or figure illustrating the predicted values of the dependent variable across levels of the independent variables for a prototypical respondent.
Josh Pasek, Assistant Professor of Communication Studies at the University of Michigan (www.joshpasek.com).
Rounds numbers to text and drops leading zeros in the process.
rd(x, digits=2, add=TRUE, max=(digits+3))
rd(x, digits=2, add=TRUE, max=(digits+3))
x |
A vector of values to be rounded (must be numeric). |
digits |
The number of digits to round to (must be an integer). |
add |
An optional dichotomous indicator for whether additional digits should be added if no numbers appear in pre-set digit level. |
max |
Maximum number of digits to be shown if |
A vector of length equal to that of x
of class character
.
Josh Pasek, Assistant Professor of Communication Studies at the University of Michigan (www.joshpasek.com).
rd(seq(0, 1, by=.1))
rd(seq(0, 1, by=.1))
Recodes p values to stars for use in tables.
starmaker(x, p.levels=c(.001, .01, .05, .1), symbols=c("***", "**", "*", "+"))
starmaker(x, p.levels=c(.001, .01, .05, .1), symbols=c("***", "**", "*", "+"))
x |
A vector of p values to be turned into stars (must be numeric). |
p.levels |
A vector of the maximum p value for each symbol used (p<p.level). |
symbols |
A vector of the symbols to be displayed for each p value. |
A vector of length equal to that of x
of class character
.
Josh Pasek, Assistant Professor of Communication Studies at the University of Michigan (www.joshpasek.com).
starmaker(seq(0, .15, by=.01)) cbind(p=seq(0, .15, by=.01), star=starmaker(seq(0, .15, by=.01)))
starmaker(seq(0, .15, by=.01)) cbind(p=seq(0, .15, by=.01), star=starmaker(seq(0, .15, by=.01)))
stdz
produces a standardized copy of any input variable. It can also standardize a weighted variable to produce a copy of the original variable standardized around its weighted mean and variance.
stdz(x, weight=NULL)
stdz(x, weight=NULL)
x |
|
weight |
|
A vector of length equal to x with a (weighted) mean of zero and a (weighted) standard deviation of 1.
Josh Pasek, Assistant Professor of Communication Studies at the University of Michigan (www.joshpasek.com).
test <- c(1,1,1,1,1,1,2,2,2,3,3,3,4,4) weight <- c(.5,.5,.5,.5,.5,1,1,1,1,2,2,2,2,2) summary(stdz(test)) summary(stdz(test, weight)) wtd.mean(stdz(test, weight), weight) wtd.var(stdz(test, weight), weight)
test <- c(1,1,1,1,1,1,2,2,2,3,3,3,4,4) weight <- c(.5,.5,.5,.5,.5,1,1,1,1,2,2,2,2,2) summary(stdz(test)) summary(stdz(test, weight)) wtd.mean(stdz(test, weight), weight) wtd.var(stdz(test, weight), weight)
wpct
produces a weighted table of the proportion of data in each category for any variable. This is simply a weighted frequency table divided by its sum.
wpct(x, weight=NULL, na.rm=TRUE, ...)
wpct(x, weight=NULL, na.rm=TRUE, ...)
x |
|
weight |
|
na.rm |
If |
... |
|
A table object of length equal to the number of separate values of x
.
Josh Pasek, Assistant Professor of Communication Studies at the University of Michigan (www.joshpasek.com).
test <- c(1,1,1,1,1,1,2,2,2,3,3,3,4,4) weight <- c(.5,.5,.5,.5,.5,1,1,1,1,2,2,2,2,2) wpct(test) wpct(test, weight)
test <- c(1,1,1,1,1,1,2,2,2,3,3,3,4,4) weight <- c(.5,.5,.5,.5,.5,1,1,1,1,2,2,2,2,2) wpct(test) wpct(test, weight)
wtd.chi.sq
produces weighted chi-squared tests for two- and three-variable contingency tables. Decomposes parts of three-variable contingency tables as well. Note that weights run with the default parameters here treat the weights as an estimate of the precision of the information. A prior version of this software was set to default to mean1=FALSE
.
wtd.chi.sq(var1, var2, var3=NULL, weight=NULL, na.rm=TRUE, drop.missing.levels=TRUE, mean1=TRUE)
wtd.chi.sq(var1, var2, var3=NULL, weight=NULL, na.rm=TRUE, drop.missing.levels=TRUE, mean1=TRUE)
var1 |
|
var2 |
|
var3 |
|
weight |
|
na.rm |
|
drop.missing.levels |
|
mean1 |
|
A two-way chi-squared produces a vector including a single chi-squared value, degrees of freedom measure, and p-value for each analysis.
A three-way chi-squared produces a matrix with a single chi-squared value, degrees of freedom measure, and p-value for each of seven analyses. These include: (1) the values using a three-way contingency table, (2) the values for a two-way contingency table with each pair of variables, and (3) assessments for whether the relations between each pair of variables are significantly different across levels of the third variable.
Josh Pasek, Assistant Professor of Communication Studies at the University of Michigan (www.joshpasek.com).
var1 <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3) var2 <- c(1,1,2,2,3,3,1,1,2,2,3,3,1,1,2) var3 <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3) weight <- c(.5,.5,.5,.5,.5,1,1,1,1,1,2,2,2,2,2) wtd.chi.sq(var1, var2) wtd.chi.sq(var1, var2, weight=weight) wtd.chi.sq(var1, var2, var3) wtd.chi.sq(var1, var2, var3, weight=weight)
var1 <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3) var2 <- c(1,1,2,2,3,3,1,1,2,2,3,3,1,1,2) var3 <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3) weight <- c(.5,.5,.5,.5,.5,1,1,1,1,1,2,2,2,2,2) wtd.chi.sq(var1, var2) wtd.chi.sq(var1, var2, weight=weight) wtd.chi.sq(var1, var2, var3) wtd.chi.sq(var1, var2, var3, weight=weight)
wtd.cors
function.
wtd.cor
produces a Pearsons correlation coefficient comparing two variables or matrices. Note that weights run with the default parameters here treat the weights as an estimate of the precision of the information. For survey data, users should run this code with bootstrapped standard errors bootse=TRUE
, which are robust to heteroskadesticity, although these will vary slightly each time the weights are run. A prior version of this software was set to default to mean1=FALSE
and bootse=FALSE
.
wtd.cor(x, y=NULL, weight=NULL, mean1=TRUE, collapse=TRUE, bootse=FALSE, bootp=FALSE, bootn=1000)
wtd.cor(x, y=NULL, weight=NULL, mean1=TRUE, collapse=TRUE, bootse=FALSE, bootp=FALSE, bootn=1000)
x |
|
y |
|
weight |
|
mean1 |
|
collapse |
|
bootse |
|
bootp |
|
bootn |
|
A list with matrices for the estimated correlation coefficient, the standard error on that correlation coefficient, the t-value for that correlation coefficient, and the p value for the significance of the correlation. If the list can be simplified, simplification will be done.
Josh Pasek, Assistant Professor of Communication Studies at the University of Michigan (www.joshpasek.com).
wtd.cors
stdz
wtd.t.test
wtd.chi.sq
test <- c(1,1,1,1,1,1,2,2,2,3,3,3,4,4) t2 <- rev(test) weight <- c(.5,.5,.5,.5,.5,1,1,1,1,2,2,2,2,2) wtd.cor(test, t2) wtd.cor(test, t2, weight) wtd.cor(test, t2, weight, bootse=TRUE)
test <- c(1,1,1,1,1,1,2,2,2,3,3,3,4,4) t2 <- rev(test) weight <- c(.5,.5,.5,.5,.5,1,1,1,1,2,2,2,2,2) wtd.cor(test, t2) wtd.cor(test, t2, weight) wtd.cor(test, t2, weight, bootse=TRUE)
wtd.cors
produces a Pearsons correlation coefficient comparing two variables or matrices.
wtd.cors(x, y=NULL, weight=NULL)
wtd.cors(x, y=NULL, weight=NULL)
x |
|
y |
|
weight |
|
A matrix of the estimated correlation coefficients.
Marcus Schwemmle at GfK programmed the C code, R wrapper by Josh Pasek, Assistant Professor of Communication Studies at the University of Michigan (www.joshpasek.com).
wtd.cor
stdz
wtd.t.test
wtd.chi.sq
test <- c(1,1,1,1,1,1,2,2,2,3,3,3,4,4) t2 <- rev(test) weight <- c(.5,.5,.5,.5,.5,1,1,1,1,2,2,2,2,2) wtd.cors(test, t2) wtd.cors(test, t2, weight)
test <- c(1,1,1,1,1,1,2,2,2,3,3,3,4,4) t2 <- rev(test) weight <- c(.5,.5,.5,.5,.5,1,1,1,1,2,2,2,2,2) wtd.cors(test, t2) wtd.cors(test, t2, weight)
Produces weighted histograms by adding a "weight" option to the his.default function from the graphics package (Copyright R-core). The code here was copied from that function and modified slightly to allow for weighted histograms as well as unweighted histograms. The generic function hist computes a histogram of the given data values. If plot=TRUE, the resulting object of class "histogram" is plotted by plot.histogram, before it is returned.
wtd.hist(x, breaks = "Sturges", freq = NULL, probability = !freq, include.lowest = TRUE, right = TRUE, density = NULL, angle = 45, col = NULL, border = NULL, main = paste("Histogram of" , xname), xlim = range(breaks), ylim = NULL, xlab = xname, ylab, axes = TRUE, plot = TRUE, labels = FALSE, nclass = NULL, weight = NULL, ...)
wtd.hist(x, breaks = "Sturges", freq = NULL, probability = !freq, include.lowest = TRUE, right = TRUE, density = NULL, angle = 45, col = NULL, border = NULL, main = paste("Histogram of" , xname), xlim = range(breaks), ylim = NULL, xlab = xname, ylab, axes = TRUE, plot = TRUE, labels = FALSE, nclass = NULL, weight = NULL, ...)
x |
a vector of values for which the histogram is desired. |
breaks |
one of:
In the last three cases the number is a suggestion only. |
freq |
logical; if |
probability |
an alias for |
include.lowest |
logical; if |
right |
logical; if |
density |
the density of shading lines, in lines per inch.
The default value of |
angle |
the slope of shading lines, given as an angle in degrees (counter-clockwise). |
col |
a colour to be used to fill the bars.
The default of |
border |
the color of the border around the bars. The default is to use the standard foreground color. |
main , xlab , ylab
|
these arguments to |
xlim , ylim
|
the range of x and y values with sensible defaults.
Note that |
axes |
logical. If |
plot |
logical. If |
labels |
logical or character. Additionally draw labels on top
of bars, if not |
nclass |
numeric (integer). For S(-PLUS) compatibility only,
|
weight |
numeric. Defines a set of weights to produce a weighted histogram. Will default to 1 for each case if no other weight is defined. |
... |
further arguments and graphical parameters passed to
|
The definition of histogram differs by source (with
country-specific biases). R's default with equi-spaced breaks (also
the default) is to plot the (weighted) counts in the cells defined by
breaks
. Thus the height of a rectangle is proportional to
the (weighted) number of points falling into the cell, as is the area
provided the breaks are equally-spaced.
The default with non-equi-spaced breaks is to give a plot of area one, in which the area of the rectangles is the fraction of the data points falling in the cells.
If right = TRUE
(default), the histogram cells are intervals
of the form (a, b]
, i.e., they include their right-hand endpoint,
but not their left one, with the exception of the first cell when
include.lowest
is TRUE
.
For right = FALSE
, the intervals are of the form [a, b)
,
and include.lowest
means ‘include highest’.
The default for breaks
is "Sturges"
: see
nclass.Sturges
. Other names for which algorithms
are supplied are "Scott"
and "FD"
/
"Freedman-Diaconis"
(with corresponding functions
nclass.scott
and nclass.FD
).
Case is ignored and partial matching is used.
Alternatively, a function can be supplied which
will compute the intended number of breaks as a function of x
.
an object of class "histogram"
which is a list with components:
breaks |
the |
counts |
|
density |
values for each bin such that the area under the histogram totals 1.
|
intensities |
same as |
mids |
the |
xname |
a character string with the actual |
equidist |
logical, indicating if the distances between
|
Josh Pasek, Assistant Professor of Communication Studies at the University of Michigan (www.joshpasek.com) was responsible for the updates to the hist function necessary to implement weighted counts. The hist.default code from the graphics package on which the current function was based was written by R-core. All modifications are noted in code and the copyright for all original code remains with R-core.
var1 <- c(1:100) wgt <- var1/mean(var1) par(mfrow=c(2, 2)) wtd.hist(var1) wtd.hist(var1, weight=wgt) wtd.hist(var1, weight=var1)
var1 <- c(1:100) wgt <- var1/mean(var1) par(mfrow=c(2, 2)) wtd.hist(var1) wtd.hist(var1, weight=wgt) wtd.hist(var1, weight=var1)
wtd.t.test
produces either one- or two-sample t-tests comparing weighted data streams to one another. Note that weights run with the default parameters here treat the weights as an estimate of the precision of the information. For survey data, users should run this code with bootstrapped standard errors bootse=TRUE
, which are robust to heteroskadesticity, although these will vary slightly each time the weights are run. A prior version of this software was set to default to mean1=FALSE
and bootse=FALSE
.
wtd.t.test(x, y=0, weight=NULL, weighty=NULL, samedata=TRUE, alternative="two.tailed", mean1=TRUE, bootse=FALSE, bootp=FALSE, bootn=1000, drops="pairwise")
wtd.t.test(x, y=0, weight=NULL, weighty=NULL, samedata=TRUE, alternative="two.tailed", mean1=TRUE, bootse=FALSE, bootp=FALSE, bootn=1000, drops="pairwise")
x |
|
y |
|
weight |
|
weighty |
|
samedata |
|
alternative |
|
mean1 |
|
bootse |
|
bootp |
|
bootn |
|
drops |
|
A list element with an identifier for the test; coefficients for the t value, degrees of freedom, and p value of the t-test; and additional statistics of potential interest.
Josh Pasek, Assistant Professor of Communication Studies at the University of Michigan (www.joshpasek.com). Gene Culter added code for a one-tailed version of the test.
test <- c(1,1,1,1,1,1,2,2,2,3,3,3,4,4) t2 <- rev(test)+1 weight <- c(.5,.5,.5,.5,.5,1,1,1,1,2,2,2,2,2) wtd.t.test(test, t2) wtd.t.test(test, t2, weight) wtd.t.test(test, t2, weight, bootse=TRUE)
test <- c(1,1,1,1,1,1,2,2,2,3,3,3,4,4) t2 <- rev(test)+1 weight <- c(.5,.5,.5,.5,.5,1,1,1,1,2,2,2,2,2) wtd.t.test(test, t2) wtd.t.test(test, t2, weight) wtd.t.test(test, t2, weight, bootse=TRUE)