Title: | Methods and Examples from Introduction to the Theory of Observational Studies |
---|---|
Description: | Supplements for a book, "iTOS" = "Introduction to the Theory of Observational Studies." Data sets are 'aHDL' from Rosenbaum (2023a) <doi:10.1111/biom.13558> and 'bingeM' from Rosenbaum (2023b) <doi:10.1111/biom.13921>. The function makematch() uses two-criteria matching from Zhang et al. (2023) <doi:10.1080/01621459.2021.1981337> to create the matched data 'bingeM' from 'binge'. The makematch() function also implements optimal matching (Rosenbaum (1989) <doi:10.2307/2290079>) and matching with fine or near-fine balance (Rosenbaum et al. (2007) <doi:10.1198/016214506000001059> and Yang et al (2012) <doi:10.1111/j.1541-0420.2011.01691.x>). The book makes use of two other R packages, 'weightedRank' and 'tightenBlock'. |
Authors: | Paul R. Rosenbaum [aut, cre] |
Maintainer: | Paul R. Rosenbaum <[email protected]> |
License: | GPL-2 |
Version: | 1.0.3 |
Built: | 2024-12-05 06:44:22 UTC |
Source: | CRAN |
Supplements for a book, "iTOS" = "Introduction to the Theory of Observational Studies." Data sets are 'aHDL' from Rosenbaum (2023a) <doi:10.1111/biom.13558> and 'bingeM' from Rosenbaum (2023b) <doi:10.1111/biom.13921>. The function makematch() uses two-criteria matching from Zhang et al. (2023) <doi:10.1080/01621459.2021.1981337> to create the matched data 'bingeM' from 'binge'. The makematch() function also implements optimal matching (Rosenbaum (1989) <doi:10.2307/2290079>) and matching with fine or near-fine balance (Rosenbaum et al. (2007) <doi:10.1198/016214506000001059> and Yang et al (2012) <doi:10.1111/j.1541-0420.2011.01691.x>). The book makes use of two other R packages, 'weightedRank' and 'tightenBlock'.
The DESCRIPTION file:
Package: | iTOS |
Type: | Package |
Title: | Methods and Examples from Introduction to the Theory of Observational Studies |
Version: | 1.0.3 |
Authors@R: | person(given = c("Paul", "R."), family = "Rosenbaum", role = c("aut", "cre"), email = "[email protected]") |
Author: | Paul R. Rosenbaum [aut, cre] |
Maintainer: | Paul R. Rosenbaum <[email protected]> |
Description: | Supplements for a book, "iTOS" = "Introduction to the Theory of Observational Studies." Data sets are 'aHDL' from Rosenbaum (2023a) <doi:10.1111/biom.13558> and 'bingeM' from Rosenbaum (2023b) <doi:10.1111/biom.13921>. The function makematch() uses two-criteria matching from Zhang et al. (2023) <doi:10.1080/01621459.2021.1981337> to create the matched data 'bingeM' from 'binge'. The makematch() function also implements optimal matching (Rosenbaum (1989) <doi:10.2307/2290079>) and matching with fine or near-fine balance (Rosenbaum et al. (2007) <doi:10.1198/016214506000001059> and Yang et al (2012) <doi:10.1111/j.1541-0420.2011.01691.x>). The book makes use of two other R packages, 'weightedRank' and 'tightenBlock'. |
License: | GPL-2 |
Encoding: | UTF-8 |
LazyData: | true |
Imports: | stats, MASS, rcbalance, BiasedUrn, xtable |
Suggests: | weightedRank |
Depends: | R (>= 3.5.0) |
NeedsCompilation: | no |
Packaged: | 2024-09-02 19:37:06 UTC; rosenbap |
Repository: | CRAN |
Date/Publication: | 2024-09-05 16:00:02 UTC |
Index of help topics:
aHDL Alcohol and HDL Cholesterol addMahal Rank-Based Mahalanobis Distance Matrix addNearExact Add a Near-exact Penalty to an Exisiting Distance Matrix. addcaliper Add a Caliper to an Existing Cost Matrix addinteger Add an Integer Penalty to an Existing Distance Matrix addquantile Cut a Covariate at Quantiles and Add a Penalty for Different Quantile Categories amplify Amplification of sensitivity analysis in observational studies. binge Binge Drinking and High Blood Pressure bingeM Binge Drinking and High Blood Pressure - Matched With Two Control Groups computep Computes individual and pairwise treatment assignment probabilities. ev Computes the null expectation and variance for one stratum. evalBal Evaluate Covariate Balance in a Matched Sample evall Compute expectations and variances for one stratum. gconv Convolution of Two Probability Generating Functions iTOS-package Methods and Examples from Introduction to the Theory of Observational Studies makematch Two-Criteria Matching makenetwork Make the Network Used for Matching with Two Criteria noether Sensitivity Analysis Using Noether's Test for Matched Pairs startcost Initialize a Distance Matrix. zeta zeta function in sensitivity analysis
Paul R. Rosenbaum [aut, cre]
Maintainer: Paul R. Rosenbaum <[email protected]>
Rosenbaum, Paul R. Introduction to the Theory of Observational Studies. Manuscript, 2024.
Rosenbaum, P. R. (1989) <doi:10.2307/2290079> Optimal matching for observational studies. Journal of the American Statistical Association, 84, 1024-1032.
Rosenbaum, Paul R., Richard N. Ross, and Jeffrey H. Silber (2007) <doi:10.1198/016214506000001059> Minimum distance matched sampling with fine balance in an observational study of treatment for ovarian cancer. Journal of the American Statistical Association 102, 75-83.
Rosenbaum, P. R. (2023a) <doi:10.1111/biom.13558> Sensitivity analyses informed by tests for bias in observational studies. Biometrics 79, 475-487.
Rosenbaum, P. R. (2023b) <doi:10.1111/biom.13921> A second evidence factor for a second control group. Biometrics, 79, 3968-3980.
Yang, D., Small, D. S., Silber, J. H. and Rosenbaum, P. R. (2012) <doi:10.1111/j.1541-0420.2011.01691.x> Optimal matching with minimal deviation from fine balance in a study of obesity and surgical outcomes. Biometrics, 68, 628-636.
Zhang, B., D. S. Small, K. B. Lasater, M. McHugh, J. H. Silber, and P. R. Rosenbaum (2023) <doi:10.1080/01621459.2021.1981337> Matching one sample according to two criteria in observational studies. Journal of the American Statistical Association, 118, 1140-1151.
data(binge) table(binge$AlcGroup) data(aHDL) table(aHDL$grp)
data(binge) table(binge$AlcGroup) data(aHDL) table(aHDL$grp)
For one covariate, adds a caliper to an existing cost matrix.
addcaliper(costmatrix, z, p, caliper = NULL, penalty = 1000, twostep = TRUE)
addcaliper(costmatrix, z, p, caliper = NULL, penalty = 1000, twostep = TRUE)
costmatrix |
An existing cost matrix with sum(z) rows and sum(1-z) columns. The function checks the compatability of costmatrix, z and p; so, it may stop with an error if these are not of appropriate dimensions. In particular, costmatrix may come from startcost(). |
z |
A vector with z[i]=1 if individual i is treated or z[i]=0 if individual i is control. The rows of costmatrix refer to treated individuals and the columns refer to controls. |
p |
A vector with the same length as p. The vector p is the covariate for which a caliper is needed. |
caliper |
Determines the type and length of the caliper. The caliper becomes a vector cvex with length 2. If is.null(caliper), then the caliper is +/- 0.2 times the standard deviation of p, namely cvec = c(-.2,.2)*sd(p). If caliper is a single number, then the caliper is +/- caliper, or cvec = c(-1,1)*abs(caliper). If caliper is a vector of length 2, then an asymmetric caliper is used, cvec = c(min(caliper),max(caliper)), where min(caliper) must be negative and max caliper must be positive. |
penalty |
Let I be the index of ith treated individual, 1,...,sum(z), and J be the index of the jth control, j=1,...,sum(1-z), so 1 <= I <= length(z) and so 1 <= J <= length(z). The penality added to costmatrix[i,j] is 0 if cvec[1] <= p[I]-p[J] <= cvex[2]. |
twostep |
If twostep is FALSE, then no action is taken. If twostep is true, no action is take if 2 cvec[1] <= p[I]-p[J] <= 2 cvex[2], and otherwise costmatrix[i,j] is further increased by adding penalty. In words, the penalty is doubled if p[I]-p[J] falls outside twice the caliper. |
For discussion of directional calipers, see Yu and Rosenbaum (2019).
A penalized costmatrix.
Paul R. Rosenbaum
Cochran, William G., and Donald B. Rubin. Controlling bias in observational studies: A review. Sankhya: The Indian Journal of Statistics, Series A 1973;35:417-446.
Yu, Ruoqi, and Paul R. Rosenbaum. <doi:10.1111/biom.13098> Directional penalties for optimal matching in observational studies. Biometrics 2019;75:1380-1390.
data(binge) # Select two treated and three controls from binge d<-binge[is.element(binge$SEQN,c(109315,109365,109266,109273,109290)),] z<-1*(d$AlcGroup=="B") names(z)<-d$SEQN attach(d) x<-data.frame(age,female) detach(d) rownames(x)<-d$SEQN dist<-startcost(z) z x dist # Ten-year age caliper addcaliper(dist,z,x$age,caliper=10,twostep=FALSE) # Ten-year age caliper with twostep=TRUE addcaliper(dist,z,x$age,caliper=10,twostep=TRUE) # Same ten-year age caliper with twostep=TRUE addcaliper(dist,z,x$age,caliper=c(-10,10)) # Asymmetric, directional age caliper with twostep=TRUE addcaliper(dist,z,x$age,caliper=c(-2,10)) # Treated 109315 aged 30 is more than 2 years younger # than control 109273 aged 36, 30-36<(-2), so # row 109315 column 109273 is penalized, indeed # double penalized, as 30-36<2*(-2) rm(z,x,dist,d)
data(binge) # Select two treated and three controls from binge d<-binge[is.element(binge$SEQN,c(109315,109365,109266,109273,109290)),] z<-1*(d$AlcGroup=="B") names(z)<-d$SEQN attach(d) x<-data.frame(age,female) detach(d) rownames(x)<-d$SEQN dist<-startcost(z) z x dist # Ten-year age caliper addcaliper(dist,z,x$age,caliper=10,twostep=FALSE) # Ten-year age caliper with twostep=TRUE addcaliper(dist,z,x$age,caliper=10,twostep=TRUE) # Same ten-year age caliper with twostep=TRUE addcaliper(dist,z,x$age,caliper=c(-10,10)) # Asymmetric, directional age caliper with twostep=TRUE addcaliper(dist,z,x$age,caliper=c(-2,10)) # Treated 109315 aged 30 is more than 2 years younger # than control 109273 aged 36, 30-36<(-2), so # row 109315 column 109273 is penalized, indeed # double penalized, as 30-36<2*(-2) rm(z,x,dist,d)
Takes an integer valued covariate, and adds a penalty proportional to the difference in the integer values, with proportionality constant penalty.
addinteger(costmatrix, z, iscore, penalty = 1000)
addinteger(costmatrix, z, iscore, penalty = 1000)
costmatrix |
An existing cost matrix with sum(z) rows and sum(1-z) columns. The function checks the compatability of costmatrix, z and p; so, it may stop with an error if these are not of appropriate dimensions. In particular, costmatrix may come from startcost(). |
z |
A vector with z[i]=1 if individual i is treated or z[i]=0 if individual i is control. The rows of costmatrix refer to treated individuals and the columns refer to controls. |
iscore |
An vector of integers with length equal to length(z). |
penalty |
One positive number used to penalize mismatches for iscore. |
If a treated and control individual differ on iscore in absolute value by dif, then the distance between them is increased by adding dif*penalty.
A penalized distance matrix.
Paul R. Rosenbaum
data(binge) # Select two treated and four controls from binge d<-binge[is.element(binge$SEQN,c(109315,109365,109266,109273,109290,109332)),] attach(d) z<-1*(AlcGroup=="B") names(z)<-d$SEQN rbind(z,education) dist<-startcost(z) addinteger(dist,z,education,penalty=3) detach(d) rm(d,dist,z)
data(binge) # Select two treated and four controls from binge d<-binge[is.element(binge$SEQN,c(109315,109365,109266,109273,109290,109332)),] attach(d) z<-1*(AlcGroup=="B") names(z)<-d$SEQN rbind(z,education) dist<-startcost(z) addinteger(dist,z,education,penalty=3) detach(d) rm(d,dist,z)
Adds a rank-based Mahalanobis distance to an exisiting distance matrix.
addMahal(costmatrix, z, X)
addMahal(costmatrix, z, X)
costmatrix |
An existing cost matrix with sum(z) rows and sum(1-z) columns. The function checks the compatability of costmatrix, z and p; so, it may stop with an error if these are not of appropriate dimensions. In particular, costmatrix may come from startcost(). |
z |
A vector with z[i]=1 if individual i is treated or z[i]=0 if individual i is control. The rows of costmatrix refer to treated individuals and the columns refer to controls. |
X |
A matrix with length(z) rows containing covariates. |
The rank-based Mahalanobis distance is defined in section 9.3 of Rosenbaum (2020).
A new distance matrix that is the sum of costmatrix and the rank-based Mahalanobis distances.
Paul R. Rosenbaum
Rosenbaum, P. R. (2020) <doi:10.1007/978-3-030-46405-9> Design of Observational Studies (2nd Edition). New York: Springer.
Rubin, D. B. (1980) <doi:10.2307/2529981> Bias reduction using Mahalanobis-metric matching. Biometrics, 36, 293-298.
data(binge) # Select two treated and three controls from binge d<-binge[is.element(binge$SEQN,c(109315,109365,109266,109273,109290)),] z<-1*(d$AlcGroup=="B") names(z)<-d$SEQN attach(d) x<-cbind(age,female) detach(d) rownames(x)<-d$SEQN dist<-startcost(z) z x dist dist<-addMahal(dist,z,x) dist rm(z,x,dist,d)
data(binge) # Select two treated and three controls from binge d<-binge[is.element(binge$SEQN,c(109315,109365,109266,109273,109290)),] z<-1*(d$AlcGroup=="B") names(z)<-d$SEQN attach(d) x<-cbind(age,female) detach(d) rownames(x)<-d$SEQN dist<-startcost(z) z x dist dist<-addMahal(dist,z,x) dist rm(z,x,dist,d)
Add a Near-exact Penalty to an Exisiting Distance Matrix.
addNearExact(costmatrix, z, exact, penalty = 1000)
addNearExact(costmatrix, z, exact, penalty = 1000)
costmatrix |
An existing cost matrix with sum(z) rows and sum(1-z) columns. The function checks the compatability of costmatrix, z and p; so, it may stop with an error if these are not of appropriate dimensions. In particular, costmatrix may come from startcost(). |
z |
A vector with z[i]=1 if individual i is treated or z[i]=0 if individual i is control. The rows of costmatrix refer to treated individuals and the columns refer to controls. |
exact |
A vector with the same length as z. Typically, exact take a small or moderate number of values. |
penalty |
One positive number. |
If the ith treated individual and the jth control have different values of exact, then the distance between them in costmatrix is increased by adding penalty.
A penalized distance matrix.
A sufficiently large penalty will maximize the number of individuals exactly matched for exact. A smaller penalty will tend to increase the number of individuals matched exactly, without prioritizing one covariate over all others.
If the left distance matrix is penalized, it will affect pairing and balance; however, if the right distance matrix is penalized it will affect balance only, as in the near-fine balance technique of Yang et al. (2012).
Adding several near-exact penalties for different covariates on the right distance matrix implements a Hamming distance on the joint distribution of those covariates, as discussed in Zhang et al. (2023).
Near-exact matching for a nominal covariate is discussed and contrasted with exact matching in Sections 10.3 and 10.4 of Rosenbaum (2020). Near-exact matching is always feasible, because it implements a constraint using a penalty. Exact matching may be infeasible, but when feasible it may be used to speed up computations. For an alternative method of speeding computations, see Yu et al. (2020) who identify feasible constraints very quickly prior to matching with those constraints.
Paul R. Rosenbaum
Rosenbaum, P. R. (2020) <doi:10.1007/978-3-030-46405-9> Design of Observational Studies (2nd Edition). New York: Springer.
Yang, D., Small, D. S., Silber, J. H. and Rosenbaum, P. R. (2012) <doi:10.1111/j.1541-0420.2011.01691.x> Optimal matching with minimal deviation from fine balance in a study of obesity and surgical outcomes. Biometrics, 68, 628-636. (Extension of fine balance useful when fine balance is infeasible. Comes as close as possible to fine balance. Implemented in makematch() by placing a large near-exact penalty on a nominal/integer covariate x1 on the right distance matrix.)
Yu, R., Silber, J. H., Rosenbaum, P. R. (2020) <doi:10.1214/19-STS699> Matching Methods for Observational Studies Derived from Large Administrative Databases. Statistical Science, 35, 338-355.
Zhang, B., D. S. Small, K. B. Lasater, M. McHugh, J. H. Silber, and P. R. Rosenbaum (2023) <doi:10.1080/01621459.2021.1981337> Matching one sample according to two criteria in observational studies. Journal of the American Statistical Association, 118, 1140-1151.
Zubizarreta, J. R., Reinke, C. E., Kelz, R. R., Silber, J. H. and Rosenbaum, P. R. (2011) <doi:10.1198/tas.2011.11072> Matching for several sparse nominal variables in a case control study of readmission following surgery. The American Statistician, 65(4), 229-238.
data(binge) # Select two treated and three controls from binge d<-binge[is.element(binge$SEQN,c(109315,109365,109266,109273,109290)),] z<-1*(d$AlcGroup=="B") names(z)<-d$SEQN attach(d) x<-data.frame(age,female) detach(d) rownames(x)<-d$SEQN dist<-startcost(z) z x dist addNearExact(dist,z,x$female) addNearExact(dist,z,x$age<40,penalty=10) # Combine several penalties dist<-addNearExact(dist,z,x$female) dist<-addNearExact(dist,z,x$age<40,penalty=10) dist dist<-addNearExact(dist,z,x$age<60,penalty=5) dist # This distance suggests pairing 109315-109266 # and 109365-109290 rm(z,x,dist,d)
data(binge) # Select two treated and three controls from binge d<-binge[is.element(binge$SEQN,c(109315,109365,109266,109273,109290)),] z<-1*(d$AlcGroup=="B") names(z)<-d$SEQN attach(d) x<-data.frame(age,female) detach(d) rownames(x)<-d$SEQN dist<-startcost(z) z x dist addNearExact(dist,z,x$female) addNearExact(dist,z,x$age<40,penalty=10) # Combine several penalties dist<-addNearExact(dist,z,x$female) dist<-addNearExact(dist,z,x$age<40,penalty=10) dist dist<-addNearExact(dist,z,x$age<60,penalty=5) dist # This distance suggests pairing 109315-109266 # and 109365-109290 rm(z,x,dist,d)
Cut a Covariate at Quantiles and Add a Penalty for Different Quantile Categories
addquantile(costmatrix, z, p, pct = c(0.2, 0.4, 0.6, 0.8), penalty = 1000)
addquantile(costmatrix, z, p, pct = c(0.2, 0.4, 0.6, 0.8), penalty = 1000)
costmatrix |
An existing cost matrix with sum(z) rows and sum(1-z) columns. The function checks the compatability of costmatrix, z and p; so, it may stop with an error if these are not of appropriate dimensions. In particular, costmatrix may come from startcost(). |
z |
A vector with z[i]=1 if individual i is treated or z[i]=0 if individual i is control. The rows of costmatrix refer to treated individuals and the columns refer to controls. |
p |
A vector of length equal to length(z). Quantiles of p will penalize the distance. |
pct |
A vector of numbers strictly between 0 and 1. These determine the quantiles of p. For instance, c(.25,.5,.75) uses the quartiles of p. |
penalty |
One positive number used as a penalty. |
The vector p is cut at its quantiles defined by pct, and the integer difference in quantile categories is multiplied by penalty and added to the distance matrix. The function is similar to addinteger(), except the integer values are not specified, but rather are deduced from the quantiles.
If you cannot match for the quantile category of p, then addquantile() prefers to match from an adjacent quantile category.
A penalized distance matrix.
Paul R. Rosenbaum
data(binge) d<-binge[binge$AlcGroup!="N",] attach(d) z<-1*(AlcGroup=="B") names(z)<-SEQN dist<-startcost(z) quantile(age,pct=c(1/4,1/2,3/4)) rbind(z,age)[,1:20] addquantile(dist,z,d$age,pct=c(1/4,1/2,3/4),penalty=5)[1:5,1:7] detach(d) rm(z,dist,d)
data(binge) d<-binge[binge$AlcGroup!="N",] attach(d) z<-1*(AlcGroup=="B") names(z)<-SEQN dist<-startcost(z) quantile(age,pct=c(1/4,1/2,3/4)) rbind(z,age)[,1:20] addquantile(dist,z,d$age,pct=c(1/4,1/2,3/4),penalty=5)[1:5,1:7] detach(d) rm(z,dist,d)
A small observational study of light daily alcohol consumption and HDL cholesterol – so-called good cholesterol – derived from NHANES 2013-2014 and 2015-2016. There are 406 matched sets of four individuals, making 1624 individuals in total. Sets were matched for age, female and education in five ordered categories.
data("aHDL")
data("aHDL")
A data frame with 1624 observations on the following 11 variables.
nh
NHANES 2013-2014 is 1314, and NHANES 2015-2016 is 1516
SEQN
NHANES ID number
age
Age in years
female
1=female, 0=male
education
1 is <9th grade, 3 is high school, 5 is a BA degree
z
1=light almost daily alcohol, 0=little or no alcohol last year.
grp
Treated group and control groups. Daily=light almost daily alcohol, Never=fewer than 12 drinks during entire life, Rarely=more than 12 drinks in life, but fewer than 12 in the past year, and never had a period of daily binge drinking, PastBinge = a past history of binge drinking on most days, but currently drinks once a week or less. For details, see Rosenbaum (2022a, Appendix).
grpL
Short labels for plotting formed as the first letters of grp. D
< N
< R
< B
hdl
HDL cholesterol level mg/dL
mmercury
Methylmercury level ug/L
mset
Matched set indicator, 1, 2, ..., 406. The 1624 observations are in 406 matched sets, each of size 4.
There is a debate about whether light daily alcohol consumption – a single glass of red wine – shortens or lengthens life. LoConte et al. (2018) emphasize that alcohol is a carcinogen. Suh et al. (1992) claim reduced cardiovascular mortality brought about by an increase in high density high-density lipoprotein (HDL) cholesterol, the so-called good cholesterol. There is on-going debate about whether there are cardiovascular benefits, and if they exist, whether they are large enough to offset an increased risk of cancer. This example looks at a small corner of the larger debate, namely the effect on HDL cholesterol.
The example contains several attempts to detect unmeasured confounding bias, if present. There is a secondary outcome thought to be unaffected by alcohol consumption, namely methylmercury levels in the blood, likely an indicator of the consumption of fish, not of alcohol; see Pedersen et al. (1994) and WHO (2021). There are also three control groups, all with little present alcohol consumption, but with different uses of alcohol in the past; see the definition of variable grp above.
The appendix to Rosenbaum (2023) describes the data and matching in detail. It is used as an example in Rosenbaum (2022).
The help file for boxplotTT() applies the tail transformation to this example, reproducing a plot from Rosenbaum (2022).
This data set is also included in the tailTransform package. See also the informedSen package which contains a part of this data set.
US National Health and Nutrition Examination Survey (NHANES), 2013-2014 and 2015-2016.
LoConte, N. K., Brewster, A. M., Kaur, J. S., Merrill, J. K., and Alberg, A. J. <doi:10.1200/JCO.2017.76.1155> Alcohol and cancer: a statement of the American Society of Clinical Oncology. Journal of Clinical Oncology 2018;36:83-93.
Pedersen, G. A., Mortensen, G. K. and Larsen, E. H. <doi:10.1080/02652039409374234> Beverages as a source of toxic trace element intake. Food Additives and Contaminants, 1994;11:351–363.
Rosenbaum, P. R. (1987) <doi:10.1214/ss/1177013232> The role of a second control group in an observational study. Statistical Science, 1987;2:292-306.
Rosenbaum, P. R. <doi:10.2307/2531497> The role of known effects in observational studies. Biometrics, 1989;45:557-569.
Rosenbaum, P. R. <doi:10.1214/aos/1176347131> On permutation tests for hidden biases in observational studies. The Annals of Statistics, 1989;17:643-653.
Rosenbaum, P. R. <doi:10.1080/01621459.2013.879261> Weighted M-statistics with superior design sensitivity in matched observational studies with multiple controls. Journal of the American Statistical Association, 2014;109(507):1145-1158
Rosenbaum, P. R. <doi:10.1080/00031305.2022.2063944> (2022). A new transformation of treated-control matched-pair differences for graphical display. American Statistician, 2022;76(4):346-352.
Rosenbaum, P. R. <doi:10.1111/biom.13558> Sensitivity analyses informed by tests for bias in observational studies. Biometrics 2023;79(1):475-487.
Suh, I., Shaten, B. J., Cutler, J. A., and Kuller, L. H. <doi:10.7326/0003-4819-116-11-881> Alcohol use and mortality from coronary heart disease: The role of high-density lipoprotein cholesterol. Annals of Internal Medicine 1992;116:881-887.
World Health Organization (2021). Mercury and Health, <https://www.who.int/news-room/fact-sheets/detail/mercury-and-health>, (Accessed 30 August 2021).
data(aHDL) table(aHDL$grp,aHDL$grpL) # Short labels for plotting boxplot(aHDL$age~aHDL$grp,xlab="Group",ylab="Age") boxplot(aHDL$education~aHDL$grp,xlab="Group",ylab="Education") table(aHDL$female,aHDL$grpL) table(aHDL$z,aHDL$grpL) # # In iTOS book, for Table 8.1 y<-t(matrix(aHDL$hdl,4,406)) weightedRank::wgtRank(y,gamma=5,phi="wilc") weightedRank::wgtRank(y,gamma=5,phi="quade") weightedRank::wgtRank(y,gamma=5,phi="u868") weightedRank::wgtRank(y,gamma=5,phi="u878") # # The sets were also matched for is.na(aHDL$mmercury), for use # in Rosenbaum (2023). About half of the matched sets # have values for mmercury. Discussed in Chapter 12 of iTOS. table(is.na(aHDL$mmercury),aHDL$grp) # See also the informedSen package for additional analysis
data(aHDL) table(aHDL$grp,aHDL$grpL) # Short labels for plotting boxplot(aHDL$age~aHDL$grp,xlab="Group",ylab="Age") boxplot(aHDL$education~aHDL$grp,xlab="Group",ylab="Education") table(aHDL$female,aHDL$grpL) table(aHDL$z,aHDL$grpL) # # In iTOS book, for Table 8.1 y<-t(matrix(aHDL$hdl,4,406)) weightedRank::wgtRank(y,gamma=5,phi="wilc") weightedRank::wgtRank(y,gamma=5,phi="quade") weightedRank::wgtRank(y,gamma=5,phi="u868") weightedRank::wgtRank(y,gamma=5,phi="u878") # # The sets were also matched for is.na(aHDL$mmercury), for use # in Rosenbaum (2023). About half of the matched sets # have values for mmercury. Discussed in Chapter 12 of iTOS. table(is.na(aHDL$mmercury),aHDL$grp) # See also the informedSen package for additional analysis
Uses the method in Rosenbaum and Silber (2009) to interpret a value of the sensitivity parameter gamma. Each value of gamma amplifies to a curve (lambda,delta) in a two-dimensional sensitivity analysis, the inference being the same for all points on the curve. That is, a one-dimensional sensitivity analysis in terms of gamma has a two-dimensional interpretation in terms of (lambda,delta).
amplify(gamma, lambda)
amplify(gamma, lambda)
gamma |
gamma > 1 is the value of the sensitivity parameter, for instance the parameter in senmv. length(gamma)>1 will generate an error. |
lambda |
lambda is a vector of values > gamma. An error will result unless lambda[i] > gamma > 1 for every i. |
A single value of gamma, say gamma = 2.2 in the example, corresponds to a curve of values of (lambda, delta), including (3, 7), (4, 4.33), (5, 3.57), and (7, 3) in the example. An unobserved covariate that is associated with a lambda = 3 fold increase in the odds of treatment and a delta = 7 fold increase in the odds of a positive pair difference is equivalent to gamma = 2.2.
The curve is gamma = (lambda*delta + 1)/(lambda+delta). Amplify is given one gamma and a vector of lambdas and solves for the vector of deltas. The calculation is elementary.
This interpretation of gamma is developed in detail in Rosenbaum and Silber (2009), and it makes use of Wolfe's (1974) family of semiparametric deformations of an arbitrary symmetric distribuiton. See also Rosenbaum (2020, Section 3.6). For an elementary discussion, see Rosenbaum (2017, Table 9.1).
Strictly speaking, the amplification describes matched pairs, not matched sets. The senm function views a k-to-1 matched set with k controls matched to one treated individual as a collection of k correlated treated-minus-control matched pair differences; see Rosenbaum (2007). For matched sets, it is natural to think of the amplification as describing any one of the k matched pair differences in a k-to-1 matched set.
The curve has asymptotes that the function amplify does not compute: gamma corresponds with (lambda,delta) = (gamma, Inf) and (Inf, gamma).
A related though distict idea is developed in Gastwirth et al (1998). The two approaches agree when the outcome is binary, that is, for McNemar's test.
Returns a vector of values of delta of length(lambda) with names lambda.
The amplify function is also in the sensitivitymv package where a different example is used.
Paul R. Rosenbaum
Gastwirth, J. L., Krieger, A. M., Rosenbaum, P. R. (1998) <doi:10.1093/biomet/85.4.907> Dual and simultaneous sensitivity analysis for matched pairs. Biometrika, 85, 907-920.
Rosenbaum, P. R. and Silber, J. H. (2009) <doi:10.1198/jasa.2009.tm08470> Amplification of sensitivity analysis in observational studies. Journal of the American Statistical Association, 104, 1398-1405.
Rosenbaum, P. R. (2017) <doi:10.4159/9780674982697> Observation and Experiment: An Introduction to Causal Inference. Cambridge, MA: Harvard University Press. Table 9.1.
Rosenbaum, P. R. (2020) <doi:10.1007/978-3-030-46405-9> Design of Observational Studies (2nd ed.) NY: Springer. Section 3.6.
Wolfe, D. A. (1974) <doi:10.2307/2286025> A charaterization of population weighted symmetry and related results. Journal of the American Statistical Association, 69, 819-822.
# Consider a treated-control match pair as the unit of measure, # analogous to one meter or one foot. The calculation # amplify(4,7) says that, in a matched pair, gamma=4 # is the same a bias that increases the odds of treatment # 7-fold and increases the odds of positive matched-pair # difference in outcomes 9-fold. amplify(4,7) # It is also true that, in a matched pair, gamma=4 # is the same a bias that increases the odds of treatment # 9-fold and increases the odds of positive matched-pair # difference in outcomes 7-fold. amplify(4,9) # It is also true that, in a matched pair, gamma=4 # is the same a bias that increases the odds of treatment # 5-fold and increases the odds of positive matched-pair # difference in outcomes 19-fold. amplify(4,5) # The amplify function can produce the entire curve at once: amplify(4,5:19)
# Consider a treated-control match pair as the unit of measure, # analogous to one meter or one foot. The calculation # amplify(4,7) says that, in a matched pair, gamma=4 # is the same a bias that increases the odds of treatment # 7-fold and increases the odds of positive matched-pair # difference in outcomes 9-fold. amplify(4,7) # It is also true that, in a matched pair, gamma=4 # is the same a bias that increases the odds of treatment # 9-fold and increases the odds of positive matched-pair # difference in outcomes 7-fold. amplify(4,9) # It is also true that, in a matched pair, gamma=4 # is the same a bias that increases the odds of treatment # 5-fold and increases the odds of positive matched-pair # difference in outcomes 19-fold. amplify(4,5) # The amplify function can produce the entire curve at once: amplify(4,5:19)
These unmatched data are from NHANES, and they illustrate multivariate matching. The matched version of the data is bingeM, and it was produced by the example in the documentation for the makematch() function.
data("binge")
data("binge")
A data frame with 4627 observations on the following 17 variables.
SEQN
NHANES identification number
age
Age in years
ageC
Age cut into four categories, [20,30), [30,45), [45,60) and [60,Inf).
female
1=female, 0=male
educationf
Education in five categories: <9th grade, 9th-11th grade without a high school degree or equivalent, high school degree or equivalent, some college, at least a BA degree. An ordered factor with levels <9th
< 9-11
< HS
< SomeCol
< >=BA
education
The previous variable, educationf, as integer scores, 1-5.
bmi
BMI or body mass index. A measure of obesity.
waisthip
Waist-to-hip ratio. A measure of obesity.
vigor
1 if engages in vigorous activity, either in recreation or at work, 0 otherwise
smokenowf
Do you smoke now? Everyday
< SomeDays
< No
smokenow
The previous variable, smokenowf, as integer scores.
bpRX
1=currently taking medication to control high blood pressure, 0=other
smokeQuit
1=used to smoke regularly but quit, 0=other. A current smoker and a never smoker both have value 0.
AlcGroup
An ordered factor with levels B
< N
< P
bpSystolic
Systolic blood pressure. The average of up to three measurements.
bpDiastolic
Diastolic blood pressure. The average of up to three measurements.
bpCombined
A combined measure of blood pressure.
This data set is intended to illustrate multivariate matching. See the example in the documentation for the makematch() function.
In the examples below, the simple B-N match in Section 4.3 of the iTOS book is constructed. It matches for the propensity score and nothing else.
bpCombined is the sum of two standardized measures, one for systolic blood pressure and one for diastolic blood pressure. In the larger NHANES data set of individuals at least 20 years of age who are not pregnant, the median and the mad (=median absolute deviation from the median) were determined separately for systolic and diastolic blood pressure. The calculation used the median and mad functions in the 'stats' package, so the mad was by default scaled to resemble the standard deviation for a Normal distribution. bpCombined is the sum of two quantities: systolic blood pressure minus its median divided by its mad plus diastolic blood pressure minus its median divided by its mad.
The data are from the US National Health and Nutrition Examination Survey, NHANES 2017-March 2020 Pre-pandemic. The 2017-2020 data were affected by COVID-19 and are not a survey. The complete data are available from the CDC web page. With minor differences, the data were used as an example in Rosenbaum (2023).
Rosenbaum, P. R. (2023) <doi:10.1111/biom.13921> A second evidence factor for a second control group. Biometrics, 79(4), 3968-3980.
Rosenbaum, P.R. and Rubin, D.B. (1985) <doi:10.2307/2683903> Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39(1):33-38.
US National Health and Nutrition Examination Survey, 2017-2020. Atlanta: US Centers for Disease Control.
data(binge) attach(binge) # A simple goal is to match each of the 206 binge drinkers # to one control from each of the two control groups. table(AlcGroup) # The matching ratios are very different, 19-to-1 or 2.4 to 1. table(AlcGroup)/206 # Before matching, the age distributions are very different. boxplot(age~AlcGroup) # NHANES does not ask every question of every person, often # using either age or sex to determine what will be asked. # If we match exactly for age categories that determine what # will be asked, then both members of a matched pair will # either have an answer or no answer, and this simplifies # some analyses, while also beginning the work of controlling # an important covariate. table(AlcGroup,ageC) # Age is going to be a problem for the P controls. There # are 29 bingers B under age 30, but only 19 past bingers P. table(AlcGroup,smokenowf) # Smoking is a challenge, too. The 19-to-1 matching ratio # for N controls drops to about 4 = 364/92 for everyday # smokers, and from 2.43 to 1.44 = 133/92 for P controls. # There are too few P controls who smoke SomeDays to # match exactly with bingers B. detach(binge) #################################################### # This example produces the elementary match in # Section 4.3 of the iTOS book. # It matches binge drinkers (B) to never binge # controls (N), matching just for the propensity # score. The match uses both a caliper and a fine # balance constraint for the propensity score. # Generally, one does not just match for the # propensity score, but this is an example. # See also the documentation for makematch. library(iTOS) data(binge) # Make the treatment indicator z for B versus N z<-rep(NA,dim(binge)[1]) z[binge$AlcGroup=="B"]<-1 z[binge$AlcGroup=="N"]<-0 # I find it convenient to write a small function # to create a match. In that way, small # changes in the function can improve the # match by improving covariate balance. # It also documents the details of the match. # It also makes it possible to reproduce the match. matchPropensity<-function(z,ncontrols=1){ # # Some bookkeeping. # Select the B versus N part of the binge data dt<-binge[!is.na(z),] z<-z[!is.na(z)] # Sort data, placing treated (B) first, # and then ordered by SEQN. dt<-dt[order(1-z,dt$SEQN),] z<-z[order(1-z,dt$SEQN)] rownames(dt)<-dt$SEQN names(z)<-dt$SEQN # Initialize the distance matrix on # the left and right to zero distances left<-startcost(z) right<-startcost(z) # Create the propensity score attach(dt) propmod<-glm(z~age+female+education+smokenow+smokeQuit+bpRX+ bmi+vigor+waisthip,family=binomial) p<-propmod$fitted.values # The left distance matrix is changed from zero # by adding a caliper on the propensity score. # The caliper is almost the one used in # Rosenbaum and Rubin (1985, American Statistician). # If two individuals differ on the propensity score # by more than 0.2 times the standard deviation of p, # the distance between them is increased from 0 to 10. # That was the caliper in Rosenbaum and Rubin (1985). # By default, addcaliper doubles the distance (to 20) at # 0.4 = 2 x 0.2 times the standard deviation of p. # Without this doubling feature, the caliper views # everyone who violates the 0.2 caliper as equivalent. left<-addcaliper(left,z,p,penalty=10) # The right distance matrix now adds a fine balance # constraint. The variable (p>0.05)+(p>.1)+(p>.15)+(p>.2) # takes integer values from 0 to 4, taking steps up as # the propensity score increases. # The right distance matrix is 0 if two people fall in # the same category, is 1000 if they fall in adjacent # categories, 2000 if there if there is a category # between them, and so on. Because 1000 is so much # larger than 10, the fine balance constraint takes # priority over the caliper. right<-addinteger(right,z,(p>0.05)+(p>.1)+(p>.15)+(p>.2)) # Some more bookkeeping detach(dt) dt<-cbind(dt,z,p) # The big step: Use the distance matrices to make the match m<-makematch(dt,left,right,ncontrols=ncontrols) # Final bookkeeping m$mset<-as.integer(m$mset) treated<-m$SEQN[m$z==1] treated<-as.vector(t(matrix(treated,length(treated),ncontrols+1))) m<-cbind(m,treated) list(m=m,dt=dt) } # Call the function above to make the match mProp<-matchPropensity(z) m<-mProp$m # Make Table 4.3 in the iTOS book. t.test(m$age~m$z)$p.value t.test(m$female~m$z)$p.value t.test(m$education~m$z)$p.value t.test(m$bmi~m$z)$p.value t.test(m$waisthip~m$z)$p.value t.test(m$vigor~m$z)$p.value t.test(m$smokenow~m$z)$p.value t.test(m$smokeQuit~m$z)$p.value t.test(m$bpRX~m$z)$p.value t.test(m$p~m$z)$p.value dt<-mProp$dt tr<-m$z==1 co<-m$z==0 un<-!is.element(dt$SEQN,m$SEQN) vnames<-c("age","female","education","bmi","waisthip","vigor", "smokenow","smokeQuit","bpRX","p") o<-matrix(NA,10,3) pval<-rep(NA,10) names(pval)<-vnames rownames(o)<-vnames colnames(o)<-c("Treated","Control","Unmatched") for (i in 1:10){ vname<-vnames[i] vm<-as.vector(m[,colnames(m)==vname]) vdt<-as.vector(dt[,colnames(dt)==vname]) o[i,1]<-mean(vm[tr]) o[i,2]<-mean(vm[co]) o[i,3]<-mean(vdt[un]) pval[i]<-t.test(vm[tr],vm[co])$p.value } library(xtable) o<-cbind(o,pval) xtable(o,digits=c(NA,2,2,2,3)) m[5:6,]
data(binge) attach(binge) # A simple goal is to match each of the 206 binge drinkers # to one control from each of the two control groups. table(AlcGroup) # The matching ratios are very different, 19-to-1 or 2.4 to 1. table(AlcGroup)/206 # Before matching, the age distributions are very different. boxplot(age~AlcGroup) # NHANES does not ask every question of every person, often # using either age or sex to determine what will be asked. # If we match exactly for age categories that determine what # will be asked, then both members of a matched pair will # either have an answer or no answer, and this simplifies # some analyses, while also beginning the work of controlling # an important covariate. table(AlcGroup,ageC) # Age is going to be a problem for the P controls. There # are 29 bingers B under age 30, but only 19 past bingers P. table(AlcGroup,smokenowf) # Smoking is a challenge, too. The 19-to-1 matching ratio # for N controls drops to about 4 = 364/92 for everyday # smokers, and from 2.43 to 1.44 = 133/92 for P controls. # There are too few P controls who smoke SomeDays to # match exactly with bingers B. detach(binge) #################################################### # This example produces the elementary match in # Section 4.3 of the iTOS book. # It matches binge drinkers (B) to never binge # controls (N), matching just for the propensity # score. The match uses both a caliper and a fine # balance constraint for the propensity score. # Generally, one does not just match for the # propensity score, but this is an example. # See also the documentation for makematch. library(iTOS) data(binge) # Make the treatment indicator z for B versus N z<-rep(NA,dim(binge)[1]) z[binge$AlcGroup=="B"]<-1 z[binge$AlcGroup=="N"]<-0 # I find it convenient to write a small function # to create a match. In that way, small # changes in the function can improve the # match by improving covariate balance. # It also documents the details of the match. # It also makes it possible to reproduce the match. matchPropensity<-function(z,ncontrols=1){ # # Some bookkeeping. # Select the B versus N part of the binge data dt<-binge[!is.na(z),] z<-z[!is.na(z)] # Sort data, placing treated (B) first, # and then ordered by SEQN. dt<-dt[order(1-z,dt$SEQN),] z<-z[order(1-z,dt$SEQN)] rownames(dt)<-dt$SEQN names(z)<-dt$SEQN # Initialize the distance matrix on # the left and right to zero distances left<-startcost(z) right<-startcost(z) # Create the propensity score attach(dt) propmod<-glm(z~age+female+education+smokenow+smokeQuit+bpRX+ bmi+vigor+waisthip,family=binomial) p<-propmod$fitted.values # The left distance matrix is changed from zero # by adding a caliper on the propensity score. # The caliper is almost the one used in # Rosenbaum and Rubin (1985, American Statistician). # If two individuals differ on the propensity score # by more than 0.2 times the standard deviation of p, # the distance between them is increased from 0 to 10. # That was the caliper in Rosenbaum and Rubin (1985). # By default, addcaliper doubles the distance (to 20) at # 0.4 = 2 x 0.2 times the standard deviation of p. # Without this doubling feature, the caliper views # everyone who violates the 0.2 caliper as equivalent. left<-addcaliper(left,z,p,penalty=10) # The right distance matrix now adds a fine balance # constraint. The variable (p>0.05)+(p>.1)+(p>.15)+(p>.2) # takes integer values from 0 to 4, taking steps up as # the propensity score increases. # The right distance matrix is 0 if two people fall in # the same category, is 1000 if they fall in adjacent # categories, 2000 if there if there is a category # between them, and so on. Because 1000 is so much # larger than 10, the fine balance constraint takes # priority over the caliper. right<-addinteger(right,z,(p>0.05)+(p>.1)+(p>.15)+(p>.2)) # Some more bookkeeping detach(dt) dt<-cbind(dt,z,p) # The big step: Use the distance matrices to make the match m<-makematch(dt,left,right,ncontrols=ncontrols) # Final bookkeeping m$mset<-as.integer(m$mset) treated<-m$SEQN[m$z==1] treated<-as.vector(t(matrix(treated,length(treated),ncontrols+1))) m<-cbind(m,treated) list(m=m,dt=dt) } # Call the function above to make the match mProp<-matchPropensity(z) m<-mProp$m # Make Table 4.3 in the iTOS book. t.test(m$age~m$z)$p.value t.test(m$female~m$z)$p.value t.test(m$education~m$z)$p.value t.test(m$bmi~m$z)$p.value t.test(m$waisthip~m$z)$p.value t.test(m$vigor~m$z)$p.value t.test(m$smokenow~m$z)$p.value t.test(m$smokeQuit~m$z)$p.value t.test(m$bpRX~m$z)$p.value t.test(m$p~m$z)$p.value dt<-mProp$dt tr<-m$z==1 co<-m$z==0 un<-!is.element(dt$SEQN,m$SEQN) vnames<-c("age","female","education","bmi","waisthip","vigor", "smokenow","smokeQuit","bpRX","p") o<-matrix(NA,10,3) pval<-rep(NA,10) names(pval)<-vnames rownames(o)<-vnames colnames(o)<-c("Treated","Control","Unmatched") for (i in 1:10){ vname<-vnames[i] vm<-as.vector(m[,colnames(m)==vname]) vdt<-as.vector(dt[,colnames(dt)==vname]) o[i,1]<-mean(vm[tr]) o[i,2]<-mean(vm[co]) o[i,3]<-mean(vdt[un]) pval[i]<-t.test(vm[tr],vm[co])$p.value } library(xtable) o<-cbind(o,pval) xtable(o,digits=c(NA,2,2,2,3)) m[5:6,]
The bingeM data set is the matched data set built from the unmatched binge data using the makematch() function. The documentation for the makematch() function builds bingeM from binge.
data("bingeM")
data("bingeM")
A data frame with 618 observations on the following 20 variables.
SEQN
NHANES identification number
age
Age in years
ageC
Age cut into four categories, [20,30), [30,45), [45,60) and [60,Inf).
female
1=female, 0=male
educationf
Education in five categories: <9th grade, 9th-11th grade without a high school degree or equivalent, high school degree or equivalent, some college, at least a BA degree. An ordered factor with levels <9th
< 9-11
< HS
< SomeCol
< >=BA
education
The previous variable, educationf, as integer scores, 1-5.
bmi
BMI or body mass index. A measure of obesity.
waisthip
Waist-to-hip ratio. A measure of obesity.
vigor
1 if engages in vigorous activity, either in recreation or at work, 0 otherwise
smokenowf
Do you smoke now? Everyday
< SomeDays
< No
smokenow
The previous variable, smokenowf, as integer scores.
bpRX
1=currently taking medication to control high blood pressure, 0=other
smokeQuit
1=used to smoke regularly but quit, 0=other. A current smoker and a never smoker both have value 0.
AlcGroup
An ordered factor with levels B
< N
< P
bpSystolic
Systolic blood pressure. The average of up to three measurements.
bpDiastolic
Diastolic blood pressure. The average of up to three measurements.
bpCombined
A combined measure of blood pressure.
z
Treatment/control indicator, z[i]=1 if i is in AlcGroup category B and z[i]=0 otherwise.
mset
Indicator of the matched set, 1, 2, ..., 206.
treated
The SEQN for the treated individual in this matched set.
bingeM is the matched data set built from the unmatched binge data using the makematch() function. The documentation for the makematch() function builds bingeM from binge.
bpCombined is the sum of two standardized measures, one for systolic blood pressure and one for diastolic blood pressure. In the larger NHANES data set of individuals at least 20 years of age who are not pregnant, the median and the mad (=median absolute deviation from the median) were determined separately for systolic and diastolic blood pressure. The calculation used the median and mad functions in the 'stats' package, so the mad was by default scaled to resemble the standard deviation for a Normal distribution. bpCombined is the sum of two quantities: systolic blood pressure minus its median divided by its mad plus diastolic blood pressure minus its median divided by its mad.
The data are from the US National Health and Nutrition Examination Survey, NHANES 2017-March 2020 Pre-pandemic. The 2017-2020 data were affected by COVID-19 and are not a survey. The complete data are available from the CDC web page. With minor differences, the data were used as an example in Rosenbaum (2023).
Rosenbaum, P. R. (2023) <doi:10.1111/biom.13921> A second evidence factor for a second control group. Biometrics, 79(4), 3968-3980.
US National Health and Nutrition Examination Survey, 2017-2020. Atlanta: US Centers for Disease Control.
data(bingeM) table(bingeM$AlcGroup) tapply(bingeM$age,bingeM$AlcGroup,median) tapply(bingeM$bmi,bingeM$AlcGroup,median) tapply(bingeM$education,bingeM$AlcGroup,mean) tapply(bingeM$female,bingeM$AlcGroup,mean) tapply(bingeM$smokenow,bingeM$AlcGroup,mean) tapply(bingeM$vigor,bingeM$AlcGroup,mean) tapply(bingeM$smokeQuit,bingeM$AlcGroup,mean) boxplot(bingeM$bpCombined~bingeM$AlcGroup,ylab="Combined BP", xlab="Alcohol Group",las=1) y<-t(matrix(bingeM$bpCombined,3,206)) weightedRank::wgtRank(y,gamma=2) amplify(2,3) # In iTOS book, comparison of Tables 13.2 and 13.3 weightedRank::ef2C(y,gamma=2,upsilon=2,m1=c(6,6),range=TRUE,scores=1:3) weightedRank::ef2C(y,gamma=2,upsilon=2.3,m1=c(6,8),range=FALSE,scores=c(1,2,5))
data(bingeM) table(bingeM$AlcGroup) tapply(bingeM$age,bingeM$AlcGroup,median) tapply(bingeM$bmi,bingeM$AlcGroup,median) tapply(bingeM$education,bingeM$AlcGroup,mean) tapply(bingeM$female,bingeM$AlcGroup,mean) tapply(bingeM$smokenow,bingeM$AlcGroup,mean) tapply(bingeM$vigor,bingeM$AlcGroup,mean) tapply(bingeM$smokeQuit,bingeM$AlcGroup,mean) boxplot(bingeM$bpCombined~bingeM$AlcGroup,ylab="Combined BP", xlab="Alcohol Group",las=1) y<-t(matrix(bingeM$bpCombined,3,206)) weightedRank::wgtRank(y,gamma=2) amplify(2,3) # In iTOS book, comparison of Tables 13.2 and 13.3 weightedRank::ef2C(y,gamma=2,upsilon=2,m1=c(6,6),range=TRUE,scores=1:3) weightedRank::ef2C(y,gamma=2,upsilon=2.3,m1=c(6,8),range=FALSE,scores=c(1,2,5))
Of limited interest to most users, the computep function plays an internal role in 2-sample and stratified sensitivity analyses. The computep function is equations (9) and (10), page 496, in Rosenbaum and Krieger (1990). The computep function is from the senstrat package.
computep(bigN, n, m, g)
computep(bigN, n, m, g)
bigN |
Total sample size in this stratum. |
n |
Treated sample size in this stratum. |
m |
The number of 1's in the vector u of unobserved covariates. Here, u has bigN-m 0's followed by m 1's. |
g |
The sensitivity parameter |
p1 |
Equation (9), page 496, in Rosenbaum and Krieger (1990) evaluated with u[i]=1. |
p0 |
Equation (9), page 496, in Rosenbaum and Krieger (1990) evaluated with u[i]=0. |
p11 |
Equation (10), page 496, in Rosenbaum and Krieger (1990) evaluated with u[i]=1, u[j]=1. |
p10 |
Equation (10), page 496, in Rosenbaum and Krieger (1990) evaluated with u[i]=1, u[j]=0. |
p00 |
Equation (10), page 496, in Rosenbaum and Krieger (1990) evaluated with u[i]=0, u[j]=0. |
The function computep is called by the function ev.
Paul R. Rosenbaum
Rosenbaum, P. R. and Krieger, A. M. (1990) <doi:10.2307/2289789> Sensitivity of two-sample permutation inferences in observational studies. Journal of the American Statistical Association, 85, 493-498.
Rosenbaum, P. R. (2002). Observational Studies (2nd edition). New York: Springer. Section 4.6.
computep(10,5,6,2)
computep(10,5,6,2)
Of limited interest to most users, the ev function plays an internal role in 2-sample and stratified sensitivity analyses. The expectation and variance returned by the ev function are defined in the third paragraph of section 4, page 495, of Rosenbaum and Krieger (1990).
ev(sc, z, m, g, method)
ev(sc, z, m, g, method)
sc |
A vector of scored outcomes for one stratum. For instance, for Wilcoxon's rank sum test, these would be the ranks of the outcomes in the current stratum. |
z |
Treatment indicators, with length(z)=length(sc). Here, z[i]=1 if i is treated and z[i]=0 if i is control. |
m |
The unobserved covariate u has length(z)-m 0's followed by m 1's. |
g |
The sensitivity parameter |
method |
If method="RK" or if method="BU", exact expectations and variances are used in a large sample approximation. Methods "RK" and "BU" should give the same answer, but "RK" uses formulas from Rosenbaum and Krieger (1990), while "BU" obtains exact moments for the extended hypergeometric distribution using the BiasedUrn package and then applies Proposition 20, page 155, section 4.7.4 of Rosenbaum (2002). In contrast, method="LS" does not use exact expectations and variances, but rather uses the large sample approximations in section 4.6.4 of Rosenbaum (2002). Finally, method="AD" uses method="LS" for large strata and method="BU" for smaller strata. |
The function ev() is called by the function evall(). The ev() function is from the senstrat package.
expect |
Null expectation of the test statistic. |
vari |
Null variance of the test statistic. |
Paul R. Rosenbaum
Rosenbaum, P. R. and Krieger, A. M. (1990) <doi:10.2307/2289789> Sensitivity of two-sample permutation inferences in observational studies. Journal of the American Statistical Association, 85, 493-498.
Rosenbaum, P. R. (2002). Observational Studies (2nd edition). New York: Springer. Section 4.6.
Rosenbaum, P. R. (2018) <doi:10.1214/18-AOAS1153> Sensitivity analysis for stratified comparisons in an observational study of the effect of smoking on homocysteine levels. The Annals of Applied Statistics, 12(4), 2312-2334.
ev(1:5,c(0,1,0,1,0),3,2,"RK") ev(1:5,c(0,1,0,1,0),3,2,"BU")
ev(1:5,c(0,1,0,1,0),3,2,"RK") ev(1:5,c(0,1,0,1,0),3,2,"BU")
The covariate balance in a matched sample is compared to the balance that would have been obtained in a completely randomized experiment built from the same people. The existing matched sample is randomized to treatment or control many times, and various measures of covariate balance are computed from the one matched sample and the many randomized experiments. The main elements of the method are from Hansen and Bowers (2008), Pimentel et al. (2015, Table 1), and Yu (2021).
evalBal(z, x, statistic = "s", reps = 1000, trunc = 0.2, nunique = 2, alpha = 0.05)
evalBal(z, x, statistic = "s", reps = 1000, trunc = 0.2, nunique = 2, alpha = 0.05)
z |
z is a vector, with z[i]=1 for treated and z[i]=0 for control |
x |
x is a matrix or a data-frame containing covariates with no NAs. An error will result if length(z) does not equal the number of rows of x. |
statistic |
If statistic="t", the default two-sample t-test from the 'stats' package computes a two-sided P-value for each covariate in the matched sample and the many randomized experiments. If statistic="w", then the two-sample Wilcoxon rank sum test is used, as implemented in the 'stats' package. If statistic="s", the two-sample Wilcoxon rank sum test is used for numeric covariates with more than nunique distinct values, or the chi-square test for a two-way table is used for factors and for numeric covariates with at most nunique distinct values. The default value is nunique=2. |
reps |
A positive integer. A total of reps randomized experiments are compared to the one matched sample. |
trunc |
For each simulated randomized experiment, a P-value is computed for each covariate. Also computed is the statistic proposed by Zykin et al. (2002) defined as the product of those covariate-specific P-values that do not exceed trunc. This truncated product is not a P-value, but it is a statistic. See Details. |
nunique |
See the option statistic="s" above. |
alpha |
For each simulated randomized experiment, a P-value is computed for each covariate. Also computed is number of these P-values that are less than or equal to alpha. |
Truncated Product: For independent uniform P-values, Zaykin et al. (2002) derive a true P-value from the null distribution of their truncated product of P-values. That null distribution can be computed using the truncatedP() function in the 'sensitivitymv' package; however, it is not used here, because the P-values for dependent covariates are not independent. Rather, the actual randomization distribution of the truncated product is simulated. Taking trunc=1 yields Fisher's statistic, namely the product of all of the P-values.
test.name |
The name of the test used. |
actual |
For each covariate, the usual two sample P-values comparing the distributions of treated and control groups for each covariate in x. Also, the minimum P-value, the truncated product of P-values, and the number of P-values less than or equal to alpha – none of these quantities is a P-value. |
simBetter |
Comparison of the covariate imbalance in the actual matched sample and the many simulated randomized experiment. Of the reps randomized experiments, how many were strictly better balanced than the one matched observational study. |
sim |
Details of the simulated randomized experiments. |
Paul R. Rosenbaum
Hansen, B. B., and Bowers, J. (2008) <doi:10.1214/08-STS254> Covariate balance in simple, stratified and clustered comparative studies. Statistical Science, 23, 219-236.
Pimentel, S. D., Kelz, R. R., Silber, J. H. and Rosenbaum, P. R. (2015) <doi:10.1080/01621459.2014.997879> Large, sparse optimal matching with refined covariate balance in an observational study of the health outcomes produced by new surgeons. Journal of the American Statistical Association, 110, 515-527.
Yu, R. (2021) <doi:10.1111/biom.13098> Evaluating and improving a matched comparison of antidepressants and bone density. Biometrics, 77(4), 1276-1288.
Zaykin, D. V., Zhivotovsky, L. A., Westfall, P. H. and Weir, B. S. (2002) <doi:10.1002/gepi.0042> Truncated product method of combining P-values. Genetic Epidemiology, 22, 170-185.
# Evaluate the balance in the bingeM matched sample. # The more difficult control group, P, will be evaluated. data(bingeM) attach(bingeM) xBP<-data.frame(age,female,education,bmi,waisthip,vigor,smokenow,bpRX,smokeQuit) xBP<-xBP[bingeM$AlcGroup!="N",] detach(bingeM) z<-bingeM$z[bingeM$AlcGroup!="N"] # In a serious evaluation, take reps=1000 or reps=10000. # For a quick example, reps is set to reps=100 here. set.seed(5) balBP<-evalBal(z,xBP,reps=100) balBP$test.name # This says that age is compared using the Wilcoxon two-sample test, # and female is compared using the chi-square test for a 2x2 table. # Because the default, nunique=2, was used, education was evaluated # using Wilcoxon's test; however, changing nunique to 5 would evaluate # the 5 levels of education using a chi-square test for a 2x5 table. balBP$actual # In the matched sample, none of the 9 covariates has a P-value # of 0.05 or less. The smallest of the 9 P-values is .366, and # their truncated product is 1, because, by definition, the truncated # product is 1 if all of the P-values are above trunc. apply(balBP$sim,2,median) # In the simulated randomized experiments, the median of the 100 # P-values is close to 1/2 for all covariates. balBP$simBetter # Of the 100 simulated randomized experiments, only 3 were better # balanced than the matched sample in terms of the minimum P-value, # and none were better balanced in terms of the truncated product # of P-values. # # There were too few controls in the P control group who smoked # on somedays to match exactly for smokenow. Nonetheless, only # 13/100 randomized experiments were better balanced for smokenow. # # Now compare the binge group B to the combination of the two # control groups. attach(bingeM) x<-data.frame(age,female,education,bmi,waisthip,vigor,smokenow,bpRX,smokeQuit) detach(bingeM) set.seed(5) balAll<-evalBal(bingeM$z,x,reps=100,trunc=1) balAll$actual balAll$simBetter # This time, Fisher's product of all P-values is used, with trunc=1. # In terms of the minimum P-value and the product of P-values, # none of the 100 randomized experiments is better balanced than the # matched sample.
# Evaluate the balance in the bingeM matched sample. # The more difficult control group, P, will be evaluated. data(bingeM) attach(bingeM) xBP<-data.frame(age,female,education,bmi,waisthip,vigor,smokenow,bpRX,smokeQuit) xBP<-xBP[bingeM$AlcGroup!="N",] detach(bingeM) z<-bingeM$z[bingeM$AlcGroup!="N"] # In a serious evaluation, take reps=1000 or reps=10000. # For a quick example, reps is set to reps=100 here. set.seed(5) balBP<-evalBal(z,xBP,reps=100) balBP$test.name # This says that age is compared using the Wilcoxon two-sample test, # and female is compared using the chi-square test for a 2x2 table. # Because the default, nunique=2, was used, education was evaluated # using Wilcoxon's test; however, changing nunique to 5 would evaluate # the 5 levels of education using a chi-square test for a 2x5 table. balBP$actual # In the matched sample, none of the 9 covariates has a P-value # of 0.05 or less. The smallest of the 9 P-values is .366, and # their truncated product is 1, because, by definition, the truncated # product is 1 if all of the P-values are above trunc. apply(balBP$sim,2,median) # In the simulated randomized experiments, the median of the 100 # P-values is close to 1/2 for all covariates. balBP$simBetter # Of the 100 simulated randomized experiments, only 3 were better # balanced than the matched sample in terms of the minimum P-value, # and none were better balanced in terms of the truncated product # of P-values. # # There were too few controls in the P control group who smoked # on somedays to match exactly for smokenow. Nonetheless, only # 13/100 randomized experiments were better balanced for smokenow. # # Now compare the binge group B to the combination of the two # control groups. attach(bingeM) x<-data.frame(age,female,education,bmi,waisthip,vigor,smokenow,bpRX,smokeQuit) detach(bingeM) set.seed(5) balAll<-evalBal(bingeM$z,x,reps=100,trunc=1) balAll$actual balAll$simBetter # This time, Fisher's product of all P-values is used, with trunc=1. # In terms of the minimum P-value and the product of P-values, # none of the 100 randomized experiments is better balanced than the # matched sample.
Of limited interest to most users, the evall() function plays an internal role in 2-sample and stratified sensitivity analyses. The expectation and variance returned by the evall() function are defined in the third paragraph of section 4, page 495, of Rosenbaum and Krieger (1990). The function evall() calls the function ev() to determine the expectation and variance of the test statistic for an unobserved covariate u with length(z)-m 0's followed by m 1's, doing this for m=1,...,length(z)-1.
evall(sc, z, g, method)
evall(sc, z, g, method)
sc |
A vector of scored outcomes for one stratum. For instance, for Wilcoxon's rank sum test, these would be the ranks of the outcomes in the current stratum. |
z |
Treatment indicators, with length(z)=length(sc). Here, z[i]=1 if i is treated and z[i]=0 if i is control. |
g |
The sensitivity parameter |
method |
If method="RK" or if method="BU", exact expectations and variances are used in a large sample approximation. Methods "RK" and "BU" should give the same answer, but "RK" uses formulas from Rosenbaum and Krieger (1990), while "BU" obtains exact moments for the extended hypergeometric distribution using the BiasedUrn package and then applies Proposition 20, page 155, section 4.7.4 of Rosenbaum (2002). In contrast, method="LS" does not use exact expectations and variances, but rather uses the large sample approximations in section 4.6.4 of Rosenbaum (2002). Finally, method="AD" uses method="LS" for large strata and method="BU" for smaller strata. |
The evall() function is called by the sen2sample() function and the senstrat() function.
A data.frame with length(z)-1 rows and three columns. The first column, m, gives the number of 1's in the unobserved covariate vector, u. The second column, expect, and the third column, var, give the expectation and variance of the test statistic for this u.
The example is from Table 1, page 497, of Rosenbaum and Krieger (1990). The example is also Table 4.15, page 146, in Rosenbaum (2002). The example refers to Cu cells. The data are orignally from Skerfving et al. (1974). The evall function is from the senstrat package.
Paul R. Rosenbaum
Rosenbaum, P. R. and Krieger, A. M. (1990) <doi:10.2307/2289789> Sensitivity of two-sample permutation inferences in observational studies. Journal of the American Statistical Association, 85, 493-498.
Rosenbaum, P. R. (2002). Observational Studies (2nd edition). New York: Springer. Section 4.6.
Rosenbaum, P. R. (2018) <doi:10.1214/18-AOAS1153> Sensitivity analysis for stratified comparisons in an observational study of the effect of smoking on homocysteine levels. The Annals of Applied Statistics, 12(4), 2312-2334.
z<-c(rep(0,16),rep(1,23)) CuCells<-c(2.7, .5, 0, 0, 5, 0, 0, 1.3, 0, 1.8, 0, 0, 1.0, 1.8, 0, 3.1, .7, 4.6, 0, 1.7, 5.2, 0, 5, 9.5, 2, 3, 1, 3.5, 2, 5, 5.5, 2, 3, 4, 0, 2, 2.2, 0, 2) evall(rank(CuCells),z,2,"RK")
z<-c(rep(0,16),rep(1,23)) CuCells<-c(2.7, .5, 0, 0, 5, 0, 0, 1.3, 0, 1.8, 0, 0, 1.0, 1.8, 0, 3.1, .7, 4.6, 0, 1.7, 5.2, 0, 5, 9.5, 2, 3, 1, 3.5, 2, 5, 5.5, 2, 3, 4, 0, 2, 2.2, 0, 2) evall(rank(CuCells),z,2,"RK")
Computes the convolution of two probability generating functions using the convolve function in the stats package. The convolve function uses the fast fourier transform.
gconv(g1,g2)
gconv(g1,g2)
g1 |
A probability generating function. A vector g1 for a random variable X taking values 0, 1, 2, ..., length(g1)-1, where g1[i] = Pr(X=i-1)For example, g1 = c(2/3, 1/3) is the generating function of a binary random variable X with Pr(X=0)=2/3, Pr(X=1)=1/3. The random variable that is 0 with probability 1 has g1=1. |
g2 |
Another probability generating function for a random variable Y. For a fair die, g2 = c(0, 1/6, 1/6, 1/6, 1/6, 1/6, 1/6). |
The probability generating function of X+Y when X and Y are independent.
The gconv function is a slight modification of a similar function in the sensitivity2x2xk package.
Pagano, M. and Tritchler, D. (1983) <doi:10.2307/2288653> On obtaining permutation distributions in polynomial time. Journal of the American Statistical Association, 78, 435-440.
Rosenbaum, P. R. (2020) <doi:10.1007/978-3-030-46405-9> Design of Observational Studies. New York: Springer. Chapter 3 Appendix: Exact Computations for Sensitivity Analysis, page 103.
gconv(c(2/3,1/3),c(2/3,1/3)) gconv(1,c(2/3,1/3)) round(gconv(c(0, 1/6, 1/6, 1/6, 1/6, 1/6, 1/6), c(0, 1/6, 1/6, 1/6, 1/6, 1/6, 1/6)),3) # # Compute the exact distribution of Quade's treated-control # statistic forI=3 blocks of size J=3. # # Block with range rank = 1 rk1<-c(0,1/3,1/3,1/3) names(rk1)<-0:3 rk1 # # Block with range rank = 2 rk2<-c(0,0,1/3,0,1/3,0,1/3) names(rk2)<-0:6 rk2 # # Block with range rank = 3 rk3<-c(0,0,0,1/3,0,0,1/3,0,0,1/3) names(rk3)<-0:9 rk3 # # Convolution of rk1 and rk2 round(gconv(rk1,rk2),3) 1/(3^2) # # Convolution of rk1, rk2 and rk3 round(gconv(gconv(rk1,rk2),rk3),3) 1/(3^3)
gconv(c(2/3,1/3),c(2/3,1/3)) gconv(1,c(2/3,1/3)) round(gconv(c(0, 1/6, 1/6, 1/6, 1/6, 1/6, 1/6), c(0, 1/6, 1/6, 1/6, 1/6, 1/6, 1/6)),3) # # Compute the exact distribution of Quade's treated-control # statistic forI=3 blocks of size J=3. # # Block with range rank = 1 rk1<-c(0,1/3,1/3,1/3) names(rk1)<-0:3 rk1 # # Block with range rank = 2 rk2<-c(0,0,1/3,0,1/3,0,1/3) names(rk2)<-0:6 rk2 # # Block with range rank = 3 rk3<-c(0,0,0,1/3,0,0,1/3,0,0,1/3) names(rk3)<-0:9 rk3 # # Convolution of rk1 and rk2 round(gconv(rk1,rk2),3) 1/(3^2) # # Convolution of rk1, rk2 and rk3 round(gconv(gconv(rk1,rk2),rk3),3) 1/(3^3)
Implements the method of Zhang et al (2023) <doi:10.1080/01621459.2021.1981337>. As special cases, this includes: minimum distance (or optimal matching), matching with fine balance or near-fine balance or refined balance; see Chapter 5 of the iTOS book or the references below.
makematch(dat, costL, costR, ncontrols = 1, controlcosts = NULL,solver='rlemon')
makematch(dat, costL, costR, ncontrols = 1, controlcosts = NULL,solver='rlemon')
dat |
A data frame. Typically, this is the entire data set. Part of it will be returned as a matched sample with some added variables. |
costL |
The distance matrix on the left side of the network, used for pairing. This matrix would most often be made by adding distances to a zero distance matrix created by startcost(), for instance, using addMahal(). In section 5.4 of the book iTOS or Figure 1 of Zhang et al. (2023), these are the costs on the left treated-control edges. |
costR |
The distance matrix on the right side of the network, used for balancing. This matrix would most often be made by adding distances to a zero distance matrix created by startcost(), for instance, using addNearExact(). If you do not need a right distance matrix, then initialize it to zero using startcost() and do not add additional distances to its intial form. In section 5.4 of the book iTOS or Figure 1 of Zhang et al. (2023), these are the costs on the right control-treated edges. |
ncontrols |
One positive integer, 1 for pair matching, 2 for matching two controls to each treated individual, etc. When ncontrols=2 is feasible, it is often useful to compare the quality of the match obtained with ncontrols=1 and ncontrols=2. In Figure 1 of Zhang et al. (2023), ncontrols determines the total flow that leaves the source and is collected by the sink as ncontrols times the number of treated individuals. |
controlcosts |
An optional vector of costs used to penalize the control-control edges. For instance, one might penalize the use of controls with low propensity scores. This is illustrated in the example for the B-P match. In Figure 1 of Zhang et al. (2023), these are the costs on the central control-control edges. |
solver |
The solver must be either 'rlemon' or 'rrelaxiv'. Both solvers find a minimum cost flow in a network, but rlemon is public and rrelaxiv has an academic license; so, rrelaxiv requires a separate instalation; see details. |
This function calls the callrelax() function in Samuel Pimentel's package rcbalance. There are two solvers for callrelax, namely rlemon and the rrelaxiv code of Bertsekas and Tseng (1988). The default is rlemon. rrelaxiv may be the better solver, but it has an academic license, so it is not distributed by CRAN and must be downloaded separately from github at https://errickson.net/rrelaxiv/ or <https://github.com/josherrickson/rrelaxiv/>; see the documentation for callrelax(). There is no need to install rrelaxiv: the package works without it.
In principle, it can happen that two different matches have the same minimum cost, and in this case the two solvers might produce different but equally good matched samples. This is not a problem, but it can come as a surprise. It is unlikely to happen unless the distance matrix has many tied distances. Tied distances are rare in modern distance matrices, with many covariates, using modern ambitious matching techniques.
Returns a matched data set. The matched rows of dat are returned with a new variable mset indicating the matched set. The returned file is sorted by mset and z.
Paul R. Rosenbaum
Bertsekas, D. P., Tseng, P. (1988) <doi:10.1007/BF02288322> The Relax codes for linear minimum cost network flow problems. Annals of Operations Research, 13, 125-190.
Bertsekas, D. P. (1990) <doi:10.1287/inte.20.4.133> The auction algorithm for assignment and other network flow problems: A tutorial. Interfaces, 20(4), 133-149.
Bertsekas, D. P., Tseng, P. (1994) <http://web.mit.edu/dimitrib/www/Bertsekas_Tseng_RELAX4_!994.pdf> RELAX-IV: A Faster Version of the RELAX Code for Solving Minimum Cost Flow Problems.
Hansen, B. B. and Klopfer, S. O. (2006) <doi:10.1198/106186006X137047> "Optimal full matching and related designs via network flows". Journal of computational and Graphical Statistics, 15(3), 609-627. ('optmatch' package)
Hansen, B. B. (2007) <https://www.r-project.org/conferences/useR-2007/program/presentations/hansen.pdf> Flexible, optimal matching for observational studies. R News, 7, 18-24. ('optmatch' package)
Pimentel, S. D., Kelz, R. R., Silber, J. H. and Rosenbaum, P. R. (2015) <doi:10.1080/01621459.2014.997879> Large, sparse optimal matching with refined covariate balance in an observational study of the health outcomes produced by new surgeons. Journal of the American Statistical Association, 110, 515-527. (Introduces an extension of fine balance called refined balance that is implemented in Pimentel's package 'rcbalance'. This can be implemented using makematch() by, say, placing on the right a very large near-exact penalty on a nominal/integer covariate x1, and a still large but smaller penalty on the nominal/integer covariate as.integer(factor(x1):factor(x2)), etc.)
Pimentel, S. D. (2016) "Large, Sparse Optimal Matching with R Package rcbalance" <https://obsstudies.org/large-sparse-optimal-matching-with-r-package-rcbalance/> Observational Studies, 2, 4-23. (Discusses and illustrates the use of Pimentel's 'rcbalance' package.)
Rosenbaum, P. R. and Rubin, D. B. (1985) <doi:10.1080/00031305.1985.10479383> Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39, 33-38. (This paper suggested emphasizing the propensity score in a match, but also attempting to obtain a close match for key covariates using a Mahalanobis distance.)
Rosenbaum, P. R. (1989) <doi:10.1080/01621459.1989.10478868> Optimal matching for observational studies. Journal of the American Statistical Association, 84(408), 1024-1032. (Discusses and illustrates fine balance using minimum cost flow in a network in section 3.2. This is implemented using makematch() by placing a large near-exact penalty on a nominal/integer covariate x1 on the right distance matrix.)
Rosenbaum, P. R., Ross, R. N. and Silber, J. H. (2007) <doi:10.1198/016214506000001059> Minimum distance matched sampling with fine balance in an observational study of treatment for ovarian cancer. Journal of the American Statistical Association, 102, 75-83.
Rosenbaum, P. R. (2020) <doi:10.1007/978-3-030-46405-9> Design of Observational Studies (2nd Edition). New York: Springer.
Yang, D., Small, D. S., Silber, J. H. and Rosenbaum, P. R. (2012) <doi:10.1111/j.1541-0420.2011.01691.x> Optimal matching with minimal deviation from fine balance in a study of obesity and surgical outcomes. Biometrics, 68, 628-636. (Extension of fine balance useful when fine balance is infeasible. Comes as close as possible to fine balance. Implemented in makematch() by placing a large near-exact penalty on a nominal/integer covariate x1 on the right distance matrix.)
Yu, Ruoqi, and P. R. Rosenbaum. <doi:10.1111/biom.13098> Directional penalties for optimal matching in observational studies. Biometrics 75, no. 4 (2019): 1380-1390. (If a covariate is very out-of-balance, we should prefer a mismatch that works against the imbalance to an equally large mismatch that supports the imbalance. This is illustrated with the propensity score in the example below. Because a directional penalty tolerates a big mismatch in the good direction, it makes sense to place the penalty on the right distance matrix.)
Yu, R. (2023) <doi:10.1111/biom.13771> How well can fine balance work for covariate balancing? Biometrics. 79(3), 2346-2356.
Zhang, B., D. S. Small, K. B. Lasater, M. McHugh, J. H. Silber, and P. R. Rosenbaum (2023) <doi:10.1080/01621459.2021.1981337> Matching one sample according to two criteria in observational studies. Journal of the American Statistical Association, 118, 1140-1151. (This is the basic reference for the makematch() function. It generalizes the concepts of fine balance, near-fine balance and refined balance that were developed in other references.)
Zubizarreta, J. R., Reinke, C. E., Kelz, R. R., Silber, J. H. and Rosenbaum, P. R. (2011) <doi:10.1198/tas.2011.11072> Matching for several sparse nominal variables in a case control study of readmission following surgery. The American Statistician, 65(4), 229-238. (This paper combines near-exact matching and fine balance for the same nominal covariate. It is implemented in makematch() by placing the same covariate on the left and the right, as with smokenow in the example below.)
# See also the examples for binge in the documentation for the dataset binge. data(binge) # matchNever creates the B-N match for the binge data. # matchPast creates the B-P match for the binge data. # Each match has its own propensity score, and these # mean different things in the B-N match and the B-P # match. The propensity score is denoted p. # The Mahalanobis distance is the rank-based method # described in Rosenbaum (2020, section 9.3). # A directional caliper from Yu and Rosenbaum (2019) # favors controls with high propensity scores, p. # # The two matches share most features, including: # 1. A heavy left penalty for mismatching for female and bpRX. # 2. Left penalties for mismatching for ageC, vigor, smokenow. # 3. A left Mahalanobis distance for all covariates. # 4. A heavy right penalty for mismatching smokenow. # 5. Right penalties for education and smokeQuit. # 6. A right Mahalanobis distance for just female, age and p. # 7. An asymmetric, directional caliper on p. # Because the left distance determines pairing, it emphasizes # covariates that are out-of-balance and thought to be related # to blood pressure. Because the right distance affects balance # but not pairing, it worries about education and smoking in the # distant past, as well as the propensity score which again is # focused on covariate balance. The asymmetric caliper is # tolerant of mismatches in the desired direction, so it too # is placed on the right. Current smoking, smokenow, is placed # on both the left and the right: even if we cannot always pair # for it, perhaps we can balance it. # In common practice, before examining outcomes, one compares # several matched designs, fixing imperfections by adjusting # penalties and other considerations. # # The B-P match is more difficult, because the P group is # smaller than the N group. The B-P match uses the # controlcosts feature while the B-N match does not. # In the B-P match, there are too few young controls # and too few controls with high propensity scores. # Therefore, controlcosts penalize the use of controls # with both low propensity scores (p<.4) and higher # ages (age>42), thereby minimizing the use of these # controls, even though the use of some of these # controls is unavoidable in a pair match. matchNever<-function(z,ncontrols=1){ dt<-binge[!is.na(z),] z<-z[!is.na(z)] dt<-dt[order(1-z,dt$SEQN),] z<-z[order(1-z,dt$SEQN)] rownames(dt)<-dt$SEQN names(z)<-dt$SEQN left<-startcost(z) right<-startcost(z) attach(dt) propmod<-glm(z~age+female+education+smokenow+smokeQuit+bpRX+ bmi+vigor+waisthip,family=binomial) p<-propmod$fitted.values left<-addinteger(left,z,as.integer(ageC),penalty=100) left<-addNearExact(left,z,female,penalty = 10000) left<-addNearExact(left,z,bpRX,penalty = 10000) left<-addNearExact(left,z,vigor,penalty=10) left<-addinteger(left,z,smokenow,penalty=10) left<-addMahal(left,z,cbind(age,bpRX,female,education,smokenow, smokeQuit,bmi,vigor,waisthip)) right<-addMahal(right,z,cbind(female,age,p)) right<-addinteger(right,z,education,penalty=10) right<-addinteger(right,z,smokenow,penalty=1000) right<-addinteger(right,z,smokeQuit,penalty=10) right<-addcaliper(right,z,p,caliper=c(-1,.03),penalty=10) detach(dt) dt<-cbind(dt,z,p) m<-makematch(dt,left,right,ncontrols=ncontrols) m$mset<-as.integer(m$mset) treated<-m$SEQN[m$z==1] treated<-as.vector(t(matrix(treated,length(treated),ncontrols+1))) m<-cbind(m,treated) list(m=m,dt=dt) } matchPast<-function(z,ncontrols=1){ dt<-binge[!is.na(z),] z<-z[!is.na(z)] dt<-dt[order(1-z,dt$SEQN),] z<-z[order(1-z,dt$SEQN)] rownames(dt)<-dt$SEQN names(z)<-dt$SEQN left<-startcost(z) right<-startcost(z) attach(dt) propmod<-glm(z~age+female+education+smokenow+smokeQuit+bpRX+ bmi+vigor+waisthip,family=binomial) p<-propmod$fitted.values left<-addinteger(left,z,as.integer(ageC),penalty=100) left<-addNearExact(left,z,female,penalty = 10000) left<-addNearExact(left,z,bpRX,penalty = 10000) left<-addNearExact(left,z,vigor,penalty=10) left<-addinteger(left,z,smokenow,penalty=10) left<-addMahal(left,z,cbind(age,bpRX,female,education,smokenow, smokeQuit,bmi,vigor,waisthip)) right<-addMahal(right,z,cbind(female,age,p)) right<-addinteger(right,z,education,penalty=10) right<-addinteger(right,z,smokenow,penalty=1000) right<-addinteger(right,z,smokeQuit,penalty=10) right<-addcaliper(right,z,p,caliper=c(-1,.03),penalty=10) controlcosts<-((p[z==0]<.4)&(age[z==0]>42))*1000 detach(dt) dt<-cbind(dt,z,p) m<-makematch(dt,left,right,ncontrols=ncontrols,controlcosts=controlcosts) m$mset<-as.integer(m$mset) treated<-m$SEQN[m$z==1] treated<-as.vector(t(matrix(treated,length(treated),ncontrols+1))) m<-cbind(m,treated) list(m=m,dt=dt) } z<-rep(NA,dim(binge)[1]) z[binge$AlcGroup=="B"]<-1 z[binge$AlcGroup=="P"]<-0 mPastComplete<-matchPast(z,ncontrols=1) mPast<-mPastComplete$m mPastComplete<-mPastComplete$dt rm(z) z<-rep(NA,dim(binge)[1]) z[binge$AlcGroup=="B"]<-1 z[binge$AlcGroup=="N"]<-0 mNeverComplete<-matchNever(z,ncontrols=1) mNever<-mNeverComplete$m mNeverComplete<-mNeverComplete$dt rm(z) bingeM<-rbind(mNever,mPast[mPast$z==0,]) bingeM<-bingeM[order(bingeM$treated,bingeM$AlcGroup,bingeM$SEQN),] w<-which(colnames(bingeM)=="p")[1] bingeM<-bingeM[,-w] rm(binge,matchNever,matchPast,w) old.par <- par(no.readonly = TRUE) par(mfrow=c(1,2)) boxplot(mNever$p[mNever$z==1],mNever$p[mNever$z==0], mNeverComplete$p[!is.element(mNeverComplete$SEQN,mNever$SEQN)], names=c("B","mN","uN"),ylab="Propensity Score",main="Never", ylim=c(0,.8)) boxplot(mPast$p[mPast$z==1],mPast$p[mPast$z==0], mPastComplete$p[!is.element(mPastComplete$SEQN,mPast$SEQN)], names=c("B","mP","uP"),ylab="Propensity Score",main="Past", ylim=c(0,.8)) par(mfrow=c(1,2)) boxplot(mNever$age[mNever$z==1],mNever$age[mNever$z==0], mNeverComplete$age[!is.element(mNeverComplete$SEQN,mNever$SEQN)], names=c("B","mN","uN"),ylab="Age",main="Never", ylim=c(20,80)) boxplot(mPast$age[mPast$z==1],mPast$age[mPast$z==0], mPastComplete$age[!is.element(mPastComplete$SEQN,mPast$SEQN)], names=c("B","mP","uP"),ylab="Age",main="Past", ylim=c(20,80)) par(mfrow=c(1,2)) boxplot(mNever$education[mNever$z==1],mNever$education[mNever$z==0], mNeverComplete$education[!is.element(mNeverComplete$SEQN,mNever$SEQN)], names=c("B","mN","uN"),ylab="Education",main="Never", ylim=c(1,5)) boxplot(mPast$education[mPast$z==1],mPast$education[mPast$z==0], mPastComplete$education[!is.element(mPastComplete$SEQN,mPast$SEQN)], names=c("B","mP","uP"),ylab="Education",main="Past", ylim=c(1,5)) par(mfrow=c(1,2)) boxplot(mNever$bmi[mNever$z==1],mNever$bmi[mNever$z==0], mNeverComplete$bmi[!is.element(mNeverComplete$SEQN,mNever$SEQN)], names=c("B","mN","uN"),ylab="BMI",main="Never", ylim=c(14,70)) boxplot(mPast$bmi[mPast$z==1],mPast$bmi[mPast$z==0], mPastComplete$bmi[!is.element(mPastComplete$SEQN,mPast$SEQN)], names=c("B","mP","uP"),ylab="BMI",main="Past", ylim=c(14,70)) par(mfrow=c(1,2)) boxplot(mNever$waisthip[mNever$z==1],mNever$waisthip[mNever$z==0], mNeverComplete$waisthip[!is.element(mNeverComplete$SEQN,mNever$SEQN)], names=c("B","mN","uN"),ylab="Waist/Hip",main="Never", ylim=c(.65,1.25)) boxplot(mPast$waisthip[mPast$z==1],mPast$waisthip[mPast$z==0], mPastComplete$waisthip[!is.element(mPastComplete$SEQN,mPast$SEQN)], names=c("B","mP","uP"),ylab="Waist/Hip",main="Past", ylim=c(.65,1.25)) par(old.par)
# See also the examples for binge in the documentation for the dataset binge. data(binge) # matchNever creates the B-N match for the binge data. # matchPast creates the B-P match for the binge data. # Each match has its own propensity score, and these # mean different things in the B-N match and the B-P # match. The propensity score is denoted p. # The Mahalanobis distance is the rank-based method # described in Rosenbaum (2020, section 9.3). # A directional caliper from Yu and Rosenbaum (2019) # favors controls with high propensity scores, p. # # The two matches share most features, including: # 1. A heavy left penalty for mismatching for female and bpRX. # 2. Left penalties for mismatching for ageC, vigor, smokenow. # 3. A left Mahalanobis distance for all covariates. # 4. A heavy right penalty for mismatching smokenow. # 5. Right penalties for education and smokeQuit. # 6. A right Mahalanobis distance for just female, age and p. # 7. An asymmetric, directional caliper on p. # Because the left distance determines pairing, it emphasizes # covariates that are out-of-balance and thought to be related # to blood pressure. Because the right distance affects balance # but not pairing, it worries about education and smoking in the # distant past, as well as the propensity score which again is # focused on covariate balance. The asymmetric caliper is # tolerant of mismatches in the desired direction, so it too # is placed on the right. Current smoking, smokenow, is placed # on both the left and the right: even if we cannot always pair # for it, perhaps we can balance it. # In common practice, before examining outcomes, one compares # several matched designs, fixing imperfections by adjusting # penalties and other considerations. # # The B-P match is more difficult, because the P group is # smaller than the N group. The B-P match uses the # controlcosts feature while the B-N match does not. # In the B-P match, there are too few young controls # and too few controls with high propensity scores. # Therefore, controlcosts penalize the use of controls # with both low propensity scores (p<.4) and higher # ages (age>42), thereby minimizing the use of these # controls, even though the use of some of these # controls is unavoidable in a pair match. matchNever<-function(z,ncontrols=1){ dt<-binge[!is.na(z),] z<-z[!is.na(z)] dt<-dt[order(1-z,dt$SEQN),] z<-z[order(1-z,dt$SEQN)] rownames(dt)<-dt$SEQN names(z)<-dt$SEQN left<-startcost(z) right<-startcost(z) attach(dt) propmod<-glm(z~age+female+education+smokenow+smokeQuit+bpRX+ bmi+vigor+waisthip,family=binomial) p<-propmod$fitted.values left<-addinteger(left,z,as.integer(ageC),penalty=100) left<-addNearExact(left,z,female,penalty = 10000) left<-addNearExact(left,z,bpRX,penalty = 10000) left<-addNearExact(left,z,vigor,penalty=10) left<-addinteger(left,z,smokenow,penalty=10) left<-addMahal(left,z,cbind(age,bpRX,female,education,smokenow, smokeQuit,bmi,vigor,waisthip)) right<-addMahal(right,z,cbind(female,age,p)) right<-addinteger(right,z,education,penalty=10) right<-addinteger(right,z,smokenow,penalty=1000) right<-addinteger(right,z,smokeQuit,penalty=10) right<-addcaliper(right,z,p,caliper=c(-1,.03),penalty=10) detach(dt) dt<-cbind(dt,z,p) m<-makematch(dt,left,right,ncontrols=ncontrols) m$mset<-as.integer(m$mset) treated<-m$SEQN[m$z==1] treated<-as.vector(t(matrix(treated,length(treated),ncontrols+1))) m<-cbind(m,treated) list(m=m,dt=dt) } matchPast<-function(z,ncontrols=1){ dt<-binge[!is.na(z),] z<-z[!is.na(z)] dt<-dt[order(1-z,dt$SEQN),] z<-z[order(1-z,dt$SEQN)] rownames(dt)<-dt$SEQN names(z)<-dt$SEQN left<-startcost(z) right<-startcost(z) attach(dt) propmod<-glm(z~age+female+education+smokenow+smokeQuit+bpRX+ bmi+vigor+waisthip,family=binomial) p<-propmod$fitted.values left<-addinteger(left,z,as.integer(ageC),penalty=100) left<-addNearExact(left,z,female,penalty = 10000) left<-addNearExact(left,z,bpRX,penalty = 10000) left<-addNearExact(left,z,vigor,penalty=10) left<-addinteger(left,z,smokenow,penalty=10) left<-addMahal(left,z,cbind(age,bpRX,female,education,smokenow, smokeQuit,bmi,vigor,waisthip)) right<-addMahal(right,z,cbind(female,age,p)) right<-addinteger(right,z,education,penalty=10) right<-addinteger(right,z,smokenow,penalty=1000) right<-addinteger(right,z,smokeQuit,penalty=10) right<-addcaliper(right,z,p,caliper=c(-1,.03),penalty=10) controlcosts<-((p[z==0]<.4)&(age[z==0]>42))*1000 detach(dt) dt<-cbind(dt,z,p) m<-makematch(dt,left,right,ncontrols=ncontrols,controlcosts=controlcosts) m$mset<-as.integer(m$mset) treated<-m$SEQN[m$z==1] treated<-as.vector(t(matrix(treated,length(treated),ncontrols+1))) m<-cbind(m,treated) list(m=m,dt=dt) } z<-rep(NA,dim(binge)[1]) z[binge$AlcGroup=="B"]<-1 z[binge$AlcGroup=="P"]<-0 mPastComplete<-matchPast(z,ncontrols=1) mPast<-mPastComplete$m mPastComplete<-mPastComplete$dt rm(z) z<-rep(NA,dim(binge)[1]) z[binge$AlcGroup=="B"]<-1 z[binge$AlcGroup=="N"]<-0 mNeverComplete<-matchNever(z,ncontrols=1) mNever<-mNeverComplete$m mNeverComplete<-mNeverComplete$dt rm(z) bingeM<-rbind(mNever,mPast[mPast$z==0,]) bingeM<-bingeM[order(bingeM$treated,bingeM$AlcGroup,bingeM$SEQN),] w<-which(colnames(bingeM)=="p")[1] bingeM<-bingeM[,-w] rm(binge,matchNever,matchPast,w) old.par <- par(no.readonly = TRUE) par(mfrow=c(1,2)) boxplot(mNever$p[mNever$z==1],mNever$p[mNever$z==0], mNeverComplete$p[!is.element(mNeverComplete$SEQN,mNever$SEQN)], names=c("B","mN","uN"),ylab="Propensity Score",main="Never", ylim=c(0,.8)) boxplot(mPast$p[mPast$z==1],mPast$p[mPast$z==0], mPastComplete$p[!is.element(mPastComplete$SEQN,mPast$SEQN)], names=c("B","mP","uP"),ylab="Propensity Score",main="Past", ylim=c(0,.8)) par(mfrow=c(1,2)) boxplot(mNever$age[mNever$z==1],mNever$age[mNever$z==0], mNeverComplete$age[!is.element(mNeverComplete$SEQN,mNever$SEQN)], names=c("B","mN","uN"),ylab="Age",main="Never", ylim=c(20,80)) boxplot(mPast$age[mPast$z==1],mPast$age[mPast$z==0], mPastComplete$age[!is.element(mPastComplete$SEQN,mPast$SEQN)], names=c("B","mP","uP"),ylab="Age",main="Past", ylim=c(20,80)) par(mfrow=c(1,2)) boxplot(mNever$education[mNever$z==1],mNever$education[mNever$z==0], mNeverComplete$education[!is.element(mNeverComplete$SEQN,mNever$SEQN)], names=c("B","mN","uN"),ylab="Education",main="Never", ylim=c(1,5)) boxplot(mPast$education[mPast$z==1],mPast$education[mPast$z==0], mPastComplete$education[!is.element(mPastComplete$SEQN,mPast$SEQN)], names=c("B","mP","uP"),ylab="Education",main="Past", ylim=c(1,5)) par(mfrow=c(1,2)) boxplot(mNever$bmi[mNever$z==1],mNever$bmi[mNever$z==0], mNeverComplete$bmi[!is.element(mNeverComplete$SEQN,mNever$SEQN)], names=c("B","mN","uN"),ylab="BMI",main="Never", ylim=c(14,70)) boxplot(mPast$bmi[mPast$z==1],mPast$bmi[mPast$z==0], mPastComplete$bmi[!is.element(mPastComplete$SEQN,mPast$SEQN)], names=c("B","mP","uP"),ylab="BMI",main="Past", ylim=c(14,70)) par(mfrow=c(1,2)) boxplot(mNever$waisthip[mNever$z==1],mNever$waisthip[mNever$z==0], mNeverComplete$waisthip[!is.element(mNeverComplete$SEQN,mNever$SEQN)], names=c("B","mN","uN"),ylab="Waist/Hip",main="Never", ylim=c(.65,1.25)) boxplot(mPast$waisthip[mPast$z==1],mPast$waisthip[mPast$z==0], mPastComplete$waisthip[!is.element(mPastComplete$SEQN,mPast$SEQN)], names=c("B","mP","uP"),ylab="Waist/Hip",main="Past", ylim=c(.65,1.25)) par(old.par)
This function is of limited interest to most users, and is called by other functions in the package. Makes the network used in the two-criteria matching method of Zhang et al. (2022).
makenetwork(costL, costR, ncontrols = 1, controlcosts = NULL)
makenetwork(costL, costR, ncontrols = 1, controlcosts = NULL)
costL |
The distance matrix on the left side of the network, used for pairing. |
costR |
The distance matrix on the right side of the network, used for balancing. |
ncontrols |
One positive integer, 1 for pair matching, 2 for matching two controls to each treated individual, etc. |
controlcosts |
An optional vector of costs used to penalize the control-control edges. |
This function creates the network depicted in Figure 1 of Zhang et al. (2023).
A minimum cost flow in this network is found by passing net to callrelax() in the package 'rcbalance'. If you use callrelax(), I strongly suggest you do this with solver set to 'rrelaxiv'. The 'rrelaxiv' package has an academic license. The 'rrelaxiv' package uses Fortran code from RELAX IV developed by Bertsekas and Tseng (1988, 1994) based on Bertsekas' (1990) auction algorithm.
idtreated |
Row identifications for treated individuals |
idcontrol |
Control identifications for control individuals |
net |
A network for use with callrelax in the 'rcbalance' package. |
Paul R. Rosenbaum
Bertsekas, D. P., Tseng, P. (1988) <doi:10.1007/BF02288322> The relax codes for linear minimum cost network flow problems. Annals of Operations Research, 13, 125-190.
Bertsekas, D. P. (1990) <doi:10.1287/inte.20.4.133> The auction algorithm for assignment and other network flow problems: A tutorial. Interfaces, 20(4), 133-149.
Bertsekas, D. P., Tseng, P. (1994) <http://web.mit.edu/dimitrib/www/Bertsekas_Tseng_RELAX4_!994.pdf> RELAX-IV: A Faster Version of the RELAX Code for Solving Minimum Cost Flow Problems.
Zhang, B., D. S. Small, K. B. Lasater, M. McHugh, J. H. Silber, and P. R. Rosenbaum (2023) <doi:10.1080/01621459.2021.1981337> Matching one sample according to two criteria in observational studies. Journal of the American Statistical Association, 118, 1140-1151.
data(binge) # Select two treated and three controls from binge d<-binge[is.element(binge$SEQN,c(109315,109365,109266,109273,109290)),] z<-1*(d$AlcGroup=="B") names(z)<-d$SEQN attach(d) x<-data.frame(age,female) detach(d) rownames(x)<-d$SEQN Ldist<-startcost(z) Ldist<-addcaliper(Ldist,z,x$age,caliper=10,penalty=5) Rdist<-startcost(z) Rdist<-addNearExact(Rdist,z,x$female) makenetwork(Ldist,Rdist)
data(binge) # Select two treated and three controls from binge d<-binge[is.element(binge$SEQN,c(109315,109365,109266,109273,109290)),] z<-1*(d$AlcGroup=="B") names(z)<-d$SEQN attach(d) x<-data.frame(age,female) detach(d) rownames(x)<-d$SEQN Ldist<-startcost(z) Ldist<-addcaliper(Ldist,z,x$age,caliper=10,penalty=5) Rdist<-startcost(z) Rdist<-addNearExact(Rdist,z,x$female) makenetwork(Ldist,Rdist)
Computes a sensitivity analysis for treated-minus-control matched pair differences in observational studies.
noether(y, f = 2/3, gamma = 1, alternative = "greater")
noether(y, f = 2/3, gamma = 1, alternative = "greater")
y |
A vector of treated-minus-control matched pair differences. |
f |
A nonnegative number strictly less than 1. Suppose that there are I matched pair differences, length(y)=I. Rank the absolute pair differences from 1 to I with average ranks for ties. Noether's statistic looks at the roughly (1-f)I pair differences with absolute ranks that are at least fI, and computes the sign test from these fI pair differences. With f=0, Noether's statistic is the usual sign test statistic. With f=2/3, Noether's statistic focuses on the 1/3 of pairs with the largest absolute pair differences. In his article, Noether suggested f=1/3 for randomized matched pair differences from a Normal distribution, but f=2/3 is better for sensitivity analyses in observational studies. Pair differences that are zero are not counted, but this is uncommon for f=2/3. |
gamma |
A number greater than or equal to 1. gamma is the sensitivity parameter, where gamma=1 for a randomization test, and gamma>1 for a sensitivity analysis. |
alternative |
The possible alternatives are "greater", "less" or "two.sided"; however, "two.sided" is available only for gamma=1. |
Noether's (1973) strengthens the sign test. In a randomized experiment, it increase power. In an observational study, it increases design sensitivity and the Bahadur efficiency of a sensitivity analysis.
Because the test has a binomial null distribution in both a randomized experiment and in an observational study, Noether's test is used in a number of problems in Introduction to the Theory of Observational Studies.
Noether's test is related to methods of Gastwirth (1966), Brown (1981), and Markowski and Hettmansperger (1982). Its properties in an observational study are discussed Rosenbaum (2012, 2015).
number.pairs |
Number of pairs used by Noether's statistic, roughly fI. |
positive.pairs |
Number of positive pair differences among used pairs. |
pval |
P-value testing the null hypothesis of no treatment effect. Obtained from the binomial distribution. |
As noted in the Preface to Introduction to the Theory of Observational Studies, Noether's statistic is used in a sequence of Problems that appear in various chapters.
Paul R. Rosenbaum
Brown, B. M. (1981) <doi:10.1093/biomet/68.1.235> Symmetric quantile averages and related estimators. Biometrika, 68(1), 235-242.
Gastwirth, J. L. (1966) <doi:10.1080/01621459.1966.10482185> On robust procedures. Journal of the American Statistical Association, 61(316), 929-948.
Markowski, E. P. and Hettmansperger, T. P. (1982) <doi:10.1080/01621459.1982.10477905> Inference based on simple rank step score statistics for the location model. Journal of the American Statistical Association, 77(380), 901-907.
Noether, G. E. (1973) <doi:10.1080/01621459.1973.10481411> Some simple distribution-free confidence intervals for the center of a symmetric distribution. Journal of the American Statistical Association, 68(343), 716-719.
Rosenbaum, P. R. (2012) <10.1214/11-AOAS508> An exact adaptive test with superior design sensitivity in an observational study of treatments for ovarian cancer. Annals of Applied Statistics, 6, 83-105.
Rosenbaum, P. R. (2015) <doi:10.1080/01621459.2014.960968> Bahadur efficiency of sensitivity analyses in observational studies. Journal of the American Statistical Association, 110(509), 205-217.
set.seed(1) y<-rnorm(1000)+.5 noether(y,f=0,gamma=3) noether(y,f=2/3,gamma=3)
set.seed(1) y<-rnorm(1000)+.5 noether(y,f=0,gamma=3) noether(y,f=2/3,gamma=3)
Creates an distance matrix of zeros of dimensions compatible with the treatment indicator vector z.
startcost(z)
startcost(z)
z |
A vector with z[i]=1 if individual i is treated or z[i]=0 if individual i is control. The rows of costmatrix refer to treated individuals and the columns refer to controls. Although not strictly required, it is best that z has names that are the same as the names of the data frame dat that will be used in matching. |
A matrix of zeros with sum(z) rows and sum(1-z) columns. If z has names, then they become the row and column names of this matrix.
Paul R. Rosenbaum
data(binge) # Select two treated and three controls from binge d<-binge[is.element(binge$SEQN,c(109315,109365,109266,109273,109290)),] z<-1*(d$AlcGroup=="B") names(z)<-d$SEQN dist<-startcost(z) dist rm(z,dist,d)
data(binge) # Select two treated and three controls from binge d<-binge[is.element(binge$SEQN,c(109315,109365,109266,109273,109290)),] z<-1*(d$AlcGroup=="B") names(z)<-d$SEQN dist<-startcost(z) dist rm(z,dist,d)
Of limited interest to most users, the zeta function plays an internal role in 2-sample and stratified sensitivity analyses. The zeta function is equation (8), page 495, in Rosenbaum and Krieger (1990).
zeta(bigN, n, m, g)
zeta(bigN, n, m, g)
bigN |
Total sample size in this stratum. |
n |
Treated sample size in this stratum. |
m |
The number of 1's in the vector u of unobserved covariates. Here, u has bigN-m 0's followed by m 1's. |
g |
The sensitivity parameter |
The value of the zeta function.
The zeta function is called by computep. The zeta function is from the senstrat package.
Paul R. Rosenbaum
Rosenbaum, P. R. and Krieger, A. M. (1990). Sensitivity of two-sampler permutation inferences in observational studies. Journal of the American Statistical Association, 85, 493-498.
Rosenbaum, P. R. (2002). Observational Studies (2nd edition). New York: Springer. Section 4.6.
zeta(10,5,6,2)
zeta(10,5,6,2)