Title: | Information Preserving Regression-Based Tools for Statistical Disclosure Control |
---|---|
Description: | Implementation of the methods described in the paper with the above title: Langsrud, Ø. (2019) <doi:10.1007/s11222-018-9848-9>. The package can be used to generate synthetic or hybrid continuous microdata, and the relationship to the original data can be controlled in several ways. A function for replacing suppressed tabular cell frequencies with decimal numbers is included. |
Authors: | Øyvind Langsrud [aut, cre] |
Maintainer: | Øyvind Langsrud <[email protected]> |
License: | Apache License 2.0 | file LICENSE |
Version: | 1.0.0 |
Built: | 2025-03-05 07:10:09 UTC |
Source: | CRAN |
The limit calculated by FindAlpha
is used when alpha =1 cannot be chosen (warning produced).
In output, alpha is attribute.
CalculateCdirect(a, b, epsAlpha = 1e-07, AlphaHandler = warning, alpha = NULL) CalculateC(a, b, ..., viaQR = NULL, returnAlpha = FALSE)
CalculateCdirect(a, b, epsAlpha = 1e-07, AlphaHandler = warning, alpha = NULL) CalculateC(a, b, ..., viaQR = NULL, returnAlpha = FALSE)
a |
matrix E in paper |
b |
matrix Eg in paper |
epsAlpha |
Precision constant for alpha calculation |
AlphaHandler |
Function (warning or stop) to be used when alpha<1 |
alpha |
Possible with alpha as input instead of computing |
... |
Arguments to CalculateCdirect |
viaQR |
When TRUE QR is involved. This may be needed to handle colinear data. When NULL viaQR is set to TRUE if ordinary computations fail. |
returnAlpha |
When TRUE alpha (1 or value below 1) is returned instead of C. Attribute viaQR is included. |
When epsAlpha=NULL calculations are performed directly (alpha=1) and alpha is not attribute.
Calculated C with attributes alpha and viaQR (when CalculateC)
Øyvind Langsrud
x <- 1:10 y <- matrix(rnorm(30) + 1:30, 10, 3) a <- residuals(lm(y ~ x)) b <- residuals(lm(2 * y + matrix(rnorm(30), 10, 3) ~ x)) a1 <- a b1 <- b a1[, 3] <- a[, 1] + a[, 2] b1[, 3] <- b[, 1] + b[, 2] alpha <- FindAlpha(a, b) FindAlphaSimple(a, b) # Same result as above CalculateC(a, b) CalculateCdirect(a, b) # Same result as above without viaQR attribute CalculateCdirect(a, b, alpha = alpha/(1 + 1e-07)) # Same result as above since epsAlpha = 1e-07 CalculateCdirect(a, b, alpha = alpha/2) # OK # CalculateCdirect(a,b, alpha = 2*alpha) # Not OK FindAlpha(a, b1) # FindAlphaSimple(a,b1) # Not working since b1 is collinear CalculateC(a, b1, returnAlpha = TRUE) # Almost same alpha as above (epsAlpha cause difference) FindAlpha(b, a) CalculateC(b, a, returnAlpha = TRUE) # 1 returned (not same as above) CalculateC(b, a) FindAlpha(b1, a) # alpha smaller than epsAlpha is set to 0 in CalculateC CalculateC(b1, a) # When alpha = 0 C is calculated by GenQR insetad of chol
x <- 1:10 y <- matrix(rnorm(30) + 1:30, 10, 3) a <- residuals(lm(y ~ x)) b <- residuals(lm(2 * y + matrix(rnorm(30), 10, 3) ~ x)) a1 <- a b1 <- b a1[, 3] <- a[, 1] + a[, 2] b1[, 3] <- b[, 1] + b[, 2] alpha <- FindAlpha(a, b) FindAlphaSimple(a, b) # Same result as above CalculateC(a, b) CalculateCdirect(a, b) # Same result as above without viaQR attribute CalculateCdirect(a, b, alpha = alpha/(1 + 1e-07)) # Same result as above since epsAlpha = 1e-07 CalculateCdirect(a, b, alpha = alpha/2) # OK # CalculateCdirect(a,b, alpha = 2*alpha) # Not OK FindAlpha(a, b1) # FindAlphaSimple(a,b1) # Not working since b1 is collinear CalculateC(a, b1, returnAlpha = TRUE) # Almost same alpha as above (epsAlpha cause difference) FindAlpha(b, a) CalculateC(b, a, returnAlpha = TRUE) # 1 returned (not same as above) CalculateC(b, a) FindAlpha(b1, a) # alpha smaller than epsAlpha is set to 0 in CalculateC CalculateC(b1, a) # When alpha = 0 C is calculated by GenQR insetad of chol
Function to find the largest alpha that makes equation 10 in the paper solvable.
FindAlpha(a, b, tryViaQR = TRUE) FindAlphaSimple(a, b)
FindAlpha(a, b, tryViaQR = TRUE) FindAlphaSimple(a, b)
a |
matrix E in paper |
b |
matrix Eg in paper |
tryViaQR |
When TRUE QR transformation used (to handle collinearity) when ordinary calculations fail. |
alpha
FindAlphaSimple performs the calculations by a simple/direct method. FindAlpha is made to handle problematic special cases.
Øyvind Langsrud
See examples in the documentation of CalculateC
Matrix X decomposed as Q and R (X=QR) where columns of Q are orthonormal. Ordinary QR or SVD may be used.
GenQR(x, doSVD = FALSE, findR = TRUE, makeunique = findR, tol = 1e-07)
GenQR(x, doSVD = FALSE, findR = TRUE, makeunique = findR, tol = 1e-07)
x |
Matrix to be decomposed |
doSVD |
When TRUE SVD instead of QR |
findR |
When FALSE only Q returned |
makeunique |
When TRUE force uniqueness by positive diagonal elements (QR) or by column sums (SVD) |
tol |
As input to qr or, in the case of svd(), similar as input to MASS::ginv(). |
To handle dependency a usual decomposition of X is PX=QR where P is a permutation matrix. This function returns RP^T as R. When SVD, Q=U and R=SV^T.
List with Q and R or just Q
Øyvind Langsrud
GenQR(matrix(rnorm(15),5,3)) GenQR(matrix(rnorm(15),5,3)[,c(1,2,1,3)]) GenQR(matrix(rnorm(15),5,3)[,c(1,2,1,3)],TRUE)
GenQR(matrix(rnorm(15),5,3)) GenQR(matrix(rnorm(15),5,3)[,c(1,2,1,3)]) GenQR(matrix(rnorm(15),5,3)[,c(1,2,1,3)],TRUE)
Implementation of equation 6 (arbitrary residual data) and equation 7 (residual correlations) in the paper. The alpha limit is calculated (equation 9). The limit is used when alpha =1 cannot be chosen (warning produced). In output, alpha is attribute.
RegSDCadd(y, resCorr = NULL, x = NULL, yStart = NULL, ensureIntercept = TRUE)
RegSDCadd(y, resCorr = NULL, x = NULL, yStart = NULL, ensureIntercept = TRUE)
y |
Matrix of confidential variables |
resCorr |
Required residual correlations (possibly recycled) |
x |
Matrix of non-confidential variables |
yStart |
Arbitrary data whose residuals will be used. Will be calculated from resCorr when NULL. |
ensureIntercept |
Whether to ensure/include a constant term. Non-NULL x is subjected to |
Use epsAlpha=NULL to avoid calculation of alpha. Use of alpha (<1) will produce a warning. Input matrices are subjected to EnsureMatrix
.
Generated version of y with alpha as attribute
Øyvind Langsrud
x <- matrix(1:10, 10, 1) y <- matrix(rnorm(30) + 1:30, 10, 3) yOut <- RegSDCadd(y, c(0.1, 0.2, 0.3), x) # Correlations between residuals as required diag(cor(residuals(lm(y ~ x)), residuals(lm(yOut ~ x)))) # Identical covariance matrices cov(y) - cov(yOut) cov(residuals(lm(y ~ x))) - cov(residuals(lm(yOut ~ x))) # Identical regression results summary(lm(y[, 1] ~ x)) summary(lm(yOut[, 1] ~ x)) # alpha as attribute attr(yOut, "alpha") # With yStart as input and alpha limit in use (warning produced) yOut <- RegSDCadd(y, NULL, x, 2 * y + matrix(rnorm(30), 10, 3)) attr(yOut, "alpha") # Same correlation for all variables RegSDCadd(y, 0.2, x) # But in this case RegSDCcomp is equivalent and faster RegSDCcomp(y, 0.2, x) # Make nearly collinear data y[, 3] <- y[, 1] + y[, 2] + 0.001 * y[, 3] # Not possible to achieve correlations. Small alpha with warning. RegSDCadd(y, c(0.1, 0.2, 0.3), x) # Exact collinear data y[, 3] <- y[, 1] + y[, 2] # Zero alpha with warning RegSDCadd(y, c(0.1, 0.2, 0.3), x)
x <- matrix(1:10, 10, 1) y <- matrix(rnorm(30) + 1:30, 10, 3) yOut <- RegSDCadd(y, c(0.1, 0.2, 0.3), x) # Correlations between residuals as required diag(cor(residuals(lm(y ~ x)), residuals(lm(yOut ~ x)))) # Identical covariance matrices cov(y) - cov(yOut) cov(residuals(lm(y ~ x))) - cov(residuals(lm(yOut ~ x))) # Identical regression results summary(lm(y[, 1] ~ x)) summary(lm(yOut[, 1] ~ x)) # alpha as attribute attr(yOut, "alpha") # With yStart as input and alpha limit in use (warning produced) yOut <- RegSDCadd(y, NULL, x, 2 * y + matrix(rnorm(30), 10, 3)) attr(yOut, "alpha") # Same correlation for all variables RegSDCadd(y, 0.2, x) # But in this case RegSDCcomp is equivalent and faster RegSDCcomp(y, 0.2, x) # Make nearly collinear data y[, 3] <- y[, 1] + y[, 2] + 0.001 * y[, 3] # Not possible to achieve correlations. Small alpha with warning. RegSDCadd(y, c(0.1, 0.2, 0.3), x) # Exact collinear data y[, 3] <- y[, 1] + y[, 2] # Zero alpha with warning RegSDCadd(y, c(0.1, 0.2, 0.3), x)
Implementation of equation 8 in the paper.
RegSDCcomp( y, compCorr = NA, x = NULL, doSVD = FALSE, makeunique = TRUE, ensureIntercept = TRUE )
RegSDCcomp( y, compCorr = NA, x = NULL, doSVD = FALSE, makeunique = TRUE, ensureIntercept = TRUE )
y |
Matrix of confidential variables |
compCorr |
Required component score correlations (possibly recycled) |
x |
Matrix of non-confidential variables |
doSVD |
SVD when TRUE and QR when FALSE |
makeunique |
Parameter to be used in GenQR |
ensureIntercept |
Whether to ensure/include a constant term. Non-NULL x is subjected to |
NA component score correlation means independent random. Input matrices are subjected to EnsureMatrix
.
Generated version of y
Øyvind Langsrud
x <- matrix(1:10, 10, 1) y <- matrix(rnorm(30) + 1:30, 10, 3) # Same as IPSO (RegSDCipso) RegSDCcomp(y, NA, x) # Using QR and SVD yQR <- RegSDCcomp(y, c(0.1, 0.2, NA), x) ySVD <- RegSDCcomp(y, c(0.1, 0.2, NA), x, doSVD = TRUE) # Calculation of residuals r <- residuals(lm(y ~ x)) rQR <- residuals(lm(yQR ~ x)) rSVD <- residuals(lm(ySVD ~ x)) # Correlations for two first components as required diag(cor(GenQR(r)$Q, GenQR(rQR)$Q)) diag(cor(GenQR(r, doSVD = TRUE)$Q, GenQR(rSVD, doSVD = TRUE)$Q)) # Identical covariance matrices cov(yQR) - cov(ySVD) cov(rQR) - cov(rSVD) # Identical regression results summary(lm(y[, 1] ~ x)) summary(lm(yQR[, 1] ~ x)) summary(lm(ySVD[, 1] ~ x))
x <- matrix(1:10, 10, 1) y <- matrix(rnorm(30) + 1:30, 10, 3) # Same as IPSO (RegSDCipso) RegSDCcomp(y, NA, x) # Using QR and SVD yQR <- RegSDCcomp(y, c(0.1, 0.2, NA), x) ySVD <- RegSDCcomp(y, c(0.1, 0.2, NA), x, doSVD = TRUE) # Calculation of residuals r <- residuals(lm(y ~ x)) rQR <- residuals(lm(yQR ~ x)) rSVD <- residuals(lm(ySVD ~ x)) # Correlations for two first components as required diag(cor(GenQR(r)$Q, GenQR(rQR)$Q)) diag(cor(GenQR(r, doSVD = TRUE)$Q, GenQR(rSVD, doSVD = TRUE)$Q)) # Identical covariance matrices cov(yQR) - cov(ySVD) cov(rQR) - cov(rSVD) # Identical regression results summary(lm(y[, 1] ~ x)) summary(lm(yQR[, 1] ~ x)) summary(lm(ySVD[, 1] ~ x))
Function that returns a dataset
RegSDCdata(dataset)
RegSDCdata(dataset)
dataset |
Name of data set within the RegSDC package |
sec7data: Data in section 7 of the paper as a data frame
sec7y: Y in section 7 of the paper as a matrix
sec7x: X in section 7 of the paper as a matrix
sec7z: Z in section 7 of the paper as a matrix
sec7xAll: Xall in section 7 of the paper as a matrix
sec7zAll: Zall in section 7 of the paper as a matrix
sec7zAllSupp: As Zall with suppressed values set to NA
data frame
Øyvind Langsrud
RegSDCdata("sec7data") RegSDCdata("sec7y") RegSDCdata("sec7x") RegSDCdata("sec7z") RegSDCdata("sec7xAll") RegSDCdata("sec7zAll") RegSDCdata("sec7zAllSupp")
RegSDCdata("sec7data") RegSDCdata("sec7y") RegSDCdata("sec7x") RegSDCdata("sec7z") RegSDCdata("sec7xAll") RegSDCdata("sec7zAll") RegSDCdata("sec7zAllSupp")
Implementation of the methodology in section 6 in the paper
RegSDChybrid( y, clusters = NULL, xLocal = NULL, xGlobal = NULL, clusterPieces = NULL, xClusterPieces = NULL, groupedClusters = NULL, xGroupedClusters = NULL, alternative = NULL, alpha = NULL, ySim = NULL, returnParts = FALSE, epsAlpha = 1e-07, makeunique = TRUE, tolerance = sqrt(.Machine$double.eps) )
RegSDChybrid( y, clusters = NULL, xLocal = NULL, xGlobal = NULL, clusterPieces = NULL, xClusterPieces = NULL, groupedClusters = NULL, xGroupedClusters = NULL, alternative = NULL, alpha = NULL, ySim = NULL, returnParts = FALSE, epsAlpha = 1e-07, makeunique = TRUE, tolerance = sqrt(.Machine$double.eps) )
y |
Matrix of confidential variables |
clusters |
Vector of cluster coding |
xLocal |
Matrix of x-variables to be crossed with clusters |
xGlobal |
Matrix of x-variables NOT to be crossed with clusters |
clusterPieces |
Vector of coding of cluster pieces |
xClusterPieces |
Matrix of x-variables to be crossed with cluster pieces |
groupedClusters |
Vector of coding of grouped clusters |
xGroupedClusters |
Matrix of x-variables to be crossed with grouped clusters |
alternative |
One of "" (default), "a", "b" or "c" |
alpha |
Possible to specify parameter used internally by alternative "c" |
ySim |
Possible to specify the internally simulated data manually |
returnParts |
Alternative output six matrices: y1 and y2 (fitted), e3s and e4s (new residuals), e3 and e4 (original residuals) |
epsAlpha |
Precision constant for alpha calculation |
makeunique |
Parameter to be used in GenQR |
tolerance |
Parameter to |
Input matrices are subjected to EnsureMatrix
.
Necessary constant terms (intercept) are automatically included.
That is, a column of ones is not needed in the input matrices.
Generated version of y
Øyvind Langsrud
################################################# # Generate example data for introductory examples ################################################# y <- matrix(rnorm(30) + 1:30, 10, 3) x <- matrix(1:10, 10, 1) # x <- 1:10 is equivalent # Same as RegSDCipso(y) yOut <- RegSDChybrid(y) # With a single cluster both are same as RegSDCipso(y, x) yOut <- RegSDChybrid(y, xLocal = x) yOut <- RegSDChybrid(y, xGlobal = x) # Define two clusters clust <- rep(1:2, each = 5) # MHa and MHb in paper yMHa <- RegSDChybrid(y, clusters = clust, xLocal = x) yMHb <- RegSDChybrid(y, clusterPieces = clust, xLocal = x) # An extended variant of MHb as mentioned in paper paragraph below definition of MHa/MHb yMHbExt <- RegSDChybrid(y, clusterPieces = clust, xClusterPieces = x) # Identical means within clusters aggregate(y, list(clust = clust), mean) aggregate(yMHa, list(clust = clust), mean) aggregate(yMHb, list(clust = clust), mean) aggregate(yMHbExt, list(clust = clust), mean) # Identical global regression results summary(lm(y[, 1] ~ x)) summary(lm(yMHa[, 1] ~ x)) summary(lm(yMHb[, 1] ~ x)) summary(lm(yMHbExt[, 1] ~ x)) # MHa: Identical local regression results summary(lm(y[, 1] ~ x, subset = clust == 1)) summary(lm(yMHa[, 1] ~ x, subset = clust == 1)) # MHb: Different results summary(lm(yMHb[, 1] ~ x, subset = clust == 1)) # MHbExt: Same estimates and different std. errors summary(lm(yMHbExt[, 1] ~ x, subset = clust == 1)) ################################################### # Generate example data for more advanced examples ################################################### x <- matrix((1:90) * (1 + runif(90)), 30, 3) x1 <- x[, 1] x2 <- x[, 2] x3 <- x[, 3] y <- matrix(rnorm(90), 30, 3) + x clust <- paste("c", rep(1:3, each = 10), sep = "") ######## Run main algorithm z0 <- RegSDChybrid(y, clusters = clust, xLocal = x3, xGlobal = cbind(x1, x2)) # Corresponding models by lm lmy <- lm(y ~ clust + x1 + x2 + x3:clust) lm0 <- lm(z0 ~ clust + x1 + x2 + x3:clust) # Preserved regression coef (x3 within clusters) coef(lmy) - coef(lm0) # Preservation of x3 coef locally can also be seen by local regression coef(lm(y ~ x3, subset = clust == "c2")) - coef(lm(z0 ~ x3, subset = clust == "c2")) # Covariance matrix preserved cov(resid(lmy)) - cov(resid(lm0)) # But not preserved within clusters cov(resid(lmy)[clust == "c2", ]) - cov(resid(lm0)[clust == "c2", ]) ######## Modification (a) za <- RegSDChybrid(y, clusters = clust, xLocal = x3, xGlobal = cbind(x1, x2), alternative = "a") lma <- lm(za ~ clust + x1 + x2 + x3:clust) # Now covariance matrices preserved within clusters cov(resid(lmy)[clust == "c2", ]) - cov(resid(lma)[clust == "c2", ]) # If we estimate coef for x1 and x2 within clusters, # they become identical and identical to global estimates coef(lma) coef(lm(za ~ clust + x1:clust + x2:clust + x3:clust)) ######## Modification (c) with automatic calculation of alpha # The result depends on the randomly generated data # When the result is that alpha=1, modification (b) is equivalent zc <- RegSDChybrid(y, clusters = clust, xLocal = x3, xGlobal = cbind(x1, x2), alternative = "c") lmc <- lm(zc ~ clust + x1 + x2 + x3:clust) # Preserved regression coef as above coef(lmy) - coef(lmc) # Again covariance matrices preserved within clusters cov(resid(lmy)[clust == "c2", ]) - cov(resid(lmc)[clust == "c2", ]) # If we estimate coef for x1 and x2 within clusters, # results are different from modification (a) above coef(lmc) coef(lm(zc ~ clust + x1:clust + x2:clust + x3:clust)) #################################################### # Make groups of clusters (d) and cluster pieces (e) #################################################### clustGr <- paste("gr", ceiling(rep(1:3, each = 10)/2 + 0.1), sep = "") clustP <- c("a", "a", rep("b", 28)) ######## Modifications (c), (d) and (e) zGrP <- RegSDChybrid(y, clusters = clust, clusterPieces = clustP, groupedClusters = clustGr, xLocal = x3, xGroupedClusters = x2, xGlobal = x1, alternative = "c") # Corresponding models by lm lmGrP <- lm(zGrP ~ clust:clustP + x1 + x2:clustGr + x3:clust - 1) lmY <- lm(y ~ clust:clustP + x1 + x2:clustGr + x3:clust - 1) # Preserved regression coef coef(lmY) - coef(lmGrP) # Identical means within cluster pieces aggregate(y, list(clust = clust, clustP = clustP), mean) aggregate(zGrP, list(clust = clust, clustP = clustP), mean) # Covariance matrix preserved cov(resid(lmY)) - cov(resid(lmGrP)) # Covariance matrices preserved within clusters cov(resid(lmY)[clust == "c2", ]) - cov(resid(lmGrP)[clust == "c2", ]) # Covariance matrices not preserved within cluster pieces cov(resid(lmY)[clustP == "a", ]) - cov(resid(lmGrP)[clustP == "a", ])
################################################# # Generate example data for introductory examples ################################################# y <- matrix(rnorm(30) + 1:30, 10, 3) x <- matrix(1:10, 10, 1) # x <- 1:10 is equivalent # Same as RegSDCipso(y) yOut <- RegSDChybrid(y) # With a single cluster both are same as RegSDCipso(y, x) yOut <- RegSDChybrid(y, xLocal = x) yOut <- RegSDChybrid(y, xGlobal = x) # Define two clusters clust <- rep(1:2, each = 5) # MHa and MHb in paper yMHa <- RegSDChybrid(y, clusters = clust, xLocal = x) yMHb <- RegSDChybrid(y, clusterPieces = clust, xLocal = x) # An extended variant of MHb as mentioned in paper paragraph below definition of MHa/MHb yMHbExt <- RegSDChybrid(y, clusterPieces = clust, xClusterPieces = x) # Identical means within clusters aggregate(y, list(clust = clust), mean) aggregate(yMHa, list(clust = clust), mean) aggregate(yMHb, list(clust = clust), mean) aggregate(yMHbExt, list(clust = clust), mean) # Identical global regression results summary(lm(y[, 1] ~ x)) summary(lm(yMHa[, 1] ~ x)) summary(lm(yMHb[, 1] ~ x)) summary(lm(yMHbExt[, 1] ~ x)) # MHa: Identical local regression results summary(lm(y[, 1] ~ x, subset = clust == 1)) summary(lm(yMHa[, 1] ~ x, subset = clust == 1)) # MHb: Different results summary(lm(yMHb[, 1] ~ x, subset = clust == 1)) # MHbExt: Same estimates and different std. errors summary(lm(yMHbExt[, 1] ~ x, subset = clust == 1)) ################################################### # Generate example data for more advanced examples ################################################### x <- matrix((1:90) * (1 + runif(90)), 30, 3) x1 <- x[, 1] x2 <- x[, 2] x3 <- x[, 3] y <- matrix(rnorm(90), 30, 3) + x clust <- paste("c", rep(1:3, each = 10), sep = "") ######## Run main algorithm z0 <- RegSDChybrid(y, clusters = clust, xLocal = x3, xGlobal = cbind(x1, x2)) # Corresponding models by lm lmy <- lm(y ~ clust + x1 + x2 + x3:clust) lm0 <- lm(z0 ~ clust + x1 + x2 + x3:clust) # Preserved regression coef (x3 within clusters) coef(lmy) - coef(lm0) # Preservation of x3 coef locally can also be seen by local regression coef(lm(y ~ x3, subset = clust == "c2")) - coef(lm(z0 ~ x3, subset = clust == "c2")) # Covariance matrix preserved cov(resid(lmy)) - cov(resid(lm0)) # But not preserved within clusters cov(resid(lmy)[clust == "c2", ]) - cov(resid(lm0)[clust == "c2", ]) ######## Modification (a) za <- RegSDChybrid(y, clusters = clust, xLocal = x3, xGlobal = cbind(x1, x2), alternative = "a") lma <- lm(za ~ clust + x1 + x2 + x3:clust) # Now covariance matrices preserved within clusters cov(resid(lmy)[clust == "c2", ]) - cov(resid(lma)[clust == "c2", ]) # If we estimate coef for x1 and x2 within clusters, # they become identical and identical to global estimates coef(lma) coef(lm(za ~ clust + x1:clust + x2:clust + x3:clust)) ######## Modification (c) with automatic calculation of alpha # The result depends on the randomly generated data # When the result is that alpha=1, modification (b) is equivalent zc <- RegSDChybrid(y, clusters = clust, xLocal = x3, xGlobal = cbind(x1, x2), alternative = "c") lmc <- lm(zc ~ clust + x1 + x2 + x3:clust) # Preserved regression coef as above coef(lmy) - coef(lmc) # Again covariance matrices preserved within clusters cov(resid(lmy)[clust == "c2", ]) - cov(resid(lmc)[clust == "c2", ]) # If we estimate coef for x1 and x2 within clusters, # results are different from modification (a) above coef(lmc) coef(lm(zc ~ clust + x1:clust + x2:clust + x3:clust)) #################################################### # Make groups of clusters (d) and cluster pieces (e) #################################################### clustGr <- paste("gr", ceiling(rep(1:3, each = 10)/2 + 0.1), sep = "") clustP <- c("a", "a", rep("b", 28)) ######## Modifications (c), (d) and (e) zGrP <- RegSDChybrid(y, clusters = clust, clusterPieces = clustP, groupedClusters = clustGr, xLocal = x3, xGroupedClusters = x2, xGlobal = x1, alternative = "c") # Corresponding models by lm lmGrP <- lm(zGrP ~ clust:clustP + x1 + x2:clustGr + x3:clust - 1) lmY <- lm(y ~ clust:clustP + x1 + x2:clustGr + x3:clust - 1) # Preserved regression coef coef(lmY) - coef(lmGrP) # Identical means within cluster pieces aggregate(y, list(clust = clust, clustP = clustP), mean) aggregate(zGrP, list(clust = clust, clustP = clustP), mean) # Covariance matrix preserved cov(resid(lmY)) - cov(resid(lmGrP)) # Covariance matrices preserved within clusters cov(resid(lmY)[clust == "c2", ]) - cov(resid(lmGrP)[clust == "c2", ]) # Covariance matrices not preserved within cluster pieces cov(resid(lmY)[clustP == "a", ]) - cov(resid(lmGrP)[clustP == "a", ])
Implementation of equation 4 in the paper.
RegSDCipso(y, x = NULL, ensureIntercept = TRUE)
RegSDCipso(y, x = NULL, ensureIntercept = TRUE)
y |
Matrix of confidential variables |
x |
Matrix of non-confidential variables |
ensureIntercept |
Whether to ensure/include a constant term. Non-NULL x is subjected to |
Input matrices are subjected to EnsureMatrix
.
Generated version of y
Øyvind Langsrud
x <- matrix(1:5, 5, 1) y <- matrix(rnorm(15) + 1:15, 5, 3) ySynth <- RegSDCipso(y, x) # Identical regression results summary(lm(y[, 1] ~ x)) summary(lm(ySynth[, 1] ~ x)) # Identical covariance matrices cov(y) - cov(ySynth) cov(residuals(lm(y ~ x))) - cov(residuals(lm(ySynth ~ x)))
x <- matrix(1:5, 5, 1) y <- matrix(rnorm(15) + 1:15, 5, 3) ySynth <- RegSDCipso(y, x) # Identical regression results summary(lm(y[, 1] ~ x)) summary(lm(ySynth[, 1] ~ x)) # Identical covariance matrices cov(y) - cov(ySynth) cov(residuals(lm(y ~ x))) - cov(residuals(lm(ySynth ~ x)))
Implementation of equation 12 in the paper.
RegSDCnew(y, yNew, x = NULL, doSVD = FALSE, ensureIntercept = TRUE)
RegSDCnew(y, yNew, x = NULL, doSVD = FALSE, ensureIntercept = TRUE)
y |
Matrix of confidential variables |
yNew |
Matrix of y-data for new scores |
x |
Matrix of non-confidential variables |
doSVD |
SVD when TRUE and QR when FALSE |
ensureIntercept |
Whether to ensure/include a constant term. Non-NULL x is subjected to |
doSVD has effect on decomposition of y and yNew. Input matrices are subjected to EnsureMatrix
.
Generated version of y
Øyvind Langsrud
x <- matrix(1:5, 5, 1) y <- matrix(rnorm(15) + 1:15, 5, 3) # Same as IPSO (RegSDCipso) RegSDCnew(y, matrix(rnorm(15), 5, 3), x) # Close to y RegSDCnew(y, y + 0.001 * matrix(rnorm(15), 5, 3), x)
x <- matrix(1:5, 5, 1) y <- matrix(rnorm(15) + 1:15, 5, 3) # Same as IPSO (RegSDCipso) RegSDCnew(y, matrix(rnorm(15), 5, 3), x) # Close to y RegSDCnew(y, y + 0.001 * matrix(rnorm(15), 5, 3), x)
Implementation based on equations 11, 12 and 17 in the paper.
RegSDCromm(y, lambda = Inf, x = NULL, doSVD = FALSE, ensureIntercept = TRUE)
RegSDCromm(y, lambda = Inf, x = NULL, doSVD = FALSE, ensureIntercept = TRUE)
y |
Matrix of confidential variables |
lambda |
ROMM parameter |
x |
Matrix of non-confidential variables |
doSVD |
SVD when TRUE and QR when FALSE |
ensureIntercept |
Whether to ensure/include a constant term. Non-NULL x is subjected to |
doSVD has effect on decomposition of y.
The exact behaviour of the method depends on the choice of the decomposition method because of
the sequentially phenomenon mentioned in the paper.
The similarity to the original data will tend to be highest for the first component.
Input matrices are subjected to EnsureMatrix
.
Generated version of y
Øyvind Langsrud
x <- matrix(1:5, 5, 1) y <- matrix(rnorm(15) + 1:15, 5, 3) # Same as IPSO (RegSDCipso) RegSDCromm(y, Inf, x) # Close to IPSO RegSDCromm(y, 100, x) # Close to y RegSDCromm(y, 0.001, x)
x <- matrix(1:5, 5, 1) y <- matrix(rnorm(15) + 1:15, 5, 3) # Same as IPSO (RegSDCipso) RegSDCromm(y, Inf, x) # Close to IPSO RegSDCromm(y, 100, x) # Close to y RegSDCromm(y, 0.001, x)
Assume that frequencies to be published, z
, can be computed from inner
frequencies, y
, via z = t(x) %*% y
,
where x
is a dummy matrix.
Assuming correct suppression, this function will generate safe inner cell frequencies as decimal numbers.
SuppressDec( x, z = NULL, y = NULL, suppressed = NULL, digits = 9, nRep = 1, yDeduct = NULL, resScale = NULL, rmse = NULL, sparseLimit = 500 )
SuppressDec( x, z = NULL, y = NULL, suppressed = NULL, digits = 9, nRep = 1, yDeduct = NULL, resScale = NULL, rmse = NULL, sparseLimit = 500 )
x |
Dummy matrix where the dimensions matches z and/or y input. Sparse matrix (Matrix package) is possible. |
z |
Frequencies to be published. All, only the safe ones or with suppressed as NA. |
y |
Inner cell frequencies (see details). |
suppressed |
Logical vector defining the suppressed elements of z. |
digits |
Output close to whole numbers will be rounded using |
nRep |
Integer, when >1, several y's will be generated. Extra columns in output. |
yDeduct |
Values to be subtracted from y and added back after the calculations. Can be used to perform the modulo method described in the paper (see examples). |
resScale |
Residuals will be scaled by resScale |
rmse |
Desired root mean square error (residual standard error). Will be used when resScale is NULL or cannot be used. |
sparseLimit |
Limit for the number of rows of a reduced x-matrix within the algorithm. When exceeded, a sparse algorithm is used
(see |
This function makes use of ReduceX
and RegSDCipso
.
It is not required that y
consists of cell frequencies. A multivariate y
or z
is also possible.
Then several values are possible as digits
, resScale
and rmse
input.
The inner cell frequencies as decimal numbers
Capital letters, X, Y and Z, are used in the paper.
Øyvind Langsrud
# Same data as in the paper z <- RegSDCdata("sec7z") x <- RegSDCdata("sec7x") y <- RegSDCdata("sec7y") # Now z is t(x) %*% y zAll <- RegSDCdata("sec7zAll") zAllSupp <- RegSDCdata("sec7zAllSupp") xAll <- RegSDCdata("sec7xAll") # When no suppression, output is identical to y SuppressDec(xAll, zAll, y) SuppressDec(xAll, zAll) # y can be seen in z # Similar to Y* in paper (but other random values) SuppressDec(x, z, y) # Residual standard error forced to be 1 SuppressDec(x, z, y, rmse = 1) # Seven ways of obtaining the same output SuppressDec(x, z, rmse = 1) # slower, y must be estimated SuppressDec(x, y = y, rmse = 1) SuppressDec(xAll, zAllSupp, y, rmse = 1) SuppressDec(xAll, zAllSupp, rmse = 1) # slower, y must be estimated SuppressDec(xAll, zAll, y, is.na(zAllSupp), rmse = 1) SuppressDec(xAll, zAll, suppressed = is.na(zAllSupp), rmse = 1) # y seen in z SuppressDec(xAll, y = y, suppressed = is.na(zAllSupp), rmse = 1) # YhatMod4 and YhatMod10 in Table 2 in paper SuppressDec(xAll, zAllSupp, y, yDeduct = 4 * (y%/%4), resScale = 0) SuppressDec(xAll, zAllSupp, y, yDeduct = 10 * (y%/%10), rmse = 0) # As data in Table 3 in paper (but other random values) SuppressDec(xAll, zAllSupp, y, yDeduct = 10 * (y%/%10), resScale = 0.1) # rmse instead of resScale and 5 draws SuppressDec(xAll, zAllSupp, y, yDeduct = 10 * (y%/%10), rmse = 1, nRep = 5)
# Same data as in the paper z <- RegSDCdata("sec7z") x <- RegSDCdata("sec7x") y <- RegSDCdata("sec7y") # Now z is t(x) %*% y zAll <- RegSDCdata("sec7zAll") zAllSupp <- RegSDCdata("sec7zAllSupp") xAll <- RegSDCdata("sec7xAll") # When no suppression, output is identical to y SuppressDec(xAll, zAll, y) SuppressDec(xAll, zAll) # y can be seen in z # Similar to Y* in paper (but other random values) SuppressDec(x, z, y) # Residual standard error forced to be 1 SuppressDec(x, z, y, rmse = 1) # Seven ways of obtaining the same output SuppressDec(x, z, rmse = 1) # slower, y must be estimated SuppressDec(x, y = y, rmse = 1) SuppressDec(xAll, zAllSupp, y, rmse = 1) SuppressDec(xAll, zAllSupp, rmse = 1) # slower, y must be estimated SuppressDec(xAll, zAll, y, is.na(zAllSupp), rmse = 1) SuppressDec(xAll, zAll, suppressed = is.na(zAllSupp), rmse = 1) # y seen in z SuppressDec(xAll, y = y, suppressed = is.na(zAllSupp), rmse = 1) # YhatMod4 and YhatMod10 in Table 2 in paper SuppressDec(xAll, zAllSupp, y, yDeduct = 4 * (y%/%4), resScale = 0) SuppressDec(xAll, zAllSupp, y, yDeduct = 10 * (y%/%10), rmse = 0) # As data in Table 3 in paper (but other random values) SuppressDec(xAll, zAllSupp, y, yDeduct = 10 * (y%/%10), resScale = 0.1) # rmse instead of resScale and 5 draws SuppressDec(xAll, zAllSupp, y, yDeduct = 10 * (y%/%10), rmse = 1, nRep = 5)
Implementation of equation 21 in the paper.
Z2Yhat(z, x, digits = 9)
Z2Yhat(z, x, digits = 9)
z |
Z as a matrix |
x |
X as a matrix |
digits |
When non-NULL, output values close to whole numbers will be rounded using
|
Generalized inverse is computed by ginv
.
In practise, the computations can be speeded up using reduced versions of X and Z. See ReduceX
.
Yhat as a matrix
Øyvind Langsrud
# Same data as in the paper z <- RegSDCdata("sec7z") x <- RegSDCdata("sec7x") Z2Yhat(z, x) # With y known, yHat can be computed in other ways y <- RegSDCdata("sec7y") # Now z is t(x) %*% y fitted(lm(y ~ x - 1)) IpsoExtra(y, x, FALSE, resScale = 0)
# Same data as in the paper z <- RegSDCdata("sec7z") x <- RegSDCdata("sec7x") Z2Yhat(z, x) # With y known, yHat can be computed in other ways y <- RegSDCdata("sec7y") # Now z is t(x) %*% y fitted(lm(y ~ x - 1)) IpsoExtra(y, x, FALSE, resScale = 0)