twoPhaseGAS provides tools for the design and analysis of two-phase genetic association studies (GAS) in which a subset of the full cohort is selected for targeted sequencing (phase 2), while the remaining individuals contribute only GWAS-level data (phase 1).
The package offers:
DataGeneration_TPD().twoPhaseDesign() / twoPhase().twoPhaseSPML(), which implements a semiparametric maximum
likelihood (SPML) estimator using the EM algorithm.library(twoPhaseGAS)
data.table::setDTthreads(1)
set.seed(42)
data <- DataGeneration_TPD(
Beta0 = 2, # intercept
Beta1 = 0.5, # genetic effect (G on Y)
Disp = 1, # error variance for default gaussian family
N = 3000, # phase 1 cohort size
LD.r = 0.75, # LD (r) between causal SNP G and GWAS SNP Z
P_g = 0.20, # MAF of G
P_z = 0.30 # MAF of Z
)
#> Beta0=2 Beta1=0.5
head(data)
#> wait_it Y G1 Z G0 S
#> 1 1 1.314338 0 1 0 1
#> 2 1 1.207286 0 2 0 1
#> 3 1 1.592996 0 0 0 1
#> 4 1 1.351329 1 1 1 1
#> 5 1 3.115760 0 0 0 3
#> 6 1 1.120543 0 0 1 1The simulated data frame contains:
| Column | Description |
|---|---|
Y |
Continuous outcome |
G1 |
Causal sequence variant (absent from phase 1 GWAS) |
Z |
GWAS SNP (available for everyone) |
S |
Stratum variable derived from residuals |
Here we use simple random sampling; twoPhaseDesign() can
be used for optimised designs.
set.seed(1)
n2 <- 500 # phase 2 size
R <- rep(0L, nrow(data))
R[sample(nrow(data), n2)] <- 1L # 1 = selected for phase 2
data[R == 0, c("G1")] <- NA # Make all phase 1 data complement G1 values missing
cat("Phase 1 complement:", sum(R == 0), "individuals\n")
#> Phase 1 complement: 2500 individuals
cat("Phase 2: ", sum(R == 1), "individuals\n")
#> Phase 2: 500 individualsWhen the missing-by-design covariate (miscov) is
absent from formula,
twoPhaseSPML() fits the null model and returns score
statistics.
res_Ho <- twoPhaseSPML(
formula = Y ~ Z,
miscov = ~ G1,
auxvar = ~ Z,
data = data
)
print(res_Ho)
#>
#> Call:
#> twoPhaseSPML(formula = Y ~ Z, miscov = ~G1, auxvar = ~Z, family = gaussian)
#>
#> Coefficients:
#> Estimate
#> (Intercept) 2.016
#> Z 0.290
#>
#> Family: gaussian Link: identity
#> Dispersion: 1.074
#> Log-likelihood: -7382 (EM iterations: 45 )
#>
#> Hypothesis test for missing-by-design covariate(s): G1
#> [Null model - Score statistics]
#> Observed score statistic (Sobs): 26.06
#> Expected score statistic (Sexp): 25.51
#> Degrees of freedom: 1
#> p-value (chi-squared): 3.313e-07The print method (modelled on glm)
displays:
Including miscov in formula switches to the
alternative model. The EM algorithm now estimates the effect of
G1 jointly with the regression parameters, and Wald
statistics are reported.
res_Ha <- twoPhaseSPML(
formula = Y ~ Z + G1,
miscov = ~ G1,
auxvar = ~ Z,
data = data
)
print(res_Ha)
#>
#> Call:
#> twoPhaseSPML(formula = Y ~ Z + G1, miscov = ~G1, auxvar = ~Z, family = gaussian)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.00924 0.02548 78.846 < 2e-16 ***
#> Z -0.09125 0.06188 -1.474 0.14
#> G1 0.61235 0.08688 7.048 1.81e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Family: gaussian Link: identity
#> Dispersion: 1.021
#> Log-likelihood: -7367 (EM iterations: 53 )
#>
#> Hypothesis test for missing-by-design covariate(s): G1
#> [Alternative model - Wald statistics]
#> Observed Wald statistic (Wobs): 49.68
#> Expected Wald statistic (Wexp): 38.87
#> Degrees of freedom: 1
#> p-value (chi-squared): 1.811e-12The returned object is a list of class
"twoPhaseSPML".
# Regression coefficients
res_Ha$theta
#> (Intercept) Z G1
#> 2.00923599 -0.09124575 0.61234907
# Log-likelihood
res_Ha$ll
#> [1] -7367.262
# Estimated joint distribution of G1 and Z
head(res_Ha$qGZ)
#> G1 Z q
#> 1 0 0 0.477127116
#> 2 1 0 0.004206217
#> 3 0 1 0.151060698
#> 4 1 1 0.269605968
#> 5 0 2 0.012126314
#> 6 1 2 0.049821574
# Wald statistic and degrees of freedom
cat("Wobs =", res_Ha$Wobs, " df =", res_Ha$df, "\n")
#> Wobs = 49.67874 df = 1
cat("p-value =", pchisq(res_Ha$Wobs, df = res_Ha$df, lower.tail = FALSE), "\n")
#> p-value = 1.810981e-12start.values for warm startsWhen analysing many variants, passing converged estimates from a
nearby model as start.values can reduce computation
time.
# Use null-model estimates as starting point for the alternative model
res_warm <- twoPhaseSPML(
formula = Y ~ Z + G1,
miscov = ~ G1,
auxvar = ~ Z,
data = data,
start.values = list(betas = res_Ho$theta,
q = res_Ho$qGZ)
)sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] twoPhaseGAS_1.2.5 rmarkdown_2.31
#>
#> loaded via a namespace (and not attached):
#> [1] dfoptim_2023.1.0 cli_3.6.6 knitr_1.51
#> [4] rlang_1.2.0 xfun_0.57 jsonlite_2.0.0
#> [7] data.table_1.18.4 buildtools_1.0.0 bigmemory_4.6.4
#> [10] htmltools_0.5.9 maketools_1.3.2 sys_3.4.3
#> [13] sass_0.4.10 grid_4.6.0 evaluate_1.0.5
#> [16] jquerylib_0.1.4 kofnGA_1.3 MASS_7.3-65
#> [19] fastmap_1.2.0 yaml_2.3.12 lifecycle_1.0.5
#> [22] bigmemory.sri_0.1.8 compiler_4.6.0 Rcpp_1.1.1-1.1
#> [25] enrichwith_0.5.0 lattice_0.22-9 digest_0.6.39
#> [28] nloptr_2.2.1 R6_2.6.1 parallel_4.6.0
#> [31] bslib_0.11.0 Matrix_1.7-5 uuid_1.2-2
#> [34] tools_4.6.0 cachem_1.1.0