Introduction to twoPhaseGAS

Overview

twoPhaseGAS provides tools for the design and analysis of two-phase genetic association studies (GAS) in which a subset of the full cohort is selected for targeted sequencing (phase 2), while the remaining individuals contribute only GWAS-level data (phase 1).

The package offers:

Data simulation via DataGeneration_TPD().
Phase 2 design optimisation via twoPhaseDesign() / twoPhase().
Joint analysis of phase 1 and phase 2 data via twoPhaseSPML(), which implements a semiparametric maximum likelihood (SPML) estimator using the EM algorithm.

1. Simulating two-phase data

library(twoPhaseGAS)

data.table::setDTthreads(1)

set.seed(42)
data <- DataGeneration_TPD(
  Beta0  = 2,    # intercept
  Beta1  = 0.5,  # genetic effect (G on Y)
  Disp = 1,      # error variance for default gaussian family
  N      = 3000, # phase 1 cohort size
  LD.r   = 0.75, # LD (r) between causal SNP G and GWAS SNP Z
  P_g    = 0.20, # MAF of G
  P_z    = 0.30  # MAF of Z
)
#> Beta0=2 Beta1=0.5
head(data)
#>   wait_it        Y G1 Z G0 S
#> 1       1 1.314338  0 1  0 1
#> 2       1 1.207286  0 2  0 1
#> 3       1 1.592996  0 0  0 1
#> 4       1 1.351329  1 1  1 1
#> 5       1 3.115760  0 0  0 3
#> 6       1 1.120543  0 0  1 1

The simulated data frame contains:

Column	Description
`Y`	Continuous outcome
`G1`	Causal sequence variant (absent from phase 1 GWAS)
`Z`	GWAS SNP (available for everyone)
`S`	Stratum variable derived from residuals

2. Selecting a phase 2 subsample

Here we use simple random sampling; twoPhaseDesign() can be used for optimised designs.

set.seed(1)
n2   <- 500                              # phase 2 size
R    <- rep(0L, nrow(data))
R[sample(nrow(data), n2)] <- 1L         # 1 = selected for phase 2

data[R == 0, c("G1")]  <- NA # Make all phase 1 data complement G1 values  missing

cat("Phase 1 complement:", sum(R == 0), "individuals\n")
#> Phase 1 complement: 2500 individuals
cat("Phase 2:           ", sum(R == 1), "individuals\n")
#> Phase 2:            500 individuals

3. Analysis under H₀ (score test)

When the missing-by-design covariate (miscov) is absent from formula, twoPhaseSPML() fits the null model and returns score statistics.

res_Ho <- twoPhaseSPML(
  formula = Y ~ Z,
  miscov  = ~ G1,
  auxvar  = ~ Z,
  data   = data
)

print(res_Ho)
#> 
#> Call:
#> twoPhaseSPML(formula = Y ~ Z, miscov = ~G1, auxvar = ~Z, family = gaussian)
#> 
#> Coefficients:
#>             Estimate
#> (Intercept)    2.016
#> Z              0.290
#> 
#> Family: gaussian  Link: identity 
#> Dispersion: 1.074 
#> Log-likelihood: -7382  (EM iterations: 45 )
#> 
#> Hypothesis test for missing-by-design covariate(s): G1 
#>   [Null model - Score statistics]
#>   Observed score statistic (Sobs): 26.06 
#>   Expected score statistic (Sexp): 25.51 
#>   Degrees of freedom: 1 
#>   p-value (chi-squared): 3.313e-07

The print method (modelled on glm) displays:

the fitted call,
estimated regression coefficients,
family and link,
log-likelihood and number of EM iterations, and
the observed and expected score statistics with a chi-squared p-value.

4. Analysis under Hₐ (Wald test)

Including miscov in formula switches to the alternative model. The EM algorithm now estimates the effect of G1 jointly with the regression parameters, and Wald statistics are reported.

res_Ha <- twoPhaseSPML(
  formula = Y ~ Z + G1,
  miscov  = ~ G1,
  auxvar  = ~ Z,
  data   = data
)

print(res_Ha)
#> 
#> Call:
#> twoPhaseSPML(formula = Y ~ Z + G1, miscov = ~G1, auxvar = ~Z, family = gaussian)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  2.00924    0.02548  78.846  < 2e-16 ***
#> Z           -0.09125    0.06188  -1.474     0.14    
#> G1           0.61235    0.08688   7.048 1.81e-12 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Family: gaussian  Link: identity 
#> Dispersion: 1.021 
#> Log-likelihood: -7367  (EM iterations: 53 )
#> 
#> Hypothesis test for missing-by-design covariate(s): G1 
#>   [Alternative model - Wald statistics]
#>   Observed Wald statistic (Wobs): 49.68 
#>   Expected Wald statistic (Wexp): 38.87 
#>   Degrees of freedom: 1 
#>   p-value (chi-squared): 1.811e-12

5. Accessing results programmatically

The returned object is a list of class "twoPhaseSPML".

# Regression coefficients
res_Ha$theta
#> (Intercept)           Z          G1 
#>  2.00923599 -0.09124575  0.61234907

# Log-likelihood
res_Ha$ll
#> [1] -7367.262

# Estimated joint distribution of G1 and Z
head(res_Ha$qGZ)
#>   G1 Z           q
#> 1  0 0 0.477127116
#> 2  1 0 0.004206217
#> 3  0 1 0.151060698
#> 4  1 1 0.269605968
#> 5  0 2 0.012126314
#> 6  1 2 0.049821574

# Wald statistic and degrees of freedom
cat("Wobs =", res_Ha$Wobs, "  df =", res_Ha$df, "\n")
#> Wobs = 49.67874   df = 1
cat("p-value =", pchisq(res_Ha$Wobs, df = res_Ha$df, lower.tail = FALSE), "\n")
#> p-value = 1.810981e-12

6. Using `start.values` for warm starts

When analysing many variants, passing converged estimates from a nearby model as start.values can reduce computation time.

# Use null-model estimates as starting point for the alternative model
res_warm <- twoPhaseSPML(
  formula     = Y ~ Z + G1,
  miscov      = ~ G1,
  auxvar      = ~ Z,
  data       = data,
  start.values = list(betas = res_Ho$theta,
                      q     = res_Ho$qGZ)
)

Session information

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] twoPhaseGAS_1.2.5 rmarkdown_2.31   
#> 
#> loaded via a namespace (and not attached):
#>  [1] dfoptim_2023.1.0    cli_3.6.6           knitr_1.51         
#>  [4] rlang_1.2.0         xfun_0.57           jsonlite_2.0.0     
#>  [7] data.table_1.18.4   buildtools_1.0.0    bigmemory_4.6.4    
#> [10] htmltools_0.5.9     maketools_1.3.2     sys_3.4.3          
#> [13] sass_0.4.10         grid_4.6.0          evaluate_1.0.5     
#> [16] jquerylib_0.1.4     kofnGA_1.3          MASS_7.3-65        
#> [19] fastmap_1.2.0       yaml_2.3.12         lifecycle_1.0.5    
#> [22] bigmemory.sri_0.1.8 compiler_4.6.0      Rcpp_1.1.1-1.1     
#> [25] enrichwith_0.5.0    lattice_0.22-9      digest_0.6.39      
#> [28] nloptr_2.2.1        R6_2.6.1            parallel_4.6.0     
#> [31] bslib_0.11.0        Matrix_1.7-5        uuid_1.2-2         
#> [34] tools_4.6.0         cachem_1.1.0