--- title: "Generating Pseudo Population" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{generating_pseudo_population} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Pseudo population dataset is computed based on user-defined causal inference approaches (e.g., matching or weighting). A covariate balance test is performed on the pseudo population dataset. Users can specify covariate balance criteria and activate an adaptive approach and number of attempts to search for a target pseudo population dataset that meets the covariate balance criteria. ## Usage Input parameters: **`w`** A data.frame of observed continues exposure, including `id` and `w` columns. **`c`** A data frame or matrix of observed baseline covariates, also includes `id` column, **`ci_appr`** The causal inference approach. Options are "matching" and "weighting". **`dist_measure`** Distance measuring function. **`scale`** specified scale parameter to control the relative weight that is attributed to the distance measures of the exposure versus the GPS estimates **`delta_n`** specified caliper parameter on the exposure **`covar_bl_method`** specified covariate balance method **`covar_bl_trs`** specified covariate balance threshold **`max_attempt`** maximum number of attempt to satisfy covariate balance ## Technical Details for Matching The matching algorithm aims to match an observed unit $j$ to each $j'$ at each exposure level $w^{(l)}$. 1) We specify **`delta_n`** ($\delta_n$), a caliper for any exposure level $w$, which constitutes equally sized bins, i.e., $[w-\delta_n, w+\delta_n]$. Based on the caliper **`delta_n`** , we define a predetermined set of $L$ exposure levels $\{w^{(1)}=\min(w)+ \delta_n,w^{(2)}=\min(w)+3 \delta_n,...,w^{(L)} = \min(w)+(2L-1) \delta_n\}$, where $L = \lfloor \frac{\max(w)-\min(w)}{2\delta_n} + \frac{1}{2} \rfloor$. Each exposure level $w^{(l)}$ is the midpoint of equally sized bins, $[w^{(l)}-\delta_n, w^{(l)}+\delta_n]$. 2) We implement a nested-loop algorithm, with $l$ in $1,2,\ldots, L$ as the outer-loop, and $j'$ in $1 ,\ldots,N$ as the inner-loop. The algorithm outputs the final product of our design stage, i.e., a matched set with $N\times L$ units. \ **for** $l = 1,2,\ldots, L$ **do** \   Choose **one** exposure level of interest $w^{(l)} \in \{w^{(1)}, w^{(2)}, ..., w^{(L)}\}$. \   **for** $j' = 1 ,\ldots,N$ **do** \ \setlength{\leftskip}{0pt}   2.1 Evaluate the GPS $\hat{e}(w^{(l)}, \mathbf{c}_{j'})$ (for short $e^{(l)}_{j'}$) at $w^{(l)}$ based on the fitted GPS model in Step 1 for each unit $j'$ having observed covariates $\mathbf{c}_{j'}$. \   2.2 Implement the matching to find **an** observed unit -- denoted by $j$ -- that matched with $j'$ with respect to both the exposure $w_{j}\approx w^{(l)}$ and the estimated GPS $\hat{e}(w_j, \mathbf{c}_{j}) \approx e^{(l)}_{j'}$ (under a standardized Euclidean transformation). More specifically, we find a $j$ as $$ j_{{gps}}(e^{(l)}_{j'},w^{(l)})=\text{arg} \ \underset{j: w_j \in [w^{(l)}-\delta_n,w^{(l)}+\delta_n]}{\text{min}} \ \mid\mid( \lambda \hat{e}^{*}(w_j,\mathbf{c}_j), (1-\lambda)w^{*}_j) -(\lambda e_{j'}^{(l)*}, (1-\lambda) w^{(l)*})\mid\mid, $$ where **`dist_measure`** ($||.||$) is a pre-specified two-dimensional metric, **`scale`** ($\lambda$) is the scale parameter assigning weights to the corresponding two dimensions (i.e., the GPS and exposure), and $\delta$ is the caliper defined in Step 2 allowing that only the unit $j$ with an observed exposure $w_j \in [w^{(l)}-\delta,w^{(l)}+\delta]$ can get matched. \   2.3 Impute $Y_{j'}(w^{(l)})$ as: $\hat{Y}_{j'}(w^{(l)})=Y^{obs}_{j_{{gps}}(e^{(l)}_{j'},w^{(l)})}$. \   **end for** \begin{itemize}   Note: We allow multiple $j'$ (e.g., $j' =1$ and $j' = 5$) to be matched with the same observed unit $j$ throughout the inner-loop $j'$ in $1 ,\ldots,N$ (matching with replacement). \end{itemize} **end for** 3) After implementing the matching algorithm, we construct the matched set with $N\times L$ units by combining all $\hat{Y}_{j'}(w^{(l)})$ for $j'=1,\ldots,N$ and for all $w^{(l)} \in \{w^{(1)},w^{(2)},...,w^{(L)}\}$. ## Technical Details for Covariate Balance We introduce the absolute correlation measure (**`covar_bl_method`** = "absolute") to assess covariate balance for continuous exposures . The absolute correlation between the exposure and each pre-exposure covariate is a global measure and can inform whether the whole matched set is balanced. The measures above build upon the work by [@austin2019assessing] who examine covariate balance conditions with continuous exposures. We adapt them into the proposed matching framework. In a balanced pseudo population dataset, the correlations between the exposure and pre-exposure covariates should close to zero, that is $E [\mathbf{c}_{i}^{*} w_{i}^{*} ] \approx \mathbf{0}.$ We calculate the absolute correlation in the pseudo population dataset as \begin{align*} \big\lvert \sum_{i=1}^{N\times L} \mathbf{c}_{i}^{*} w_{i}^{*} \big\lvert \end{align*} The average absolute correlations are defined as the average of absolute correlations among all covariates. Average absolute correlation: \begin{align*} \overline{\big\lvert \sum_{i=1}^{N\times L} \mathbf{c}_{i}^{*} w_{i}^{*} \big\lvert} < \boldsymbol{\epsilon}_1. \end{align*} We specify a pre-specified threshold **`covar_bl_trs`** ($\boldsymbol{\epsilon}_1$), for example 0.1, on average absolute correlation as the threshold for covariate balance in the pseudo population dataset. ## References