AutoWMM: Automating the WMM on Trees

library(AutoWMM)

Introduction

In population size estimation, methods based on back-calculation (multiplier methods) are a popular approach to estimating the size of a target population which is partially hidden and not directly enumerable from existing data. The basis behind this approach is to use a subpopulation with known count and a estimate of the proportion of the target population belonging to this subgroup to estimate the size of the target population. However, this basic method is not applicable when many subgroup counts are available, in which case evidence could be synthesized to provide a more accurate estimate of the target population size. If subgroups are mutually exclusive, a tree-like structure can be created with the target population at the root node, and children (or grandchildren) of the root representing these subgroups with known counts. In this case, to combine available evidence a weighted sum of estimates from each back-calculated path can be made, which is automated and achieved with this package using the weighted multiplier method (WMM) (Flynn and Gustafson 2024b). Variance-minimizing weights are used to provide an estimate of the root population size on any admissible tree-structured data, as well as options for visual rendering of trees for reporting of results.

The implementation behind these functions is described at length elsewhere (Flynn and Gustafson 2024a), and is based on previously developed methodology (Flynn and Gustafson 2024b). A more extensive application to real-world data can be found in (Flynn, Gustafson, and Irvine 2024).

makeTree()

The makeTree function creates a tree object from any admissible dataframe; for a dataframe to be admissible, it must contain specific column labels. These columns include the following:

  • ‘from’ (string, node label)

  • ‘to’ (string, node label)

  • ‘Estimate’ (+ integer)

  • ‘Total’ (+ integer)

  • ‘Count’ (+ integer)

The root node is the assumed to represent the target population for which estimation is sought. The makeTree function will accept a data.frame object with these columns, and create a data.tree object (from data.tree package) from data.frame; it also checks the dataframe columns and structure to ensure the data.tree object can be used for root node population size estimation.

Further details on each column are as follows:

  • from’ (node label) and ‘to’ (node label), which encode the tree structure. from and to describe the edge for that row of data.

  • Estimate’ (+ integer), ‘Total’ (+ integer), which are used as parameters in a Beta(Estimate+1,Total-Estimate+1) distribution at each branch, or, if ‘Total’ is equal for all branches in a sibling group, then a Dirichlet distributions with parameters Estimate+1 from each child of a sibling group is used. Estimate and Total are assumed to come from surveys of size Total (a sample of the population at node from), and observe Estimate number of those individuals at Total which move to the node described by to. Used for branch probabilities only.

  • Count’ (for terminal nodes with marginal counts). Count column is NA for rows where to nodes are not leaves, and also for all leaves without a marginal count.

A Population (logical) column is not needed, but can be added if Estimate and Total come from population numbers, rather than samples. A Description column (string) is also possible to include if particular descriptions are desired on the visual diagram made by drawTree function.

An example of this package in use can be seen in the following:

## create admissible dataset
treeData <- data.frame("from" = c("Z", "Z", "A", "A"),
                        "to" = c("A", "B", "C", "D"),
                        "Estimate" = c(4, 34, 9, 1),
                        "Total" = c(11, 70, 10, 10),
                        "Count" = c(NA, 500, NA, 50),
                        "Population" = c(FALSE, FALSE, FALSE, FALSE),
                        "Description" = c("First child of the root", "Second child of the root",
                        "First grandchild", "Second grandchild"))

## make tree object using makeTree
tree <- makeTree(treeData)
tree
#>   levelName
#> 1 Z        
#> 2  ¦--A    
#> 3  ¦   ¦--C
#> 4  ¦   °--D
#> 5  °--B

drawTree()

The drawTree function creates a descriptive diagram of a tree object created using makeTree, which allows the user to visualize the tree before it is used for WMM estimation. Prints descriptions or node labels on nodes, and probabilities based on previous surveys to branches. Specifically, node descriptions are given within respective nodes if provided by Description column in dataframe used in makeTree, and branch probabilities calculated using the ratio of data columns Estimate over Total are given along tree branches.

The function has three arguments, the first being the makeTree tree object the user would like the render. The user may also specify an argument probs, which takes the values TRUE/FALSE and determines whether probabilities will be displayed along tree branches. Lastly, the argument desc takes the values TRUE/FALSE and determines whether descriptions will be displayed in each node; for desc=TRUE, a Description column must be included as part of the data.frame used in the makeTree function tree construction.

A use case, using the above tree created by makeTree, is as follows:

## draw tree pre-estimation, with descriptions on nodes (default), and suppressing probabilities on branching
drawTree(tree, probs = FALSE)

wmmTree()

This function is the central functionality of the AutoWMM package, and performs WMM estimation on the tree created with makeTree. It compute an estimate of the size of a target population represented by the root node of tree-structure data. The wmmTree function accepts a tree object made using the makeTree function, and generates an estimate of the root node (the target population) by combining multiple back-calculated path estimates. This function will generate a weighted estimate using variance-minimizing weights, which combines back-calculated estimates of the root via the multiplier method. It will incorporate data from all leaves with both 1) known marginal leaf counts at the terminal point of the root to leaf path, and 2) available branch probability estimates for each segment along the root-to-leaf path. These paths are deemed “informative” (Flynn and Gustafson 2024b).
The wmmTree function uses the following arguments:

  • tree: A tree object created using makeTree function.

  • sample_length: The desired number of estimates of the target (root) node.

  • method: The method of back-calculation. Current version only supports the default multiplier method to produce back-calculated estimates using an internal function, method = "mmEstimate", unless the tree is a special case using single-source sibling data only. In the latter case, the method uses closed forms to generate path specific means and variances and WMM estimate (see single.source argument below).

  • int.type: The type of confidence interval desired. The default, and recommended, interval produced is the central 95% using quantiles (int.type = "quants"}. Setting this argument to var generates a 95% confidence interval based on sample variance, while the cox option for this argument generates the Cox interval for log-normal data.

  • single.source: This logical argument should be set to TRUE only when all sibling groups which have branches that are used in back-calculated paths are fully informed by a single source of data. In this special case, sampling can be bypassed, and closed forms are used to generate the WMM root estimate and it’s uncertainty; at this time, this method provides an approximation only as paths are assumed independent.

Back-calculation from each leaf proceeds as follows. Surveys or literature estimates are assumed to inform Estimate and Total columns of dataframe. These branches are then sampled ‘sample_length’ number of times (i.e., number of runs) using a variety of methods, depending on the sources of data being used to inform sibling branches. For example, two sibling branches informed by a single survey which observed the movement from parent to child node of Estimate individuals in a survey with sample size Total, can be assigned a Beta(Estimate+1, Total-Estimate+1) distribution. For more complex configurations of source knowledge, importance sampling and rejection schemes are also employed to ensure consistency among sibling branch groups and root estimates for a given run. For each leaf with non-empty Count, we back-calculate by multiplying by the sampled inverse probabilities of each branch along the root-to-leaf path.

The function ultimately generates sample_length number of weighted estimates of the root target population. Using functionality of data.tree package on makeTree object also provides:

  • confidence intervals of estimates of the root provided by leaves with marginal counts, as node$uncertainty

  • probability samples can be accessed as node$probability_samples

  • root estimates from terminal nodes with non-empty count as stored as node$targetEst_samples

The output of the function is a list of four entries; WMM root node estimate and corresponding uncertainty, a vector of estimates given on each run, and the weights associated with each path which provides a root node estimate (an informative path). Specifically, these four outputs are as follows:

  • The first is a list with four entries: in the first position is the WMM root node size estimate given by the synthesis across all informative paths. The second entry is the uncertainty associated with the root node size estimate. In the third entry we find a vector of weights which sum to one, with length equal to the number of informative paths and the weight associated with those paths. In the fourth and final position, a vector of length sample_length of estimates of the target (root) population size given by the WMM for each run.

  • The second entry can alternatively be accessed using the Get function through data.tree to access the node attribute uncertainty; it gives the 95% confidence intervals for estimates given by leaves with marginal counts, as well as the root node.

  • The third entry can alternatively be accessed using the Get function through data.tree to access the node attribute probability_samples; it returns vectors of probabilities sampled for all branches leading into each node.

  • The fourth entry can alternatively be accessed using the Get function through data.tree to access the node attribute targetEst_samples; it returns vectors of root estimates calculated for all paths with marginal counts on the respective root-to-leaf path.

The calculation of estimates relies on the internal functions confInts, ko.weights, meanlogEstimates, mmEstimate, root.confInt, rootEsts, logEstimates, and sampleBeta.

The function can be demonstrated using the tree created in the makeTree section above:

## perform root node estimation
## small sample_length was chosen for efficiency across machines
Zhats <- wmmTree(tree, sample_length = 3)
#> using variance-weighted mean with multiplier method sampled path estimates

The user can then print the estimates of the root node generated by each iteration, the weights of each branch, the final estimate of the root node population size calculated using the WMM, and the final rounded estimate of the root:

# print the estimates of the root node generated by the iterations
Zhats$estimates 
#>        [,1]
#> D   1221.79
#> D.1 1303.49
#> D.2 1055.86
# prints the weights of each branch
Zhats$weights 
#>              D         B
#> [1,] 0.4700522 0.5299478
# prints the final estimate of the root node by WMM
Zhats$root 
#>       Z 
#> 1189.15
# prints the final rounded estimate of the root with conf. int.
Zhats$uncertainty 
#>          Z
#> lower 1064
#> upper 1299

The user may also use the data.tree functionality with the makeTree object to obtain the average root estimate with a 95% confidence interval, the samples generated from each path which provided the root estimate, as well as the sampled probabilities for each branch over the iterations:

## show the average root estimate with 95\% confidence interval, as well as
## average estimates with confidence intervals for each node with a marginal
## count
tree$Get('uncertainty')
#>          Z  A  C        D        B
#> lower 1064 NA NA 1574.829 748.6592
#> upper 1299 NA NA 1983.099 942.2195

## show the samples generated from each path which provides root estimates
tree$Get('targetEst_samples')
#> $Z
#> numeric(0)
#> 
#> $A
#> numeric(0)
#> 
#> $C
#> numeric(0)
#> 
#> $D
#>        D        D        D 
#> 1632.140 2003.532 1571.869 
#> 
#> $B
#> [1] 945.0367 890.2601 741.8644

## show the probabilities sampled at each branch leading into the given node
tree$Get('probability_samples')
#>       Z         A         C          D         B
#> [1,] NA 0.4709200 0.9349473 0.06505273 0.5290800
#> [2,] NA 0.4383664 0.9430706 0.05692938 0.5616336
#> [3,] NA 0.3260224 0.9024323 0.09756773 0.6739776

A second example using a slightly larger tree can be seen as follows:

## create 2nd admissible dataset
## this example handles many branch sampling cases, including all siblings informed from different surveys, same survey, and mixed case, as well as some siblings not informed and the rest from different surveys, same survey, and mixed case.
treeData2 <- data.frame("from" = c("Z", "Z", "Z",
                                    "A", "A",
                                    "B", "B", "B",
                                    "C", "C", "C",
                                    "H", "H", "H",
                                    "K", "K", "K"),
                        "to" = c("A", "B", "C",
                                  "D", "E",
                                  "F", "G", "H",
                                  "I", "J", "K",
                                  "L", "M", "N",
                                  "O", "P", "Q"),
                        "Estimate" = c(24, 34, 12,
                                      9, 1,
                                      NA, 19, 1,
                                      NA, 2, 1,
                                      20, 10, 12,
                                      5, 3, NA),
                        "Total" = c(70, 70, 70,
                                    10, 11,
                                    NA, 30, 8,
                                    NA, 12, 12,
                                    40, 40, 40,
                                    10, 10, NA),
                        "Count" = c(NA, NA, NA,
                                    50, NA,
                                    NA, 15, NA,
                                    NA, 10, NA,
                                    NA, NA, 20,
                                    5, 2, NA))

## make tree object using makeTree
tree2 <- makeTree(treeData2)
#> WARNING: No 'Population' column exists. We assume all
#>           values are sample estimates

## perform root node estimation
Zhats <- wmmTree(tree2, sample_length = 3)
#> using variance-weighted mean with multiplier method sampled path estimates
Zhats$estimates # print the estimates of the root node generated by the 15 iterations
#>       [,1]
#> A   862.41
#> A.1 300.94
#> A.2 678.22
Zhats$weights # prints the weights of each branch
#>              D           G         N        J         O         P
#> [1,] 0.1099031 0.001465447 0.3183321 0.220356 0.1963685 0.1535749
Zhats$root # prints the final estimate of the root node by WMM
#>      Z 
#> 560.43
Zhats$uncertainty # prints the final rounded estimate of the root with conf. int.
#>         Z
#> lower 313
#> upper 852

## show the average root estimate with 95\% confidence interval, as well as average estimates with confidence intervals for each node with a marginal count
tree2$Get('uncertainty')
#>         Z  A        D  E  B  F        G  H  L  M         N  C  I        J  K
#> lower 313 NA 160.2849 NA NA NA 39.26677 NA NA NA  555.3146 NA NA 139.3921 NA
#> upper 852 NA 308.3435 NA NA NA 59.23216 NA NA NA 3834.5179 NA NA 980.5035 NA
#>               O         P  Q
#> lower  272.8911  156.2845 NA
#> upper 3139.1687 2029.2968 NA

## show the samples generated from each path which provides root estimates
tree2$Get('targetEst_samples')
#> $Z
#> numeric(0)
#> 
#> $A
#> numeric(0)
#> 
#> $D
#>        A        A        A 
#> 241.3265 156.8698 312.3463 
#> 
#> $E
#> numeric(0)
#> 
#> $B
#> numeric(0)
#> 
#> $F
#> numeric(0)
#> 
#> $G
#>        G        G        G 
#> 59.54404 53.60716 38.62865 
#> 
#> $H
#> numeric(0)
#> 
#> $L
#> numeric(0)
#> 
#> $M
#> numeric(0)
#> 
#> $N
#>         H         H         H 
#> 4212.6003  642.3334  551.0761 
#> 
#> $C
#> numeric(0)
#> 
#> $I
#> numeric(0)
#> 
#> $J
#>         J         J         J 
#> 1054.1682  247.5679  135.2411 
#> 
#> $K
#> numeric(0)
#> 
#> $O
#>         O         O         O 
#>  389.4780  267.8293 3503.6146 
#> 
#> $P
#>         P         P         P 
#>  170.2544  155.5819 2312.0124 
#> 
#> $Q
#> numeric(0)

## show the probabilities sampled at each branch leading into the given node
tree2$Get('probability_samples')
#>       Z         A         D         E         B  F         G          H
#> [1,] NA 0.3918268 0.5287750 0.4712250 0.4340951 NA 0.5803207 0.02907534
#> [2,] NA 0.4254156 0.7492335 0.2507665 0.4081080 NA 0.6856355 0.17813125
#> [3,] NA 0.3241041 0.4939115 0.5060885 0.5037739 NA 0.7708077 0.19164880
#>              L         M         N         C  I          J          K         O
#> [1,] 0.3874104 0.2364318 0.3761578 0.1740781 NA 0.05449367 0.19081342 0.3864864
#> [2,] 0.4417438 0.1299501 0.4283061 0.1664764 NA 0.24263474 0.26304510 0.4263134
#> [3,] 0.4632516 0.1608447 0.3759037 0.1721220 NA 0.42959063 0.01948836 0.4254437
#>              P  Q
#> [1,] 0.3536541 NA
#> [2,] 0.2935541 NA
#> [3,] 0.2578863 NA

countTree()

After WMM estimation has been performed on a makeTree object, the countTree function renders a diagram of the tree, much like drawTree, but showing the root estimate generated using wmmTree in the root node, as well as the marginal leaf counts (data column Count) which contributed to the weighted estimate displayed within the corresponding leaf nodes. The mean of the sampled branch probabilities generated using the wmmTree method are also displayed along each branch.

Functionality can be demonstrated using the trees defined above:

## visualize the tree post-estimation, with final weighted root estimate (rounded) displayed in the root node and marginal counts displayed in their respective leaves.  
## means of sampled probability appear on branches, so note that sum of sibling branches may not equal 1.
countTree(tree)

estTree()

The estTree function is for use after wmmTree has been applied to a makeTree tree object. The function allows the user to visualize the tree with the root size estimate given by wmmTree displayed in the root node, and the root estimate given by each particular path which contributed to the weighted estimate displayed in the corresponding leaf node. It also displays average of probability samples generated using wmmTree method on each branch.

Functionality can again be demonstrated using the trees defined above:

## visualize the tree post-estimation, with final weighted root estimate (rounded) displayed in the root node and path-specific estimates in their respective leaves.
## The means of sampled probability appear on branches, so note that sum of sibling branches may not equal 1
estTree(tree)

References

Flynn, M. J., and P. Gustafson. 2024a. AutoWMM and JAGStree - R Packages for Population Estimation on Relational Tree-Structured Data.” [Manuscript in Preparation] Department of Statistics, University of British Columbia.
———. 2024b. “Leveraging Relational Evidence: Population Size Estimation on Tree-Structured Data with the Weighted Multiplier Method.” [Manuscript in Preparation] Department of Statistics, University of British Columbia.
Flynn, M. J., P. Gustafson, and M. A. Irvine. 2024. “Estimating the Number of Opioid Overdoses in British Columbia Using Relational Evidence with Tree Structure.” [Manuscript in Preparation] Department of Statistics, University of British Columbia.