--- title: "PartialNetwork: An R package for estimating peer effects using partial network information" header-includes: - \usepackage[utf8]{inputenc} - \usepackage[T1]{fontenc} - \usepackage{xcolor} author: "Vincent Boucher and Elys\u00E9e Aristide Houndetoungan" date: "`r Sys.Date()`" output: pdf_document: citation_package: natbib number_sections: yes latex_engine: pdflatex toc: true bibliography: ["References.bib", "Packages.bib"] biblio-style: "apalike" abstract: \noindent This vignette makes it easier to get started with the package **PartialNetwork**. We illustrate the estimators presented in @boucher2020estimating with examples of simulated data. Section [1]{#IV} illustrates the instrumental variable (IV) estimator, which is biased when the network is sampled. We calculate the bias and propose a general consistent estimator using a simulated general method of moments (SGMM) in Section [2]{#SGMM}. Sections [3]{#BAYES}--[5]{#BAYESARD} present the Bayesian estimator that jointly estimates the spatial autoregressive (SAR) model and the network formation model. Finally, we discuss in Section [6]{#SELECT} how to address the problem of selection bias that can arise when missing links are not completely random. link-citations: true urlcolor: blue vignette: > %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{PartialNetwork package: Examples and Applications} %\VignetteEngine{knitr::rmarkdown} --- \definecolor{shadecolor}{rgb}{0.97, 0.97, 0.95} ```{r setup, include=FALSE} rmarkdown::find_pandoc(version = '2.9.2.1') knitr::opts_chunk$set(echo = TRUE) ``` \newpage # Instrumental variable (IV) procedure {#IV} We provide the function `sim.IV(dnetwork, X, y, replication, power)` where `dnetwork` is the network linking probabilities, `X` is a matrix of covariates, `y` (optional) is the vector of outcome, `replication` (optional, default = 1) is the number of replications, and `power` (optional, default = 1) is the number of powers of the interaction matrix used to generate the instruments. The function outputs a proxy for `Gy` and simulated instruments. ## Model without contextual effects The following code provides an example using a sample of 30 networks of size 50 each. For the sake of the example, we assume that linking probabilities are *known* and drawn from an uniform distribution. We first simulate data. Then, we estimate the linear-in-means model using our IV procedure, using the known linking probabilities to generate approximations of the true network. ```{r IV1, echo = TRUE, eval = TRUE} library(PartialNetwork) set.seed(123) # Number of groups M <- 30 # size of each group N <- rep(50,M) # individual effects beta <- c(2,1,1.5) # endogenous effects alpha <- 0.4 # std-dev errors se <- 1 # network distribution distr <- runif(sum(N*(N-1))) distr <- vec.to.mat(distr, N, normalise = FALSE) # covariates X <- cbind(rnorm(sum(N),0,5),rpois(sum(N),7)) # true network G0 <- sim.network(distr) # normalise G0norm <- norm.network(G0) # simulate dependent variable use an external package y <- CDatanet::simsar(~ X, Glist = G0norm, theta = c(alpha, beta, se)) y <- y$y # generate instruments instr <- sim.IV(distr, X, y, replication = 1, power = 1) GY1c1 <- instr[[1]]$G1y # proxy for Gy (draw 1) GXc1 <- instr[[1]]$G1X[,,1] # proxy for GX (draw 1) GXc2 <- instr[[1]]$G2X[,,1] # proxy for GX (draw 2) # build dataset # keep only instrument constructed using a different draw than the one used to proxy Gy dataset <- as.data.frame(cbind(y, X, GY1c1, GXc1, GXc2)) colnames(dataset) <- c("y","X1","X2","G1y", "G1X1", "G1X2", "G2X1", "G2X2") ``` Once the instruments are generated, the estimation can be performed using standard tools, e.g. the function `ivreg` from the **AER** package. For example, if we use the same draw for the proxy and the instruments, the estimation is "bad". ```{r IV2, echo = TRUE, eval = TRUE, message=FALSE} library(AER) # Same draws out.iv1 <- ivreg(y ~ X1 + X2 + G1y | X1 + X2 + G1X1 + G1X2, data = dataset) summary(out.iv1) ``` If we use different draws for the proxy and the instruments, the estimation is "good". ```{r IV3, echo = TRUE, eval = TRUE, message=FALSE} # Different draws out.iv2 <- ivreg(y ~ X1 + X2 + G1y | X1 + X2 + G2X1 + G2X2, data = dataset) summary(out.iv2) ``` However, as stated by our Proposition 2, this estimator is biased. We can approximate the bias as follows. ```{r IV3bias} ddS <- as.matrix(cbind(1, dataset[,c("X1", "X2", "G1y")])) #\ddot{S} dZ <- as.matrix(cbind(1, dataset[,c("X1", "X2", "G2X1", "G2X2")]))#\dot{Z} dZddS <- crossprod(dZ, ddS)/sum(N) W <- solve(crossprod(dZ)/sum(N)) matM <- solve(crossprod(dZddS, W%*%dZddS), crossprod(dZddS, W)) maxbias <- apply(sapply(1:1000, function(...){ dddGy <- peer.avg(sim.network(distr, normalise = TRUE) , y) abs(matM%*%crossprod(dZ, dddGy - dataset$G1y)/sum(N)) }), 1, max); names(maxbias) <- c("(Intercept)", "X1", "X2", "G1y") {cat("Maximal absolute bias\n"); print(maxbias)} ``` ## Model with contextual effects We now assume that the model includes contextual effects. We first generate data. ```{r IV4, echo = TRUE, eval = TRUE} rm(list = ls()) library(PartialNetwork) set.seed(123) # Number of groups M <- 30 # size of each group N <- rep(50,M) # individual effects beta <- c(2,1,1.5) # contextual effects gamma <- c(5, -3) # endogenous effects alpha <- 0.4 # std-dev errors se <- 1 # network distribution distr <- runif(sum(N*(N-1))) distr <- vec.to.mat(distr, N, normalise = FALSE) # covariates X <- cbind(rnorm(sum(N),0,5),rpois(sum(N),7)) # true network G0 <- sim.network(distr) # normalise G0norm <- norm.network(G0) # GX GX <- peer.avg(G0norm, X) # simulate dependent variable use an external package y <- CDatanet::simsar(~ X + GX, Glist = G0norm, theta = c(alpha, beta, gamma, se)) y <- y$y # generate instruments # we need power = 2 for models with contextual effetcs instr <- sim.IV(distr, X, y, replication = 1, power = 2) GY1c1 <- instr[[1]]$G1y # proxy for Gy (draw 1) GXc1 <- instr[[1]]$G1X[,,1] # proxy for GX (draw 1) GXc2 <- instr[[1]]$G2X[,,1] # proxy for GX (draw 2) GXc2sq <- instr[[1]]$G2X[,,2] # proxy for G^2X (draw 2) # build dataset # keep only instrument constructed using a different draw than the one used to proxy Gy dataset <- as.data.frame(cbind(y, X, GX, GY1c1, GXc1, GXc2, GXc2sq)) colnames(dataset) <- c("y","X1","X2", "GX1", "GX2", "G1y", "G1X1", "G1X2", "G2X1", "G2X2", "G2X1sq", "G2X2sq") ``` As pointed out in the paper, the IV procedure requires $\mathbf{G}\mathbf{X}$ being observed. In additions, when contextual effects are included, we consider the extended model. ```{r IV5, echo = TRUE, eval = TRUE, message=FALSE} # Different draws out.iv2 <- ivreg(y ~ X1 + X2 + GX1 + GX2 + G1X1 + G1X2 + G1y | X1 + X2 + GX1 + GX2 + G1X1 + G1X2 + G2X1 + G2X2 + G2X1sq + G2X2sq, data = dataset) summary(out.iv2) ``` We also compute the maximal absolute bias. ```{r IV5bias} ddS <- as.matrix(cbind(1, dataset[,c("X1", "X2", "GX1", "GX2", "G1X1", "G1X2", "G1y")])) dZ <- as.matrix(cbind(1, dataset[,c("X1", "X2", "GX1", "GX2", "G1X1", "G1X2", "G2X1", "G2X2", "G2X1sq", "G2X2sq")])) dZddS <- crossprod(dZ, ddS)/sum(N) W <- solve(crossprod(dZ)/sum(N)) matM <- solve(crossprod(dZddS, W%*%dZddS), crossprod(dZddS, W)) maxbias <- apply(sapply(1:1000, function(...){ dddGy <- peer.avg(sim.network(distr, normalise = TRUE) , y) abs(matM%*%crossprod(dZ, dddGy - dataset$G1y)/sum(N)) }), 1, max); names(maxbias) <- c("(Intercept)", "X1", "X2", "GX1", "GX2", "G1X1", "G1X2", "G1y") {cat("Maximal absolute bias\n"); print(maxbias)} ``` # Simulated Method of Moments {#SGMM} As shown in the paper (see @boucher2020estimating), our IV estimator is inconsistent. Although the bias is expected to be small, in general, the IV estimator performs less well when the model includes contextual effects. Therefore, we propose a Simulated Method of Moments (SMM) estimator by correcting the bias of the moment function use by the IV estimator. Our SMM estimator is then consistent and also deals with group heterogeneity. ## Models without group heterogeneity We first simulate data. ```{r smm1, echo = TRUE, eval = TRUE} rm(list = ls()) library(PartialNetwork) set.seed(123) # Number of groups M <- 100 # size of each group N <- rep(30,M) # individual effects beta <- c(2, 1, 1.5, 5, -3) # endogenous effects alpha <- 0.4 # std-dev errors se <- 1 # network distribution distr <- runif(sum(N*(N-1))) distr <- vec.to.mat(distr, N, normalise = FALSE) # covariates X <- cbind(rnorm(sum(N),0,5),rpois(sum(N),7)) # true network G0 <- sim.network(distr) # normalise G0norm <- norm.network(G0) # Matrix GX GX <- peer.avg(G0norm, X) # simulate dependent variable use an external package y <- CDatanet::simsar(~ X + GX, Glist = G0norm, theta = c(alpha, beta, se)) Gy <- y$Gy y <- y$y # build dataset dataset <- as.data.frame(cbind(y, X, Gy, GX)) colnames(dataset) <- c("y","X1","X2", "Gy", "GX1", "GX2") ``` The estimation can be performed using the function `smmSAR` (do `?smmSAR` to read the help file of the function). The function allows to specify if `GX` and `Gy` are observed. We provide an example for each case. If `GX` and `Gy` are observed (instruments will be constructed using the network distribution). ```{r Smmload, echo = FALSE, eval = TRUE, message=FALSE} load(url('https://github.com/ahoundetoungan/PartialNetwork/blob/master/datavignettes/smm.rda?raw=true')) ``` ```{r Smm2, echo = TRUE, eval = FALSE, message=FALSE} out.smm1 <- smmSAR(y ~ X1 + X2 | Gy | GX1 + GX2, dnetwork = distr, contextual = T, smm.ctr = list(R = 1, print = F), data = dataset) summary(out.smm1) ``` ```{r Smm2a, echo = FALSE, eval = TRUE, message=FALSE} print(smm$smm1) ``` If `GX` is observed and not `Gy`. ```{r Smm3, echo = TRUE, eval = FALSE, message=FALSE} out.smm2 <- smmSAR(y ~ X1 + X2 || GX1 + GX2, dnetwork = distr, contextual = T, smm.ctr = list(R = 1, print = F), data = dataset) summary(out.smm2) ``` ```{r Smm3a, echo = FALSE, eval = TRUE, message=FALSE} print(smm$smm2) ``` If `Gy` is observed and not `GX`. ```{r Smm4, echo = TRUE, eval = FALSE, message=FALSE} out.smm3 <- smmSAR(y ~ X1 + X2 | Gy, dnetwork = distr, contextual = T, smm.ctr = list(R = 100, print = F), data = dataset) summary(out.smm3) ``` ```{r Smm4a, echo = FALSE, eval = TRUE, message=FALSE} print(smm$smm3) ``` If neither `Gy` nor `GX` are observed. ```{r Smm5, echo = TRUE, eval = FALSE, message=FALSE} out.smm4 <- smmSAR(y ~ X1 + X2, dnetwork = distr, contextual = T, smm.ctr = list(R = 100, print = F), data = dataset) summary(out.smm4) ``` ```{r Smm5a, echo = FALSE, eval = TRUE, message=FALSE} print(smm$smm4) ``` ## Models with group heterogeneity We assume here that every group has an unobserved characteristic which affects the outcome. Let us first simulate the data. ```{r smm6a, echo = TRUE, eval = FALSE} rm(list = ls()) ``` ```{r smm6b, echo = TRUE, eval = TRUE} library(PartialNetwork) set.seed(123) # Number of groups M <- 200 # size of each group N <- rep(30,M) # individual effects beta <- c(1, 1, 1.5, 5, -3) # endogenous effects alpha <- 0.4 # std-dev errors se <- 1 # network distribution distr <- runif(sum(N*(N-1))) distr <- vec.to.mat(distr, N, normalise = FALSE) # covariates X <- cbind(rnorm(sum(N),0,5), rpois(sum(N),7)) # Groups' fixed effects # In order to have groups' heterogeneity correlated to X (fixed effects), # We consider the quantile of X2 at 25% in the group eff <- unlist(lapply(1:M, function(x) rep(quantile(X[(c(0, cumsum(N))[x]+1):(cumsum(N)[x]),2], probs = 0.25), each = N[x]))) print(c("cor(eff, X1)" = cor(eff, X[,1]), "cor(eff, X2)" = cor(eff, X[,2]))) # We can see that eff is correlated to X2. We can confirm that the correlation is # strongly significant. print(c("p.value.cor(eff, X1)" = cor.test(eff, X[,1])$p.value, "p.value.cor(eff, X2)" = cor.test(eff, X[,2])$p.value)) # true network G0 <- sim.network(distr) # normalise G0norm <- norm.network(G0) # Matrix GX GX <- peer.avg(G0norm, X) # simulate dependent variable use an external package y <- CDatanet::simsar(~ -1 + eff + X + GX, Glist = G0norm, theta = c(alpha, beta, se)) Gy <- y$Gy y <- y$y # build dataset dataset <- as.data.frame(cbind(y, X, Gy, GX)) colnames(dataset) <- c("y","X1","X2", "Gy", "GX1", "GX2") ``` The group heterogeneity is correlated to $\mathbf{X}$ and induces bias if we do not control for it. In practice, we do not observe `eff` and we cannot add 200 dummies variables as explanatory variables to the model. We can control for group heterogeneity by taking the difference of each variable with respect to the group average (see @bramoulle2009identification). To do this, we only need to set `fixed.effects = TRUE` in `smmSAR`. We provide examples. If `GX` is observed and not `Gy`. ```{r Smm7, echo = TRUE, eval = FALSE, message=FALSE} out.smmeff1 <- smmSAR(y ~ X1 + X2 || GX1 + GX2, dnetwork = distr, contextual = T, fixed.effects = T, smm.ctr = list(R = 1, print = F), data = dataset) summary(out.smmeff1) ``` ```{r Smm7a, echo = FALSE, eval = TRUE, message=FALSE} print(smm$smme1) ``` If `Gy` is observed and not `GX`. ```{r Smm8, echo = TRUE, eval = FALSE, message=FALSE} out.smmeff2 <- smmSAR(y ~ X1 + X2 | Gy, dnetwork = distr, contextual = T, fixed.effects = T, smm.ctr = list(R = 100, print = F), data = dataset) summary(out.smmeff2) ``` ```{r Smm8a, echo = FALSE, eval = TRUE, message=FALSE} print(smm$smme2) ``` If neither `Gy` nor `GX` are observed. ```{r Smm9, echo = TRUE, eval = FALSE, message=FALSE} out.smmeff3 <- smmSAR(y ~ X1 + X2, dnetwork = distr, contextual = T, fixed.effects = T, smm.ctr = list(R = 100, print = F), data = dataset) summary(out.smmeff3) ``` ```{r Smm9a, echo = FALSE, eval = TRUE, message=FALSE} print(smm$smme3) ``` As we can see, the estimator with fixed effects has larger variance. Thus, we cannot illustrate its consistency using only one simulation. Therefore, we conduct Monte Carlo simulations. We construct the function `fMC` which simulates data and computes the SMM estimator following the same process as described above. ```{r Smm10, echo = TRUE, eval = FALSE, message=FALSE} fMC <- function(...){ # Number of groups M <- 200 # size of each group N <- rep(30,M) # individual effects beta <- c(1, 1, 1.5, 5, -3) # endogenous effects alpha <- 0.4 # std-dev errors se <- 1 # network distribution distr <- runif(sum(N*(N-1))) distr <- vec.to.mat(distr, N, normalise = FALSE) # covariates X <- cbind(rnorm(sum(N),0,5), rpois(sum(N),7)) # Groups' fixed effects # We defined the groups' fixed effect as the quantile at 25% of X2 in the group # This implies that the effects are correlated with X eff <- unlist(lapply(1:M, function(x) rep(quantile(X[(c(0, cumsum(N))[x]+1):(cumsum(N)[x]),2], probs = 0.25), each = N[x]))) # true network G0 <- sim.network(distr) # normalise G0norm <- norm.network(G0) # Matrix GX GX <- peer.avg(G0norm, X) # simulate dependent variable use an external package y <- CDatanet::simsar(~ -1 + eff + X + GX, Glist = G0norm, theta = c(alpha, beta, se)) Gy <- y$Gy y <- y$y # build dataset dataset <- as.data.frame(cbind(y, X, Gy, GX)) colnames(dataset) <- c("y","X1","X2", "Gy", "GX1", "GX2") out.smmeff1 <- smmSAR(y ~ X1 + X2 || GX1 + GX2, dnetwork = distr, contextual = T, fixed.effects = T, smm.ctr = list(R = 1, print = F), data = dataset) out.smmeff2 <- smmSAR(y ~ X1 + X2 | Gy, dnetwork = distr, contextual = T, fixed.effects = T, smm.ctr = list(R = 100, print = F), data = dataset) out.smmeff3 <- smmSAR(y ~ X1 + X2, dnetwork = distr, contextual = T, fixed.effects = T, smm.ctr = list(R = 100, print = F), data = dataset) out <- data.frame("GX.observed" = out.smmeff1$estimates, "Gy.observed" = out.smmeff2$estimates, "None.observed" = out.smmeff3$estimates) out } ``` We perform 250 simulations. ```{r Smm10a, echo = TRUE, eval = FALSE, message=FALSE} smm.Monte.C <- lapply(1:250, fMC) ``` We compute the average of the estimates ```{r Smm10b, echo = TRUE, eval = FALSE, message=FALSE} Reduce('+', smm.Monte.C)/250 ``` ```{r Smm10c, echo = FALSE, eval = TRUE, message=FALSE} print(smm$smmmc) ``` The SMM estimator performs well even when we only have the distribution of the network, $\mathbf{G}\mathbf{X}$ and $\mathbf{G}\mathbf{y}$ are not observed, and the model includes group heterogeneity. ```{r smmsave, echo = FALSE, eval = FALSE} smm <- list(smm1 = out.smm1, smm2 = out.smm2, smm3 = out.smm3, smm4 = out.smm4, smme1 = out.smmeff1, smme2 = out.smmeff2, smme3 = out.smmeff3, smmmc = smm$smmmc) save(smm, file = "~/Dropbox/Papers - In progress/Partial Network/Package/AH/PartialNetwork/datavignettes/smm.rda") ``` ## How to compute the variance when the network distribution is estimated? ```{r Smmlogitload, echo = FALSE, eval = TRUE, message=FALSE} load(url('https://github.com/ahoundetoungan/PartialNetwork/blob/master/datavignettes/smmlogit.rda?raw=true')) ``` In practice, the econometrician does not observed the true distribution of the entire network. They can only have an estimator based on partial network data (see @boucher2020estimating). Because this estimator is used instead of the true distribution, this increases the variance of the SMM estimator. We develop a method to estimate the variance by taking into account the uncertainty related to the estimator of the network distribution. Assume that the network distribution is logistic, ie, \begin{equation}\dfrac{p_{ij}}{1 - p_{ij}} = \exp(\rho_0 + \rho_1|X_{i1} + X_{j1}| + \rho_2|X_{i2} - X_{j2}|)\label{eq:logit}\end{equation} and $\mathbb{P}(a_{ij} = 1|\mathbf{P}) = p_{ij}$, where $\mathbf{A}$ is the adjacency matrix, $a_{ij}$ is the ($i$, $j$)-th entry of $\mathbf{A}$ and $\mathbf{P}$ is the matrix of $p_{ij}$. We simulated data following this assumption. ```{r smmp1, echo = TRUE, eval = FALSE} library(PartialNetwork) rm(list = ls()) set.seed(123) # Number of groups M <- 100 # size of each group N <- rep(30,M) # covariates X <- cbind(rnorm(sum(N),0,5),rpois(sum(N),7)) # network formation model parameter rho <- c(-0.8, 0.2, -0.1) # individual effects beta <- c(2, 1, 1.5, 5, -3) # endogenous effects alpha <- 0.4 # std-dev errors se <- 1 # network tmp <- c(0, cumsum(N)) X1l <- lapply(1:M, function(x) X[c(tmp[x] + 1):tmp[x+1],1]) X2l <- lapply(1:M, function(x) X[c(tmp[x] + 1):tmp[x+1],2]) dist.net <- function(x, y) abs(x - y) X1.mat <- lapply(1:M, function(m) { matrix(kronecker(X1l[[m]], X1l[[m]], FUN = dist.net), N[m])}) X2.mat <- lapply(1:M, function(m) { matrix(kronecker(X2l[[m]], X2l[[m]], FUN = dist.net), N[m])}) Xnet <- as.matrix(cbind("Const" = 1, "dX1" = mat.to.vec(X1.mat), "dX2" = mat.to.vec(X2.mat))) ynet <- Xnet %*% rho ynet <- c(1*((ynet + rlogis(length(ynet))) > 0)) G0 <- vec.to.mat(ynet, N, normalise = FALSE) # normalise G0norm <- norm.network(G0) # Matrix GX GX <- peer.avg(G0norm, X) # simulate dependent variable use an external package y <- CDatanet::simsar(~ X + GX, Glist = G0norm, theta = c(alpha, beta, se)) Gy <- y$Gy y <- y$y # build dataset dataset <- as.data.frame(cbind(y, X, Gy, GX)) colnames(dataset) <- c("y","X1","X2", "Gy", "GX1", "GX2") ``` We do not observed the true distribution of the network. We observe a subset of $\{a_{ij}\}$. Let $\mathcal{A}^{obs} = \{(i,~j), ~ a_{ij} \text{ is observed}\}$ and $\mathcal{A}^{nobs} = \{(i,~j), ~ a_{ij} \text{ is not observed}\}$. We assume that $|\mathcal{A}^{obs}| \to \infty$ as the sample size grows to $\infty$. Therefore, $\rho_0$, $\rho_1$, and $\rho_2$ can be consistently estimated. ```{r smmp2, echo = TRUE, eval = FALSE} nNet <- nrow(Xnet) # network formation model sample size Aobs <- sample(1:nNet, round(0.3*nNet)) # We observed 30% # We can estimate rho using the gml function from the stats package logestim <- glm(ynet[Aobs] ~ -1 + Xnet[Aobs,], family = binomial(link = "logit")) slogestim <- summary(logestim) rho.est <- logestim$coefficients rho.var <- slogestim$cov.unscaled # we also need the covariance of the estimator ``` We assume that the explanatory variables $X_{i1}$ and $X_{i2}$ are observed for any $i$ in the sample. Using the estimator of $\rho_0$, $\rho_1$, and $\rho_2$, we can also compute $\hat{\mathbf{P}}$, a consistent estimator of $\mathbf{P}$, from Equation (\ref{eq:logit}). ```{r smmp3, echo = TRUE, eval = FALSE} d.logit <- lapply(1:M, function(x) { out <- 1/(1 + exp(-rho.est[1] - rho.est[2]*X1.mat[[x]] - rho.est[3]*X2.mat[[x]])) diag(out) <- 0 out}) ``` We can use $\hat{\mathbf{P}}$ for our SMM estimator. We focus on the case where neither $\mathbf{G}\mathbf{X}$ nor $\mathbf{G}\mathbf{y}$ are observed. The same strategy can be used for other cases. ```{r Smmp4, echo = TRUE, eval = FALSE, message=FALSE} smm.logit <- smmSAR(y ~ X1 + X2, dnetwork = d.logit, contextual = T, smm.ctr = list(R = 100, print = F), data = dataset) summary(smm.logit) ``` ```{r Smmp4b, echo = FALSE, eval = TRUE, message=FALSE} print(smmlo$smm1) ``` The variance of the estimator computed above is conditionally on `d.logit`. It does not take into account the uncertainty related to the estimation of `d.logit`. To compute the right variance, we need a simulator from the distribution of `d.logit`. This simulator is a function which samples one network distribution from the distribution of `d.logit`. For instance, for the logit model, we can sample $\boldsymbol{\rho}$ from $N\Big(\hat{\boldsymbol{\rho}}, \mathbb{V}(\hat{\boldsymbol{\rho}})\Big)$ and then compute $\hat{\mathbf{P}}$. Depending on the network formation model, users can compute the adequate simulator. ```{r Smmp5, echo = TRUE, eval = FALSE, message=FALSE} fdist <- function(rho.est, rho.var, M, X1.mat, X2.mat){ rho.est1 <- MASS::mvrnorm(mu = rho.est, Sigma = rho.var) lapply(1:M, function(x) { out <- 1/(1 + exp(-rho.est1[1] - rho.est1[2]*X1.mat[[x]] - rho.est1[3]*X2.mat[[x]])) diag(out) <- 0 out}) } ``` The function can be passed into the argument `.fun` of the `summary` method. We also need to pass the arguments of the simulator as a list into `.args`. ```{r Smmp6a, echo = TRUE, eval = FALSE, message=FALSE} fdist_args <- list(rho.est = rho.est, rho.var = rho.var, M = M, X1.mat = X1.mat, X2.mat = X2.mat) summary(smm.logit, dnetwork = d.logit, data = dataset, .fun = fdist, .args = fdist_args, sim = 500, ncores = 8) # ncores performs simulations in parallel ``` ```{r Smmp6c, echo = FALSE, eval = TRUE, message=FALSE} print(smmlo$smm2) ``` We can notice that the variance is larger when we take into account the uncertainty of the estimation of the logit model. ```{r smmlogitsave, echo = FALSE, eval = FALSE} smm2 <- summary(smm.logit, dnetwork = d.logit, data = dataset, .fun = fdist, .args = fdist_args, sim = 500, ncores = 8) smmlo <- list(smm1 = smm.logit, smm2 = smm2) save(smmlo, file = "~/Dropbox/Papers - In progress/Partial Network/Package/AH/PartialNetwork/datavignettes/smmlogit.rda") ``` # Bayesian estimator without network formation model {#BAYES} The Bayesian estimator is neatly packed in the function `mcmcSAR` (see the help page of the function in the package, using `?mcmcSAR`, for more details on the function). Below, we provide a simple example using simulated data. For the sake of the example, we assume that linking probabilities are *known* and drawn from an uniform distribution. We first simulate data. Then, we estimate the linear-in-means model using our Bayesian estimator. **In the following example (example I-1, output `out.none1`), we assume that the network is entirely observed.** **We first simulate data.** ```{r BayesNone0, echo=FALSE} load(url('https://github.com/ahoundetoungan/PartialNetwork/blob/master/datavignettes/out.none.rda?raw=true')) #save(out.none1, out.none2.1, out.none2.2, out.none3.1, out.none3.2, file = "~/Dropbox/Papers - In progress/Partial Network/Package/AH/PartialNetwork/datavignettes/out.none.rda") ``` ```{r BayesNone1as, eval=FALSE} rm(list = ls()) library(PartialNetwork) set.seed(123) # EXAMPLE I: WITHOUT NETWORK FORMATION MODEL # Number of groups M <- 50 # size of each group N <- rep(30,M) # individual effects beta <- c(2,1,1.5) # contextual effects gamma <- c(5,-3) # endogenous effects alpha <- 0.4 # std-dev errors se <- 1 # network distribution distr <- runif(sum(N*(N-1))) distr <- vec.to.mat(distr, N, normalise = FALSE) # covariates X <- cbind(rnorm(sum(N),0,5),rpois(sum(N),7)) # true network G0 <- sim.network(distr) # normalize G0norm <- norm.network(G0) # GX GX <- peer.avg(G0norm, X) # simulate dependent variable use an external package y <- CDatanet::simsar(~ X + GX, Glist = G0norm, theta = c(alpha, beta, gamma, se)) y <- y$y # dataset dataset <- as.data.frame(cbind(y, X1 = X[,1], X2 = X[,2])) ``` **Once the data are simulated, the estimation can be performed using the function `mcmcSAR`.**^[We ran the program on a processor `Intel(R) Xeon(R) W-1270P CPU @ 3.80GHz`. The execution time in the output depends on the CPU performance.] ```{r BayesNone1ae, eval=FALSE} # Example I-1: When the network is fully observed out.none1 <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = "all", G0 = G0, data = dataset, iteration = 2e4) summary(out.none1) ``` ```{r BayesNone1b, echo=FALSE} summary(out.none1) ``` **For Example I-2, we assume that only 60% of the links are observed. ** ```{r BayesNone21s, eval=FALSE} # Example I-2: When a part of the network is observed # 60% of the network data is observed G0.obs <- lapply(N, function(x) matrix(rbinom(x^2, 1, 0.6), x)) ``` **Estimation `out.none2.1` assumes that the sampled network is the true one (inconsistent, peer effects are overestimated).** ```{r BayesNone21e, eval=FALSE} # replace the non-observed part of the network by 0 (missing links) G0.start <- lapply(1:M, function(x) G0[[x]]*G0.obs[[x]]) # Use network with missing data as the true network out.none2.1 <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = "all", G0 = G0.start, data = dataset, iteration = 2e4) summary(out.none2.1) # the peer effets seem overestimated ``` ```{r BayesNone21b, echo=FALSE} summary(out.none2.1) # the peer effets seem overestimated ``` **Estimation `out.none2.2` specifies which links are observed and which ones are not. The true probabilities are used to sample un-observed links (consistent).** ```{r BayesNone22a, eval=FALSE} out.none2.2 <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = G0.obs, G0 = G0.start, data = dataset, mlinks = list(dnetwork = distr), iteration = 2e4) summary(out.none2.2) ``` ```{r BayesNone22b, echo=FALSE} summary(out.none2.2) ``` **For Example I-3, we assume that only linking probabilities are known. ** **Estimation `out.none3.1` assumes the researcher uses a draw from that distribution as the true network (inconsistent, peer effects are overestimated).** ```{r BayesNone31as, eval=FALSE} # Example I-3: When only the network distribution is available # Simulate a fictitious network and use as true network G0.tmp <- sim.network(distr) out.none3.1 <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = "all", G0 = G0.tmp, data = dataset, iteration = 2e4) summary(out.none3.1) # the peer effets seem overestimated ``` ```{r BayesNone31b, echo=FALSE} summary(out.none3.1) # the peer effets seem overestimated ``` **Estimation `out.none3.2` specifies that no link is observed, but that the distribution is known (consistent).** ```{r BayesNone32a, eval=FALSE} out.none3.2 <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = "none", data = dataset, mlinks = list(dnetwork = distr), iteration = 2e4) summary(out.none3.2) ``` ```{r BayesNone32b, echo=FALSE} summary(out.none3.2) ``` # Bayesian estimator with logit model as network formation model {#BAYESLOG} For this example, we assume that links are generated using a simple logit model. We do not observe the true distribution. **We first simulate data.** ```{r BayesLog0, echo=FALSE} load(url('https://github.com/ahoundetoungan/PartialNetwork/blob/master/datavignettes/out.logi.rda?raw=true')) # save(out.logi2.2, out.logi3.2, slogestim, sout.logi4.1, sout.logi4.2, sout.logi4.3, sout.selb1, sout.selb2, file = "~/Dropbox/Papers - In progress/Partial Network/Package/AH/PartialNetwork/datavignettes/out.logi.rda") ``` ```{r BayesLog1as, eval=FALSE} # EXAMPLE II: NETWORK FORMATION MODEL: LOGIT rm(list = ls()) library(PartialNetwork) set.seed(123) # Number of groups M <- 50 # size of each group N <- rep(30,M) # individual effects beta <- c(2,1,1.5) # contextual effects gamma <- c(5,-3) # endogenous effects alpha <- 0.4 # std-dev errors se <- 2 # parameters of the network formation model rho <- c(-2, -.5, .2) # covariates X <- cbind(rnorm(sum(N),0,5),rpois(sum(N),7)) # compute distance between individuals tmp <- c(0, cumsum(N)) X1l <- lapply(1:M, function(x) X[c(tmp[x] + 1):tmp[x+1],1]) X2l <- lapply(1:M, function(x) X[c(tmp[x] + 1):tmp[x+1],2]) dist.net <- function(x, y) abs(x - y) X1.mat <- lapply(1:M, function(m) { matrix(kronecker(X1l[[m]], X1l[[m]], FUN = dist.net), N[m])}) X2.mat <- lapply(1:M, function(m) { matrix(kronecker(X2l[[m]], X2l[[m]], FUN = dist.net), N[m])}) # true network Xnet <- as.matrix(cbind("Const" = 1, "dX1" = mat.to.vec(X1.mat), "dX2" = mat.to.vec(X2.mat))) ynet <- Xnet %*% rho ynet <- 1*((ynet + rlogis(length(ynet))) > 0) G0 <- vec.to.mat(ynet, N, normalise = FALSE) G0norm <- norm.network(G0) # GX GX <- peer.avg(G0norm, X) # simulate dependent variable use an external package y <- CDatanet::simsar(~ X + GX, Glist = G0norm, theta = c(alpha, beta, gamma, se)) y <- y$y # dataset dataset <- as.data.frame(cbind(y, X1 = X[,1], X2 = X[,2])) ``` For example II-1, we assume that the researcher only **observes 60% of the links**, but **know that the network formation model is logistic**. ```{r BayesLog21as, eval=FALSE} # Example II-1: When a part of the network is observed # 60% of the network data is observed G0.obs <- lapply(N, function(x) matrix(rbinom(x^2, 1, 0.6), x)) # replace the non-observed part of the network by 0 G0.start <- lapply(1:M, function(x) G0[[x]]*G0.obs[[x]]) # Infer the missing links in the network data mlinks <- list(model = "logit", mlinks.formula = ~ dX1 + dX2, mlinks.data = as.data.frame(Xnet)) ``` **Once the data are simulated, the estimation can be performed.** ```{r BayesLog21ae, eval=FALSE} out.logi2.2 <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = G0.obs, G0 = G0.start, data = dataset, mlinks = mlinks, iteration = 2e4) summary(out.logi2.2) ``` ```{r BayesLog22b, echo=FALSE} summary(out.logi2.2) ``` ```{r BayesLog22c, fig.height = 3, fig.align = "center"} plot(out.logi2.2, plot.type = "sim", mar = c(3, 2.1, 1, 1)) ``` ```{r BayesLog22cc, fig.height = 3, fig.align = "center"} plot(out.logi2.2, plot.type = "sim", which.parms = "rho", mar = c(3, 2.1, 1, 1)) ``` **Example II-2 disregards the information about observed links (which we used to estimate the logit model) and only uses the asymptotic distribution of the network formation parameters.** ```{r BayesLog31as, eval=FALSE} # Example II-2: When only the network distribution is available # Infer the network data # We only provide estimate of rho and its variance Gvec <- mat.to.vec(G0, ceiled = TRUE) logestim <- glm(Gvec ~ -1 + Xnet, family = binomial(link = "logit")) slogestim <- summary(logestim) estimates <- list("rho" = logestim$coefficients, "var.rho" = slogestim$cov.unscaled, "N" = N) mlinks <- list(model = "logit", mlinks.formula = ~ dX1 + dX2, mlinks.data = as.data.frame(Xnet), estimates = estimates) out.logi3.2 <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = "none", data = dataset, mlinks = mlinks, iteration = 2e4) summary(out.logi3.2) ``` ```{r BayesLog32b, echo=FALSE} summary(out.logi3.2) ``` ```{r BayesLog32c, fig.height = 3, fig.align = "center"} plot(out.logi3.2, plot.type = "sim", mar = c(3, 2.1, 1, 1)) ``` ```{r BayesLog32cc, fig.height = 3, fig.align = "center"} plot(out.logi3.2, plot.type = "sim", which.parms = "rho", mar = c(3, 2.1, 1, 1)) ``` ### Remarks {-} In the previous example, partial network information is available to estimate `rho` and `var.rho`. These estimates are included in the object `estimates`. Then in the MCMC, we set `G0.obs = "none"`, which means that the entire network will be inferred. In this situation, the posterior distribution of `rho` is strongly linked to the prior given by the practitioner, i.e. a normal distribution of parameters, the initial estimates of `rho` and `var.rho`. Indeed, an additional source of identification of the posterior distribution of `rho` may come from the spatial autoregressive (SAR) model. However, the available information in the observed part of the network is not used to update `rho` since it is already used when computing `estimates`. From a certain point of view, inferring the observed part of the network can be considered inefficient. It is possible to keep fixed the observed entries of the adjacency matrix. As in example II-1, it is sufficient to set `G0.obs = G0.obs` instead of `G0.obs = "none"` in the function `mcmcSAR`. However, the observed part of the network is assumed to be non-stochastic in this case and will not be used to update `rho`. As soon as the practitioner gives an initial estimate of `rho` and `var.rho` in `estimates`, no information from the observed part of the network is used to update `rho`. The initial estimate of `rho` and `var.rho` summarizes the information. We present an example below. ```{r BayesLog21ae3, eval=FALSE} estimates <- list("rho" = logestim$coefficients, "var.rho" = slogestim$cov.unscaled) mlinks <- list(model = "logit", mlinks.formula = ~ dX1 + dX2, mlinks.data = as.data.frame(Xnet), estimates = estimates) out.logi4.1 <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = G0.obs, G0 = G0.start, data = dataset, mlinks = mlinks, iteration = 2e4) summary(out.logi4.1) ``` ```{r BayesLog21ae4, eval=TRUE, echo=FALSE} print(sout.logi4.1) ``` One can notice that the network formation model estimate is not too different from the initial estimate of `rho`. ```{r BayesLog21ae1, eval=TRUE} print(slogestim) ``` It is also possible to update `rho` in the MCMC using the observed part of the network. A particular case is Example II-1, where the distribution of `rho` is not given by the practitioner. In this example, the observed part of the network is used to estimate `rho` and `var.rho` (as in Example II-2). These estimates are then used as the prior distribution in the MCMC and `rho` is updated using the available information in the network. If the practitioner wants to define another prior, they can use `prior` instead of `estimate`. This is an example where we assume that the prior distribution of `rho` is the standard normal distribution. ```{r BayesLog21ae5, eval=FALSE} prior <- list("rho" = c(0, 0, 0), "var.rho" = diag(3)) mlinks <- list(model = "logit", mlinks.formula = ~ dX1 + dX2, mlinks.data = as.data.frame(Xnet), prior = prior) out.logi4.2 <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = G0.obs, G0 = G0.start, data = dataset, mlinks = mlinks, iteration = 2e4) summary(out.logi4.2) ``` ```{r BayesLog21ae6, echo=FALSE} print(sout.logi4.2) ``` The MCMC performs well although the distribution of `rho` defined in `prior` is not true. This is because the observed part of the network is used to update `rho`. In contrast, the MCMC could be inconsistent if one defines `estimates` as `prior` because `rho` will not be updated using the observed entries in the adjacency matrix, but only using information from the outcome `y`. # Bayesian estimator with latent space model as network formation model {#BAYESARD} ## ARD, @breza2020using We also offer a function of the estimator in @breza2020using. We first simulate data. We then estimate the model's parameters assuming that the researcher only knows ARD. We present two examples, one for which we observe ARD for the entire population (Example 1) and one for which we observe ARD for only 70% of the population (Example 2). ```{r Bayesard0, echo=FALSE} load(url('https://github.com/ahoundetoungan/PartialNetwork/blob/master/datavignettes/out.ard.rda?raw=true')) # save(data.plot1, data.plot2, genz, genv, zi, vk, file = "~/Dropbox/Papers - In progress/Partial Network/Package/AH/PartialNetwork/datavignettes/out.ard.rda") ``` The data is simulated following a procedure similar to the one in @breza2020using. ```{r ard1, eval=FALSE} rm(list = ls()) library(PartialNetwork) set.seed(123) # LATENT SPACE MODEL N <- 500 genzeta <- 1 mu <- -1.35 sigma <- 0.37 K <- 12 # number of traits P <- 3 # Sphere dimension # ARD parameters # Generate z (spherical coordinates) genz <- rvMF(N, rep(0,P)) # Generate nu from a Normal(mu, sigma^2) (The gregariousness) gennu <- rnorm(N, mu, sigma) # compute degrees gend <- N*exp(gennu)*exp(mu+0.5*sigma^2)*exp(logCpvMF(P,0) - logCpvMF(P,genzeta)) # Link probabilities distr <- sim.dnetwork(gennu, gend, genzeta, genz) # Adjacency matrix G <- sim.network(distr) # Generate vk, the trait location genv <- rvMF(K, rep(0, P)) # set fixed some vk distant genv[1,] <- c(1, 0, 0) genv[2,] <- c(0, 1, 0) genv[3,] <- c(0, 0, 1) # eta, the intensity parameter geneta <- abs(rnorm(K, 2, 1)) # Build traits matrix densityatz <- matrix(0, N, K) for(k in 1:K){ densityatz[,k] <- dvMF(genz, genv[k,]*geneta[k]) } trait <- matrix(0, N, K) NK <- floor(runif(K, 0.8, 0.95)*colSums(densityatz)/apply(densityatz, 2, max)) for (k in 1:K) { trait[,k] <- rbinom(N, 1, NK[k]*densityatz[,k]/sum(densityatz[,k])) } # Build ADR ARD <- G %*% trait # generate b genb <- numeric(K) for(k in 1:K){ genb[k] <- sum(G[,trait[,k]==1])/sum(G) } ``` **Example 1: we observe ARD for the entire population** ```{r ard1as, eval=FALSE} # Example1: ARD is observed for the whole population # initialization d0 <- exp(rnorm(N)); b0 <- exp(rnorm(K)); eta0 <- rep(1,K) zeta0 <- 2; z0 <- matrix(rvMF(N, rep(0,P)), N); v0 <- matrix(rvMF(K,rep(0, P)), K) # We should fix some vk and bk vfixcolumn <- 1:5 bfixcolumn <- c(3, 7, 9) b0[bfixcolumn] <- genb[bfixcolumn] v0[vfixcolumn,] <- genv[vfixcolumn,] start <- list("z" = z0, "v" = v0, "d" = d0, "b" = b0, "eta" = eta0, "zeta" = zeta0) ``` **The estimation can be performed using the function `mcmcARD`** ```{r ard1e, eval=FALSE} # MCMC estim.ard1 <- mcmcARD(Y = ARD, traitARD = trait, start = start, fixv = vfixcolumn, consb = bfixcolumn, iteration = 5000) ``` ```{r ardaa1, eval=FALSE} # plot coordinates of individual 123 i <- 123 zi <- estim.ard1$simulations$z[i,,] par(mfrow = c(3, 1), mar = c(2.1, 2.1, 1, 1)) invisible(lapply(1:3, function(x) { plot(zi[x,], type = "l", ylab = "", col = "blue", ylim = c(-1, 1)) abline(h = genz[i, x], col = "red") })) ``` ```{r ardaa1a, echo=FALSE, fig.height = 4, fig.align = "center"} i <- 123 par(mfrow = c(3, 1), mar = c(2.1, 2.1, 1, 1)) invisible(lapply(1:3, function(x) { plot(zi[x,], type = "l", ylab = "", col = "blue", ylim = c(-1, 1)) abline(h = genz[i, x], col = "red") })) ``` ```{r ardaa2, eval=FALSE} # plot coordinates of the trait 8 k <- 8 vk <- estim.ard1$simulations$v[k,,] par(mfrow = c(3, 1), mar = c(2.1, 2.1, 1, 1)) invisible(lapply(1:3, function(x) { plot(vk[x,], type = "l", ylab = "", col = "blue", ylim = c(-1, 1)) abline(h = genv[k, x], col = "red") })) ``` ```{r ardaa2a, echo=FALSE, fig.height = 4, fig.align = "center"} k <- 8 par(mfrow = c(3, 1), mar = c(2.1, 2.1, 1, 1)) invisible(lapply(1:3, function(x) { plot(vk[x,], type = "l", ylab = "", col = "blue", ylim = c(-1, 1)) abline(h = genv[k, x], col = "red") })) ``` ```{r ardb, message=FALSE, eval=FALSE} # plot degree library(ggplot2) data.plot1 <- data.frame(True_degree = gend, Estim_degree = colMeans(tail(estim.ard1$simulations$d, 2500))) ggplot(data = data.plot1, aes(x = True_degree, y = Estim_degree)) + geom_abline(col = "red") + geom_point(col = "blue") ``` ```{r ardbb, message=FALSE, echo=FALSE, fig.height = 3, fig.align = "center"} library(ggplot2) ggplot(data = data.plot1, aes(x = True_degree, y = Estim_degree)) + geom_abline(col = "red") + geom_point(col = "blue") ``` **Example 2: we observe ARD for only 70% of the population** ```{r ard2as, eval=FALSE} # Example2: ARD is observed for 70% population # sample with ARD n <- round(0.7*N) # individual with ARD iselect <- sort(sample(1:N, n, replace = FALSE)) ARDs <- ARD[iselect,] traits <- trait[iselect,] # initialization d0 <- d0[iselect]; z0 <- z0[iselect,] start <- list("z" = z0, "v" = v0, "d" = d0, "b" = b0, "eta" = eta0, "zeta" = zeta0) ``` **The estimation can be performed using the function `mcmcARD`** ```{r ard2ae, eval=FALSE} # MCMC estim.ard2 <- mcmcARD(Y = ARDs, traitARD = traits, start = start, fixv = vfixcolumn, consb = bfixcolumn, iteration = 5000) ``` ```{r ard2b, eval=FALSE} # estimation for non ARD # we need a logical vector indicating if the i-th element has ARD hasARD <- (1:N) %in% iselect # we use the matrix of traits to estimate distance between individuals estim.nard2 <- fit.dnetwork(estim.ard2, X = trait, obsARD = hasARD, m = 1) ``` ```{r arde, eval=FALSE} # estimated degree estd <- estim.nard2$degree data.plot2 <- data.frame(True_degree = gend, Estim_degree = estd, Has_ARD = ifelse(hasARD, "YES", "NO")) ggplot(data = data.plot2, aes(x = True_degree, y = Estim_degree, colour = Has_ARD)) + geom_abline(col = "red") + geom_point() ``` ```{r ardee, message=FALSE, echo=FALSE, fig.height = 3, fig.align = "center"} ggplot(data = data.plot2, aes(x = True_degree, y = Estim_degree, colour = Has_ARD)) + geom_abline(col = "red") + geom_point() ``` ## Estimating peer effects model with ARD Given the predicted probabilities, estimated using the estimator proposed by @breza2020using assuming that ARD are observed for the entire population, we implement our Bayesian estimator assuming that the posterior distribution of the linking probabilities are jointly normally distributed. ```{r Bayesard00, echo=FALSE} load(url('https://github.com/ahoundetoungan/PartialNetwork/blob/master/datavignettes/out.lsp.rda?raw=true')) #save(out.lspa1, out.lspa2, file = "~/Dropbox/Papers - In progress/Partial Network/Package/AH/PartialNetwork/datavignettes/out.lsp.rda") ``` **We first simulate data.** ```{r bayesard1, eval=FALSE} rm(list = ls()) library(PartialNetwork) set.seed(123) M <- 30 N <- rep(60, M) genzeta <- 3 mu <- -1.35 sigma <- 0.37 K <- 12 # number of traits P <- 3 # Sphere dimension # IN THIS LOOP, WE GENERATE DATA FOLLOWING BREZA ET AL. (2020) AND # ESTIMATE THEIR LATENT SPACE MODEL FOR EACH SUB-NETWORK. estimates <- list() list.trait <- list() G0 <- list() for (m in 1:M) { ####################################################################################### ####### SIMULATION STAGE ######## ####################################################################################### # ARD parameters # Generate z (spherical coordinates) genz <- rvMF(N[m], rep(0,P)) # Generate nu from a Normal(mu, sigma^2) (The gregariousness) gennu <- rnorm(N[m],mu,sigma) # compute degrees gend <- N[m]*exp(gennu)*exp(mu+0.5*sigma^2)*exp(logCpvMF(P,0) - logCpvMF(P,genzeta)) # Link probabilities distr <- sim.dnetwork(gennu, gend, genzeta, genz) # Adjacency matrix G <- sim.network(distr) G0[[m]] <- G # Generate vk, the trait location genv <- rvMF(K, rep(0, P)) # set fixed some vk distant genv[1,] <- c(1, 0, 0) genv[2,] <- c(0, 1, 0) genv[3,] <- c(0, 0, 1) # eta, the intensity parameter geneta <-abs(rnorm(K, 2, 1)) # Build traits matrix densityatz <- matrix(0, N[m], K) for(k in 1:K){ densityatz[,k] <- dvMF(genz, genv[k,]*geneta[k]) } trait <- matrix(0, N[m], K) NK <- floor(runif(K, .8, .95)*colSums(densityatz)/apply(densityatz, 2, max)) for (k in 1:K) { trait[,k] <- rbinom(N[m], 1, NK[k]*densityatz[,k]/sum(densityatz[,k])) } list.trait[[m]] <- trait # Build ADR ARD <- G %*% trait # generate b genb <- numeric(K) for(k in 1:K){ genb[k] <- sum(G[,trait[,k]==1])/sum(G) + 1e-8 } ####################################################################################### ####### ESTIMATION STAGE ######## ####################################################################################### # initialization d0 <- gend; b0 <- exp(rnorm(K)); eta0 <- rep(1,K); zeta0 <- genzeta z0 <- matrix(rvMF(N[m], rep(0,P)), N[m]); v0 <- matrix(rvMF(K,rep(0, P)), K) # We should fix some vk and bk vfixcolumn <- 1:5 bfixcolumn <- c(1, 3, 5, 7, 9, 11) b0[bfixcolumn] <- genb[bfixcolumn] v0[vfixcolumn,] <- genv[vfixcolumn,] start <- list("z" = z0, "v" = v0, "d" = d0, "b" = b0, "eta" = eta0, "zeta" = zeta0) estimates[[m]] <- mcmcARD(Y = ARD, traitARD = trait, start = start, fixv = vfixcolumn, consb = bfixcolumn, sim.d = FALSE, sim.zeta = FALSE, iteration = 5000, ctrl.mcmc = list(print = FALSE)) } # SIMULATE DATA FOR THE OUTCOME MODEL # individual effects beta <- c(2,1,1.5) # contextual effects gamma <- c(5,-3) # endogenous effects alpha <- 0.4 # std-dev errors se <- 1 # covariates X <- cbind(rnorm(sum(N),0,5),rpois(sum(N),7)) # Normalise G0 G0norm <- norm.network(G0) # GX GX <- peer.avg(G0norm, X) # simulate dependent variable use an external package y <- CDatanet::simsar(~ X + GX, Glist = G0norm, theta = c(alpha, beta, gamma, se)) y <- y$y # dataset dataset <- as.data.frame(cbind(y, X1 = X[,1], X2 = X[,2])) ``` **Once the data are simulated, the estimation can be performed using the function `mcmcSAR`.** ```{r bayesard2s, eval=FALSE} mlinks <- list(model = "latent space", estimates = estimates) out.lspa1 <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = "none", data = dataset, mlinks = mlinks, iteration = 2e4) summary(out.lspa1) ``` ```{r Bayesard2b, echo=FALSE} summary(out.lspa1) ``` ```{r Bayesard2c, fig.height = 3, fig.align = "center"} plot(out.lspa1, plot.type = "sim", mar = c(3, 2.1, 1, 1)) ``` ## Estimating peer effects with partial ARD Given the predicted probabilities, estimated using the estimator proposed by @breza2020using assuming that ARD are observed for 70% $\sim$ 100% of the population, we implement our Bayesian estimator assuming that the posterior distribution of the linking probabilities are jointly normally distributed. **We first simulate data.** ```{r bayesard3, eval=FALSE} rm(list = ls()) library(PartialNetwork) set.seed(123) M <- 30 N <- rep(60, M) genzeta <- 3 mu <- -1.35 sigma <- 0.37 K <- 12 P <- 3 # IN THIS LOOP, WE GENERATE DATA FOLLOWING BREZA ET AL. (2020) AND # ESTIMATE THEIR LATENT SPACE MODEL FOR EACH SUB-NETWORK. estimates <- list() list.trait <- list() obARD <- list() G0 <- list() for (m in 1:M) { ####################################################################################### ####### SIMULATION STAGE ######## ####################################################################################### # ARD parameters # Generate z (spherical coordinates) genz <- rvMF(N[m], rep(0,P)) # Generate nu from a Normal(mu, sigma^2) (The gregariousness) gennu <- rnorm(N[m],mu,sigma) # compute degrees gend <- N[m]*exp(gennu)*exp(mu+0.5*sigma^2)*exp(logCpvMF(P,0) - logCpvMF(P,genzeta)) # Link probabilities distr <- sim.dnetwork(gennu, gend, genzeta, genz) # Adjacency matrix G <- sim.network(distr) G0[[m]] <- G # Generate vk, the trait location genv <- rvMF(K, rep(0, P)) # set fixed some vk distant genv[1,] <- c(1, 0, 0) genv[2,] <- c(0, 1, 0) genv[3,] <- c(0, 0, 1) # eta, the intensity parameter geneta <-abs(rnorm(K, 2, 1)) # Build traits matrix densityatz <- matrix(0, N[m], K) for(k in 1:K){ densityatz[,k] <- dvMF(genz, genv[k,]*geneta[k]) } trait <- matrix(0, N[m], K) NK <- floor(runif(K, .8, .95)*colSums(densityatz)/apply(densityatz, 2, max)) for (k in 1:K) { trait[,k] <- rbinom(N[m], 1, NK[k]*densityatz[,k]/sum(densityatz[,k])) } list.trait[[m]] <- trait # Build ADR ARD <- G %*% trait # generate b genb <- numeric(K) for(k in 1:K){ genb[k] <- sum(G[,trait[,k]==1])/sum(G) + 1e-8 } # sample with ARD n <- round(runif(1, .7, 1)*N[m]) # individual with ARD iselect <- sort(sample(1:N[m], n, replace = FALSE)) hasARD <- (1:N[m]) %in% iselect obARD[[m]] <- hasARD ARDs <- ARD[iselect,] traits <- trait[iselect,] ####################################################################################### ####### ESTIMATION STAGE ######## ####################################################################################### # initialization d0 <- gend[iselect]; b0 <- exp(rnorm(K)); eta0 <- rep(1,K); zeta0 <- genzeta z0 <- matrix(rvMF(n, rep(0,P)), n); v0 <- matrix(rvMF(K, rep(0, P)), K) # We should fix some vk and bk vfixcolumn <- 1:5 bfixcolumn <- c(1, 3, 5, 7, 9, 11) b0[bfixcolumn] <- genb[bfixcolumn]; v0[vfixcolumn,] <- genv[vfixcolumn,] start <- list("z" = z0, "v" = v0, "d" = d0, "b" = b0, "eta" = eta0, "zeta" = zeta0) estimates[[m]] <- mcmcARD(Y = ARDs, traitARD = traits, start = start, fixv = vfixcolumn, consb = bfixcolumn, sim.d = FALSE, sim.zeta = FALSE, iteration = 5000, ctrl.mcmc = list(print = FALSE)) } # SIMULATE DATA FOR THE OUTCOME MODEL # individual effects beta <- c(2,1,1.5) # contextual effects gamma <- c(5,-3) # endogenous effects alpha <- 0.4 # std-dev errors se <- 1 # covariates X <- cbind(rnorm(sum(N),0,5),rpois(sum(N),7)) # Normalise G0 G0norm <- norm.network(G0) # GX GX <- peer.avg(G0norm, X) # simulate dependent variable use an external package y <- CDatanet::simsar(~ X + GX, Glist = G0norm, theta = c(alpha, beta, gamma, se)) y <- y$y # dataset dataset <- as.data.frame(cbind(y, X1 = X[,1], X2 = X[,2])) ``` **Once the data are simulated, the estimation can be performed using the function `mcmcSAR`.** ```{r bayesard4s, eval=FALSE} mlinks <- list(model = "latent space", estimates = estimates, mlinks.data = list.trait, obsARD = obARD) out.lspa2 <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = "none", data = dataset, mlinks = mlinks, iteration = 2e4) summary(out.lspa2) ``` ```{r Bayesard4b, echo=FALSE} summary(out.lspa2) ``` ```{r Bayesard4c, fig.height = 3, fig.align = "center"} plot(out.lspa2, plot.type = "sim", mar = c(3, 2.1, 1, 1)) ``` # The selection bias issue {#SELECT} In many applications, missing links are not completely random. For example, an individual who has no friends is less likely to have missing value on their row in the adjacency matrix than someone who has a lot of links. Even regressing a logit/probit network formation model using the entries of the adjacency matrix that are correctly measured will not yield a consistent estimator due to the selection bias issue. Consider the case of missing links due to unmatched declared friends. This is for example the case of the network data collected by the National Longitudinal Study of Adolescent to Adult Health (Add Health). We know the *true* number of links for each individual. This information is crucial to address the selection bias issue. When an individual has missing links, we could doubt all the "zeros" on their row in the adjacency matrix. To consistently estimate the SAR model, we can simply assume that information from this row is not true, i.e, the row has "zeros" everywhere in the object `G0.obs`. Information from individuals without missing links can be used to estimate `rho` and doubtful rows in the adjacency matrix will be inferred. However, there is a selection bias issue because we do not use rows with missing links to estimate `rho`. These rows are likely to have a lot of `ones`. As a consequence, the network formation model estimating using rows with no unmatched friends will simulate fewer links than the true data-generating process (DGP). A straightforward approach to address the issue is to weight the selected rows that are used to estimate `rho` (see @manski1977estimation). Every selected row must be weighted by the inverse of the probability of having no unmatched links. To do so, the practitioner can include an input `weights` in the list `mlinks` which is an argument of the function `mcmcsar`. The input `weights` must be a vector of dimension the number of "ones" in `G0.obs`. We consider the following example where the number of missing links is randomly chosen between zero and the true number of links. ```{r selb1, echo = TRUE, eval=FALSE} rm(list = ls()) library(PartialNetwork) library(dplyr) set.seed(123) # Number of groups M <- 50 # size of each group N <- rep(30,M) # individual effects beta <- c(2, 1, 1.5) # contextual effects gamma <- c(5,-3) # endogenous effects alpha <- 0.4 # std-dev errors se <- 2 # parameters of the network formation model rho <- c(-0.5, -.5, .4) # covariates X <- cbind(rnorm(sum(N),0,5),rpois(sum(N),7)) # compute distance between individuals tmp <- c(0, cumsum(N)) X1l <- lapply(1:M, function(x) X[c(tmp[x] + 1):tmp[x+1],1]) X2l <- lapply(1:M, function(x) X[c(tmp[x] + 1):tmp[x+1],2]) dist.net <- function(x, y) abs(x - y) X1.mat <- lapply(1:M, function(m) { matrix(kronecker(X1l[[m]], X1l[[m]], FUN = dist.net), N[m])}) X2.mat <- lapply(1:M, function(m) { matrix(kronecker(X2l[[m]], X2l[[m]], FUN = dist.net), N[m])}) # true network Xnet <- as.matrix(cbind("Const" = 1, "dX1" = mat.to.vec(X1.mat), "dX2" = mat.to.vec(X2.mat))) ynet <- Xnet %*% rho ynet <- c(1*((ynet + rlogis(length(ynet))) > 0)) G0 <- vec.to.mat(ynet, N, normalise = FALSE) # number of friends nfriends <- unlist(lapply(G0, function(x) rowSums(x))) # number of missing links nmislink <- sapply(nfriends, function(x) sample(0:x, 1)) ``` We now simulate the observed network by removing links from `G0` depending on the number of unmatched links. We also simulate the outcome `y` using the true network. ```{r selb2, echo = TRUE, eval=FALSE} Gobs <- list(M) # The observed network G0.obs <- list(M) # Which information is true and doubtful for(x in 1:M){ Gx <- G0[[x]] G0.obsx <- matrix(1, N[x], N[x]); diag(G0.obsx) <- 0 csum <- cumsum(c(0, N)) nmis <- nmislink[(csum[x] + 1):csum[x + 1]] for (i in 1:N[x]) { if(nmis[i] > 0){ tmp <- which(c(Gx[i,]) == 1) if(length(which(c(Gx[i,]) == 1)) > 1) { tmp <- sample(which(c(Gx[i,]) == 1), nmis[i]) } Gx[i,tmp] <- 0 G0.obsx[i,] <- 0 } } Gobs[[x]] <- Gx G0.obs[[x]] <- G0.obsx } G0norm <- norm.network(G0) # GX GX <- peer.avg(G0norm, X) # simulate dependent variable use an external package y <- CDatanet::simsar(~ X + GX, Glist = G0norm, theta = c(alpha, beta, gamma, se)) y <- y$y # data set dataset <- as.data.frame(cbind(y, X1 = X[,1], X2 = X[,2])) ``` Assume that we estimate the network formation by selecting the individual with no unmatched links. ```{r, selb3a, eval=FALSE, echo=TRUE} mlinks <- list(model = "logit", mlinks.formula = ~ dX1 + dX2, mlinks.data = as.data.frame(Xnet)) out.selb1 <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0 = Gobs, G0.obs = G0.obs, data = dataset, mlinks = mlinks, iteration = 2e4) summary(out.selb1) ``` ```{r, selb3b, echo=FALSE} print(sout.selb1) ``` Although the peer effect estimate appears to be correct, the network formation estimator is likely biased. Indeed the estimator of the intercept is $-2.26$ and is significantly lower than its actual value of -$2$. As a result, the network formation model is likely to simulate fewer links than the true DGP. This is an issue if the practitioner uses the model to simulate policy impact on the outcome `y` or network features, such as centrality. To control for the selection issue, we can weight selected individuals. The weight depends on the framework, especially how missing links occur. In our example, the number of missing links is chosen randomly between zero and the number of declared links. If the individual has $n_i$ declared friends, the probability of having no unmatched links can be estimated by the proportion of the number of individuals who has no unmatched links in the subset of individuals who declare $n_i$. ```{r, selb3c, eval=FALSE, echo=TRUE} G0.obsvec <- as.logical(mat.to.vec(G0.obs)) Gvec <- mat.to.vec(Gobs, ceiled = TRUE)[G0.obsvec] W <- unlist(data.frame(nfriends = nfriends, nmislink = nmislink) %>% group_by(nfriends) %>% summarise(w = length(nmislink)/sum(nmislink == 0)) %>% select(w)) W <- lapply(1:M, function(x){ matrix(rep(W[rowSums(G0[[x]]) + 1], each = N[x]), N[x], byrow = TRUE)}) weights <- mat.to.vec(W)[G0.obsvec] mlinks <- list(model = "logit", mlinks.formula = ~ dX1 + dX2, mlinks.data = as.data.frame(Xnet), weights = weights) out.selb2 <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0 = Gobs, G0.obs = G0.obs, data = dataset, mlinks = mlinks, iteration = 2e4) summary(out.selb2) ``` ```{r, selb3d, echo=FALSE} print(sout.selb2) ``` It is also possible to use selected sample and the weight to estimate `rho` and `var.rho` using the function `glm`. These estimates can be provided to `mlinks` in `estimates`. In this case `rho` will not be updated using the observed part of the network.