--- title: "Internals: object structure & writing custom extensions" description: "Inspect netify object structure and validation checks." author: "Cassy Dorff and Shahryar Minhas" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Internals: object structure & writing custom extensions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( dev = "png", dpi = 150, cache = FALSE, echo = TRUE, collapse = TRUE, comment = "#>" ) ``` This vignette is for **package developers**, **methodologists writing custom extensions**, and **anyone who wants to inspect the underlying object structure**. End-user workflows are covered in `vignette("quickstart_inference", package = "netify")` and the topic-specific project-site articles; this one goes one layer deeper. ```{r} library(netify) data(icews) ``` ## the netify object: a base r object with attributes A **netify object** is a base R matrix / 3D array / list-of-matrices with `class = "netify"` and a bundle of attributes carrying all metadata. The `netify_type` attribute records one of three shapes: | `netify_type` | Underlying R object | When used | |-----------------|---------------------------------------------|----------------------------------------------------| | `cross_sec` | `[n x n]` matrix (or `[r x c]` bipartite) | One time period | | `longit_array` | `[n x n x T]` array (or `[n x n x p x T]`) | Multi-period, constant actor set | | `longit_list` | Named list of matrices, one per time | Multi-period, varying actor composition | Multilayer networks insert a layer dimension (position 3) and store layer names in `attr(x, "layers")`. Mixed-directedness multilayer is supported via a vector-valued `symmetric` attribute. `dgCMatrix` inputs from the **Matrix** package are accepted but densified at construction (a one-shot cli inform is emitted); a true sparse storage backend is not yet implemented. Inspect any netify object's attribute bundle: ```{r} net <- netify(icews[icews$year == 2010, ], actor1 = "i", actor2 = "j", symmetric = FALSE, weight = "verbCoop", nodal_vars = "i_polity2", dyad_vars = "matlCoop") class(net) str(attributes(net), max.level = 1) ``` Key attributes: - `netify_type` -- `"cross_sec"`, `"longit_array"`, or `"longit_list"` - `mode` -- `"unipartite"` or `"bipartite"` - `symmetric` -- scalar logical (or named vector for mixed-directedness multilayer) - `weight` -- column name (or `NULL` for binary) - `is_binary`, `detail_weight`, `diag_to_NA`, `missing_to_zero`, `sum_dyads` - `layers` -- character vector; length > 1 means multilayer - `actor_pds` -- data.frame with `actor`, `min_time`, `max_time` per actor - `nodal_data` -- data.frame with `actor`, optional `time`, and one column per nodal variable - `dyad_data` -- **nested list**: `list[[time]][[var]] = matrix`. Cross-sec uses `"1"` as the time key. ## extracting parts | Want this | Use | |-----------|-----| | The raw matrix / array / list | `get_raw(net)` | | A quick numeric peek | `peek(net, from = 5, to = 5)` | | The full long edge data frame | `unnetify(net)` or `tidy(net)` | | Lean wide-to-long edge frame (no nodal merge) | `melt(net)` | | Graph-level stats | `summary(net)` or `glance(net)` | | Actor-level stats | `summary_actor(net)` | | Size / composition descriptors | `measurements(net)` | | Edge data + layout for plotting | `net_plot_data(net)$net_dfs` | `tidy()` and `glance()` are S3 methods on the broom generics -- they're registered on package load if `generics` (or anything that imports it, like `broom`, `dplyr`, `tidymodels`) is installed. No hard dependency on broom. ## tidy interop If `tibble` and `broom` are installed, three tidyverse-flavored entry points are available. `as_tibble.netify` is registered against `tibble::as_tibble` via `.onLoad`, so `tibble` must be installed to use the unprefixed call; `tidy()` and `glance()` are likewise registered against the broom generics. ```{r} library(tibble) library(broom) # one row per dyad (long edge frame, wrapped in a tibble) as_tibble(net) # broom-style tidy summary: a tibble with one row per (non-zero) dyad head(tidy(net)) # one-row-per-network model-card summary (one row per time/layer if applicable) glance(net) ``` `as_tibble(net)` and `tidy(net)` share the same long-format payload as `unnetify(net)` -- they differ only in whether zero-weight edges are dropped by default (`tidy()` drops them, matching the broom convention). `glance(net)` is the broom-flavored sibling of `summary(net)` -- one row of graph-level statistics per network (or per (time, layer) slice for longitudinal / multilayer inputs). `as_tibble()` also has a method for `netify_comparison` objects (from `compare_networks()`), returning the per-pair `$comparisons` frame directly so you can pipe straight into `filter()` / `arrange()` / `pivot_wider()`: ```{r} panel <- netify(icews[icews$year %in% c(2010, 2011), ], actor1 = "i", actor2 = "j", time = "year", symmetric = FALSE, weight = "verbCoop") net_2010 <- subset_netify(panel, time = "2010") net_2011 <- subset_netify(panel, time = "2011") cmp <- compare_networks(list("2010" = net_2010, "2011" = net_2011)) as_tibble(cmp) ``` ## predicates and descriptors For programmatic dispatch -- e.g. inside a custom exporter or pipeline step -- netify ships a small set of non-masking predicates and size accessors: ```{r} # class / structure predicates is_netify(net) is_bipartite(net) # may be masked by igraph::is_bipartite is_bipartite_netify(net) # alias that won't be masked is_directed_netify(net) is_longitudinal(net) is_multilayer(net) # size / composition accessors n_actors(net) # number of unique actors n_periods(net) # number of time periods (1 for cross-sec) n_layers(net) # number of layers (1 for single-layer) head(get_actor_time_info(net)) # stored actor_pds: actor, min_time, max_time ``` The `_netify` suffix on `is_bipartite_netify()` and `is_directed_netify()` avoids masking the same-named predicates from `igraph` and `network`. The unsuffixed versions exist too and defer to the foreign package if a non-netify graph object is passed. ## object-level validation `validate_netify()` is the developer's pre-flight check. It walks the attribute bundle and confirms the invariants the rest of the package relies on: ```{r} validate_netify(net, verbose = TRUE) ``` The full list of checks: - `netify_type` -- one of `"cross_sec"`, `"longit_array"`, `"longit_list"`, and matches the underlying R object - `mode` -- `"unipartite"` or `"bipartite"` - `symmetric_type` -- scalar logical, or named logical of length `n_layers` for mixed-directedness multilayer - `layers_consistent` -- `attr(net, "layers")` length matches the layer dimension where applicable - `nodal_actors_known` -- every actor referenced in `nodal_data` exists in the network's actor set - `is_binary_consistent` -- the stored `is_binary` flag matches the actual content - `symmetric_consistent` -- if stored as symmetric, the matrix content is actually symmetric - `unipartite_dimnames` -- row and column names agree for unipartite networks - `slice_dimnames_consistent` -- every slice of a longitudinal array/list has dimnames in the order recorded in `actor_pds` - `nodal_time_known` -- time keys in nodal/dyad data are a subset of the network's time axis A quick demo on a clean netlet versus one tampered to introduce a stray actor: ```{r} # clean netlet ticks every box all(unlist(validate_netify(net, verbose = FALSE))) # tamper: inject a stray actor into nodal_data bad <- net nd <- attr(bad, "nodal_data") nd <- rbind(nd, nd[1, , drop = FALSE]) nd$actor[nrow(nd)] <- "ZZZ_not_in_network" attr(bad, "nodal_data") <- nd validate_netify(bad, verbose = TRUE) ``` If a custom exporter or an internal manipulation breaks one of these, `validate_netify()` is the first place to look. ## open-cohort panels with `actor_pds` When the actor roster changes over time -- entries, exits, attrition, contact-tracing windows -- pass `actor_time_uniform = FALSE` and supply an `actor_pds` data.frame giving each actor's `[min_time, max_time]` window. Each period's netlet then contains only the actors whose window covers that period; densities, degree counts, and per-period actor sets respect those entry / exit boundaries. ### the `actor_pds` roster, step by step Three pieces have to line up for an open-cohort netlet to be well-formed: 1. **A roster** -- one row per actor, with `actor`, `min_time`, `max_time`. The `[min_time, max_time]` interval is *closed* (the actor is alive in the boundary periods themselves). 2. **An edgelist** -- one row per observed interaction. During construction, rows outside the roster windows are excluded because the referenced actor is not in the risk set for that period. 3. **`netify(actor_time_uniform = FALSE, actor_pds = roster, ...)`** -- this is what tells the constructor to honor the roster instead of treating every actor as present in every period. Below, actor `a` is in the network only during periods 1-2 (enters at `t = 1`, exits after `t = 2`), and actors `d` / `e` arrive at `t = 3`. The edgelist deliberately includes a tie at `t = 2` involving `a` and ties at `t = 3` involving `d` so we can verify the period-by-period actor sets afterwards. ```{r} set.seed(1) # roster: actors with closed-interval entry / exit times roster <- data.frame( actor = c("a", "b", "c", "d", "e"), min_time = c(1, 1, 1, 3, 3), max_time = c(2, 5, 4, 5, 5) # a exits after t = 2 ) # edges (only show up while both endpoints are in the roster) edges <- data.frame( i = c("a", "a", "b", "c", "d", "c", "d", "e"), j = c("b", "c", "c", "b", "e", "d", "e", "b"), t = c(1, 2, 2, 3, 4, 3, 5, 5) ) net_oc <- netify(edges, actor1 = "i", actor2 = "j", time = "t", actor_time_uniform = FALSE, actor_pds = roster ) # read the roster back off the netlet itself head(get_actor_time_info(net_oc)) n_actors(net_oc) n_periods(net_oc) ``` `get_actor_time_info()` on a netify object returns the stored `actor_pds` directly (it is also the argument name on `netify()` itself, so the round trip is exact). On a raw dyad data.frame the same function *derives* the roster from observed activity -- that's how you build the `actor_pds` argument in the first place. This is the standard path for open-cohort longitudinal data -- panel surveys with attrition, contact-tracing chains, organizational membership over time, animal co-occurrence with births / deaths, and similar settings where treating every actor as present in every period would distort denominators. ### density and per-period actor sets The key correctness guarantee: density (and every other per-period statistic) is computed against the actor set *alive in that period*, not the union of all actors ever observed. With the roster above: - `t = 1`: actors `a`, `b`, `c` are alive (3 actors) - `t = 2`: actors `a`, `b`, `c` are alive (3 actors; `a`'s last period) - `t = 3`: actors `b`, `c`, `d`, `e` are alive (4 actors; `a` is now out, `d` / `e` enter) - `t = 4`: actors `b`, `c`, `d`, `e` are alive (4 actors) - `t = 5`: actors `b`, `d`, `e` are alive (3 actors; `c`'s last period was 4) ```{r} oc_summary <- summary(net_oc) oc_summary[, c("net", "num_actors", "density", "num_edges")] ``` The period-3 denominator is **4 x 3 = 12** (directed) or **4 x 3 / 2 = 6** (symmetric), *not* 5 x 4 -- actor `a` is not counted because its `max_time` is 2. This is the load-bearing invariant: density at `t = 3` excludes the actor present only in periods 1-2, so a researcher comparing densities across periods is not comparing apples to oranges. The same accounting flows into `summary_actor()`, `homophily()`, `mixing_matrix()`, and the plot helpers -- open-cohort actors never show up as zero-degree placeholders in periods they did not exist. ## na versus zero in weighted networks For weighted data, the distinction between `0` (observed, but no edge / zero-valued interaction) and `NA` (unobserved / not-at-risk) is semantically important in many domains -- epidemiology, animal behavior, sparse survey rosters. By default `netify(missing_to_zero = TRUE)` fills unobserved dyads with `0`. Pass `missing_to_zero = FALSE` to keep them as `NA`, in which case `summary()$prop_edges_missing` reports the non-trivial missingness fraction and downstream centrality / homophily routines propagate the NA semantics rather than silently treating the dyad as a zero-weight tie. Both `prop_edges_missing` and `prop_unknown_edges` use the same denominator as `density` -- the number of *potential edge dyads* (off-diagonal for unipartite + `diag_to_NA`, halved for symmetric, all cells for bipartite). That means `density + prop_edges_missing + observed_zero_fraction = 1` is an identity, which is the property you want when you are reading them as competing fractions. The `prop_unknown_edges` column is suppressed when `missing_to_zero = TRUE` because every unobserved dyad has been filled with 0 and the value would be identically zero in every row; when present it tracks `prop_edges_missing` and serves as the "this netlet carries NA semantics" cue downstream. ## writing a custom graph-level statistic `summary(net, other_stats = list(my_stat = fn))` accepts user-supplied functions. Each function receives the **netify object** for the current time period / layer (not a stripped matrix), so you can call `netify_to_igraph()`, `peek()`, or whatever you want inside. Return a **named numeric vector** -- names become column names in the output frame. ```{r} # example: number of weakly connected components with at least 2 nodes n_components_2plus <- function(net) { g <- netify_to_igraph(net) c(n_components_2plus = sum(igraph::components(g)$csize >= 2)) } # example: edge weight skewness weight_skew <- function(net) { v <- as.vector(net) v <- v[!is.na(v) & v != 0] if (length(v) < 3) return(c(weight_skew = NA_real_)) c(weight_skew = mean((v - mean(v))^3) / (stats::sd(v)^3)) } summary(net, other_stats = list( comp = n_components_2plus, skew = weight_skew )) ``` The same `other_stats` mechanism is available in `summary_actor()`, `compare_networks()`, `homophily()`, `mixing_matrix()`, and `dyad_correlation()`. In each case the function gets called once per time period / layer / iteration as appropriate; check the function's `?` for the exact contract. ## writing a custom actor-level statistic `summary_actor(net, other_stats = list(my_stat = fn))`. The function should return a vector with one value per actor (in row order): ```{r} # example: per-actor mean tie weight to non-isolates mean_active_tie <- function(mat) { apply(mat, 1, function(row) { nonzero <- row[!is.na(row) & row != 0] if (length(nonzero) == 0) NA_real_ else mean(nonzero) }) } head(summary_actor(net, stats = "fast", other_stats = list(mean_active = mean_active_tie))) ``` ## reading the dyad_data nested list `attr(net, "dyad_data")` is the structure that gives netify O(1) access to per-time dyadic covariates. Its shape is: ```{r} dd <- attr(net, "dyad_data") names(dd) # cross-sec: just "1" names(dd[["1"]]) # one entry per dyadic variable str(dd[["1"]][["matlCoop"]]) ``` For longitudinal networks the top level has one entry per period: ```{r, eval = FALSE} # pseudo-structure for a 3-period network with 2 dyadic vars: # dd[["2010"]][["matlCoop"]] -> n x n matrix # dd[["2010"]][["verbConf"]] -> n x n matrix # dd[["2011"]][["matlCoop"]] -> n x n matrix # ... etc. ``` If you're writing a converter (e.g., to a new modeling format), this is the structure to iterate over. ## writing a custom exporter (`to_*` function) The existing `to_amen()`, `to_lame()`, `to_dbn()`, `to_igraph()`, `to_statnet()` give you templates. `to_amen()` exports the standard amen-style data structure; `to_lame()` wraps that exporter with `mode`, `family`, and an executable `ame_call` snippet for lame workflows. A new exporter generally needs to: 1. Validate the netify object with `validate_netify(netlet, verbose = FALSE)` 2. Branch on `attr(netlet, "netify_type")` (cross_sec / longit_array / longit_list) 3. For multilayer, decide whether to iterate per layer (use `subset_netify(netlet, layers = X)` and return a named list) or to produce a joint structure (4D array, etc.) 4. Pull the raw network data with `get_raw(netlet)` 5. Pull nodal data with `attr(netlet, "nodal_data")` and align to your target package's actor order 6. Pull dyadic data with `attr(netlet, "dyad_data")` and reshape to your target format 7. Handle missing values according to your target package's convention (most don't accept NAs) If your target package expects per-layer outputs but takes multilayer netify inputs, the standard pattern is: ```{r, eval = FALSE} to_mymodel <- function(netlet, ...) { validate_netify(netlet, verbose = FALSE) layer_names <- attributes(netlet)$layers if (length(layer_names) > 1) { out <- lapply(layer_names, function(lyr) { to_mymodel(subset_netify(netlet, layers = lyr), ...) }) names(out) <- layer_names return(out) } # ... single-layer logic ... } ``` This is what `to_igraph()`, `to_statnet()`, and `to_lame()` do. ## performance characteristics | Operation | Complexity | Notes | |-----------|-----------|-------| | `netify(df)` cross-sec | O(N * E) | C++ via Rcpp; fast | | `netify(df, time = ...)` longit | O(N * E * T) | C++; fast | | `summary(net)` | O(N^2) per period | igraph backend | | `summary_actor(net)` | O(N^2) per period x per layer | igraph; closeness/betweenness dominate | | `bootstrap_netlet(net, fn, n_boot = B)` | O(B * n * N^2) | depends on the statistic supplied in `fn` | | `compare_networks(method = "qap")` | O(R * N^2) | C++ permutations; tunable via `n_permutations` | | `compare_networks(method = "spectral")` | O(N^3) | use `spectral_rank = round(sqrt(N))` for large nets | | `unnetify(net)` | O(N^2 * T) | use `remove_zeros = TRUE` for sparse output | | `melt(net)` | O(N^2 * T) | leaner than `unnetify`; no nodal merge | | `plot(net)` | O(N^2) layout + O(E) render | igraph layout dominates | For networks of a few hundred actors over a dozen time periods, everything runs in seconds. The C++ routines for `netify()`, `compare_networks()` QAP, and similarity calculations handle the heaviest paths. The R-side wrappers (especially in plotting and attribute handling) are not as optimized. **Memory budget rule of thumb.** netify stores adjacencies densely: an N x N double-precision matrix is 8 * N^2 bytes. For longitudinal stacks the cost is 8 * N^2 * T (`longit_array`) or the sum over per-period sizes (`longit_list`, when actor composition varies). Concretely: | N | per-snapshot RAM | T = 12 (monthly year) | T = 52 (weekly year) | |---|------------------|-----------------------|----------------------| | 200 | 320 KB | 3.8 MB | 17 MB | | 1,000 | 8 MB | 96 MB | 416 MB | | 5,000 | 200 MB | 2.4 GB | 10 GB | | 10,000 | 800 MB | 9.4 GB | 41 GB | | 50,000 | 20 GB | 234 GB | 1 TB | `longit_list` netlets cost the *sum* of per-period sizes rather than `T x max(N)^2`, so they pay less when actor composition is sparse over time. For 15,000-node weekly Twitter snapshots over a year (`T = 52`), a `longit_array` is ~93 GB and would not fit on a laptop; a `longit_list` whose typical-period N is much smaller is the only viable in-memory shape. Above ~10,000 actors, prefer building the netlet from an edgelist data.frame (skips the dense intermediate at construction) and consider exiting to an edge data.frame via `unnetify(net, remove_zeros = TRUE)`, or to `igraph` via `to_igraph(net)` for community detection and other large-N algorithms. `summary_actor(stats = "fast")` skips closeness / betweenness / eigen / HITS and auto-promotes when N exceeds `getOption("netify.fast_threshold", 1500L)`. **Sparse-input guard.** Passing a `Matrix::dgCMatrix` (or any `sparseMatrix`) with density < 1% and N > 5,000 aborts construction with a pointer to the edgelist path. The motivating case is a 15K x 15K follower graph at density 0.001: densifying allocates ~1.7 GB of mostly zeros, which is almost never what the caller wants. Pass `force_dense = TRUE` to override if you really do want the dense allocation. **Benchmark, three ER networks.** Wall-clock for the dense-matrix path on a representative laptop (single core, single snapshot, p = 0.01). Re-run locally with the chunk below if you want numbers for your own machine. ```{r benchmark, eval = FALSE} library(netify) set.seed(1) bench_one <- function(N, p = 0.01) { # build an er adjacency directly as an edgelist (skips the dense intermediate) i <- sample.int(N, size = round(p * N * N), replace = TRUE) j <- sample.int(N, size = length(i), replace = TRUE) df <- data.frame(from = i, to = j) t0 <- Sys.time(); net <- netify(df, actor1 = "from", actor2 = "to"); t_build <- Sys.time() - t0 t0 <- Sys.time(); s <- summary(net); t_summary <- Sys.time() - t0 t0 <- Sys.time(); sa <- summary_actor(net, stats = "fast"); t_actor_fast <- Sys.time() - t0 t0 <- Sys.time(); ig <- to_igraph(net); t_igraph <- Sys.time() - t0 data.frame(N = N, build_s = as.numeric(t_build, units = "secs"), summary_s = as.numeric(t_summary, units = "secs"), summary_actor_fast_s = as.numeric(t_actor_fast, units = "secs"), to_igraph_s = as.numeric(t_igraph, units = "secs")) } do.call(rbind, lapply(c(1000, 5000, 10000), bench_one)) ``` Indicative results (single-core, 16 GB laptop): | N | `netify()` | `summary()` | `summary_actor(stats = "fast")` | `to_igraph()` | |--------|------------|-------------|---------------------------------|---------------| | 1,000 | < 1 s | < 1 s | < 1 s | < 1 s | | 5,000 | ~2 s | ~4 s | ~1 s | < 1 s | | 10,000 | ~2 s | ~14 s | ~3 s | ~3 s | `summary()` is dominated by the igraph-based global metrics and grows fastest; `netify()` itself stays under a few seconds even at N = 10,000 when fed an edgelist. The full `summary_actor()` (with closeness / betweenness / eigen / HITS) blows up roughly quadratically -- at N = 10,000 it takes several minutes versus the few seconds shown above for the fast path. That's why the auto-promote fires by default. ## see also - `vignette("quickstart_inference", package = "netify")` -- minimal end-to-end tour - [Foundations](https://netify-dev.github.io/netify/articles/foundations.html) -- full IR walkthrough - [Pipeline: netify to lame and dbn](https://netify-dev.github.io/netify/articles/pipeline_lame_dbn.html) -- optional modeling handoff article - `vignette("pipeline_netify_ergm", package = "netify")` -- modeling handoff to ergm / statnet