Package 'NAIR' reference manual

Title:	Network Analysis of Immune Repertoire
Description:	Pipelines for studying the adaptive immune repertoire of T cells and B cells via network analysis based on receptor sequence similarity. Relate clinical outcomes to immune repertoires based on their network properties, or to particular clusters and clones within a repertoire. Yang et al. (2023) <doi:10.3389/fimmu.2023.1181825>.
Authors:	Brian Neal [aut, cre], Hai Yang [aut], Daniil Matveev [aut], Phi Long Le [aut], Li Zhang [cph, aut]
Maintainer:	Brian Neal <[email protected]>
License:	GPL (>= 3)
Version:	1.0.4
Built:	2025-01-27 06:55:07 UTC
Source:	CRAN

NAIR: Network Analysis of Immune Repertoire

Description

To learn about the NAIR package and get started, visit the package website, or browse the package vignettes offline:

browseVignettes(package = "NAIR")

The following vignette is a good place to start:

vignette("NAIR", package = "NAIR")

Author(s)

Brian Neal ([email protected]), Maintainer
Hai Yang ([email protected])
Phi-Long Le ([email protected])
Li Zhang ([email protected])

Partition a Network Graph Into Clusters

Description

Given a list of network objects returned by buildRepSeqNetwork() or generateNetworkObjects(), partitions the network graph into clusters using the specified clustering algorithm, adding a cluster membership variable to the node metadata.

Usage

addClusterMembership(
  net,
  cluster_fun = "fast_greedy",
  cluster_id_name = "cluster_id",
  overwrite = FALSE,
  verbose = FALSE,
  ...,
  data = deprecated(),
  fun = deprecated()
)
addClusterMembership(
  net,
  cluster_fun = "fast_greedy",
  cluster_id_name = "cluster_id",
  overwrite = FALSE,
  verbose = FALSE,
  ...,
  data = deprecated(),
  fun = deprecated()
)

Arguments

`net`	A `list` of network objects conforming to the output of `buildRepSeqNetwork()` or `generateNetworkObjects()`. See details. Alternatively, this argument accepts the network `igraph`, with the node metadata passed to the `data` argument. However, this alternative functionality is deprecated and will eventually be removed.
`cluster_fun`	A character string specifying the clustering algorithm to use. See details.
`cluster_id_name`	A character string specifying the name of the cluster membership variable to be added to the node metadata.
`overwrite`	Logical. Should the variable specified by `cluster_id_name` be overwritten if it already exists?
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.
`...`	Named optional arguments to the function specified by `cluster_fun`.
`data`	See `net`.
`fun`	Replaced by `cluster_fun`.

Details

The list net must contain the named elements igraph (of class igraph), adjacency_matrix (a matrix or dgCMatrix encoding edge connections), and node_data (a data.frame containing node metadata), all corresponding to the same network. The lists returned by buildRepSeqNetwork() and generateNetworkObjects() are examples of valid inputs for the net argument.

Alternatively, the igraph may be passed to net and the node metadata to data. However, this alternative functionality is deprecated and will eventually be removed.

A clustering algorithm is used to partition the network graph into clusters (densely-connected subgraphs). Each cluster represents a collection of clones/cells with similar receptor sequences. The method used to partition the graph depends on the choice of clustering algorithm, which is specified using the cluster_fun argument.

The available options for cluster_fun are listed below. Each refers to an igraph function implementing a particular clustering algorithm. Follow the links to learn more about the individual clustering algorithms.

Optional arguments to each clustering algorithm can have their values specified using the ellipses (...) argument of addClusterMembership().

Each cluster is assigned a numeric cluster ID. A cluster membership variable, whose name is specified by cluster_id_name, is added to the node metadata, encoding the cluster membership of the node for each row. The cluster membership is encoded as the cluster ID number of the cluster to which the node belongs.

The overwrite argument controls whether to overwrite pre-existing data. If the variable specified by cluster_id_name is already present in the node metadata, then overwrite must be set to TRUE in order to perform clustering and overwrite the variable with new cluster membership values. Alternatively, by specifying a value for cluster_id_name that is not among the variables in the node metadata, a new cluster membership variable can be created while preserving the old cluster membership variable. In this manner, clustering can be performed multiple times on the same network using different clustering algorithms, without losing the results.

Value

If the variable specified by cluster_id_name is not present in net$node_data, returns a copy of net with this variable added to net$node_data encoding the cluster membership of the network node corresponding to each row. If the variable is already present and overwrite = TRUE, then its values are replaced with the new values for cluster membership.

Additionally, if net contains a list named details, then the following elements will be added to net$details if they do not already exist:

`clusters_in_network`	A named numeric vector of length 1. The first entry's name is the name of the clustering algorithm, and its value is the number of clusters resulting from performing clustering on the network.
`cluster_id_variable`	A named numeric vector of length 1. The first entry's name is the name of the clustering algorithm, and its value is the name of the corresponding cluster membership variable in the node metadata (i.e., the value of `cluster_id_name`).

If net$details already contains these elements, they will be updated according to whether the cluster membership variable specified by cluster_id_name is added to net$node_data or already exists and is overwritten. In the former case (the cluster membership variable does not already exist), the length of each vector (clusters_in_network) and (cluster_id_variable) is increased by 1, with the new information appended as a new named entry to each. In the latter case (the cluster membership variable is overwritten), the new information overwrites the name and value of the last entry of each vector.

In the event where overwrite = FALSE and net$node_data contains a variable with the same name as the value of cluster_id_name, then an unaltered copy of net is returned with a message notifying the user.

Under the alternative (deprecated) input format where the node metadata is passed to data and the igraph is passed to net, the node metadata is returned instead of the list of network objects, with the cluster membership variable added or updated as described above.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

set.seed(42)
toy_data <- simulateToyData()

net <- generateNetworkObjects(
  toy_data, "CloneSeq"
)

# Perform cluster analysis,
# add cluster membership to net$node_data
net <- addClusterMembership(net)

net$details$clusters_in_network
net$details$cluster_id_variable

# overwrite values in net$node_data$cluster_id
# with cluster membership values obtained using "cluster_leiden" algorithm
net <- addClusterMembership(
  net,
  cluster_fun = "leiden",
  overwrite = TRUE
)

net$details$clusters_in_network
net$details$cluster_id_variable

# perform clustering using "cluster_louvain" algorithm
# saves cluster membership values to net$node_data$cluster_id_louvain
# (net$node_data$cluster_id retains membership values from "cluster_leiden")
net <- addClusterMembership(
  net,
  cluster_fun = "louvain",
  cluster_id_name = "cluster_id_louvain",
)

net$details$clusters_in_network
net$details$cluster_id_variable

set.seed(42)
toy_data <- simulateToyData()

net <- generateNetworkObjects(
  toy_data, "CloneSeq"
)

# Perform cluster analysis,
# add cluster membership to net$node_data
net <- addClusterMembership(net)

net$details$clusters_in_network
net$details$cluster_id_variable

# overwrite values in net$node_data$cluster_id
# with cluster membership values obtained using "cluster_leiden" algorithm
net <- addClusterMembership(
  net,
  cluster_fun = "leiden",
  overwrite = TRUE
)

net$details$clusters_in_network
net$details$cluster_id_variable

# perform clustering using "cluster_louvain" algorithm
# saves cluster membership values to net$node_data$cluster_id_louvain
# (net$node_data$cluster_id retains membership values from "cluster_leiden")
net <- addClusterMembership(
  net,
  cluster_fun = "louvain",
  cluster_id_name = "cluster_id_louvain",
)

net$details$clusters_in_network
net$details$cluster_id_variable

Compute Cluster-Level Network Properties

Description

Given a list of network objects returned by buildRepSeqNetwork() or generateNetworkObjects(), computes cluster-level network properties, performing clustering first if needed. The list of network objects is returned with the cluster properties added as a data frame.

Usage

addClusterStats(
  net,
  cluster_id_name = "cluster_id",
  seq_col = NULL,
  count_col = NULL,
  degree_col = "degree",
  cluster_fun = "fast_greedy",
  overwrite = FALSE,
  verbose = FALSE,
  ...
)

addClusterStats(
  net,
  cluster_id_name = "cluster_id",
  seq_col = NULL,
  count_col = NULL,
  degree_col = "degree",
  cluster_fun = "fast_greedy",
  overwrite = FALSE,
  verbose = FALSE,
  ...
)

Arguments

`net`	A `list` of network objects conforming to the output of `buildRepSeqNetwork()` or `generateNetworkObjects()`. See details.
`cluster_id_name`	A character string specifying the name of the cluster membership variable in `net$node_data` that identifies the cluster to which each node belongs. If the variable does not exist, it will be added by calling `addClusterMembership()`. If the variable does exist, its values will be used unless `overwrite = TRUE`, in which case its values will be overwritten and the new values used.
`seq_col`	Specifies the column(s) of `net$node_data` containing the receptor sequences upon whose similarity the network is based. Accepts a character or numeric vector of length 1 or 2, containing either column names or column indices. If provided, related cluster-level properties will be computed. The default `NULL` will use the value contained in `net$details$seq_col` if it exists and is valid.
`count_col`	Specifies the column of `net$node_data` containing a measure of abundance (such as clone count or UMI count). Accepts a character string containing the column name or a numeric scalar containing the column index. If provided, related cluster-level properties will be computed.
`degree_col`	Specifies the column of `net$node_data` containing the network degree of each node. Accepts a character string containing the column name. If the column does not exist, it will be added.
`cluster_fun`	A character string specifying the clustering algorithm to use when adding or overwriting the cluster membership variable in `net$node_data` specified by `cluster_id_name`. Passed to `addClusterMembership()`.
`overwrite`	Logical. If `TRUE` and `net` already contains an element named `cluster_data`, it will be overwritten. Similarly, if `overwrite = TRUE` and `net$node_data` contains a variable whose name matches the value of `cluster_id_name`, then its values will be overwritten with new cluster membership values (obtained using `addClusterMembership()` with the specified value of `cluster_fun`), and cluster properties will be computed based on the new values.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.
`...`	Named optional arguments to the function specified by `cluster_fun`.

Details

If the network graph has previously been partitioned into clusters using addClusterMembership() and the user wishes to compute network properties for these clusters, the name of the cluster membership variable in net$node_data should be provided to the cluster_id_name argument.

If the value of cluster_id_name is not the name of a variable in net$node_data, then clustering is performed using addClusterMembership() with the specified value of cluster_fun, and the cluster membership values are written to net$node_data using the value of cluster_id_name as the variable name. If overwrite = TRUE, this is done even if this variable already exists.

Value

A modified copy of net, with cluster properties contained in the element cluster_data. This is a data.frame containing one row for each cluster in the network and the following variables:

`cluster_id`	The cluster ID number.
`node_count`	The number of nodes in the cluster.
`mean_seq_length`	The mean sequence length in the cluster. Only present when `length(seq_col) == 1`.
`A_mean_seq_length`	The mean first sequence length in the cluster. Only present when `length(seq_col) == 2`.
`B_mean_seq_length`	The mean second sequence length in the cluster. Only present when `length(seq_col) == 2`.
`mean_degree`	The mean network degree in the cluster.
`max_degree`	The maximum network degree in the cluster.
`seq_w_max_degree`	The receptor sequence possessing the maximum degree within the cluster. Only present when `length(seq_col) == 1`.
`A_seq_w_max_degree`	The first sequence of the node possessing the maximum degree within the cluster. Only present when `length(seq_col) == 2`.
`B_seq_w_max_degree`	The second sequence of the node possessing the maximum degree within the cluster. Only present when `length(seq_col) == 2`.
`agg_count`	The aggregate count among all nodes in the cluster (based on the counts in `count_col`).
`max_count`	The maximum count among all nodes in the cluster (based on the counts in `count_col`).
`seq_w_max_count`	The receptor sequence possessing the maximum count within the cluster. Only present when `length(seq_col) == 1`.
`A_seq_w_max_count`	The first sequence of the node possessing the maximum count within the cluster. Only present when `length(seq_col) == 2`.
`B_seq_w_max_count`	The second sequence of the node possessing the maximum count within the cluster. Only present when `length(seq_col) == 2`.
`diameter_length`	The longest geodesic distance in the cluster, computed as the length of the vector returned by `get_diameter()`.
`assortativity`	The assortativity coefficient of the cluster's graph, based on the degree (minus one) of each node in the cluster (with the degree computed based only upon the nodes within the cluster). Computed using `assortativity_degree()`.
`global_transitivity`	The transitivity (i.e., clustering coefficient) for the cluster's graph, which estimates the probability that adjacent vertices are connected. Computed using `transitivity()` with `type = "global"`.
`edge_density`	The number of edges in the cluster as a fraction of the maximum possible number of edges. Computed using `edge_density()`.
`degree_centrality_index`	The centrality index of the cluster's graph based on within-cluster network degree. Computed as the `centralization` element of the output from `centr_degree()`.
`closeness_centrality_index`	The centrality index of the cluster's graph based on closeness, i.e., distance to other nodes in the cluster. Computed using `centralization()`.
`eigen_centrality_index`	The centrality index of the cluster's graph based on the eigenvector centrality scores, i.e., values of the first eigenvector of the adjacency matrix for the cluster. Computed as the `centralization` element of the output from `centr_eigen()`.
`eigen_centrality_eigenvalue`	The eigenvalue corresponding to the first eigenvector of the adjacency matrix for the cluster. Computed as the `value` element of the output from `eigen_centrality()`.

If net$node_data did not previously contain a variable whose name matches the value of cluster_id_name, then this variable will be present and will contain values for cluster membership, obtained through a call to addClusterMembership() using the clustering algorithm specified by cluster_fun.

If net$node_data did previously contain a variable whose name matches the value of cluster_id_name and overwrite = TRUE, then the values of this variable will be overwritten with new values for cluster membership, obtained as above based on cluster_fun.

If net$node_data did not previously contain a variable whose name matches the value of degree_col, then this variable will be present and will contain values for network degree.

Additionally, if net contains a list named details, then the following elements will be added to net$details, or overwritten if they already exist:

`cluster_data_goes_with`	A character string containing the value of `cluster_id_name`. When `net$node_data` contains multiple cluster membership variables (e.g., from applying different clustering methods), `cluster_data_goes_with` allows the user to distinguish which of these variables corresponds to `net$cluster_data`.
`count_col_for_cluster_data`	A character string containing the value of `count_col`. If `net$node_data` contains multiple count variables, this allows the user to distinguish which of these variables corresponds to the count-related properties in `net$cluster_data`, such as `max_count`. If `count_col = NULL`, then the value will be `NA`.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

set.seed(42)
toy_data <- simulateToyData()

net <- generateNetworkObjects(
  toy_data, "CloneSeq"
)

net <- addClusterStats(
  net,
  count_col = "CloneCount"
)

head(net$cluster_data)
net$details

# won't change net since net$cluster_data exists
net <- addClusterStats(
  net,
  count_col = "CloneCount",
  cluster_fun = "leiden",
  verbose = TRUE
)

# overwrites values in net$cluster_data
# and cluster membership values in net$node_data$cluster_id
# with values obtained using "cluster_leiden" algorithm
net <- addClusterStats(
  net,
  count_col = "CloneCount",
  cluster_fun = "leiden",
  overwrite = TRUE
)

net$details

# overwrites existing values in net$cluster_data
# with values obtained using "cluster_louvain" algorithm
# saves cluster membership values to net$node_data$cluster_id_louvain
# (net$node_data$cluster_id retains membership values from "cluster_leiden")
net <- addClusterStats(
  net,
  count_col = "CloneCount",
  cluster_fun = "louvain",
  cluster_id_name = "cluster_id_louvain",
  overwrite = TRUE
)

net$details

# perform clustering using "cluster_fast_greedy" algorithm,
# save cluster membership values to net$node_data$cluster_id_greedy
net <- addClusterMembership(
  net,
  cluster_fun = "fast_greedy",
  cluster_id_name = "cluster_id_greedy"
)

# compute cluster properties for the clusters from previous step
# overwrites values in net$cluster_data
net <- addClusterStats(
  net,
  cluster_id_name = "cluster_id_greedy",
  overwrite = TRUE
)

net$details
set.seed(42)
toy_data <- simulateToyData()

net <- generateNetworkObjects(
  toy_data, "CloneSeq"
)

net <- addClusterStats(
  net,
  count_col = "CloneCount"
)

head(net$cluster_data)
net$details

# won't change net since net$cluster_data exists
net <- addClusterStats(
  net,
  count_col = "CloneCount",
  cluster_fun = "leiden",
  verbose = TRUE
)

# overwrites values in net$cluster_data
# and cluster membership values in net$node_data$cluster_id
# with values obtained using "cluster_leiden" algorithm
net <- addClusterStats(
  net,
  count_col = "CloneCount",
  cluster_fun = "leiden",
  overwrite = TRUE
)

net$details

# overwrites existing values in net$cluster_data
# with values obtained using "cluster_louvain" algorithm
# saves cluster membership values to net$node_data$cluster_id_louvain
# (net$node_data$cluster_id retains membership values from "cluster_leiden")
net <- addClusterStats(
  net,
  count_col = "CloneCount",
  cluster_fun = "louvain",
  cluster_id_name = "cluster_id_louvain",
  overwrite = TRUE
)

net$details

# perform clustering using "cluster_fast_greedy" algorithm,
# save cluster membership values to net$node_data$cluster_id_greedy
net <- addClusterMembership(
  net,
  cluster_fun = "fast_greedy",
  cluster_id_name = "cluster_id_greedy"
)

# compute cluster properties for the clusters from previous step
# overwrites values in net$cluster_data
net <- addClusterStats(
  net,
  cluster_id_name = "cluster_id_greedy",
  overwrite = TRUE
)

net$details

Compute Node-Level Network Properties

Description

Given the node metadata and igraph for a network, computes a specified set of network properties for the network nodes. The node metadata is returned with each property added as a variable.

This function was deprecated in favor of addNodeStats() in NAIR 1.0.1. The new function accepts and returns the entire list of network objects returned by buildRepSeqNetwork() or by generateNetworkObjects(). It can compute cluster membership and add the values to the node metadata. It additionally updates the list element details with further information linking the node-level and cluster-level metadata.

Usage

addNodeNetworkStats(
  data,
  net,
  stats_to_include = chooseNodeStats(),
  cluster_fun = "fast_greedy",
  cluster_id_name = "cluster_id",
  overwrite = FALSE,
  verbose = FALSE,
  ...
)

addNodeNetworkStats(
  data,
  net,
  stats_to_include = chooseNodeStats(),
  cluster_fun = "fast_greedy",
  cluster_id_name = "cluster_id",
  overwrite = FALSE,
  verbose = FALSE,
  ...
)

Arguments

`data`	A data frame containing the node-level metadata for the network, with each row corresponding to a network node.
`net`	The network `igraph`.
`stats_to_include`	Specifies which network properties to compute. Accepts a vector created using `chooseNodeStats()` or `exclusiveNodeStats()`, or the character string `"all"` to compute all network properties.
`cluster_fun`	A character string specifying the clustering algorithm to use when computing cluster membership. Applicable only when `stats_to_include = "all"` or `stats_to_include["cluster_id"]` is `TRUE`. Passed to `addClusterMembership()`.
`cluster_id_name`	A character string specifying the name of the cluster membership variable to be added to `data`. Applicable only when `stats_to_include = "all"` or `stats_to_include["cluster_id"]` is `TRUE`. Passed to `addClusterMembership()`.
`overwrite`	Logical. If `TRUE` and `data` contains a variable whose name matches the value of `cluster_id_name`, then its values will be overwritten with new cluster membership values (obtained using `addClusterMembership()` with the specified value of `cluster_fun`). Applicable only when `stats_to_include = "all"` or `stats_to_include["cluster_id"]` is `TRUE`.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.
`...`	Named optional arguments to the function specified by `cluster_fun`.

Details

Node-level network properties are properties that pertain to each individual node in the network graph.

Some are local properties, meaning that their value for a given node depends only on a subset of the nodes in the network. One example is the network degree of a given node, which represents the number of other nodes that are directly joined to the given node by an edge connection.

Other properties are global properties, meaning that their value for a given node depends on all of the nodes in the network. An example is the authority score of a node, which is computed using the entire graph adjacency matrix (if we denote this matrix by $A$ , then the principal eigenvector of $A^T A$ represents the authority scores of the network nodes).

See chooseNodeStats() for a list of the available node-level network properties.

Value

A copy of data with with an additional column for each new network property computed. See chooseNodeStats() for the network property names, which are used as the column names, except for the cluster membership variable, whose name is the value of cluster_id_name.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

set.seed(42)
toy_data <- simulateToyData()

net <-
  generateNetworkObjects(
    toy_data,
    "CloneSeq"
  )

net$node_data <-
  addNodeNetworkStats(
    net$node_data,
    net$igraph
  )


set.seed(42)
toy_data <- simulateToyData()

net <-
  generateNetworkObjects(
    toy_data,
    "CloneSeq"
  )

net$node_data <-
  addNodeNetworkStats(
    net$node_data,
    net$igraph
  )

Compute Node-Level Network Properties

Description

Given a list of network objects returned by buildRepSeqNetwork() or generateNetworkObjects(), computes a specified set of network properties for the network nodes. The list of network objects is returned with each property added as a variable to the node metadata.

Usage

addNodeStats(
  net,
  stats_to_include = chooseNodeStats(),
  cluster_fun = "fast_greedy",
  cluster_id_name = "cluster_id",
  overwrite = FALSE,
  verbose = FALSE,
  ...
)

addNodeStats(
  net,
  stats_to_include = chooseNodeStats(),
  cluster_fun = "fast_greedy",
  cluster_id_name = "cluster_id",
  overwrite = FALSE,
  verbose = FALSE,
  ...
)

Arguments

`net`	A `list` of network objects conforming to the output of `buildRepSeqNetwork()` or `generateNetworkObjects()`. See details.
`stats_to_include`	Specifies which network properties to compute. Accepts a vector created using `chooseNodeStats()` or `exclusiveNodeStats()`, or the character string `"all"` to compute all network properties.
`cluster_fun`	A character string specifying the clustering algorithm to use when computing cluster membership. Applicable only when `stats_to_include = "all"` or `stats_to_include["cluster_id"]` is `TRUE`. Passed to `addClusterMembership()`.
`cluster_id_name`	A character string specifying the name of the cluster membership variable to be added to the node metadata. Applicable only when `stats_to_include = "all"` or `stats_to_include["cluster_id"]` is `TRUE`. Passed to `addClusterMembership()`.
`overwrite`	Logical. If `TRUE` and `net$node_data` contains a variable whose name matches the value of `cluster_id_name`, then its values will be overwritten with new cluster membership values (obtained using `addClusterMembership()`, to which the values of `cluster_fun`, `overwrite`). Applicable only when `stats_to_include = "all"` or `stats_to_include["cluster_id"]` is `TRUE`.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.
`...`	Named optional arguments to the function specified by `cluster_fun`.

Details

Node-level network properties are properties that pertain to each individual node in the network graph.

See chooseNodeStats() for a list of the available node-level network properties.

Value

A modified copy of net, with net$node_data containing an additional column for each new network property computed. See chooseNodeStats() for the network property names, which are used as the column names, except for the cluster membership variable, whose name is the value of cluster_id_name.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

set.seed(42)
toy_data <- simulateToyData()

net <- generateNetworkObjects(
  toy_data, "CloneSeq"
)

# Add default set of node properties
net <- addNodeStats(net)

# Modify default set of node properties
net <- addNodeStats(
  net,
  stats_to_include =
    chooseNodeStats(
      closeness = TRUE,
      page_rank = FALSE
    )
)

# Add only the spepcified node properties
net <- addNodeStats(
  net,
  stats_to_include =
    exclusiveNodeStats(
      degree = TRUE,
      transitivity = TRUE
    )
)

# Add all node-level network properties
net <- addNodeStats(
  net,
  stats_to_include = "all"
)


set.seed(42)
toy_data <- simulateToyData()

net <- generateNetworkObjects(
  toy_data, "CloneSeq"
)

# Add default set of node properties
net <- addNodeStats(net)

# Modify default set of node properties
net <- addNodeStats(
  net,
  stats_to_include =
    chooseNodeStats(
      closeness = TRUE,
      page_rank = FALSE
    )
)

# Add only the spepcified node properties
net <- addNodeStats(
  net,
  stats_to_include =
    exclusiveNodeStats(
      degree = TRUE,
      transitivity = TRUE
    )
)

# Add all node-level network properties
net <- addNodeStats(
  net,
  stats_to_include = "all"
)

Generate Plots of a Network Graph

Description

Generates one or more ggraph plots of the network graph according to the user specifications.

addPlots() accepts and returns a list of network objects, adding the plots to the existing list contents. If the list already contains plots, the new plots will be created using the same coordinate layout as the existing plots.

generateNetworkGraphPlots() accepts the network igraph and node metadata, and returns a list containing plots.

Usage

addPlots(
  net,
  print_plots = FALSE,
  plot_title = NULL,
  plot_subtitle = "auto",
  color_nodes_by = NULL,
  color_scheme = "default",
  color_legend = "auto",
  color_title = "auto",
  edge_width = 0.1,
  size_nodes_by = 0.5,
  node_size_limits = NULL,
  size_title = "auto",
  verbose = FALSE
)

generateNetworkGraphPlots(
  igraph,
  data,
  print_plots = FALSE,
  plot_title = NULL,
  plot_subtitle = NULL,
  color_nodes_by = NULL,
  color_scheme = "default",
  color_legend = "auto",
  color_title = "auto",
  edge_width = 0.1,
  size_nodes_by = 0.5,
  node_size_limits = NULL,
  size_title = "auto",
  layout = NULL,
  verbose = FALSE
)
addPlots(
  net,
  print_plots = FALSE,
  plot_title = NULL,
  plot_subtitle = "auto",
  color_nodes_by = NULL,
  color_scheme = "default",
  color_legend = "auto",
  color_title = "auto",
  edge_width = 0.1,
  size_nodes_by = 0.5,
  node_size_limits = NULL,
  size_title = "auto",
  verbose = FALSE
)

generateNetworkGraphPlots(
  igraph,
  data,
  print_plots = FALSE,
  plot_title = NULL,
  plot_subtitle = NULL,
  color_nodes_by = NULL,
  color_scheme = "default",
  color_legend = "auto",
  color_title = "auto",
  edge_width = 0.1,
  size_nodes_by = 0.5,
  node_size_limits = NULL,
  size_title = "auto",
  layout = NULL,
  verbose = FALSE
)

Arguments

`net`	A `list` of network objects conforming to the output of `buildRepSeqNetwork()` or `generateNetworkObjects()`. See details.
`igraph`	An `igraph` object containing the network graph to be plotted.
`data`	A data frame containing the node metadata for the network, with each row corresponding to a node.
`print_plots`	A logical scalar; should plots be printed in the `R` plotting window?
`plot_title`	A character string containing the plot title.
`plot_subtitle`	A character string containing the plot subtitle. The default value `"auto"` generates a subtitle describing the settings used to construct the network, including the distance type and distance cutoff.
`color_nodes_by`	A vector specifying one or more node metadata variables used to encode the color of the nodes. One plot is generated for each entry, with each plot coloring the nodes according to the variable in the corresponding entry. This argument accepts a character vector where each entry is a column name of the node metadata. If this argument is `NULL`, generates a single plot with uncolored nodes.
`color_scheme`	A character string specifying the color scale to use for all plots, or a character vector whose length matches that of `color_nodes_by`, with each entry specifying the color scale for the corresponding plot. `"default"` specifies the default `ggplot()` color scale. Other options are one of the viridis color scales (e.g., `"plasma"`, `"A"` or other valid inputs to the `option` argument of `scale_color_viridis()`) or (for discrete variables) a palette from `hcl.pals()` (e.g., `"RdYlGn"`). Each of the viridis color scales can include the suffix `"-1"` to reverse its direction (e.g., `"plasma-1"` or `"A-1"`).
`color_legend`	A logical scalar specifying whether to display the color legend in plots. The default value of `"auto"` shows the color legend if nodes are colored according to a continuous variable or according to a discrete variable with at most 20 distinct values.
`color_title`	A character string specifying the title of the color legend in all plots, or a character vector whose length matches that of `color_nodes_by`, with each entry specifying the title of the color legend in the corresponding plot. Only applicable for plots with colored nodes. The value `"auto"` uses the corresponding value of `color_nodes_by`.
`edge_width`	A numeric scalar specifying the width of the graph edges in the plot. Passed to the `width` argument of `geom_edge_link0()`.
`size_nodes_by`	A numeric scalar specifying the size of the nodes in all plots, or the column name of a node metadata variable used to encode the size of the nodes in all plots. Alternatively, an argument value of `NULL` uses the default `ggraph` size for all nodes. Passed to the size aesthetic mapping of `geom_node_point()`.
`node_size_limits`	A numeric vector of length 2, specifying the minimum and maximum node size. Only applicable if nodes are sized according to a variable. If `node_size_limits = NULL`, the default size scale will be used.
`size_title`	A character string (or `NULL`) specifying the title for the size legend. Only applicable if nodes are sized according to a variable. The value `"auto"` uses the value of `size_nodes_by`.
`layout`	A `matrix` specifying the coordinate layout of the network nodes, with one row for each node in the network and two columns. Each row specifies the x and y coordinates for the corresponding node. If `NULL`, the layout matrix is created using `[igraph:layout_components]{layout_components()}`. This argument can be used to create plots conforming to the same layout as previously-generated plots. It can also be used to generate plots with custom layouts.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.

Details

The arguments color_nodes_by and size_nodes_by accept the names of variables in the node metadata. For addPlots(), this is the data frame node_data contained in the list provided to the net argument. For generateNetworkGraphPlots(), this is the data frame provided to the data argument.

addPlots() adds the generated plots to the list plots contained in the list of network objects provided to net. The plots element is created if it does not already exist. If plots already exist, the new plots will be generated with the same coordinate layout as the existing plots. Each plot is named according to the variable used to color the nodes. If a plot already exists with the same name as one of the new plots, it will be overwritten with the new plot. If the plots list does not already contain an element named graph_layout, it will be added. This element contains the coordinate layout for the plots as a two-column matrix.

When calling generateNetworkGraphPlots(), if one wishes for the plots to be generated with the same coordinate layout as an existing plot, the layout matrix for the existing plot must be passed to the layout argument.

The plots can be printed to a pdf using saveNetworkPlots().

Value

addPlots() returns a modified copy of net with the new plots contained in the element named plots (a list), in addition to any previously existing plots.

generateNetworkGraphPlots() returns a list containing the new plots.

Each plot is an object of class ggraph. Within the list of plots, each plot is named after the variable used to color the nodes. For a plot with uncolored nodes, the name is uniform_color.

The list containing the new plots also contains an element named graph_layout. This is a matrix specifying the coordinate layout of the nodes in the plots. It contains one row for each node in the network and two columns. Each row specifies the x and y coordinates for the corresponding node. This matrix can be used to generate additional plots with the same layout as the plots in the returned list.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Network Visualization article on package website

Examples

set.seed(42)
toy_data <- simulateToyData()

net <- buildNet(toy_data, "CloneSeq", node_stats = TRUE)

net <- addPlots(
  net,
  color_nodes_by =
    c("SampleID", "transitivity", "coreness"),
  color_scheme =
    c("Set 2", "mako-1", "plasma-1"),
  color_title =
    c("", "Transitvity", "Coreness"),
  size_nodes_by = "degree",
  node_size_limits = c(0.1, 1.5),
  plot_subtitle = NULL,
  print_plots = TRUE
)

set.seed(42)
toy_data <- simulateToyData()

net <- buildNet(toy_data, "CloneSeq", node_stats = TRUE)

net <- addPlots(
  net,
  color_nodes_by =
    c("SampleID", "transitivity", "coreness"),
  color_scheme =
    c("Set 2", "mako-1", "plasma-1"),
  color_title =
    c("", "Transitvity", "Coreness"),
  size_nodes_by = "degree",
  node_size_limits = c(0.1, 1.5),
  plot_subtitle = NULL,
  print_plots = TRUE
)

Aggregate Counts/Frequencies for Clones With Identical Receptor Sequences

Description

Given bulk Adaptive Immune Receptor Repertoire Sequencing (AIRR-Seq) data with clones indexed by row, returns a data frame containing one row for each unique receptor sequence. Includes the number of clones sharing each sequence, as well as aggregate values for clone count and clone frequency across all clones sharing each sequence. Clones can be grouped according to metadata, in which case aggregation is performed within (but not across) groups.

Usage

aggregateIdenticalClones(
  data,
  clone_col,
  count_col,
  freq_col,
  grouping_cols = NULL,
  verbose = FALSE
)
aggregateIdenticalClones(
  data,
  clone_col,
  count_col,
  freq_col,
  grouping_cols = NULL,
  verbose = FALSE
)

Arguments

`data`	A data frame containing the bulk AIRR-Seq data, with clones indexed by row.
`clone_col`	Specifies the column of `data` containing the receptor sequences. Accepts a character string containing the column name or a numeric scalar containing the column index.
`count_col`	Specifies the column of `data` containing the clone counts. Accepts a character string containing the column name or a numeric scalar containing the column index.
`freq_col`	Specifies the column of `data` containing the clone frequencies. Accepts a character string containing the column name or a numeric scalar containing the column index.
`grouping_cols`	An optional character vector of column names or numeric vector of column indices, specifying one or more columns of `data` used to assign clones to groups. If provided, aggregation occurs within groups, but not across groups. See details.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.

Details

If grouping_cols is left unspecified, the returned data frame will contain one row for each unique receptor sequence appearing in data.

If one or more columns of data are specified using the grouping_cols argument, then each clone (row) in data is assigned to a group based on its combination of values in these columns. If two clones share the same receptor sequence but belong to different groups, their receptor sequence will appear multiple times in the returned data frame, with one row for each group in which the sequence appears. In each such row, the aggregate clone count, aggregate clone frequency, and number of clones sharing the sequence are reported within the group for that row.

Value

A data frame whose first column contains the receptor sequences and has the same name as the column of data specified by clone_col. One additional column will be present for each column of data that is specified using the grouping_cols argument, with each having the same column name. The remaining columns are as follows:

`AggregatedCloneCount`	The aggregate clone count across all clones (within the same group, if applicable) that share the receptor sequence in that row.
`AggregatedCloneFrequency`	The aggregate clone frequency across all clones (within the same group, if applicable) that share the receptor sequence in that row.
`UniqueCloneCount`	The number of clones (rows) in `data` (within the same group, if applicable) possessing the receptor sequence for the current row.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

my_data <- data.frame(
  clone_seq = c("ATCG", rep("ACAC", 2), rep("GGGG", 4)),
  clone_count = rep(1, 7),
  clone_freq = rep(1/7, 7),
  time_point = c("t_0", rep(c("t_0", "t_1"), 3)),
  subject_id = c(rep(1, 5), rep(2, 2))
)
my_data

aggregateIdenticalClones(
  my_data,
  "clone_seq",
  "clone_count",
  "clone_freq",
)

# group clones by time point
aggregateIdenticalClones(
  my_data,
  "clone_seq",
  "clone_count",
  "clone_freq",
  grouping_cols = "time_point"
)

# group clones by subject ID
aggregateIdenticalClones(
  my_data,
  "clone_seq",
  "clone_count",
  "clone_freq",
  grouping_cols = "subject_id"
)

# group clones by time point and subject ID
aggregateIdenticalClones(
  my_data,
  "clone_seq",
  "clone_count",
  "clone_freq",
  grouping_cols =
    c("subject_id", "time_point")
)
my_data <- data.frame(
  clone_seq = c("ATCG", rep("ACAC", 2), rep("GGGG", 4)),
  clone_count = rep(1, 7),
  clone_freq = rep(1/7, 7),
  time_point = c("t_0", rep(c("t_0", "t_1"), 3)),
  subject_id = c(rep(1, 5), rep(2, 2))
)
my_data

aggregateIdenticalClones(
  my_data,
  "clone_seq",
  "clone_count",
  "clone_freq",
)

# group clones by time point
aggregateIdenticalClones(
  my_data,
  "clone_seq",
  "clone_count",
  "clone_freq",
  grouping_cols = "time_point"
)

# group clones by subject ID
aggregateIdenticalClones(
  my_data,
  "clone_seq",
  "clone_count",
  "clone_freq",
  grouping_cols = "subject_id"
)

# group clones by time point and subject ID
aggregateIdenticalClones(
  my_data,
  "clone_seq",
  "clone_count",
  "clone_freq",
  grouping_cols =
    c("subject_id", "time_point")
)

Build Global Network of Associated TCR/BCR Clusters

Description

Part of the workflow Searching for Associated TCR/BCR Clusters. Intended for use following findAssociatedClones().

Given data containing a neighborhood of similar clones around each associated sequence, combines the data into a global network and performs network analysis and cluster analysis.

Usage

buildAssociatedClusterNetwork(
  file_list,
  input_type = "rds",
  data_symbols = "data", header = TRUE, sep,
  read.args = list(row.names = 1),
  seq_col,
  min_seq_length = NULL,
  drop_matches = NULL,
  drop_isolated_nodes = FALSE,
  node_stats = TRUE,
  stats_to_include =
    chooseNodeStats(cluster_id = TRUE),
  cluster_stats = TRUE,
  color_nodes_by = "GroupID",
  output_name = "AssociatedClusterNetwork",
  verbose = FALSE,
  ...
)
buildAssociatedClusterNetwork(
  file_list,
  input_type = "rds",
  data_symbols = "data", header = TRUE, sep,
  read.args = list(row.names = 1),
  seq_col,
  min_seq_length = NULL,
  drop_matches = NULL,
  drop_isolated_nodes = FALSE,
  node_stats = TRUE,
  stats_to_include =
    chooseNodeStats(cluster_id = TRUE),
  cluster_stats = TRUE,
  color_nodes_by = "GroupID",
  output_name = "AssociatedClusterNetwork",
  verbose = FALSE,
  ...
)

Arguments

`file_list`	A character vector of file paths, or a list containing `connections` and file paths. Each element corresponds to a single file containing the data for a single sample. Passed to `loadDataFromFileList()`.
`input_type`	A character string specifying the file format of the neighborhood data files. Options are `"table"`, `"txt"`, `"tsv"`, `"csv"`, `"rds"` and `"rda"`. Passed to `loadDataFromFileList()`.
`data_symbols`	Used when `input_type = "rda"`. Specifies the name of each neighborhood's data frame within its respective Rdata file. Passed to `loadDataFromFileList()`.
`header`	For values of `input_type` other than `"rds"` and `"rda"`, this argument is used to specify the value of the `header` argument to `read.table()`, `read.csv()`, etc.
`sep`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify a non-default value of the `sep` argument to `read.table()`, `read.csv()`, etc.
`read.args`	For values of `input_type` other than `"rds"` and `"rda"`, this argument is used to specify values of optional arguments to `read.table()`, `read.csv()`, etc. Accepts a named list of argument values. Values of `header` and `sep` in this list take precedence over values specified via the `header` and `sep` arguments.
`seq_col`	Specifies the column of each neighborhood's data frame containing the TCR/BCR sequences. Accepts a character string containing the column name or a numeric scalar containing the column index.
`min_seq_length`	Passed to `buildRepSeqNetwork()` when constructing the global network.
`drop_matches`	Passed to `buildRepSeqNetwork()` when constructing the global network.
`drop_isolated_nodes`	Passed to `buildRepSeqNetwork()` when constructing the global network.
`node_stats`	Passed to `buildRepSeqNetwork()` when constructing the global network.
`stats_to_include`	Passed to `buildRepSeqNetwork()` when constructing the global network.
`cluster_stats`	Passed to `buildRepSeqNetwork()` when constructing the global network.
`color_nodes_by`	Passed to `buildRepSeqNetwork()` when constructing the global network.
`output_name`	Passed to `buildRepSeqNetwork()` when constructing the global network.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.
`...`	Other arguments to `buildRepSeqNetwork()` when constructing the global network.

Details

Each associated sequence's neighborhood contains clones (from all samples) with TCR/BCR sequences similar to the associated sequence. The neighborhoods are assumed to have been previously identified using findAssociatedClones().

The neighborhood data for all associated sequences are used to construct a single global network. Cluster analysis is used to partition the global network into clusters, which are considered as the associated TCR/BCR clusters. Network properties for the nodes and clusters are computed and returned as metadata. A plot of the global network graph is produced, with the nodes colored according to the binary variable of interest.

See the Searching for Associated TCR/BCR Clusters article on the package website for more details.

Value

A list of network objects as returned by buildRepSeqNetwork(). The list is returned invisibly. If the input data contains a combined total of fewer than two rows, or if the global network contains no nodes, then the function returns NULL, invisibly, with a warning.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Searching for Associated TCR/BCR Clusters article on package website

Examples

set.seed(42)

## Simulate 30 samples from two groups (treatment/control) ##
n_control <- n_treatment <- 15
n_samples <- n_control + n_treatment
sample_size <- 30 # (seqs per sample)
base_seqs <- # first five are associated with treatment
  c("CASSGAYEQYF", "CSVDLGKGNNEQFF", "CASSIEGQLSTDTQYF",
    "CASSEEGQLSTDTQYF", "CASSPEGQLSTDTQYF",
    "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF")
# Relative generation probabilities by control/treatment group
pgen_c <- matrix(rep(c(rep(1, 5), rep(30, 3)), times = n_control),
                 nrow = n_control, byrow = TRUE)
pgen_t <- matrix(rep(c(1, 1, rep(1/3, 3), rep(2, 3)), times = n_treatment),
                 nrow = n_treatment, byrow = TRUE)
pgen <- rbind(pgen_c, pgen_t)
simulateToyData(
  samples = n_samples,
  sample_size = sample_size,
  prefix_length = 1,
  prefix_chars = c("", ""),
  prefix_probs = cbind(rep(1, n_samples), rep(0, n_samples)),
  affixes = base_seqs,
  affix_probs = pgen,
  num_edits = 0,
  output_dir = tempdir(),
  no_return = TRUE
)

## Step 1: Find Associated Sequences ##
sample_files <-
  file.path(tempdir(),
            paste0("Sample", 1:n_samples, ".rds")
  )
group_labels <- c(rep("reference", n_control),
                  rep("comparison", n_treatment))
associated_seqs <-
  findAssociatedSeqs(
    file_list = sample_files,
    input_type = "rds",
    group_ids = group_labels,
    seq_col = "CloneSeq",
    min_seq_length = NULL,
    drop_matches = NULL,
    min_sample_membership = 0,
    pval_cutoff = 0.1
  )
head(associated_seqs[, 1:5])

## Step 2: Find Associated Clones ##
dir_step2 <- tempfile()
findAssociatedClones(
  file_list = sample_files,
  input_type = "rds",
  group_ids = group_labels,
  seq_col = "CloneSeq",
  assoc_seqs = associated_seqs$ReceptorSeq,
  min_seq_length = NULL,
  drop_matches = NULL,
  output_dir = dir_step2
)

## Step 3: Global Network of Associated Clusters ##
associated_clusters <-
  buildAssociatedClusterNetwork(
    file_list = list.files(dir_step2,
                           full.names = TRUE
    ),
    seq_col = "CloneSeq",
    size_nodes_by = 1.5,
    print_plots = TRUE
  )


set.seed(42)

## Simulate 30 samples from two groups (treatment/control) ##
n_control <- n_treatment <- 15
n_samples <- n_control + n_treatment
sample_size <- 30 # (seqs per sample)
base_seqs <- # first five are associated with treatment
  c("CASSGAYEQYF", "CSVDLGKGNNEQFF", "CASSIEGQLSTDTQYF",
    "CASSEEGQLSTDTQYF", "CASSPEGQLSTDTQYF",
    "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF")
# Relative generation probabilities by control/treatment group
pgen_c <- matrix(rep(c(rep(1, 5), rep(30, 3)), times = n_control),
                 nrow = n_control, byrow = TRUE)
pgen_t <- matrix(rep(c(1, 1, rep(1/3, 3), rep(2, 3)), times = n_treatment),
                 nrow = n_treatment, byrow = TRUE)
pgen <- rbind(pgen_c, pgen_t)
simulateToyData(
  samples = n_samples,
  sample_size = sample_size,
  prefix_length = 1,
  prefix_chars = c("", ""),
  prefix_probs = cbind(rep(1, n_samples), rep(0, n_samples)),
  affixes = base_seqs,
  affix_probs = pgen,
  num_edits = 0,
  output_dir = tempdir(),
  no_return = TRUE
)

## Step 1: Find Associated Sequences ##
sample_files <-
  file.path(tempdir(),
            paste0("Sample", 1:n_samples, ".rds")
  )
group_labels <- c(rep("reference", n_control),
                  rep("comparison", n_treatment))
associated_seqs <-
  findAssociatedSeqs(
    file_list = sample_files,
    input_type = "rds",
    group_ids = group_labels,
    seq_col = "CloneSeq",
    min_seq_length = NULL,
    drop_matches = NULL,
    min_sample_membership = 0,
    pval_cutoff = 0.1
  )
head(associated_seqs[, 1:5])

## Step 2: Find Associated Clones ##
dir_step2 <- tempfile()
findAssociatedClones(
  file_list = sample_files,
  input_type = "rds",
  group_ids = group_labels,
  seq_col = "CloneSeq",
  assoc_seqs = associated_seqs$ReceptorSeq,
  min_seq_length = NULL,
  drop_matches = NULL,
  output_dir = dir_step2
)

## Step 3: Global Network of Associated Clusters ##
associated_clusters <-
  buildAssociatedClusterNetwork(
    file_list = list.files(dir_step2,
                           full.names = TRUE
    ),
    seq_col = "CloneSeq",
    size_nodes_by = 1.5,
    print_plots = TRUE
  )

Build Global Network of Public TCR/BCR Clusters

Description

Part of the workflow Searching for Public TCR/BCR Clusters. Intended for use following findPublicClusters().

Given node-level metadata for each sample's filtered clusters, combines the data into a global network and performs network analysis and cluster analysis.

Usage

buildPublicClusterNetwork(

  ## Input ##
  file_list,
  input_type = "rds",
  data_symbols = "ndat",
  header = TRUE, sep,
  read.args = list(row.names = 1),
  seq_col,

  ## Network Settings ##
  drop_isolated_nodes = FALSE,
  node_stats = deprecated(),
  stats_to_include = deprecated(),
  cluster_stats = deprecated(),

  ## Visualization ##
  color_nodes_by = "SampleID",
  color_scheme = "turbo",
  plot_title = "Global Network of Public Clusters",

  ## Output ##
  output_dir = NULL,
  output_name = "PublicClusterNetwork",
  verbose = FALSE,

  ...

)
buildPublicClusterNetwork(

  ## Input ##
  file_list,
  input_type = "rds",
  data_symbols = "ndat",
  header = TRUE, sep,
  read.args = list(row.names = 1),
  seq_col,

  ## Network Settings ##
  drop_isolated_nodes = FALSE,
  node_stats = deprecated(),
  stats_to_include = deprecated(),
  cluster_stats = deprecated(),

  ## Visualization ##
  color_nodes_by = "SampleID",
  color_scheme = "turbo",
  plot_title = "Global Network of Public Clusters",

  ## Output ##
  output_dir = NULL,
  output_name = "PublicClusterNetwork",
  verbose = FALSE,

  ...

)

Arguments

`file_list`	A character vector of file paths, or a list containing `connections` and file paths. Each element corresponds to a single file containing the data for a single sample. `loadDataFromFileList()`.
`input_type`	A character string specifying the file format of the input files. Options are `"csv"`, `"rds"` and `"rda"`. Passed to `loadDataFromFileList()`.
`data_symbols`	Used when `input_type = "rda"`. Specifies the name of the data frame within each Rdata file. Passed to `loadDataFromFileList()`.
`header`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify a non-default value of the `header` argument to `read.table()`, `read.csv()`, etc.
`sep`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify a non-default value of the `sep` argument to `read.table()`, `read.csv()`, etc.
`read.args`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify non-default values of optional arguments to `read.table()`, `read.csv()`, etc. Accepts a named list of argument values. Values of `header` and `sep` in this list take precedence over values specified via the `header` and `sep` arguments.
`seq_col`	Specifies the column in the node-level metadata that contains the TCR/BCR sequences. Accepts a character string containing the column name or a numeric scalar containing the column index.
`drop_isolated_nodes`	Passed to `buildRepSeqNetwork()` when constructing the global network.
`node_stats`	All network properties are automatically computed.
`stats_to_include`	All network properties are automatically computed.
`cluster_stats`	All network properties are automatically computed.
`color_nodes_by`	Passed to `buildRepSeqNetwork()` when constructing the global network. The node-level network properties for the global network (see details) are included among the valid options.
`color_scheme`	Passed to `addPlots()` when constructing the global network.
`plot_title`	Passed to `buildRepSeqNetwork()` when constructing the global network.
`output_dir`	Passed to `buildRepSeqNetwork()` when constructing the global network.
`output_name`	Passed to `buildRepSeqNetwork()` when constructing the global network.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.
`...`	Other arguments to `buildRepSeqNetwork()` (including arguments to `addPlots()`) when constructing the global network. Does not include `node_stats`, `stats_to_include`, `cluster_stats` or `cluster_id_name`.

Details

The node-level metadata for the filtered clusters from all samples is combined and the global network is constructed by calling buildNet() with node_stats = TRUE, stats_to_include = "all", cluster_stats = TRUE and cluster_id_name = "ClusterIDPublic".

The computed node-level network properties are renamed to reflect their correspondence to the global network. This is done to distinguish them from the network properties that correspond to the sample-level networks. The names are:

ClusterIDPublic
PublicNetworkDegree
PublicTransitivity
PublicCloseness
PublicCentralityByCloseness
PublicEigenCentrality
PublicCentralityByEigen
PublicBetweenness
PublicCentralityByBetweenness
PublicAuthorityScore
PublicCoreness
PublicPageRank

See the Searching for Public TCR/BCR Clusters article on the package website.

Value

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Searching for Public TCR/BCR Clusters article on package website

Examples

set.seed(42)

## Simulate 30 samples with a mix of public/private sequences ##
samples <- 30
sample_size <- 30 # (seqs per sample)
base_seqs <- c(
  "CASSIEGQLSTDTQYF", "CASSEEGQLSTDTQYF", "CASSSVETQYF",
  "CASSPEGQLSTDTQYF", "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF",
  "CASSLTSGYNEQFF", "CASSETGYNEQFF", "CASSLTGGNEQFF", "CASSYLTGYNEQFF",
  "CASSLTGNEQFF", "CASSLNGYNEQFF", "CASSFPWDGYGYTF", "CASTLARQGGELFF",
  "CASTLSRQGGELFF", "CSVELLPTGPLETSYNEQFF", "CSVELLPTGPSETSYNEQFF",
  "CVELLPTGPSETSYNEQFF", "CASLAGGRTQETQYF", "CASRLAGGRTQETQYF",
  "CASSLAGGRTETQYF", "CASSLAGGRTQETQYF", "CASSRLAGGRTQETQYF",
  "CASQYGGGNQPQHF", "CASSLGGGNQPQHF", "CASSNGGGNQPQHF", "CASSYGGGGNQPQHF",
  "CASSYGGGQPQHF", "CASSYKGGNQPQHF", "CASSYTGGGNQPQHF",
  "CAWSSQETQYF", "CASSSPETQYF", "CASSGAYEQYF", "CSVDLGKGNNEQFF")
# Relative generation probabilities
pgen <- cbind(
  stats::toeplitz(0.6^(0:(sample_size - 1))),
  matrix(1, nrow = samples, ncol = length(base_seqs) - samples)
)
simulateToyData(
  samples = samples,
  sample_size = sample_size,
  prefix_length = 1,
  prefix_chars = c("", ""),
  prefix_probs = cbind(rep(1, samples), rep(0, samples)),
  affixes = base_seqs,
  affix_probs = pgen,
  num_edits = 0,
  output_dir = tempdir(),
  no_return = TRUE
)


## 1. Find Public Clusters in Each Sample
sample_files <-
  file.path(tempdir(),
            paste0("Sample", 1:samples, ".rds")
  )
findPublicClusters(
  file_list = sample_files,
  input_type = "rds",
  seq_col = "CloneSeq",
  count_col = "CloneCount",
  min_seq_length = NULL,
  drop_matches = NULL,
  top_n_clusters = 3,
  min_node_count = 5,
  min_clone_count = 15000,
  output_dir = tempdir()
)

## 2. Build Global Network of Public Clusters
public_clusters <-
  buildPublicClusterNetwork(
    file_list =
      list.files(
        file.path(tempdir(), "node_meta_data"),
        full.names = TRUE
      ),
    seq_col = "CloneSeq",
    count_col = "CloneCount",
    plot_title = NULL,
    plot_subtitle = NULL,
    print_plots = TRUE
  )



set.seed(42)

## Simulate 30 samples with a mix of public/private sequences ##
samples <- 30
sample_size <- 30 # (seqs per sample)
base_seqs <- c(
  "CASSIEGQLSTDTQYF", "CASSEEGQLSTDTQYF", "CASSSVETQYF",
  "CASSPEGQLSTDTQYF", "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF",
  "CASSLTSGYNEQFF", "CASSETGYNEQFF", "CASSLTGGNEQFF", "CASSYLTGYNEQFF",
  "CASSLTGNEQFF", "CASSLNGYNEQFF", "CASSFPWDGYGYTF", "CASTLARQGGELFF",
  "CASTLSRQGGELFF", "CSVELLPTGPLETSYNEQFF", "CSVELLPTGPSETSYNEQFF",
  "CVELLPTGPSETSYNEQFF", "CASLAGGRTQETQYF", "CASRLAGGRTQETQYF",
  "CASSLAGGRTETQYF", "CASSLAGGRTQETQYF", "CASSRLAGGRTQETQYF",
  "CASQYGGGNQPQHF", "CASSLGGGNQPQHF", "CASSNGGGNQPQHF", "CASSYGGGGNQPQHF",
  "CASSYGGGQPQHF", "CASSYKGGNQPQHF", "CASSYTGGGNQPQHF",
  "CAWSSQETQYF", "CASSSPETQYF", "CASSGAYEQYF", "CSVDLGKGNNEQFF")
# Relative generation probabilities
pgen <- cbind(
  stats::toeplitz(0.6^(0:(sample_size - 1))),
  matrix(1, nrow = samples, ncol = length(base_seqs) - samples)
)
simulateToyData(
  samples = samples,
  sample_size = sample_size,
  prefix_length = 1,
  prefix_chars = c("", ""),
  prefix_probs = cbind(rep(1, samples), rep(0, samples)),
  affixes = base_seqs,
  affix_probs = pgen,
  num_edits = 0,
  output_dir = tempdir(),
  no_return = TRUE
)


## 1. Find Public Clusters in Each Sample
sample_files <-
  file.path(tempdir(),
            paste0("Sample", 1:samples, ".rds")
  )
findPublicClusters(
  file_list = sample_files,
  input_type = "rds",
  seq_col = "CloneSeq",
  count_col = "CloneCount",
  min_seq_length = NULL,
  drop_matches = NULL,
  top_n_clusters = 3,
  min_node_count = 5,
  min_clone_count = 15000,
  output_dir = tempdir()
)

## 2. Build Global Network of Public Clusters
public_clusters <-
  buildPublicClusterNetwork(
    file_list =
      list.files(
        file.path(tempdir(), "node_meta_data"),
        full.names = TRUE
      ),
    seq_col = "CloneSeq",
    count_col = "CloneCount",
    plot_title = NULL,
    plot_subtitle = NULL,
    print_plots = TRUE
  )

Build Global Network of Public TCR/BCR Clusters Using Representative Clones

Description

Alternative step in the workflow Searching for Public TCR/BCR Clusters. Intended for use following findPublicClusters() in cases where buildPublicClusterNetwork() cannot be practically used due to the size of the full global network.

Given cluster-level metadata for each sample's filtered clusters, selects a representative TCR/BCR from each cluster, combines the representatives into a global network and performs network analysis and cluster analysis.

Usage

buildPublicClusterNetworkByRepresentative(

  ## Input ##
  file_list,
  input_type = "rds",
  data_symbols = "cdat",
  header, sep, read.args,
  seq_col = "seq_w_max_count",
  count_col = "agg_count",

  ## Network Settings ##
  dist_type = "hamming",
  dist_cutoff = 1,
  cluster_fun = "fast_greedy",

  ## Visualization ##
  plots = TRUE,
  print_plots = FALSE,
  plot_title = "auto",
  plot_subtitle = "auto",
  color_nodes_by = "SampleID",
  color_scheme = "turbo",
  ...,

  ## Output ##
  output_dir = NULL,
  output_type = "rds",
  output_name = "PubClustByRepresentative",
  pdf_width = 12,
  pdf_height = 10,
  verbose = FALSE

)
buildPublicClusterNetworkByRepresentative(

  ## Input ##
  file_list,
  input_type = "rds",
  data_symbols = "cdat",
  header, sep, read.args,
  seq_col = "seq_w_max_count",
  count_col = "agg_count",

  ## Network Settings ##
  dist_type = "hamming",
  dist_cutoff = 1,
  cluster_fun = "fast_greedy",

  ## Visualization ##
  plots = TRUE,
  print_plots = FALSE,
  plot_title = "auto",
  plot_subtitle = "auto",
  color_nodes_by = "SampleID",
  color_scheme = "turbo",
  ...,

  ## Output ##
  output_dir = NULL,
  output_type = "rds",
  output_name = "PubClustByRepresentative",
  pdf_width = 12,
  pdf_height = 10,
  verbose = FALSE

)

Arguments

`file_list`	A vector of file paths where each file contains the cluster-level metadata for one sample's filtered clusters. Passed to `loadDataFromFileList()`.
`input_type`	A character string specifying the file format of the input files. Options are `"csv"`, `"rds"` and `"rda"`. Passed to `loadDataFromFileList()`.
`data_symbols`	Used when `input_type = "rda"`. Specifies the name of the data frame within each Rdata file. Passed to `loadDataFromFileList()`.
`header`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify a non-default value of the `header` argument to `read.table()`, `read.csv()`, etc.
`sep`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify a non-default value of the `sep` argument to `read.table()`, `read.csv()`, etc.
`read.args`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify non-default values of optional arguments to `read.table()`, `read.csv()`, etc. Accepts a named list of argument values. Values of `header` and `sep` in this list take precedence over values specified via the `header` and `sep` arguments.
`seq_col`	Specifies the column in the cluster-level metadata that contains the representative TCR/BCR sequence for each cluster. Accepts a character string containing the column name or a numeric scalar containing the column index. By default, uses the sequence with the maximum clone count in each cluster.
`count_col`	Specifies the column in the cluster-level metadata that contains the aggregate clone count for each cluster. Accepts a character string containing the column name or a numeric scalar containing the column index.
`dist_type`	Passed to `buildRepSeqNetwork()` when constructing the global network.
`dist_cutoff`	Passed to `buildRepSeqNetwork()` when constructing the global network.
`cluster_fun`	Passed to `buildRepSeqNetwork()` when performing cluster analysis on the global network.
`plots`	Logical. Should plots of the global network graph be produced?
`print_plots`	Logical. If plots of the global network graph are produced, should they be printed to the R plotting window?
`plot_title`	Passed to `addPlots()` when producing plots of the global network graph.
`plot_subtitle`	Passed to `addPlots()` when producing plots of the global network graph.
`color_nodes_by`	Passed to `addPlots()` when producing plots of the global network graph. Valid options include the default `"SampleID"`, as well as node-level properties (see `addNodeNetworkStats`) and sample-level cluster properties (see `getClusterStats`), which correspond to the representative TCRs/BCRs and the original sample-level clusters they represent, respectively.
`color_scheme`	Passed to `addPlots()` when producing plots of the global network graph.
`...`	Other arguments to `addPlots()` when producing plots of the global network graph.
`output_dir`	Passed to `saveNetwork()` after constructing the global network.
`output_type`	Passed to `saveNetwork()` after constructing the global network.
`output_name`	Passed to `saveNetwork()` after constructing the global network.
`pdf_width`	Passed to `saveNetwork()` after constructing the global network. Only applicable if `plots = TRUE`.
`pdf_height`	Passed to `saveNetwork()` after constructing the global network. Only applicable if `plots = TRUE`.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.

Details

From each filtered cluster in each sample's network, a representative TCR/BCR is selected. By default, this is the sequence with the greatest clone count in each cluster. The representatives from all clusters and all samples are then used to construct a single global network. Cluster analysis is used to partition this global network into clusters. Network properties for the nodes and clusters are computed and returned as metadata. A plot of the global network graph is produced, with the nodes colored according to sample ID.

Within this network, clusters containing nodes from multiple samples can be considered as the skeletons of the complete public clusters. The filtered cluster data for each sample can then be subset to keep the sample-level clusters whose representative TCR/BCRs belong to the skeletons of the public clusters. After subsetting in this manner, buildPublicClusterNetwork() can be used to construct the global network of complete public clusters.

See the Searching for Public TCR/BCR Clusters article on the package website.

Value

If the input data contains a combined total of fewer than two rows, or if the global network contains no nodes, then the function returns NULL, invisibly, with a warning. Otherwise, invisibly returns a list of network objects as returned by buildRepSeqNetwork(). The global cluster membership variable in the data frame node_data is named ClusterIDPublic.

The data frame cluster_data includes the following variables that represent properties of the clusters in the global network of representative TCR/BCRs:

`cluster_id`	The global cluster ID number.
`node_count`	The number of global network nodes in the global cluster.
`TotalSampleLevelNodes`	For each representative TCR/BCR in the global cluster, we record the number of nodes in the sample-level cluster for which it is the representative TCR/BCR. We then sum these node counts across all the representative TCR/BCRs in the global cluster.
`TotalCloneCount`	For each representative TCR/BCR in the global cluster, we record the aggregate clone count from all nodes in the sample-level cluster for which it is the representative TCR/BCR. We then sum these aggregate clone counts across all the representative TCR/BCRs in the global cluster.
`MeanOfMeanSeqLength`	For each representative TCR/BCR in the global cluster, we record the mean sequence length over all clones (nodes) in the sample-level cluster for which it is the representative TCR/BCR. We then average these mean sequence lengths over all the representative TCR/BCRs in the global cluster.
`MeanDegreeInPublicNet`	For each representative TCR/BCR in the global cluster, we record the mean network degree over all nodes in the sample-level cluster for which it is the representative TCR/BCR. We then average these mean degree values over all the representative TCR/BCRs in the global cluster.
`MaxDegreeInPublicNet`	For each representative TCR/BCR in the global cluster, we record the maximum network degree across all nodes in the sample-level cluster for which it is the representative TCR/BCR. We then take the maximum of these maximum degree values over all the representative TCR/BCRs in the global cluster.
`SeqWithMaxDegree`	For each representative TCR/BCR in the global cluster, we record the maximum network degree across all nodes in the sample-level cluster for which it is the representative TCR/BCR. We then identify the representative TCR/BCR with the maximum value of these maximum degrees over all the representative TCR/BCRs in the global cluster. The TCR/BCR sequence of the identified representative TCR/BCR is recorded in this variable.
`MaxCloneCount`	For each representative TCR/BCR in the global cluster, we record the maximum clone count across all clones (nodes) in the sample-level cluster for which it is the representative TCR/BCR. We then take the maximum of these maximum clone counts over all the representative TCR/BCRs in the global cluster.
`SampleWithMaxCloneCount`	For each representative TCR/BCR in the global cluster, we record the maximum clone count across all clones (nodes) in the sample-level cluster for which it is the representative TCR/BCR. We then identify the representative TCR/BCR with the maximum value of these maximum clone counts over all the representative TCR/BCRs in the global cluster. The sample to which the identified representative TCR/BCR belongs is recorded in this variable.
`SeqWithMaxCloneCount`	For each representative TCR/BCR in the global cluster, we record the maximum clone count across all clones (nodes) in the sample-level cluster for which it is the representative TCR/BCR. We then identify the representative TCR/BCR with the maximum value of these maximum clone counts over all the representative TCR/BCRs in the global cluster. The TCR/BCR sequence of the identified representative TCR/BCR is recorded in this variable.
`MaxAggCloneCount`	For each representative TCR/BCR in the global cluster, we record the aggregate clone count across all clones (nodes) in the sample-level cluster for which it is the representative TCR/BCR. We then take the maximum of these aggregate clone counts over all the representative TCR/BCRs in the global cluster.
`SampleWithMaxAggCloneCount`	For each representative TCR/BCR in the global cluster, we record the aggregate clone count across all clones (nodes) in the sample-level cluster for which it is the representative TCR/BCR. We then identify the representative TCR/BCR with the maximum value of these aggregate clone counts over all the representative TCR/BCRs in the global cluster. The sample to which the identified representative TCR/BCR belongs is recorded in this variable.
`SeqWithMaxAggCloneCount`	For each representative TCR/BCR in the global cluster, we record the aggregate clone count across all clones (nodes) in the sample-level cluster for which it is the representative TCR/BCR. We then identify the representative TCR/BCR with the maximum value of these aggregate clone counts over all the representative TCR/BCRs in the global cluster. The TCR/BCR sequence of the identified representative TCR/BCR is recorded in this variable.
`DiameterLength`	See `getClusterStats`. Based on edge connections between representative TCR/BCRs in the global cluster.
`Assortativity`	See `getClusterStats`. Based on edge connections between representative TCR/BCRs in the global cluster.
`GlobalTransitivity`	See `getClusterStats`. Based on edge connections between representative TCR/BCRs in the global cluster.
`EdgeDensity`	See `getClusterStats`. Based on edge connections between representative TCR/BCRs in the global cluster.
`DegreeCentralityIndex`	See `getClusterStats`. Based on edge connections between representative TCR/BCRs in the global cluster.
`ClosenessCentralityIndex`	See `getClusterStats`. Based on edge connections between representative TCR/BCRs in the global cluster.
`EigenCentralityIndex`	See `getClusterStats`. Based on edge connections between representative TCR/BCRs in the global cluster.
`EigenCentralityEigenvalue`	See `getClusterStats`. Based on edge connections between representative TCR/BCRs in the global cluster.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Searching for Public TCR/BCR Clusters article on package website

Examples

set.seed(42)

## Simulate 30 samples with a mix of public/private sequences ##
samples <- 30
sample_size <- 30 # (seqs per sample)
base_seqs <- c(
  "CASSIEGQLSTDTQYF", "CASSEEGQLSTDTQYF", "CASSSVETQYF",
  "CASSPEGQLSTDTQYF", "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF",
  "CASSLTSGYNEQFF", "CASSETGYNEQFF", "CASSLTGGNEQFF", "CASSYLTGYNEQFF",
  "CASSLTGNEQFF", "CASSLNGYNEQFF", "CASSFPWDGYGYTF", "CASTLARQGGELFF",
  "CASTLSRQGGELFF", "CSVELLPTGPLETSYNEQFF", "CSVELLPTGPSETSYNEQFF",
  "CVELLPTGPSETSYNEQFF", "CASLAGGRTQETQYF", "CASRLAGGRTQETQYF",
  "CASSLAGGRTETQYF", "CASSLAGGRTQETQYF", "CASSRLAGGRTQETQYF",
  "CASQYGGGNQPQHF", "CASSLGGGNQPQHF", "CASSNGGGNQPQHF", "CASSYGGGGNQPQHF",
  "CASSYGGGQPQHF", "CASSYKGGNQPQHF", "CASSYTGGGNQPQHF",
  "CAWSSQETQYF", "CASSSPETQYF", "CASSGAYEQYF", "CSVDLGKGNNEQFF")
# Relative generation probabilities
pgen <- cbind(
  stats::toeplitz(0.6^(0:(sample_size - 1))),
  matrix(1, nrow = samples, ncol = length(base_seqs) - samples)
)
simulateToyData(
  samples = samples,
  sample_size = sample_size,
  prefix_length = 1,
  prefix_chars = c("", ""),
  prefix_probs = cbind(rep(1, samples), rep(0, samples)),
  affixes = base_seqs,
  affix_probs = pgen,
  num_edits = 0,
  output_dir = tempdir(),
  no_return = TRUE
)


## 1. Find Public Clusters in Each Sample
sample_files <-
  file.path(tempdir(),
            paste0("Sample", 1:samples, ".rds")
  )
findPublicClusters(
  file_list = sample_files,
  input_type = "rds",
  seq_col = "CloneSeq",
  count_col = "CloneCount",
  min_seq_length = NULL,
  drop_matches = NULL,
  top_n_clusters = 3,
  min_node_count = 5,
  min_clone_count = 15000,
  output_dir = tempdir()
)

## 2. Build Public Cluster Network by Representative TCR/BCRs
buildPublicClusterNetworkByRepresentative(
  file_list =
    list.files(
      file.path(tempdir(), "cluster_meta_data"),
      full.names = TRUE
    ),
  size_nodes_by = 1,
  print_plots = TRUE
)


set.seed(42)

## Simulate 30 samples with a mix of public/private sequences ##
samples <- 30
sample_size <- 30 # (seqs per sample)
base_seqs <- c(
  "CASSIEGQLSTDTQYF", "CASSEEGQLSTDTQYF", "CASSSVETQYF",
  "CASSPEGQLSTDTQYF", "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF",
  "CASSLTSGYNEQFF", "CASSETGYNEQFF", "CASSLTGGNEQFF", "CASSYLTGYNEQFF",
  "CASSLTGNEQFF", "CASSLNGYNEQFF", "CASSFPWDGYGYTF", "CASTLARQGGELFF",
  "CASTLSRQGGELFF", "CSVELLPTGPLETSYNEQFF", "CSVELLPTGPSETSYNEQFF",
  "CVELLPTGPSETSYNEQFF", "CASLAGGRTQETQYF", "CASRLAGGRTQETQYF",
  "CASSLAGGRTETQYF", "CASSLAGGRTQETQYF", "CASSRLAGGRTQETQYF",
  "CASQYGGGNQPQHF", "CASSLGGGNQPQHF", "CASSNGGGNQPQHF", "CASSYGGGGNQPQHF",
  "CASSYGGGQPQHF", "CASSYKGGNQPQHF", "CASSYTGGGNQPQHF",
  "CAWSSQETQYF", "CASSSPETQYF", "CASSGAYEQYF", "CSVDLGKGNNEQFF")
# Relative generation probabilities
pgen <- cbind(
  stats::toeplitz(0.6^(0:(sample_size - 1))),
  matrix(1, nrow = samples, ncol = length(base_seqs) - samples)
)
simulateToyData(
  samples = samples,
  sample_size = sample_size,
  prefix_length = 1,
  prefix_chars = c("", ""),
  prefix_probs = cbind(rep(1, samples), rep(0, samples)),
  affixes = base_seqs,
  affix_probs = pgen,
  num_edits = 0,
  output_dir = tempdir(),
  no_return = TRUE
)


## 1. Find Public Clusters in Each Sample
sample_files <-
  file.path(tempdir(),
            paste0("Sample", 1:samples, ".rds")
  )
findPublicClusters(
  file_list = sample_files,
  input_type = "rds",
  seq_col = "CloneSeq",
  count_col = "CloneCount",
  min_seq_length = NULL,
  drop_matches = NULL,
  top_n_clusters = 3,
  min_node_count = 5,
  min_clone_count = 15000,
  output_dir = tempdir()
)

## 2. Build Public Cluster Network by Representative TCR/BCRs
buildPublicClusterNetworkByRepresentative(
  file_list =
    list.files(
      file.path(tempdir(), "cluster_meta_data"),
      full.names = TRUE
    ),
  size_nodes_by = 1,
  print_plots = TRUE
)

Network Analysis of Immune Repertoire

Description

Given Adaptive Immune Receptor Repertoire Sequencing (AIRR-Seq) data, builds the network graph for the immune repertoire based on sequence similarity, computes specified network properties and generates customized visualizations.

buildNet() is identical to buildRepSeqNetwork(), existing as an alias for convenience.

Usage

buildRepSeqNetwork(

  ## Input ##
  data,
  seq_col,
  count_col = NULL,
  subset_cols = NULL,
  min_seq_length = 3,
  drop_matches = NULL,

  ## Network ##
  dist_type = "hamming",
  dist_cutoff = 1,
  drop_isolated_nodes = TRUE,
  node_stats = FALSE,
  stats_to_include = chooseNodeStats(),
  cluster_stats = FALSE,
  cluster_fun = "fast_greedy",
  cluster_id_name = "cluster_id",

  ## Visualization ##
  plots = TRUE,
  print_plots = FALSE,
  plot_title = "auto",
  plot_subtitle = "auto",
  color_nodes_by = "auto",
  ...,

  ## Output ##
  output_dir = NULL,
  output_type = "rds",
  output_name = "MyRepSeqNetwork",
  pdf_width = 12,
  pdf_height = 10,
  verbose = FALSE

)

# Alias for buildRepSeqNetwork()
buildNet(
  data,
  seq_col,
  count_col = NULL,
  subset_cols = NULL,
  min_seq_length = 3,
  drop_matches = NULL,
  dist_type = "hamming",
  dist_cutoff = 1,
  drop_isolated_nodes = TRUE,
  node_stats = FALSE,
  stats_to_include = chooseNodeStats(),
  cluster_stats = FALSE,
  cluster_fun = "fast_greedy",
  cluster_id_name = "cluster_id",
  plots = TRUE,
  print_plots = FALSE,
  plot_title = "auto",
  plot_subtitle = "auto",
  color_nodes_by = "auto",
  ...,
  output_dir = NULL,
  output_type = "rds",
  output_name = "MyRepSeqNetwork",
  pdf_width = 12,
  pdf_height = 10,
  verbose = FALSE

)

buildRepSeqNetwork(

  ## Input ##
  data,
  seq_col,
  count_col = NULL,
  subset_cols = NULL,
  min_seq_length = 3,
  drop_matches = NULL,

  ## Network ##
  dist_type = "hamming",
  dist_cutoff = 1,
  drop_isolated_nodes = TRUE,
  node_stats = FALSE,
  stats_to_include = chooseNodeStats(),
  cluster_stats = FALSE,
  cluster_fun = "fast_greedy",
  cluster_id_name = "cluster_id",

  ## Visualization ##
  plots = TRUE,
  print_plots = FALSE,
  plot_title = "auto",
  plot_subtitle = "auto",
  color_nodes_by = "auto",
  ...,

  ## Output ##
  output_dir = NULL,
  output_type = "rds",
  output_name = "MyRepSeqNetwork",
  pdf_width = 12,
  pdf_height = 10,
  verbose = FALSE

)

# Alias for buildRepSeqNetwork()
buildNet(
  data,
  seq_col,
  count_col = NULL,
  subset_cols = NULL,
  min_seq_length = 3,
  drop_matches = NULL,
  dist_type = "hamming",
  dist_cutoff = 1,
  drop_isolated_nodes = TRUE,
  node_stats = FALSE,
  stats_to_include = chooseNodeStats(),
  cluster_stats = FALSE,
  cluster_fun = "fast_greedy",
  cluster_id_name = "cluster_id",
  plots = TRUE,
  print_plots = FALSE,
  plot_title = "auto",
  plot_subtitle = "auto",
  color_nodes_by = "auto",
  ...,
  output_dir = NULL,
  output_type = "rds",
  output_name = "MyRepSeqNetwork",
  pdf_width = 12,
  pdf_height = 10,
  verbose = FALSE

)

Arguments

`data`	A data frame containing the AIRR-Seq data, with variables indexed by column and observations (e.g., clones or cells) indexed by row.
`seq_col`	Specifies the column(s) of `data` containing the receptor sequences to be used as the basis of similarity between rows. Accepts a character string containing the column name or a numeric scalar containing the column index. Also accepts a vector of length 2 specifying distinct sequence columns (e.g., alpha chain and beta chain), in which case similarity between rows depends on similarity in both sequence columns (see details).
`count_col`	Optional. Specifies the column of `data` containing a measure of abundance, e.g., clone count or unique molecular identifier (UMI) count. Accepts either the column name or column index. Passed to `addClusterStats()`; only relevant if `cluster_stats = TRUE`.
`subset_cols`	Specifies which columns of the AIRR-Seq data are included in the output. Accepts a vector of column names or a vector of column indices. The default `NULL` includes all columns. The receptor sequence column is always included regardless of this argument's value. Passed to `filterInputData()`.
`min_seq_length`	A numeric scalar, or `NULL`. Observations whose receptor sequences have fewer than `min_seq_length` characters are removed prior to network analysis.
`drop_matches`	Optional. Passed to `filterInputData()`. Accepts a character string containing a regular expression (see `regex`). Checks receptor sequences for a pattern match using `grep()`. Those returning a match are removed prior to network analysis.
`dist_type`	Specifies the function used to quantify the similarity between sequences. The similarity between two sequences determines the pairwise distance between their respective nodes in the network graph, with greater similarity corresponding to shorter distance. Valid options are `"hamming"` (the default), which uses `hamDistBounded()`, and `"levenshtein"`, which uses `levDistBounded()`.
`dist_cutoff`	A nonnegative scalar. Specifies the maximum pairwise distance (based on `dist_type`) for an edge connection to exist between two nodes. Pairs of nodes whose distance is less than or equal to this value will be joined by an edge connection in the network graph. Controls the stringency of the network construction and affects the number and density of edges in the network. A lower cutoff value requires greater similarity between sequences in order for their respective nodes to be joined by an edge connection. A value of `0` requires two sequences to be identical in order for their nodes to be joined by an edge.
`drop_isolated_nodes`	A logical scalar. When `TRUE`, removes each node that is not joined by an edge connection to any other node in the network graph.
`node_stats`	A logical scalar. Specifies whether node-level network properties are computed.
`stats_to_include`	A named logical vector returned by `chooseNodeStats()` or `exclusiveNodeStats()`. Specifies the node-level network properties to compute. Also accepts the value `"all"`. Only relevant if `node_stats = TRUE`.
`cluster_stats`	A logical scalar. Specifies whether to compute cluster-level network properties.
`cluster_fun`	Passed to `addClusterMembership()`. Specifies the clustering algorithm used when cluster analysis is performed. Cluster analysis is performed when `cluster_stats = TRUE` or when `node_stats = TRUE` with the `cluster_id` property enabled via the `stats_to_include` argument.
`cluster_id_name`	Passed to `addClusterMembership()`. Specifies the name of the cluster membership variable added to the node metadata when cluster analysis is performed (see `cluster_fun`).
`plots`	A logical scalar. Specifies whether to generate plots of the network graph.
`print_plots`	A logical scalar. If `plots = TRUE`, specifies whether the plots should be printed to the R plotting window.
`plot_title`	A character string or `NULL`. If `plots = TRUE`, this is the title used for each plot. The default value `"auto"` generates the title based on the value of the `output_name` argument.
`plot_subtitle`	A character string or `NULL`. If `plots = TRUE`, this is the subtitle used for each plot. The default value `"auto"` generates a subtitle based on the values of the `dist_type` and `dist_cutoff` arguments.
`color_nodes_by`	Optional. Specifies a variable to be used as metadata for coloring the nodes in the network graph plot. Accepts a character string. This can be a column name of `data` or (if `node_stats = TRUE`) the name of a computed node-level network property (based on `stats_to_include`). Also accepts a character vector specifying multiple variables, in which case one plot will be generated for each variable. The default value `"auto"` attempts to use one of several potential variables to color the nodes, depending on what is available. A value of `NULL` leaves the nodes uncolored.
`...`	Other named arguments to `addPlots()`.
`output_dir`	A file path specifying the directory for saving the output. The directory will be created if it does not exist. If `NULL`, output will be returned but not saved.
`output_type`	A character string specifying the file format to use when saving the output. The default value `"individual"` saves each element of the returned list as an individual uncompressed file, with data frames saved in csv format. For better compression, the values `"rda"` and `"rds"` save the returned list as a single file using the rda and rds format, respectively (in the former case, the list will be named `net` within the rda file). Regardless of the argument value, any plots generated will saved to a pdf file containing one plot per page.
`output_name`	A character string. All files saved will have file names beginning with this value.
`pdf_width`	Sets the width of each plot when writing to pdf. Passed to `saveNetwork()`.
`pdf_height`	Sets the height of each plot when writing to pdf. Passed to `saveNetwork()`.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.

Details

To construct the immune repertoire network, each TCR/BCR clone (bulk data) or cell (single-cell data) is modeled as a node in the network graph, corresponding to a single row of the AIRR-Seq data. For each node, the corresponding receptor sequence is considered. Both nucleotide and amino acid sequences are supported for this purpose. The receptor sequence is used as the basis of similarity and distance between nodes in the network.

Similarity between sequences is measured using either the Hamming distance or Levenshtein (edit) distance. The similarity determines the pairwise distance between nodes in the network graph. The more similar two sequences are, the shorter the distance between their respective nodes. Two nodes in the graph are joined by an edge if the distance between them is sufficiently small, i.e., if their receptor sequences are sufficiently similar.

For single-cell data, edge connections between nodes can be based on similarity in both the alpha chain and beta chain sequences. This is done by providing a vector of length 2 to seq_cols specifying the two sequence columns in data. The distance between two nodes is then the greater of the two distances between sequences in corresponding chains. Two nodes will be joined by an edge if their alpha chain sequences are sufficiently similar and their beta chain sequences are sufficiently similar.

See the buildRepSeqNetwork package vignette for more details. The vignette can be accessed offline using vignette("buildRepSeqNetwork").

Value

If the constructed network contains no nodes, the function will return NULL, invisibly, with a warning. Otherwise, the function invisibly returns a list containing the following items:

`details`	A list containing information about the network and the settings used during its construction.
`igraph`	An object of class `igraph` containing the list of nodes and edges for the network graph.
`adjacency_matrix`	The network graph adjacency matrix, stored as a sparse matrix of class `dgCMatrix` from the `Matrix` package. See `dgCMatrix-class`.
`node_data`	A data frame containing containing metadata for the network nodes, where each row corresponds to a node in the network graph. This data frame contains all variables from `data` (unless otherwise specified via `subset_cols`) in addition to the computed node-level network properties if `node_stats = TRUE`. Each row's name is the name of the corresponding row from `data`.
`cluster_data`	A data frame containing network properties for the clusters, where each row corresponds to a cluster in the network graph. Only included if `cluster_stats = TRUE`.
`plots`	A list containing one element for each plot generated as well as an additional element for the matrix that specifies the graph layout. Each plot is an object of class `ggraph`. Only included if `plots = TRUE`.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

buildRepSeqNetwork vignette

Examples

set.seed(42)
toy_data <- simulateToyData()

# Simple call
network = buildNet(
  toy_data,
  seq_col = "CloneSeq",
  print_plots = TRUE
)

# Customized:
network <- buildNet(
  toy_data, "CloneSeq",
  dist_type = "levenshtein",
  node_stats = TRUE,
  cluster_stats = TRUE,
  cluster_fun = "louvain",
  cluster_id_name = "cluster_membership",
  count_col = "CloneCount",
  color_nodes_by = c("SampleID", "cluster_membership", "coreness"),
  color_scheme = c("default", "Viridis", "plasma-1"),
  size_nodes_by = "degree",
  node_size_limits = c(0.1, 1.5),
  plot_title = NULL,
  plot_subtitle = NULL,
  print_plots = TRUE,
  verbose = TRUE
)

typeof(network)

names(network)

network$details

head(network$node_data)

head(network$cluster_data)

set.seed(42)
toy_data <- simulateToyData()

# Simple call
network = buildNet(
  toy_data,
  seq_col = "CloneSeq",
  print_plots = TRUE
)

# Customized:
network <- buildNet(
  toy_data, "CloneSeq",
  dist_type = "levenshtein",
  node_stats = TRUE,
  cluster_stats = TRUE,
  cluster_fun = "louvain",
  cluster_id_name = "cluster_membership",
  count_col = "CloneCount",
  color_nodes_by = c("SampleID", "cluster_membership", "coreness"),
  color_scheme = c("default", "Viridis", "plasma-1"),
  size_nodes_by = "degree",
  node_size_limits = c(0.1, 1.5),
  plot_title = NULL,
  plot_subtitle = NULL,
  print_plots = TRUE,
  verbose = TRUE
)

typeof(network)

names(network)

network$details

head(network$node_data)

head(network$cluster_data)

Specify Node-level Network Properties to Compute

Description

Create a vector specifying node-level network properties to compute. Intended for use with buildRepSeqNetwork() or addNodeNetworkStats.

node_stat_settings() is a deprecated equivalent of chooseNodeStats().

Usage

chooseNodeStats(
  degree = TRUE,
  cluster_id = FALSE,
  transitivity = TRUE,
  closeness = FALSE,
  centrality_by_closeness = FALSE,
  eigen_centrality = TRUE,
  centrality_by_eigen = TRUE,
  betweenness = TRUE,
  centrality_by_betweenness = TRUE,
  authority_score = TRUE,
  coreness = TRUE,
  page_rank = TRUE,
  all_stats = FALSE
)

exclusiveNodeStats(
  degree = FALSE,
  cluster_id = FALSE,
  transitivity = FALSE,
  closeness = FALSE,
  centrality_by_closeness = FALSE,
  eigen_centrality = FALSE,
  centrality_by_eigen = FALSE,
  betweenness = FALSE,
  centrality_by_betweenness = FALSE,
  authority_score = FALSE,
  coreness = FALSE,
  page_rank = FALSE
)

chooseNodeStats(
  degree = TRUE,
  cluster_id = FALSE,
  transitivity = TRUE,
  closeness = FALSE,
  centrality_by_closeness = FALSE,
  eigen_centrality = TRUE,
  centrality_by_eigen = TRUE,
  betweenness = TRUE,
  centrality_by_betweenness = TRUE,
  authority_score = TRUE,
  coreness = TRUE,
  page_rank = TRUE,
  all_stats = FALSE
)

exclusiveNodeStats(
  degree = FALSE,
  cluster_id = FALSE,
  transitivity = FALSE,
  closeness = FALSE,
  centrality_by_closeness = FALSE,
  eigen_centrality = FALSE,
  centrality_by_eigen = FALSE,
  betweenness = FALSE,
  centrality_by_betweenness = FALSE,
  authority_score = FALSE,
  coreness = FALSE,
  page_rank = FALSE
)

Arguments

`degree`	Logical. Whether to compute network degree.
`cluster_id`	Logical. Whether to perform cluster analysis and record the cluster membership of each node. See `addClusterMembership()`.
`transitivity`	Logical. Whether to compute node-level network transitivity using `transitivity()` with `type = "local"`. The local transitivity of a node is the the number of triangles connected to the node relative to the number of triples centered on that node.
`closeness`	Logical. Whether to compute network closeness using `closeness()`.
`centrality_by_closeness`	Logical. Whether to compute network centrality by closeness. The values are the entries of the `res` element of the list returned by `centr_clo()`.
`eigen_centrality`	Logical. Whether to compute the eigenvector centrality scores of node network positions. The scores are the entries of the `vector` element of the list returned by `eigen_centrality()` with `weights = NA`. The centrality scores correspond to the values of the first eigenvector of the adjacency matrix for the cluster graph.
`centrality_by_eigen`	Logical. Whether to compute node-level network centrality scores based on eigenvector centrality scores. The scores are the entries of the `vector` element of the list returned by `centr_eigen()`.
`betweenness`	Logical. Whether to compute network betweenness using `betweenness()`.
`centrality_by_betweenness`	Logical. Whether to compute network centrality scores by betweenness. The scores are the entires of the `res` element of the list returned by `centr_betw()`.
`authority_score`	Logical. Whether to compute the authority score using `authority_score()`.
`coreness`	Logical. Whether to compute network coreness using `coreness()`.
`page_rank`	Logical. Whether to compute page rank. The page rank values are the entries of the `vector` element of the list returned by `page_rank()`.
`all_stats`	Logical. If `TRUE`, all other argument values are overridden and set to `TRUE`.

Details

These functions return a vector that can be passed to the stats_to_include argument of addNodeStats() (or buildRepSeqNetwork(), if node_stats = TRUE) in order to specify which node-level network properties to compute.

chooseNodeStats and exclusiveNodeStats each have default argument values suited to a different use case, in order to reduce the number of argument values that must be set manually.

chooseNodeStats has most arguments TRUE by default. It is best suited for including a majority of the available properties. It can be called with all_stats = TRUE to set all values to TRUE.

exclusiveNodeStats has all of its arguments set to FALSE by default. It is best suited for including only a few properties.

Value

A named logical vector with one entry for each of the function's arguments (except for all_stats). Each entry has the same name as the corresponding argument, and its value matches the argument's value.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

set.seed(42)
toy_data <- simulateToyData()

net <- generateNetworkObjects(
  toy_data, "CloneSeq"
)

# Add default set of node properties
net <- addNodeStats(net)

# Modify default set of node properties
net <- addNodeStats(
  net,
  stats_to_include =
    chooseNodeStats(
      closeness = TRUE,
      page_rank = FALSE
    )
)

# Add only the spepcified node properties
net <- addNodeStats(
  net,
  stats_to_include =
    exclusiveNodeStats(
      degree = TRUE,
      transitivity = TRUE
    )
)

# Add all node-level network properties
net <- addNodeStats(
  net,
  stats_to_include = "all"
)

set.seed(42)
toy_data <- simulateToyData()

net <- generateNetworkObjects(
  toy_data, "CloneSeq"
)

# Add default set of node properties
net <- addNodeStats(net)

# Modify default set of node properties
net <- addNodeStats(
  net,
  stats_to_include =
    chooseNodeStats(
      closeness = TRUE,
      page_rank = FALSE
    )
)

# Add only the spepcified node properties
net <- addNodeStats(
  net,
  stats_to_include =
    exclusiveNodeStats(
      degree = TRUE,
      transitivity = TRUE
    )
)

# Add all node-level network properties
net <- addNodeStats(
  net,
  stats_to_include = "all"
)

Load and Combine Data From Multiple Samples

Description

Given multiple data frames stored in separate files, loadDataFromFileList() loads and combines them into a single data frame.

combineSamples() has the same default behavior as loadDataFromFileList(), but possesses additional arguments that allow the data frames to be filtered, subsetted and augmented with sample-level variables before being combined.

Usage

loadDataFromFileList(
  file_list,
  input_type,
  data_symbols = NULL,
  header, sep, read.args
)

combineSamples(
  file_list,
  input_type,
  data_symbols = NULL,
  header, sep, read.args,
  seq_col = NULL,
  min_seq_length = NULL,
  drop_matches = NULL,
  subset_cols = NULL,
  sample_ids = NULL,
  subject_ids = NULL,
  group_ids = NULL,
  verbose = FALSE
)
loadDataFromFileList(
  file_list,
  input_type,
  data_symbols = NULL,
  header, sep, read.args
)

combineSamples(
  file_list,
  input_type,
  data_symbols = NULL,
  header, sep, read.args,
  seq_col = NULL,
  min_seq_length = NULL,
  drop_matches = NULL,
  subset_cols = NULL,
  sample_ids = NULL,
  subject_ids = NULL,
  group_ids = NULL,
  verbose = FALSE
)

Arguments

`file_list`	A character vector of file paths, or a list containing `connections` and file paths. Each element corresponds to a single file containing the data for a single sample.
`input_type`	A character string specifying the file format of the sample data files. Options are `"rds"`, `"rda"`, `"csv"`, `"csv2"`, `"tsv"`, `"table"`. See details.
`data_symbols`	Used when `input_type = "rda"`. Specifies the name of each sample's data frame within its respective Rdata file. Accepts a character vector of the same length as `file_list`. Alternatively, a single character string can be used if all data frames have the same name.
`header`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify a non-default value of the `header` argument to `read.table()`, `read.csv()`, etc.
`sep`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify a non-default value of the `sep` argument to `read.table()`, `read.csv()`, etc.
`read.args`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify non-default values of optional arguments to `read.table()`, `read.csv()`, etc. Accepts a named list of argument values. Values of `header` and `sep` in this list take precedence over values specified via the `header` and `sep` arguments.
`seq_col`	If provided, each sample's data will be filtered based on the values of `min_seq_length` and `drop_matches`. Passed to `filterInputData()` for each sample.
`min_seq_length`	Passed to `filterInputData()` for each sample.
`drop_matches`	Passed to `filterInputData()` for each sample.
`subset_cols`	Passed to `filterInputData()` for each sample.
`sample_ids`	A character or numeric vector of sample IDs, whose length matches that of `file_list`.
`subject_ids`	An optional character or numeric vector of subject IDs, whose length matches that of `file_list`. Used to assign a subject ID to each sample.
`group_ids`	A character or numeric vector of group IDs whose length matches that of `file_list`. Used to assign each sample to a group.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.

Details

Each file is assumed to contain the data for a single sample, with observations indexed by row, and with the same columns across samples.

Valid options for input_type (and the corresponding function used to load each file) include:

"rds": readRDS()
"rds": readRDS()
"rda": load()
"csv": read.csv()
"csv2": read.csv2()
"tsv": read.delim()
"table": read.table()

If input_type = "rda", the data_symbols argument specifies the name of each data frame within its respective file.

When calling combineSamples(), for each of sample_ids, subject_ids and group_ids that is non-null, a corresponding variable will be added to the combined data frame; these variables are named SampleID, SubjectID and GroupID.

Value

A data frame containing the combined data rows from all files.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

# Generate example data
set.seed(42)
samples <- simulateToyData(sample_size = 5)
sample_1 <- subset(samples, SampleID == "Sample1")
sample_2 <- subset(samples, SampleID == "Sample2")

# RDS format
rdsfiles <- tempfile(c("sample1", "sample2"), fileext = ".rds")
saveRDS(sample_1, rdsfiles[1])
saveRDS(sample_2, rdsfiles[2])

loadDataFromFileList(
  rdsfiles,
  input_type = "rds"
)

# With filtering and subsetting
combineSamples(
  rdsfiles,
  input_type = "rds",
  seq_col = "CloneSeq",
  min_seq_length = 13,
  drop_matches = "GGG",
  subset_cols = "CloneSeq",
  sample_ids = c("id01", "id02"),
  verbose = TRUE
)

# RData, different data frame names
rdafiles <- tempfile(c("sample1", "sample2"), fileext = ".rda")
save(sample_1, file = rdafiles[1])
save(sample_2, file = rdafiles[2])
loadDataFromFileList(
  rdafiles,
  input_type = "rda",
  data_symbols = c("sample_1", "sample_2")
)

# RData, same data frame names
df <- sample_1
save(df, file = rdafiles[1])
df <- sample_2
save(df, file = rdafiles[2])
loadDataFromFileList(
  rdafiles,
  input_type = "rda",
  data_symbols = "df"
)

# comma-separated values with header row; row names in first column
csvfiles <- tempfile(c("sample1", "sample2"), fileext = ".csv")
utils::write.csv(sample_1, csvfiles[1], row.names = TRUE)
utils::write.csv(sample_2, csvfiles[2], row.names = TRUE)
loadDataFromFileList(
  csvfiles,
  input_type = "csv",
  read.args = list(row.names = 1)
)

# semicolon-separated values with decimals as commas;
# header row, row names in first column
utils::write.csv2(sample_1, csvfiles[1], row.names = TRUE)
utils::write.csv2(sample_2, csvfiles[2], row.names = TRUE)
loadDataFromFileList(
  csvfiles,
  input_type = "csv2",
  read.args = list(row.names = 1)
)

# tab-separated values with header row and decimals as commas
tsvfiles <- tempfile(c("sample1", "sample2"), fileext = ".tsv")
utils::write.table(sample_1, tsvfiles[1], sep = "\t", dec = ",")
utils::write.table(sample_2, tsvfiles[2], sep = "\t", dec = ",")
loadDataFromFileList(
  tsvfiles,
  input_type = "tsv",
  header = TRUE,
  read.args = list(dec = ",")
)

# space-separated values with header row and NAs encoded as as "No Value"
txtfiles <- tempfile(c("sample1", "sample2"), fileext = ".txt")
utils::write.table(sample_1, txtfiles[1], na = "No Value")
utils::write.table(sample_2, txtfiles[2], na = "No Value")
loadDataFromFileList(
  txtfiles,
  input_type = "table",
  read.args = list(
    header = TRUE,
    na.strings = "No Value"
  )
)

# custom value separator and row names in first column
utils::write.table(sample_1, txtfiles[1],
                   sep = "@", row.names = TRUE, col.names = FALSE
)
utils::write.table(sample_2, txtfiles[2],
                   sep = "@", row.names = TRUE, col.names = FALSE
)
loadDataFromFileList(
  txtfiles,
  input_type = "table",
  sep = "@",
  read.args = list(
    row.names = 1,
    col.names = c("rownames",
                  "CloneSeq", "CloneFrequency",
                  "CloneCount", "SampleID"
    )
  )
)

# same as previous example
# (value of sep in read.args overrides value in sep argument)
loadDataFromFileList(
  txtfiles,
  input_type = "table",
  sep = "\t",
  read.args = list(
    sep = "@",
    row.names = 1,
    col.names = c("rownames",
                  "CloneSeq", "CloneFrequency",
                  "CloneCount", "SampleID"
    )
  )
)


# Generate example data
set.seed(42)
samples <- simulateToyData(sample_size = 5)
sample_1 <- subset(samples, SampleID == "Sample1")
sample_2 <- subset(samples, SampleID == "Sample2")

# RDS format
rdsfiles <- tempfile(c("sample1", "sample2"), fileext = ".rds")
saveRDS(sample_1, rdsfiles[1])
saveRDS(sample_2, rdsfiles[2])

loadDataFromFileList(
  rdsfiles,
  input_type = "rds"
)

# With filtering and subsetting
combineSamples(
  rdsfiles,
  input_type = "rds",
  seq_col = "CloneSeq",
  min_seq_length = 13,
  drop_matches = "GGG",
  subset_cols = "CloneSeq",
  sample_ids = c("id01", "id02"),
  verbose = TRUE
)

# RData, different data frame names
rdafiles <- tempfile(c("sample1", "sample2"), fileext = ".rda")
save(sample_1, file = rdafiles[1])
save(sample_2, file = rdafiles[2])
loadDataFromFileList(
  rdafiles,
  input_type = "rda",
  data_symbols = c("sample_1", "sample_2")
)

# RData, same data frame names
df <- sample_1
save(df, file = rdafiles[1])
df <- sample_2
save(df, file = rdafiles[2])
loadDataFromFileList(
  rdafiles,
  input_type = "rda",
  data_symbols = "df"
)

# comma-separated values with header row; row names in first column
csvfiles <- tempfile(c("sample1", "sample2"), fileext = ".csv")
utils::write.csv(sample_1, csvfiles[1], row.names = TRUE)
utils::write.csv(sample_2, csvfiles[2], row.names = TRUE)
loadDataFromFileList(
  csvfiles,
  input_type = "csv",
  read.args = list(row.names = 1)
)

# semicolon-separated values with decimals as commas;
# header row, row names in first column
utils::write.csv2(sample_1, csvfiles[1], row.names = TRUE)
utils::write.csv2(sample_2, csvfiles[2], row.names = TRUE)
loadDataFromFileList(
  csvfiles,
  input_type = "csv2",
  read.args = list(row.names = 1)
)

# tab-separated values with header row and decimals as commas
tsvfiles <- tempfile(c("sample1", "sample2"), fileext = ".tsv")
utils::write.table(sample_1, tsvfiles[1], sep = "\t", dec = ",")
utils::write.table(sample_2, tsvfiles[2], sep = "\t", dec = ",")
loadDataFromFileList(
  tsvfiles,
  input_type = "tsv",
  header = TRUE,
  read.args = list(dec = ",")
)

# space-separated values with header row and NAs encoded as as "No Value"
txtfiles <- tempfile(c("sample1", "sample2"), fileext = ".txt")
utils::write.table(sample_1, txtfiles[1], na = "No Value")
utils::write.table(sample_2, txtfiles[2], na = "No Value")
loadDataFromFileList(
  txtfiles,
  input_type = "table",
  read.args = list(
    header = TRUE,
    na.strings = "No Value"
  )
)

# custom value separator and row names in first column
utils::write.table(sample_1, txtfiles[1],
                   sep = "@", row.names = TRUE, col.names = FALSE
)
utils::write.table(sample_2, txtfiles[2],
                   sep = "@", row.names = TRUE, col.names = FALSE
)
loadDataFromFileList(
  txtfiles,
  input_type = "table",
  sep = "@",
  read.args = list(
    row.names = 1,
    col.names = c("rownames",
                  "CloneSeq", "CloneFrequency",
                  "CloneCount", "SampleID"
    )
  )
)

# same as previous example
# (value of sep in read.args overrides value in sep argument)
loadDataFromFileList(
  txtfiles,
  input_type = "table",
  sep = "\t",
  read.args = list(
    sep = "@",
    row.names = 1,
    col.names = c("rownames",
                  "CloneSeq", "CloneFrequency",
                  "CloneCount", "SampleID"
    )
  )
)

Get Coordinate Layout From Graph Plot

Description

Given a ggraph plot, extract the coordinate layout of the graph nodes as a two-column matrix.

Usage

extractLayout(plot)
extractLayout(plot)

Arguments

plot

An object of class ggraph.

Details

Equivalent to as.matrix(plot$data[c("x", "y")]).

Value

A matrix with two columns and one row per network node. Each row contains the Cartesian coordinates of the corresponding node.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples


set.seed(42)
toy_data <- simulateToyData()
net <- buildRepSeqNetwork(toy_data, "CloneSeq", print_plots = TRUE)

my_layout <- extractLayout(net$plots[[1]])

# same as `graph_layout` element in the plot list
all.equal(my_layout, net$plots$graph_layout, check.attributes = FALSE)

set.seed(42)
toy_data <- simulateToyData()
net <- buildRepSeqNetwork(toy_data, "CloneSeq", print_plots = TRUE)

my_layout <- extractLayout(net$plots[[1]])

# same as `graph_layout` element in the plot list
all.equal(my_layout, net$plots$graph_layout, check.attributes = FALSE)

Filter Data Rows and Subset Data Columns

Description

Given a data frame with a column containing receptor sequences, filter data rows by sequence length and sequence content. Keep all data columns or choose which columns to keep.

Usage

filterInputData(
  data,
  seq_col,
  min_seq_length = NULL,
  drop_matches = NULL,
  subset_cols = NULL,
  count_col = deprecated(),
  verbose = FALSE
)
filterInputData(
  data,
  seq_col,
  min_seq_length = NULL,
  drop_matches = NULL,
  subset_cols = NULL,
  count_col = deprecated(),
  verbose = FALSE
)

Arguments

`data`	A data frame.
`seq_col`	Specifies the column(s) of `data` containing the receptor sequences. Accepts a character or numeric vector of length 1 or 2, containing either column names or column indices. Each column specified will be coerced to a character vector. Data rows containing a value of `NA` in any of the specified columns will be dropped.
`min_seq_length`	Observations whose receptor sequences have fewer than `min_seq_length` characters are dropped.
`drop_matches`	Accepts a character string containing a regular expression (see `regex`). Checks values in the receptor sequence column for a pattern match using `grep()`. Rows in which a match is found are dropped.
`subset_cols`	Specifies which columns of the AIRR-Seq data are included in the output. Accepts a character vector of column names or a numeric vector of column indices. The default `NULL` includes all columns. The receptor sequence column is always included regardless of this argument's value.
`count_col`	Does nothing.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.

Value

A data frame.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

set.seed(42)
raw_data <- simulateToyData()

# Remove sequences shorter than 13 characters,
# as well as sequences containing the subsequence "GGGG".
# Keep variables for clone sequence, clone frequency and sample ID
filterInputData(
  raw_data,
  seq_col = "CloneSeq",
  min_seq_length = 13,
  drop_matches = "GGGG",
  subset_cols =
    c("CloneSeq", "CloneFrequency", "SampleID"),
  verbose = TRUE
)

set.seed(42)
raw_data <- simulateToyData()

# Remove sequences shorter than 13 characters,
# as well as sequences containing the subsequence "GGGG".
# Keep variables for clone sequence, clone frequency and sample ID
filterInputData(
  raw_data,
  seq_col = "CloneSeq",
  min_seq_length = 13,
  drop_matches = "GGGG",
  subset_cols =
    c("CloneSeq", "CloneFrequency", "SampleID"),
  verbose = TRUE
)

Identify TCR/BCR Clones in a Neighborhood Around Each Associated Sequence

Description

Part of the workflow Searching for Associated TCR/BCR Clusters. Intended for use following findAssociatedSeqs() and prior to buildAssociatedClusterNetwork().

Given multiple samples of bulk Adaptive Immune Receptor Repertoire Sequencing (AIRR-Seq) data and a vector of associated sequences, identifies for each associated sequence a global "neighborhood" comprised of clones with TCR/BCR sequences similar to the associated sequence.

Usage

findAssociatedClones(

  ## Input ##
  file_list, input_type,
  data_symbols = NULL,
  header, sep, read.args,
  sample_ids =
    paste0("Sample", 1:length(file_list)),
  subject_ids = NULL,
  group_ids,
  seq_col,
  assoc_seqs,

  ## Neighborhood Criteria ##
  nbd_radius = 1,
  dist_type = "hamming",
  min_seq_length = 6,
  drop_matches = NULL,

  ## Output ##
  subset_cols = NULL,
  output_dir,
  output_type = "rds",
  verbose = FALSE

)
findAssociatedClones(

  ## Input ##
  file_list, input_type,
  data_symbols = NULL,
  header, sep, read.args,
  sample_ids =
    paste0("Sample", 1:length(file_list)),
  subject_ids = NULL,
  group_ids,
  seq_col,
  assoc_seqs,

  ## Neighborhood Criteria ##
  nbd_radius = 1,
  dist_type = "hamming",
  min_seq_length = 6,
  drop_matches = NULL,

  ## Output ##
  subset_cols = NULL,
  output_dir,
  output_type = "rds",
  verbose = FALSE

)

Arguments

`file_list`	A character vector of file paths, or a list containing `connections` and file paths. Each element corresponds to a single file containing the data for a single sample. Passed to `loadDataFromFileList()`.
`input_type`	A character string specifying the file format of the sample data files. Options are `"table"`, `"txt"`, `"tsv"`, `"csv"`, `"rds"` and `"rda"`. Passed to `loadDataFromFileList()`.
`data_symbols`	Used when `input_type = "rda"`. Specifies the name of each sample's data frame within its respective Rdata file. Passed to `loadDataFromFileList()`.
`header`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify a non-default value of the `header` argument to `read.table()`, `read.csv()`, etc.
`sep`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify a non-default value of the `sep` argument to `read.table()`, `read.csv()`, etc.
`read.args`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify non-default values of optional arguments to `read.table()`, `read.csv()`, etc. Accepts a named list of argument values. Values of `header` and `sep` in this list take precedence over values specified via the `header` and `sep` arguments.
`sample_ids`	A character or numeric vector of sample IDs, whose length matches that of `file_list`. Each entry is assigned as the sample ID to the corresponding entry of `file_list`.
`subject_ids`	An optional character or numeric vector of subject IDs, whose length matches that of `file_list`. Used to assign a subject ID to each sample.
`group_ids`	A character or numeric vector of group IDs whose length matches that of `file_list`. Used to assign each sample to a group. The two groups represent the levels of the binary variable of interest.
`seq_col`	Specifies the column of each sample's data frame containing the TCR/BCR sequences. Accepts a character string containing the column name or a numeric scalar containing the column index.
`assoc_seqs`	A character vector containing the TCR/BCR sequences associated with the binary variable of interest.
`nbd_radius`	The maximum distance (based on `dist_type`) between an associated sequence and other TCR/BCR sequences belonging to its neighborhood. Lower values require sequences to be more similar to an associated sequence in order to belong to its neighborhood.
`dist_type`	Specifies the function used to quantify the similarity between sequences. The similarity between two sequences determines their pairwise distance, with greater similarity corresponding to shorter distance. Valid options are `"hamming"` (the default), which uses `hamDistBounded()`, and `"levenshtein"`, which uses `levDistBounded()`.
`min_seq_length`	Clones with TCR/BCR sequences below this length will be removed. Passed to `filterInputData()` when loading each sample.
`drop_matches`	Passed to `filterInputData()`. Accepts a character string containing a regular expression (see `regex`). Checks TCR/BCR sequences for a pattern match using `grep()`. Those returning a match are dropped. By default, sequences containing any of the characters `*`, `\|` or `_` are dropped.
`subset_cols`	Controls which columns of the AIRR-Seq data from each sample are included in the output. Accepts a character vector of column names or a numeric vector of column indices. The default `NULL` includes all columns. Passed to `filterInputData()`.
`output_dir`	A file path to a directory for saving the output. A valid output directory is required, since no output is returned in R. The specified directory will be created if it does not already exist.
`output_type`	A character string specifying the file format to use for saving the output. Valid options are `"rda"`, `"csv"`, `"csv2"`, `"tsv"` and`"table"`. For `"rda"`, data frames are named `data` in the R environment. For the remaining options, `write.table()` is called with `row.names = TRUE`.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.

Details

For each associated sequence, its neighborhood is defined to include all clones with TCR/BCR sequences that are sufficiently similar to the associated sequence. The arguments dist_type and nbd_radius control how the similarity is measured and the degree of similarity required for neighborhood membership.

For each associated sequence, a data frame is saved to an individual file. The data frame contains one row for each clone in the associated sequence's neighborhood (from all samples). It includes variables for sample ID, group ID and (if provided) subject ID, as well as variables from the AIRR-Seq data.

The files saved by this function are intended for use with buildAssociatedClusterNetwork(). See the Searching for Associated TCR/BCR Clusters article on the package website for more details.

Value

Returns TRUE, invisibly. The function is called for its side effects.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Searching for Associated TCR/BCR Clusters article on package website

Examples

set.seed(42)

## Simulate 30 samples from two groups (treatment/control) ##
n_control <- n_treatment <- 15
n_samples <- n_control + n_treatment
sample_size <- 30 # (seqs per sample)
base_seqs <- # first five are associated with treatment
  c("CASSGAYEQYF", "CSVDLGKGNNEQFF", "CASSIEGQLSTDTQYF",
    "CASSEEGQLSTDTQYF", "CASSPEGQLSTDTQYF",
    "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF")
# Relative generation probabilities by control/treatment group
pgen_c <- matrix(rep(c(rep(1, 5), rep(30, 3)), times = n_control),
                 nrow = n_control, byrow = TRUE)
pgen_t <- matrix(rep(c(1, 1, rep(1/3, 3), rep(2, 3)), times = n_treatment),
                 nrow = n_treatment, byrow = TRUE)
pgen <- rbind(pgen_c, pgen_t)
simulateToyData(
  samples = n_samples,
  sample_size = sample_size,
  prefix_length = 1,
  prefix_chars = c("", ""),
  prefix_probs = cbind(rep(1, n_samples), rep(0, n_samples)),
  affixes = base_seqs,
  affix_probs = pgen,
  num_edits = 0,
  output_dir = tempdir(),
  no_return = TRUE
)

## Step 1: Find Associated Sequences ##
sample_files <-
  file.path(tempdir(),
            paste0("Sample", 1:n_samples, ".rds")
  )
group_labels <- c(rep("reference", n_control),
                  rep("comparison", n_treatment))
associated_seqs <-
  findAssociatedSeqs(
    file_list = sample_files,
    input_type = "rds",
    group_ids = group_labels,
    seq_col = "CloneSeq",
    min_seq_length = NULL,
    drop_matches = NULL,
    min_sample_membership = 0,
    pval_cutoff = 0.1
  )
head(associated_seqs[, 1:5])

## Step 2: Find Associated Clones ##
dir_step2 <- tempfile()
findAssociatedClones(
  file_list = sample_files,
  input_type = "rds",
  group_ids = group_labels,
  seq_col = "CloneSeq",
  assoc_seqs = associated_seqs$ReceptorSeq,
  min_seq_length = NULL,
  drop_matches = NULL,
  output_dir = dir_step2
)

## Step 3: Global Network of Associated Clusters ##
associated_clusters <-
  buildAssociatedClusterNetwork(
    file_list = list.files(dir_step2,
                           full.names = TRUE
    ),
    seq_col = "CloneSeq",
    size_nodes_by = 1.5,
    print_plots = TRUE
  )


set.seed(42)

## Simulate 30 samples from two groups (treatment/control) ##
n_control <- n_treatment <- 15
n_samples <- n_control + n_treatment
sample_size <- 30 # (seqs per sample)
base_seqs <- # first five are associated with treatment
  c("CASSGAYEQYF", "CSVDLGKGNNEQFF", "CASSIEGQLSTDTQYF",
    "CASSEEGQLSTDTQYF", "CASSPEGQLSTDTQYF",
    "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF")
# Relative generation probabilities by control/treatment group
pgen_c <- matrix(rep(c(rep(1, 5), rep(30, 3)), times = n_control),
                 nrow = n_control, byrow = TRUE)
pgen_t <- matrix(rep(c(1, 1, rep(1/3, 3), rep(2, 3)), times = n_treatment),
                 nrow = n_treatment, byrow = TRUE)
pgen <- rbind(pgen_c, pgen_t)
simulateToyData(
  samples = n_samples,
  sample_size = sample_size,
  prefix_length = 1,
  prefix_chars = c("", ""),
  prefix_probs = cbind(rep(1, n_samples), rep(0, n_samples)),
  affixes = base_seqs,
  affix_probs = pgen,
  num_edits = 0,
  output_dir = tempdir(),
  no_return = TRUE
)

## Step 1: Find Associated Sequences ##
sample_files <-
  file.path(tempdir(),
            paste0("Sample", 1:n_samples, ".rds")
  )
group_labels <- c(rep("reference", n_control),
                  rep("comparison", n_treatment))
associated_seqs <-
  findAssociatedSeqs(
    file_list = sample_files,
    input_type = "rds",
    group_ids = group_labels,
    seq_col = "CloneSeq",
    min_seq_length = NULL,
    drop_matches = NULL,
    min_sample_membership = 0,
    pval_cutoff = 0.1
  )
head(associated_seqs[, 1:5])

## Step 2: Find Associated Clones ##
dir_step2 <- tempfile()
findAssociatedClones(
  file_list = sample_files,
  input_type = "rds",
  group_ids = group_labels,
  seq_col = "CloneSeq",
  assoc_seqs = associated_seqs$ReceptorSeq,
  min_seq_length = NULL,
  drop_matches = NULL,
  output_dir = dir_step2
)

## Step 3: Global Network of Associated Clusters ##
associated_clusters <-
  buildAssociatedClusterNetwork(
    file_list = list.files(dir_step2,
                           full.names = TRUE
    ),
    seq_col = "CloneSeq",
    size_nodes_by = 1.5,
    print_plots = TRUE
  )

Identify TCR/BCR Sequences Associated With a Binary Variable

Description

Part of the workflow Searching for Associated TCR/BCR Clusters.

Given multiple samples of bulk Adaptive Immune Receptor Repertoire Sequencing (AIRR-Seq) data and a binary variable of interest such as a disease condition, treatment or clinical outcome, identify receptor sequences that exhibit a statistically significant difference in frequency between the two levels of the binary variable.

findAssociatedSeqs() is designed for use when each sample is stored in a separate file. findAssociatedSeqs2() is designed for use with a single data frame containing all samples.

Usage

findAssociatedSeqs(
  ## Input ##
  file_list,
  input_type,
  data_symbols = NULL,
  header, sep, read.args,
  sample_ids = deprecated(),
  subject_ids = NULL,
  group_ids,
  groups = deprecated(),
  seq_col,
  freq_col = NULL,

  ## Search Criteria ##
  min_seq_length = 7,
  drop_matches = "[*|_]",
  min_sample_membership = 5,
  pval_cutoff = 0.05,

  ## Output ##
  outfile = NULL,
  verbose = FALSE
)


findAssociatedSeqs2(
  ## Input ##
  data,
  seq_col,
  sample_col,
  subject_col = sample_col,
  group_col,
  groups = deprecated(),
  freq_col = NULL,

  ## Search Criteria ##
  min_seq_length = 7,
  drop_matches = "[*|_]",
  min_sample_membership = 5,
  pval_cutoff = 0.05,

  ## Ouptut ##
  outfile = NULL,
  verbose = FALSE
)

findAssociatedSeqs(
  ## Input ##
  file_list,
  input_type,
  data_symbols = NULL,
  header, sep, read.args,
  sample_ids = deprecated(),
  subject_ids = NULL,
  group_ids,
  groups = deprecated(),
  seq_col,
  freq_col = NULL,

  ## Search Criteria ##
  min_seq_length = 7,
  drop_matches = "[*|_]",
  min_sample_membership = 5,
  pval_cutoff = 0.05,

  ## Output ##
  outfile = NULL,
  verbose = FALSE
)


findAssociatedSeqs2(
  ## Input ##
  data,
  seq_col,
  sample_col,
  subject_col = sample_col,
  group_col,
  groups = deprecated(),
  freq_col = NULL,

  ## Search Criteria ##
  min_seq_length = 7,
  drop_matches = "[*|_]",
  min_sample_membership = 5,
  pval_cutoff = 0.05,

  ## Ouptut ##
  outfile = NULL,
  verbose = FALSE
)

Arguments

`file_list`	A character vector of file paths, or a list containing `connections` and file paths. Each element corresponds to a single file containing the data for a single sample. Passed to `loadDataFromFileList()`.
`input_type`	A character string specifying the file format of the sample data files. Options are `"table"`, `"txt"`, `"tsv"`, `"csv"`, `"rds"` and `"rda"`. Passed to `loadDataFromFileList()`.
`data_symbols`	Used when `input_type = "rda"`. Specifies the name of each sample's data frame within its respective Rdata file. Passed to `loadDataFromFileList()`.
`header`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify a non-default value of the `header` argument to `read.table()`, `read.csv()`, etc.
`sep`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify a non-default value of the `sep` argument to `read.table()`, `read.csv()`, etc.
`read.args`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify non-default values of optional arguments to `read.table()`, `read.csv()`, etc. Accepts a named list of argument values. Values of `header` and `sep` in this list take precedence over values specified via the `header` and `sep` arguments.
`sample_ids`	Does nothing.
`subject_ids`	A character or numeric vector of subject IDs, whose length matches that of `file_list`. Only relevant when the binary variable of interest is subject-specific and multiple samples belong to the same subject.
`group_ids`	A character or numeric vector of group IDs containing exactly two unique values and with length matching that of `file_list`. The two groups correspond to the two values of the binary variable of interest.
`groups`	Does nothing.
`seq_col`	Specifies the column of each sample's data frame containing the TCR/BCR sequences. Accepts a character string containing the column name or a numeric scalar containing the column index.
`freq_col`	Optional. Specifies the column of each sample's data frame containing the clone frequency (i.e., clone count divided by the sum of the clone counts across all clones in the sample). Accepts a character string containing the column name or a numeric scalar containing the column index. If this argument is specified, the maximum clone frequency (across all samples) for each associated sequence will be included in the content of the `label` variable of the returned data frame.
`min_seq_length`	Controls the minimum TCR/BCR sequence length considered when searching for associated sequences. Passed to `filterInputData()`.
`drop_matches`	Passed to `filterInputData()`. Accepts a character string containing a regular expression (see `regex`). Checks TCR/BCR sequences for a pattern match using `grep()`. Those returning a match are excluded from consideration as associated sequences. It is recommended to filter out sequences containing special characters that are invalid for use in file names. By default, sequences containing any of the characters `*`, `\|` or `_` are dropped.
`min_sample_membership`	Controls the minimum number of samples in which a TCR/BCR sequence must be present in order to be considered when searching for associated sequences. Setting this value to `NULL` bypasses the check.
`pval_cutoff`	Controls the P-value cutoff below which an association is detected by Fisher's exact test (see details).
`outfile`	A file path for saving the output (using `write.csv()`).
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.
`data`	A data frame containing the combined AIRR-seq data for all samples.
`sample_col`	The column of `data` containing the sample IDs. Accepts a character string containing the column name or a numeric scalar containing the column index.
`subject_col`	Optional. The column of `data` containing the subject IDs. Accepts a character string containing the column name or a numeric scalar containing the column index. Only relevant when the binary variable of interest is subject-specific and multiple samples belong to the same subject.
`group_col`	The column of `data` containing the group IDs. Accepts a character string containing the column name or a numeric scalar containing the column index. The groups correspond to the two values of the binary variable of interest. Thus there should be exactly two unique values in this column.

Details

The TCR/BCR sequences from all samples are first filtered according to minimum sequence length and sequence content based on the specified values in min_seq_length and drop_matches, respectively. The sequences are further filtered based on sample membership, removing sequences appearing in fewer than min_sample_membership samples.

For each remaining TCR/BCR sequence, a P-value is computed for Fisher's exact test of independence between the binary variable of interest and the presence of the sequence within a repertoire. The samples/subjects are divided into two groups based on the levels of the binary variable. If subject IDs are provided, then the test is based on the number of subjects in each group for whom the sequence appears in one of their samples. Without subject IDs, the test is based on the number of samples possessing the sequence in each group.

Fisher's exact test is performed using fisher.test(). TCR/BCR sequences with a $P$ -value below pval_cutoff are sorted by $P$ -value and returned along with some additional information.

The returned ouput is intended for use with the findAssociatedClones() function. See the Searching for Associated TCR/BCR Clusters article on the package website.

Value

A data frame containing the TCR/BCR sequences found to be associated with the binary variable using Fisher's exact test (see details). Each row corresponds to a unique TCR/BCR sequence and includes the following variables:

`ReceptorSeq`	The unique receptor sequence.
`fisher_pvalue`	The P-value on Fisher's exact test for independence between the receptor sequence and the binary variable of interest.
`shared_by_n_samples`	The number of samples in which the sequence was observed.
`samples_g0`	Of the samples in which the sequence was observed, the number of samples belonging to the first group.
`samples_g1`	Of the samples in which the sequence was observed, the number of samples belonging to the second group.
`shared_by_n_subjects`	The number of subjects in which the sequence was observed (only present if subject IDs are specified).
`subjects_g0`	Of the subjects in which the sequence was observed, the number of subjects belonging to the first group (only present if subject IDs are specified).
`subjects_g1`	Of the subjects in which the sequence was observed, the number of subjects belonging to the second group (only present if subject IDs are specified).
`max_freq`	The maximum clone frequency across all samples. Only present if `freq_col` is non-null.
`label`	A character string summarizing the above information. Also includes the maximum in-sample clone frequency across all samples, if available.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Searching for Associated TCR/BCR Clusters article on package website

Examples

set.seed(42)

## Simulate 30 samples from two groups (treatment/control) ##
n_control <- n_treatment <- 15
n_samples <- n_control + n_treatment
sample_size <- 30 # (seqs per sample)
base_seqs <- # first five are associated with treatment
  c("CASSGAYEQYF", "CSVDLGKGNNEQFF", "CASSIEGQLSTDTQYF",
    "CASSEEGQLSTDTQYF", "CASSPEGQLSTDTQYF",
    "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF")
# Relative generation probabilities by control/treatment group
pgen_c <- matrix(rep(c(rep(1, 5), rep(30, 3)), times = n_control),
                 nrow = n_control, byrow = TRUE)
pgen_t <- matrix(rep(c(1, 1, rep(1/3, 3), rep(2, 3)), times = n_treatment),
                 nrow = n_treatment, byrow = TRUE)
pgen <- rbind(pgen_c, pgen_t)
simulateToyData(
  samples = n_samples,
  sample_size = sample_size,
  prefix_length = 1,
  prefix_chars = c("", ""),
  prefix_probs = cbind(rep(1, n_samples), rep(0, n_samples)),
  affixes = base_seqs,
  affix_probs = pgen,
  num_edits = 0,
  output_dir = tempdir(),
  no_return = TRUE
)

## Step 1: Find Associated Sequences ##
sample_files <-
  file.path(tempdir(),
            paste0("Sample", 1:n_samples, ".rds")
  )
group_labels <- c(rep("reference", n_control),
                  rep("comparison", n_treatment))
associated_seqs <-
  findAssociatedSeqs(
    file_list = sample_files,
    input_type = "rds",
    group_ids = group_labels,
    seq_col = "CloneSeq",
    min_seq_length = NULL,
    drop_matches = NULL,
    min_sample_membership = 0,
    pval_cutoff = 0.1
  )
head(associated_seqs[, 1:5])

## Step 2: Find Associated Clones ##
dir_step2 <- tempfile()
findAssociatedClones(
  file_list = sample_files,
  input_type = "rds",
  group_ids = group_labels,
  seq_col = "CloneSeq",
  assoc_seqs = associated_seqs$ReceptorSeq,
  min_seq_length = NULL,
  drop_matches = NULL,
  output_dir = dir_step2
)

## Step 3: Global Network of Associated Clusters ##
associated_clusters <-
  buildAssociatedClusterNetwork(
    file_list = list.files(dir_step2,
                           full.names = TRUE
    ),
    seq_col = "CloneSeq",
    size_nodes_by = 1.5,
    print_plots = TRUE
  )


set.seed(42)

## Simulate 30 samples from two groups (treatment/control) ##
n_control <- n_treatment <- 15
n_samples <- n_control + n_treatment
sample_size <- 30 # (seqs per sample)
base_seqs <- # first five are associated with treatment
  c("CASSGAYEQYF", "CSVDLGKGNNEQFF", "CASSIEGQLSTDTQYF",
    "CASSEEGQLSTDTQYF", "CASSPEGQLSTDTQYF",
    "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF")
# Relative generation probabilities by control/treatment group
pgen_c <- matrix(rep(c(rep(1, 5), rep(30, 3)), times = n_control),
                 nrow = n_control, byrow = TRUE)
pgen_t <- matrix(rep(c(1, 1, rep(1/3, 3), rep(2, 3)), times = n_treatment),
                 nrow = n_treatment, byrow = TRUE)
pgen <- rbind(pgen_c, pgen_t)
simulateToyData(
  samples = n_samples,
  sample_size = sample_size,
  prefix_length = 1,
  prefix_chars = c("", ""),
  prefix_probs = cbind(rep(1, n_samples), rep(0, n_samples)),
  affixes = base_seqs,
  affix_probs = pgen,
  num_edits = 0,
  output_dir = tempdir(),
  no_return = TRUE
)

## Step 1: Find Associated Sequences ##
sample_files <-
  file.path(tempdir(),
            paste0("Sample", 1:n_samples, ".rds")
  )
group_labels <- c(rep("reference", n_control),
                  rep("comparison", n_treatment))
associated_seqs <-
  findAssociatedSeqs(
    file_list = sample_files,
    input_type = "rds",
    group_ids = group_labels,
    seq_col = "CloneSeq",
    min_seq_length = NULL,
    drop_matches = NULL,
    min_sample_membership = 0,
    pval_cutoff = 0.1
  )
head(associated_seqs[, 1:5])

## Step 2: Find Associated Clones ##
dir_step2 <- tempfile()
findAssociatedClones(
  file_list = sample_files,
  input_type = "rds",
  group_ids = group_labels,
  seq_col = "CloneSeq",
  assoc_seqs = associated_seqs$ReceptorSeq,
  min_seq_length = NULL,
  drop_matches = NULL,
  output_dir = dir_step2
)

## Step 3: Global Network of Associated Clusters ##
associated_clusters <-
  buildAssociatedClusterNetwork(
    file_list = list.files(dir_step2,
                           full.names = TRUE
    ),
    seq_col = "CloneSeq",
    size_nodes_by = 1.5,
    print_plots = TRUE
  )

Find Public Clusters Among RepSeq Samples

Description

Part of the workflow Searching for Public TCR/BCR Clusters.

Given multiple samples of bulk Adaptive Immune Receptor Repertoire Sequencing (AIRR-Seq) data, construct the repertoire network for each sample. Within each sample's network, perform cluster analysis and filter the clusters based on node count and aggregate clone count.

Usage

findPublicClusters(

  ## Input ##
  file_list,
  input_type,
  data_symbols = NULL,
  header, sep, read.args,
  sample_ids =
    paste0("Sample", 1:length(file_list)),
  seq_col,
  count_col = NULL,

  ## Search Criteria ##
  min_seq_length = 3,
  drop_matches = "[*|_]",
  top_n_clusters = 20,
  min_node_count = 10,
  min_clone_count = 100,

  ## Optional Visualization ##
  plots = FALSE,
  print_plots = FALSE,
  plot_title = "auto",
  color_nodes_by = "cluster_id",

  ## Output ##
  output_dir,
  output_type = "rds",

  ## Optional Output ##
  output_dir_unfiltered = NULL,
  output_type_unfiltered = "rds",

  verbose = FALSE,

  ...

)
findPublicClusters(

  ## Input ##
  file_list,
  input_type,
  data_symbols = NULL,
  header, sep, read.args,
  sample_ids =
    paste0("Sample", 1:length(file_list)),
  seq_col,
  count_col = NULL,

  ## Search Criteria ##
  min_seq_length = 3,
  drop_matches = "[*|_]",
  top_n_clusters = 20,
  min_node_count = 10,
  min_clone_count = 100,

  ## Optional Visualization ##
  plots = FALSE,
  print_plots = FALSE,
  plot_title = "auto",
  color_nodes_by = "cluster_id",

  ## Output ##
  output_dir,
  output_type = "rds",

  ## Optional Output ##
  output_dir_unfiltered = NULL,
  output_type_unfiltered = "rds",

  verbose = FALSE,

  ...

)

Arguments

`file_list`	A character vector of file paths, or a list containing `connections` and file paths. Each element corresponds to a single file containing the data for a single sample. Passed to `loadDataFromFileList()`.
`input_type`	A character string specifying the file format of the sample data files. Options are `"table"`, `"txt"`, `"tsv"`, `"csv"`, `"rds"` and `"rda"`. Passed to `loadDataFromFileList()`.
`data_symbols`	Used when `input_type = "rda"`. Specifies the name of each sample's data frame within its respective Rdata file. Passed to `loadDataFromFileList()`.
`header`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify a non-default value of the `header` argument to `read.table()`, `read.csv()`, etc.
`sep`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify a non-default value of the `sep` argument to `read.table()`, `read.csv()`, etc.
`read.args`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify non-default values of optional arguments to `read.table()`, `read.csv()`, etc. Accepts a named list of argument values. Values of `header` and `sep` in this list take precedence over values specified via the `header` and `sep` arguments.
`sample_ids`	A character or numeric vector of sample IDs, whose length matches that of `file_list`. The values should be valid for use as filenames and should avoid using the forward slash or backslash characters (`/` or `\`).
`seq_col`	Specifies the column of each sample's data frame containing the TCR/BCR sequences. Accepts a character string containing the column name or a numeric scalar containing the column index.
`count_col`	Specifies the column of each sample's data frame containing the clone count (measure of clonal abundance). Accepts a character string containing the column name or a numeric scalar containing the column index. If `NULL`, the clusters in each sample's network will be selected solely based upon node count.
`min_seq_length`	Passed to `buildRepSeqNetwork()` when constructing the network for each sample.
`drop_matches`	Passed to `buildRepSeqNetwork()` when constructing the network for each sample. Accepts a character string containing a regular expression (see `regex`). Checks TCR/BCR sequences for a pattern match using `grep()`. Those returning a match are dropped. By default, sequences containing any of the characters `*`, `\|` or `_` are dropped.
`top_n_clusters`	The number of clusters from each sample to be automatically be included among the filtered clusters, based on greatest node count.
`min_node_count`	Clusters with at least this many nodes will be included among the filtered clusters.
`min_clone_count`	Clusters with an aggregate clone count of at least this value will be included among the filtered clusters. A value of `NULL` ignores this criterion and does not select additional clusters based on clone count.
`plots`	Passed to `buildRepSeqNetwork()` when constructing the network for each sample.
`print_plots`	Passed to `buildRepSeqNetwork()` when constructing the network for each sample.
`plot_title`	Passed to `buildRepSeqNetwork()` when constructing the network for each sample.
`color_nodes_by`	Passed to `buildRepSeqNetwork()` when constructing the network for each sample.
`output_dir`	The file path of the directory for saving the output. The directory will be created if it does not already exist.
`output_type`	A character string specifying the file format to use for saving the output. Valid options include `"csv"`, `"rds"` and `"rda"`.
`output_dir_unfiltered`	An optional directory for saving the unfiltered network data for each sample. By default, only the filtered results are saved.
`output_type_unfiltered`	A character string specifying the file format to use for saving the unfiltered network data for each sample. Only applicable if `output_dir_unfiltered` is non-null. Passed to `buildRepSeqNetwork()` when constructing the network for each sample.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.
`...`	Other arguments to `buildRepSeqNetwork` when constructing the network for each sample, not including `node_stats`, `stats_to_include`, `cluster_stats`, `cluster_id_name` or `output_name` (see details).

Details

Each sample's network is constructed using an individual call to buildNet() with node_stats = TRUE, stats_to_include = "all", cluster_stats = TRUE and cluster_id_name = "ClusterIDInSample". The node-level properties are renamed to reflect their correspondence to the sample-level network. Specifically, the properties are named:

SampleLevelNetworkDegree
SampleLevelTransitivity
SampleLevelCloseness
SampleLevelCentralityByCloseness
SampleLevelCentralityByEigen
SampleLevelEigenCentrality
SampleLevelBetweenness
SampleLevelCentralityByBetweenness
SampleLevelAuthorityScore
SampleLevelCoreness
SampleLevelPageRank

A variable SampleID is added to both the node-level and cluster-level meta data for each sample.

After the clusters in each sample are filtered, the node-level and cluster-level metadata are saved in the respective subdirectories node_meta_data and cluster_meta_data of the output directory specified by output_dir.

The unfiltered network results for each sample can also be saved by supplying a directory to output_dir_unfiltered, if these results are desired for downstream analysis. Each sample's unfiltered network results will then be saved to its own subdirectory created within this directory.

The files containing the node-level metadata for the filtered clusters can be supplied to buildPublicClusterNetwork() in order to construct a global network of public clusters. If the full global network is too large to practically construct, the files containing the cluster-level meta data for the filtered clusters can be supplied to buildPublicClusterNetworkByRepresentative() to build a global network using only a single representative sequence from each cluster. This allows prominent public clusters to still be identified.

See the Searching for Public TCR/BCR Clusters article on the package website.

Value

Returns TRUE, invisibly.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Searching for Public TCR/BCR Clusters vignette

Examples

set.seed(42)

## Simulate 30 samples with a mix of public/private sequences ##
samples <- 30
sample_size <- 30 # (seqs per sample)
base_seqs <- c(
  "CASSIEGQLSTDTQYF", "CASSEEGQLSTDTQYF", "CASSSVETQYF",
  "CASSPEGQLSTDTQYF", "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF",
  "CASSLTSGYNEQFF", "CASSETGYNEQFF", "CASSLTGGNEQFF", "CASSYLTGYNEQFF",
  "CASSLTGNEQFF", "CASSLNGYNEQFF", "CASSFPWDGYGYTF", "CASTLARQGGELFF",
  "CASTLSRQGGELFF", "CSVELLPTGPLETSYNEQFF", "CSVELLPTGPSETSYNEQFF",
  "CVELLPTGPSETSYNEQFF", "CASLAGGRTQETQYF", "CASRLAGGRTQETQYF",
  "CASSLAGGRTETQYF", "CASSLAGGRTQETQYF", "CASSRLAGGRTQETQYF",
  "CASQYGGGNQPQHF", "CASSLGGGNQPQHF", "CASSNGGGNQPQHF", "CASSYGGGGNQPQHF",
  "CASSYGGGQPQHF", "CASSYKGGNQPQHF", "CASSYTGGGNQPQHF",
  "CAWSSQETQYF", "CASSSPETQYF", "CASSGAYEQYF", "CSVDLGKGNNEQFF")
# Relative generation probabilities
pgen <- cbind(
  stats::toeplitz(0.6^(0:(sample_size - 1))),
  matrix(1, nrow = samples, ncol = length(base_seqs) - samples)
)
simulateToyData(
  samples = samples,
  sample_size = sample_size,
  prefix_length = 1,
  prefix_chars = c("", ""),
  prefix_probs = cbind(rep(1, samples), rep(0, samples)),
  affixes = base_seqs,
  affix_probs = pgen,
  num_edits = 0,
  output_dir = tempdir(),
  no_return = TRUE
)

sample_files <-
  file.path(tempdir(),
            paste0("Sample", 1:samples, ".rds")
  )
findPublicClusters(
  file_list = sample_files,
  input_type = "rds",
  seq_col = "CloneSeq",
  count_col = "CloneCount",
  min_seq_length = NULL,
  drop_matches = NULL,
  top_n_clusters = 3,
  min_node_count = 5,
  min_clone_count = 15000,
  output_dir = tempdir()
)



set.seed(42)

## Simulate 30 samples with a mix of public/private sequences ##
samples <- 30
sample_size <- 30 # (seqs per sample)
base_seqs <- c(
  "CASSIEGQLSTDTQYF", "CASSEEGQLSTDTQYF", "CASSSVETQYF",
  "CASSPEGQLSTDTQYF", "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF",
  "CASSLTSGYNEQFF", "CASSETGYNEQFF", "CASSLTGGNEQFF", "CASSYLTGYNEQFF",
  "CASSLTGNEQFF", "CASSLNGYNEQFF", "CASSFPWDGYGYTF", "CASTLARQGGELFF",
  "CASTLSRQGGELFF", "CSVELLPTGPLETSYNEQFF", "CSVELLPTGPSETSYNEQFF",
  "CVELLPTGPSETSYNEQFF", "CASLAGGRTQETQYF", "CASRLAGGRTQETQYF",
  "CASSLAGGRTETQYF", "CASSLAGGRTQETQYF", "CASSRLAGGRTQETQYF",
  "CASQYGGGNQPQHF", "CASSLGGGNQPQHF", "CASSNGGGNQPQHF", "CASSYGGGGNQPQHF",
  "CASSYGGGQPQHF", "CASSYKGGNQPQHF", "CASSYTGGGNQPQHF",
  "CAWSSQETQYF", "CASSSPETQYF", "CASSGAYEQYF", "CSVDLGKGNNEQFF")
# Relative generation probabilities
pgen <- cbind(
  stats::toeplitz(0.6^(0:(sample_size - 1))),
  matrix(1, nrow = samples, ncol = length(base_seqs) - samples)
)
simulateToyData(
  samples = samples,
  sample_size = sample_size,
  prefix_length = 1,
  prefix_chars = c("", ""),
  prefix_probs = cbind(rep(1, samples), rep(0, samples)),
  affixes = base_seqs,
  affix_probs = pgen,
  num_edits = 0,
  output_dir = tempdir(),
  no_return = TRUE
)

sample_files <-
  file.path(tempdir(),
            paste0("Sample", 1:samples, ".rds")
  )
findPublicClusters(
  file_list = sample_files,
  input_type = "rds",
  seq_col = "CloneSeq",
  count_col = "CloneCount",
  min_seq_length = NULL,
  drop_matches = NULL,
  top_n_clusters = 3,
  min_node_count = 5,
  min_clone_count = 15000,
  output_dir = tempdir()
)

Compute Graph Adjacency Matrix for Immune Repertoire Network

Description

Given a list of receptor sequences, computes the adjacency matrix for the network graph based on sequence similarity.

sparseAdjacencyMatFromSeqs() is a deprecated equivalent of generateAdjacencyMatrix().

Usage

generateAdjacencyMatrix(
  seqs,
  dist_type = "hamming",
  dist_cutoff = 1,
  drop_isolated_nodes = TRUE,
  method = "default",
  verbose = FALSE
)

# Deprecated equivalent:
sparseAdjacencyMatFromSeqs(
  seqs,
  dist_type = "hamming",
  dist_cutoff = 1,
  drop_isolated_nodes = TRUE,
  method = "default",
  verbose = FALSE,
  max_dist = deprecated()
)
generateAdjacencyMatrix(
  seqs,
  dist_type = "hamming",
  dist_cutoff = 1,
  drop_isolated_nodes = TRUE,
  method = "default",
  verbose = FALSE
)

# Deprecated equivalent:
sparseAdjacencyMatFromSeqs(
  seqs,
  dist_type = "hamming",
  dist_cutoff = 1,
  drop_isolated_nodes = TRUE,
  method = "default",
  verbose = FALSE,
  max_dist = deprecated()
)

Arguments

`seqs`	A character vector containing the receptor sequences.
`dist_type`	Specifies the function used to quantify the similarity between sequences. The similarity between two sequences determines the pairwise distance between their respective nodes in the network graph, with greater similarity corresponding to shorter distance. Valid options are `"hamming"` (the default), which uses `hamDistBounded`, and `"levenshtein"`, which uses `levDistBounded`.
`dist_cutoff`	A nonnegative scalar. Specifies the maximum pairwise distance (based on `dist_type`) for an edge connection to exist between two nodes. Pairs of nodes whose distance is less than or equal to this value will be joined by an edge connection in the network graph. Controls the stringency of the network construction and affects the number and density of edges in the network. A lower cutoff value requires greater similarity between sequences in order for their respective nodes to be joined by an edge connection. A value of `0` requires two sequences to be identical in order for their nodes to be joined by an edge.
`drop_isolated_nodes`	Logical. When `TRUE`, removes each node that is not joined by an edge connection to any other node in the network graph.
`method`	A character string specifying the algorithm to use. Choices are `"default"` and `"pattern"`. `"pattern"` is only valid when `dist_cutoff < 3`, but tends to be faster than `"default"` for sparsely connected networks, at the cost of greater memory usage (can cause crashes for large or densely-connected networks, particularly for `dist_cutoff = 2`). The default algorithm tends to be faster for densely-connected networks or long sequences.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr`.
`max_dist`	Equivalent to `dist_cutoff`.

Details

The adjacency matrix of a graph with $n$ nodes is the symmetric $n \times n$ matrix for which entry $(i,j)$ is equal to 1 if nodes $i$ and $j$ are connected by an edge in the network graph and 0 otherwise.

To construct the graph of the immune repertoire network, each receptor sequence is modeled as a node. The similarity between receptor sequences, as measured using either the Hamming or Levenshtein distance, determines the distance between nodes in the network graph. The more similar two sequences are, the shorter the distance between their respective nodes. Two nodes in the graph are joined by an edge if the distance between them is sufficiently small, i.e., if their receptor sequences are sufficiently similar.

Value

A sparse matrix of class dgCMatrix (see dgCMatrix-class).

If drop_isolated_nodes = TRUE, the row and column names of the matrix indicate which receptor sequences in the seqs vector correspond to each row and column of the matrix. The row and column names can be accessed using dimnames. This returns a list containing two character vectors, one for the row names and one for the column names. The name of the $i$ th matrix row is the index of the seqs vector corresponding to the $i$ th row and $i$ th column of the matrix. The name of the $j$ th matrix column is the receptor sequence corresponding to the $j$ th row and $j$ th column of the matrix.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

generateAdjacencyMatrix(
  c("fee", "fie", "foe", "fum", "foo")
)

# No edge connections exist based on a Hamming distance of 1
# (returns a 0x0 sparse matrix)
generateAdjacencyMatrix(
  c("foo", "foobar", "fubar", "bar")
)

# Same as the above example, but keeping all nodes
# (returns a 4x4 sparse matrix)
generateAdjacencyMatrix(
  c("foo", "foobar", "fubar", "bar"),
  drop_isolated_nodes = FALSE
)

# Relaxing the edge criteria using a Hamming distance of 2
# (still results in no edge connections)
generateAdjacencyMatrix(
  c("foo", "foobar", "fubar", "bar"),
  dist_cutoff = 2
)

# Using a Levenshtein distance of 2, however,
# does result in edge connections
generateAdjacencyMatrix(
  c("foo", "foobar", "fubar", "bar"),
  dist_type = "levenshtein",
  dist_cutoff = 2
)

# Using a Hamming distance of 3
# also results in (different) edge connections
generateAdjacencyMatrix(
  c("foo", "foobar", "fubar", "bar"),
  dist_cutoff = 3
)
generateAdjacencyMatrix(
  c("fee", "fie", "foe", "fum", "foo")
)

# No edge connections exist based on a Hamming distance of 1
# (returns a 0x0 sparse matrix)
generateAdjacencyMatrix(
  c("foo", "foobar", "fubar", "bar")
)

# Same as the above example, but keeping all nodes
# (returns a 4x4 sparse matrix)
generateAdjacencyMatrix(
  c("foo", "foobar", "fubar", "bar"),
  drop_isolated_nodes = FALSE
)

# Relaxing the edge criteria using a Hamming distance of 2
# (still results in no edge connections)
generateAdjacencyMatrix(
  c("foo", "foobar", "fubar", "bar"),
  dist_cutoff = 2
)

# Using a Levenshtein distance of 2, however,
# does result in edge connections
generateAdjacencyMatrix(
  c("foo", "foobar", "fubar", "bar"),
  dist_type = "levenshtein",
  dist_cutoff = 2
)

# Using a Hamming distance of 3
# also results in (different) edge connections
generateAdjacencyMatrix(
  c("foo", "foobar", "fubar", "bar"),
  dist_cutoff = 3
)

Generate the `igraph` for a Network Adjacency Matrix

Description

Given the adjacency matrix of an undirected graph, returns the corresponding igraph containing the list of nodes and edges.

generateNetworkFromAdjacencyMat() is a deprecated equivalent of generateNetworkGraph().

Usage

generateNetworkGraph(
  adjacency_matrix
)

# Deprecated equivalent:
generateNetworkFromAdjacencyMat(
  adjacency_matrix
)
generateNetworkGraph(
  adjacency_matrix
)

# Deprecated equivalent:
generateNetworkFromAdjacencyMat(
  adjacency_matrix
)

Arguments

adjacency_matrix

A symmetric matrix. Passed to graph_from_adjacency_matrix().

Value

An object of class igraph, containing the list of nodes and edges corresponding to adjacency_matrix.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

set.seed(42)
toy_data <- simulateToyData(sample_size = 10)

adj_mat <-
  generateAdjacencyMatrix(
    toy_data$CloneSeq
  )

igraph <-
  generateNetworkGraph(
    adj_mat
  )
set.seed(42)
toy_data <- simulateToyData(sample_size = 10)

adj_mat <-
  generateAdjacencyMatrix(
    toy_data$CloneSeq
  )

igraph <-
  generateNetworkGraph(
    adj_mat
  )

Generate Basic Output for an Immune Repertoire Network

Description

Given Adaptive Immune Receptor Repertoire Sequencing (AIRR-Seq) data, builds the network graph for the immune repertoire based on sequence similarity.

Usage

generateNetworkObjects(
  data,
  seq_col,
  dist_type = "hamming",
  dist_cutoff = 1,
  drop_isolated_nodes = TRUE,
  verbose = FALSE
)
generateNetworkObjects(
  data,
  seq_col,
  dist_type = "hamming",
  dist_cutoff = 1,
  drop_isolated_nodes = TRUE,
  verbose = FALSE
)

Arguments

`data`	A data frame containing the AIRR-Seq data, with variables indexed by column and observations (e.g., clones or cells) indexed by row.
`seq_col`	Specifies the column(s) of `data` containing the receptor sequences to be used as the basis of similarity between rows. Accepts a character string containing the column name or a numeric scalar containing the column index. Also accepts a vector of length 2 specifying distinct sequence columns (e.g., alpha chain and beta chain), in which case similarity between rows depends on similarity in both sequence columns (see details).
`dist_type`	Specifies the function used to measure the similarity between sequences. The similarity between two sequences determines the pairwise distance between their respective nodes in the network graph. Valid options are `"hamming"` (the default), which uses `hamDistBounded()`, and `"levenshtein"`, which uses `levDistBounded()`.
`dist_cutoff`	A nonnegative scalar. Specifies the maximum pairwise distance (based on `dist_type`) for an edge connection to exist between two nodes. Pairs of nodes whose distance is less than or equal to this value will be joined by an edge connection in the network graph. Controls the stringency of the network construction and affects the number and density of edges in the network. A lower cutoff value requires greater similarity between sequences in order for their respective nodes to be joined by an edge connection. A value of `0` requires two sequences to be identical in order for their nodes to be joined by an edge.
`drop_isolated_nodes`	A logical scalar. When `TRUE`, removes each node that is not joined by an edge connection to any other node in the network graph.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.

Details

Similarity between sequences is measured using either the Hamming distance or Levenshtein (edit) distance. The similarity determines the pairwise distance between nodes in the network graph. The more similar two sequences are, the shorter the distance between their respective nodes. Two nodes are joined by an edge if their receptor sequences are sufficiently similar, i.e., if the distance between the nodes is sufficiently small.

See the buildRepSeqNetwork package vignette for more details. The vignette can be accessed offline using vignette("buildRepSeqNetwork").

Value

If the constructed network contains no nodes, the function will return NULL, invisibly, with a warning. Otherwise, the function invisibly returns a list containing the following items:

`igraph`	An object of class `igraph` containing the list of nodes and edges for the network graph.
`adjacency_matrix`	The network graph adjacency matrix, stored as a sparse matrix of class `dgCMatrix` from the `Matrix` package. See `dgCMatrix-class`.
`node_data`	A data frame containing containing metadata for the network nodes, where each row corresponds to a node in the network graph. This data frame contains all variables from `data` (unless otherwise specified via `subset_cols`) in addition to the computed node-level network properties if `node_stats = TRUE`. Each row's name is the name of the corresponding row from `data`.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

buildRepSeqNetwork vignette

Examples

set.seed(42)
toy_data <- simulateToyData()

net <-
  generateNetworkObjects(
    toy_data,
    "CloneSeq"
  )
set.seed(42)
toy_data <- simulateToyData()

net <-
  generateNetworkObjects(
    toy_data,
    "CloneSeq"
  )

Compute Cluster-Level Network Properties

Description

Given the node-level metadata and adjacency matrix for a network graph that has been partitioned into clusters, computes network properties for the clusters and returns them in a data frame.

addClusterStats() is preferred to getClusterStats() in most situations.

Usage

getClusterStats(
  data,
  adjacency_matrix,
  seq_col = NULL,
  count_col = NULL,
  cluster_id_col = "cluster_id",
  degree_col = NULL,
  cluster_fun = deprecated(),
  verbose = FALSE
)
getClusterStats(
  data,
  adjacency_matrix,
  seq_col = NULL,
  count_col = NULL,
  cluster_id_col = "cluster_id",
  degree_col = NULL,
  cluster_fun = deprecated(),
  verbose = FALSE
)

Arguments

`data`	A data frame containing the node-level metadata for the network, with each row corresponding to a network node.
`adjacency_matrix`	The adjacency matrix for the network.
`seq_col`	Specifies the column(s) of `data` containing the receptor sequences upon whose similarity the network is based. Accepts a character or numeric vector of length 1 or 2, containing either column names or column indices. If provided, then related cluster-level properties will be computed.
`count_col`	Specifies the column of `data` containing a measure of abundance (such as clone count or UMI count). Accepts a character string containing the column name or a numeric scalar containing the column index. If provided, related cluster-level properties will be computed.
`cluster_id_col`	Specifies the column of `data` containing the cluster membership variable that identifies the cluster to which each node belongs. Accepts a character string containing the column name or a numeric scalar containing the column index.
`degree_col`	Specifies the column of `data` containing the network degree of each node. Accepts a character string containing the column name or a numeric scalar containing the column index. If the column does not exist, the network degree will be computed.
`cluster_fun`	Does nothing.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.

Details

To use getClusterStats(), the network graph must first be partitioned into clusters, which can be done using addClusterMembership(). The name of the cluster membership variable in the node metadata must be provided to the cluster_id_col argument when calling getClusterStats().

Value

A data frame containing one row for each cluster in the network and the following variables:

`cluster_id`	The cluster ID number.
`node_count`	The number of nodes in the cluster.
`mean_seq_length`	The mean sequence length in the cluster. Only present when `length(seq_col) == 1`.
`A_mean_seq_length`	The mean first sequence length in the cluster. Only present when `length(seq_col) == 2`.
`B_mean_seq_length`	The mean second sequence length in the cluster. Only present when `length(seq_col) == 2`.
`mean_degree`	The mean network degree in the cluster.
`max_degree`	The maximum network degree in the cluster.
`seq_w_max_degree`	The receptor sequence possessing the maximum degree within the cluster. Only present when `length(seq_col) == 1`.
`A_seq_w_max_degree`	The first sequence of the node possessing the maximum degree within the cluster. Only present when `length(seq_col) == 2`.
`B_seq_w_max_degree`	The second sequence of the node possessing the maximum degree within the cluster. Only present when `length(seq_col) == 2`.
`agg_count`	The aggregate count among all nodes in the cluster (based on the counts in `count_col`).
`max_count`	The maximum count among all nodes in the cluster (based on the counts in `count_col`).
`seq_w_max_count`	The receptor sequence possessing the maximum count within the cluster. Only present when `length(seq_col) == 1`.
`A_seq_w_max_count`	The first sequence of the node possessing the maximum count within the cluster. Only present when `length(seq_col) == 2`.
`B_seq_w_max_count`	The second sequence of the node possessing the maximum count within the cluster. Only present when `length(seq_col) == 2`.
`diameter_length`	The longest geodesic distance in the cluster, computed as the length of the vector returned by `get_diameter()`.
`assortativity`	The assortativity coefficient of the cluster's graph, based on the degree (minus one) of each node in the cluster (with the degree computed based only upon the nodes within the cluster). Computed using `assortativity_degree()`.
`global_transitivity`	The transitivity (i.e., clustering coefficient) for the cluster's graph, which estimates the probability that adjacent vertices are connected. Computed using `transitivity()` with `type = "global"`.
`edge_density`	The number of edges in the cluster as a fraction of the maximum possible number of edges. Computed using `edge_density()`.
`degree_centrality_index`	The centrality index of the cluster's graph based on within-cluster network degree. Computed as the `centralization` element of the output from `centr_degree()`.
`closeness_centrality_index`	The centrality index of the cluster's graph based on closeness, i.e., distance to other nodes in the cluster. Computed using `centralization()`.
`eigen_centrality_index`	The centrality index of the cluster's graph based on the eigenvector centrality scores, i.e., values of the first eigenvector of the adjacency matrix for the cluster. Computed as the `centralization` element of the output from `centr_eigen()`.
`eigen_centrality_eigenvalue`	The eigenvalue corresponding to the first eigenvector of the adjacency matrix for the cluster. Computed as the `value` element of the output from `eigen_centrality()`.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

set.seed(42)
toy_data <- simulateToyData()

net <-
  generateNetworkObjects(
    toy_data, "CloneSeq"
  )

net <- addClusterMembership(net)

net$cluster_data <-
  getClusterStats(
    net$node_data,
    net$adjacency_matrix,
    seq_col = "CloneSeq",
    count_col = "CloneCount"
  )
set.seed(42)
toy_data <- simulateToyData()

net <-
  generateNetworkObjects(
    toy_data, "CloneSeq"
  )

net <- addClusterMembership(net)

net$cluster_data <-
  getClusterStats(
    net$node_data,
    net$adjacency_matrix,
    seq_col = "CloneSeq",
    count_col = "CloneCount"
  )

Identify Cells or Clones in a Neighborhood Around a Target Sequence

Description

Given Adaptive Immune Receptor Repertoire Sequencing (AIRR-Seq) data and a target receptor sequence that is present within the data, identifies a "neighborhood" comprised of cells/clones with receptor sequences sufficiently similar to the target sequence.

Usage

getNeighborhood(
    data,
    seq_col,
    target_seq,
    dist_type = "hamming",
    max_dist = 1
)
getNeighborhood(
    data,
    seq_col,
    target_seq,
    dist_type = "hamming",
    max_dist = 1
)

Arguments

`data`	A data frame containing the AIRR-Seq data.
`seq_col`	Specifies the column of `data` containing the receptor sequences. Accepts a character string containing the column name or a numeric scalar containing the column index.
`target_seq`	A character string containing the target receptor sequence. Must be a receptor sequence possessed by one of the clones/cells in the AIRR-Seq data.
`dist_type`	Specifies the function used to quantify the similarity between receptor sequences. The similarity between two sequences determines their pairwise distance, with greater similarity corresponding to shorter distance. Valid options are `"hamming"` (the default), which uses `hamDistBounded()`, and `"levenshtein"`, which uses `levDistBounded()`.
`max_dist`	Determines whether each cell/clone belongs to the neighborhood based on its receptor sequence's distance from the target sequence. The distance is based on the `dist_type` argument. `max_dist` specifies the maximum distance at which a cell/clone belongs to the neighborhood. Lower values require greater similarity between the target sequence and the receptor sequences of cells/clones in its neighborhood.

Value

A data frame containing the rows of data corresponding to the cells/clones in the neighborhood.

If no cell/clone in the AIRR-Seq data possesses the target sequence as its receptor sequence, then a value of NULL is returned.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

set.seed(42)
toy_data <- simulateToyData(sample_size = 500)

# Get neighborhood around first clone sequence
nbd <-
  getNeighborhood(
    toy_data,
    seq_col = "CloneSeq",
    target_seq = "GGGGGGGAATTGG"
  )

head(nbd)

set.seed(42)
toy_data <- simulateToyData(sample_size = 500)

# Get neighborhood around first clone sequence
nbd <-
  getNeighborhood(
    toy_data,
    seq_col = "CloneSeq",
    target_seq = "GGGGGGGAATTGG"
  )

head(nbd)

Bounded Computation of Hamming Distance

Description

Computes the Hamming distance between two strings subject to a specified upper bound.

Usage

hamDistBounded(a, b, k)
hamDistBounded(a, b, k)

Arguments

`a`	A character string.
`b`	A character string to be compared to `a`.
`k`	The upper bound on the Hamming distance between `a` and `b`.

Details

For two character strings of equal length, the Hamming distance measures the total number of character differences between characters in corresponding positions. That is, for each position in one string, the character in that position is checked to see whether it differs from the character in the same position of the other string.

For two character strings of different lengths, the Hamming distance is not defined. However, hamDistBounded() will accommodate strings of different lengths, doing so in a conservative fashion that seeks to yield a meaningful result for the purpose of checking whether two strings are sufficiently similar. If the two strings differ in length, placeholder characters are appended to the shorter string until its length matches that of the longer string. Each appended placeholder character is treated as different from the character in the corresponding position of the longer string. This is effectively the same as truncating the end of the longer string and adding the number of deleted characters to the Hamming distance between the shorter string and the truncated longer string (which is what is actually done in practice, as the computation is faster).

The above method used by hamDistBounded() to accommodate unequal string lengths results in distance values whose meaning may be questionable, depending on context, when the two strings have different lengths. The decision to append placeholder characters to the end of the shorter string (as opposed to prepending them to the beginning) is ad hoc and somewhat arbitrary. In effect, it allows two strings of different lengths to be considered sufficiently similar if the content of the shorter string sufficiently matches the beginning content of the longer string and the difference in string length is not too great.

For comparing sequences of different lengths, the Levenshtein distance (see levDistBounded()) is more appropriate and meaningful than using hamDistBounded(), but comes at the cost of greater computational burden.

Computation is aborted early if the Hamming distance is determined to exceed the specified upper bound. This functionality is designed for cases when distinguishing between values above the upper bound is not meaningful, taking advantage of this fact to reduce the computational burden.

Value

An integer. If the Hamming distance exceeds the specified upper bound k, then a value of -1 is returned. Otherwise, returns the Hamming distance between a and b.

Note

The computed value may be invalid when the length of either string is close to or greater than the value of INT_MAX in the compiler that was used at build time (typically 2147483647).

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

# using an upper bound of 3
# (trivial since strings have length 3)
hamDistBounded("foo", "foo", 3)
hamDistBounded("foo", "fee", 3)
hamDistBounded("foo", "fie", 3)
hamDistBounded("foo", "foe", 3)
hamDistBounded("foo", "fum", 3)
hamDistBounded("foo", "bar", 3)

# using an upper bound of 1
# (most distances exceed the upper bound)
hamDistBounded("foo", "fee", 1)
hamDistBounded("foo", "fie", 1)
hamDistBounded("foo", "foe", 1)
hamDistBounded("foo", "fum", 1)
hamDistBounded("foo", "bar", 1)

# comparing strings of nonmatching length
hamDistBounded("foo", "fubar", 10)
hamDistBounded("foo", "foobar", 10)
hamDistBounded("foo", "barfoo", 10)

# using an upper bound of 3
# (trivial since strings have length 3)
hamDistBounded("foo", "foo", 3)
hamDistBounded("foo", "fee", 3)
hamDistBounded("foo", "fie", 3)
hamDistBounded("foo", "foe", 3)
hamDistBounded("foo", "fum", 3)
hamDistBounded("foo", "bar", 3)

# using an upper bound of 1
# (most distances exceed the upper bound)
hamDistBounded("foo", "fee", 1)
hamDistBounded("foo", "fie", 1)
hamDistBounded("foo", "foe", 1)
hamDistBounded("foo", "fum", 1)
hamDistBounded("foo", "bar", 1)

# comparing strings of nonmatching length
hamDistBounded("foo", "fubar", 10)
hamDistBounded("foo", "foobar", 10)
hamDistBounded("foo", "barfoo", 10)

Label Clusters in a Network Graph Plot

Description

Functions for labeling the clusters in network graph plots with their cluster IDs. The user can specify a cluster-level property by which to rank the clusters, labeling only those clusters above a specified rank.

Usage

labelClusters(
  net,
  plots = NULL,
  top_n_clusters = 20,
  cluster_id_col = "cluster_id",
  criterion = "node_count",
  size = 5, color = "black",
  greatest_values = TRUE
)

addClusterLabels(
  plot,
  net,
  top_n_clusters = 20,
  cluster_id_col = "cluster_id",
  criterion = "node_count",
  size = 5,
  color = "black",
  greatest_values = TRUE
)
labelClusters(
  net,
  plots = NULL,
  top_n_clusters = 20,
  cluster_id_col = "cluster_id",
  criterion = "node_count",
  size = 5, color = "black",
  greatest_values = TRUE
)

addClusterLabels(
  plot,
  net,
  top_n_clusters = 20,
  cluster_id_col = "cluster_id",
  criterion = "node_count",
  size = 5,
  color = "black",
  greatest_values = TRUE
)

Arguments

`net`	A `list` of network objects conforming to the output of `buildRepSeqNetwork()` or `generateNetworkObjects()`. See details.
`plots`	Specifies which plots in `net$plots` to annotate. Accepts a character vector of element names or a numeric vector of element position indices. The default `NULL` annotates all plots.
`plot`	A `ggraph` object containing the network graph plot.
`top_n_clusters`	A positive integer specifying the number of clusters to label. Those with the highest rank according to the `criterion` argument will be labeled.
`cluster_id_col`	Specifies the column of `net$node_data` containing the variable for cluster membership. Accepts a character string containing the column name.
`criterion`	Can be used to specify a cluster-level network property by which to rank the clusters. Non-default values are ignored unless `net$cluster_data` exists and corresponds to the cluster membership variable specified by `cluster_id_col`. Accepts a character string containing a column name of `net$cluster_data`. The property must be quantitative for the ranking to be meaningful. By default, clusters are ranked by node count, which is computed based on the cluster membership values if necessary.
`size`	The font size of the cluster ID labels. Passed to the `size` argument of `geom_node_text()`.
`color`	The color of the cluster ID labels. Passed to the `color` argument of `geom_node_text()`.
`greatest_values`	Logical. Controls whether clusters are ranked according to the greatest or least values of the property specified by the `criterion` argument. If `TRUE`, clusters with greater values will be ranked above those with lower values, thereby receiving a higher priority to be labeled.

Details

Value

labelClusters() returns a copy of net with the specified plots annotated.

addClusterLabels() returns an annotated copy of plot.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

set.seed(42)
toy_data <- simulateToyData()

network <- buildRepSeqNetwork(
  toy_data, "CloneSeq",
  cluster_stats = TRUE,
  color_nodes_by = "cluster_id",
  color_scheme = "turbo",
  color_legend = FALSE,
  plot_title = NULL,
  plot_subtitle = NULL,
  size_nodes_by = 1
)

network <- labelClusters(network)

network$plots$cluster_id
set.seed(42)
toy_data <- simulateToyData()

network <- buildRepSeqNetwork(
  toy_data, "CloneSeq",
  cluster_stats = TRUE,
  color_nodes_by = "cluster_id",
  color_scheme = "turbo",
  color_legend = FALSE,
  plot_title = NULL,
  plot_subtitle = NULL,
  size_nodes_by = 1
)

network <- labelClusters(network)

network$plots$cluster_id

Label Nodes in a Network Graph Plot

Description

Functions for annotating a graph plot to add custom labels to the nodes.

Usage

labelNodes(
  net,
  node_labels,
  plots = NULL,
  size = 5,
  color = "black"
)

addGraphLabels(
  plot,
  node_labels,
  size = 5,
  color = "black"
)
labelNodes(
  net,
  node_labels,
  plots = NULL,
  size = 5,
  color = "black"
)

addGraphLabels(
  plot,
  node_labels,
  size = 5,
  color = "black"
)

Arguments

`net`	A `list` of network objects conforming to the output of `buildRepSeqNetwork()` or `generateNetworkObjects()`. See details.
`plot`	A `ggraph` object containing the network graph plot.
`node_labels`	A vector containing the node labels, where each entry is the label for a single node. The length should match the number of nodes in the plot.
`plots`	Specifies which plots in `net$plots` to annotate. Accepts a character vector of element names or a numeric vector of element position indices. The default `NULL` annotates all plots.
`size`	The font size of the node labels. Passed to the `size` argument of `geom_node_text()`.
`color`	The color of the node labels. Passed to the `size` argument of `geom_node_text()`.

Details

Labels are added using geom_node_text().

Value

labelNodes() returns a copy of net with the specified plots annotated.

addGraphLabels() returns a ggraph object containing the original plot annotated with the node labels.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

set.seed(42)
toy_data <-
  simulateToyData(
    samples = 1,
    sample_size = 10,
    prefix_length = 1
  )

# Generate network
network <-
  buildNet(
    toy_data,
    seq_col = "CloneSeq",
    plot_title = NULL,
    plot_subtitle = NULL
  )

# Label each node with its receptor sequence
network <- labelNodes(network, "CloneSeq", size = 3)

network$plots[[1]]

set.seed(42)
toy_data <-
  simulateToyData(
    samples = 1,
    sample_size = 10,
    prefix_length = 1
  )

# Generate network
network <-
  buildNet(
    toy_data,
    seq_col = "CloneSeq",
    plot_title = NULL,
    plot_subtitle = NULL
  )

# Label each node with its receptor sequence
network <- labelNodes(network, "CloneSeq", size = 3)

network$plots[[1]]

Bounded Computation of Levenshtein Distance

Description

Computes the Levenshtein distance between two strings subject to a specified upper bound.

Usage

levDistBounded(a, b, k)
levDistBounded(a, b, k)

Arguments

`a`	A character string.
`b`	A character string to be compared to `a`.
`k`	An integer specifying the upper bound on the Levenshtein distance between `a` and `b`.

Details

The Levenshtein distance (sometimes referred to as edit distance) between two character strings measures the minimum number of single-character edits (insertions, deletions and transformations) needed to transform one string into the other.

Compared to the Hamming distance (see hamDistBounded()), the Levenshtein distance is particularly useful for comparing sequences of different lengths, as it can account for insertions and deletions, whereas the Hamming distance only accounts for single-character transformations. However, the computational burden for the Levenshtein distance can be significantly greater than for the Hamming distance.

Computation is aborted early if the Levenshtein distance is determined to exceed the specified upper bound. This functionality is designed for cases when distinguishing between values above the upper bound is not meaningful, taking advantage of this fact to reduce the computational burden.

Value

An integer. If the Levenshtein distance exceeds the specified upper bound k, then a value of -1 is returned. Otherwise, returns the Levenshtein distance between a and b.

Note

The computed value may be invalid when the length of either string is close to or greater than the value of INT_MAX in the compiler that was used at build time (typically 2147483647).

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

# equal string lengths,
# character transmutations only
levDistBounded("foo", "bar", 3)
hamDistBounded("foo", "bar", 3) # agrees with Hamming distance

# one insertion, one deletion
levDistBounded("1234567", "1.23457", 7)
hamDistBounded("1234567", "1.23457", 7) # compare to Hamming distance

# same as above, but with a different lower bound
levDistBounded("1234567", "1.23457", 3) # within the bound
hamDistBounded("1234567", "1.23457", 3) # exceeds the bound

# one deletion (last position)
levDistBounded("1234567890", "123456789", 10)
hamDistBounded("1234567890", "123456789", 10)

# note the Hamming distance agrees with the Levenshtein distance
# for the above example, since the deletion occurs in the final
# character position. This is due to how hamDistBounded() handles
# strings of different lengths. In the example below, however...

# one deletion (first position)
levDistBounded("1234567890", "234567890", 10)
hamDistBounded("1234567890", "234567890", 10) # compare to Hamming distance

# one deletion, one transmutation
levDistBounded("foobar", "fubar", 6)
hamDistBounded("foobar", "fubar", 6) # compare to Hamming distance
# equal string lengths,
# character transmutations only
levDistBounded("foo", "bar", 3)
hamDistBounded("foo", "bar", 3) # agrees with Hamming distance

# one insertion, one deletion
levDistBounded("1234567", "1.23457", 7)
hamDistBounded("1234567", "1.23457", 7) # compare to Hamming distance

# same as above, but with a different lower bound
levDistBounded("1234567", "1.23457", 3) # within the bound
hamDistBounded("1234567", "1.23457", 3) # exceeds the bound

# one deletion (last position)
levDistBounded("1234567890", "123456789", 10)
hamDistBounded("1234567890", "123456789", 10)

# note the Hamming distance agrees with the Levenshtein distance
# for the above example, since the deletion occurs in the final
# character position. This is due to how hamDistBounded() handles
# strings of different lengths. In the example below, however...

# one deletion (first position)
levDistBounded("1234567890", "234567890", 10)
hamDistBounded("1234567890", "234567890", 10) # compare to Hamming distance

# one deletion, one transmutation
levDistBounded("foobar", "fubar", 6)
hamDistBounded("foobar", "fubar", 6) # compare to Hamming distance

Plot the Graph of an Immune Repertoire Network

Description

Given the igraph of an immune repertoire network, generates a plot of the network graph according to the user specifications.

Deprecated. Replaced by addPlots().

Usage

plotNetworkGraph(
  igraph,
  plot_title = NULL,
  plot_subtitle = NULL,
  color_nodes_by = NULL,
  color_scheme = "default",
  color_legend = "auto",
  color_title = "auto",
  edge_width = 0.1,
  size_nodes_by = 0.5,
  node_size_limits = NULL,
  size_title = "auto",
  outfile = NULL,
  pdf_width = 12,
  pdf_height = 8
)
plotNetworkGraph(
  igraph,
  plot_title = NULL,
  plot_subtitle = NULL,
  color_nodes_by = NULL,
  color_scheme = "default",
  color_legend = "auto",
  color_title = "auto",
  edge_width = 0.1,
  size_nodes_by = 0.5,
  node_size_limits = NULL,
  size_title = "auto",
  outfile = NULL,
  pdf_width = 12,
  pdf_height = 8
)

Arguments

`igraph`	An object of class `igraph`.
`plot_title`	A character string containing the plot title. Passed to `labs()`.
`plot_subtitle`	A character string containing the plot subtitle. Passed to `labs()`.
`color_nodes_by`	A vector whose length matches the number of nodes in the network. The values are used to encode the color of each node. An argument value of `NULL` (the default) leaves the nodes uncolored. Passed to the color aesthetic mapping of `geom_node_point()`.
`color_scheme`	A character string specifying the color scale used to color the nodes. `"default"` uses default `ggplot()` colors. Other options are one of the viridis color scales (e.g., `"plasma"`, `"A"` or other valid inputs to the `option` argument of `scale_color_viridis()`) or (for discrete variables) a palette from `hcl.pals()` (e.g., `"RdYlGn"`). Each of the viridis color scales can include the suffix `"-1"` to reverse its direction (e.g., `"plasma-1"` or `"A-1"`).
`color_legend`	A logical scalar specifying whether to display the color legend in the plot. The default value of `"auto"` shows the color legend if `color_nodes_by` is a continuous variable or a discrete variable with at most 20 distinct values.
`color_title`	A character string (or `NULL`) specifying the title for the color legend. Only applicable if `color_nodes_by` is a vector. If `color_title = "auto"` (the default), the title for the color legend will be the name of the vector provided to `color_nodes_by`.
`edge_width`	A numeric scalar specifying the width of the graph edges in the plot. Passed to the `width` argument of `geom_edge_link0()`.
`size_nodes_by`	A numeric scalar specifying the size of the nodes, or a numeric vector with positive entires that encodes the size of each node (and whose length matches the number of nodes in the network). Alternatively, an argument value of `NULL` uses the default `ggraph()` size for all nodes. Passed to the size aesthetic mapping of `geom_node_point()`.
`size_title`	A character string (or `NULL`) specifying the title for the size legend. Only applicable if `size_nodes_by` is a vector. If `size_title = "auto"` (the default), the title for the color legend will be the name of the vector provided to `size_nodes_by`.
`node_size_limits`	A numeric vector of length 2, specifying the minimum and maximum node size. Only applicable if `size_nodes_by` is a vector. If `node_size_limits = NULL`, the default size scale will be used.
`outfile`	An optional file path for saving the plot as a pdf. If `NULL` (the default), no pdf will be saved.
`pdf_width`	Sets the plot width when writing to pdf. Passed to the `width` argument of `pdf()`.
`pdf_height`	Sets the plot height when writing to pdf. Passed to the `height` argument of `pdf()`.

Value

A ggraph object.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Network Visualization article on package website

Examples

set.seed(42)
toy_data <- simulateToyData()

# Generate network for data
net <- buildNet(toy_data, "CloneSeq")

# Plot network graph
net_plot <- plotNetworkGraph(
  net$igraph,
  color_nodes_by =
    net$node_data$SampleID,
  color_title = NULL,
  size_nodes_by =
    net$node_data$CloneCount,
  size_title = "Clone Count",
  node_size_limits = c(0.5, 1.5))

print(net_plot)

set.seed(42)
toy_data <- simulateToyData()

# Generate network for data
net <- buildNet(toy_data, "CloneSeq")

# Plot network graph
net_plot <- plotNetworkGraph(
  net$igraph,
  color_nodes_by =
    net$node_data$SampleID,
  color_title = NULL,
  size_nodes_by =
    net$node_data$CloneCount,
  size_title = "Clone Count",
  node_size_limits = c(0.5, 1.5))

print(net_plot)

Save List of Network Objects

Description

Given a list of network objects such as that returned by buildRepSeqNetwork() or generateNetworkObjects, saves its contents according to the specified file format scheme.

Usage

saveNetwork(
  net,
  output_dir,
  output_type = "rds",
  output_name = "MyRepSeqNetwork",
  pdf_width = 12,
  pdf_height = 10,
  verbose = FALSE,
  output_filename = deprecated()
)
saveNetwork(
  net,
  output_dir,
  output_type = "rds",
  output_name = "MyRepSeqNetwork",
  pdf_width = 12,
  pdf_height = 10,
  verbose = FALSE,
  output_filename = deprecated()
)

Arguments

`net`	A list of network objects returned by `buildRepSeqNetwork()` or `generateNetworkObjects()`.
`output_dir`	A file path specifying the directory in which to write the file(s).
`output_type`	A character string specifying the file format scheme to use when writing output to file. Valid options are `"individual"`, `"rds"` and `"rda"`. See detials.
`output_name`	A character string. All files saved will have file names beginning with this value.
`pdf_width`	If the list contains plots, this controls the width of each plot when writing to pdf. Passed to the `width` argument of the `pdf` function.
`pdf_height`	If the list contains plots, this controls the height of each plot when writing to pdf. Passed to the `height` argument of the `pdf` function.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.
`output_filename`	Equivalent to `output_name`.

Details

The list net must contain the named elements igraph (of class igraph), adjacency_matrix (a matrix or dgCMatrix encoding edge connections), and node_data (a data.frame containing node metadata), all corresponding to the same network. The list returned by buildRepSeqNetwork() and generateNetworkObjects() is an example of a valid input for the net argument.

The additional elements cluster_data (a data.frame) and plots (a list containing objects of class ggraph and possibly one matrix named graph_layout) will also be saved, if present.

By default, the list net is saved to a compressed data file in the RDS format, while any plots present are printed to a single pdf containing one plot per page.

The name of each saved file begins with the value of output_name. When output_type is one of "rds" or "rda", only two files are saved (the rds/rda and the pdf); for each file, output_name is followed by the appropriate file extension.

When output_type = "individual", each element of net is saved as a separate file, where output_name is followed by:

_NodeMetadata.csv for node_data
_ClusterMetadata.csv for cluster_data
_EdgeList.txt for igraph
_AdjacencyMatrix.mtx for adjacency_matrix
_Plots.rda for plots
_GraphLayout.txt for plots$graph_layout
_Details.rds for details

node_data and cluster_data are saved using write.csv(), with row.names being TRUE for node_data and FALSE for cluster_data. The igraph is saved using write_graph() with format = "edgelist". The adjacency matrix is saved using writeMM(). The graph layout is saved using write() with ncolumns = 2.

Value

Returns TRUE if output is saved, otherwise returns FALSE (with a warning if output_dir is non-null and the specified directory does not exist and could not be created).

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

set.seed(42)
toy_data <- simulateToyData()

net <- buildRepSeqNetwork(
  toy_data,
  seq_col = "CloneSeq",
  node_stats = TRUE,
  cluster_stats = TRUE,
  color_nodes_by = c("transitivity", "SampleID")
)

# save as single RDS file
saveNetwork(
  net,
  output_dir = tempdir(),
  verbose = TRUE
)

saveNetwork(
  net,
  output_dir = tempdir(),
  output_type = "individual",
  verbose = TRUE
)


set.seed(42)
toy_data <- simulateToyData()

net <- buildRepSeqNetwork(
  toy_data,
  seq_col = "CloneSeq",
  node_stats = TRUE,
  cluster_stats = TRUE,
  color_nodes_by = c("transitivity", "SampleID")
)

# save as single RDS file
saveNetwork(
  net,
  output_dir = tempdir(),
  verbose = TRUE
)

saveNetwork(
  net,
  output_dir = tempdir(),
  output_type = "individual",
  verbose = TRUE
)

Write Plots to a PDF

Description

Given a list of plots, write all plots to a single pdf file containing one plot per page, and optionally save the graph layout as a csv file.

Usage

saveNetworkPlots(
  plotlist,
  outfile,
  pdf_width = 12,
  pdf_height = 10,
  outfile_layout = NULL,
  verbose = FALSE
)
saveNetworkPlots(
  plotlist,
  outfile,
  pdf_width = 12,
  pdf_height = 10,
  outfile_layout = NULL,
  verbose = FALSE
)

Arguments

`plotlist`	A named list whose elements are of class `ggraph`. May also contain an element named `graph_layout` with the matrix specifying the graph layout.
`outfile`	A `connection` or a character string containing the file path used to save the pdf.
`pdf_width`	Sets the page width. Passed to the `width` argument of `pdf()`.
`pdf_height`	Sets the page height. Passed to the `height` argument of `pdf()`.
`outfile_layout`	An optional `connection` or file path for saving the graph layout. Passed to the `file` argument of `write()`, which is called with `ncolumns = 2`.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.

Value

Returns TRUE, invisibly.

Author(s)

Brian Neal ([email protected])

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

set.seed(42)
toy_data <- simulateToyData()

net <-
  generateNetworkObjects(
    toy_data,
    "CloneSeq"
  )

net <-
  addPlots(
    net,
    color_nodes_by =
      c("SampleID", "CloneCount"),
    print_plots = TRUE
  )

saveNetworkPlots(
  net$plots,
  outfile =
    file.path(tempdir(), "network.pdf"),
  outfile_layout =
    file.path(tempdir(), "graph_layout.txt")
)

# Load saved graph layout
graph_layout <- matrix(
  scan(file.path(tempdir(), "graph_layout.txt"), quiet = TRUE),
  ncol = 2
)
all.equal(graph_layout, net$plots$graph_layout)



set.seed(42)
toy_data <- simulateToyData()

net <-
  generateNetworkObjects(
    toy_data,
    "CloneSeq"
  )

net <-
  addPlots(
    net,
    color_nodes_by =
      c("SampleID", "CloneCount"),
    print_plots = TRUE
  )

saveNetworkPlots(
  net$plots,
  outfile =
    file.path(tempdir(), "network.pdf"),
  outfile_layout =
    file.path(tempdir(), "graph_layout.txt")
)

# Load saved graph layout
graph_layout <- matrix(
  scan(file.path(tempdir(), "graph_layout.txt"), quiet = TRUE),
  ncol = 2
)
all.equal(graph_layout, net$plots$graph_layout)

Generate Toy AIRR-Seq Data

Description

Generates toy data that can be used to test or demonstrate the behavior of functions in the NAIR package. Created as a lightweight tool for use in tests, examples and vignettes. This function is not intended to simulate realistic data.

Usage

simulateToyData(
  samples = 2,
  chains = 1,
  sample_size = 100,
  prefix_length = 7,
  prefix_chars = c("G", "A", "T", "C"),
  prefix_probs = rbind(
    "sample1" = c(12, 4, 1, 1),
    "sample2" = c(4, 12, 1, 1)),
  affixes = c("AATTGG", "AATCGG", "AATTCG",
              "AATTGC", "AATTG", "AATTC"),
  affix_probs = rbind(
    "sample1" = c(10, 4, 2, 2, 1, 1),
    "sample2" = c(1, 1, 1, 2, 2.5, 2.5)),
  num_edits = 0,
  edit_pos_probs = function(seq_length) {
    stats::dnorm(seq(-4, 4, length.out = seq_length))
  },
  edit_ops = c("insertion", "deletion", "transmutation"),
  edit_probs = c(5, 1, 4),
  new_chars = prefix_chars,
  new_probs = prefix_probs,
  output_dir = NULL,
  no_return = FALSE
)
simulateToyData(
  samples = 2,
  chains = 1,
  sample_size = 100,
  prefix_length = 7,
  prefix_chars = c("G", "A", "T", "C"),
  prefix_probs = rbind(
    "sample1" = c(12, 4, 1, 1),
    "sample2" = c(4, 12, 1, 1)),
  affixes = c("AATTGG", "AATCGG", "AATTCG",
              "AATTGC", "AATTG", "AATTC"),
  affix_probs = rbind(
    "sample1" = c(10, 4, 2, 2, 1, 1),
    "sample2" = c(1, 1, 1, 2, 2.5, 2.5)),
  num_edits = 0,
  edit_pos_probs = function(seq_length) {
    stats::dnorm(seq(-4, 4, length.out = seq_length))
  },
  edit_ops = c("insertion", "deletion", "transmutation"),
  edit_probs = c(5, 1, 4),
  new_chars = prefix_chars,
  new_probs = prefix_probs,
  output_dir = NULL,
  no_return = FALSE
)

Arguments

`samples`	The number of distinct samples to include in the data.
`chains`	The number of chains (either 1 or 2) for which to generate receptor sequences.
`sample_size`	The number of observations to generate per sample.
`prefix_length`	The length of the random prefix generated for each observed sequence. Specifically, the number of elements of `prefix_chars` that are sampled with replacement and concatenated to form each prefix.
`prefix_chars`	A character vector containing characters or strings from which to sample when generating the prefix for each observed sequence.
`prefix_probs`	A numeric matrix whose column dimension matches the length of `prefix_chars` and with row dimension matching the value of `samples`. The $i$ th row specifies the relative probability weights assigned to each element of `prefix_chars` when sampling to form the prefix for each sequence in the $i$ th sample.
`affixes`	A character vector containing characters or strings from which to sample when generating the suffix for each observed sequence.
`affix_probs`	A numeric matrix whose column dimension matches the length of `affixes` and with row dimension matching the value of `samples`. The $i$ th row specifies the relative probability weights assigned to each element of `affixes` when sampling to form the suffix for each sequence in the $i$ th sample.
`num_edits`	A nonnegative integer specifying the number of random edit operations to perform on each observed sequence after its initial generation.
`edit_pos_probs`	A function that accepts a nonnegative integer (the character length of a sequence) as its argument and returns a vector of this length containing probability weights. Each time an edit operation is performed on a sequence, the character position at which to perform the operation is randomly determined according to the probabilities given by this function.
`edit_ops`	A character vector specifying the possible operations that can be performed for each edit. The default value includes all valid operations (insertion, deletion, transmutation).
`edit_probs`	A numeric vector of the same length as `edit_ops`, specifying the relative probability weights assigned to each edit operation.
`new_chars`	A character vector containing characters or strings from which to sample when performing an insertion edit operation.
`new_probs`	A numeric matrix whose column dimension matches the length of `new_chars` and with row dimension matching the value of `samples`. The $i$ th row specifies, for the $i$ th sample, the relative probability weights assigned to each element of `new_chars` when performing a transmutation or insertion as a random edit operation.
`output_dir`	An optional character string specifying a file directory to save the generated data. One file will be generated per sample.
`no_return`	A logical flag that can be used to prevent the function from returning the generated data. If `TRUE`, the function will instead return `TRUE` once all processes are complete.

Details

Each observed sequence is obtained by separately generating a prefix and suffix according to the specified settings, then joining the two and performing sequential rounds of edit operations randomized according to the user's specifications.

Count data is generated for each observation; note that this count data is generated independently from the observed sequences and has no relationship to them.

Value

If no_return = FALSE (the default), a data.frame whose contents depend on the value of the chains argument.

For chains = 1, the data frame contains the following variables:

`CloneSeq`	The "receptor sequence" for each observation.
`CloneFrequency`	The "clone frequency" for each observation (clone count as a proportion of the aggregate clone count within each sample).
`CloneCount`	The "clone count" for each observation.
`SampleID`	The sample ID for each observation.

For chains = 2, the data frame contains the following variables:

`AlphaSeq`	The "alpha chain" receptor sequence for each observation.
`AlphaSeq`	The "beta chain" receptor sequence for each observation.
`UMIs`	The "unique molecular identifier count" for each observation.
`Count`	The "count" for each observation.
`SampleID`	The sample ID for each observation.

If no_return = TRUE, the function returns TRUE upon completion.

Author(s)

Brian Neal ([email protected])

Examples

set.seed(42)

# Bulk data from two samples
dat1 <- simulateToyData()

# Single-cell data with alpha and beta chain sequences
dat2 <- simulateToyData(chains = 2)

# Write data to file, return nothing
simulateToyData(sample_size = 500,
                num_edits = 10,
                no_return = TRUE,
                output_dir = tempdir())

# Example customization
dat4 <-
  simulateToyData(
    samples = 5,
    sample_size = 50,
    prefix_length = 0,
    prefix_chars = "",
    prefix_probs = matrix(1, nrow = 5),
    affixes = c("CASSLGYEQYF", "CASSLGETQYF",
                "CASSLGTDTQYF", "CASSLGTEAFF",
                "CASSLGGTEAFF", "CAGLGGRDQETQYF",
                "CASSQETQYF", "CASSLTDTQYF",
                "CANYGYTF", "CANTGELFF",
                "CSANYGYTF"),
    affix_probs = matrix(1, ncol = 11, nrow = 5),
  )

## Simulate 30 samples with a mix of public/private sequences ##
samples <- 30
sample_size <- 30 # (seqs per sample)
base_seqs <- c(
  "CASSIEGQLSTDTQYF", "CASSEEGQLSTDTQYF", "CASSSVETQYF",
  "CASSPEGQLSTDTQYF", "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF",
  "CASSLTSGYNEQFF", "CASSETGYNEQFF", "CASSLTGGNEQFF", "CASSYLTGYNEQFF",
  "CASSLTGNEQFF", "CASSLNGYNEQFF", "CASSFPWDGYGYTF", "CASTLARQGGELFF",
  "CASTLSRQGGELFF", "CSVELLPTGPLETSYNEQFF", "CSVELLPTGPSETSYNEQFF",
  "CVELLPTGPSETSYNEQFF", "CASLAGGRTQETQYF", "CASRLAGGRTQETQYF",
  "CASSLAGGRTETQYF", "CASSLAGGRTQETQYF", "CASSRLAGGRTQETQYF",
  "CASQYGGGNQPQHF", "CASSLGGGNQPQHF", "CASSNGGGNQPQHF", "CASSYGGGGNQPQHF",
  "CASSYGGGQPQHF", "CASSYKGGNQPQHF", "CASSYTGGGNQPQHF",
  "CAWSSQETQYF", "CASSSPETQYF", "CASSGAYEQYF", "CSVDLGKGNNEQFF")
# Relative generation probabilities
pgen <- cbind(
  stats::toeplitz(0.6^(0:(sample_size - 1))),
  matrix(1, nrow = samples, ncol = length(base_seqs) - samples))
dat5 <-
  simulateToyData(
    samples = samples,
    sample_size = sample_size,
    prefix_length = 1,
    prefix_chars = c("", ""),
    prefix_probs = cbind(rep(1, samples), rep(0, samples)),
    affixes = base_seqs,
    affix_probs = pgen,
    num_edits = 0
  )

## Simulate 30 samples from two groups (treatment/control) ##
samples_c <- samples_t <- 15 # Number of samples by control/treatment group
samples <- samples_c + samples_t
sample_size <- 30 # (seqs per sample)
base_seqs <- # first five are associated with treatment
  c("CASSGAYEQYF", "CSVDLGKGNNEQFF", "CASSIEGQLSTDTQYF",
    "CASSEEGQLSTDTQYF", "CASSPEGQLSTDTQYF",
    "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF")
# Relative generation probabilities by control/treatment group
pgen_c <- matrix(rep(c(rep(1, 5), rep(30, 3)), times = samples_c),
                 nrow = samples_c, byrow = TRUE)
pgen_t <- matrix(rep(c(1, 1, rep(1/3, 3), rep(2, 3)), times = samples_t),
                 nrow = samples_t, byrow = TRUE)
pgen <- rbind(pgen_c, pgen_t)
dat6 <-
  simulateToyData(
    samples = samples,
    sample_size = sample_size,
    prefix_length = 1,
    prefix_chars = c("", ""),
    prefix_probs =
      cbind(rep(1, samples), rep(0, samples)),
    affixes = base_seqs,
    affix_probs = pgen,
    num_edits = 0
  )



set.seed(42)

# Bulk data from two samples
dat1 <- simulateToyData()

# Single-cell data with alpha and beta chain sequences
dat2 <- simulateToyData(chains = 2)

# Write data to file, return nothing
simulateToyData(sample_size = 500,
                num_edits = 10,
                no_return = TRUE,
                output_dir = tempdir())

# Example customization
dat4 <-
  simulateToyData(
    samples = 5,
    sample_size = 50,
    prefix_length = 0,
    prefix_chars = "",
    prefix_probs = matrix(1, nrow = 5),
    affixes = c("CASSLGYEQYF", "CASSLGETQYF",
                "CASSLGTDTQYF", "CASSLGTEAFF",
                "CASSLGGTEAFF", "CAGLGGRDQETQYF",
                "CASSQETQYF", "CASSLTDTQYF",
                "CANYGYTF", "CANTGELFF",
                "CSANYGYTF"),
    affix_probs = matrix(1, ncol = 11, nrow = 5),
  )

## Simulate 30 samples with a mix of public/private sequences ##
samples <- 30
sample_size <- 30 # (seqs per sample)
base_seqs <- c(
  "CASSIEGQLSTDTQYF", "CASSEEGQLSTDTQYF", "CASSSVETQYF",
  "CASSPEGQLSTDTQYF", "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF",
  "CASSLTSGYNEQFF", "CASSETGYNEQFF", "CASSLTGGNEQFF", "CASSYLTGYNEQFF",
  "CASSLTGNEQFF", "CASSLNGYNEQFF", "CASSFPWDGYGYTF", "CASTLARQGGELFF",
  "CASTLSRQGGELFF", "CSVELLPTGPLETSYNEQFF", "CSVELLPTGPSETSYNEQFF",
  "CVELLPTGPSETSYNEQFF", "CASLAGGRTQETQYF", "CASRLAGGRTQETQYF",
  "CASSLAGGRTETQYF", "CASSLAGGRTQETQYF", "CASSRLAGGRTQETQYF",
  "CASQYGGGNQPQHF", "CASSLGGGNQPQHF", "CASSNGGGNQPQHF", "CASSYGGGGNQPQHF",
  "CASSYGGGQPQHF", "CASSYKGGNQPQHF", "CASSYTGGGNQPQHF",
  "CAWSSQETQYF", "CASSSPETQYF", "CASSGAYEQYF", "CSVDLGKGNNEQFF")
# Relative generation probabilities
pgen <- cbind(
  stats::toeplitz(0.6^(0:(sample_size - 1))),
  matrix(1, nrow = samples, ncol = length(base_seqs) - samples))
dat5 <-
  simulateToyData(
    samples = samples,
    sample_size = sample_size,
    prefix_length = 1,
    prefix_chars = c("", ""),
    prefix_probs = cbind(rep(1, samples), rep(0, samples)),
    affixes = base_seqs,
    affix_probs = pgen,
    num_edits = 0
  )

## Simulate 30 samples from two groups (treatment/control) ##
samples_c <- samples_t <- 15 # Number of samples by control/treatment group
samples <- samples_c + samples_t
sample_size <- 30 # (seqs per sample)
base_seqs <- # first five are associated with treatment
  c("CASSGAYEQYF", "CSVDLGKGNNEQFF", "CASSIEGQLSTDTQYF",
    "CASSEEGQLSTDTQYF", "CASSPEGQLSTDTQYF",
    "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF")
# Relative generation probabilities by control/treatment group
pgen_c <- matrix(rep(c(rep(1, 5), rep(30, 3)), times = samples_c),
                 nrow = samples_c, byrow = TRUE)
pgen_t <- matrix(rep(c(1, 1, rep(1/3, 3), rep(2, 3)), times = samples_t),
                 nrow = samples_t, byrow = TRUE)
pgen <- rbind(pgen_c, pgen_t)
dat6 <-
  simulateToyData(
    samples = samples,
    sample_size = sample_size,
    prefix_length = 1,
    prefix_chars = c("", ""),
    prefix_probs =
      cbind(rep(1, samples), rep(0, samples)),
    affixes = base_seqs,
    affix_probs = pgen,
    num_edits = 0
  )

Package 'NAIR'

Help Index

NAIR: Network Analysis of Immune Repertoire

Description

Author(s)

See Also

Partition a Network Graph Into Clusters

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Compute Cluster-Level Network Properties

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Compute Node-Level Network Properties

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Compute Node-Level Network Properties

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Generate Plots of a Network Graph

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Aggregate Counts/Frequencies for Clones With Identical Receptor Sequences

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

Build Global Network of Associated TCR/BCR Clusters

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Build Global Network of Public TCR/BCR Clusters

Description

Usage

Arguments

Details