Introduction to SifiNet

SifiNet (Single-cell feature identification with Network topology) is an R package of a clustering-independent method for directly identifying feature gene sets. Feature gene set are sets of gene exhibiting unique phenotypes in different cell subpopulations. SifiNet is designed for single cell RNA-seq data, finding sets of feature genes that are differentially expressed in some cell subpopulation. However, it can also find epigenomic feature gene sets when applied to single cell ATAC-seq data. This vignette gives an introduction to SifiNet package.

Installation

The development version of SifiNet can be installed from Github:

devtools::install_github("jichunxie/sifinet")

SifiNet method

SifiNet method consists of two major sections, A and B. After loading the data, in section A we construct a gene co-expression network with genes as nodes and the co-expressions between genes as edges. And in section B, we identify the feature gene sets based on the gene co-expression network.

Create SifiNet object

Suppose we already have a matrix of count data that we wish to find feature gene sets from. Here is a toy example generated with the complexSCsimulation package:

# Load package
library(SiFINeT)

# Load data

data <- readRDS("toy_matrix.rds")
sifi_obj <- create_SiFINeT_object(
  counts = data,
  gene.name = NULL,
  meta.data = NULL,
  data.name = NULL,
  sparse = FALSE,
  rowfeature = TRUE
)

The count matrix should be a gene (row) by cell (column) matrix by default. Otherwise, for a cell (row) by gene (column) matrix, the rowfeature argument need to be set to FALSE. Name of the genes could be specified by the gene.name argument. Meta data of the cells could be input using the meta.data argument. If the input count matrix is sparse (majority of the matrix is 0), we can set the sparse argument to be TRUE, which would potentially reduce the space and time cost of the method. We can also assign a name to the input data set by data.name argument. Multiple count matrix could be saved in a SifiNet object. The name of data set could help to distinguish those inputs.

Section A: Construct the gene co-expression network

# estimate the quantiles
sifi_obj <- quantile_thres(sifi_obj)
sifi_obj <- feature_coexp(sifi_obj)

In SifiNet, we use quantile association to measure the co-expression between genes. So, to construct the gene co-expression network, we first divide the read counts in the matrix into “low” and “high” groups using quantile regression in quantile_thres function. If [email protected] has at least one column, then the columns of [email protected] are used as independent variables. Otherwise, the mean total number of reads in each cell is used as independent variable for the quantile regression. The results are saved in [email protected]. Based on the results of the “low” and “high” group classification, we calculate the co-expression score using feature_coexp function. The results are saved in sifi_obj@coexp.

# build the co-expression network
sifi_obj <- create_network(sifi_obj,
  alpha = 0.05, manual = FALSE,
  least_edge_prop = 0.01
)

Then we select a threshold for co-expression score. Gene pairs with absolute value of co-expression score greater than the threshold would be considered to have an edge between them in the network. This procedure is done by function create_network. By default, an FDR control procedure is applied to select the threshold. Argument alpha is the type I error of the FDR control procedure. After finding the threshold using the FDR control procedure, SifiNet also allows an additional modification of the threshold. In case that the threshold is overly strict, we can set a lower bound for the proportion of edges (total number of edges divided by the total number of different gene pairs) with argument least_edge_prop. Note the least_edge_prop is only used when argument manual is TRUE. The results are saved in sifi_obj@est_ms and sifi_obj@thres.

# filter gene to improve the quality of the co-expression network
sifi_obj <- filter_lowexp(sifi_obj, t1 = 10, t2 = 0.9, t3 = 0.9)

After building the co-expression network, we further perform additional gene filtering steps to improve the quality of the co- expression network. Greater values of arguments t1, t2, and t3 would lead to a less strict filtering, keeping more genes in the network. The results are saved in sifi_obj@kset.

Section B: Calculate gene connectivity and find feature gene sets

# calculate connectivity for each gene
sifi_obj <- cal_connectivity(sifi_obj, m = 10, niter = 100)

In section B, based on the topology structure of gene co-expression network, we identify the feature gene sets. Using function cal_connectivity, we calculate the 1st, 2nd, and 3rd order connectivities for each gene. The connectivities are measures of the topology structure of gene co-expression network. Greater values of argument m and niter would improve the accuracy of 3rd order connectivity, but would also increase the computation time. The results are saved in sifi_obj@conn.

# detect core feature gene sets
sifi_obj <- find_unique_feature(sifi_obj,
  t1 = 5,
  t2 = 0.4,
  t3 = 0.3,
  t1p = 5,
  t2p = 0.7,
  t3p = 0.5,
  resolution = 1,
  min_set_size = 5
)

After calculating the connectivities, we can detect candidate feature genes. And then we identify the core feature genes in each of the feature gene sets among the candidate feature genes. Core feature genes are genes that show unique phenotypes in only one cell subpopulation (not shared with other cell subpopulation). These steps are done by function find_unique_feature. Arguments t1, t2, t3, t1p, t2p, t3p are thresholds for 1st, 2nd, and 3rd order connectivities. Greater input values of those arguments would lead to less number of feature gene sets and feature genes. resolution argument is used for cluster_louvain function in igraph package. We require a feature gene set to have at least min_set_size core feature genes to be detected. The results are saved in sifi_obj@conn2, sifi_obj@fg_id, sifi_obj@uni_fg_id, sifi_obj@uni_cluster, sifi_obj@selected_cluster and sifi_obj@featureset.

# assign shared feature genes to core feature gene sets
sifi_obj <- assign_shared_feature(sifi_obj, min_edge_prop = 0.5)

Then we assign other candidate feature genes (that are not identified as core feature gene) into core feature gene sets. Those genes are allowed to be shared by more than 1 core feature gene sets. The assignment of shared feature gene is based on the proportion of edge between a core feature gene set and the feature gene (number of edges between the feature gene and genes in the core feature gene set divided by number of genes in the core feature gene set). An assignment with greater value of argument min_edge_prop would have less shared feature gene added. The results are saved in sifi_obj@featureset.

# enrich feature gene sets with other genes
sifi_obj <- enrich_feature_set(sifi_obj, min_edge_prop = 0.9)

In the last step, we enrich the feature gene sets (with both core and shared feature gene) with other genes that are not identified previously as feature gene. Those genes are also allowed to be shared by more than 1 core feature gene sets. Similarly, the assignment of enriched feature gene is based on the proportion of edge between a feature gene set and the gene (number of edges between the gene and genes in the feature gene set divided by number of genes in the feature gene set). An assignment with greater value of argument min_edge_prop would have less shared feature gene added. The results are saved in sifi_obj@featureset.

Session information

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] SiFINeT_1.13   rmarkdown_2.29
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.6.5      sass_0.4.9         generics_0.1.3     tidyr_1.3.1       
##  [5] lattice_0.22-6     digest_0.6.37      magrittr_2.0.3     evaluate_1.0.3    
##  [9] grid_4.4.2         fastmap_1.2.0      jsonlite_1.8.9     Matrix_1.7-1      
## [13] ggrepel_0.9.6      survival_3.8-3     gridExtra_2.3      purrr_1.0.2       
## [17] viridisLite_0.4.2  scales_1.3.0       tweenr_2.0.3       jquerylib_0.1.4   
## [21] cli_3.6.3          rlang_1.1.4        graphlayouts_1.2.1 polyclip_1.10-7   
## [25] splines_4.4.2      tidygraph_1.3.1    munsell_0.5.1      withr_3.0.2       
## [29] cachem_1.1.0       yaml_2.3.10        tools_4.4.2        MatrixModels_0.5-3
## [33] SparseM_1.84-2     memoise_2.0.1      dplyr_1.1.4        colorspace_2.1-1  
## [37] ggplot2_3.5.1      buildtools_1.0.0   vctrs_0.6.5        R6_2.5.1          
## [41] lifecycle_1.0.4    MASS_7.3-64        ggraph_2.2.1       pkgconfig_2.0.3   
## [45] pillar_1.10.1      bslib_0.8.0        gtable_0.3.6       glue_1.8.0        
## [49] Rcpp_1.0.14        ggforce_0.4.2      xfun_0.50          tibble_3.2.1      
## [53] tidyselect_1.2.1   sys_3.4.3          knitr_1.49         farver_2.1.2      
## [57] htmltools_0.5.8.1  igraph_2.1.3       maketools_1.3.1    compiler_4.4.2    
## [61] quantreg_5.99.1