Title: | Single Cell Feature Identification with Network Topology |
---|---|
Description: | Cluster-independent method based on topology structure of gene co-expression network for identifying feature gene sets, extracting cellular subpopulations, and elucidating intrinsic relationships among these subpopulations. Without prior cell clustering, SifiNet circumvents potential inaccuracies in clustering that may influence subsequent analyses. This method is introduced in Qi Gao, Zhicheng Ji, Liuyang Wang, Kouros Owzar, Qi-Jing Li, Cliburn Chan, Jichun Xie "SifiNet: a robust and accurate method to identify feature gene sets and annotate cells" (2024) <doi:10.1093/nar/gkae307>. |
Authors: | Qi Gao [aut, cre] |
Maintainer: | Qi Gao <[email protected]> |
License: | GPL-3 |
Version: | 1.13 |
Built: | 2025-01-16 21:43:08 UTC |
Source: | CRAN |
cal_coexp This function calculates the coexpression patterns between genes and returns the coexpression matrix.
cal_coexp(X)
cal_coexp(X)
X |
Input binarized cell (row) by gene (column) matrix |
Coexpression matrix
Qi Gao
cal_coexp_sp This function calculates the coexpression patterns between genes in sparse matrix and returns the coexpression matrix.
cal_coexp_sp(X)
cal_coexp_sp(X)
X |
Input binarized cell (row) by gene (column) sparse matrix |
Coexpression matrix
Qi Gao
cal_conn This function calculates the first 3 order connectivities for each gene and returns the list of vectors of connectivities.
cal_conn(data, thres = 3, m = 10L, abso = 1L, niter = 100L)
cal_conn(data, thres = 3, m = 10L, abso = 1L, niter = 100L)
data |
Input gene by gene coexpression matrix |
thres |
Gene pairs with coexpression exceed thres would be assigned an edge between them in the coexpression network |
m |
Sample size used for the calculation of 3rd order connectivities |
abso |
Whether to calculate connectivities in absolute network (TRUE) or positive network (FALSE) |
niter |
Number of sample used for the calculation of 3rd order connectivities |
List of connectivities C1, C2, and C3 ' @export
Qi Gao
The function calculates the 1st, 2nd and 3rd order connectivities for all genes
cal_connectivity(so, m = 10, niter = 100)
cal_connectivity(so, m = 10, niter = 100)
so |
a SiFINeT object |
m |
number of neighbors sampled each time for the calculation of 3rd order connectivity |
niter |
number of samples created for the calculation of 3rd order connectivity |
For gene i, First order connectivity is defined as the number of edges connected to gene i (degree of the gene node i in the network). Second order connectivity is defined as the proportion of edges between the neighbors of gene i, calculated as number of observed edges between the neighbors of gene i divided by the number of possible edges between the neighbors. Third order connectivity is defined as a weighted proportion of edges between neighbors and neighbors of neighbors of gene i. Third order connectivity is calculated as the mean of edge proportions across weighted samples. Each gene is weighted by the number of edges it has with the neighbors of gene i. Then SiFINeT repeatedly samples m genes for niter times. For each sample, the edge proportion (number of observed edges / number of possible edges) is calculated. And the mean edge proportion across the sample is the 3rd order connectivity for gene i.
SiFINeT object with conn (absolute network connectivities) updated.
The function estimates the null distribution of coexpression patterns and generates coexpression network
create_network(so, alpha = 0.05, manual = FALSE, least_edge_prop = 0.01)
create_network(so, alpha = 0.05, manual = FALSE, least_edge_prop = 0.01)
so |
a SiFINeT object |
alpha |
the Type I error rate used for FDR control procedure |
manual |
whether to manually set threshold for edge assignment |
least_edge_prop |
the minimum proportion of edges. Only used when manual = TRUE |
Theoretically the distribution of coexpression patterns would converge to standard Gaussian if either one of the gene pair is not feature gene. However in genomics analysis, empirical null could be much more variable than theoretical null. SiFINeT uses estimated null mean and standard deviation to find the threshold for network edges. An edge is assigned to a pair of gene if the absolute value of coexpression pattern between the 2 genes is greater than the threshold Assuming the distribution to be Gaussian, with the estimated null mean and standard deviation, SiFINeT uses SQUAC to control the false discovery rate (FDR) for coexpression patterns. In case the signal is not strong enough and the coexpression network is too sparse, SiFINeT also accept user-defined lower bound for the least proportion of edges. Usually a coexpression network with edge proportion between 0.5% - 10% would have better performance for the detection of feature gene sets.
SiFINeT object with est_ms (estimated mean and sd) and thres (network edge threshold) updated.
Jiashun Jin and Tony T. Cai. “Estimating the Null and the Proportion of Non-Null Effects in Large-Scale Multiple Comparisons”. In: Journal of the American Statistical Association 102 (478 2004), pp. 495–506. doi: 10.1198/016214507000000167.
Jichun Xie and Ruosha Li. “False discovery rate control for high dimensional networks of quantile associations conditioning on covariates”. In: J R Stat Soc Series B Stat Methodol (2018). doi: 10.1111/rssb.12288.
The function classifies count data based on thresholds defined by quantile regression
create_SiFINeT_object( counts, gene.name = NULL, meta.data = NULL, data.name = NULL, sparse = FALSE, rowfeature = TRUE )
create_SiFINeT_object( counts, gene.name = NULL, meta.data = NULL, data.name = NULL, sparse = FALSE, rowfeature = TRUE )
counts |
count matrix |
gene.name |
name of the features |
meta.data |
data.frame of meta data |
data.name |
name of dataset |
sparse |
whether the count matrix should be analyzed as sparse matrix |
rowfeature |
whether the count matrix is feature (row) by cell (column) |
a SiFINeT object
The function chooses genes that are not found to be feature genes as enriched feature genes and assigns them into unique+shared feature gene sets
enrich_feature_set(so, min_edge_prop = 0.9)
enrich_feature_set(so, min_edge_prop = 0.9)
so |
a SiFINeT object |
min_edge_prop |
minimum proportion of edges between a gene and a unique+shared feature gene set for the new feature to be assigned to the set |
Genes that are not selected as feature genes would be added in the enriched section of the feature gene set if they are connected with more than min_edge_prop of the unique and shared feature genes in each of the feature gene group.
SiFINeT object with enriched feature genes in geneset updated.
EstNull This function is a Rcpp version of Wenguang Sun and Tony T. Cai's EstNull.func R function, estimating null distribution from data. Sun, W., & Cai, T. T. (2007). Oracle and Adaptive Compound Decision Rules for False Discovery Rate Control. Journal of the American Statistical Association, 102(479), 901–912.
EstNull(x, gamma = 0.1)
EstNull(x, gamma = 0.1)
x |
Input vector of all coexpression values |
gamma |
Parameter setting the stopping threshold |
List of mean and std
Qi Gao
The function extract a subnetwork from the co-expression network
extract_subnetwork( so, target_gene_name = NULL, target_gene_id = NULL, positive = TRUE )
extract_subnetwork( so, target_gene_name = NULL, target_gene_id = NULL, positive = TRUE )
so |
a SiFINeT object |
target_gene_name |
the names of the target genes in the output network |
target_gene_id |
the indices of the target genes in the output network, not used when target_gene_name is not Null |
positive |
whether only positive (default) co-expressions or all co-expressions are considered in assigning edges |
an adjacency matrix of the output subnetwork
The function calculates coexpression patterns between genes
feature_coexp(so)
feature_coexp(so)
so |
a SiFINeT object |
The coexpression pattern of a pair of genes is a normalized co-occurrence of high (or equivalently low) expression level of the 2 genes in the classified count matrix. The normalization is based on the estimated quantiles of the low-high separation instead of the quantiles used for quantile regressions. Theoretically, the distribution of coexpression patterns should asymptotically follow standard Gaussian distribution if at least one of the 2 genes is not differentially expressed feature gene.
SiFINeT object with coexp (gene coexpression matrix) updated.
The function filters out genes with low expression rate and high positive coexpression with genes of same expression level
filter_lowexp(so, t1 = 10, t2 = 0.9, t3 = 0.9)
filter_lowexp(so, t1 = 10, t2 = 0.9, t3 = 0.9)
so |
a SiFINeT object |
t1 |
threshold for number of total edges connecting the feature node. Lower t1 leads to stricter filtering. |
t2 |
threshold for the proportion of positive edges. Lower t2 leads to stricter filtering. |
t3 |
threshold for the proportion of edges with features of same expression level. Lower t3 leads to stricter filtering. |
When using only mean expression level as independent variable in quantile regression,
it is observed that genes with low expression level tend to have large positive with genes that have same median expression level.
To reduce the coexpression noise caused by low expression level, it is preferred to filter out genes which have large amount and high proportion of positive coexpressions with genes sharing same median expression level.
SiFINeT object with kset (kept index set) updated.
The function finds the clustered unique feature genes
find_unique_feature( so, t1 = 5, t2 = 0.4, t3 = 0.3, t1p = 5, t2p = 0.7, t3p = 0.5, resolution = 1, min_set_size = 5 )
find_unique_feature( so, t1 = 5, t2 = 0.4, t3 = 0.3, t1p = 5, t2p = 0.7, t3p = 0.5, resolution = 1, min_set_size = 5 )
so |
a SiFINeT object |
t1 |
feature gene selection parameter, lower threshold for 1st order connectivity in absolute network |
t2 |
feature gene selection parameter, lower threshold for 2nd order connectivity in absolute network |
t3 |
feature gene selection parameter, lower threshold for 3rd order connectivity in absolute network |
t1p |
unique feature gene selection parameter, lower threshold for 1st order connectivity in positive sub-network |
t2p |
unique feature gene selection parameter, lower threshold for 2nd order connectivity in positive sub-network |
t3p |
unique feature gene selection parameter, lower threshold for 3rd order connectivity in positive sub-network |
resolution |
resolution for louvain clustering of unique feature genes |
min_set_size |
minimum size for a unique feature gene cluster to be a separate unique feature gene set |
SiFINeT first find genes with high 1st (>= t1), 2nd (>= t2) and 3rd (>= t3) order connectivities in absolute network (conn) to be candidate feature genes. Then a positive sub-network is created where only candidate feature gene nodes (fg_id) and edges with positive coexpression patterns (coexp >= thres) are included. Feature genes genes with high 1st (>= t1p), 2nd (>= t2p) and at least moderate 3rd (>= t3p) order connectivities in positive sub-network (conn2) are chosen to be candidate unique feature genes. Note that when the network is not too sparse, t3p should usually be smaller than t2p for the detection of unique feature genes in transition cell types. The candidate unique feature genes are then separated into groups by louvain clustering (with resolution defined by the resolution parameter), and among them large groups (number of genes greater than min_set_size) are chosen to be unique feature gene sets that represent different cell types.
SiFINeT object with fg_id (candidate feature gene index), uni_fg_id (candidate unique feature gene index), conn2 (connectivities in positive sub-network), uni_cluster (cluster of candidate unique feature genes), selected_cluster (selected unique feature gene clusters), and unique feature genes in featureset updated.
The function plots the topology network of the feature gene sets found by SiFINeT.
geneset_topology( so, weightthres = 0.3, edge_method = 2, node_color = "black", shiftsize = 0.05, boundsize = 0.3, prefix = "", set_name = NULL )
geneset_topology( so, weightthres = 0.3, edge_method = 2, node_color = "black", shiftsize = 0.05, boundsize = 0.3, prefix = "", set_name = NULL )
so |
a SiFINeT object |
weightthres |
edges between nodes (feature gene sets) with weight greater than weightthres would be shown in the plot |
edge_method |
SiFINeT provides 2 methods of calculating edge weight. The number of shared feature genes between feature gene sets would be used when edge_method = 1; while the edge proportion between feature gene sets would be applied if edge_method = 2. |
node_color |
color of nodes. Should have either length 1 or same length as the number of feature gene sets. |
shiftsize |
set the distance between center of label and the corresponding feature gene sets node. |
boundsize |
set the size of the boundary region. |
prefix |
the prefix of the labels |
set_name |
name of the gene sets |
This function visualizes the output feature gene sets of SiFINeT in the form of network. Number of shared feature genes or proportion of edges between feature gene sets could be used to weight the edges. The layout of the nodes is created by create_layout function in ggraph package.
A ggraph (ggplot) object
Thomas Lin Pedersen (2022). ggraph: An Implementation of Grammar of Graphics for Graphs and Networks. R package version 2.0.6. https://CRAN.R-project.org/package=ggraph
norm_FDR_SQAUC The function controls the false discovery rate (FDR) of coexpression patterns using SQAUC method Jichun Xie and Ruosha Li. "False discovery rate control for high dimensional networks of quantile associations conditioning on covariates". In: J R Stat Soc Series B Stat Methodol (2018). doi: 10.1111/rssb.12288.
norm_FDR_SQAUC(value, sam_mean, sam_sd, alpha, n, p)
norm_FDR_SQAUC(value, sam_mean, sam_sd, alpha, n, p)
value |
the vector of coexpression patterns |
sam_mean |
the estimated sample mean |
sam_sd |
the estimated sample sd |
alpha |
the type I error rate |
n |
the number of cells |
p |
the number of genes |
lower bound threshold for genes to be significantly coexpressed
The function classifies count data into binary low (0) - high (1) data, based on whether the count number is greater than a threshold.
quantile_thres(so)
quantile_thres(so)
so |
a SiFINeT object |
The threshold used for classification is defined by quantile regression on each gene using Frisch–Newton interior point method ("fn" option for method variable in quantreg package, rq function). By default if no meta data is provided, the quantile regression would be applied on the mean expression level of each cell. The quantile to be estimated in the quantile regression is set to be the estimated 50% quantile of the non-zero part of the expression level for each gene. If the expression level of a gene is low (with median 0), then the threshold is set to be 0.
SiFINeT object with data.thres (categorized count matrix) updated.
Koenker, R. and S. Portnoy (1997) The Gaussian Hare and the Laplacean Tortoise: Computability of Squared-error vs Absolute Error Estimators, (with discussion). Statistical Science, 12, 279-300.
Roger Koenker (2022). quantreg: Quantile Regression. R package version 5.94. https://CRAN.R-project.org/package=quantreg
The SiFINeT Class
data
a list of cell (row) by gene (column) count matrix, either regular or sparse matrix
sparse
whether the count matrix should be analyzed as sparse matrix
meta.data
matrix of meta data, the number of rows should equal to the number of cells
gene.name
a vector of names of genes with length equal to the number of genes
data.name
name of the dataset
n
number of cells in the dataset
p
number of genes in the dataset
data.thres
binarized count matrix
coexp
matrix of genes coexpression
est_ms
estimated mean and sd of coexpression values
thres
lower bound of coexpression (or absolute value of coexpression) for network edge assignment
q5
50% quantile for each gene
kset
index of kept genes after the filtering step
conn
list of connectivities in absolute network
conn2
list of connectivities in positive sub-network
fg_id
index of the candidate feature genes
uni_fg_id
index of the candidate unique feature genes
uni_cluster
cluster result of the candidate unique feature genes
selected_cluster
selected unique feature gene clusters
featureset
detected set of feature genes