atlasapprox (R interface)

Cell atlases such as Tabula Muris and Tabula Sapiens are multi-organ single cell omics data sets describing entire organisms. A cell atlas approximation is a lossy and lightweight compression of a cell atlas that can be streamed via the internet.

This project enables biologists, doctors, and data scientist to quickly find answers for questions such as:

  • What is the expression of a specific gene in human lung?
  • What are the marker genes of a specific cell type in mouse pancreas?
  • What fraction of cells (of a specific type) express a gene of interest?

NOTE: These questions can be also asked in R, Python, JavaScript or in a language agnostic manner using the REST API (see https://atlasapprox.readthedocs.io).


Installation

To install the R interface of atlasapprox from CRAN, use:

install.packages("atlasapprox")

Usage

To use the package, you must first load it:

library("atlasapprox")

Now you have all atlasapprox functions available.

Available organisms or species

The easiest way to explore atlas approximations is to query a list of available organisms:

organisms <- GetOrganisms()
print(organisms)
##  [1] "a_queenslandica" "a_thaliana"      "c_elegans"       "c_gigas"        
##  [5] "c_hemisphaerica" "d_melanogaster"  "d_rerio"         "f_vesca"        
##  [9] "h_miamia"        "h_sapiens"       "h_vulgaris"      "i_pulchra"      
## [13] "l_minuta"        "m_leidyi"        "m_murinus"       "m_musculus"     
## [17] "n_vectensis"     "o_sativa"        "p_crozieri"      "p_dumerilii"    
## [21] "s_lacustris"     "s_mansoni"       "s_mediterranea"  "s_pistillata"   
## [25] "s_purpuratus"    "t_adhaerens"     "t_aestivum"      "x_laevis"       
## [29] "z_mays"

Organs in a single organism

Once you know what species you are interested in, you can explore the list of organs from that species for which an atlas approximation is available:

human_organs <- GetOrgans(organism = 'h_sapiens')
print(human_organs)
##  [1] "bladder"   "blood"     "colon"     "eye"       "fat"       "gut"      
##  [7] "heart"     "kidney"    "liver"     "lung"      "lymphnode" "mammary"  
## [13] "marrow"    "muscle"    "pancreas"  "prostate"  "salivary"  "skin"     
## [19] "spleen"    "thymus"    "tongue"    "trachea"   "uterus"

Cell types within an organ

The next level of zoom is to query the list of cell types that make up an organ of choice, e.g.:

cell_types <- GetCelltypes(organism = 'h_sapiens', organ = 'Lung')
print(cell_types)
##  [1] "neutrophil"             "basophil"               "monocyte"              
##  [4] "macrophage"             "dendritic"              "B"                     
##  [7] "plasma"                 "T"                      "NK"                    
## [10] "plasmacytoid"           "goblet"                 "AT1"                   
## [13] "AT2"                    "club"                   "ciliated"              
## [16] "basal"                  "serous"                 "mucous"                
## [19] "arterial"               "venous"                 "capillary"             
## [22] "CAP2"                   "lymphatic"              "fibroblast"            
## [25] "alveolar fibroblast"    "smooth muscle"          "vascular smooth muscle"
## [28] "pericyte"               "mesothelial"            "ionocyte"

NOTE: Although cell atlases aim to cover all cell types from a tissue, rare types might be missing because of limited sampling or inaccurate annotation. If you think a cell type is missing from a tissue, please contact fabio DOT zanini AT unsw DOT edu DOT au.


Gene expression

If you have some genes you are interested in, you can query their expression across cell types in the organ of choice:

expression <- GetAverage(organism = 'h_sapiens', organ = 'Lung', features = c('PTPRC', 'COL1A1'))
print(expression)
##                               PTPRC       COL1A1
## neutrophil             2.231271e+01  0.014522638
## basophil               2.443684e+00  0.005077871
## monocyte               7.794549e+00  0.003399504
## macrophage             2.801027e+00  0.002812853
## dendritic              4.313318e+00  0.013302779
## B                      3.000779e+00  0.000000000
## plasma                 4.200674e-01  0.009642163
## T                      1.051312e+01  0.009203196
## NK                     1.143152e+01  0.063305810
## plasmacytoid           2.168309e+00  0.000000000
## goblet                 1.898965e-01  0.145349205
## AT1                    9.707276e-02  0.109001435
## AT2                    1.457898e-01  0.058521412
## club                   3.052110e-01  0.071080528
## ciliated               2.264476e-01  0.060997065
## basal                  2.570614e-01  0.064534329
## serous                 3.813045e-01  0.000000000
## mucous                 0.000000e+00  0.116527453
## arterial               1.409595e-01  0.031918123
## venous                 3.115328e-01  0.007172978
## capillary              1.500604e-01  0.004225238
## CAP2                   1.768180e-01  0.022919910
## lymphatic              2.947334e-04  0.000000000
## fibroblast             5.332901e-02 10.089125633
## alveolar fibroblast    1.934833e-01  4.771382809
## smooth muscle          5.999142e-01  2.049613953
## vascular smooth muscle 4.121004e-01  2.203665972
## pericyte               6.380575e-01  0.038223870
## mesothelial            5.869431e-01  1.449272752
## ionocyte               5.413984e-01  0.000000000

You can also request not only the average level of expression, but the fraction of cells within each type that express the gene:

fraction_expressing <- GetFractionDetected(organism = 'h_sapiens', organ = 'Lung', features = c('PTPRC', 'COL1A1'))
print(fraction_expressing)
##                             PTPRC      COL1A1
## neutrophil             0.92528737 0.011494253
## basophil               0.65014577 0.002915452
## monocyte               0.93330902 0.002186589
## macrophage             0.94777960 0.004276316
## dendritic              0.94303799 0.009493670
## B                      0.60919541 0.000000000
## plasma                 0.36567163 0.014925373
## T                      0.93114001 0.003825555
## NK                     0.95454544 0.007575758
## plasmacytoid           0.72222221 0.000000000
## goblet                 0.16710876 0.172413796
## AT1                    0.07109005 0.085308060
## AT2                    0.11589766 0.115569651
## club                   0.10886320 0.047206167
## ciliated               0.11627907 0.069767445
## basal                  0.12568556 0.049360145
## serous                 0.20000000 0.000000000
## mucous                 0.00000000 0.375000000
## arterial               0.05347594 0.010695187
## venous                 0.10236221 0.007874016
## capillary              0.05678023 0.002672011
## CAP2                   0.05975395 0.003514939
## lymphatic              0.02127660 0.000000000
## fibroblast             0.03116883 0.890909076
## alveolar fibroblast    0.07098766 0.595678985
## smooth muscle          0.11206897 0.534482777
## vascular smooth muscle 0.11250000 0.500000000
## pericyte               0.13145539 0.014084507
## mesothelial            0.29411766 0.764705896
## ionocyte               0.21052632 0.000000000

To get a list of all available features (e.g. genes) for an organism, you can use:

genes <- GetFeatures(organism = 'h_sapiens')
# To show just the first 20 genes
print(head(genes, 20))
##  [1] "A1BG"        "A1BG-AS1"    "A1CF"        "A2M"         "A2M-AS1"    
##  [6] "A2ML1"       "A2ML1-AS1"   "A2ML1-AS2"   "A2MP1"       "A3GALT2"    
## [11] "A4GALT"      "A4GNT"       "AAAS"        "AACS"        "AACSP1"     
## [16] "AADAC"       "AADACL2"     "AADACL2-AS1" "AADACL3"     "AADACL4"

Markers

Each cell type expressed specific genes that contribute to its unique biological function, called markers. To request a list of markers for your cell type of choice:

markers <- GetMarkers(organism = 'h_sapiens', organ = 'Lung', cell_type = 'fibroblast', number = 5)
print(markers)
## [1] "MFAP5"     "PI16"      "RPL10P6"   "EEF1A1P11" "RPL7P9"

NOTE: There are multiple methods to compute marker genes. The current version of the API uses one specific method, but future versions aim to give the user choice as of which method they prefer.


Finding cells that highly express a gene

If you’re interested in knowing which cell types express your gene of interest the most, across all organs:

highest_expressors <- GetHighestMeasurement(organism = 'h_sapiens', feature = 'PTPRC', number = 5)
print(highest_expressors)
##    Cell type    Organ  Average
## 1 neutrophil      fat 32.03100
## 2 neutrophil    blood 23.67732
## 3 neutrophil   spleen 23.54425
## 4 neutrophil prostate 23.49050
## 5 neutrophil  trachea 23.17156

Finding similar features

If you want to find other features (genes) that show similar expression patterns to a feature of interest. To get a list of similar features for your gene of choice:

similar_genes <- GetSimilarFeatures(organism = 'h_sapiens', organ = 'lung', feature = 'PTPRC', number = 5, method = 'correlation')
print(similar_genes)
##   Similar features  distances
## 1             LCP1 0.01616174
## 2             CD53 0.03778654
## 3              WAS 0.04407340
## 4            HCLS1 0.05033290
## 5         ARHGAP30 0.05041230

NOTE: There are multiple methods to compute feature similarity. The available methods are:

  • correlation (default): Pearson correlation of the fraction of cells expressing each gene
  • cosine: Cosine similarity of the fraction of expressing cells
  • euclidean: Euclidean distance of average expression levels
  • manhattan: Manhattan distance of average expression levels
  • log-euclidean: Euclidean distance after log-transformation (useful for sparse features)

Data sources

atlasapprox relies upon available cell atlases kindly released for public use:

We are grateful to all authors above for their help and committment to open science.

To get the data sources in the package, call:

data_sources <- GetDataSources()
print(data_sources)

NOTE: Although the original cell type annotations of these data sets are mostly preserved, a quality check is performed before computing approximations. During this step, some cell types might be filtered out, renamed, or split into multiple subannotations. If you found a problem in the data that indicates misannotations, please reach out to fabio DOT zanini AT unsw DOT edu DOT au and we will endeavour to fix it.