Cleanet for Mass Cytometry

library(Cleanet)
library(readr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)

Thank you for using Cleanet, an unsupervised method for doublet identification and classification. This tutorial will walk you through a standard Cleanet workflow for mass cytometry data. Most of the steps would be the same for flow cytometry data, with the exception of debris depletion, which is more straightforward in that case, because of the scattering channels.

We start by loading some example data. These 10,000 events come from the whole blood of a healthy donor, profiled by CyTOF at the University of Pennsylvania, using the 30-parameter MDIPA panel. There are also two DNA intercalator channels and one channel providing cell type annotations. Channels have been renamed for convenience.

path <- system.file("extdata", "df_mdipa.csv", package="Cleanet")
df_mdipa <- read_csv(path, col_types=cols())
print(df_mdipa)
#> # A tibble: 10,000 × 33
#>      CD45 CD196 CD123  CD19   CD4   CD8a CD11c  CD16 CD45RO  CD45RA CD161 CD194
#>     <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl>
#>  1   74.9  14.4 4.50   1.99  1.36  11.0   16.8  700.  20.0    2.94   2.23 19.7 
#>  2   94.1  25.4 2.19   3.29 27.6    3.88  15.0  878.  86.5   13.2    5.47 10.3 
#>  3   60.7  24.0 2.73   1.93  9.22   9.34  18.2  661.  29.5   12.0    4.05 27.5 
#>  4  648.    0   1.02  12.2  22.5   29.0   37.7  187.   7.67 284.    41.0   5.85
#>  5   30.7  13.3 0.191  6.01  7.94  21.3   16.0  492.  15.8    3.74   2.12  6.78
#>  6  135.   26.3 7.19   3.73  7.24  10.7   38.5  628.  77.6    4.18   0    25.5 
#>  7   46.9  28.1 5.01   0     2.51   3.81  23.9  774.  36.6    0.272  5.15 21.6 
#>  8 1508.   17.6 2.75   5.57  7.00 542.    28.8  157. 120.   226.    68.7  85.8 
#>  9   99.2  31.5 7.65   3.54  5.29  15.6   28.4  753.  26.0    1.40  10.4  14.8 
#> 10   64.7  45.0 7.31   5.79  7.88  25.4   21.8  760.  37.5    3.87   0    11.4 
#> # ℹ 9,990 more rows
#> # ℹ 21 more variables: CD25 <dbl>, CD27 <dbl>, CD57 <dbl>, CD183 <dbl>,
#> #   CD185 <dbl>, CD28 <dbl>, CD38 <dbl>, CD56 <dbl>, TCRgd <dbl>, CD294 <dbl>,
#> #   CD197 <dbl>, CD14 <dbl>, CD3 <dbl>, CD20 <dbl>, CD66b <dbl>,
#> #   `HLA-DR` <dbl>, IgD <dbl>, CD127 <dbl>, DNA1 <dbl>, DNA2 <dbl>, label <chr>

The minimal information necessary to run Cleanet is a data frame and a set of columns to use for doublet detection.

cols <- c("CD45", "CD123", "CD19", "CD11c", "CD16",
          "CD56", "CD294", "CD14", "CD3", "CD20",
          "CD66b", "CD38", "HLA-DR", "CD45RA",
          "DNA1", "DNA2")
cleanet_res <- cleanet(df_mdipa, cols, cofactor=5)
#> Starting Cleanet...
#> Doublet simulation using all 10000 events...
#> Done! Found 699 doublets, 7% of total.

The output is a list containing, alongside other model information, a binary array specifying predictions for all events.

print(table(cleanet_res$status))
#> 
#> Doublet Singlet 
#>     699    9301

The sensitivity value is the fraction of simulated doublets that Cleanet correctly classifies as doublets. It is an internal measure of model confidence.

print(cleanet_res$sensitivity)
#> [1] 0.914

We can verify the predictions on a DNA/CD45 bivariate plot: events predicted to be doublets have higher values for DNA1, as expected.

ggplot(df_mdipa, aes(x=asinh(DNA1/5), y=asinh(CD45/5), color=cleanet_res$status)) +
  geom_point(size=0.2) +
  scale_color_discrete(name="Status") +
  theme_bw()

If there is a lot of debris in the sample, the simulation performed by Cleanet can fail: a singlet plus a debris event fail to add up to a doublet. In our file, debris is only 10% of all events, which is acceptable. In general, results can be improved by flagging (some of) the debris events, so that Cleanet knows to exclude them in the simulation. There is a helper function designed for this, which gives visual feedback.

is_debris <- filter_debris_cytof(df_mdipa, cols)

There is a scalar parameter with values between 0 and 1 that can be tuned to flag fewer or more events as debris. The default value of 0.3 works well for MDIPA, but a different one may be appropriate for your panel.

is_debris <- filter_debris_cytof(df_mdipa, cols, threshold = 0.35)

Including the information about debris in the input can help Cleanet make better predictions. The impact is greater for samples that contain large proportions of debris.

cleanet_res <- cleanet(df_mdipa, cols, cofactor=5, is_debris=is_debris)
#> Starting Cleanet...
#> Doublet simulation using 8566 events, 85.7% of total...
#> Done! Found 548 doublets, 5.5% of total.
print(cleanet_res$sensitivity)
#> [1] 0.994163

ggplot(df_mdipa, aes(x=asinh(DNA1/5), y=asinh(CD45/5), color=cleanet_res$status)) +
  geom_point(size=0.2) +
  scale_color_discrete(name="Status") +
  theme_bw()

The CD294 distribution in the sample shows two groups of CD294 positive cells: basophils to the left, eosinophils to the right. In particular, eosinophils have DNA1 values that are similar to those of doublets, making them challenging to distinguish from doublets in bivariate gating. As a multivariate method, Cleanet has no trouble classifying them as singlets.

ggplot(df_mdipa, aes(x=asinh(DNA1/5), y=asinh(CD45/5), color=asinh(CD294/5))) +
  geom_point(size=0.2) +
  scale_color_gradient(low="black", high="red") +
  theme_bw()

Now assume that you have isolated the singlets and classified them, using your favorite manual or automated method. For our example data, this information is stored in the label column. Here is the breakdown among cell types.

print(table(df_mdipa$label))
#> 
#>     B cell   Basophil     Debris Eosinophil   Monocyte    NK cell Neutrophil 
#>        291         34       1125        117        389        511       5512 
#>     T cell 
#>       2021

Cleanet can use this information to extend the classification of singlets into a classification of doublets (and higher multiplets). The label information for singlets only must be extracted and passed to the classify_doublets function. We will tabulate the output by doublet type.

singlet_clas <- df_mdipa$label[which(cleanet_res$status!="Doublet")]
doublet_clas <- classify_doublets(cleanet_res, singlet_clas)
sort(table(doublet_clas))
#> doublet_clas
#>                         B cell_Monocyte                B cell_Neutrophil_T cell 
#>                                       1                                       1 
#>                       Monocyte_Monocyte          Monocyte_Neutrophil_Neutrophil 
#>                                       1                                       1 
#>                         NK cell_NK cell           NK cell_Neutrophil_Neutrophil 
#>                                       1                                       1 
#> Neutrophil_Neutrophil_Neutrophil_T cell                Neutrophil_T cell_T cell 
#>                                       1                                       1 
#>                       Eosinophil_T cell              Monocyte_Neutrophil_T cell 
#>                                       2                                       2 
#>               NK cell_Neutrophil_T cell                          B cell_NK cell 
#>                                       2                                       3 
#>                           B cell_T cell                     Basophil_Neutrophil 
#>                                       3                                       4 
#>                          NK cell_T cell                   Eosinophil_Neutrophil 
#>                                       7                                       8 
#>                         Monocyte_T cell        Neutrophil_Neutrophil_Neutrophil 
#>                                       8                                       9 
#>            Neutrophil_Neutrophil_T cell                       B cell_Neutrophil 
#>                                      10                                      22 
#>                           T cell_T cell                     Monocyte_Neutrophil 
#>                                      27                                      30 
#>                      NK cell_Neutrophil                       Neutrophil_T cell 
#>                                      41                                     131 
#>                   Neutrophil_Neutrophil 
#>                                     231

Neutrophils account for ~50% of singlets in whole blood. Unsurprisingly, then, they are also involved in most of the multiplets. To generalize this intuition, we can compute expected proportions of doublet types, by multiplying the frequencies of the components. The function compare_doublets_exp_obs does this and returns the proportions of expected and observed doublets as a data frame.

df_exp_obs <- compare_doublets_exp_obs(doublet_clas, singlet_clas, cleanet_res)
#> Joining with `by = join_by(Type)`
arrange(df_exp_obs, -Expected)
#> # A tibble: 45 × 3
#>    Type                  Expected Observed
#>    <chr>                    <dbl>    <dbl>
#>  1 Neutrophil_Neutrophil   0.378   0.422  
#>  2 Neutrophil_T cell       0.274   0.239  
#>  3 NK cell_Neutrophil      0.0721  0.0748 
#>  4 Monocyte_Neutrophil     0.0549  0.0547 
#>  5 T cell_T cell           0.0496  0.0493 
#>  6 B cell_Neutrophil       0.0399  0.0401 
#>  7 NK cell_T cell          0.0261  0.0128 
#>  8 Monocyte_T cell         0.0199  0.0146 
#>  9 Eosinophil_Neutrophil   0.0163  0.0146 
#> 10 B cell_T cell           0.0144  0.00547
#> # ℹ 35 more rows

A quick plot can help us compare expected and observed proportions.

ggplot(df_exp_obs, aes(x=Expected, y=Observed)) +
  geom_point() +
  geom_abline(slope=1, yintercept=0, linetype="dotted") +
  theme_bw()
#> Warning in geom_abline(slope = 1, yintercept = 0, linetype = "dotted"):
#> Ignoring unknown parameters: `yintercept`

In this case, expected and observed proportions match very well. Large deviations from the diagonal could be caused by technical factors, or by biological ones such as interactions between specific cell types.