Considering the high number of samples that can be generated with
modern HDC instruments and the number of cells acquired we developed a
data projection workflow in cyCONDOR
. With this approach an
initial model it trained including a UMAP dimensionality reduction and a
cell classifier of the cell labels. We can then project any new sample
into the pre-trained model, this operation is much faster and allows to
analyse millions of cell in few minutes.
Loading the data for training
We start by loading the data for the training.
condor_train <- prep_fcd(data_path = "../../../Figure 6 - Data Projection/data_and_envs/fcs_train/",
max_cell = 5000,
useCSV = FALSE,
transformation = "auto_logi",
remove_param = c("FSC-H", "SSC-H", "FSC-W", "SSC-W", "Time", "live_dead"),
anno_table = "../../../Figure 6 - Data Projection/data_and_envs/metadata_train.csv",
filename_col = "filename")
condor_train$anno$cell_anno$group <- "train"
Loading the data for projection
We also load the data to later project.
condor_test <- prep_fcd(data_path = "../../../Figure 6 - Data Projection/data_and_envs/fcs_test/",
max_cell = 10000,
useCSV = FALSE,
transformation = "auto_logi",
remove_param = c("FSC-H", "SSC-H", "FSC-W", "SSC-W", "Time", "live_dead"),
anno_table = "../../../Figure 6 - Data Projection/data_and_envs/metadata_test.csv",
filename_col = "filename")
condor_test$anno$cell_anno$group <- "test"
UMAP Projection
We start now by running a UMAP, in this case we set the
ret_model
variable to TRUE
to keep the UMAP
model in the condor
object. The UMAP calculation and data
projection can be performed only based on the protein expression
(expr
) as pca
would be performed independently
in the two dataset not providing consistent results.
Run UMAP keeping the model
condor_train <- runUMAP(fcd = condor_train,
input_type = "expr",
data_slot = "orig",
nThreads = 4,
ret_model = TRUE)
Add data to the embedding
We can now predict the UMAP coordinates of the test data.
condor_test<- learnUMAP(fcd = condor_test,
input_type = "expr",
data_slot = "orig",
fcd_model = condor_train,
nEpochs = 100,
nThreads = 4,
prefix = "pred")
The predicted UMAP coordinates can be accessed via
condor_test$umap$pred_expr_orig
.
condor_test$umap$pred_expr_orig[1:5,]
## UMAP1 UMAP2
## ID10.fcs_1 8.721360 0.634307
## ID10.fcs_2 -2.714980 9.288510
## ID10.fcs_3 -5.952371 -3.586285
## ID10.fcs_4 -3.397721 -11.507561
## ID10.fcs_5 -3.282857 -3.098220
Train a classifier for the label transfer
To transfer also the labels from the reference to the projected data
we need to train a cell classifier. We start by clustering the training
data. In this case both FlowSOM
and
Phenogpraph
can be used as input for the cell label kNN
classifier. In this vignette we use Phenograph
.
condor_train <- runPhenograph(fcd = condor_train,
input_type = "expr",
data_slot = "orig",
k = 150)
## Run Rphenograph starts:
## -Input data of 45000 rows and 28 columns
## -k is set to 150
## Finding nearest neighbors...DONE ~ 47.566 s
## Compute jaccard coefficient between nearest-neighbor sets...
## Presorting knn...
## presorting DONE ~ 2.076 s
## Start jaccard
## DONE ~ 40.344 s
## Build undirected graph from the weighted links...DONE ~ 3.814 s
## Run louvain clustering on the graph ...DONE ~ 23.978 s
## Run Rphenograph DONE, totally takes 115.702s.
## Return a community class
## -Modularity value: 0.846826
## -Number of clusters: 17
We can visualize the Phenograph
clustering in a
UMAP.
plot_dim_red(fcd= condor_train,
expr_slot = NULL,
reduction_method = "umap",
reduction_slot = "expr_orig",
cluster_slot = "phenograph_expr_orig_k_150",
param = "Phenograph",
title = "Phenograph clustering of the training data set")
Label transfer
Now, we train the classifier on the clustering labels. If you assigned a metacluster label, this can also be used to train the classifier.
Train label transfer kNN classifier
Here, we use the Phenograph
clustering labels as an
example to train the classifier. In many cases you probably want to use
the metacluster labels of an annotated flow cytometry data set which had
been previously assigned using metaclustering()
.
condor_train <- train_transfer_model(fcd = condor_train,
data_slot = "orig",
input_type = "expr",
cluster_slot = "phenograph_expr_orig_k_150",
cluster_var = "Phenograph",
method = "knn",
tuneLength = 5,
trControl = caret::trainControl(method = "cv"))
## Loading required package: ggplot2
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:cyCONDOR':
##
## confusionMatrix
condor_train$extras$lt_model$performance_plot
#kNN importance
condor_train$extras$lt_model$features_plot
Predict the labels
Based on the trained classifier, we predict now the cluster labels for the test data set.
condor_test <- predict_labels(fcd = condor_test,
data_slot = "orig",
input_type = "expr",
fcd_model = condor_train,
label = "label_pred")
The predicted labels are saved in
condor_test$clustering$label_pred
.
condor_test$clustering$label_pred[1:5,]
## Description predicted_label
## 1 predicted 7
## 2 predicted 11
## 3 predicted 9
## 4 predicted 12
## 5 predicted 1
Visualize the results
We provide here some costom code to overlap in a single plot the
results from the train and test condor
object. Nevertheless
the independent results of each dataset can be vidualized with
cyCONDOR
built-in functions.
Prepare the dataframe
train <- as.data.frame(cbind(condor_train$umap$expr_orig, Phenograph = condor_train$clustering$phenograph_expr_orig_k_150[, 1]))
train$type <- "train"
test <- cbind(condor_test$umap$pred_expr_orig,
condor_test$clustering$label_pred)
test$Description <- NULL
test$Description <- NULL
colnames(test) <- c("UMAP1", "UMAP2", "Phenograph")
test$type <- "test"
vis_data <- rbind(train, test)
Overlap UMAP
ggplot(data = vis_data, aes(x = UMAP1, y = UMAP2, color = type, alpha = type, size = type)) +
geom_point() +
scale_color_manual(values = c("gray", "#92278F")) +
scale_alpha_manual(values = c(0.5, 1)) +
scale_size_manual(values = c(0.1, 0.5)) +
theme_bw() +
theme(aspect.ratio = 1, panel.grid = element_blank()) +
ggtitle("UMAP projected")
ggplot(data = vis_data, aes(x = UMAP1, y = UMAP2, color = Phenograph, alpha = type, size = type)) +
geom_point() +
scale_alpha_manual(values = c(0.1, 1)) +
scale_size_manual(values = c(0.1, 0.5)) +
theme_bw() +
theme(aspect.ratio = 1, panel.grid = element_blank()) +
ggtitle("Predicted cluster") + facet_wrap(~type)
Session Info
info <- sessionInfo()
info
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] caret_6.0-94 lattice_0.22-5 ggplot2_3.4.4 cyCONDOR_0.2.0
##
## loaded via a namespace (and not attached):
## [1] IRanges_2.34.1 Rmisc_1.5.1
## [3] urlchecker_1.0.1 nnet_7.3-19
## [5] CytoNorm_2.0.1 TH.data_1.1-2
## [7] vctrs_0.6.4 digest_0.6.33
## [9] png_0.1-8 shape_1.4.6
## [11] proxy_0.4-27 slingshot_2.8.0
## [13] ggrepel_0.9.4 parallelly_1.36.0
## [15] MASS_7.3-60 pkgdown_2.0.7
## [17] reshape2_1.4.4 httpuv_1.6.12
## [19] foreach_1.5.2 BiocGenerics_0.46.0
## [21] withr_2.5.1 ggrastr_1.0.2
## [23] xfun_0.40 ggpubr_0.6.0
## [25] ellipsis_0.3.2 survival_3.5-7
## [27] memoise_2.0.1 hexbin_1.28.3
## [29] ggbeeswarm_0.7.2 RProtoBufLib_2.12.1
## [31] princurve_2.1.6 profvis_0.3.8
## [33] ggsci_3.0.0 systemfonts_1.0.5
## [35] ragg_1.2.6 zoo_1.8-12
## [37] GlobalOptions_0.1.2 DEoptimR_1.1-3
## [39] Formula_1.2-5 prettyunits_1.2.0
## [41] promises_1.2.1 scatterplot3d_0.3-44
## [43] rstatix_0.7.2 globals_0.16.2
## [45] ps_1.7.5 rstudioapi_0.15.0
## [47] miniUI_0.1.1.1 generics_0.1.3
## [49] ggcyto_1.28.1 base64enc_0.1-3
## [51] processx_3.8.2 curl_5.1.0
## [53] S4Vectors_0.38.2 zlibbioc_1.46.0
## [55] flowWorkspace_4.12.2 polyclip_1.10-6
## [57] randomForest_4.7-1.1 GenomeInfoDbData_1.2.10
## [59] RBGL_1.76.0 ncdfFlow_2.46.0
## [61] RcppEigen_0.3.3.9.4 xtable_1.8-4
## [63] stringr_1.5.0 desc_1.4.2
## [65] doParallel_1.0.17 evaluate_0.22
## [67] S4Arrays_1.0.6 hms_1.1.3
## [69] glmnet_4.1-8 GenomicRanges_1.52.1
## [71] irlba_2.3.5.1 colorspace_2.1-0
## [73] harmony_1.1.0 reticulate_1.34.0
## [75] readxl_1.4.3 magrittr_2.0.3
## [77] lmtest_0.9-40 readr_2.1.4
## [79] Rgraphviz_2.44.0 later_1.3.1
## [81] future.apply_1.11.0 robustbase_0.99-0
## [83] XML_3.99-0.15 cowplot_1.1.1
## [85] matrixStats_1.1.0 RcppAnnoy_0.0.21
## [87] xts_0.13.1 class_7.3-22
## [89] Hmisc_5.1-1 pillar_1.9.0
## [91] nlme_3.1-163 iterators_1.0.14
## [93] compiler_4.3.1 RSpectra_0.16-1
## [95] stringi_1.7.12 gower_1.0.1
## [97] minqa_1.2.6 SummarizedExperiment_1.30.2
## [99] lubridate_1.9.3 devtools_2.4.5
## [101] CytoML_2.12.0 plyr_1.8.9
## [103] crayon_1.5.2 abind_1.4-5
## [105] locfit_1.5-9.8 sp_2.1-1
## [107] sandwich_3.0-2 pcaMethods_1.92.0
## [109] dplyr_1.1.3 codetools_0.2-19
## [111] multcomp_1.4-25 textshaping_0.3.7
## [113] recipes_1.0.8 openssl_2.1.1
## [115] Rphenograph_0.99.1 TTR_0.24.3
## [117] bslib_0.5.1 e1071_1.7-13
## [119] destiny_3.14.0 GetoptLong_1.0.5
## [121] ggplot.multistats_1.0.0 mime_0.12
## [123] splines_4.3.1 circlize_0.4.15
## [125] Rcpp_1.0.11 sparseMatrixStats_1.12.2
## [127] cellranger_1.1.0 knitr_1.44
## [129] utf8_1.2.4 clue_0.3-65
## [131] lme4_1.1-35.1 fs_1.6.3
## [133] listenv_0.9.0 checkmate_2.3.0
## [135] DelayedMatrixStats_1.22.6 pkgbuild_1.4.2
## [137] ggsignif_0.6.4 tibble_3.2.1
## [139] Matrix_1.6-1.1 rpart.plot_3.1.1
## [141] callr_3.7.3 tzdb_0.4.0
## [143] tweenr_2.0.2 pkgconfig_2.0.3
## [145] pheatmap_1.0.12 tools_4.3.1
## [147] cachem_1.0.8 smoother_1.1
## [149] fastmap_1.1.1 rmarkdown_2.25
## [151] scales_1.2.1 grid_4.3.1
## [153] usethis_2.2.2 broom_1.0.5
## [155] sass_0.4.7 graph_1.78.0
## [157] carData_3.0-5 RANN_2.6.1
## [159] rpart_4.1.21 farver_2.1.1
## [161] yaml_2.3.7 MatrixGenerics_1.12.3
## [163] foreign_0.8-85 ggthemes_4.2.4
## [165] cli_3.6.1 purrr_1.0.2
## [167] stats4_4.3.1 lifecycle_1.0.3
## [169] uwot_0.1.16 askpass_1.2.0
## [171] Biobase_2.60.0 mvtnorm_1.2-3
## [173] lava_1.7.3 sessioninfo_1.2.2
## [175] backports_1.4.1 cytolib_2.12.1
## [177] timechange_0.2.0 gtable_0.3.4
## [179] rjson_0.2.21 umap_0.2.10.0
## [181] ggridges_0.5.4 Rphenoannoy_0.1.0
## [183] parallel_4.3.1 pROC_1.18.5
## [185] limma_3.56.2 jsonlite_1.8.7
## [187] edgeR_3.42.4 RcppHNSW_0.5.0
## [189] bitops_1.0-7 Rtsne_0.16
## [191] FlowSOM_2.8.0 ranger_0.16.0
## [193] flowCore_2.12.2 jquerylib_0.1.4
## [195] timeDate_4022.108 shiny_1.7.5.1
## [197] ConsensusClusterPlus_1.64.0 htmltools_0.5.6.1
## [199] diffcyt_1.20.0 glue_1.6.2
## [201] XVector_0.40.0 VIM_6.2.2
## [203] RCurl_1.98-1.13 rprojroot_2.0.3
## [205] gridExtra_2.3 boot_1.3-28.1
## [207] TrajectoryUtils_1.8.0 igraph_1.5.1
## [209] R6_2.5.1 tidyr_1.3.0
## [211] SingleCellExperiment_1.22.0 labeling_0.4.3
## [213] vcd_1.4-11 cluster_2.1.4
## [215] pkgload_1.3.3 GenomeInfoDb_1.36.4
## [217] ipred_0.9-14 nloptr_2.0.3
## [219] DelayedArray_0.26.7 tidyselect_1.2.0
## [221] vipor_0.4.5 htmlTable_2.4.2
## [223] ggforce_0.4.1 CytoDx_1.20.0
## [225] car_3.1-2 future_1.33.0
## [227] ModelMetrics_1.2.2.2 munsell_0.5.0
## [229] laeken_0.5.2 data.table_1.14.8
## [231] htmlwidgets_1.6.2 ComplexHeatmap_2.16.0
## [233] RColorBrewer_1.1-3 rlang_1.1.1
## [235] remotes_2.4.2.1 colorRamps_2.3.1
## [237] ggnewscale_0.4.9 fansi_1.0.5
## [239] hardhat_1.3.0 beeswarm_0.4.0
## [241] prodlim_2023.08.28