Data Projection • cyCONDOR

library(cyCONDOR)

Considering the high number of samples that can be generated with modern HDC instruments and the number of cells acquired we developed a data projection workflow in cyCONDOR. With this approach an initial model it trained including a UMAP dimensionality reduction and a cell classifier of the cell labels. We can then project any new sample into the pre-trained model, this operation is much faster and allows to analyse millions of cell in few minutes.

Loading the data for training

We start by loading the data for the training.

condor_train <- prep_fcd(data_path = "../../../Figure 6 - Data Projection/data_and_envs/fcs_train/", 
                         max_cell = 5000, 
                         useCSV = FALSE, 
                         transformation = "auto_logi", 
                         remove_param = c("FSC-H", "SSC-H", "FSC-W", "SSC-W", "Time", "live_dead"), 
                         anno_table = "../../../Figure 6 - Data Projection/data_and_envs/metadata_train.csv", 
                         filename_col = "filename")

condor_train$anno$cell_anno$group <- "train"

Loading the data for projection

We also load the data to later project.

condor_test <- prep_fcd(data_path = "../../../Figure 6 - Data Projection/data_and_envs/fcs_test/",
                        max_cell = 10000,
                        useCSV = FALSE,
                        transformation = "auto_logi",
                        remove_param = c("FSC-H", "SSC-H", "FSC-W", "SSC-W", "Time", "live_dead"),
                        anno_table = "../../../Figure 6 - Data Projection/data_and_envs/metadata_test.csv",
                        filename_col = "filename")

condor_test$anno$cell_anno$group <- "test"

UMAP Projection

We start now by running a UMAP, in this case we set the ret_model variable to TRUE to keep the UMAP model in the condor object. The UMAP calculation and data projection can be performed only based on the protein expression (expr) as pca would be performed independently in the two dataset not providing consistent results.

Run UMAP keeping the model

condor_train <- runUMAP(fcd = condor_train, 
                        input_type = "expr", 
                        data_slot = "orig", 
                        nThreads = 4,
                        ret_model = TRUE)

Add data to the embedding

We can now predict the UMAP coordinates of the test data.

condor_test<- learnUMAP(fcd = condor_test,
                        input_type = "expr", 
                        data_slot = "orig", 
                        fcd_model = condor_train,
                        nEpochs = 100, 
                        nThreads = 4, 
                        prefix = "pred")

The predicted UMAP coordinates can be accessed via condor_test$umap$pred_expr_orig.

condor_test$umap$pred_expr_orig[1:5,]

##                UMAP1      UMAP2
## ID10.fcs_1  8.721360   0.634307
## ID10.fcs_2 -2.714980   9.288510
## ID10.fcs_3 -5.952371  -3.586285
## ID10.fcs_4 -3.397721 -11.507561
## ID10.fcs_5 -3.282857  -3.098220

Train a classifier for the label transfer

To transfer also the labels from the reference to the projected data we need to train a cell classifier. We start by clustering the training data. In this case both FlowSOM and Phenogpraph can be used as input for the cell label kNN classifier. In this vignette we use Phenograph.

condor_train <- runPhenograph(fcd = condor_train, 
                              input_type = "expr", 
                              data_slot = "orig", 
                              k = 150)

## Run Rphenograph starts:
##   -Input data of 45000 rows and 28 columns
##   -k is set to 150

##   Finding nearest neighbors...DONE ~ 48.264 s
##   Compute jaccard coefficient between nearest-neighbor sets...

## Presorting knn...

## presorting DONE ~ 2.074 s
##   Start jaccard
## DONE ~ 39.566 s
##   Build undirected graph from the weighted links...DONE ~ 3.853 s
##   Run louvain clustering on the graph ...DONE ~ 27.729 s

## Run Rphenograph DONE, totally takes 119.412s.

##   Return a community class
##   -Modularity value: 0.846826 
##   -Number of clusters: 17

We can visualize the Phenograph clustering in a UMAP.

plot_dim_red(fcd= condor_train,  
             expr_slot = NULL,
             reduction_method = "umap", 
             reduction_slot = "expr_orig", 
             cluster_slot = "phenograph_expr_orig_k_150",
             param = "Phenograph",
             title = "Phenograph clustering of the training data set")

Label transfer

Now, we train the classifier on the clustering labels. If you assigned a metacluster label, this can also be used to train the classifier.

Train label transfer kNN classifier

Here, we use the Phenograph clustering labels as an example to train the classifier. In many cases you probably want to use the metacluster labels of an annotated flow cytometry data set which had been previously assigned using metaclustering().

condor_train <- train_transfer_model(fcd = condor_train, 
                                     data_slot = "orig", 
                                     input_type = "expr", 
                                     cluster_slot = "phenograph_expr_orig_k_150",
                                     cluster_var = "Phenograph",
                                     method = "knn", 
                                     tuneLength = 5, 
                                     trControl = caret::trainControl(method = "cv"))

## Loading required package: ggplot2

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:cyCONDOR':
## 
##     confusionMatrix

condor_train$extras$lt_model$performance_plot

#kNN importance
condor_train$extras$lt_model$features_plot

Predict the labels

Based on the trained classifier, we predict now the cluster labels for the test data set.

condor_test <- predict_labels(fcd = condor_test, 
                              data_slot = "orig", 
                              input_type = "expr", 
                              fcd_model = condor_train, 
                              label = "label_pred")

The predicted labels are saved in condor_test$clustering$label_pred.

condor_test$clustering$label_pred[1:5,]

##   Description predicted_label
## 1   predicted               7
## 2   predicted              11
## 3   predicted               9
## 4   predicted              12
## 5   predicted               1

Visualize the results

We provide here some costom code to overlap in a single plot the results from the train and test condor object. Nevertheless the independent results of each dataset can be vidualized with cyCONDOR built-in functions.

Prepare the dataframe

train <- as.data.frame(cbind(condor_train$umap$expr_orig, Phenograph = condor_train$clustering$phenograph_expr_orig_k_150[, 1]))

train$type <- "train"

test <- cbind(condor_test$umap$pred_expr_orig, 
              condor_test$clustering$label_pred)

test$Description <- NULL
test$Description <- NULL

colnames(test) <- c("UMAP1", "UMAP2", "Phenograph")

test$type <- "test"

vis_data <- rbind(train, test)

Overlap UMAP

vis_data$type <- factor(vis_data$type, levels = c("train", "test"))

ggplot(data = vis_data, aes(x = UMAP1, y = UMAP2, color = type, alpha = type, size = type)) +
  geom_point() +
  scale_color_manual(values = c("gray", "#92278F")) +
  scale_alpha_manual(values = c(0.5, 1)) +
  scale_size_manual(values = c(0.1, 0.5)) +
  theme_bw() +
  theme(aspect.ratio = 1, panel.grid = element_blank()) + 
  ggtitle("UMAP projected")

ggplot(data = vis_data, aes(x = UMAP1, y = UMAP2, color = Phenograph, alpha = type, size = type)) +
  geom_point() +
  scale_alpha_manual(values = c(0.1, 1)) +
  scale_size_manual(values = c(0.1, 0.5)) +
  theme_bw() +
  theme(aspect.ratio = 1, panel.grid = element_blank()) +
  ggtitle("Predicted cluster") + facet_wrap(~type)

Session Info

info <- sessionInfo()

info

## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] caret_6.0-94   lattice_0.22-5 ggplot2_3.4.4  cyCONDOR_0.2.1
## 
## loaded via a namespace (and not attached):
##   [1] IRanges_2.34.1              Rmisc_1.5.1                
##   [3] urlchecker_1.0.1            nnet_7.3-19                
##   [5] CytoNorm_2.0.1              TH.data_1.1-2              
##   [7] vctrs_0.6.4                 digest_0.6.33              
##   [9] png_0.1-8                   shape_1.4.6                
##  [11] proxy_0.4-27                slingshot_2.8.0            
##  [13] ggrepel_0.9.4               parallelly_1.36.0          
##  [15] MASS_7.3-60                 pkgdown_2.0.7              
##  [17] reshape2_1.4.4              httpuv_1.6.12              
##  [19] foreach_1.5.2               BiocGenerics_0.46.0        
##  [21] withr_2.5.1                 ggrastr_1.0.2              
##  [23] xfun_0.40                   ggpubr_0.6.0               
##  [25] ellipsis_0.3.2              survival_3.5-7             
##  [27] memoise_2.0.1               hexbin_1.28.3              
##  [29] ggbeeswarm_0.7.2            RProtoBufLib_2.12.1        
##  [31] princurve_2.1.6             profvis_0.3.8              
##  [33] ggsci_3.0.0                 systemfonts_1.0.5          
##  [35] ragg_1.2.6                  zoo_1.8-12                 
##  [37] GlobalOptions_0.1.2         DEoptimR_1.1-3             
##  [39] Formula_1.2-5               prettyunits_1.2.0          
##  [41] promises_1.2.1              scatterplot3d_0.3-44       
##  [43] rstatix_0.7.2               globals_0.16.2             
##  [45] ps_1.7.5                    rstudioapi_0.15.0          
##  [47] miniUI_0.1.1.1              generics_0.1.3             
##  [49] ggcyto_1.28.1               base64enc_0.1-3            
##  [51] processx_3.8.2              curl_5.1.0                 
##  [53] S4Vectors_0.38.2            zlibbioc_1.46.0            
##  [55] flowWorkspace_4.12.2        polyclip_1.10-6            
##  [57] randomForest_4.7-1.1        GenomeInfoDbData_1.2.10    
##  [59] RBGL_1.76.0                 ncdfFlow_2.46.0            
##  [61] RcppEigen_0.3.3.9.4         xtable_1.8-4               
##  [63] stringr_1.5.0               desc_1.4.2                 
##  [65] doParallel_1.0.17           evaluate_0.22              
##  [67] S4Arrays_1.0.6              hms_1.1.3                  
##  [69] glmnet_4.1-8                GenomicRanges_1.52.1       
##  [71] irlba_2.3.5.1               colorspace_2.1-0           
##  [73] harmony_1.1.0               reticulate_1.34.0          
##  [75] readxl_1.4.3                magrittr_2.0.3             
##  [77] lmtest_0.9-40               readr_2.1.4                
##  [79] Rgraphviz_2.44.0            later_1.3.1                
##  [81] future.apply_1.11.0         robustbase_0.99-0          
##  [83] XML_3.99-0.15               cowplot_1.1.1              
##  [85] matrixStats_1.1.0           RcppAnnoy_0.0.21           
##  [87] xts_0.13.1                  class_7.3-22               
##  [89] Hmisc_5.1-1                 pillar_1.9.0               
##  [91] nlme_3.1-163                iterators_1.0.14           
##  [93] compiler_4.3.1              RSpectra_0.16-1            
##  [95] stringi_1.7.12              gower_1.0.1                
##  [97] minqa_1.2.6                 SummarizedExperiment_1.30.2
##  [99] lubridate_1.9.3             devtools_2.4.5             
## [101] CytoML_2.12.0               plyr_1.8.9                 
## [103] crayon_1.5.2                abind_1.4-5                
## [105] locfit_1.5-9.8              sp_2.1-1                   
## [107] sandwich_3.0-2              pcaMethods_1.92.0          
## [109] dplyr_1.1.3                 codetools_0.2-19           
## [111] multcomp_1.4-25             textshaping_0.3.7          
## [113] recipes_1.0.8               openssl_2.1.1              
## [115] Rphenograph_0.99.1          TTR_0.24.3                 
## [117] bslib_0.5.1                 e1071_1.7-13               
## [119] destiny_3.14.0              GetoptLong_1.0.5           
## [121] ggplot.multistats_1.0.0     mime_0.12                  
## [123] splines_4.3.1               circlize_0.4.15            
## [125] Rcpp_1.0.11                 sparseMatrixStats_1.12.2   
## [127] cellranger_1.1.0            knitr_1.44                 
## [129] utf8_1.2.4                  clue_0.3-65                
## [131] lme4_1.1-35.1               fs_1.6.3                   
## [133] listenv_0.9.0               checkmate_2.3.0            
## [135] DelayedMatrixStats_1.22.6   pkgbuild_1.4.2             
## [137] ggsignif_0.6.4              tibble_3.2.1               
## [139] Matrix_1.6-1.1              rpart.plot_3.1.1           
## [141] callr_3.7.3                 tzdb_0.4.0                 
## [143] tweenr_2.0.2                pkgconfig_2.0.3            
## [145] pheatmap_1.0.12             tools_4.3.1                
## [147] cachem_1.0.8                smoother_1.1               
## [149] fastmap_1.1.1               rmarkdown_2.25             
## [151] scales_1.2.1                grid_4.3.1                 
## [153] usethis_2.2.2               broom_1.0.5                
## [155] sass_0.4.7                  graph_1.78.0               
## [157] carData_3.0-5               RANN_2.6.1                 
## [159] rpart_4.1.21                farver_2.1.1               
## [161] yaml_2.3.7                  MatrixGenerics_1.12.3      
## [163] foreign_0.8-85              ggthemes_4.2.4             
## [165] cli_3.6.1                   purrr_1.0.2                
## [167] stats4_4.3.1                lifecycle_1.0.3            
## [169] uwot_0.1.16                 askpass_1.2.0              
## [171] Biobase_2.60.0              mvtnorm_1.2-3              
## [173] lava_1.7.3                  sessioninfo_1.2.2          
## [175] backports_1.4.1             cytolib_2.12.1             
## [177] timechange_0.2.0            gtable_0.3.4               
## [179] rjson_0.2.21                umap_0.2.10.0              
## [181] ggridges_0.5.4              Rphenoannoy_0.1.0          
## [183] parallel_4.3.1              pROC_1.18.5                
## [185] limma_3.56.2                jsonlite_1.8.7             
## [187] edgeR_3.42.4                RcppHNSW_0.5.0             
## [189] bitops_1.0-7                Rtsne_0.16                 
## [191] FlowSOM_2.8.0               ranger_0.16.0              
## [193] flowCore_2.12.2             jquerylib_0.1.4            
## [195] timeDate_4022.108           shiny_1.7.5.1              
## [197] ConsensusClusterPlus_1.64.0 htmltools_0.5.6.1          
## [199] diffcyt_1.20.0              glue_1.6.2                 
## [201] XVector_0.40.0              VIM_6.2.2                  
## [203] RCurl_1.98-1.13             rprojroot_2.0.3            
## [205] gridExtra_2.3               boot_1.3-28.1              
## [207] TrajectoryUtils_1.8.0       igraph_1.5.1               
## [209] R6_2.5.1                    tidyr_1.3.0                
## [211] SingleCellExperiment_1.22.0 labeling_0.4.3             
## [213] vcd_1.4-11                  cluster_2.1.4              
## [215] pkgload_1.3.3               GenomeInfoDb_1.36.4        
## [217] ipred_0.9-14                nloptr_2.0.3               
## [219] DelayedArray_0.26.7         tidyselect_1.2.0           
## [221] vipor_0.4.5                 htmlTable_2.4.2            
## [223] ggforce_0.4.1               CytoDx_1.20.0              
## [225] car_3.1-2                   future_1.33.0              
## [227] ModelMetrics_1.2.2.2        munsell_0.5.0              
## [229] laeken_0.5.2                data.table_1.14.8          
## [231] htmlwidgets_1.6.2           ComplexHeatmap_2.16.0      
## [233] RColorBrewer_1.1-3          rlang_1.1.1                
## [235] remotes_2.4.2.1             colorRamps_2.3.1           
## [237] ggnewscale_0.4.9            fansi_1.0.5                
## [239] hardhat_1.3.0               beeswarm_0.4.0             
## [241] prodlim_2023.08.28