In this vignette we exemplify a workflow to correct for a batch effect induced by performing experiments (including cell preparation, staining and recording of flow cytometry data) on different dates. The example data set used in this vignette consists of peripheral blood mononuclear cells (PBMCs) from healthy donors.
1 cyCONDOR dimensionality reduction workflow
This first part describes the general cyCONDOR
data
processing and dimensionality reduction workflow (see
vignettes("Data loading and transformation")
and
vignettes("Dimensionality Reduction")
for more details)
until the closer inspection of the data set and identification of a
technical batch effect.
Loading the data
We start by loading the data.
condor <- prep_fcd(data_path = "../../../Figure 3 - Batch Correction/data/CureDem/all/",
max_cell = 500,
useCSV = FALSE,
transformation = "auto_logi",
remove_param = c("Time", "FSC-H", "SSC-H"),
anno_table = "../../../Figure 3 - Batch Correction/data/CureDem/all.csv",
filename_col = "filename"
)
# set the date as factor for visualization purposes
condor$anno$cell_anno$exp <- as.factor(condor$anno$cell_anno$exp)
Dimensionality Reduction
Next, we perform dimensionality reduction calculating the principal components (PCs) and the UMAP based on the PCs.
Inspection of batch effect
We can now visualize the batch effect in this data set by plotting the UMAP coordinates and coloring the cells by the experiment date.
plot_dim_red(fcd= condor,
reduction_method = "umap",
reduction_slot = "pca_orig",
param = "exp",
title = "Original UMAP")
plot_dim_red(fcd= condor,
reduction_method = "umap",
reduction_slot = "pca_orig",
param = "exp",
title = "Original UMAP",
facet_by_variable = TRUE)
2 Batch correction
Within the cyCONDOR
ecosystem we implemented harmony and CytoNorm for batch
correction. The correction can be applied either to the protein
expression values (fluorescence intensities) or to the principal
components. There is no ‘magic bullet’ for batch correction and each of
these methods needs to be used with care to correct for the batch effect
but not the underlying biological effects. Therefore, each of the here
described methods should always be validated by inspecting the
expression of hallmark markers.
If you use this workflow in your work please consider citing cyCONDOR and harmony or CytoNorm.
2.1 Correct Principal Components
To use the harmony
algorithm for correcting the
principal components, we use the harmonize_PCA()
and define
the batch variable batch_var
, in this example the
experiment date ‘exp’.
condor <- harmonize_PCA(fcd = condor,
batch_var = c("exp"),
data_slot = "orig")
## Transposing data matrix
## Initializing state using k-means centroids initialization
## Harmony 1/10
## Harmony 2/10
## Harmony 3/10
## Harmony 4/10
## Harmony 5/10
## Harmony 6/10
## Harmony 7/10
## Harmony converged after 7 iterations
The harmonized PCs are saved in condor$pca$norm
.
condor$pca$norm[1:10, 1:5]
## PC1
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_1 -1.1366675
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_2 3.6644012
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_3 2.4084209
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_4 0.2216334
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_5 3.7184420
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_6 3.7217632
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_7 2.4474169
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_8 -2.7793225
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_9 1.8133915
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_10 4.0265809
## PC2
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_1 -3.4624610
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_2 1.4990914
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_3 -0.2008533
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_4 -1.0498486
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_5 -0.1274135
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_6 0.7872670
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_7 1.1856915
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_8 1.0500470
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_9 -2.3469980
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_10 0.6143970
## PC3
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_1 -1.6953916
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_2 0.3315178
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_3 0.5562826
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_4 -0.1512992
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_5 -0.1128416
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_6 0.5681361
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_7 -0.7625211
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_8 -0.6801728
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_9 -1.0648298
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_10 0.7935678
## PC4
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_1 1.88612893
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_2 -0.71703164
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_3 2.50713060
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_4 0.81109320
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_5 0.80766039
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_6 0.72652862
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_7 0.60679898
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_8 -0.84180911
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_9 0.21054986
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_10 -0.06402865
## PC5
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_1 0.95925264
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_2 0.27537428
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_3 1.46456636
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_4 -2.28436215
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_5 1.21457968
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_6 -0.35067378
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_7 1.11939274
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_8 0.72285975
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_9 -1.69663626
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_10 0.03348906
Repeat dimensionality reduction
We can now recalculate the UMAP based on the harmonized PCs. For this
we select input_type = "pca"
and
data_slot = "norm"
.
condor <- runUMAP(fcd = condor,
input_type = "pca",
data_slot = "norm",
prefix= NULL)
Unless a prefix has been set, the new UMAP coordinates calculated
based on the harmonized PCs can be accessed via
condor$umap$pca_norm
.
condor$umap$pca_norm[1:5,]
## UMAP1
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_1 -3.707570
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_2 -4.884564
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_3 -5.692166
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_4 -6.726507
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_5 -6.848187
## UMAP2
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_1 -12.651143
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_2 10.020984
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_3 7.914641
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_4 -3.255706
## 20230322_AB_20220406_DH_193223821_193363_E2_Live_cells_comp.fcs_5 7.687574
Visualization of the data integration results
UMAP
For a first inspection of the data integration, we can visualize the results by plotting the new UMAP coordinates and coloring the cells by the experiment date. The UMAP shows no strong separation of the cells by the experiment date anymore. To further inspect the data integration results, the expression of cell type lineage markers should be assessed. This is highly dependent on the markers measured in the flow cytometry data set and therefore not included in this vignette.
plot_dim_red(fcd= condor,
reduction_method = "umap",
reduction_slot = "pca_norm",
param = "exp",
title = "Harmonized UMAP")
plot_dim_red(fcd= condor,
reduction_method = "umap",
reduction_slot = "pca_norm",
param = "exp",
title = "Harmonized UMAP",
facet= TRUE)
LISI score
To further inspect the data integration result, we can calculate the Local Inverse Simpson’s Index (LISI) score. The LISI score can be used to assess the degree of mixing among the cells from the different batches in the UMAP space.
#prepare pre batch correction data set
pre_batch <- cbind(condor$umap$pca_orig, condor$anno$cell_anno)
#calculate LISI score for batch variable 'exp'
res_pre <- compute_lisi(pre_batch[,c(1,2)], pre_batch, c('exp'))
colnames(res_pre) <- "lisi"
#combine LISI score with cell annotation
lisi_mat_pre <- cbind(pre_batch, res_pre)
lisi_mat_pre$type <- "pre"
#prepare ppost batch correction data set
post_batch <- cbind(condor$umap$pca_norm, condor$anno$cell_anno)
#calculate LISI score for batch variable 'exp'
res_post <- compute_lisi(post_batch[,c(1,2)], post_batch, c('exp'))
colnames(res_post) <- "lisi"
#combine LISI score with cell annotation
lisi_mat_post <- cbind(post_batch, res_post)
lisi_mat_post$type <- "post"
#combine pre and post batch matrices
lisi_mat <- rbind(lisi_mat_post, lisi_mat_pre)
lisi_mat$type <- factor(lisi_mat$type, levels = c("pre", "post"))
#visualization
p <- ggplot(data = lisi_mat, aes(y = lisi, x = type, fill = type)) +
geom_jitter_rast(alpha = 0.01, scale =0.5) +
geom_violin(alpha = 0.8) +
scale_fill_manual(values= c("#1C75BC", "#BE1E2D"))+
theme_bw() +
theme(aspect.ratio = 2, panel.grid = element_blank(),
text= element_text(size=16, color= "black")) +
xlab("")+
ylab("LISI score")
p
2.2 Correct fluorescent intensities
As an alternative to correcting the embedding, the fluorescent
intensities can be directly corrected using harmony
or
CytoNorm
. However, direct correction of fluorescence
intensities should be used with caution, if biological differences
between groups or conditions are planned to be compared in the
downstream analysis.
Correct fluorescent intensities with harmony
To use the harmony
algorithm for correcting the
intensities, we use the harmonize_intensities()
and define
the batch variable batch_var
, in this case ‘exp’.
condor <- harmonize_intensities(fcd = condor,
batch_var = c("exp"))
## Transposing data matrix
## Initializing state using k-means centroids initialization
## Harmony 1/10
## Harmony 2/10
## Harmony 3/10
## Harmony converged after 3 iterations
The harmonized intensities are saved in
condor$expr$norm
.
Correct fluorescent intensities with CytoNorm
Another approach for correcting the fluorescence intensities is the
application of the CytoNorm algorithm
within the cyCONDOR
ecosystem. For a detailed description
of CytoNorm
see Van Gassen et al.,
2019.
As a first step, the batch effect is learnt from reference samples provided by the user. The control samples should contain the whole range of expression values of the markers in the panel. For example in the case of a stimulation experiment including unstimulated and stimulated samples as control samples improved estimation of the batch effect (Van Gassen et al., 2019). Ideally the samples used as reference samples were stained and recorded along the other samples within each batch. Depending on your data set, it is also possible to use all samples contained in your data set as reference samples to train the model and learn the differences between the batches as shown in the example below.
Training the model
Here, a model is trained on all samples contained in the data set.
For this we need to provide the name of the batch variable
batch_var
and optionally parameters which should not be
included for the training as well as different parameters for clustering
the cells with FlowSOM (FlowSom_param
). Here, we use all
markers present in the fcd
and perform the training with
the default parameters for FlowSom_param
. The number of
desired FlowSOM metaclusters is defined by nClus
and should
be adjusted according to your data set, see Van Gassen et al., 2019
for more details.
condor <- train_cytonorm(fcd = condor,
batch_var = "exp",
remove_param = NULL,
FlowSOM_param = list(nCells = 5000, xdim = 5, ydim = 5, nClus = 10, scale = FALSE),
seed = 91)
Normalization of data with trained model
Next the trained model saved within your fcd
is used to
normalize all samples present in your fcd
. The fcs files
with the normalized expression values can be saved, if
keep_fcs
is set to TRUE
.
condor <- run_cytonorm(fcd = condor,
files = NULL,
batch_var = "exp",
keep_fcs = FALSE)
## start normalization
## adding normalized expression data to fcd
## removing temporary fcs files
Session Info
info <- sessionInfo()
info
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lisi_1.0 ggrastr_1.0.2 ggplot2_3.4.4 cyCONDOR_0.2.0
##
## loaded via a namespace (and not attached):
## [1] IRanges_2.34.1 Rmisc_1.5.1
## [3] urlchecker_1.0.1 nnet_7.3-19
## [5] CytoNorm_2.0.1 TH.data_1.1-2
## [7] vctrs_0.6.4 digest_0.6.33
## [9] png_0.1-8 shape_1.4.6
## [11] proxy_0.4-27 slingshot_2.8.0
## [13] ggrepel_0.9.4 parallelly_1.36.0
## [15] MASS_7.3-60 pkgdown_2.0.7
## [17] reshape2_1.4.4 httpuv_1.6.12
## [19] foreach_1.5.2 BiocGenerics_0.46.0
## [21] withr_2.5.1 xfun_0.40
## [23] ggpubr_0.6.0 ellipsis_0.3.2
## [25] survival_3.5-7 memoise_2.0.1
## [27] hexbin_1.28.3 ggbeeswarm_0.7.2
## [29] RProtoBufLib_2.12.1 princurve_2.1.6
## [31] profvis_0.3.8 ggsci_3.0.0
## [33] systemfonts_1.0.5 ragg_1.2.6
## [35] zoo_1.8-12 GlobalOptions_0.1.2
## [37] DEoptimR_1.1-3 Formula_1.2-5
## [39] prettyunits_1.2.0 promises_1.2.1
## [41] scatterplot3d_0.3-44 rstatix_0.7.2
## [43] globals_0.16.2 ps_1.7.5
## [45] rstudioapi_0.15.0 miniUI_0.1.1.1
## [47] generics_0.1.3 ggcyto_1.28.1
## [49] base64enc_0.1-3 processx_3.8.2
## [51] curl_5.1.0 S4Vectors_0.38.2
## [53] zlibbioc_1.46.0 flowWorkspace_4.12.2
## [55] polyclip_1.10-6 randomForest_4.7-1.1
## [57] GenomeInfoDbData_1.2.10 RBGL_1.76.0
## [59] ncdfFlow_2.46.0 RcppEigen_0.3.3.9.4
## [61] xtable_1.8-4 stringr_1.5.0
## [63] desc_1.4.2 doParallel_1.0.17
## [65] evaluate_0.22 S4Arrays_1.0.6
## [67] hms_1.1.3 glmnet_4.1-8
## [69] GenomicRanges_1.52.1 irlba_2.3.5.1
## [71] colorspace_2.1-0 harmony_1.1.0
## [73] reticulate_1.34.0 readxl_1.4.3
## [75] magrittr_2.0.3 lmtest_0.9-40
## [77] readr_2.1.4 Rgraphviz_2.44.0
## [79] later_1.3.1 lattice_0.22-5
## [81] future.apply_1.11.0 robustbase_0.99-0
## [83] XML_3.99-0.15 cowplot_1.1.1
## [85] matrixStats_1.1.0 RcppAnnoy_0.0.21
## [87] xts_0.13.1 class_7.3-22
## [89] Hmisc_5.1-1 pillar_1.9.0
## [91] nlme_3.1-163 iterators_1.0.14
## [93] compiler_4.3.1 RSpectra_0.16-1
## [95] stringi_1.7.12 gower_1.0.1
## [97] minqa_1.2.6 SummarizedExperiment_1.30.2
## [99] lubridate_1.9.3 devtools_2.4.5
## [101] CytoML_2.12.0 plyr_1.8.9
## [103] crayon_1.5.2 abind_1.4-5
## [105] locfit_1.5-9.8 sp_2.1-1
## [107] sandwich_3.0-2 pcaMethods_1.92.0
## [109] dplyr_1.1.3 codetools_0.2-19
## [111] multcomp_1.4-25 textshaping_0.3.7
## [113] recipes_1.0.8 openssl_2.1.1
## [115] Rphenograph_0.99.1 TTR_0.24.3
## [117] bslib_0.5.1 e1071_1.7-13
## [119] destiny_3.14.0 GetoptLong_1.0.5
## [121] ggplot.multistats_1.0.0 mime_0.12
## [123] splines_4.3.1 circlize_0.4.15
## [125] Rcpp_1.0.11 sparseMatrixStats_1.12.2
## [127] cellranger_1.1.0 knitr_1.44
## [129] utf8_1.2.4 clue_0.3-65
## [131] lme4_1.1-35.1 fs_1.6.3
## [133] listenv_0.9.0 checkmate_2.3.0
## [135] DelayedMatrixStats_1.22.6 pkgbuild_1.4.2
## [137] ggsignif_0.6.4 tibble_3.2.1
## [139] Matrix_1.6-1.1 rpart.plot_3.1.1
## [141] callr_3.7.3 tzdb_0.4.0
## [143] tweenr_2.0.2 pkgconfig_2.0.3
## [145] pheatmap_1.0.12 tools_4.3.1
## [147] cachem_1.0.8 RhpcBLASctl_0.23-42
## [149] smoother_1.1 fastmap_1.1.1
## [151] rmarkdown_2.25 scales_1.2.1
## [153] grid_4.3.1 usethis_2.2.2
## [155] broom_1.0.5 sass_0.4.7
## [157] graph_1.78.0 carData_3.0-5
## [159] RANN_2.6.1 rpart_4.1.21
## [161] farver_2.1.1 yaml_2.3.7
## [163] MatrixGenerics_1.12.3 foreign_0.8-85
## [165] ggthemes_4.2.4 cli_3.6.1
## [167] purrr_1.0.2 stats4_4.3.1
## [169] lifecycle_1.0.3 uwot_0.1.16
## [171] askpass_1.2.0 caret_6.0-94
## [173] Biobase_2.60.0 mvtnorm_1.2-3
## [175] lava_1.7.3 sessioninfo_1.2.2
## [177] backports_1.4.1 cytolib_2.12.1
## [179] timechange_0.2.0 gtable_0.3.4
## [181] rjson_0.2.21 umap_0.2.10.0
## [183] ggridges_0.5.4 parallel_4.3.1
## [185] pROC_1.18.5 limma_3.56.2
## [187] jsonlite_1.8.7 edgeR_3.42.4
## [189] RcppHNSW_0.5.0 bitops_1.0-7
## [191] Rtsne_0.16 FlowSOM_2.8.0
## [193] ranger_0.16.0 flowCore_2.12.2
## [195] jquerylib_0.1.4 timeDate_4022.108
## [197] shiny_1.7.5.1 ConsensusClusterPlus_1.64.0
## [199] htmltools_0.5.6.1 diffcyt_1.20.0
## [201] glue_1.6.2 XVector_0.40.0
## [203] VIM_6.2.2 RCurl_1.98-1.13
## [205] rprojroot_2.0.3 gridExtra_2.3
## [207] boot_1.3-28.1 TrajectoryUtils_1.8.0
## [209] igraph_1.5.1 R6_2.5.1
## [211] tidyr_1.3.0 SingleCellExperiment_1.22.0
## [213] labeling_0.4.3 vcd_1.4-11
## [215] cluster_2.1.4 pkgload_1.3.3
## [217] GenomeInfoDb_1.36.4 ipred_0.9-14
## [219] nloptr_2.0.3 DelayedArray_0.26.7
## [221] tidyselect_1.2.0 vipor_0.4.5
## [223] htmlTable_2.4.2 ggforce_0.4.1
## [225] CytoDx_1.20.0 car_3.1-2
## [227] future_1.33.0 ModelMetrics_1.2.2.2
## [229] munsell_0.5.0 laeken_0.5.2
## [231] data.table_1.14.8 htmlwidgets_1.6.2
## [233] ComplexHeatmap_2.16.0 RColorBrewer_1.1-3
## [235] rlang_1.1.1 remotes_2.4.2.1
## [237] colorRamps_2.3.1 Cairo_1.6-1
## [239] ggnewscale_0.4.9 fansi_1.0.5
## [241] hardhat_1.3.0 beeswarm_0.4.0
## [243] prodlim_2023.08.28