Dimensionality Reduction
Source:vignettes/Dimensionality_Reduction.Rmd
Dimensionality_Reduction.Rmd
In this vignette we showcase cyCONDOR
functions for
dimensionality reduction. We exemplify how to perform Principle
Component Analysis (PCA) and calculate Uniform Manifold Approximation
and Projection (UMAP), t-Distributed Stochastic Neighbor Embedding
(tSNE) and Diffusion Map (DM).
All functions need the condor
object as fcd
input and the data_slot
to be used for the calculation. The
runPCA
function always uses the expr
slot for
the calculation, for non-linear dimensionality reduction (UMAP, tSNE,
DM) the user can decide to use the expr
data or the
pca
result as input_type
.
Additionally, the user has the option to specifically state which
markers should be used for the calculation by listing them under
markers
. The list can be either written out manually or be
extracted directly from the condor
object using the
implemented functions measured_markers
and
used_markers
By default all available markers from the
condor
object are used. If the discard
option
is set to TRUE, all markers except the ones listed
under markers
are used for calculation. This enables the
exclusion of single markers. When using pca
as input is
possible to specify the number of PCs to be used.
By defining a prefix
which gets incorporated into the
slot name of the output, each function can be run with different
settings and the results will be saved accordingly.
The functions return a fcd with an additional data frame
corresponding to the chosen dimensionality reduction method saved in
fcd$reduction_method
. The name of the output consists of
the prefix
(if given) and the data_slot
.
Marker selection using the marker
and
discard
variables
It is possible to specify the markers which should be the basis for
the calculation using a combination of the markers
variable
and the discard
flag in all dimensionality reduction
functions. markers
takes a vector of marker names as an
input that should be included (positive selection) or excluded (negative
selection). The user can choose to either discard the specified markers
by setting the discard
flag to TRUE (negative selection) or
to keep only the specified markers by using the default setting of the
discard
flag (positive selection).
The marker names should correspond to a specific column in the
expression table and can be given manually or can be extracted from the
condor object using the cyCONDOR
function
used_markers
. When performing a marker selection the user
should make sure that a prefix for the output name is set to avoid
overwriting a previously calculated matrix.
The option of marker selection is implemented in all dimensionality reduction functions but we only demonstrate it for PCA.
Load an example dataset
condor <- readRDS("../.test_files/conodr_example_016.rds")
Principal Component Analysis (PCA)
The calculation of the Principle Components is based on the
prcomp
function from the R Stats package (https://rdocumentation.org/packages/stats/versions/3.6.2).
condor <- runPCA(fcd = condor,
data_slot = "orig",
seed = 91)
The output data frame of the PCA can be accessed with
condor$pca$orig
.
As a demonstration the following code shows a positive and negative
selection with the corresponding discard
flag setting.
PCA (Positive selection: Specifying the markers to be used as basis for the calculation)
condor <- runPCA(fcd = condor,
data_slot = "orig",
seed = 91,
prefix = "Tcell",
markers = c("CD3", "CD4", "CD8"),
discard = FALSE)
The output data frame of the PCA with positive marker selection can
be accessed with condor$pca$Tcell_orig
.
PCA (Negative selection: Excluding a specific marker from the calculation)
condor <- runPCA(fcd = condor,
data_slot = "orig",
seed = 91,
prefix = "scatter_exclusion",
markers = c("FSC-A", "SSC-A"),
discard = TRUE)
The output data frame of the PCA with negative marker selection can
be accessed with condor$pca$scatter_exclusion_orig
.
UMAP
The calculation of the UMAP is based on the umap
function from the uwot package. For more details see: Melville J (2023).
“uwot: The Uniform Manifold Approximation and Projection (UMAP) Method
for Dimensionality Reduction” https://github.com/jlmelville/uwot.
Besides important metrics that can be set in the uwot umap function
(e.g. number of items that define a neighborhood around each point
(nNeighbors
) and minimum distance between embedded points
(min_dist
)) the runUMAP
function implemented
in cyCondor
has additional parameters that can be adjusted.
Next to the selection of markers
and an output
prefix
the user can specify the number of PCs that should
be used for the umap calculation (nPC
) and has the option
to save the umap model for future data projection
(ret_model
).
condor <- runUMAP(fcd = condor,
input_type = "pca",
data_slot = "orig",
seed = 91)
The output data frame of the UMAP coordinates can be accessed with
condor$umap$pca_orig
.
tSNE
The tSNE calculation is based on the function Rtsne
from
the package Rtsne
. The implementation in
cyCondor
allows for the definition of the perplexity used
in the tSNE calculation. This parameter controls how many nearest
neighbors should be taken into account when constructing the embedding.
The user has the option, similar as in the UMAP function, to select the
number of PCs which should be used for the calculation. For more details
see: Jesse H. Krijthe (2015). “Rtsne: T-Distributed Stochastic Neighbor
Embedding using a Barnes-Hut Implementation” https://github.com/jkrijthe/Rtsne.
condor <- runtSNE(fcd = condor,
input_type = "pca",
data_slot = "orig",
seed = 91,
perplexity = 30)
The output data frame of the tSNE coordinates can be accessed with
condor$tSNE$pca_orig
.
Diffusion Map
The calculation of DM is based on the function
DiffusionMap
from the package destiny
. The
number of nearest neighbors to be considered can be specified with
k
. Here, the user has as well the option to select the
number of PCs which should be used for the calculation. For more deatils
see: Philipp Angerer et al. (2015). “destiny: diffusion maps for
large-scale single-cell data in R.” Helmholtz-Zentrum München.http://bioinformatics.oxfordjournals.org/content/32/8/1241.
condor <- runDM(fcd = subset_fcd(condor, 5000),
input_type = "pca",
data_slot = "orig",
k = 10,
seed = 91)
The output data frame of the DM can be accessed with
condor$diffmap$pca_orig
.
Session Info
info <- sessionInfo()
info
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cyCONDOR_0.2.1
##
## loaded via a namespace (and not attached):
## [1] IRanges_2.34.1 Rmisc_1.5.1
## [3] urlchecker_1.0.1 nnet_7.3-19
## [5] CytoNorm_2.0.1 TH.data_1.1-2
## [7] vctrs_0.6.4 digest_0.6.33
## [9] png_0.1-8 shape_1.4.6
## [11] proxy_0.4-27 slingshot_2.8.0
## [13] ggrepel_0.9.4 parallelly_1.36.0
## [15] MASS_7.3-60 pkgdown_2.0.7
## [17] reshape2_1.4.4 httpuv_1.6.12
## [19] foreach_1.5.2 BiocGenerics_0.46.0
## [21] withr_2.5.1 ggrastr_1.0.2
## [23] xfun_0.40 ggpubr_0.6.0
## [25] ellipsis_0.3.2 survival_3.5-7
## [27] memoise_2.0.1 hexbin_1.28.3
## [29] ggbeeswarm_0.7.2 RProtoBufLib_2.12.1
## [31] princurve_2.1.6 profvis_0.3.8
## [33] ggsci_3.0.0 systemfonts_1.0.5
## [35] ragg_1.2.6 zoo_1.8-12
## [37] GlobalOptions_0.1.2 DEoptimR_1.1-3
## [39] Formula_1.2-5 prettyunits_1.2.0
## [41] promises_1.2.1 scatterplot3d_0.3-44
## [43] rstatix_0.7.2 globals_0.16.2
## [45] ps_1.7.5 rstudioapi_0.15.0
## [47] miniUI_0.1.1.1 generics_0.1.3
## [49] ggcyto_1.28.1 base64enc_0.1-3
## [51] processx_3.8.2 curl_5.1.0
## [53] S4Vectors_0.38.2 zlibbioc_1.46.0
## [55] flowWorkspace_4.12.2 polyclip_1.10-6
## [57] randomForest_4.7-1.1 GenomeInfoDbData_1.2.10
## [59] RBGL_1.76.0 ncdfFlow_2.46.0
## [61] RcppEigen_0.3.3.9.4 xtable_1.8-4
## [63] stringr_1.5.0 desc_1.4.2
## [65] doParallel_1.0.17 evaluate_0.22
## [67] S4Arrays_1.0.6 hms_1.1.3
## [69] glmnet_4.1-8 GenomicRanges_1.52.1
## [71] irlba_2.3.5.1 colorspace_2.1-0
## [73] harmony_1.1.0 reticulate_1.34.0
## [75] readxl_1.4.3 magrittr_2.0.3
## [77] lmtest_0.9-40 readr_2.1.4
## [79] Rgraphviz_2.44.0 later_1.3.1
## [81] lattice_0.22-5 future.apply_1.11.0
## [83] robustbase_0.99-0 XML_3.99-0.15
## [85] cowplot_1.1.1 matrixStats_1.1.0
## [87] xts_0.13.1 class_7.3-22
## [89] Hmisc_5.1-1 pillar_1.9.0
## [91] nlme_3.1-163 iterators_1.0.14
## [93] compiler_4.3.1 RSpectra_0.16-1
## [95] stringi_1.7.12 gower_1.0.1
## [97] minqa_1.2.6 SummarizedExperiment_1.30.2
## [99] lubridate_1.9.3 devtools_2.4.5
## [101] CytoML_2.12.0 plyr_1.8.9
## [103] crayon_1.5.2 abind_1.4-5
## [105] locfit_1.5-9.8 sp_2.1-1
## [107] sandwich_3.0-2 pcaMethods_1.92.0
## [109] dplyr_1.1.3 codetools_0.2-19
## [111] multcomp_1.4-25 textshaping_0.3.7
## [113] recipes_1.0.8 openssl_2.1.1
## [115] Rphenograph_0.99.1 TTR_0.24.3
## [117] bslib_0.5.1 e1071_1.7-13
## [119] destiny_3.14.0 GetoptLong_1.0.5
## [121] ggplot.multistats_1.0.0 mime_0.12
## [123] splines_4.3.1 circlize_0.4.15
## [125] Rcpp_1.0.11 sparseMatrixStats_1.12.2
## [127] cellranger_1.1.0 knitr_1.44
## [129] utf8_1.2.4 clue_0.3-65
## [131] lme4_1.1-35.1 fs_1.6.3
## [133] listenv_0.9.0 checkmate_2.3.0
## [135] DelayedMatrixStats_1.22.6 pkgbuild_1.4.2
## [137] ggsignif_0.6.4 tibble_3.2.1
## [139] Matrix_1.6-1.1 rpart.plot_3.1.1
## [141] callr_3.7.3 tzdb_0.4.0
## [143] tweenr_2.0.2 pkgconfig_2.0.3
## [145] pheatmap_1.0.12 tools_4.3.1
## [147] cachem_1.0.8 smoother_1.1
## [149] fastmap_1.1.1 rmarkdown_2.25
## [151] scales_1.2.1 grid_4.3.1
## [153] usethis_2.2.2 broom_1.0.5
## [155] sass_0.4.7 graph_1.78.0
## [157] carData_3.0-5 RANN_2.6.1
## [159] rpart_4.1.21 farver_2.1.1
## [161] yaml_2.3.7 MatrixGenerics_1.12.3
## [163] foreign_0.8-85 ggthemes_4.2.4
## [165] cli_3.6.1 purrr_1.0.2
## [167] stats4_4.3.1 lifecycle_1.0.3
## [169] uwot_0.1.16 askpass_1.2.0
## [171] caret_6.0-94 Biobase_2.60.0
## [173] mvtnorm_1.2-3 lava_1.7.3
## [175] sessioninfo_1.2.2 backports_1.4.1
## [177] cytolib_2.12.1 timechange_0.2.0
## [179] gtable_0.3.4 rjson_0.2.21
## [181] umap_0.2.10.0 ggridges_0.5.4
## [183] parallel_4.3.1 pROC_1.18.5
## [185] limma_3.56.2 jsonlite_1.8.7
## [187] edgeR_3.42.4 RcppHNSW_0.5.0
## [189] bitops_1.0-7 ggplot2_3.4.4
## [191] Rtsne_0.16 FlowSOM_2.8.0
## [193] ranger_0.16.0 flowCore_2.12.2
## [195] jquerylib_0.1.4 timeDate_4022.108
## [197] shiny_1.7.5.1 ConsensusClusterPlus_1.64.0
## [199] htmltools_0.5.6.1 diffcyt_1.20.0
## [201] glue_1.6.2 XVector_0.40.0
## [203] VIM_6.2.2 RCurl_1.98-1.13
## [205] rprojroot_2.0.3 gridExtra_2.3
## [207] boot_1.3-28.1 TrajectoryUtils_1.8.0
## [209] igraph_1.5.1 R6_2.5.1
## [211] tidyr_1.3.0 SingleCellExperiment_1.22.0
## [213] vcd_1.4-11 cluster_2.1.4
## [215] pkgload_1.3.3 GenomeInfoDb_1.36.4
## [217] ipred_0.9-14 nloptr_2.0.3
## [219] DelayedArray_0.26.7 tidyselect_1.2.0
## [221] vipor_0.4.5 htmlTable_2.4.2
## [223] ggforce_0.4.1 CytoDx_1.20.0
## [225] car_3.1-2 future_1.33.0
## [227] ModelMetrics_1.2.2.2 munsell_0.5.0
## [229] laeken_0.5.2 data.table_1.14.8
## [231] htmlwidgets_1.6.2 ComplexHeatmap_2.16.0
## [233] RColorBrewer_1.1-3 rlang_1.1.1
## [235] remotes_2.4.2.1 colorRamps_2.3.1
## [237] ggnewscale_0.4.9 fansi_1.0.5
## [239] hardhat_1.3.0 beeswarm_0.4.0
## [241] prodlim_2023.08.28