Machine learning classifier
we developed a set of functions allowing
the user to easily train a classifier for the sample labels. In this
vignette we exemplify all the steps required for the classification of
AML (acute myeloid leukemia) and control samples. The trained model can
then be used to predict the label of and external sample. This workflow
is based on the CytoDX
package, for detailed documentation
see the original manuscript Hu
et. al, 2019, Bioinformatics and cytoDX
documentation of
We will start the vignette by loading a training dataset, in this
dataset the clinical classification of the sample is known and will be
used to train a cytoDX
model. In cyCONDOR
model is saved withing the condor
object and can be used to classify new samples.
If you use this workflow in your work please consider citing cyCONDOR and cytoDX.
Train the cytoDX
Load the data
We start by importing the training dataset, this is done as
previously described with the prep_fcd
function, in this
case the anno_table
also include the clinical
classification of the samples (aml
condor <- prep_fcd(data_path = "../../../Figure 7 - Clinical Classifier/data_and_envs/CytoDX/train/",
max_cell = 10000000,
transformation = "auto_logi",
remove_param = c("FSC-A","FSC-W","FSC-H","Time"),
anno_table = "../../../Figure 7 - Clinical Classifier/data_and_envs/CytoDX/fcs_info_train.csv",
filename_col = "fcsName",
seed = 91)
Build the classifier model
We now train the cytoDX
classifier on the sample label,
this step does not require any other pre-analysis on the dataset,
nevertheless, if you are not familiar with the data you are using for
training we recommend an exploratory data analysis first.
# Re order variables - this is not strictly needed but the classification always consider the first variable as reference.
condor$anno$cell_anno$Label <- factor(condor$anno$cell_anno$Label,
levels = c("normal", "aml"),
labels = c("1_normal", "2_aml"))
The train_classifier_model
requires the user to define
the input table and few parameter to be used for training the
model. As some of the variables are derived from the
package (
function) please
refer to cytoDX
documentation for further details.
- fcd: Flow cytometry data set to be used for training the model.
- input_type: data slot to be used for the classification, suggested
. - data slot: exact name of the data slot to be used (
, if batch correction was performed). - sample_names: name of the column of the
containing the sample names. - classification_variable: name of the column of the
containing the clinical classification to be used for training the classifier. - type1: type of first level prediction, parameter inherited from
, seecytoDX
documentation for details. - type2: type of second level prediction, parameter inherited from
, seecytoDX
documentation for details. - parallelCore: number of cores to be used.
condor <- train_classifier_model(fcd = condor,
input_type = "expr",
data_slot = "orig",
sample_names = "expfcs_filename",
classification_variable = condor$anno$cell_anno$Label,
family = "binomial",
type1 = "response",
parallelCore = 1,
reg = FALSE,
seed = 91)
## Warning in lognet(xd, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
Explore the result of model training
We can now explore the results of the cell level and sample level
prediction on the training data. The results are stored together with
the cytoDX
model itself in the extras
Cell level predition result on the training dataset
The cellular level result contain the probability of classification
to aml
for each cell in the dataset, this table also
include the true label of each cell.
## sample y1.Truth y.Pred.s0
## 1 sample11.fcs 2_aml 0.5836773
## 2 sample11.fcs 2_aml 0.5299637
## 3 sample11.fcs 2_aml 0.6896542
## 4 sample11.fcs 2_aml 0.4914881
## 5 sample11.fcs 2_aml 0.5393115
## 6 sample11.fcs 2_aml 0.3407959
Sample level predition result on the training dataset
The sample level result contain the probability of classification to
for each cell in the dataset, this table also include
the true label of each cell.
## sample y1.Truth y.Pred.s0
## sample11.fcs sample11.fcs 2_aml 1.000000e+00
## sample12.fcs sample12.fcs 2_aml 1.000000e+00
## sample13.fcs sample13.fcs 2_aml 9.999999e-01
## sample14.fcs sample14.fcs 2_aml 1.000000e+00
## sample15.fcs sample15.fcs 2_aml 1.000000e+00
## sample16.fcs sample16.fcs 1_normal 5.949578e-14
Visualize the results on the train dataset
We can now visualize the prediction result both at cell and sample level.
anno <- read.csv("../../../Figure 7 - Clinical Classifier/data_and_envs/CytoDX/fcs_info_train.csv")
ggplot(merge(x = condor$extras$classifier_model$train.Data.cell, y = anno, by.x = "sample", by.y = "fcsName"), aes(x = sample, y = y.Pred.s0, color = Label)) +
geom_jitter() +
geom_violin() +
scale_color_manual(values = c("#92278F", "#F15A29")) +
theme_bw() +
theme(aspect.ratio = 1) +
ylab("probability") +
ggtitle("sample level prediction - train data") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(merge(x = condor$extras$classifier_model$train.Data.sample, y = anno, by.x = "sample", by.y = "fcsName"), aes(x = sample, y = y.Pred.s0, color = Label)) +
geom_point(size = 4) +
scale_color_manual(values = c("#92278F", "#F15A29")) +
theme_bw() +
theme(aspect.ratio = 2) +
ylab("probability") +
ggtitle("sample level prediction - train data") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Visualization of the decision tree
We can use a cytoDX
built-in function to visualize the
decision tree used for the cell level classification. See
documentation for further details.
tree <- treeGate(P = condor$extras$classifier_model$train.Data.cell$y.Pred.s0,
x= condor$expr$orig)
Testing on an independent dataset
Load the data
To now validate the performance of the trained cytoDX
model we will test it on a test dataset with no overlap with the
training data.
condor_test <- prep_fcd(data_path = "../../../Figure 7 - Clinical Classifier/data_and_envs/CytoDX/test/",
max_cell = 10000000,
transformation = "auto_logi",
remove_param = c("FSC-A","FSC-W","FSC-H","Time"),
anno_table = "../../../Figure 7 - Clinical Classifier/data_and_envs/CytoDX/fcs_info_test.csv",
filename_col = "fcsName",
seed = 91)
Predict classification
We can now predict the label using the trained model
# Re order variables - this is not strictly needed but the classification always consider the first variable as reference.
condor_test$anno$cell_anno$Label <- factor(condor_test$anno$cell_anno$Label,
levels = c("normal", "aml"),
labels = c("1_normal", "2_aml"))
The predict_classifier
requires few user defined input
to predict the labels of an external dataset using a previously prepared
- fcd: flow cytometri dataset of the new data
- input_type: data slot to be used for the classification, suggested
. Should match the option selection intrain_classifier_model
. - data slot: exact name of the data slot to be used (
, if batch correction was performed). Should match the option selection intrain_classifier_model
. - sample_names: name of the column in the
containing the sample names. - model_object:
model, this is stored in thecondor
object used to train the model (extras
condor_test <- predict_classifier(fcd = condor_test,
input_type = "expr",
data_slot = "orig",
sample_names = "expfcs_filename",
model_object = condor$extras$classifier_model,
seed = 91)
Explore the result of prediction in test dataset
We can now explore the results of the cell level and sample level
prediction on the test data. The results are stored together with the
model itself in the extras
Cell level predition result on the test dataset
The cellular level result contain the probability of classification
to aml
for each cell in the dataset.
## sample y.Pred.s0
## 1 sample1.fcs 0.6212374
## 2 sample1.fcs 0.6780328
## 3 sample1.fcs 0.5818562
## 4 sample1.fcs 0.3354043
## 5 sample1.fcs 0.4015879
## 6 sample1.fcs 0.7143018
Cell level predition result on the test dataset
The sample level result contain the probability of classification to
for each cell in the dataset.
## sample y.Pred.s0
## sample1.fcs sample1.fcs 1.000000e+00
## sample10.fcs sample10.fcs 6.021657e-12
## sample2.fcs sample2.fcs 1.000000e+00
## sample3.fcs sample3.fcs 9.998114e-01
## sample4.fcs sample4.fcs 1.000000e+00
## sample5.fcs sample5.fcs 1.000000e+00
Visualize the results on the test dataset
We can now visualize the prediction result both at cell and sample level.
anno <- read.csv("../../../Figure 7 - Clinical Classifier/data_and_envs/CytoDX/fcs_info_test.csv")
tmp <- merge(x = condor_test$extras$classifier_prediction$xNew.Pred.cell, y = anno, by.x = "sample", by.y = "fcsName")
tmp$sample <- factor(tmp$sample, levels = c("sample1.fcs", "sample2.fcs", "sample3.fcs", "sample4.fcs", "sample5.fcs", "sample6.fcs", "sample7.fcs", "sample8.fcs", "sample9.fcs", "sample10.fcs"))
ggplot(tmp, aes(x = sample, y = y.Pred.s0, color = Label)) +
geom_jitter() +
geom_violin() +
scale_color_manual(values = c("#92278F", "#F15A29")) +
theme_bw() +
theme(aspect.ratio = 1) +
ylab("probability") +
ggtitle("cell level prediction - test data") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
tmp <- merge(x = condor_test$extras$classifier_prediction$xNew.Pred.sample, y = anno, by.x = "sample", by.y = "fcsName")
tmp$sample <- factor(tmp$sample, levels = c("sample1.fcs", "sample2.fcs", "sample3.fcs", "sample4.fcs", "sample5.fcs", "sample6.fcs", "sample7.fcs", "sample8.fcs", "sample9.fcs", "sample10.fcs"))
ggplot(tmp, aes(x = sample, y = y.Pred.s0, color = Label)) +
geom_point(size = 4) +
scale_color_manual(values = c("#92278F", "#F15A29")) +
theme_bw() +
theme(aspect.ratio = 2) +
ylab("probability") +
ggtitle("sample level prediction - test data") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Session Info
