Import datasets

# install.packages(c("mixOmics","scales","knitr"))
library(mixOmics)
library(RColorBrewer)
library(knitr)

X <- as.matrix(read.csv2("genes.csv"))
y <- read.csv2("cellType.csv",header = F)$V1

PLS, sPLS and sPLS-DA

The package comes from (Lê Cao, Boitard, and Besse 2011) and the data come from (Zheng et al. 2017).

As a reminder, each axis of a PLS problem solves the following optimization problem :

max_{\begin{matrix} u^{T} u = 1 \\ v^{T} v = 1 \end{matrix}} c o v (X u, Y v) = max_{\begin{matrix} u^{T} u = 1 \\ v^{T} v = 1 \end{matrix}} u^{T} X^{T} Y v .

The deflation permits to repeat that problem over successive axes in orthonormal ways.

The sPLS is the $L_{1}$ modified optimization problem, like for the LASSO, which constraints the weights to be smaller than the unconstrained problem (PLS).

sPLS-DA is a PLS method which account to answer to 2 questions in plus than the classical PLS model :

Deal with classes instead of quantitative variables,
Select variables for each axis.

So, the sPLS-DA permits to select variables on different axes which will discriminate the different classes. For this we will have to tune two parameters :

$k e e p_{X}$ per component : the number of genes selected per component,
$n_{c o m p}$ the total number of components.

Remark : In the sPLS problem we do not have to tune $k e e p_{X}$ because we fix it to the number of variables.

Back to the 10X dataset!

We have decided to deal with the 10X dataset and to treat firstly the case of 4 populations which are easily separable, which are :

b-cells
cd14_monocytes
cd34
cd56_nk

So, first of all we create new datasets with only the considered cells. We have to standardize the X dataset :

indices_4_pop <- which(y %in%  c("b_cells",
                                 "cd14_monocytes",
                                 "cd34",
                                 "cd56_nk"))
X_4_pop <- scale(X[indices_4_pop,])
y_4_pop <- droplevels(y[indices_4_pop])

Tune the parameters

The tricky part of the algorithm is to tune the different parameters $n_{c o m p}$ and $k e e p_{X}$ .

The common way is to tune $k e e p_{X}$ on the $1^{s t}$ component, then do the deflation, then tune $k e e p_{X}$ on the $2^{n d}$ component and re-apply up to a coherent value for $n_{c o m p}$ .

To fix $n_{c o m p}$ we will most of times use the rule of thumb : $n_{c o m p}$ =K-1, where K is the number of classes in the dataset.

Here K=4 so we will build

n_{c o m p} = 3,

different components.

Now we have to find a way of selecting the number of variables per axis using a validation criterion. We will use here the quality of classification in a cross-validation based method.

More precisely

$n . f o l d s = 20$ cross-validation,
$k e e p_{X} = 1 : 20$ the grid over which we test the model.

Tune $k e e p_{X_{1}}$ : the number of variables selected along the first axis

Import the function crossValidate_splsDA from file crossValidate_splsDA.R.

source("crossValidate_splsDA.R")

Please use that function to perform cross-validation, plot the error and discuss the results.

n.folds <- 20
keepXs <- 1:20

ncomp <- 1
previsous.keepX <- NULL
errors <- crossValidate_splsDA(X_4_pop,y_4_pop,
                               ncomp=ncomp,
                               keepXs=keepXs,
                               previsous.keepX=previsous.keepX,
                               n.folds=n.folds)
plot(errors,pch=16,type="b",
     xlab="keepX grid for First component",ylab="Error",
     main="Cross-validation error \n n.folds=20, 1st component")

… it took a few times but you found certainly that $k e e p_{X} \in (2, 7)$ are quite good possibilities. Tell us if you found anything different…

Assume you have selected the parameter :

k e e p_{X_{1}} = 7.

Get $k e e p_{X_{2}}$ and $k e e p_{X_{3}}$

As we do not want to waste to much time, we have performed the other cross-validation procedures. The code is here :

# ncomp <- 2
# previsous.keepX <- 7
# errors <- crossValidate_splsDA(X_4_pop,y_4_pop,
#                                ncomp=ncomp,
#                                keepXs=keepXs,
#                                previsous.keepX=previsous.keepX,
#                                n.folds=20)
# plot(errors,pch=16,type="b")
# 
# ncomp <- 3
# previsous.keepX <- c(7,4)
# errors <- crossValidate_splsDA(X_4_pop,y_4_pop,
#                                ncomp=ncomp,
#                                keepXs=keepXs,
#                                previsous.keepX=previsous.keepX,
#                                n.folds=20)
# plot(errors,pch=16,type="b")

Which permits to select the following values :

(k e e p_{X_{2}}, k e e p_{X_{3}}) = (5, 6)

Check the final model

As we have selected the parameters which gives a good level of satisfaction in terms of prediction, we want to check out what we selected.

Model builing

Build the desired model and plot the components. Discuss about the interest of the components to discriminate the classes.

ncomp <- 3
modele <- splsda(X_4_pop,y_4_pop,ncomp = ncomp,keepX = c(7,5,6))

plots <- list()
for(i in 1:ncomp){
  dat <- data.frame(VariateX = modele$variates$X[,i], cell_type = y_4_pop)
  a <- ggplot(dat, aes(x = VariateX, fill = cell_type)) +
    geom_density(alpha = 0.6)+theme(
      legend.title = element_text(size = 20, face = "bold"),
      legend.text = element_text(size = 15),
        axis.title = element_text(size=15),
        axis.text = element_text(size=15))+
    xlab(paste("Component",i))
  plots[[i]] <- a
}
do.call(gridExtra::grid.arrange,  c(grobs=plots, ncol=3))

Selected variables

Find the positions of the genes whch are selected in that study

matGenes <- round(modele$loadings$X[-which(rowSums(modele$loadings$X)==0),],3)
matGenes[which(matGenes==0)] <- ''
kable(matGenes)

	comp 1	comp 2	comp 3
CD52			-0.084
S100A6	-0.251
S100A4	-0.447
RPS27	0.349
CD74			-0.846
HLA.A		0.006
HLA.C		0.13
LTB			-0.346
EEF1A1			-0.311
IFITM2		0.254
FTH1	-0.47
MALAT1	0.194
B2M		0.943
RPL13			-0.131
PFN1		0.172
OAZ1	-0.05
FTL	-0.595
CD37			-0.212

PlotVar representation

It permits to represent the weights in two-dimensionnal figures. Look for that function in the Help and plot it.

plotVar(modele,comp = 1:2,cex=3)

plotVar(modele,comp = c(1,3),cex=3)

What are the problems of that representation ?

Contribution plots

Those plots are quite fancy to check the weights of each variable selected along its component. Look for that function in the Help and plot it.

plotLoadings(modele, comp = 1, method = 'mean', contrib = 'max',legend=F)

plotLoadings(modele, comp = 2, method = 'mean', contrib = 'max',legend=F)

plotLoadings(modele, comp = 3, method = 'mean', contrib = 'max',legend=F)

cim representation

An interpretable way of representing the selected genes is to use cim representation from mixOmics. Plot the selected genes and only them ;).

matGenes <- modele$loadings$X[-which(rowSums(modele$loadings$X)==0),]
modele_cim <- plsda(X_4_pop[,
                            which(rowSums(abs(modele$loadings$X))!=0)[
                              order(apply(matGenes,1,function(a){which(a!=0)}))]],
                    y_4_pop,ncomp = 3)
cim(modele_cim,row.names = NA,
    row.sideColors = brewer.pal(4,name = "Set1")[y_4_pop],
    cluster = "none",
    xlab = "Genes selected",ylab = "Cells",
    legend=list( legend = levels(y_4_pop), 
    col = brewer.pal(4,name = "Set1"),title = "Cell-type", cex = 0.7),mapping="XY")

How can you comment that figure ?

Can you find information redundant with previous figures ?

Variances explained

Plot the variance explained by each axis

plot(modele$explained_variance$X,ylim=range(c(modele$explained_variance$X,modele$explained_variance$Y)),pch=16)
points(modele$explained_variance$Y,pch=16,col="red")

Are the variances increasing or decreasing ?

Is that necessarly the case

Predict functions

PLS-based methods permit to construct a classification model.

For example, if we want to test a few predictions :

id_test <- c(1,2,5,6,5,22,20,12,55,256,758,726,540,265,799)
predicted_classes <- predict(object = modele,newdata = X_4_pop[id_test,] )
# compare prediction to reality
kable(table(predicted_classes$class$max.dist[,ncomp], y_4_pop[id_test]))

	b_cells	cd14_monocytes	cd34	cd56_nk
b_cells	9	0	0	0
cd14_monocytes	0	2	0	0
cd34	0	0	1	0
cd56_nk	0	0	0	3

Do the same with the all dataset

No cheat here but ask for help if needed!

References

Lê Cao, Kim-Anh, Simon Boitard, and Philippe Besse. 2011. “Sparse Pls Discriminant Analysis: Biologically Relevant Feature Selection and Graphical Displays for Multiclass Problems.” BMC Bioinformatics 12 (1). BioMed Central: 253.

Zheng, Grace XY, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, et al. 2017. “Massively Parallel Digital Transcriptional Profiling of Single Cells.” Nature Communications 8. Nature Publishing Group: 14049.

sPLS-DA for 10x Single-cell RNA-Seq

Hadrien LORENZO

29/11/2018

Import datasets

PLS, sPLS and sPLS-DA

Back to the 10X dataset!

Tune the parameters

Tune $k e e p_{X_{1}}$ : the number of variables selected along the first axis

Get $k e e p_{X_{2}}$ and $k e e p_{X_{3}}$

Check the final model

Model builing

Selected variables

PlotVar representation

Contribution plots

cim representation

Variances explained

Predict functions

Do the same with the all dataset

References

sPLS-DA for 10x Single-cell RNA-Seq

Hadrien LORENZO

29/11/2018

Import datasets

PLS, sPLS and sPLS-DA

Back to the 10X dataset!

Tune the parameters

Tune keepX1keepX1 : the number of variables selected along the first axis

Get keepX2keepX2 and keepX3keepX3

Check the final model

Model builing

Selected variables

PlotVar representation

Contribution plots

cim representation

Variances explained

Predict functions

Do the same with the all dataset

References

Tune $k e e p_{X_{1}}$ : the number of variables selected along the first axis

Get $k e e p_{X_{2}}$ and $k e e p_{X_{3}}$