unravelling `omics' data with the mixomics r package -...
Post on 13-Jul-2020
7 Views
Preview:
TRANSCRIPT
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Unravelling `omics' data with the mixOmics R
package
Illustration on several studies
Kim-Anh Lê Cao
Queensland Facility for Advanced BioinformaticsThe University of Queensland
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Challenges
The issue with integrative systems biology
Unlimited quantity of data(n << p problem)
Data from multiple sources
→ E�cient and biologically relevant statistical methodologies areneeded to combine the information in these heterogeneous datasets.
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
The biological questions
Single Omics analysis
Do we observe a `natural' separation between the di�erentgroups of patients?
Can we identify potential biomarker candidates predicting thestatus of the patients?
Integrative Omics analysis
Can we identify a subset of correlated genes and proteins frommatching data sets?
Can we predict the abundance of a protein given theexpression of a small subset of genes?
Do two matching omics data set contain the sameinformation?
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
The data
The data
n = number of patientsp, q = number of biological features (genes, proteins ..)
Single Omics analysis
one omic data set X(n x p)
for a supervised analysis, Y vector indicating the class of thepatients
Integrative Omics analysis
two matching omics data sets (measured on the same patients)
X (n x p) and Z (n x q)
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Multivariate analysis
Linear multivariate approaches enable:
Dimension reduction→ project the data in a smaller subspace
To handle multicollinear, irrelevant, missing variables
To capture experimental annd biological variation
In particular, in mixOmics, focus is on:
Data integration
Variable selection
Computationally e�cient methodologies for large biologicaldata sets
Interpretable graphical outputs
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
mixOmics
mixOmics is an R package dedicated to the exploration and theintegrative analyses of high dimensional biological data sets.
Website- R tutorials- Newsletter
Web Interface- User friendly interface- Comprehensive results page
-Lê Cao et al. (2009) integRomics/mixOmics: an R package to unravel
relationships between two omics data sets, Bioinformatics
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
mixOmics
One data set (n x p)
Two matching data sets
(n x p) & (n x q)
PCA IPCA
PLS-DA
PLS rCCA
Internal variable selection
Multivariate approach Data
sPCA sIPCA
sPLS-DA
sPLS canonical
Graphical outputs
PLS sPLS
regression
Missing values (n x p)
NIPALS imputation
Sample plots
Variable plots
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Introduction with PCA
Principal Component Analysis: PCA
Seek the best directions in the data that account for most of thevariability
→ principal components: arti�cial variablesthat are linear combinations of the originalvariables:
c = X v
(n) (nxp) (p)−1.0 −0.5 0.0 0.5 1.0
−0.4
−0.2
0.0
0.2
0.4
x1
x2
c is a linear function of the elements of X having maximalvariance
v is called the associated loading vector
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Introduction with PCA
Principal components cont.
The new PCs form a vectorialsubspace of dimension < p
Project the data on these newaxes.
−1.0 −0.5 0.0 0.5 1.0
−0.4
−0.2
0.0
0.2
0.4
z1
z2
→ approximate representation of the data points in a lowerdimensional space
Problem:Interpretation di�cult with very large number of (possibly)irrelevant variables
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Introduction with PCA
sparse Principal Component Analysis: sPCA
Principal components
−1.0 −0.5 0.0 0.5 1.0
−0.4
−0.2
0.0
0.2
0.4
x1
x2
loading vectors(PCA)
0 20 40 60 80 100
0.0
00
.05
0.1
00
.15
loadin
g 1
0 20 40 60 80 100
0.0
00
.05
0.1
00
.15
loadin
g 2
sparse loading vectors(sPCA)
0 20 40 60 80 100
0.0
00
.05
0.1
00
.15
0 20 40 60 80 100
0.0
00
.05
0.1
00
.15
loadin
g 1
loadin
g 2
The principal components are linear combinations of the originalvariables, variables weights are de�ned in the associated loadingvectors.sparse PCA computes the sparse loading vectors to removeirrelevant variables using lasso penalizations (Shen & Huang 2008, J.
Multivariate Analysis).
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Independent Principal Component Analysis
IPCA is based on Independent Component Analysis (ICA):
assumes non Gaussian data distribution ( 6= PCA).
`blind source' signal separation.
seeks for a set of independent components ( 6= PCA).
Combines the advantages of both PCA and ICA.
The PCA loadings are transformed via ICA to obtainindependent loading vectors and independent principalcomponents.
sparse IPCA also developed to select the variable contributingto the independent loading vectors
Yao, F. Coquery, J. and Lê Cao, K-A. 2012 Independent Principal Component
Analysis for biologically meaningful dimension reduction of large biological data sets,
BMC Bioinformatics.
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Independent Principal Component Analysis
Illustration of PCA and IPCASample representation for the kidney transplant study (3 groups ofrejection status patients)
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Discriminant Analysis
PLS - Discriminant Analysis
Similarly to Linear Discriminant Analysis, classical PLS-DAlooks for the best components to separate the sample groups.
As opposed to PCA/IPCA methods, it is a supervisedapproach.
In addition to this, sPLS-DA searches for discriminativevariables that can help separating the sample groups.
Evaluation of the discriminative power of the selected variablesusing external data sets or cross-validation.
Lê Cao K-A., Boitard S. and Besse P. (2011) Sparse PLS Discriminant Analysis:
biologically relevant feature selection and graphical displays for multiclass problems,
BMC Bioinformatics, 12:253.
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Discriminant Analysis
Illustration of sPLS-DA
number of selected genes
erro
r ra
te
1 3 5 7 9 20 40 60 80 100 300 600 900
0.00
0.05
0.10
0.15
0.20
comp 1comp 1:2comp 1:3
Genomics − training data
●
Figure: Tuning the number of var.
to select with cross-validation
NR
−0.2 −0.1 0.0 0.1 0.2 0.3
−0
.4−
0.2
0.0
0.2
0.4
0.6
Comp 1
Co
mp
2
Genomics
Train
AR
NR
Test
AR
Figure: Predicting the class of the
test set samples
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Motivation
Aims:
unravel the correlation stucture between two data sets
select co-regulated biological entities across samples
→ select and integrate in a one step procedure the di�erent typesof data
Partial Least Squares regression maximises the covariancebetween each linear combination (components) associated toeach data set
sparse PLS has been developed to include variable selectionfrom both data sets
Two modes are proposed to model the relationship betweenthe two data sets ('regression' and 'canonical')
Lê Cao et al. (2008), SAGMB, Lê Cao et al. (2009), BMC Bioinformatics
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Integrating genomics and proteomics
Illustration of sparse PLS: sample plotsPLS aims at selecting correlated variables (genes, proteins) acrossthe same samples by performing a multivariate regression.Regression: explain the protein abundance w.r.t the gene expression�⇒ relationship�.
The latent variables (components)are determined based on theselected genes and proteins→ give more insight into thesamples similarities.
Unsupervised approach
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Integrating genomics and proteomics
Illustration of sparsePLS: variable plot
Relevance networks are bipartite graphs directlyinferred from the sPLS components.
Some other insightful graphical outputs:
correlationcircle plots
clusteredimage maps
−1.0 −0.5 0.0 0.5 1.0
−1
.0−
0.5
0.0
0.5
1.0
Dimension 1
Dim
en
sio
n 2
Protein groupProbe set
González I., Lê Cao K.-A., Davis, M.D. and Déjean S. Visualising association between
paired `omics' data sets. In revision.
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Comparing two proteomics platforms
Illustration of sPLS canonical modeSelects correlated variables across the same samples and highlightsthe correlation structure between the two data sets.Canonical mode: �⇔ relationship�
H460 NS
RE
MEL
OV
RE
RELEU
NS
COL
COLBR
LEU
LEU
MEL
MEL MEL
MEL
BR
BR
MEL
MEL
RECNSOV
BRCNS
CNS
NSNS
BR
OV
NSNSCNSMEL
PRRERE
RE
RE
CNS BR
LEU
LEULEU
COLCOL
COL COL
COLOV
NSOVPROV
BR NS
COL
LE
MEL
NS
OV
PR
RE
BR
CNS
Figure: Arrow plot to highlight the similarties between 2 data sets
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Multilevel analysis
Cross-over design
One data set repeated
measurements (n x p)
Two matching data sets
repetated measurements (n x p) & (n x q)
multilevel PLS-DA
multilevel PLS
Internal variable selection
Multivariate approach Data
multilevel sPLS-DA
multilevel sPLS
Graphical outputs
supervised
Sample plots
Variable plots
canonical, 1 level
A novel approach for cross-over designs (repeated measurements up to 2 cross factors)
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Multilevel analysis
Illustration of multilevel sPLS-DA
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−0.2 −0.1 0.0 0.1 0.2
−0.
2−
0.1
0.0
0.1
0.2
X−variate 1
X−
varia
te 2
●
●
●
●
GAG+
GAG−
LIPO5
NS
● t1
t2
gene517gene541gene589gene550gene567gene551gene579gene506gene540gene557gene527gene534gene509gene564gene528gene518gene553gene531gene502gene537gene546gene504gene524gene591gene519gene583gene585gene503gene521gene520gene582gene552gene568gene549gene576gene563gene539gene555gene588gene571gene529gene525gene511gene587gene515gene584gene526gene570gene594gene505gene574gene535gene566gene510gene580gene577gene538gene554gene569gene572gene530gene599gene507gene575gene560gene542gene501gene543gene597gene513gene547gene562gene595gene598gene516gene561gene578gene508gene558gene593gene536gene556gene532gene559gene573gene512gene592gene533gene586gene565gene522gene514gene600gene523gene544gene548gene581gene590gene545gene596gene19gene82gene85gene27gene92gene13gene99gene38gene81gene88gene4gene9gene28gene98gene37gene75gene76gene45gene83gene32gene55gene78gene39gene51gene20gene71gene30gene96gene63gene14gene18gene53gene73gene87gene2gene91gene25gene69gene23gene48gene43gene56gene22gene47gene59gene84gene15gene65gene1gene3gene34gene58gene80gene86gene24gene36gene49gene90gene7gene44gene67gene12gene29gene95gene17gene50gene66gene26gene64gene31gene21gene61gene5gene11gene40gene89gene97gene74gene77gene94gene100gene54gene35gene33gene62gene46gene93gene10gene52gene79gene8gene57gene16gene72gene6gene60gene68gene70gene41gene42gene613gene674gene673gene694gene603gene692gene671gene655gene606gene647gene621gene622gene654gene637gene662gene699gene635gene663gene665gene653gene664gene628gene641gene678gene632gene634gene642gene630gene616gene681gene614gene685gene652gene688gene644gene689gene672gene700gene617gene677gene650gene633gene640gene625gene631gene626gene660gene612gene636gene675gene618gene624gene697gene698gene646gene676gene607gene683gene696gene619gene656gene645gene651gene684gene680gene623gene668gene610gene679gene643gene670gene611gene686gene608gene602gene609gene620gene639gene695gene629gene666gene627gene605gene649gene669gene657gene604gene661gene658gene691gene667gene601gene659gene638gene648gene682gene687gene693gene615gene690gene160gene104gene200gene149gene190gene138gene151gene165gene132gene199gene120gene121gene179gene194gene177gene129gene123gene131gene148gene158gene171gene146gene192gene106gene124gene168gene170gene188gene162gene108gene118gene155gene173gene134gene139gene143gene150gene175gene145gene195gene122gene166gene140gene182gene187gene102gene128gene110gene147gene159gene164gene191gene105gene136gene111gene107gene135gene176gene153gene193gene183gene184gene125gene197gene185gene117gene161gene172gene101gene180gene144gene126gene156gene154gene109gene167gene137gene169gene133gene186gene103gene174gene115gene141gene130gene127gene198gene119gene142gene112gene181gene152gene157gene163gene196gene114gene116gene189gene113gene178gene837gene865gene812gene870gene817gene883gene805gene856gene839gene841gene871gene885gene802gene873gene808gene843gene851gene823gene828gene831gene818gene878gene844gene891gene824gene860gene880gene807gene884gene835gene834gene832gene866gene874gene894gene801gene819gene869gene889gene857gene863gene809gene872gene877gene836gene826gene876gene850gene868gene806gene820gene867gene887gene811gene845gene859gene846gene848gene886gene840gene861gene879gene897gene888gene892gene825gene890gene862gene875gene833gene881gene816gene827gene895gene830gene858gene810gene852gene900gene803gene849gene340gene381gene392gene309gene308gene350gene346gene366gene321gene373gene303gene329gene304gene400gene313gene367gene377gene398gene368gene355gene384gene347gene333gene361gene375gene383gene317gene343gene386gene306gene319gene348gene318gene349gene387gene332gene327gene363gene372gene396gene335gene388gene359gene378gene399gene311gene324gene315gene334gene330gene364gene357gene314gene316gene345gene322gene337gene344gene360gene353gene312gene385gene325gene382gene389gene354gene379gene352gene393gene358gene331gene391gene380gene390gene336gene310gene365gene339gene376gene301gene371gene320gene362gene394gene338gene307gene374gene326gene305gene328gene323gene351gene302gene370gene397gene342gene395gene341gene356
Sample123456789101112
StimulationGAG+GAG−LIPO5NS
−2
0
2
4
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Conclusions
Conclusions
The exploratory and integrative approaches are:
�exible and can answer various types of questions.
can highlight the potential of the data.
enable to generate new hypotheses to be further investigated.
Future work includes:
Cross-platform comparison
Integration of multiple data sets (unsupervised and supervised)
Time-course experiments
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Conclusions
Acknowledgements
mixOmics teamSébastien Déjean Univ. Tlse
Ignacio González Univ. Tlse
Xin Yi Chua QFAB
Other contributors to mixOmics
Je� Coquery QFAB, Sup'Biotech
Benoit Liquet Univ. Bordeaux
Eric Fangzhou Yao QFAB, Shanghai Univ ofFinance and Economics
Pierre Monget QFAB, CESI
Kim-Anh Lê Cao
mixOmics
Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions
Questions?
Questions?
http://mixomics.qfab.orgmixomics@math.univ-toulouse.fr
k.lecao@uq.edu.au
Kim-Anh Lê Cao
mixOmics
top related