unravelling `omics' data with the mixomics r package -...

Introduction Concept of mixOmics Single Omics Analysis Integrative Omics Analysis Recent developments Conclusions

Unravelling `omics' data with the mixOmics R

package

Illustration on several studies

Kim-Anh Lê Cao

Queensland Facility for Advanced BioinformaticsThe University of Queensland

Kim-Anh Lê Cao

mixOmics

Challenges

The issue with integrative systems biology

Unlimited quantity of data(n << p problem)

Data from multiple sources

→ E�cient and biologically relevant statistical methodologies areneeded to combine the information in these heterogeneous datasets.

Kim-Anh Lê Cao

mixOmics

The biological questions

Single Omics analysis

Do we observe a `natural' separation between the di�erentgroups of patients?

Can we identify potential biomarker candidates predicting thestatus of the patients?

Integrative Omics analysis

Can we identify a subset of correlated genes and proteins frommatching data sets?

Can we predict the abundance of a protein given theexpression of a small subset of genes?

Do two matching omics data set contain the sameinformation?

Kim-Anh Lê Cao

mixOmics

The data

n = number of patientsp, q = number of biological features (genes, proteins ..)

Single Omics analysis

one omic data set X(n x p)

for a supervised analysis, Y vector indicating the class of thepatients

Integrative Omics analysis

two matching omics data sets (measured on the same patients)

X (n x p) and Z (n x q)

Kim-Anh Lê Cao

mixOmics

Multivariate analysis

Linear multivariate approaches enable:

Dimension reduction→ project the data in a smaller subspace

To handle multicollinear, irrelevant, missing variables

To capture experimental annd biological variation

In particular, in mixOmics, focus is on:

Data integration

Variable selection

Computationally e�cient methodologies for large biologicaldata sets

Interpretable graphical outputs

Kim-Anh Lê Cao

mixOmics

mixOmics is an R package dedicated to the exploration and theintegrative analyses of high dimensional biological data sets.

Website- R tutorials- Newsletter

Web Interface- User friendly interface- Comprehensive results page

-Lê Cao et al. (2009) integRomics/mixOmics: an R package to unravel

relationships between two omics data sets, Bioinformatics

Kim-Anh Lê Cao

mixOmics

One data set (n x p)

Two matching data sets

(n x p) & (n x q)

PCA IPCA

PLS-DA

PLS rCCA

Internal variable selection

Multivariate approach Data

sPCA sIPCA

sPLS-DA

sPLS canonical

Graphical outputs

PLS sPLS

regression

Missing values (n x p)

NIPALS imputation

Sample plots

Variable plots

Kim-Anh Lê Cao

mixOmics

Introduction with PCA

Principal Component Analysis: PCA

Seek the best directions in the data that account for most of thevariability

→ principal components: arti�cial variablesthat are linear combinations of the originalvariables:

c = X v

(n) (nxp) (p)−1.0 −0.5 0.0 0.5 1.0

−0.4

−0.2

c is a linear function of the elements of X having maximalvariance

v is called the associated loading vector

Kim-Anh Lê Cao

mixOmics

Principal components cont.

The new PCs form a vectorialsubspace of dimension < p

Project the data on these newaxes.

−1.0 −0.5 0.0 0.5 1.0

−0.4

−0.2

→ approximate representation of the data points in a lowerdimensional space

Problem:Interpretation di�cult with very large number of (possibly)irrelevant variables

Kim-Anh Lê Cao

mixOmics

sparse Principal Component Analysis: sPCA

Principal components

−1.0 −0.5 0.0 0.5 1.0

−0.4

−0.2

loading vectors(PCA)

0 20 40 60 80 100

loadin

0 20 40 60 80 100

loadin

sparse loading vectors(sPCA)

0 20 40 60 80 100

loadin

The principal components are linear combinations of the originalvariables, variables weights are de�ned in the associated loadingvectors.sparse PCA computes the sparse loading vectors to removeirrelevant variables using lasso penalizations (Shen & Huang 2008, J.

Multivariate Analysis).

Kim-Anh Lê Cao

mixOmics

Independent Principal Component Analysis

IPCA is based on Independent Component Analysis (ICA):

assumes non Gaussian data distribution ( 6= PCA).

`blind source' signal separation.

seeks for a set of independent components ( 6= PCA).

Combines the advantages of both PCA and ICA.

The PCA loadings are transformed via ICA to obtainindependent loading vectors and independent principalcomponents.

sparse IPCA also developed to select the variable contributingto the independent loading vectors

Yao, F. Coquery, J. and Lê Cao, K-A. 2012 Independent Principal Component

Analysis for biologically meaningful dimension reduction of large biological data sets,

BMC Bioinformatics.

Kim-Anh Lê Cao

mixOmics

Independent Principal Component Analysis

Illustration of PCA and IPCASample representation for the kidney transplant study (3 groups ofrejection status patients)

Kim-Anh Lê Cao

mixOmics

Discriminant Analysis

PLS - Discriminant Analysis

Similarly to Linear Discriminant Analysis, classical PLS-DAlooks for the best components to separate the sample groups.

As opposed to PCA/IPCA methods, it is a supervisedapproach.

In addition to this, sPLS-DA searches for discriminativevariables that can help separating the sample groups.

Evaluation of the discriminative power of the selected variablesusing external data sets or cross-validation.

Lê Cao K-A., Boitard S. and Besse P. (2011) Sparse PLS Discriminant Analysis:

biologically relevant feature selection and graphical displays for multiclass problems,

BMC Bioinformatics, 12:253.

Kim-Anh Lê Cao

mixOmics

Discriminant Analysis

Illustration of sPLS-DA

number of selected genes

1 3 5 7 9 20 40 60 80 100 300 600 900

comp 1comp 1:2comp 1:3

Genomics − training data

Figure: Tuning the number of var.

to select with cross-validation

−0.2 −0.1 0.0 0.1 0.2 0.3

Comp 1

Genomics

Figure: Predicting the class of the

test set samples

Kim-Anh Lê Cao

mixOmics

Motivation

unravel the correlation stucture between two data sets

select co-regulated biological entities across samples

→ select and integrate in a one step procedure the di�erent typesof data

Partial Least Squares regression maximises the covariancebetween each linear combination (components) associated toeach data set

sparse PLS has been developed to include variable selectionfrom both data sets

Two modes are proposed to model the relationship betweenthe two data sets ('regression' and 'canonical')

Lê Cao et al. (2008), SAGMB, Lê Cao et al. (2009), BMC Bioinformatics

Kim-Anh Lê Cao

mixOmics

Integrating genomics and proteomics

Illustration of sparse PLS: sample plotsPLS aims at selecting correlated variables (genes, proteins) acrossthe same samples by performing a multivariate regression.Regression: explain the protein abundance w.r.t the gene expression�⇒ relationship�.

The latent variables (components)are determined based on theselected genes and proteins→ give more insight into thesamples similarities.

Unsupervised approach

Kim-Anh Lê Cao

mixOmics

Integrating genomics and proteomics

Illustration of sparsePLS: variable plot

Relevance networks are bipartite graphs directlyinferred from the sPLS components.

Some other insightful graphical outputs:

correlationcircle plots

clusteredimage maps

−1.0 −0.5 0.0 0.5 1.0

Dimension 1

Protein groupProbe set

González I., Lê Cao K.-A., Davis, M.D. and Déjean S. Visualising association between

paired `omics' data sets. In revision.

Kim-Anh Lê Cao

mixOmics

Comparing two proteomics platforms

Illustration of sPLS canonical modeSelects correlated variables across the same samples and highlightsthe correlation structure between the two data sets.Canonical mode: �⇔ relationship�

H460 NS

MEL MEL

RECNSOV

NSNSCNSMEL

PRRERE

CNS BR

LEULEU

COLCOL

COL COL

NSOVPROV

Figure: Arrow plot to highlight the similarties between 2 data sets

Kim-Anh Lê Cao

mixOmics

Multilevel analysis

Cross-over design

One data set repeated

measurements (n x p)

Two matching data sets

repetated measurements (n x p) & (n x q)

multilevel PLS-DA

multilevel PLS

Internal variable selection

Multivariate approach Data

multilevel sPLS-DA

multilevel sPLS

Graphical outputs

supervised

Sample plots

Variable plots

canonical, 1 level

A novel approach for cross-over designs (repeated measurements up to 2 cross factors)

Kim-Anh Lê Cao

mixOmics

Multilevel analysis

Illustration of multilevel sPLS-DA

●●

−0.2 −0.1 0.0 0.1 0.2

X−variate 1

GAG−

● t1

gene517gene541gene589gene550gene567gene551gene579gene506gene540gene557gene527gene534gene509gene564gene528gene518gene553gene531gene502gene537gene546gene504gene524gene591gene519gene583gene585gene503gene521gene520gene582gene552gene568gene549gene576gene563gene539gene555gene588gene571gene529gene525gene511gene587gene515gene584gene526gene570gene594gene505gene574gene535gene566gene510gene580gene577gene538gene554gene569gene572gene530gene599gene507gene575gene560gene542gene501gene543gene597gene513gene547gene562gene595gene598gene516gene561gene578gene508gene558gene593gene536gene556gene532gene559gene573gene512gene592gene533gene586gene565gene522gene514gene600gene523gene544gene548gene581gene590gene545gene596gene19gene82gene85gene27gene92gene13gene99gene38gene81gene88gene4gene9gene28gene98gene37gene75gene76gene45gene83gene32gene55gene78gene39gene51gene20gene71gene30gene96gene63gene14gene18gene53gene73gene87gene2gene91gene25gene69gene23gene48gene43gene56gene22gene47gene59gene84gene15gene65gene1gene3gene34gene58gene80gene86gene24gene36gene49gene90gene7gene44gene67gene12gene29gene95gene17gene50gene66gene26gene64gene31gene21gene61gene5gene11gene40gene89gene97gene74gene77gene94gene100gene54gene35gene33gene62gene46gene93gene10gene52gene79gene8gene57gene16gene72gene6gene60gene68gene70gene41gene42gene613gene674gene673gene694gene603gene692gene671gene655gene606gene647gene621gene622gene654gene637gene662gene699gene635gene663gene665gene653gene664gene628gene641gene678gene632gene634gene642gene630gene616gene681gene614gene685gene652gene688gene644gene689gene672gene700gene617gene677gene650gene633gene640gene625gene631gene626gene660gene612gene636gene675gene618gene624gene697gene698gene646gene676gene607gene683gene696gene619gene656gene645gene651gene684gene680gene623gene668gene610gene679gene643gene670gene611gene686gene608gene602gene609gene620gene639gene695gene629gene666gene627gene605gene649gene669gene657gene604gene661gene658gene691gene667gene601gene659gene638gene648gene682gene687gene693gene615gene690gene160gene104gene200gene149gene190gene138gene151gene165gene132gene199gene120gene121gene179gene194gene177gene129gene123gene131gene148gene158gene171gene146gene192gene106gene124gene168gene170gene188gene162gene108gene118gene155gene173gene134gene139gene143gene150gene175gene145gene195gene122gene166gene140gene182gene187gene102gene128gene110gene147gene159gene164gene191gene105gene136gene111gene107gene135gene176gene153gene193gene183gene184gene125gene197gene185gene117gene161gene172gene101gene180gene144gene126gene156gene154gene109gene167gene137gene169gene133gene186gene103gene174gene115gene141gene130gene127gene198gene119gene142gene112gene181gene152gene157gene163gene196gene114gene116gene189gene113gene178gene837gene865gene812gene870gene817gene883gene805gene856gene839gene841gene871gene885gene802gene873gene808gene843gene851gene823gene828gene831gene818gene878gene844gene891gene824gene860gene880gene807gene884gene835gene834gene832gene866gene874gene894gene801gene819gene869gene889gene857gene863gene809gene872gene877gene836gene826gene876gene850gene868gene806gene820gene867gene887gene811gene845gene859gene846gene848gene886gene840gene861gene879gene897gene888gene892gene825gene890gene862gene875gene833gene881gene816gene827gene895gene830gene858gene810gene852gene900gene803gene849gene340gene381gene392gene309gene308gene350gene346gene366gene321gene373gene303gene329gene304gene400gene313gene367gene377gene398gene368gene355gene384gene347gene333gene361gene375gene383gene317gene343gene386gene306gene319gene348gene318gene349gene387gene332gene327gene363gene372gene396gene335gene388gene359gene378gene399gene311gene324gene315gene334gene330gene364gene357gene314gene316gene345gene322gene337gene344gene360gene353gene312gene385gene325gene382gene389gene354gene379gene352gene393gene358gene331gene391gene380gene390gene336gene310gene365gene339gene376gene301gene371gene320gene362gene394gene338gene307gene374gene326gene305gene328gene323gene351gene302gene370gene397gene342gene395gene341gene356

Sample123456789101112

StimulationGAG+GAG−LIPO5NS

Kim-Anh Lê Cao

mixOmics

Conclusions

The exploratory and integrative approaches are:

�exible and can answer various types of questions.

can highlight the potential of the data.

enable to generate new hypotheses to be further investigated.

Future work includes:

Cross-platform comparison

Integration of multiple data sets (unsupervised and supervised)

Time-course experiments

Kim-Anh Lê Cao

mixOmics

Conclusions

Acknowledgements

mixOmics teamSébastien Déjean Univ. Tlse

Ignacio González Univ. Tlse

Xin Yi Chua QFAB

Other contributors to mixOmics

Je� Coquery QFAB, Sup'Biotech

Benoit Liquet Univ. Bordeaux

Eric Fangzhou Yao QFAB, Shanghai Univ ofFinance and Economics

Pierre Monget QFAB, CESI

Kim-Anh Lê Cao

mixOmics

Questions?

http://mixomics.qfab.orgmixomics@math.univ-toulouse.fr

k.lecao@uq.edu.au

Kim-Anh Lê Cao

mixOmics

unravelling `omics' data with the mixomics r package -...

Documents

· 5 63 1 100 80 80 60 20 99 100 80 60 40 20 i , 500{Ë...

250v mosfets ultra-junction x-class hiperfettm power...

merkendenkenfühlen handeln wissen 20 40 100 60 80 20 40 100...

rx 60-60 rx 60-70 rx 60-80 rx 60-80/900 - ifoy

kopfrechnengeometriegleichungen prozente vermischtes 20 40...

บาลี 60 80

mixomics vignette

kraftfahrzeugtechnik · 2020. 6. 7. ·...

28m 60/200 20m.i 60/200 40 / o 20m 0/2 0 0m. 4 80 020 40...

stahlrohre quadratrohr e235, s1 schwarz, 6m … · 60 60 4...

ral...

the impact of domestic lobbying on international … ·...

package ‘mixomics’ · 2013-08-29 · package...

mixomics: an r package for ‘omics feature selection and...

bucuresti 60 -80

openstage 60/60 g/80/80 g hipath 2000/3000/5000/hipath

80 60 - europa

item name size pictures yh-1238 80*60 · 2018. 6. 5. ·...

· carca horária do curso estáaio su ervisionado (não...

20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 20 40 60 80...