high dimensional biological data analysis and visualization
DESCRIPTION
Examples of data analysis and visualization of high dimensional metabolomic data.TRANSCRIPT
![Page 1: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/1.jpg)
Dmitry Grapov, PhD
Metabolomic Data Analysis for the Study of Diseases
![Page 2: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/2.jpg)
State of the art facility producing massive amounts of biological data…
>13,000 samples/yr>160 studies~32,000 data points/study
![Page 3: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/3.jpg)
Goals?
![Page 4: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/4.jpg)
Analysis at the Metabolomic Scale
![Page 5: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/5.jpg)
Univariate vs. MultivariateUnivariate
Gro
up 1
Gro
up 2
Multivariate Predictive Modeling
Hypothesis testing (t-Test, ANOVA, etc.) PCA O-/PLS/-DA
![Page 6: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/6.jpg)
univariate/bivariate vs.
\ multivariate
mixed up samples?outliers?
Univariate vs. Multivariate
![Page 7: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/7.jpg)
Data Complexity
nm
1-D 2-D m-D
Data
samples
variables
complexity
Meta Data
Experimental Design =
Variable # = dimensionality
![Page 8: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/8.jpg)
Statistical Analysis• Identify differences in sample population
means• sensitive to distribution shape
• parametric = assumes normality
• error in Y, not in X (Y = mX + error)
• optimal for long data
• assumed independence
• false discovery rate (FDR) long
wide
n-of-one
![Page 9: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/9.jpg)
Achieving “significance” is a function of:
significance level (α) and power (1-β )
effect size (standardized difference in means)
sample size (n)
![Page 10: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/10.jpg)
Type I Error: False Positives
• Type II Error: False Negatives
• Type I risk =
• 1-(1-p.value)m
m = number of variables tested
FDR correction
• p-value adjustment or estimate of FDR (Fdr, q-value)
False Discovery Rate (FDR)
Bioinformatics (2008) 24 (12):1461-1462
![Page 11: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/11.jpg)
FDR correctionFD
R ad
just
ed p
-val
ue
p-value
Benjamini & Hochberg (1995) (“BH”)• Accepted standard
Bonferroni• Very conservative• adjusted p-value = p-value*# of tests (e.g. 0.005 * 148 = 0.74 )
![Page 12: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/12.jpg)
Multivariate AnalysisClustering• Grouping based on similarity/dissimilarity
Principal Components Analysis (PCA)• Identify modes of variance in the data
Partial Least Squares (PLS) • Identify modes of variance in the data
correlated with a hypothesis
![Page 13: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/13.jpg)
Cluster AnalysisUse similarity/dissimilarity to group a collection of samples or variables
Approaches• hierarchical (HCA)• non-hierarchical (k-NN, k-means)• distribution (mixtures models)• density (DBSCAN)• self organizing maps (SOM)
Linkage k-means
Distribution Density
![Page 14: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/14.jpg)
Hierarchical Cluster Analysissimilarity/dissimilarity defines “nearness” or distance
objects are grouped based on linkage methods
![Page 15: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/15.jpg)
Hierarchy of Similarity
Sim
ilarit
y
x
xx
x
How does my metadata match my data structure?
Hierarchy of effect sizes
![Page 16: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/16.jpg)
Projection of Data
The algorithm defines the position of the light sourcePrincipal Components Analysis (PCA)
• unsupervised• maximize variance (X)
Partial Least Squares Projection to Latent Structures (PLS)
• supervised• maximize covariance (Y ~ X)
James X. Li, 2009, VisuMap Tech.
PC1PC2
http://www.scholarpedia.org/article/Eigenfaces
Raw data PCA dimensions
![Page 17: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/17.jpg)
Interpreting PCA Results
Variance explained (eigenvalues)
Row (sample) scores and column (variable) loadings
![Page 18: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/18.jpg)
How are scores and loadings related?
![Page 19: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/19.jpg)
Centering and Scaling
PMID: 16762068
![Page 20: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/20.jpg)
Use PLS to test a hypothesis
time = 0 120 min.
Partial Least Squares (PLS) is used to identify planes of maximum correlation between X measurements and Y (hypothesis)
PCA PLS
![Page 21: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/21.jpg)
PLS model validation is critical
Determine in-sample (Q2) and out-of-sample error (RMSEP) and compare to a random model
• permutation tests
• training/testing
![Page 22: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/22.jpg)
Databases for organism specific biochemical information:
Multiple organisms
• KEGG
• BioCyc
• Reactome
Human
• HMDB
• SMPDB
Biochemical domain information
![Page 23: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/23.jpg)
Pathway Enrichment Analysis
http://www.metaboanalyst.ca/MetaboAnalyst/faces/UploadView.jsp
enrichmenttopological importance
![Page 24: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/24.jpg)
Biochemical
Network Mapping
doi:10.1186/1471-2105-13-99
Structural Similarity
![Page 25: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/25.jpg)
Data visualization as form of analysis
DM
Liver CYP2D6
Dextromethorphan = additives in
dextrorphan
• high fructose corn syrup
• antioxidants
• flavor
![Page 26: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/26.jpg)
Identification of relationships between altered metabolites urea cycle
nucleotide
synthesis
protein
glycosylation
![Page 27: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/27.jpg)
Identification of treatment effects
![Page 28: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/28.jpg)
Analysis of differential metabolic responses
Treatment 1 Treatment 2
![Page 29: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/29.jpg)
Resources• DeviumWeb- Dynamic multivariate data analysis and
visualization platformurl: https://github.com/dgrapov/DeviumWeb
• imDEV- Microsoft Excel add-in for multivariate analysisurl: http://sourceforge.net/projects/imdev/
• MetaMapR: Network analysis tools for metabolomicsurl: https://github.com/dgrapov/MetaMapR
• TeachingDemos- Tutorials and demonstrations• url: http://sourceforge.net/projects/teachingdemos/?source=directory• url: https://github.com/dgrapov/TeachingDemos
• CDS Blog- Data analysis case studiesurl: http://imdevsoftware.wordpress.com/
![Page 30: High Dimensional Biological Data Analysis and Visualization](https://reader036.vdocuments.site/reader036/viewer/2022062319/554e8563b4c90573338b4678/html5/thumbnails/30.jpg)
[email protected] metabolomics.ucdavis.edu
This research was supported in part by NIH 1 U24 DK097154