pattern recognition for the natural sciences explorative data analysis principal component analysis...
TRANSCRIPT
Pattern Recognition for the Natural SciencesPattern Recognition for the Natural Sciences
Explorative Data AnalysisExplorative Data Analysis
Principal Component Analysis (PCA)Principal Component Analysis (PCA)
Lutgarde Buydens, IMM, Analytical ChemistryLutgarde Buydens, IMM, Analytical Chemistry
Why Explorative Data Analysis ?Why Explorative Data Analysis ?
Classical ScienceClassical Science
?
[System
Paradigm change in natural sciences
Hypothesis driven
Why Explorative Data Analysis? Why Explorative Data Analysis?
Classical ScienceClassical Science Science Science with advanced technologies with advanced technologies
?
[System
ExplorativeAnalysis of data ?
System
Paradigm change in natural sciences
Hypothesis driven Data driven
Explorative Data AnalysisExplorative Data Analysis
Advanced technology: High throughput (high quality) analysis
NMR, HPLC, GC, MS/MS, immune assays, HybridsNano/Sensor technology
Genomics (gene expression profiling)
Proteomics, Metabolomics
Fingerprinting
Profiling in drug design
Overwhelming amount of dataOverwhelming amount of data
Explorative Data AnalysisExplorative Data Analysis
Visualization (principal component analysis, projections)
Unsupervised Pattern recognition (clustering)
Supervised Pattern recognition (classification)
Quantitative analysis (correlations, predictions)
Principal Component Analysis: an ExamplePrincipal Component Analysis: an Example
150 samples of Italian wines from the same region 3 different cultivars
Is it possible to characterise cultivars ?Which variables are relevant for which cultivars ?
p (13 properties) (variables)
(150 wine samples) n(objects)
Xij Flavanoid concentration of sample 75
X
xij
1 7
75
xj
xi
Flavanoid concentration
Data MatrixData Matrix
Principal Component AnalysisPrincipal Component Analysis
Barplot of 1 wine sample
Principal Component AnalysisPrincipal Component Analysis
Line plot of 1 wine sampleBarplot of 1 wine sample
Principal Component AnalysisPrincipal Component Analysis
Line plot of 1 wine sampleBarplot of 1 wine sample
Principal Component AnalysisPrincipal Component Analysis
Line plot of 1 wine sampleBarplot of 1 wine sample
Data Matrix RepresentationData Matrix Representation
xj
xi
X
xij
1 p
n xj
xi
# samples # properties
xj
xi
X
xij
1 13
150
13
1
p (13)- dimensionalVariable space
150 samples
j
xi
Sample 75
Sp (13)
Data Matrix RepresentationData Matrix Representation
xj
xi
X
xij
1 13
150
13
1
150
1
i
p (13)- dimensionalVariable space
13 variables150 samples
n (150)-dimensionalObject space
j
xi
Sample 75Property 7 (flavanoids)
Sp (13) Sn (150)
Data Matrix RepresentationData Matrix Representation
Explorative Data AnalysisExplorative Data Analysis
r (2)-dim. space of variables
Principal Component AnalysisPrincipal Component Analysis
PCA: visualization : projection in 2 dimensions
1
p (13)- dim. space of variables
Sp (13)
j
xi
1
i
n (150)-dim. space of objects
Sn (150)
13 variables150 samples
lv2
lv1
S2
13 variables
x
x
xx
xxx
xx
x
x
lv1
lv2
S2
150 samples
r (2)-dim. space of objects
13 150
Principal Component AnalysisPrincipal Component Analysis
x3
x1
x2
3 variables : S3
••
•• ••
•••
•
•• 12 samples
Principal Component AnalysisPrincipal Component Analysis
x3
x1
x2
3 variables : S3
••
•• ••
•••
•
•• 12 samples
Principal Component AnalysisPrincipal Component Analysis
S3 12 samples
PC1
PC1 = l11 x1 + l12x2 + l13x3
x3
x1
x2
••
•• ••
•••
•
••
x3
x1
x2
••
•• ••
•••
•
•• PC1
PC1 = l11 x1 + l12x2 + l13x3
Criterion: Maximum variance of projections (x)
x x xx x
xx x
xx
x
S3 12 samples
Principal Component AnalysisPrincipal Component Analysis
PC1 = l11 x1 + l12x2 + l13x3
PC2 = l21 x1 + l22x2 + l23x3
Criterion: Maximum variance of projections (x)
PC1 PC2
x2
x3
x1
x2
••
•• ••
•••
•
•• PC1
x x xx x
xx x
xx
x
S312 samples
PC2
Principal Component AnalysisPrincipal Component Analysis
Principal Components SpacePrincipal Components Space
•
•
•••• ••
•
•
••
PC1
PC2
S2 12 samples
r (2)-dim. space
pc2
pc1
S2
1
p (13)- dim. space of variables
Sp (13)
j
xi
13
150 samples
150 samples
Principal Component AnalysisPrincipal Component Analysis
Score plot
r (2)-dim. space
pc2
pc1
S2
1
p (13)- dim. space of variables
Sp (13)
j
xi
13
150 samples
150 samples
Principal Component AnalysisPrincipal Component Analysis
Score plot
PC1 (38%)
PC
2 (2
0%)
Wine data: score plot
pc2
pc1
S2
150
1
i
n (150)- dim. Space of objects
Sn (150)
13 variables
13 variables
x
x
xx
xxx
xx
x
x
Loading plot
Principal Component AnalysisPrincipal Component Analysis
pc2
pc1
S2
150
1
i
n (150)- dim. Space of objects
Sn (150)
13 variables
13 variables
x
x
xx
xxx
xx
x
x
Loading plot
Principal Component AnalysisPrincipal Component Analysis
Wine data: loading plot
PC1 (38%)
PC
2 (2
0%)
Singular Value Decomposition (SVD)Singular Value Decomposition (SVD)
Xnp = Unr Drr VTrp
Left singular vectors
PC scores
Right singular vectors
PC loadings
p
n
rr
r
n
p
r
X UVT
=
UTU =VTV =I
S2
Sp (13)
i
Sn (150)
n
11
j
xi
p
S2
Loading plot
13 variables
pc1
pc2
pc1
Score plot
150 samples
pc2
x
x
xx
xxx
xx
x
x
Principal Component Analysis : Biplot Principal Component Analysis : Biplot
pc2
pc1
x
xx
xxx
xxx
x
x150 samples + 13 variables
BIPLOTBIPLOT
Principal Component Analysis: an ExamplePrincipal Component Analysis: an Example
PC1 (38%)
PC
2 (2
0%)
Principal Component Analysis: Some IssuesPrincipal Component Analysis: Some Issues
• How many PC’s ?
• Scaling
• Outliers
How many PC’s ? How many PC’s ?
No of PC’s
Cumulative % of variance Scree plot
p
1i
2
i
2
i2
i
d
dd
100%
No of PC’s
Log
varia
nce
2 3 11 5 64 2 3 5 64
How many PC’s ? How many PC’s ?
Wine data
How many PC’s ? How many PC’s ?
PCA: ScalingPCA: Scaling
For better interpretation; may obscure results
raw data;
Mean-centering: (column wise, row wise, double)
Auto-scaling (column wise, row wise)
…..
Wine datamean-centered
Wine dataautoscaled
PCA: ScalingPCA: Scaling
Wine dataraw
Wine datamean-centered
PC1 (99.79%)
PC
2 (0
.20%
)
PC1 (99.79%)
PC
2 (0
.20%
)
PCA: ScalingPCA: Scaling
x3
x1
x2
3 variables : S3
••
•• ••
••••
••
12 samples
PC1
PCA: OutliersPCA: Outliers
x3
x1
x2
3 variables : S3
••
•• ••
••••
••
12 + 1 outlier
•
PC1
PCA: OutliersPCA: Outliers
x3
x1
x2
3 variables : S3
••
•• ••
••••
••
•
PC1
PC1
Leverage effect
PCA: OutliersPCA: Outliers
Gene expression values
Principal Component Analysis: a Recent Research ExamplePrincipal Component Analysis: a Recent Research Example
X
xij
1 4 Treatments
genes 50.000
xj
OrganonDepartment of Cell Biology
PCA Interaction Gene TreatmentPCA Interaction Gene Treatment