ipk gatersleben pattern recognition group correlation-based data processing and its application to...

25
IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology [email protected] Marc Strickert Osnabrück, 14. Januar 2005 Pattern Recognition Group Schloss Dagstuhl eibniz Institute of Plant Genetics and Crop Plant Research Gatersleben

Post on 20-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Correlation-based Data Processing

and its Application to Biology

[email protected]

Marc Strickert

Osnabrück, 14. Januar 2005

Pattern Recognition Group

Schloss Dagstuhl

Leibniz Institute of Plant Genetics and Crop Plant Research Gatersleben

Page 2: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Goals

1. Attribute rating

2. Clustering

3. Classification

4. Visualization

of biological data,

exploiting properties of

Pearson correlation.

Page 3: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Euclidean distances may be problematic

d1= (x1-y1)2+ … + (x5-y5)21

2 d2= (x1-y1)2+ … + (x5-y5)2

identical despite ofdifferent shapes

[ John Lee and Michel Verleysen ]

Page 4: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Pearson correlation invariant to scaling and shifting

amplitudevertical offset

same correlations as above!

same profiles, aligned

raw data

Up-regulated gene profiles

Euclideanview

'Pearson'view

Page 5: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Derivatives of squared Euclidean and Pearson correlation

Squared Euclidean:

Pearson correlation:

Page 6: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Applications for derivative of similarity measure

4. Visualization

(High-Throughput MDS)

2. Clustering

(Neural Gas for Correlation, NG-C)3. Classification

(GRLVQ-C)

1. Attribute rating

(Variance analogon)

Page 7: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Attribute rating

=

Squared Euclidean distance

Variance as double sum of derivatives

Page 8: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Correlation Analogon to Euclidean Variance

X

W

Page 9: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Clustering: Neural Gas (NG revisited)

NG-C:

Page 10: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

High centroid reproducibility with NG-C

NG-C

k-means

23 gene expression centroids, 10 independent runs

Indeterminate final states.

Crisp final states.

Page 11: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Classification with relevance learning

For example used in

GeneralizedLearningVector Quantizationwith Correlation(GRLVQ-C)

Adaptive Pearson correlation:

Page 12: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Leukemia cancer data set: AML / ALL separation

GRLVQ-C: Relevance factors top 10 gene ranking.

1 prototype per class + relevance learning.

consistent with Golub et al.

Page 13: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Visualization of high-dimensional data

High-dimensional data (constant source)

Low-dimensional points (variable target)

AB

C

A' B'

C'3D 2D

d12

d23

d13

d12

d23d13

“embedding”

Gradient-based stochastic optimization HiT-MDS.

!

Page 14: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Maximize distance correlations: source ≈ reconstruction

original inter-point distance matrix

reconstructed inter-point distance matrix

Adaptive parameters point coordinates

Minimize embedding stress function using negative Fischer's Z':

Page 15: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Iterative gradient descent for stress function minimization

| derivative of Fischer's Z'

| for Euclidean spaces

Page 16: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

High-Throughput Multi-Dimensional Scaling (HiT-MDS)

Initialize X by random projection (or smarter).

Calculate correlation r(X,X) once.

Draw next Pattern xi.

Minimize stress s to all xj: xik ~ -∂s / ∂xi

k.

recalculate distances dij.

adapt

Hit-MDS Algorithm

, , and r.

Input xi X Embedding xi X

dij dij

r(dij , dij)

s

1

12

2

3

34

4

Page 17: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Applications of dimension reduction (visualization)

1. Gene space browser.

2. Macro-experiment grouping.

day 0

day 26

1

2

Page 18: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Embedding 12k Genes (14 time points) in 2D

UI

D

D

I

U

orig spline

FITFITFIT

EUC COR SRC

COR COR

EUCEuclidean distance

CORPearson correlation

SRCSpearman rank cor.

Page 19: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Gene browser (4824 high-quality genes)

0 2 4 6 8 10 12 14 16 18 20 22 24 26

DAF

[ visualization: www.ggobi.org ]

Page 20: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Gene browser for powers of correlation: (1-r)8

Page 21: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Gene clustering (k=11), relevant genes in front

Page 22: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

3D-View of 62 macroarrays (4824 genes)

Page 23: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Data processing challenges in biology

Data Sets from- metabolite measurements (2D-gels, HPLC),- QTL LOD-score pattern compression,- DNA-sequence arrangement.

Missing value imputation ( probabilistic models)

Association studies ( common latent space, CCA)

Rank-based data analysis ( distribution models)

Faithful low-dimensional data representation

Proximity data handling

Common language: R / MATLAB / … ?

Page 24: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Thanks

http://pgrc-16.ipk-gatersleben.de/~stricker/

http://hitmds.webhop.net/

Pattern recognition group (IPK, headed by Udo Seiffert)

Nese Sreenivasulu (IPK, Molecular Biology)

Barbara Hammer (TU-Clausthal)

Thomas Villmann (University of Leipzig)

Page 25: IPK Gatersleben Pattern Recognition Group Correlation-based Data Processing and its Application to Biology stricker@ipk-gatersleben.de Marc Strickert Osnabrück,

IPK GaterslebenPattern Recognition Group

Some References

Strickert, M.; Sreenivasulu N.; Peterek, S.; Weschke W.; Mock, H.-P.; Seiffert, U.Unsupervised Feature Selection for Biomarker Identification in Chromatography and Gene Expression Data. In F. Schwenker and S. Marinai (Eds.), Artificial Neural Networks in Pattern Recognition, LNAI 4087, pp. 274-285, 2006.

Strickert M.; Sreenivasulu N.; Seiffert, U.Sanger-driven MDSLocalize - A Comparative study for Genomic Data. In. M. Verleysen (Ed.), Proc.14th European Symp. Artificial Neural Networks (ESANN 2006), Bruges, Belgium. D-Side publishers Evere/Belgium, pp. 265-270, 2006.

Strickert, M.; Seiffert, U.; Sreenivasulu, N.; Weschke, W.; Villmann, T.; Hammer, B.Generalized Relevance LVQ (GRLVQ) with Correlation Measures for Gene Expression Data.Neurocomputing 69(2006), pp. 651-659, Springer, 2006.

Strickert M.; Sreenivasulu N.; Usadel, B.; Seiffert, U.Correlation-maximizing surrogate gene space for visual mining of gene expression patterns in developing barley endosperm tissue.To appear in BMC Bioinformatics, 2007.

Strickert M.; Sreenivasulu N.; Seiffert, U.Browsing temporally regulated gene expressions in correlation-maximizing space.Accepted presentation at conference on Analysis of Compatibility Pathways (March 4-6, 2007).