predictive analysis of gene expression data from human sage libraries

24
Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova + * University of Porto, PORTUGAL + Russian Academy of Sciences RUSSIA § University of Oulu FINLAND

Upload: tacey

Post on 10-Jan-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Predictive Analysis of Gene Expression Data from Human SAGE Libraries. Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova + * University of Porto, PORTUGAL + Russian Academy of Sciences RUSSIA - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

Predictive Analysis of

Gene Expression Data from

Human SAGE Libraries

Alexessander Alves* Nikolay Zagoruiko+ Oleg Okun§

Olga Kutnenko+ Irina Borisova+

* University of Porto, PORTUGAL+ Russian Academy of Sciences RUSSIA§ University of Oulu FINLAND

Page 2: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

Outline

1. Goals

2. Background

3. SAGE Data

4. Gene Expression Data

5. Feature Selection

6. GRAD

7. Experiments

8. Conclusions

Page 3: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

Goal

Predictive Analysis:• Feature Selection Methods in Bioinformatics

and Machine Learning

• Cancer Classification

Page 4: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

Background

Genes code proteins and other larger biomolecules

Genes are expressed in a two steps process (Central Dogma of Biology)

Several technologies measure transcription: SAGE, Micro array…

Central Dogma of Biology

Gene Expression Process

1- Transcribed into an RNA Sequence

2- Translated into a protein

Molla et al, 2003

Page 5: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

SAGE DATA

Advantages:• Compare samples between different organs

and patients. (No normalisation required)

• Collects complete gene expression profile of a cell/tissue without prior knowledge of the mRNA to be profiled

Page 6: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

SAGE DATA

Drawbacks:• Very Expensive to Collect Data using the

SAGE method

• Very Few Examples (consequence)

Page 7: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

GENE EXPRESSION DATA

Challenges posed to Machine Learning• Number of Genes Dramatically Exceeds

Examples!!!

• Curse of Dimensionality (not enough density to estimate accuratelly the model)

• Over-fitting (higher probability of finding casual relationships among data attributes)

Page 8: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

Remove Irrelevant and Redundant Genes Methods:

• Wrapper• Fit classifier to a subset of data and use classification accuracy to

drive the search for relevant genes (e.g. C4.5 accuracy )

• Filtering• Use a function to assess the goodness of a subset of genes (e.g.

euclidean distance, entropy, correlation, etc...) Problem Complexity

• O(2n) ... • n, number of genes• Smaller dataset n=822. • O(2n) 2.8x10246 Intractable using a simple exaustive search

Feature Selection

Page 9: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

Gene Selection In Bioinformatics Filtering is usually prefered because is

computationally less expensive Several works on classification select genes

with:• Wilcoxon test, • t-test • Additionally, also remove genes with low entropy,

variability, or absolute expression level. Cons

• Redundancy• Interdependency unaware

Page 10: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

Our Proposals

Study Bioinformatics Filtering Techniques

Compare with Machine Learning Algorithms

• Avoid Redundancy

• Consider Interdependency and low expressed genes

Introduce a new Filtering Algorithm GRAD

Page 11: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

GRAD

Search StrategySearch Strategy1. Use Exaustive Search on the formation of

informative groups of attributes (“granules”)

2. Use AdDel for choosing subsets of granules

• AdDel: A combination of forward sequential search (FSS) and backward sequential search (BSS)

• Number of attributes to include on a subset is estimated by algorithm

Page 12: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

and are the distances to closest neighbors, one from each class

GRAD

AlgorithmAlgorithmP0: x1,x2,…,xn – initial set of features

Formation of granules: Ordering by individual relevanceG1: x7, x33, x12,…,xn All pairs by exhaustive searchG2: x3x8, x15x88,…,xi xj All triplets by exhaustive searchG3: x75x1x35, x11x49x55,…, xi xj xk Top level most relevant granules using AdDel• G=<G1,G2,G3>… AdDel

),(21 211 rrrf 1r 2r

Page 13: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

Experiments Comparison

1. GRAD2. Wrapper C4.53. Original Dataset4. Filtering

– Wilcoxon Test, low entropy, variability, and very low absolute expression level

Classifiers1. C4.52. SVM3. RBF 4. NN-MLP

Data• Small Dataset: 74x822

Page 14: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

Data Characterization

Not all organs have samples of both classes

Unbalanced number of cases:

• 50 Cancer Samples

• 24 Normal Samples

Most data is relativelly low expressed

Mean quite far from median:

Potentially due to outliers

Page 15: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

Data Characterization

average vs standard deviation average vs range

Both range and standard deviation have roughly linear relationship with gene expression level average

Page 16: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

Experimental Results

Predictive AccuracyGRAD WRAPPER Original Filtering

86% 82% 79% 78%

GRAD is significantly better than using the original or the filtered dataset

Wrapper approach is not

Page 17: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

GRAD Results

Importance of considering dependence Distance Function:

10 best by GRAD P=100 %

10 most individually informative P=75,7 %

),(21 211 rrrf

Page 18: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

GRAD Results

Scatter Plot of GRAD Attributes

Interdependency relationship between two non differentially expressed genes selected with GRAD

Two differentially expressed genes selected with GRAD.

Page 19: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

GRAD Results

Examples ordered by the value of the Distance Function

In the future it can allow to estimate the degree of risk, to make early diagnostics and to supervise a course of treatment

Page 20: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

Induced Classifiers

C4.5 Induced on GRAD attributes C4.5 Induced using a Wrapper Approach

Page 21: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

Conclusions

1. Coping with redundancy and dependency between attributes is very important.

2. Algorithm GRAD represents effective means to select a subset of attributes from very big initial set.

3. The submitted results have only illustrative character.

4. We are open for cooperation with those who have interest on the biological interpretation of results

Page 22: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

Questions

Page 23: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

In increasing n the relevance grows, then growth stops and begins its decrease due to addition less informative, rustling attributes.

The maximum of the curve of quality allows to specify optimum quantity of attributes. Only algorithms of AdDel family has such property.

GRAD

Page 24: Predictive Analysis  of Gene Expression Data  from Human SAGE Libraries

Feature Selection

Wrapper• Considers the classifier while searching best subset

• Accuracy Improves

• May overfit due to small sample sizes and huge dimensionality

• Computationally more expensive

Filtering:• Potentially less accurate

• Faster: Does not requires the induction of a predictor

• Commonly prefered approach in bioinformatics