ensemble classification techniques for detecting signatures of natural selection from site frequency...

21
SFselect-E Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum Andrew Stewart JHU Spring 2014

Upload: andrew-stewart

Post on 11-Aug-2014

72 views

Category:

Data & Analytics


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

SFselect-EEnsemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Andrew StewartJHU Spring 2014

Page 2: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Introduction● Searching for signatures of selection● SFselect (Ronen, 2013)● Multi-K (Whiteman, 2010)● Introducing: SFselect-E

Page 3: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Contents1) The selection classification problem2) Overview of SVM classification with SFselect

3) Ensemble preprocessing with Multi-*4) Generating model variance5) Introducing SFselect-E6) Experimental Results7) Conclusion

Page 4: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Natural selection● Population genetics● Evolution: Descent with modification● Selection

o Directional Positive Negative

o Neutral

Page 5: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Classifying natural selection● Record of demographic history● Increased LD, reduced variation● Site frequency spectrum

o ie, Tajima’s D

Page 6: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Background: SFSelect (Ronen, 2013)● Scaled Site Frequency Spectrum● Linear kernel Support Vector Machines● Trained on extensive population simulations

o SFselect, SFselect-s, SFselect-XP

Page 7: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Background: Multi-K Clustering● Bootstrap aggregation

o Random samplingo Aggregation methodo Highly accurate, but computationally expensive

● Multi-Ko Iterative K-means clusteringo Classify new points based off centroid proximityo Optimize Kend with cross validation

● Multi-KX, Multi-SVD

Page 8: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Generating ensemble diversity● Generating ensemble diversity

o Generalizerso Specializers

● Applied to SFS classification:o Improve overall classification accuracy?o Produce classifiers robust to wide variations in

genetic diversity

Page 9: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

SFselect-E● SFselect General SVM● SFselect-E: Bagging approach● SFselect-E: Multi-K approach

Page 10: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Population simulations● 1000 individuals● s = [0.005, 0.01, 0.02, 0.04, 0.08]● t = [0, 50, 150, 200, …, 3500, 4000]● n = 500● labels = [-1, 1] (neutral, selected)

Page 11: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Training the standard model● Compute allele frequencies● Scale, normalize, bin into vectors● Trained linear kernel SVM on entire dataset

Page 12: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Computational limits● Very time intensive

o Population simulationso Vectorization of SFS o Training SVMs on SFS

● Simulations grouped/indexed by replicateo Proved a major limitation on ensemble sampling

Page 13: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

SFselect-E: Bagging approach● Random sampling

o k = 100, n = 200● Aggregation

o Majority voting● Validation

o Cross validation

Page 14: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

SFselect: Multi-K approachIterative K-means clustering of DKstart = 2 : Kend = 8Train on each KCross validation to determine optimal Kend

Page 15: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Experimental analysis: K-fold C.V.How to cross validate an ensemble???

For each K, hold out Ki, train on D-Ki

Test classifier on Ki

Report mean accuracy (# correct classifications)

Page 16: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Experimental analysis: C.V. ResultsModel

AccuracyStandard SFselect SVM: 74.28Bagged SFselect-E SVM: 73.86Multi-K SFselect-E SVM: NA

Page 17: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Experimental analysis: Time series● For t = [0, 4000], test Dt

o Neutral vs Selectedo Dependent T-Test on time sample accuracies

p-value of 2.0136 X 10-24

Page 18: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Conclusions● SFselect-E consistent with SFselect

o No separation of specialized classifierso Smaller subsets?

● Limitations of structure of training data as implemented in SFselect

● Model variance best obtained by separating by s, t.

Page 19: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Conclusions● Computing time for training a major obstacle● Multi-SVD preprocessing could reduce

training time● Refactoring required first

Page 20: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Future workRefactor to treat populations independentlyBagging: random sampling across s, tMulti-K: hierarchical clustering of training dataMulti-KX, Multi-SVDSFselect-s as component models

Page 21: Ensemble classification techniques for detecting signatures of natural selection from site frequency spectrum

Future workCross population: SFselect-XP, XP-SFSCross species: SFS + conserved regionsXS-SFS

Tune ensemble diversity to population genetic diversity