a platform for pattern discovery in sets of biological sequences c. alland, j. nicolas

26
A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Post on 19-Dec-2015

221 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

A platform for pattern discovery

in sets of biological sequences

C. Alland, J. Nicolas

Page 2: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Framework : bioinformatics platform of Genopole Framework : bioinformatics platform of Genopole OuestOuest

Functional exploration

Proteomics

SequencingGenotyping

Biochips

Bioinformatics

Coordination Data Bases Bioinformatics Software High Performance Computing Teaching

Coordination Data Bases Bioinformatics Software High Performance Computing Teaching

PCIOSunFire 680056 UltraSparc III56 Go RAM

PCIOSunFire 680056 UltraSparc III56 Go RAM

http://www.sb-roscoff.fr/

BioInfo-GPO/

O. Collin

H. Leroy

Page 3: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Welcome Page of the bioinformatics platform Welcome Page of the bioinformatics platform service service

http://idefix.univ-rennes1.fr:8080/ Serveur-GPO/

Page 4: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Software Page of the bioinformatics platform Software Page of the bioinformatics platform service service

http://idefix.univ-rennes1.fr:8080/ Serveur-GPO/services.php

Page 5: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Aims of the project

Annotation of genomes : Discovery of new genes/proteinsCharacterization of functional families

Experimental comparison of methods :Choice of complexities and representations of

patterns Copy/Implementation of several algorithms

Practical tool :Parameter tuningFiltering…

Set of biological sequences

Common characteristic or discriminant pattern

Page 6: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Architecture of the platform

Pattern Discovery Algorithms

Visualizationof results

Alignment of sequencesSearch in banksPattern filtering

Tool box

Supervisor

Search of patterns

Statistical Analysis of

inter-motif regions

Practical Use

Refinement

Interface

Page 7: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Welcome page of the pattern discovery Welcome page of the pattern discovery serviceservice

Jonassen

Marsan

Pevzner

Regular languages

inferring methods

Page 8: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Brazma hierarchy for (generalized) regular Brazma hierarchy for (generalized) regular patternspatterns

• +J full regular languages (finite automata)

Page 9: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Example of the discovery of candidates in the defensin family

• Defensins are a major family of antimicrobial peptides found in mammals, cationic peptides of 28-42 amino acids length containing 3 intramolecular disulfide bonds.

• Starting point : a set of 30 sequences (including all organisms), 4 for human.

• Aim : discovery of new candidates

Collaboration with GERM (C. Pineau, F. Bourgeon)

directed by B. Jégou, staffed with 40 people and specialized in researches on male reproduction in

mammals.

Page 10: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Pratt : principle of the algorithmPratt : principle of the algorithm

1. One starts from a pattern graph containing all the most specific allowed patterns covering at least k of the n sequences in the training set;

2. A pattern search tree is explored starting from the most general one (empty pattern) and specializing it by adding allowed components (belonging to the pattern graph + generalization operators) while patterns obtain a better score. Several scores and search strategies are available;

3. The most significant patterns are filtered and a refinement phase may be applied to specialize flexible wild card with ambiguous letters

Page 11: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Pratt : three levels of usePratt : three levels of use

1. Simple : most parameters are fixed or simplified;

2. Expert: all parameters available;

3. Meta : Pratt is applied to sequences of patterns.

Page 12: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Simple Pratt parametersSimple Pratt parameters

Page 13: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Simple Pratt resultsSimple Pratt results

Page 14: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Advanced Pratt parametersAdvanced Pratt parameters

Page 15: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Advanced Pratt resultsAdvanced Pratt results

Page 16: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Visualization of selected resultsVisualization of selected results

Page 17: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Meta PrattMeta Pratt

Page 18: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Search pattern in a databankSearch pattern in a databank

Page 19: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Results of the search in a databankResults of the search in a databank

Page 20: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

View of the search in a databankView of the search in a databank

Page 21: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Statistical Analysis of inter-motif regions

Page 22: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Results for refinment of patternsResults for refinment of patterns

Page 23: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Reverse Search in a Genome Reverse Search in a Genome

Page 24: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Reverse Search in a Genome : principle Reverse Search in a Genome : principle

•From the patterns and knowledge of exons/introns splicing, a formal grammar may be inferred.

•Genomes are translated in the six frames and compiled in a suffix tree data structure.

•Syntactical analysis is done with the help of operations on suffix trees and results in potential new candidates.

To: [email protected]

Pattern : C-x(2,4)-G-x(1,3)-C-x(3,4)-C-x(7)-[AG]-[HKNRST]-C-x(5,6)-C-C

Organisme Chromosome Phase Position LengthOcc

Length Ch preOcc Occ postOcc

No match

No match

No match

No match

No match

No match

Page 25: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas

Conclusion / Perspectives

10 new potential defensins discovered

Importance of a complete environment : coupling highly expressive patterns with syntactical search in banks

Current research : « meta level » using grammatical inference. Infer any regular language from a set of positive AND negative instances.

Open questions : Better filtering of patterns, introduction of probabilities, long distance interaction.

Page 26: A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas