a platform for pattern discovery in sets of biological sequences c. alland, j. nicolas
Post on 19-Dec-2015
221 views
TRANSCRIPT
A platform for pattern discovery
in sets of biological sequences
C. Alland, J. Nicolas
Framework : bioinformatics platform of Genopole Framework : bioinformatics platform of Genopole OuestOuest
Functional exploration
Proteomics
SequencingGenotyping
Biochips
Bioinformatics
Coordination Data Bases Bioinformatics Software High Performance Computing Teaching
Coordination Data Bases Bioinformatics Software High Performance Computing Teaching
PCIOSunFire 680056 UltraSparc III56 Go RAM
PCIOSunFire 680056 UltraSparc III56 Go RAM
http://www.sb-roscoff.fr/
BioInfo-GPO/
O. Collin
H. Leroy
Welcome Page of the bioinformatics platform Welcome Page of the bioinformatics platform service service
http://idefix.univ-rennes1.fr:8080/ Serveur-GPO/
Software Page of the bioinformatics platform Software Page of the bioinformatics platform service service
http://idefix.univ-rennes1.fr:8080/ Serveur-GPO/services.php
Aims of the project
Annotation of genomes : Discovery of new genes/proteinsCharacterization of functional families
Experimental comparison of methods :Choice of complexities and representations of
patterns Copy/Implementation of several algorithms
Practical tool :Parameter tuningFiltering…
Set of biological sequences
Common characteristic or discriminant pattern
Architecture of the platform
Pattern Discovery Algorithms
Visualizationof results
Alignment of sequencesSearch in banksPattern filtering
Tool box
Supervisor
Search of patterns
Statistical Analysis of
inter-motif regions
Practical Use
Refinement
Interface
Welcome page of the pattern discovery Welcome page of the pattern discovery serviceservice
Jonassen
Marsan
Pevzner
Regular languages
inferring methods
Brazma hierarchy for (generalized) regular Brazma hierarchy for (generalized) regular patternspatterns
• +J full regular languages (finite automata)
Example of the discovery of candidates in the defensin family
• Defensins are a major family of antimicrobial peptides found in mammals, cationic peptides of 28-42 amino acids length containing 3 intramolecular disulfide bonds.
• Starting point : a set of 30 sequences (including all organisms), 4 for human.
• Aim : discovery of new candidates
Collaboration with GERM (C. Pineau, F. Bourgeon)
directed by B. Jégou, staffed with 40 people and specialized in researches on male reproduction in
mammals.
Pratt : principle of the algorithmPratt : principle of the algorithm
1. One starts from a pattern graph containing all the most specific allowed patterns covering at least k of the n sequences in the training set;
2. A pattern search tree is explored starting from the most general one (empty pattern) and specializing it by adding allowed components (belonging to the pattern graph + generalization operators) while patterns obtain a better score. Several scores and search strategies are available;
3. The most significant patterns are filtered and a refinement phase may be applied to specialize flexible wild card with ambiguous letters
Pratt : three levels of usePratt : three levels of use
1. Simple : most parameters are fixed or simplified;
2. Expert: all parameters available;
3. Meta : Pratt is applied to sequences of patterns.
Simple Pratt parametersSimple Pratt parameters
Simple Pratt resultsSimple Pratt results
Advanced Pratt parametersAdvanced Pratt parameters
Advanced Pratt resultsAdvanced Pratt results
Visualization of selected resultsVisualization of selected results
Meta PrattMeta Pratt
Search pattern in a databankSearch pattern in a databank
Results of the search in a databankResults of the search in a databank
View of the search in a databankView of the search in a databank
Statistical Analysis of inter-motif regions
Results for refinment of patternsResults for refinment of patterns
Reverse Search in a Genome Reverse Search in a Genome
Reverse Search in a Genome : principle Reverse Search in a Genome : principle
•From the patterns and knowledge of exons/introns splicing, a formal grammar may be inferred.
•Genomes are translated in the six frames and compiled in a suffix tree data structure.
•Syntactical analysis is done with the help of operations on suffix trees and results in potential new candidates.
Pattern : C-x(2,4)-G-x(1,3)-C-x(3,4)-C-x(7)-[AG]-[HKNRST]-C-x(5,6)-C-C
Organisme Chromosome Phase Position LengthOcc
Length Ch preOcc Occ postOcc
No match
No match
No match
No match
No match
No match
Conclusion / Perspectives
10 new potential defensins discovered
Importance of a complete environment : coupling highly expressive patterns with syntactical search in banks
Current research : « meta level » using grammatical inference. Infer any regular language from a set of positive AND negative instances.
Open questions : Better filtering of patterns, introduction of probabilities, long distance interaction.