a platform for pattern discovery in sets of biological sequences c. alland, j. nicolas

Post on 19-Dec-2015

225 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A platform for pattern discovery

in sets of biological sequences

C. Alland, J. Nicolas

Framework : bioinformatics platform of Genopole Framework : bioinformatics platform of Genopole OuestOuest

Functional exploration

Proteomics

SequencingGenotyping

Biochips

Bioinformatics

Coordination Data Bases Bioinformatics Software High Performance Computing Teaching

Coordination Data Bases Bioinformatics Software High Performance Computing Teaching

PCIOSunFire 680056 UltraSparc III56 Go RAM

PCIOSunFire 680056 UltraSparc III56 Go RAM

http://www.sb-roscoff.fr/

BioInfo-GPO/

O. Collin

H. Leroy

Welcome Page of the bioinformatics platform Welcome Page of the bioinformatics platform service service

http://idefix.univ-rennes1.fr:8080/ Serveur-GPO/

Software Page of the bioinformatics platform Software Page of the bioinformatics platform service service

http://idefix.univ-rennes1.fr:8080/ Serveur-GPO/services.php

Aims of the project

Annotation of genomes : Discovery of new genes/proteinsCharacterization of functional families

Experimental comparison of methods :Choice of complexities and representations of

patterns Copy/Implementation of several algorithms

Practical tool :Parameter tuningFiltering…

Set of biological sequences

Common characteristic or discriminant pattern

Architecture of the platform

Pattern Discovery Algorithms

Visualizationof results

Alignment of sequencesSearch in banksPattern filtering

Tool box

Supervisor

Search of patterns

Statistical Analysis of

inter-motif regions

Practical Use

Refinement

Interface

Welcome page of the pattern discovery Welcome page of the pattern discovery serviceservice

Jonassen

Marsan

Pevzner

Regular languages

inferring methods

Brazma hierarchy for (generalized) regular Brazma hierarchy for (generalized) regular patternspatterns

• +J full regular languages (finite automata)

Example of the discovery of candidates in the defensin family

• Defensins are a major family of antimicrobial peptides found in mammals, cationic peptides of 28-42 amino acids length containing 3 intramolecular disulfide bonds.

• Starting point : a set of 30 sequences (including all organisms), 4 for human.

• Aim : discovery of new candidates

Collaboration with GERM (C. Pineau, F. Bourgeon)

directed by B. Jégou, staffed with 40 people and specialized in researches on male reproduction in

mammals.

Pratt : principle of the algorithmPratt : principle of the algorithm

1. One starts from a pattern graph containing all the most specific allowed patterns covering at least k of the n sequences in the training set;

2. A pattern search tree is explored starting from the most general one (empty pattern) and specializing it by adding allowed components (belonging to the pattern graph + generalization operators) while patterns obtain a better score. Several scores and search strategies are available;

3. The most significant patterns are filtered and a refinement phase may be applied to specialize flexible wild card with ambiguous letters

Pratt : three levels of usePratt : three levels of use

1. Simple : most parameters are fixed or simplified;

2. Expert: all parameters available;

3. Meta : Pratt is applied to sequences of patterns.

Simple Pratt parametersSimple Pratt parameters

Simple Pratt resultsSimple Pratt results

Advanced Pratt parametersAdvanced Pratt parameters

Advanced Pratt resultsAdvanced Pratt results

Visualization of selected resultsVisualization of selected results

Meta PrattMeta Pratt

Search pattern in a databankSearch pattern in a databank

Results of the search in a databankResults of the search in a databank

View of the search in a databankView of the search in a databank

Statistical Analysis of inter-motif regions

Results for refinment of patternsResults for refinment of patterns

Reverse Search in a Genome Reverse Search in a Genome

Reverse Search in a Genome : principle Reverse Search in a Genome : principle

•From the patterns and knowledge of exons/introns splicing, a formal grammar may be inferred.

•Genomes are translated in the six frames and compiled in a suffix tree data structure.

•Syntactical analysis is done with the help of operations on suffix trees and results in potential new candidates.

To: jnicolas@irisa.fr

Pattern : C-x(2,4)-G-x(1,3)-C-x(3,4)-C-x(7)-[AG]-[HKNRST]-C-x(5,6)-C-C

Organisme Chromosome Phase Position LengthOcc

Length Ch preOcc Occ postOcc

No match

No match

No match

No match

No match

No match

Conclusion / Perspectives

10 new potential defensins discovered

Importance of a complete environment : coupling highly expressive patterns with syntactical search in banks

Current research : « meta level » using grammatical inference. Infer any regular language from a set of positive AND negative instances.

Open questions : Better filtering of patterns, introduction of probabilities, long distance interaction.

top related