october 13-14 novosibirsk,indian_russian meeting, 2008 combined network of transcription regulation...

44
October 13- 14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide functional linkages Institute of Genetics and Selection of Industrial Microorganisms, Moscow, Russia

Upload: job-young

Post on 03-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Combined network of transcription regulation and protein-protein interaction for inferring

genome-wide functional linkages

Institute of Genetics and Selection of Industrial Microorganisms, Moscow, Russia

Page 2: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Russian-Indian Collaborating Project

State Research Center of Genetics and Selection of Industrial Microorganisms, Moscow, Russia

Prof. Shekhar Mande

Kharkevich Institute of Information Transition Problems, Russian Academy of Sciences

Prof. Mikhail Gelfand

Page 3: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Comparative genomics can show gene functional linkages

• Co-occurrence in known operons• Minimal distance between a pair of

genes in a genome (unknown operons)

• Phylogenetics profiling (similar behaviour of a gene pair in several genomes)

266 linear genomes allow to evaluate functional linkages between genes by statistical methods

Yellaboina et al. Genome Research, 2007 17: 527-535

Page 4: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

What are the mechanisms behind “functionally related genes”?

Protein-protein interactions obtained in high- throughoutput experimental methods correlate well with functional relatedness of genes obtained with bioinformatics.

Yellaboina et al. Genome Research, 2007 17: 527-535

Protein-protein interaction…

BUT …

Metabolic pathways

Transcription co-regulationOr several simultaneously

Other extravagant mechanisms (direct interactions in genome)

Page 5: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Bioinformatics of transcription regulation of bacterial genes

• Specific promoters• RNA based switches • Specific protein

transcription regulatory factors (TF)

TF-mediated regulation is often responsible for regulation of complex processes

Cross – talk

Non-trivial concentration dependence

(quorum Sensing)

Page 6: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

DNA-signals responsible for TF binding

Bacterial regulatory sites:

1. Usually long and divergent

2. Often positioning referred to the promoter is important

3. Sites for crass-talking proteins may overlap

Page 7: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Integrated database

Functionally related genes

Methabolic associated genes

Transcription associated genes

Protein interactions

Page 8: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Bioinformatics for hierarchy of organization levels of biosystems

12 program components

integrated into a single system

DNA Sequence

RNA

Protein

Variation between species and individuals in populations

Sequence

Structure

Sequence

Structure

Complex

TandemSWAN, BASIO, ALEX,

SeSiMCMC, STRUSWER, STRUDL,

RNA-MBFS, Prophet, Oligomeasure,

PSACR, Combinator, KMD

Page 9: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Some technical points

Page 10: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Two integrated databases

• Molecular entities• Genome annotations

PathWay Studio, Ariadne Genomics, Inc

Original database of genome annotation and transcription

regulation

Page 11: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Integration of data on binding sites and genome annotations

• All experimental and predicted binding sites and other segments data are mapped into genome.

• Filtration of multiple identical entries and obviously irrelevant sites in EcoCyc

• Site positioning in relation with other genomic structures (repeats, genes)

• Motifs are represented as lists of allowed words

• Different experimental sources, as well as comparative genomics studies are used for motif construction

Page 12: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Viewpoints

• Database that contain the experimental data and computational predictions in the integrated manner

• XML format for organizing data flow• Possible distributed computations• Possible platform independence (Ruby & Java)

Page 13: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Unified storage for experimental data from different sources

SELEX

Comparative genomics

footprinting

Motif modelsGenome

small-BiSMarkXML-based small language

for Biological Sequence Markup

database engine

filtering identical and irrelevant motifs, preprocessing

Page 14: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Identification of optimal binding motifs using stochastic optimization

• SeSiMCMC – Gibbs sampler based algorithm for identification of binding motif

• Multiple local alignment of candidate genomic sequences

• Optimization of the motif length

• Modeling of diades (palindromes and tandem repeats) in motif structures

• Priors for absent sites and sites at the forward and backward DNA strands

SeSiMCMC result on a TRANSFAC dataset

Known binding site motif (SELEX, a sequence logo for Sp1 factor binding site)

Page 15: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

SeSiMCMC sampler page

Page 16: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Identification of spaced and overlapping motifs

Page 17: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Regulatory regions: Different types of architecture

ArcA sites

Promoter

Homotypic Clusters

Clusters aligned with promoters

Overlapping and spaced binding sites

Page 18: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Statistical validation of selectivity and identification of optimal binding motifs

• AhoPro algorithm for calculation of P-value of site binding• Comparison of different binding motifs • Using different motif models• Selection of the optimal motif• Direct calculation of motif selectivity for different specificity

levels

Motif models support includes• Positional weight matrices• Word lists• IUPAC strings

Page 19: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

•Each state at ith step –

• class CCi i ((rr11, r, r22;q);q)

Aho-Corasick pattern matching automaton

A C

TC T

C T

root

H1 = {ACC, AACC, AССTT}

H2 = {AT, CTAT, CT}

Probabilities to be at each state (probability transducer)

1 1 2Pr( ( , ; ')) iC q

step (text length)number of occurrences of the first motif

number of occurrences of the second motif

the longest suffix in prefix closure of H1UH2

Aho-Pro algorithm: exact P-value calculation

Page 20: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

We developed an algorithm of exact p-value calculation for multiple

occurrences of multiples motifs Boeva, V., J. Clement, M. Regnier, M.A. Roytberg, and V.J. Makeev. 2007. Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules. Algorithms Mol Biol 2: 13.

AhoPro – p-value calculator!

Page 21: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

A data flow for motif model construction

SeSiMCMC

Footprinting results

Genome-mappedwith correct flanking

sequences

ChIP-chip

Raw long sequences

SeSiMCMC

SELEX

Short sites or site parts

May be used as maskTo be used

as initial mask

Verification

Motif model Motif model

Sp1 binding site

Additional motif length estimation Additional motif length estimation

Page 22: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Obtaining clean data from specific sources

Using TRANSFAC as base data source for binding sites of a selected factor

Footprinted sequence N earest gene

TR AN SFAC entry

C hrom osom e

5000bp5000bp

filtering am biguous entries

C hrom osom e

Footprin ted sequenceFlank Flank

extracting chrom osom e region, conta in ing footprinted sequence

Footprinted sequenceFlank F lank

D ataset

TR AN SFACTransfac entryfactor

b indingsites

Chrom osome region

Transfac entryTransfac entry

.....Transfac entry

Footprinted sequenceFlank F lankFootprinted sequenceFlank F lank

Footprinted sequenceFlank F lank database engine

small-BiSMark

Page 23: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

A verification procedure for created motif model

New motif model Testing sequence set

Wisely chosen set ofmotif-containing sequences

AhoPro

Choosing optimal motif specificity

Selectivity testing

Processed experimental data(via SeSiMCMC)

Newly discovered motif(by SeSiMCMC or ScanSeq)

Page 24: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Comparative motif analysis

Testing sequence set

Footprinting, ChIP-chip data,

Random generated setKnown motif model 1

New motif model

Known motif model 2

Known motif model 3

AhoPro

Selectivity testing

Comparative analysisSelecting best motif model

Page 25: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

An a genome-wide motif distribution mapping

Known motif model 1

New motif model

Known motif model 2

Known motif model 3

Genome-wide globally positioned on chromosome

sites with different quality

Possible clustering of sites:different models for one factor

best models for different factors

Positioning within specific DNA regions:CRM, CpG islands, etc.

Page 26: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Possible multiple Opera House management for grid computing support

(request redirecting and resource balancing only)

Distributed computations support

Single physical machine multi-process remote task execution control service

«Theatre manager»

«Opera House» «Opera House»

«Opera House»

Specified scenario «opera libretto» execution

Physical machine

Main database

Page 27: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Overview of the technical realization of the complex

MySQL

Database levelRuby-poweredcross-platform

DRb-based server

Server level

Data-workflow level

Ruby-poweredcross-platform scenario scripts

SeSiMCMC and AhoProHigh-speed C++ code

Application level

Web-interface level

Ruby-based CGIRuby-on-rails in future

Ruby and Java-based tools(REXML, JAXP, SAXON)

small-BiSMark processing

Page 28: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

THE END

Page 29: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Acknowledgments

• GosNIIgenetika group:- Vsevolod Makeev- Alexander Favorov- Elizaveta Permina- Valentina Boeva- Ivan Kulakovsky- Dmitry Malko

Financial support Russian Federation State Innovation ProjectRussian Foundation of Basic Research DST India

Page 30: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Biological data analysis components

• DNA analysis:

•Basio – large-scale sequence analysis: compositional segmentation

•TandemSWAN – tandem repeats in DNA sequences

•SeSiMCMC – DNA motif identification

•Oligomeasure – DNA structure from DNA sequence

Page 31: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

TandemSWAN

• Tandem repeats with substitution but without indels with a control of repeat statistical significance

tttatttatttatttatttatttatttatttatttatttatttatttatttatttatttattta

Finds micro- and minisatellites with substitutions

Page 32: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

BAesianSegmentationInformationOptimizer

• Performs DNA parsing into segments with a uniform composition• Uses Bayesian optimization over all possible segment

configuration• Uses Bayesian Information Criterion (BIC) to control segmentation

resolution

Format the output

List of segments

Split – sequence preprocessing

Basio – basic segmentaton algorithm Report

Select the appropriate output format

filter

Remove short or redundent segments

Input sequence

atcatatca|ggcggcgcagccgcagcc|tctcttcttc

Page 33: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

SeSiMCMC – Sequence Similarity Markov Chain Monte Carlo

• SeSiMCMC – Gibbs sampler based algorithm for identification of binding motif

• Multiple local alignment of candidate genomic sequences

• Optimal identification of the motif length

• Analysis of symmetries in motif structures

• Priors for absent sites and sites at the forward and backward DNA strands

SeSiMCMC result on a TRANSFAC dataset

Known binding site motif (SELEX, a sequence logo for Sp1 factor binding site)

Page 34: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

ALEX – Alingment of Exons

Identifies exons in a genomic alignment

CTGACGCACAGACCCAAGTGACGACGAGGCCGA

CGGACGGACAGACCCAAGTGACGACGAGGCCGA

REG

BEG

M

REG

END M

REG

BEG

H

REG

END

H Glob

Best

Exon

Beg

Best

Exon

End

Best

Exon

Type

Best

Start

Exon

Beg

Best Start

ExonEnd

Best

Inner

Exon

Beg

Best

Inner

Exon

End

Best

Stop

Exon

Beg

Best

Stop

Exon

End

Best

One

Exon

Beg

Best

One

Exon

End

1425 2127 126 831 16.85 1779 1882 1 1779 1882 1752 1882 1752 1886 0 0

2373 2896 1202 1741 36.65 2452 2705 0 2398 2705 2452 2705 2849 2896 0 0

3279 3544 2238 2503 30.56 3279 3544 0 3512 3544 3279 3544 3279 3515 0 0

3827 4806 2815 3785 59.30 4002 4490 2 4017 4061 4002 4061 4002 4490 4017 4490

5051 5436 4059 4454 8.37 5051 5131 2 0 0 0 0 5051 5131 0 0

Page 35: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

PROTEIN ANALYSIS

• Struswer – Smith Waterman aligner taking into account the secondary structure

• Prophet – Secondary structure predictor based on discriminate analysis

• PSIC – multiple alignment with homologs

Page 36: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

1 N C 0.990 0.007 0.006

2 A C 0.900 0.040 0.044

3 K C 0.769 0.108 0.093

4 L C 0.666 0.315 0.074

5 K C 0.539 0.489 0.048

6 P H 0.405 0.639 0.033

7 V H 0.100 0.875 0.030

8 Y H 0.072 0.908 0.025

9 D H 0.055 0.926 0.012

10 S H 0.059 0.928 0.007

11 L H 0.069 0.919 0.005

12 D H 0.078 0.921 0.002

13 A H 0.051 0.946 0.003

14 V H 0.062 0.928 0.006

15 R H 0.096 0.880 0.010

16 R H 0.107 0.894 0.013

17 C H 0.104 0.899 0.015

18 A H 0.051 0.945 0.014

STRUSWER-STRUcture extension of Smith-Waterman alignER

STRUSWER – alignment of protein sequences with the reference to their secondary structure

------------------------------------------------------------------------------------------- 1a04A.exp A <-> 1au7A.exp A erd.vnqLtprerdi.lklIaqGlpnkmiarrLdites.......tvkvhvkh.....mlkkmklksrveAavwvhqErif......... ...gmraLeqfanefkvrrIklGytqtnvgeaL...aavhgsefsqtticrfenlqlsfknac....klkAilskwlEe..aeqkrrtti LLL.HHHLLHHHHHH.HHHHHLLLLHHHHHHHHLLLHH.......HHHHHHHH.....HHHHHLLLLHHHHHHHHHHHLLL......... ...LHHHHHHHHHHHHHHHHHHLLLHHHHHHHH...HHLLLLLLLHHHHHHHHLLLLEHHHHH....HHHHHHHHHHHH..LLLLLLLLL score: 326.000000 ID : 0.11

Формат выходного выравнивания. Вверху – выравнивание первичных структур; внизу – выравнивание вторичных структур.

Page 37: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Protein Secondary Structure Prediction PROPHET

Page 38: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

RNA-MBFS (RNA MultyBranch-Free Structures).

M – цикл ветвления (multi-branched loop), степень ветвления -3 Подскруктуры слева от M (т.е. E-F-G-I-H) и сверху от M (т.е. C-B-S-T) –неветвящиеся.

S

B

C

M E F G HI

5’

3’

T

Рис. 3.1. Вторичная структута РНК.

Creates optimal RNA-structure without branching

Page 39: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Integration on the level of computation and data

• Easy accessible via web interface• Integration at data level• Cluster and local network distributed computation support• Cross-platform

Page 40: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Building complex computational applications

• Possibility to create individual scenarios for any special task• Pipelining support for computational conveyers• Simple XML-format for scenarios and conveyers descriptors

Page 41: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Individual user spaces and profiles

• Individual result storage and file library

• Individual user account support

Page 42: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Easy remote administration via web interface

Page 43: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

What’s under the hood

• Used technologies and program tools– MySQL 5 database as result and user space storage– JSP for web-interface– Apache Tomcat 5 JSP/Servlet container– Java 5 and RMI for distributed computations server and node-software

Page 44: October 13-14 Novosibirsk,Indian_Russian Meeting, 2008 Combined network of transcription regulation and protein-protein interaction for inferring genome-wide

October 13-14 Novosibirsk,Indian_Russian Meeting, 2008

Acknowledgements

• Financial support of- Russian Federation State Contract № 02.434.11.100

(Intellectual technologies 2). Prof. Tumanyan V.G.- Russian Academy of Sciences project in Molecular and Cellular

Biology

• ContributorsInstutute of Mathematical Problems of Molecular Biology (Moscow

Region, Puschino, Russia)Voronezh State University, Voronezh, RussiaState Research Center of Genetics and Selection of Industrial

Microorganisms, GosNIIgenetika, Moscow, Russia