simple rearrangements

91

Upload: frye

Post on 16-Jan-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Simple Rearrangements. 1. 2. 3. 9. 10. 8. 4. 7. 5. 6. Reversals. Blocks represent conserved genes. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Reversals. 1. 2. 3. 9. 10. 8. 4. 7. 5. 6. 1, 2, 3, -8, -7, -6, -5, -4, 9, 10. Blocks represent conserved genes. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Simple Rearrangements
Page 2: Simple Rearrangements

Simple Rearrangements

Page 3: Simple Rearrangements
Page 4: Simple Rearrangements

Reversals

• Blocks represent conserved genes.

1 32

4

10

56

8

9

7

1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Page 5: Simple Rearrangements

Reversals1 32

4

10

56

8

9

7

1, 2, 3, -8, -7, -6, -5, -4, 9, 10

Blocks represent conserved genes. In the course of evolution or in a clinical context, blocks

1,…,10 could be misread as 1, 2, 3, -8, -7, -6, -5, -4, 9, 10.

Page 6: Simple Rearrangements

Types of Rearrangements

Reversal1 2 3 4 5 6 1 2 -5 -4 -3 6

Translocation1 2 3 44 5 6

1 2 6 4 5 3

1 2 3 4 5 6

1 2 3 4 5 6

Fusion

Fission

Page 7: Simple Rearrangements

Sorting by reversals: 5 stepsStep 0: p 2 -4 -3 5 -8 -7 -6 1Step 1: 2 3 4 5 -8 -7 -6 1Step 2: 2 3 4 5 6 7 8 1Step 3: 2 3 4 5 6 7 8 -1Step 4: -8 -7 -6 -5 -4 -3 -2 -1Step 5: g 1 2 3 4 5 6 7 8

Page 8: Simple Rearrangements

Sorting by reversals: 4 stepsStep 0: p 2 -4 -3 5 -8 -7 -6 1Step 1: 2 3 4 5 -8 -7 -6 1Step 2: -5 -4 -3 -2 -8 -7 -6 1Step 3: -5 -4 -3 -2 -1 6 7 8Step 4: g 1 2 3 4 5 6 7 8

Page 9: Simple Rearrangements

Sorting by reversals: 4 stepsStep 0: p 2 -4 -3 5 -8 -7 -6 1Step 1: 2 3 4 5 -8 -7 -6 1Step 2: -5 -4 -3 -2 -8 -7 -6 1Step 3: -5 -4 -3 -2 -1 6 7 8Step 4: g 1 2 3 4 5 6 7 8

What is the reversal distance for this permutation? Can it be sorted in 3 steps?

Page 10: Simple Rearrangements

From Signed to Unsigned Permutation (Continued)

0 5 6 10 9 15 16 12 11 7 8 14 13 17 18 3 4 1 2 19 20 22 21 23

• Construct the breakpoint graph as usual

• Notice the alternating cycles in the graph between every other vertex pair

• Since these cycles came from the same signed vertex, we will not be performing any reversal on both pairs at the same time; therefore, these cycles can be removed from the graph

Page 11: Simple Rearrangements

Reversal Distance with Hurdles

• Hurdles are obstacles in the genome rearrangement problem

• They cause a higher number of required reversals for a permutation to transform into the identity permutation

• Taking into account of hurdles, the following formula gives a tighter bound on reversal distance:

d(π) ≥ n+1 – c(π) + h(π)

• Let h(π) be the number of hurdles in permutation π

Page 12: Simple Rearrangements

Median Problem

Goal: find M so that DAM+DBM+DCM is minimized

NP hard for most metric distances

Page 13: Simple Rearrangements

Genome Enumeration for Multichromosome Genomes

$

1

-1

3

-3

2

-3

2 3 $

-3

.

$3

-3

...

...

...

...

.

.

.

.

.

.

.

.

.

.

.

...$-3

3 $

‹ 1, 2, 3 ›

‹ 1, 2, -3 ›

‹ 1, 2 › ‹ 3 ›

‹ 1, 2 › ‹ -3 ›

Genome Enumeration

For genomes on gene {1,2,3}

2

-2

2

-2

2

-2

Page 14: Simple Rearrangements

Rearrangement Phylogeny

Page 15: Simple Rearrangements

Compute A Given Tree (Start)

Page 16: Simple Rearrangements

Compute A Given Tree (First Median)

Page 17: Simple Rearrangements

Compute A Given Tree (Second Median)

Page 18: Simple Rearrangements

Compute A Given Tree (Third Median)

Page 19: Simple Rearrangements

Compute A Given Tree (After 1st Iteration)

Page 20: Simple Rearrangements

Binary Encoding

Page 21: Simple Rearrangements

MLBE Sequences

Page 22: Simple Rearrangements

Experimental Results (Equal Content)

80% inversion, 20% transposition

Page 23: Simple Rearrangements

An Example—New Genomes1 2 3 4 5 6 7 8 9 10

1 -4 5 2 8 10 9 -7 -6 3

1 3 5 7 9

1 5 9 -7 3

Page 24: Simple Rearrangements

Jackknifing Rate

Page 25: Simple Rearrangements

Support Value Threshold - FP

Up to 90% FP can be identified with 85% as the threshold

Page 26: Simple Rearrangements

Jackknife Properties

• Jackknifing is necessary and useful for gene order phylogeny, and a large number of errors can be identified

• 40% jackknifing rate is reasonable• 85% is a conservative threshold, 75% can

also be used• Low support branches should be examined

in detail

Page 27: Simple Rearrangements

Protein

Page 28: Simple Rearrangements

In-silico Biochemistry

• Online servers exist to determine many properties of your protein sequences• Molecular weight• Extinction coefficients• Half-life

• It is also possible to simulate protease digestion• All these analysis programs are available on

• www.expasy.ch

Page 29: Simple Rearrangements

Analyzing Local Properties• Many local properties are important for the function of

your protein• Hydrophobic regions are potential transmembrane domains• Coiled-coiled regions are potential protein-interaction

domains• Hydrophilic stretches are potential loops

• You can discover these regions• Using sliding-widow techniques (easy)• Using prediction methods such as hidden Markov Models

(more sophisticated)

Page 30: Simple Rearrangements

Sliding-window Techniques• Ideal for identifying strong

signals• Very simple methods

• Few artifacts• Not very sensitive

• Use ProtScale on www.expasy.org

• Make the window the same size as the feature you’re looking for

Page 31: Simple Rearrangements

www.expasy.org/cgi-bin/protscale.pl

Page 32: Simple Rearrangements

www.expasy.org/cgi-bin/protscale.pl

Page 33: Simple Rearrangements

www.expasy.org/cgi-bin/protscale.pl

Page 34: Simple Rearrangements

www.expasy.org/cgi-bin/protscale.pl

Hphob. / Eisenberg

Page 35: Simple Rearrangements

Transmembrane Domains

• Discovering a transmembrane domain tells you a lot about your protein

• Many important receptors have 7 transmembrane domains

• Transmembrane segments can be found using ProtScale

• The most accurate predictions come from using TMHMM

Page 36: Simple Rearrangements

Using TMHMM

• TMHMM is the best method for predicting transmembrane domains

• TMHMM uses an HMM• Its principle is very different from that of ProtScale• TMHMM output is a prediction

Page 37: Simple Rearrangements

TMHMM vs. ProtScale

Page 38: Simple Rearrangements

>sp|P78588|FREL_CANAX Probable ferric reductase transmembrane component OS=Candida albicans GN=CFL1 PE=3 SV=1 MTESKFHAKYDKIQAEFKTNGTEYAKMTTKSSSGSKTSTSASKSSKSTGSSNASKSSTNA HGSNSSTSSTSSSSSKSGKGNSGTSTTETITTPLLIDYKKFTPYKDAYQMSNNNFNLSIN YGSGLLGYWAGILAIAIFANMIKKMFPSLTNNLSGSISNLFRKHLFLPATFRKKKAQEFS IGVYGFFDGLIPTRLETIIVVIFVVLTGLFSALHIHHVKDNPQYATKNAELGHLIADRTG ILGTFLIPLLILFGGRNNFLQWLTGWDFATFIMYHRWISRVDVLLIIVHAITFSVSDKAT GKYKNRMKRDFMIWGTVSTICGGFILFQAMLFFRRKCYEVFFLIHIVLVVFFVVGGYYHL ESQGYGDFMWAAIAVWAFDRVVRLGRIFFFGARKATVSIKGDDTLKIEVPKPKYWKSVAG GHAFIHFLKPTLFLQSHPFTFTTTESNDKIVLYAKIKNGITSNIAKYLSPLPGNTATIRV LVEGPYGEPSSAGRNCKNVVFVAGGNGIPGIYSECVDLAKKSKNQSIKLIWIIRHWKSLS WFTEELEYLKKTNVQSTIYVTQPQDCSGLECFEHDVSFEKKSDEKDSVESSQYSLISNIK QGLSHVEFIEGRPDISTQVEQEVKQADGAIGFVTCGHPAMVDELRFAVTQNLNVSKHRVE YHEQLQTWA

Search with Accession number P78588http://www.uniprot.org/uniprot/

Page 39: Simple Rearrangements

www.cbs.dtu.dk/services/TMHMM-2.0

Page 40: Simple Rearrangements

www.cbs.dtu.dk/services/TMHMM-2.0

Page 41: Simple Rearrangements

Predicting Post-translational Modifications

• Post-translational modifications often occur on similar motifs in different proteins

• PROSITE is a database containing a list of known motifs, each associated with a function or a post-translational modification

• You can search PROSITE by looking for each motif it contains in your protein (the server does that for you!)

• PROSITE entries come with an extensive documentation on each function of the motif

Page 42: Simple Rearrangements

Searching for PROSITE Patterns

• Search your protein against PROSITE on ExPAsy• www.expasy.org/tools/scanprosite

• PROSITE motifs are written as patterns• Short patterns are not very informative by themselves• They only indicate a possibility• Combine them with other information to draw a conclusion

• Remember: Not everything is in PROSITE !

Page 43: Simple Rearrangements

www.expasy.org/tools/scanprosite

P12259

Page 44: Simple Rearrangements

www.expasy.org/tools/scanprosite

Page 45: Simple Rearrangements

Interpreting PROSITE Patterns• Check the pattern function: Is it compatible with the protein?

• Sometimes patterns suggest nonexistent protein features • For instance : If you find a myristoylation pattern in a prokaryote, ignore

it; prokaryotic proteins have no myristoylation !

• Short patterns are more informative if they are conserved across homologous sequences

• In that case, you can build a multiple-sequence alignment• This slide shows an example

Page 46: Simple Rearrangements

Patterns and Domains

• Patterns are usually the most striking feature of the more general motifs (called domains)

• Domains are less conserved than patterns but usually longer

• In proteins, domain analysis is gradually replacing pattern analysis

Page 47: Simple Rearrangements

Protein Domains

• Proteins are usually made of domains

• A domain is an autonomous folding unit

• Domains are more than 50 amino acids long

• It’s common to find these together:

• A regulatory domain• A binding domain• A catalytic domain

Page 48: Simple Rearrangements

Discovering Domains

• Researchers discover domains by• Comparing proteins that have similar functions• Aligning those proteins• Identifying conserved segments

• A domain is a multiple-sequence alignment formulated as a profile

• For each column, a domain indicates which amino acid is more likely to occur

Page 49: Simple Rearrangements

Domain Collections• Scientists have been discovering and characterizing protein

domains for more than 20 years

• 8 collections of domains have been established• Manual collections are very precise but small

• Automatic collections are very extensive but less informative

• These collections• Overlap

• Have been assembled by different scientists

• Have different strengths and weaknesses

• We recommend using them all!

Page 50: Simple Rearrangements

The Magnificent 8

• Pfam is the most extensive manual collection• Pfam is often used as a reference

Page 51: Simple Rearrangements

Searching Domain Collections

• Domains in Pfam often include known functions

• A match between your protein and a domain is desirable• A match is a potential indication of a function• This is VERY informative for further research!

• Three servers exist to compare proteins and domain collections:

• InterProScan www.ebi.ac.uk/interproscan• CD-Search (conserved Domain) www.ncbi.nih.nlm.gov• Motif Scan www.ch.embnet.org

Page 52: Simple Rearrangements

Using InterProScan• InterProScan is the most

comprehensive search engine for domain databases

• Makes it possible to compare alternative results on most collections

• Does not provide a statistical score

Page 53: Simple Rearrangements

>sp|P53539|FOSB_HUMAN Protein fosB OS=Homo sapiens GN=FOSB PE=1 SV=1 MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTSYSTPGMSGYSSGGASGS GGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSAPAKEDGFSWLLPPPPPPPLPFQTSQDAPPNLTASLFTHSEVQVLGDPFPVVNPSY TSSFVLTCPEVSAFAGAQRTSGSDQPSDPLNSPSLLAL

Page 54: Simple Rearrangements

www.ebi.ac.uk/InterProScan

Page 55: Simple Rearrangements

www.ebi.ac.uk/InterProScan

Page 56: Simple Rearrangements

The CD-Search Output• CD search is less extensive than that of InterProScan• Results come with a a statistical evaluation (E-value)

• 10e-15 Low E-value Good match• 2.1 High E-value Bad match

Page 57: Simple Rearrangements

www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

Page 58: Simple Rearrangements

www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

Page 59: Simple Rearrangements

www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

Page 60: Simple Rearrangements

Predicting Functions with Domains

• Finding a match with a domain having a catalytic function is good news . . . but what, exactly, does it mean?

• A match indicates that your sequence has the domain structure . . . but does it also have the function?

• You cannot say before looking into these details:• Where are the catalytic residues on the domain?• Does your sequence have the right residues at these positions?

Page 61: Simple Rearrangements

Looking into the Details• Catalytic residues are normally highly conserved in

domains• Motif Scan makes it possible to check whether these

important residues are conserved in your sequence• High bar above 0 = Highly conserved residues• Green = Your sequence has an expected residue• Red = Your sequence has an unexpected residue

Page 62: Simple Rearrangements

Looking into the Details (cont’d.)

R (Arginine) is highly expected at this positionHigh barPotential active site

If your protein has an arginine on this position . . .Bar is filled with greenYour protein could be active

Page 63: Simple Rearrangements

myhits.isb-sib.ch/cgi-bin/motif_scan

Page 64: Simple Rearrangements

Protein 3D Structure

Page 65: Simple Rearrangements

Primary, Secondary and Tertiary Structures

• Proteins are made of 20 amino acids• Proteins are on average 400 amino acids

long• Protein structure has 3 levels:

• The primary structure is the sequence of a protein

• The secondary structure is the local structure • The tertiary structure is the exact position of

each atom on a 3D model

Page 66: Simple Rearrangements

Secondary Structures

• Helix• Amino acid that twists like a spring

• Beta strand or extended• Amino acid forms a line without

twisting• Random coils

• Amino acid with a structure neither helical nor extended

• Amino-acid loops are usually coils

Page 67: Simple Rearrangements

Guessing the Secondary Structure of Your Protein

• Secondary structure predictions are good• If your protein has enough homologues, expect

80% accuracy• The most accurate secondary structure prediction

server is PSIPRED

Page 68: Simple Rearrangements

PSIPRED Output• Conf = Confidence

• 9 is the best, 0 the worst

• Pred = Every amino acid is assigned a letter:• C for coils

• E for extended or beta-strand

• H for helix

Page 69: Simple Rearrangements

>gi|15892329|ref|NP_360043.1| translocation protein TolB [Rickettsia conorii str. Malish 7] MRNIIYFILSLLFSVTSYALETINIEHGRADPTPIAVNKFDADNSAADVLGHDMVKVISNDLKLSGLFRP ISAASFIEEKTGIEYKPLFAAWRQINASLLVNGEVKKLESGKFKVSFILWDTLLEKQLAGEMLEVPKNLW RRAAHKIADKIYEKITGDAGYFDTKIVYVSESSSLPKIKRIALMDYDGANNKYLTNGKSLVLTPRFARSA DKIFYVSYATKRRVLVYEKDLKTGKESVVGDFPGISFAPRFSPDGRKAVMSIAKNGSTHIYEIDLATKQL HKLTDGFGINTSPSYSPDGKKIVYNSDRNGVPQLYIMNSDGSDVQRISFGGGSYAAPSWSPRGDYIAFTK ITKGDGGKTFNIGIMKACPQDDENSERIITSGYLVESPCWSPNGRVIMFAKGWPSSAKAPGKNKIFAIDL TGHNEREIMTPADASDPEWSGVLN

Page 70: Simple Rearrangements

bioinf.cs.ucl.ac.uk/psipred//?program=psipred

Page 71: Simple Rearrangements

bioinf.cs.ucl.ac.uk/psipred//?program=psipred

Page 72: Simple Rearrangements

bioinf.cs.ucl.ac.uk/psipred//?program=psipred

Page 73: Simple Rearrangements

bioinf.cs.ucl.ac.uk/psipred//?program=psipred

Page 74: Simple Rearrangements

Predicting Other Secondary Features

• It is also possible to predict these accurately:• Transmembrane segments• Solvent accessibility• Globularity• Coiled/coil regions

• All these predictions have an expected accuracy higher than 70%

Page 75: Simple Rearrangements

Servers

• www.predictprotein.org• cubic.bioc.columbia.edu/predictprotein• www.sdsc.edu/predicprotein• www.cbi.pku.edu.cn/predictprotein

Page 76: Simple Rearrangements

Predicting 3D Structures• Predicting 3D structures from sequences only is almost impossible

• The only reliable way to establish the 3D structure of a protein is to make a real-world experiment in

• X-ray crystallography• Nuclear magnetic resonance (NMR)

• Structures established this way are conserved in the PDB database

• “The PDB of my protein” is synonymous with “The structure of my protein”

Page 77: Simple Rearrangements

Retrieving Protein Structures from PDB

• All PDB entries are 4-letter words!• 1CRZ, 2BHL . . .

• Sometimes the chain number is added: • 1CRZA, 1CRZB . . .

• To access all PDB entries, go to www.rcsb.org • PDB contains 42,000 entries• PDB contains the structure of 16,000 unique proteins or RNAs

• You can download the coordinates and display the structure

Page 78: Simple Rearrangements

www.rcsb.org

Page 79: Simple Rearrangements

www.rcsb.org

Page 80: Simple Rearrangements

Displaying a PDB Structure• You can use any of the online

viewers to display the structure

• They will let you rotate the structure, zoom in and out, or color it

• PDB files themselves are not human-readable

Page 81: Simple Rearrangements

Predicting the Structure of Your Protein

• The bad news: • It is very hard to predict protein 3D structures

• The good news:• Similar proteins have similar structures

• If your favorite protein has a homologue with a known structure . . .

• You can do homology modeling

• How?• Start with a BLAST (more about that in the next slide)

Page 82: Simple Rearrangements

ncbi.nlm.nih.gov/BLAST

Page 83: Simple Rearrangements

ncbi.nlm.nih.gov/BLAST

Page 84: Simple Rearrangements

BLASTing PDB for Structures• BLAST your protein against

PDB

• If you get a very good hit, it means PDB contains a protein similar to yours

• Your protein and this hit probably have the same structure

Page 85: Simple Rearrangements

Be Careful! • Sometimes only one of the domains contained in your protein has

been characterized• If that’s the case, the PDB will only contain this domain• Always check the alignments

• Red line = full protein in PDB• Blue line = one domain only in this entry

Page 86: Simple Rearrangements

Structures and Sequences

• Highly conserved sequences are often important in the structure

• Make a multiple-sequence alignment to identify these important positions

• Highly conserved positions are either in the core or important for protein/protein interactions

Page 87: Simple Rearrangements

3D Predictions• If you want to predict the structure of your protein

automatically, try the Swiss Model• Swiss Model makes the BLAST for you

• The program does a bit of homology modeling

• The process delivers a new PDB entry

• You can access it at swissmodel.expasy.org

• Swiss Model gives good results for proteins having homologues in PDB

Page 88: Simple Rearrangements

zhanglab.ccmb.med.umich.edu/I-TASSER/

Page 89: Simple Rearrangements

zhanglab.ccmb.med.umich.edu/I-TASSER/

Page 90: Simple Rearrangements

3D-BLAST• Use this technique if you have a structure and you

want to find other similar structures

• Use VAST or DALI to look for proteins having the same 3D shape as yours• www.eb.ac.uk/dali• www.ncbi.nlm.nih/vast

Page 91: Simple Rearrangements

3D Movements• Most proteins need to move to do their job

• Predicting protein movement is possible using molecular dynamics• Check out this site: molmolvdb.mbb.yale.edu

• Good molecular dynamics requires extremely powerful computers• Don’t expect miracles from standard online resources