phylogenetic trees • protein structure · phylogenetic analysis variable position conserved...

Post on 07-Aug-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Prof. Bystroff talks about BIOINFORMATICS

• Sequence database searching • Phylogenetic Trees • Protein Structure

1

hi

AAAGAGATTCTGCTAGCGGTCGGAGAGATGCTGCAGCGAGTCGGCC

AAAGAGATTCTGCTAGCGGTCGGAGAGATGCTGCAGCGAGTCGGCC

AAAGAGATTCTGCTAGCGGTCGG

AGAGATGCTGCAGCGAGTCGGCC

5

Protein sequence alignment uses a "substitution matrix".

Sequence 1

Sequ

ence

2

Find the best pathway through the substitution scores, and you have an alignment

6"dynamic programming" algorithm.

BLAST searches millions of sequences

GenBank contains over 162 million sequences!!

The score for each should be the optimal alignment score. Even if we can do 1 per millisecond, it would take 45 hours to do one search. BLAST usually finishes in under a minute.

How does BLAST do it so fast?

BLAST precalculates all triplet hits in the database.

PGQ

...

PGQ PGR PGS ... PGT PGV PGWPGY PAQ PCQPDQ PEQ PFQ ......

BLAST uses an expansion table to allow for near perfect matches

My sequence has this triplet BLAST saves a

lookup table (called an INDEX) for all of the near identity triplet location in the whole database.

This is all done when BLAST is set up, before any searches are carried out.

BLAST finds diagonal arrangements of triplet hits

triplet hits in one database protein

Hits are joined by extension

BLAST scores only the best hits (saves time)

BLAST connects the diagonals (FASTA algorithm)

This protein is given a score, and we save it for later only if the score passes a cutoff.

Re-scoring.

Convert score to a e-value*.

Rank by e-value.

cutoff

*later...

11

Protein Databases available for BLAST search

Go to BLAST search page (i.e. blastp) , select a database to search and then select ? to learn a little about that database.

12

Protein Databases available for BLAST search

On BLAST search page, select.a database to search and then select ? to learn a little about that database.

13

Protein Databases available for BLAST search

On BLAST search page, select.a database to search and then select ? to learn a little about that database.

14

forms of BLASTBLAST query database

blastn nucleotide nucleotide

blastp protein protein

tblastn protein translated DNA

blastx translated DNA protein

tblastx translated DNA translated DNA

psi-blast protein, profile protein

phi-blast pattern protein

How significant is that?

Please give me a number for...

...how likely the data would not have been the result of chance,...

...as opposed to... ...a specific

inference.

e-value

A better metric of significance.E-value = p-value x (number of attempts)

16

Scores from random alignments are used to calculate the p-value of an alignment score

score--->

freq

p-value of x = ∫normalized normal distribution fit to random scoresx

x

p-value is the significance of one (1) alignment score.

e-value is the significance of one score of many tries.

Searching a database of 162 million sequences for one hit is like trying 162 million times to get one good alignment. The number of times you will see that score by chance is the p-value times 162 million!

e-value = p-value * 162,000,000 (GenBank search)

Pop-quizBLAST HIT.................... e-value1. annotation 3.0 2. annotation 3.03. annotation 3.0 4. annotation 3.05. annotation 3.0 6. annotation 3.07. annotation 3.0 8. annotation 3.09. annotation 3.0 10. annotation 3.0

How many of the above 10 hits are the expected to be by chance?

Pop-quizBLAST HIT.................... e-value1. annotation 1.0 2. annotation 2.03. annotation 3.0 4. annotation 4.05. annotation 5.0 6. annotation 6.07. annotation 7.0 8. annotation 8.09. annotation 9.0 10. annotation 10.0

How many of the above 10 hits are the expected to be by chance?

Pop-quizBLAST HIT.................... e-value1. annotation 0.0 2. annotation 0.013. annotation 0.01 4. annotation 0.015. annotation 0.02 6. annotation 0.027. annotation 0.02 8. annotation 0.029. annotation 0.02 10. annotation 10.0

How many of the above 10 hits are the expected to be by chance?

Bioinformatics

22

• Sequence database searching • Phylogenetic Trees • Protein Structure

Evolutionary time

A

B

C

D

11

1

6

3

5

genetic change

A

B

C

D

time

A

B

C

D

no meaning

Cladogram Phylogram Ultrametric tree

(D:5,(A:1,(C:1,B:6):1):3)

parenthesis (Newick) notation has both labels and distances.

A multiple sequence alignment is made using many pairwise sequence alignments

Multiple Sequence Alignment

Construct a distance-based tree

97 8177

82 59 3280 55 3190 65 40

61 4233

ABCDEF

A B C D E F ABCDEF

Draw tree heredistances

Life is not strictly a tree -- horizontal gene transfer

26

BF Smets, T Barkay (2005) “Horizontal gene transfer: perspectives at a crossroads of scientific disciplines” Nature Reviews Microbiology.

Discrete Steps Needed for Stability of Gene TransferStably incorporating horizontally transferred genes into a recipient genome involves five distinct steps (Fig. 1). 1. First, a particular segment of DNA or RNA is prepared for transfer from the donor strain through one of several processes, including excision and circularization of conjugative transposons, initiation of conjugal plasmid transfer by synthesis of a mating pair-formation protein complex, or packaging of nucleic acids into phage virions. 2. Next, the segment is transferred either by conjugation, which requires contact between the donor and recipient cells, or by transformation and transduction without direct contact. 3. During the third step, genetic material enters the recipient cell, where cell exclusion may abort the transfer. 4. Otherwise, during the fourth step, the incoming gene is integrated into the recipient genome by legitimate or sitespecific recombination or by plasmid circularization and complementary strand

synthesis. Barriers to transfer during this step come from restriction modification systems, failure to integrate and replicate within the new host genome, and incompatibility with resident plasmids. 5. In the final step, transferred genes are replicated as part of the recipient genome and transmitted to daughter cells in stable fashion over successive generations. Researchers from different disciplines tend to focus on specific stages within this five-step sequence. Thus, evolutionary biologists who examine microbial genomes for evidence of past transfers tend to look at HGTs from the perspective of step five. Molecular biologists are more likely to examine the details of the transfer events, while microbial ecologists look more broadly when they describe the magnitude and diversity of the mobile gene pool, sometimes called the mobilome.

Sequence homology trees are complicated by paralogy

Orthologs: homologs originating from a speciation eventParalogs: homologs originating from a gene duplication event.

clam

duck

crab

fish

clam

A

duck

A

crab

Bfis

h A

duck

B

fish

B

Sequence tree !!cl

am A

crab

B

duck

Adu

ck B

fish

Afis

h B

duplication

speciation

speciationgene loss

True Species tree reconciled trees

Use orthologs• To make the right inferences about

evolution, make sure your phylogenetic tree is composed of orthologs

How do you know it's an ortholog?1. It has the same function in both species.2. It has about the same number of differences across species as other orthologs.3. You don't.

Functional inference from multiple sequence alignments

ConservedNot conserved

folding

function

Functional inference from multiple sequence alignments

ConservedNot conserved

folding

function

stability

kineticsenzyme activity

binding

post-translational modification

ConservedNot conserved

folding

function

stability

kineticsenzyme activity

binding

post-translational modification

species differences

Next time:

• Visit rcsb.org

• Try visualizing a protein.

• Locate a residue that is conserved across all species in a BLAST search.

• Locate one that is conserved except in one species. What might be its function?

41

2 3

1 2

34

41

2 3

1 2

34

41

2 3

1 2

34

41

2 3

1 2

34

41

2 3

1 2

34

41

2 3

1 2

34

41

2 3

1 2

34

41

2 3

1 2

34

41

2 3

1 2

34

2

3

4

1

����� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���������� �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

���� �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

���������� �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

���� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��� ��������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��! � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� �"���� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �# ������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �$� ���������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �%��������� �" � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� ������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

����

Mm1:C362 Mm1:S393 Mm1:C499 Mm2:C210 Mm3:C96 Mm3:C197 Mm4:C114 Mm4:C170 Mm4:C233• • • • • • • • •

5

*

Phylogenetic analysis

variable position conserved

single position conserved

Transmembrane Cys can still for self-reacting SS when it mutates to a new position. Therefore, variable position conservd Cys are self-reacting. Single

position conserved cys are cross-reacting.

����� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���������� �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

���� �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

���������� �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

���� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��� ��������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��! � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� �"���� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �# ������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �$� ���������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �%��������� �" � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� ������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

����

Mm1:C362 Mm1:S393• •

In this case, mammals found to be missing conserved cysteines in the sperm specific calcium channel CatSper were species that lacked sperm competition.

http://etetoolkit.org

Format of sequence alignment for ETE tree

>Squirrel FLVVCLNT---CIFLCIYV---LTLMFTCLF---LLRICRVLR---VSICTSEFA---LGFCLFGI---LTILVCEV---LVHVCMAV---ICITQDGW >Beaver FVTVCLNT---CIFLCIYV---LILMFTCMF---LLRICRVLR---VSICTSEFF---LGFCLFGI---LTILICEV---LVHVCMAV---ICITQDGW >Blind mole rat FLVVCLNT---SIFLCIYI---LTLMFTCLF---LLRICRVLK---VSTYACEFF---LGFCLFGV---LTILTCEV---LVHVCMAV---ICITQDGW >Mouse FIVVCLNT---SIFLSIYV---LTLMFTCLF---LLRVCRVLR---VSVYVCEFL---LGFCLFGV---LTILICEV---LVHVCMAV---ICITQDGW >Pika FLVICLNT---CIFLSIYV---LTLMFTCLF---LLRICRVLR---VSIYASEFS---LGFCLFGT---LTILICEV---LLHVCMSV---ICITQDGW >Rabbit FLVVCLNT---CIFLCIYM---FVLMFTCLF---LLRICRVLR---VSIYASEFS---LGFCLFGA---LTILFCEV---LLHVCMAV---ICITQDGW >Gibbon FFVVCLNT---SIFFCIYV---LILMFTCLF---LLRICRVLR---VSICTSELF---LGFCLFGS---LTILICEV---LVHVCMAV---ICITQDGW >Monkey FFIVCLNT---SIFFCIYV---LILMFTCLF---FLRICRVLR---VSICTSELA---LGFCLFGS---LTILICEV---LVHVCMAV---ICITQDGW >Bushbaby FFIICLNT---CIFFCIYV---LILMFTCLF---FLRICRVLR---VGIYSAEFY---LGFCLFGV---LSILVCEV---LIHVCMAV---ICITQDGW >Lemur FFIICLNT---AIFFSIYL---LILMFTCLF---FLRICRVLR---VSIYSSEFV---LGFCLFGV---LTILICEV---LVHVCMAV---ICITQDGW >Sifaka FFVICLNT---SIFFCIYV---LILMFTCLF---FLRICRVLR---VSIYSSEFS---LGFCLFGV---LTILICEV---LVHVCMAV---ICITQDGW

FASTA format. Output by most alignment programs and packages.

http://etetoolkit.org

Format of tree for ETE tree

(Beaver:0.106861,(('Blind mole rat':0.0870003,Mouse:0.128141):0.0287991,('Naked mole rat':0.316691,((Pika:0.0584227,Rabbit:0.0514835):0.0419969,(((Gibbon:0.062089,Monkey:0.0723263):0.0501071,(Bushbaby:0.104971,(Lemur:0.0853643,Sifaka:0.0510973):0.0395091):0.00449631):0.0111712,((Marsupials:0.390453,((Manatee:0.0517099,Mole:0.11989):0.00669083,'Elephant shrew':0.143216):0.0258566):0.00193119,('Star-nosed mole':0.111938,((Alpaca:0.102393,Pig:0.0618056):0.010159,(Leopard:0.0585696,('Brown bat':0.108369,('Fruit bat':0.0725375,('Horseshoe bat':0.0640651,'Leaf-nosed bat':0.0497925):0.0219586):0.00813098):0.00277678):0.00371913):0.00419253):0.00732498):0.0142435):0.00485045):0.00731201):0.00875856):0.00305836,Squirrel:0.0786295);

Newick format. Output UGENE, NCBI, most tree tools.

http://etetoolkit.org

Protein Data Bank

• rcsb.org

• 4ms2 a voltage-gated calcium channel.

1) visualize overall structure in NGL2) view ligands3) view electron density3) find an amino acid. Zoom in.4) Homology modeling.

38

superposed homologs

39

40

Homology modeling in a nutshell.

ACDEFG....HIKLMNPQRSTVWY ||:|| || :| | ||||: .CDDFGACDGHIYIM..QQSTVWF

target

template

Modeling action... • Add Ala to the N-terminal Cys using energy minimization.• • Keep the conserved Phe sidechain and backbone. • Cut out the four residue insertion and connect G to H. • Switch non-similar sidechains Y->K. Possibly move backbone.. • Cut at M-Q, insert two residues, Asn-Pro. • Switch similar sidechains F->Y. Keep backbone fixed..

ALIGNMENT

Automatic homology modeling by SWISS-MODEL

42

https://swissmodel.expasy.org/interactive

Next time:

• Read CatSper paper Bystroff, C. (2018). Intramembranal disulfide cross-linking elucidates the super-quaternary structure of mammalian CatSpers. Reproductive biology, 18(1), 76-82. Chicago

Read at least one part of the paper in detail. Bring a comment, suggestion, or question to class 2/19.

top related