introduction to bioinformatics - tutorial no. 2 blast

31
Introduction to Bioinformatics - Tutorial no. 2 BLAST

Post on 20-Dec-2015

233 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Bioinformatics - Tutorial no. 2 BLAST

Introduction to Bioinformatics - Tutorial no. 2

BLAST

Page 2: Introduction to Bioinformatics - Tutorial no. 2 BLAST

BLAST

Page 3: Introduction to Bioinformatics - Tutorial no. 2 BLAST

BLAST – Outline

Sequence Alignment Complexity and indexing BLASTN and BLASTP

Basic parameters PAM and BLOSUM matrices Affine gap model E Values (once again)

Page 4: Introduction to Bioinformatics - Tutorial no. 2 BLAST

Advanced BLAST

Databases BLAST options BLAST output Taxonomic BLAST Pairwise BLAST

Page 5: Introduction to Bioinformatics - Tutorial no. 2 BLAST

Name Query type Database

blastn Genomic Genomic

blastp Protein Protein

blastx Translated genomic Protein

tblastn Protein Translated genomic

tblastx Translated genomic Translated genomic

Genomic translations test all 6 possibilities:

3x for codon frames, 2x for reverse complement

BLAST Variations

Page 6: Introduction to Bioinformatics - Tutorial no. 2 BLAST
Page 7: Introduction to Bioinformatics - Tutorial no. 2 BLAST

BLASTN Databases

nrGenBank, EMBL, DDBJ, PDB and NCBI

reference sequences (RefSeq)

htgs High-throughput genomic sequences (draft)

pat Patented nucleotide sequences

mito Mitochondrial sequences

vector Vector subset of GenBank

month GenBank, EMBL, DDBJ, PDB from 30 days

chrom Contigs and chromosomes from RefSeq

Page 8: Introduction to Bioinformatics - Tutorial no. 2 BLAST

BLASTP Databases

nrGenBank CDS translations, RefSeq,

PDB, SWISS-PROT, PIR, PRF

swissprot SWISS-PROT

pat Patented protein sequences

pdb Protein Data Bank

monthGenBank CDS translations, PDB,

SWISS-PROT, PIR, PRF from 30 days

Page 9: Introduction to Bioinformatics - Tutorial no. 2 BLAST

BLASTN/P Options (1)

Only search part of database using NCBI Entrez query format

Search specific

organism

Remove low information content, e.g. short repeats or

rich in only 2 nucleotides

Remove known human repeats

(LINEs, SINEs)

Page 10: Introduction to Bioinformatics - Tutorial no. 2 BLAST

BLASTN/P Options (2)Threshold for results

significance

Use index based on words of 7, 11 or 15

nucleotides Costs to open and extend gap, score for nucleotide

match or mismatch. Allowed gap scores: 10/1, 10/2, 11/1, 8/2, 9/2

Page 11: Introduction to Bioinformatics - Tutorial no. 2 BLAST

BLASTP Options

Scoring matrix: PAM, etc…

Search for a motif (PSI-BLAST)

Costs to open and extend gap

Page 12: Introduction to Bioinformatics - Tutorial no. 2 BLAST

BLASTN/P Formatting (1)

Show colored bar chart

Number of sequences listed

Number of alignments shown

Other (less important) options on

what to show

Page 13: Introduction to Bioinformatics - Tutorial no. 2 BLAST

BLASTN/P Formatting (2)

How to display alignments

Only show results which match Entrez search or are from specific organism

Only show results with E values in this range

Page 14: Introduction to Bioinformatics - Tutorial no. 2 BLAST

BLASTN Results

Query sequence representation

Matched areas of database sequences

Page 15: Introduction to Bioinformatics - Tutorial no. 2 BLAST

BLAST Output Header

Request ID for later retrieval

Query sequence details

Database details

Tax BLAST

Page 16: Introduction to Bioinformatics - Tutorial no. 2 BLAST

BLAST Alignments (1)

Sequence Identifier

Sequence description

Score andE value

Page 17: Introduction to Bioinformatics - Tutorial no. 2 BLAST

BLAST Alignments (2)

Normalized score of alignment

Expected number of such hits (2e-11 = 2 10-11)

Number of exact matches

Number of matches with positive score

Number of insertion / deletions

Page 18: Introduction to Bioinformatics - Tutorial no. 2 BLAST

BLAST Alignments (3)

Query sequenceExact matchInsertion / deletion

Matched sequence

Mismatch with positive

score

Position within sequence Masked low complexity region

Page 19: Introduction to Bioinformatics - Tutorial no. 2 BLAST

Expectation Values

Increases linearly with

length of query sequence

Increases linearly with

length of database

Decreases exponentially with score of

alignment

Page 20: Introduction to Bioinformatics - Tutorial no. 2 BLAST

Tax BLAST

Lineage of organism with strongest hit

Score of organism’s strongest hit

Number of organism hits

Shared ancestry in taxonomic tree

Page 21: Introduction to Bioinformatics - Tutorial no. 2 BLAST

BLAST2SEQ

Scoring scheme

Type of program

Gap model,Expect Value,

Advanced options

Sequences

Scoring matrix

SequencesGO !

This tool produces the alignment of two given sequences using BLAST engine for local alignment.

Page 22: Introduction to Bioinformatics - Tutorial no. 2 BLAST

QuestionsYou have two query sequences: query1 and query2:

>query1CCGTCCGTCCGTCGTCCTCCTCGCTTGCGGGGCGCCGGGCCCGTCCTCGAGCCCCCNNNNNCCGTCCGGCCGCGTCGGGGCCTCGCCGCGCTCTACCTACCTACCTGGTTGATCCTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTACGCACGGCCGGTACAGTGAAACTGCGAATGGCTCATTAAATCAGTTATGGTTCCTTTGGTCGCTCGCTCCTCTCCTACTTGGATAACTGTGGTAATTCTAGAGCTAATACATGCCGACGGGCGCTGACCCCCTTCGCGGGGGGGATGCGTGCATTTATCAGATCAAAACCAACCCGGTCAGCCCCTCTCCGGCCCCGGCCGGGGGGCGGGCCGCGGCGGCTTTGGTGACTCTAGATAACCTCGGGCCGATCGCACGCCCCCCGTGGCGGCGACGACCCATTCGAACGTCTGCCCTATCAACTTTCGATGGTAGTCGCCGTGCCTACCATGGTGACCACGGGTGACGGGGAATCAGGGTTCGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCACTCCCGACCCGGGGAGGTAGTGACGAAAAATAACAATACAGGACTCTTTCGAGGCCCTGTAATTGGAATGAGTCCACTTTAAATCCTTTAACGAGGATCCATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGCTGCAGTTAAAAAGCTCGTAGTTGGATCTTGGGAGCGGGCGGGCGGTCCGCCGCGAGGCGAGCCACCGCCCGTCCCCGCCCCTTGCCTCTCGGCGCCCCCTCGATGCTCTTAGCTGAGTGTCCCGCGGGGCCCGAAGCGTTTACTTTGAAAAAATTAGAGTGTTCAAAGCAGGCCCGAGCCGCCTGGATACCGCAGCTAGGAATAATGGAATAGGACCGCGGTTCTATTTTGTTGGTTTTCGGAACTGAGGCCATGATTAAGAGGGACGGCCGGGGGCATTCGTATTGCGCCGCTAGAGGTGAAATTCTTGGACCGGCGCAAGACGGACCAGAGCGAAAGCATTTGCCAAGAATGTTTTCATTAATCAAGAACGAAAGTCGGAGGTTCGAAGACGATCAGATACCGTCGTAGTTCCGACCATAAACGATGCCGACCGGCGATGCGGCGGCGTTATTCCCATGACCCGCCGGGCAGCTTCCGGGAAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGAAACCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGATAGCTCTTTCTCGATTCCGTGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGCGATTTGTCTGGTTAATTCCGATAACGAACGAGACTCTGGCATGCTAACTAGTTACGCGACCCCCGAGCGGTCGGCGTCCCCCAACTTCTTAGAGGGACAAGTGGCGTTCAGCCACCCGAGATTGAGCAATAACAGGTCTGTGATGCCCTTAGATGTCCGGGGCTGCACGCGCGCTACACTGACTGGCTCAGCGTGTGCCTACCCTACGCCGGCAGGCGCGGGTAACCCGTTGAACCCCATTCGTGATGGGGATCGGGGATTGCAATTATTCCCCATGAACGAGGAATTCCCAGTAAGTGCGGGTCATAAGCTTGCGTTGATTAAGTCCCTGCCCTTTGTACACACCGCCCGTCGCTACTACCGATTGGATGGTTTAGTGAGGCCCTCGGATCGGCCCCGCCGGGGTCGGCCCACGGCCTGGCGGAGCGCTGAGAAGACGGTCGAA

Page 23: Introduction to Bioinformatics - Tutorial no. 2 BLAST

Questions>query2TACGAACGCTGGCGGCATGCTAATACATGCAAGTCGAACGAGACCTTCGGGTCTAGTGGCGCACGGGTGGCTAACGCGTGGGAATCTGCCCTTGGGTTCGGAATAACTTCGGGAAACTGAAGCTAATACCGGATGATGACGAAAGTCCAAAGATTTATCGCCCAGGGATGAGCCCGCGTAGGATTAGCTAGTTGGTGGGGTAAAGGCTCACCAAGGCAACGATCCTTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGAGTGATGAAGGCCTTAGGGTTGTAAAGCTCTTTTACCCGAGATGATAATGACAGTATCGGGAGAATAAGCTCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCGATTTAAGTCAGAGGTGAAAGCCCGGGCTCAACCCCGAACTGCCTTTGAGACTGGATTGCTAGAATCTTGGAGAGGCGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAACACCAGTGCGAAGGCGGCTCGCTGGACAAGTATTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGATAACTAGCTGCCGGGGCACATGGTGTTTCGGTGGCGCACGTAACGCATTAAGTTATCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGGCCTGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCAGCGTTTGACATCCTCATCGCGGATTTCAGAGATGATTTCCTTCAGTTCGGCTGGATGAGTGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTTAGTTGCCAGCATTTAGTTGGGTACTCTAAAGGAACCGCCGGTGATAAGCCGGAGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTACGCGCTGGGCTACACACGTGCTACAATGGCGACTACAGTGGGCTGCAACCGTGCGAGCGGTAGCTAATCTCCAAAAGTCGTCTCAGTTCGGATTGTTCTCTGCAACTCGAGAGCATGAAGGCGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCAGGCCTTGTACACACCGCCCGTCACACCATGGGATTTGGATTCACCCGAAGGCACTGCGTTAACCCGCAAGGGAGACAGGTGACCACGGTGGGTTTAGAGACTGGGGTGAA

Page 24: Introduction to Bioinformatics - Tutorial no. 2 BLAST

QuestionsUsing BLASTN • Find what do each one of these sequences code

for.

Page 25: Introduction to Bioinformatics - Tutorial no. 2 BLAST

Questions

Page 26: Introduction to Bioinformatics - Tutorial no. 2 BLAST

Questions

• To which organism each sequence is related? • Do these sequences code for proteins?

Pretend the information for answering previous questions is not available to you could you suggest a way to answer these questions anyway?

BLASTX

Page 27: Introduction to Bioinformatics - Tutorial no. 2 BLAST

Questions

• Look carefully at the e-value column of the first 50 results of each query. What can you learn about these sequences? Are these sequences generally conserved between other organisms?

5 last answers

Page 28: Introduction to Bioinformatics - Tutorial no. 2 BLAST

Questions

• Use bl2seq to align the two query sequences. What can you say about the relation between them? Based does this last result make sense?

Page 29: Introduction to Bioinformatics - Tutorial no. 2 BLAST

QuestionsYou have two query sequences.>query3 ATGTCTGCTCCACAAGCCAAGATTTTGTCTCAAGCTCCAACTGAATTGGAATTACAAGTTGCTCAAGCTTTCGTTGAATTGGAAAATTCTTCTCCAGAATTGAAAGCTGAGTTGAGACCTTTGCAATTCAAGTCCATCAGAGAAGT

>query4 GTATGTTATTAATTTGAATCTAAACTTAAGAATAATGGAGAGTAACAAAGGAAAAAAGTGTGAACGGGACGATACCAGAATGTTTCAATCTAGAAAAGTATAAAAGATAAGGACTAGGACTCAAATGTATTTGGCTGACTATCGCCTGAACCTTGATGCTAAGCAAATACCATATCTTCAAGAAAAAGCCTACTCCAGTGTTTAAGAAGAAGGGAACGATTTACTAGATCATGCTATACGCAGTAAGGTTCTGATAGTTAATTACAATCGGTCCAAGTTCTAAGCGGTGTCGTCCATGCATATATCATTTACAAGTTACTGGCGTCAACTCTTCAAATATTCAAAATATCACCTAATCAAACTTACTAACATTTTCCTTTTTTGTTTTCCTTCTTTTATAG

Now use BlastX

• To what protein does these sequences code for?

• are these proteins conserved in other organisms?

Page 30: Introduction to Bioinformatics - Tutorial no. 2 BLAST

QuestionsNow use BlastX

• To what protein does these sequences code for?

• are these proteins conserved in other organisms?

A conserved protein component of the small (40S) subunit of S. cerevisiae.

Query 3Query 4

No protein – e-value 3.2

Page 31: Introduction to Bioinformatics - Tutorial no. 2 BLAST

Questions• You are told that the sequences were extracted from the

same gene. How could you explain the above results?

• Answer: query4 is extracted from a non-coding region (intron) and thus doesn’t code for any protein.