access to sequences: genbank – a place to start and then some more... links: embl nucleotide...

36
Access to sequences: GenBank – a place to start and then some more... : embl nucleotide archive http ://www.ebi.ac.uk/ena/ DNA data bank of Japan http://www.ddbj.nig.ac.jp/ GenBank http://www.ncbi.nlm.nih

Upload: gillian-carter

Post on 24-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

Access to sequences:GenBank – a place to start and then some more...

Links: embl nucleotide archive http://www.ebi.ac.uk/ena/ DNA data bank of Japan http://www.ddbj.nig.ac.jp/ GenBank http://www.ncbi.nlm.nih.gov/

Page 2: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
Page 3: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

contains wealth of many types of data

Page 4: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

…but the main part represent sequences (DNA, RNA, aa; short fragments, genomes…)

for the explained sample of GenBank sequence recordclick here

there is lots of categories and information, but you can view the sequencealso in much more streamlined form (called FASTA format):

>gi|1293613|gb|U49845.1|SCU49845 Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cdsGATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAGACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAAGTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATATTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAATTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCGTCACAAATAAATTTTGGCAACTTATGTTTCCTCTTCGAGCAGTACTCGAGCCCTGTCTCAAGAATGTAATAATACCCATCGTAGGTATGGTTAAAGATAGCATCTCCACAACCTCAAAGCTCCTTGCCGAGAGTCGCCCTCCTTTGTCGAGTAATTTTCACTTTTCATATGAGAACTTATTTTCTTATTCTTTACTCTCACATCCTGTAGTGATTGACACTGCAACAGCCACCATCACTAGAAGAACAGAACAATTACTTAATAGAAAAATTATATCTTCCTCGAAACGATTTCCTGCTTCCAACATCTACGTATATCAAGAAGCATTCACTTACCATGACACAGCTTCAGATTTCATTATTGCTGACAGCTACTATATCACTACTCCATCTAGTAGTGGCCACGCCCTATGAGGCATATCCTATCGGAAAACAATACCCCCCAGTGGCAAGAGTCAATGAATCGTTTACATTTCAAATTTCCAATGATACCTATAAATCGTCTGTAGACAAGACAGCTCAAATAACATACAATTGCTTCGACTTACCGAGCTGGCTTTCGTTTGACTCTAGTTCTAGAACGTTCTCAGGTGAACCTTCTTCTGACTTACTATCTGATGCGAACACCACGTTGTATTTCAATGTAATACTCGAGGGTACGGACTCTGCCGACAGCACGTCTTTGAACAATACATACCAATTTGTTGTTACAAACCGTCCATCCATCTCGCTATCGTCAGATTTCAATCTATTGGCGTTGTTAAAAAACTATGGTTATACTAACGGCAAAAACGCTCTGAAACTAGATCCTAATGAAGTCTTCAACGTGACTTTTGACCGTTCAATGTTCACTAACGAAGAATCCATTGTGTCGTATTACGGACGTTCTCAGTTGTATAATGCGCCGTTACCCAATTGGCTGTTCTTCGATTCTGGCGAGTTGAAGTTTACTGGGACGGCACCGGTGATAAACTCGGCGATTGCTCCAGAAACAAGCTACAGTTTTGTCATCATCGCTACAGACATTGAAGGATTTTCTGCCGTTGAGGTAGAATTCGAATTAGTCATCGGGGCTCACCAGTTAACTACCTCTATTCAAAATAGTTTGATAATCAACGTTACTGACACAGGTAACGTTTCATATGACTTACCTCTAAACTATGTTTATCTCGATGACGATCCTATTTCTTCTGATAAATTGGGTTCTATAAACTTATTGGATGCTCCAGACTGGGTGGCATTAGATAATGCTACCATTTCCGGGTCTGTCCCAGATGAATTACTCGGTAAGAACTCCAATCCTGCCAATTTTTCTGTGTCCATTTATGATACTTATGGTGATGTGATTTATTTCAACTTCGAAGTTGTCTCCACAACGGATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGAATGGTTCTCCTACTATTTTTTGCCTTCTCAGTTTACAGACTACGTGAATACAAACGTTTCATTAGAGTTTACTAATTCAAGCCAAGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAGTGCCCAAGAATTTCGACAAGCTTTCATTAGGTTTGAAAGCGAACCAAGGTTCACAATCTCAAGAGCTATATTTTAACATCATTGGCATGGATTCAAAGATAACTCACTCAAACCACAGTGCGAATGCAACGTCCACAAGAAGTTCTCACCACTCCACCTCAACAAGTTCTTACACATCTTCTACTTACACTGCAAAAATTTCTTCTACCTCCGCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCCAATAAAACTTCATCTCACAATAAAAAAGCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATCCTAGTAGCTCTCATTTGCTTCCTAATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTACCGCATGCTATTAGTGGACCTGATTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTACACCTTTGAACAACCCCTTTGATGATGATGCTTCCTCGTACGATGATACTTCAATAGCAAGAAGATTGGCTGCTTTGAACACTTTGAAATTGGATAACCACTCTGCCACTGAATCTGATATTTCCAGCGTGGATGAAAAGAGAGATTCTCTATCAGGTATGAATACATACAATGATCAGTTCCAATCCCAAAGTAAAGAAGAATTATTAGCAAAACCCCCAGTACAGCCTCCAGAGAGCCCGTTCTTTGACCCACAGAATAGGTCTTCTTCTGTGTATATGGATAGTGAACCAGCAGTAAATAAATCCTGGCGATATACTGGCAACCTGTCACCAGTCTCTGATATTGTCAGAGACAGTTACGGATCACAAAAAACTGTTGATACAGAAAAACTTTTCGATTTAGAAGCACCAGAGAAGGAAAAACGTACGTCAAGGGATGTCACTATGTCTTCACTGGACCCTTGGAACAGCAATATTAGCCCTTCTCCCGTAAGAAAATCAGTAACACCATCACCATATAACGTAACGAAGCATCGTAACCGCCACTTACAAAATATTCAAGACTCTCAAAGCGGTAAAAACGGAATCACTCCCACAACAATGTCAACTTCATCTTCTGACGATTTTGTTCCGGTTAAAGATGGTGAAAATTTTTGCTGGGTCCATAGCATGGAACCAGACAGAAGACCAAGTAAGAAAAGGTTAGTAGATTTTTCAAATAAGAGTAATGTCAATGTTGGTCAAGTTAAGGACATTCACGGACGCATCCCAGAAATGCTGTGATTATACGCAACGATATTTTGCTTAATTTTATTTTCCTGTTTTATTTTTTATTAGTGGTTTACAGATACCCTATATTTTATTTAGTTTTTATACTTAGAGACATTTAATTTTAATTCCATTCTTCAAATTTCATTTTTGCACTTAAAACAAAGATCCAAAAATGCTCTCGCCCTCTTCATATTGAGAATACACTCCATTCAAAATTTTGTCGTCACCGCTGATTAATTTTTCACTAAACTGATGAATAATCAAAGGCCCCACGTCAGAACCGACTAAAGAAGTGAGTTTTATTTTAGGAGGTTGAAAACCATTATTGTCTGGTAAATTTTCATCTTCTTGACATTTAACCCAGTTTGAATCCCTTTCAATTTCTGCTTTTTCCTCCAAACTATCGACCCTCCTGTTTCTGTCCAACTTATGTCCTAGTTCCAATTCGATCGCATTAATAACTGCTTCAAATGTTATTGTGTCATCGTTGACTTTAGGTAATTTCTCCAAATGCATAATCAAACTATTTAAGGAAGATCGGAATTCGTCGAACACTTCAGTTTCCGTAATGATCTGATCGTCTTTATCCACATGTTGTAATTCACTAAAATCTAAAACGTATTTTTCAATGCATAAATCGTTCTTTTTATTAATAATGCAGATGGAAAATCTGTAAACGTGCGTTAATTTAGAAAGAACATCCAGTATAAGTTCTTCTATATAGTCAATTAAAGCAGGATGCCTATTAATGGGAACGAACTGCGGCAAGTTGAATGACTGGTAAGTAGTGTAGTCGAATGACTGAGGTGGGTATACATTTCTATAAAATAAAATCAAATTAATGTAGCATTTTAAGTATACCCTCAGCCACTTCTCTACCCATCTATTCATAAAGCTGACGCAACGATTACTATTTTTTTTTTCTTCTTGGATCTCAGTCGTCGCAAAAACGTATACCTTCTTTTTCCGACCTTTTTTTTAGCTTTCTGGAAAAGTTTATATTAGTTAAACAGGGTCTAGTCTTAGTGTGAAAGCTAGTGGTTTCGATTGACTGATATTAAGAAAGTGGAAATTAAATTAGTAGTGTAGACGTATATGCATATGTATTTCTCGCCTGTTTATGTTTCTACGTACTTTTGATTTATAGCAAGGGGAAAAGAAATACATACTATTTTTTGGTAAAGGTGAAAGCATAATGTAAAAGCTAGAATAAAATGGACGAAATAAAGAGAGGCTTAGTTCATCTTTTTTCCAAAAAGCACCCAATGATAATAACTAAAATGAAAAGGATTTGCCATCTGTCAGCAACATCAGTTGTGTGAGCAATAATAAAATCATCACCTCCGTTGCCTTTAGCGCGTTTGTCGTTTGTATCTTCCGTAATTTTAGTCTTATCAATGGGAATCATAAATTTTCCAATGAATTAGCAATTTCGTCCAATTCTTTTTGAGCTTCTTCATATTTGCTTTGGAATTCTTCGCACTTCTTTTCCCATTCATCTCTTTCTTCTTCCAAAGCAACGATCCTTCTACCCATTTGCTCAGAGTTCAAATCGGCCTCTTTCAGTTTATCCATTGCTTCCTTCAGTTTGGCTTCACTGTCTTCTAGCTGTTGTTCTAGATCCTGGTTTTTCTTGGTGTAGTTCTCATTATTAGATCTCAAGTTATTGGAGTCTTCAGCCAATTGCTTTGTATCAGACAATTGACTCTCTAACTTCTCCACTTCACTGTCGAGTTGCTCGTTTTTAGCGGACAAAGATTTAATCTCGTTTTCTTTTTCAGTGTTAGATTGCTCTAATTCTTTGAGCTGTTCTCTCAGCTCCTCATATTTTTCTTGCCATGACTCAGATTCTAATTTTAAGCTATTCAATTTCTCTTTGATC

where first line introduced by ‘>’ represent the header, anything after firstline break is considered to be the sequence. Fasta (or Pearson’s) format is the most widely used sequence format in Bioinformatics!

Page 5: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

!but first, you have to find it!

Page 6: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

you can search by keyword(could be name, abbreviation...)

Page 7: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

... or unique identifier ‘Accesion number’

Page 8: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

... or first filter out all sequences of particular organism

Page 9: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

... and then use keyword

Page 10: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

check results you want to save, click ‘Display settings, ‘Apply’

Page 11: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

and copy results into any text editor

Page 12: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

or click ‘Send to’, set Format to Fasta and save to wherever you want to

This way, you can also download whole protein/nucleotide set of any particular taxonomic unit,or even the genomic sequence. Try to figure out how!

Page 13: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

... you can also search by similarity/homology using BLAST

Page 14: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

• set of sequence comparison algorithms (1990)• search sequence databases for optimal local alignments to a query• Heuristic approach based on Smith Waterman algorithm• Finds best local alignments• Provides statistical significance• www, standalone, and network clients

The BLAST programs (Basic Local Alignment Search Tools)

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410.

Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.” NAR 25:3389-3402.

BLAST+

Page 15: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

1) Choose the sequence (query)

2) Select the BLAST program

3) Choose the database to search

4) Choose optional parameters

The BLAST programs (Basic Local Alignment Search Tools)

Page 16: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

Program

Description

blastp Compares an amino acid query sequence against a protein sequence database.

blastn Compares a nucleotide query sequence against a nucleotide sequence database.

blastx

Compares a nucleotide query sequence translated in all reading frames against a protein sequence

database. You could use this option to find potential translation products of an unknown nucleotide

sequence.

tblastnCompares a protein query sequence against a

nucleotide sequence database dynamically translated in all reading frames.

tblastxCompares the six-frame translations of a nucleotide query sequence against the six-frame translations of

a nucleotide sequence database.

Program

Description

blastp Compares an amino acid query sequence against a protein sequence database.

blastn Compares a nucleotide query sequence against a nucleotide sequence database.

blastx

Compares a nucleotide query sequence translated in all reading frames against a protein sequence

database. You could use this option to find potential translation products of an unknown nucleotide

sequence.

tblastnCompares a protein query sequence against a

nucleotide sequence database dynamically translated in all reading frames.

tblastxCompares the six-frame translations of a nucleotide query sequence against the six-frame translations of

a nucleotide sequence database.

The BLAST programs: Select the BLAST program

Page 17: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

Program Notes

Megablast

Contiguous Nearly identical sequences

Discontiguous

Cross-species comparison

Position Specific

PSI-BLASTAutomatically generates a

position specific score matrix (PSSM)

RPS-BLAST Searches a database of PSI-BLAST PSSMs

Program Notes

Megablast

Contiguous Nearly identical sequences

Discontiguous

Cross-species comparison

Position Specific

PSI-BLASTAutomatically generates a

position specific score matrix (PSSM)

RPS-BLAST Searches a database of PSI-BLAST PSSMs

nucleotide only protein only

The BLAST programs: Select the BLAST program

Page 18: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

first choose appropriate database/algorithm, i.e. if you have aa sequence and you are after proteins, use blastp (protein blast), if you’re looking for coding sequence, use tblastn (translated blast) etc...

Page 19: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

paste your query sequence or acc. # here

sometimes it’s handy to zoom in the search for specific group

Page 20: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

How does it work?BLAST Algorithm in layers

“The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length w with a score of at least T.” Altschul et al. (1990)

Three heuristic layers: seeding, extension, and evaluation

• Seeding – identify where to start alignment

• Extension – extending alignment from seeds

• Evaluation – Determine which alignments are statistically significant

Page 21: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

BLAST Algorithm: Seeding

compile a list of word pairs (w=3)above threshold T

Example: for a human RBP query…FSGTWYA… (query word is in red)

A list of words (w=3) is:FSG SGT GTW TWY WYAYSG TGT ATW SWY WFAFTG SVT GSW TWF WYS

BLAST locates all common words in a pair of sequences, then uses them as seeds for the alignment

Discriminating between real and artificial matches is done using an estimate of probability that the match might occur by chance.

scores (S) and e-values (E) of BLAST hits

word=defined number of letters

Page 22: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

BLAST Algorithm: Seeding: Score

score=alignment quality

Page 23: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

• Substitution matrices are used for amino acid alignments. – each possible residue substitution is given a score

• A simpler unitary matrix is used for DNA pairs (+1 for match, -2 mismatch)

6

BLAST Algorithm: Seeding: Scoring matrix

aa frequency, aa properties

Page 24: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

BLOSUM vs PAM

• BLOSUM 62 as the default in BLAST 2.0. - tailored for comparisons of moderately distant proteins, performs

well in detecting closer relationships. - search for distant relatives may be more sensitive with a different

matrix.

BLOSUM 45 BLOSUM 62 BLOSUM 90

PAM 250 PAM 160 PAM 100

More Divergent Less Divergent

PAM (Percent Accepted Mutation)- theoretical approach- based on assumptions of mutation probabilities

BLOSUM (BLOcks SUbstitution Matrix)- empirical- constructed from multiply aligned protein families- ungapped segments (blocks) clustered based on percent identity

BLAST Algorithm: Seeding: Scoring matrix

Page 25: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

BLAST Algorithm: Seeding: E value

• Low E-values suggest that sequences are homologous• Statistical significance depends on both the size of the alignments and the size

of the sequence database‣ Important consideration for comparing results across different searches‣ E-value increases as database gets bigger‣ E-value decreases as alignments get longer

Suggested BLAST Cutoffs

• For nucleotide based searches, one should look for hits with E-values of 10^-6 or less and sequence identity of 70% or more

• For protein based searches, one should look for hits with E-values of 10^-3 or less and sequence identity of 25% or more

e- value= significance of the alignment

The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.

Page 26: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

when you manage to find a hit (i.e. a match between a “word” and a database entry), extend the hit in either direction.

Keep track of the score (use a scoring matrix)

Stop when the score drops below some cutoff.

KENFDKARFSGTWYAMAKKDPEG 50 RBP (query)

MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit)

Hit!extendextend

BLAST Algorithm: Extension and Evaluation

originally hits extended in either direction X refinement of BLAST: two independent hits required

Page 27: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

BLAST Algorithm: Extension and Evaluation

BLAST algorithm extends the initial “seed” hit into an HSP

HSP = high scoring segment pair = Local optimal alignment

Page 28: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

BLAST Algorithm: Extension and Evaluation

Page 29: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
Page 30: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

BLAST-related tools for genomic DNA

• MegaBLAST at NCBI

• BLAT (BLAST-like alignment tool). BLAT parses an entire genomic DNA database into words (11mers), then searches them against a query-a mirror image of the BLAST strategy

http://genome.ucsc.edu

• SSAHA at Ensembl uses a similar strategy as BLAThttp://www.ensembl.org

Page 31: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

it’ll even tell you, whether itfound any known domain

... or level of similarity

Page 32: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

scroll down to bottom...

the more the better

Page 33: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

check hits you want to save ... then click ‘Download’

Page 34: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

Access to sequenced data: Species and Taxa Specific Databases

https://genome.ucsc.edu/ENCODE/

http://www.genecards.org/

http://www.biobase-international.com/product/hgmd

Page 35: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

Comparative database of eukaryotic pathogens

Page 36: Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

gene/metabolic pathway oriented databases