ncbi fieldguide a field guide part 2 august 30, 2005 university of colorado health sciences center

Post on 12-Jan-2016

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

NC

BI

Fie

ldG

uid

eA Field Guide part 2

August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

ePart 2

Entrez: text searching• a GenBank record• preview/index

BLAST: sequence searching• pre-computed searches• algorithms• what’s new?

VAST: structure searching

Example: mapping oligos to a genome

NC

BI

Fie

ldG

uid

eGenBank Records

Header

Feature Table

Sequence

The Flatfile Format

NC

BI

Fie

ldG

uid

eA Typical GenBank Record

LOCUS NM_019570 4279 bp mRNA linear INV 28-OCT-2004DEFINITION Mus musculus REV1-like(S. cerevisiae)(Rev1l),mRNAACCESSION NM_019570VERSION NM_019570.3 GI:50811869 KEYWORDS .

= Title

NC

BI

Fie

ldG

uid

eGenBank Record: Feature Table

NC

BI

Fie

ldG

uid

e

GenPept identifier

GenBank Record: Feature Table, con’t.

NC

BI

Fie

ldG

uid

eGenBank Record: sequence

skip

NC

BI

Fie

ldG

uid

eIndexing for Nucleotide UID 59958365

Field Indexed Terms

[primary accession] NM_001012399[title] Bos taurus hemochromatosis (hfe), mRNA.[organism] Bos taurus[sequence length] 1168[modification date] 2005/02/19[properties] biomol mrna

gbdiv mamsrcdb refseq

[accn]

[orgn]

[mdat][prop]

NC

BI

Fie

ldG

uid

eGlobal Entrez Search: HFE

HFE

NC

BI

Fie

ldG

uid

e

Entrez Nucleotide: HFE 137 records

Not HFE [Title]

NC

BI

Fie

ldG

uid

eSmarter Query

hfe[title]

42 records

Curated HFE splice variants(11 total)

AND human[orgn]

NC

BI

Fie

ldG

uid

ehfe[title] AND human[orgn] (con’t)

Primary data

NC

BI

Fie

ldG

uid

ePreview/Index

NC

BI

Fie

ldG

uid

ePreview/Index

NC

BI

Fie

ldG

uid

ePreview/Index: Properties, srcdb

srcdbProperties

NC

BI

Fie

ldG

uid

ePreview/Index: Properties, srcdb

…AND srcdb refseq[Properties]…AND srcdb refseq[Properties]

NC

BI

Fie

ldG

uid

ePreview/Index: Properties, srcdb

…AND srcdb ddbj/embl/genbank[Properties]…AND srcdb ddbj/embl/genbank[Properties]

NC

BI

Fie

ldG

uid

e#1 hfe 137#2 hfe[title] AND human[orgn] 42

#3 #2 AND srcdb refseq[prop] 11#4 #2 AND srcdb ddbj/embl/genbank[prop] 31

Database Queries

#5 #4 AND gbdiv pri[prop] 29

#4 #4 AND gbdiv est[prop] 2

Primate division gbdiv pri[prop]EST division gbdiv est[prop]

NC

BI

Fie

ldG

uid

e

Molecule Queries

#1 hfe 116

#2 hfe[title] AND human[orgn] 42

#3 #2 AND biomol mrna[prop] 29

#4 #2 AND biomol genomic[prop] 13

Genomic DNA biomol genomic[prop]cDNA biomol mrna[prop]

NC

BI

Fie

ldG

uid

eMore Queries…

Fields are database-specific

Entrez Nucleotide

Reviewed RefSeqs with transcript variants:

srcdb refseq reviewed[prop] AND transcript[title] AND variant[title]

Topoisomerase genes from Archaea:

topoisomerase[gene name] AND archaea[organism]

Entrez Gene

Genes on human chromosome 2 with OMIM links

2[chromosome] AND human[organism] AND “gene omim”[filter]

Membrane proteins linked to cancer:

“integral to plasma membrane”[gene ontology] AND cancer[dis]

NC

BI

Fie

ldG

uid

e

Other Entrez Databases

UniSTS: markers on the Genethon map of human chromosome 12

Genethon[Map Name] AND human[organism] AND 12[chromosome]

UniGene: rat clusters that have at least one mRNA

rat[organism] NOT 0[mrna count]

Structure: structures of bacterial kinases with resolutions below 2 Å

bacteria[organism] AND kinase AND 000.00:002.00[resolution]

SNP: uniquely mapped microsatellites on human chr2

microsat[SNP Class] AND 1[Map Weight] AND 2[Chromosome]) AND human[orgn]

NC

BI

Fie

ldG

uid

e

Basic Local Alignment Search Tool

NC

BI

Fie

ldG

uid

eBLAST Web Searches, 2005

200,000

NC

BI

Fie

ldG

uid

e

Nucleotide or protein: Related

Sequences

BLAST link: BLink

Precomputed BLAST Services

Transcript clusters: UniGene

Protein homologs: HomoloGene

NC

BI

Fie

ldG

uid

eLink to Related Sequences

NC

BI

Fie

ldG

uid

eRelated Sequences

Most similar

Least similar

NC

BI

Fie

ldG

uid

e

BLink (BLAST Link)

NC

BI

Fie

ldG

uid

eBLink Output

Best hitsBest hits 3D structures3D structures CDD-SearchCDD-Search

NC

BI

Fie

ldG

uid

eGlobal vs Local Alignment

Seq 1

Seq 2

Seq 1

Seq 2

Global alignment

Local alignment

NC

BI

Fie

ldG

uid

e

Global vs Local Alignment

Seq1: WHEREISWALTERNOW (16aa)Seq2: HEWASHEREBUTNOWISHERE (21aa)

Global

Seq1: 1 W--HEREISWALTERNOW 16 W HERE

Seq2: 1 HEWASHEREBUTNOWISHERE 21

LocalSeq1: 1 W--HERE 5 Seq1: 1 W--HERE 5 W HERE W HERESeq2: 3 WASHERE 9 Seq2: 15 WISHERE 21

NC

BI

Fie

ldG

uid

eThe Flavors of BLAST

• Standard BLAST– nucleotide, protein and translations (blastn, blastp,

blastx, tblastn, tblastx)– traditional “contiguous” word hit

• Megablast– optimized for large batch searches– can use discontiguous words

• PSI-BLAST– constructs PSSMs automatically; uses as query– very sensitive protein search

• RPS BLAST– searches a database of PSSMs– tool for conserved domain searches

“contiguous”

discontiguous

NC

BI

Fie

ldG

uid

eFast- heuristic approach based on Smith Waterman

Local alignments

Statistical significance- Expect value

Versatile- blastn, blastp, blastx, tblastn, tblastx, rps-blast,

psi-blast- www, standalone, and network clients

Why Is BLAST So Popular?

NC

BI

Fie

ldG

uid

e

How BLAST Works

• Make lookup table of “words” for query

• Scan database for hits

• Ungapped extensions of hits (initial HSPs)

• Gapped extensions (no traceback)

• Gapped extensions (traceback; alignment

details)

• Make lookup table of “words” for query

• Scan database for hits

• Ungapped extensions of hits (initial HSPs)

• Gapped extensions (no traceback)

• Gapped extensions (traceback; alignment

details)

NC

BI

Fie

ldG

uid

eNucleotide Words

GTACTGGACATGGACCCTACAGGAAQuery:

GTACTGGACAT

TACTGGACATG

ACTGGACATGG

CTGGACATGGA

TGGACATGGAC

GGACATGGACC

GACATGGACCC

ACATGGACCCT

Make a lookuptable of words

11-mer

. . .

NC

BI

Fie

ldG

uid

eProtein Words

GTQITVEDLFYNIATRRKALKNQuery:

Neighborhood Words

LTV, MTV, ISV, LSV, etc.

GTQ

TQI

QIT

ITV

TVE

VED

EDL

DLF

...

Make a lookuptable of words

Word size = 3 (default)

Word size can only be 2 or 3

[ -f 11 = blastp default ]

NC

BI

Fie

ldG

uid

e

Minimum Requirements for a Hit

• Nucleotide BLAST requires one exact match• Protein BLAST requires two neighboring matches within 40 aa

GTQITVEDLFYNI

SEI YYN

ATCGCCATGCTTAATTGGGCTT

CATGCTTAATT

neighborhood words

one exact match

two matches

[ -A 40 = blastp default ]

NC

BI

Fie

ldG

uid

e

BLASTP Summary

YLS HFLSbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333

Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47

Gapped extension with trace back

Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI-LEV… 50 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I + +Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEIATSI… 337

Final HSP

+E YA YL K F+ L +SP+ +DVNVHP+K V +++ I

High-scoring pair (HSP)

HFL 18HFV 15 HFS 14HWL 13NFL 13DFL 12HWV 10etc …

YLS 15YLT 12 YVS 12YIT 10etc …

Neighborhood words

Neighborhood score threshold

T (-f) =11

Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV…

example query words

NC

BI

Fie

ldG

uid

e

Scoring Systems - Nucleotides

A G C T

A +1 –3 –3 -3

G –3 +1 –3 -3

C –3 –3 +1 -3

T –3 –3 –3 +1

Identity matrix

CAGGTAGCAAGCTTGCATGTCA

|| |||||||||||| ||||| raw score = 19-9 = 10

CACGTAGCAAGCTTG-GTGTCA

[ -r 1 -q -3 ]

NC

BI

Fie

ldG

uid

eScoring Systems - Proteins

Position Independent MatricesPAM Matrices (Percent Accepted Mutation)

• Derived from observation; small dataset of alignments• Implicit model of evolution• All calculated from PAM1• PAM250 widely used

BLOSUM Matrices (BLOck SUbstitution Matrices)• Derived from observation; large dataset of highly

conserved blocks• Each matrix derived separately from blocks with a

defined percent identity cutoff• BLOSUM62 - default matrix for BLAST

Position Specific Score Matrices (PSSMs)PSI- and RPS-BLAST

NC

BI

Fie

ldG

uid

e

A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

BLOSUM62

D

F

Negative for less likely substitutions

D

Y

FPositive for more likely substitutions

NC

BI

Fie

ldG

uid

e

Position-Specific Score Matrix

DAF-1

Serine/Threonine protein kinases catalytic loop

1 7 4PSSM scores 5 4

NC

BI

Fie

ldG

uid

e

A R N D C Q E G H I L K M F P S T W Y V 435 K -1 0 0 -1 -2 3 0 3 0 -2 -2 1 -1 -1 -1 -1 -1 -1 -1 -2 436 E 0 1 0 2 -1 0 2 -1 0 -1 -1 0 0 0 -1 0 0 -1 -1 -1 437 S 0 0 -1 0 1 1 0 1 1 0 -1 0 0 0 2 0 -1 -1 0 -1 438 N -1 0 -1 -1 1 0 -1 3 3 -1 -1 1 -1 0 0 -1 -1 1 1 -1 439 K -2 1 1 -1 -2 0 -1 -2 -2 -1 -2 5 1 -2 -2 -1 -1 -2 -2 -1 440 P -2 -2 -2 -2 -3 -2 -2 -2 -2 -1 -2 -1 0 -3 7 -1 -2 -3 -1 -1 441 A 3 -2 1 -2 0 -1 0 1 -2 -2 -2 0 -1 -2 3 1 0 -3 -3 0 442 M -3 -4 -4 -4 -3 -4 -4 -5 -4 7 0 -4 1 0 -4 -4 -2 -4 -1 2 443 A 4 -4 -4 -4 0 -4 -4 -3 -4 4 -1 -4 -2 -3 -4 -1 -2 -4 -3 4 444 H -4 -2 -1 -3 -5 -2 -2 -4 10 -6 -5 -3 -4 -3 -2 -3 -4 -5 0 -5 445 R -4 8 -3 -4 0 -1 -2 -3 -2 -5 -4 0 -3 -2 -4 -3 -3 0 -4 -5 446 D -4 -4 -1 8 -6 -2 0 -3 -3 -5 -6 -3 -5 -6 -4 -2 -3 -7 -5 -5 447 I -4 -5 -6 -6 -3 -4 -5 -6 -5 3 5 -5 1 1 -5 -5 -3 -4 -3 1 448 K 0 0 1 -3 -5 -1 -1 -3 -3 -5 -5 7 -4 -5 -3 -1 -2 -5 -4 -4 449 S 0 -3 -2 -3 0 -2 -2 -3 -3 -4 -4 -2 -4 -5 2 6 2 -5 -4 -4 450 K 0 3 0 1 -5 0 0 -4 -1 -4 -3 4 -3 -2 2 1 -1 -5 -4 -4 451 N -4 -3 8 -1 -5 -2 -2 -3 -1 -6 -6 -2 -4 -5 -4 -1 -2 -6 -4 -5 452 I -3 -5 -5 -6 0 -5 -5 -6 -5 6 2 -5 2 -2 -5 -4 -3 -5 -3 3 453 M -4 -4 -6 -6 -3 -4 -5 -6 -5 0 6 -5 1 0 -5 -4 -3 -4 -3 0 454 V -3 -3 -5 -6 -3 -4 -5 -6 -5 3 3 -4 2 -2 -5 -4 -3 -5 -3 5 455 K -2 1 1 4 -5 0 -1 -2 1 -4 -2 4 -3 -2 -3 0 -1 -5 -2 -3 456 N 1 1 3 0 -4 -1 1 0 -3 -4 -4 3 -2 -5 -2 2 -2 -5 -4 -4 457 D -3 -2 5 5 -1 -1 1 -1 0 -5 -4 0 -2 -5 -1 0 -2 -6 -4 -5 458 L -3 -1 0 -3 0 -3 -2 3 -4 -2 3 0 1 1 -2 -2 -3 5 -1 -3

Position-Specific Score Matrix

catalytic loop

NC

BI

Fie

ldG

uid

eLocal Alignment Statistics

High scores of local alignments between two random sequencesfollow the Extreme Value Distribution

Score (S)

Alig

nm

en

ts

(applies to ungapped alignments)

E = Kmne-S or E = mn2-S’

K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2

Expect ValueE = number of database hits you expect to find by chance, ≥ S

your score

expected number of

random hits

More info: www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

NC

BI

Fie

ldG

uid

e

Gapped Alignments

Gapping provides more biologically realistic alignments

Gapped BLAST parameters are simulated for each scoring matrix

Affine gap costs = -(a+bk)a = gap open penalty b = gap extend penaltyA gap of length 1 receives the score -(a+b)

NC

BI

Fie

ldG

uid

e

An Alignment BLAST Cannot Make

1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG

61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT

121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG

61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT

121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

Reason:

no contiguous exact match of 7 bp.

NC

BI

Fie

ldG

uid

e

BLAST 2 Sequences (blastx) output:

An Alignment BLAST Can Make

Solution: compare protein sequences; BLASTXScore = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3

Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3

NC

BI

Fie

ldG

uid

e

Other BLAST Algorithms

• Megablast

• Discontiguous Megablast

• PSI-BLAST

• PHI-BLAST

• Megablast

• Discontiguous Megablast

• PSI-BLAST

• PHI-BLAST

NC

BI

Fie

ldG

uid

e

Megablast: NCBI’s Genome Annotator

• Long alignments of similar DNA sequences

• Greedy algorithm

• Concatenation of query sequences

• Faster than blastn; less sensitive

• Long alignments of similar DNA sequences

• Greedy algorithm

• Concatenation of query sequences

• Faster than blastn; less sensitive

NC

BI

Fie

ldG

uid

e

MegaBLAST & Word Size

Trade-off: sensitivity vs speed

Too fast foryou?

NC

BI

Fie

ldG

uid

e

MegaBLAST & Word Size

Trade-off: sensitivity vs speed

23blastp

828megablast

711blastn

minimumdefaultWORD SIZE

NC

BI

Fie

ldG

uid

e

Discontiguous Megablast

• Uses discontiguous word matches

• Better for cross-species comparisons

• Uses discontiguous word matches

• Better for cross-species comparisons

NC

BI

Fie

ldG

uid

e

Templates for Discontiguous Words

W = 11, t = 16, coding: 1101101101101101W = 11, t = 16, non-coding: 1110010110110111W = 12, t = 16, coding: 1111101101101101W = 12, t = 16, non-coding: 1110110110110111W = 11, t = 18, coding: 101101100101101101W = 11, t = 18, non-coding: 111010010110010111W = 12, t = 18, coding: 101101101101101101W = 12, t = 18, non-coding: 111010110010110111W = 11, t = 21, coding: 100101100101100101101W = 11, t = 21, non-coding: 111010010100010010111W = 12, t = 21, coding: 100101101101100101101W = 12, t = 21, non-coding: 111010010110010010111

Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5

W = word size; # matches in template

t = template length

NC

BI

Fie

ldG

uid

eDiscontiguous (Cross-species)

MegaBLAST

NC

BI

Fie

ldG

uid

eDiscontiguous Word

Options

NC

BI

Fie

ldG

uid

eMegaBLAST vs Discontiguous

MegaBLAST

NM_017460 Homo sapiens cytochrome P450, family 3, subfamily A, polypeptide 4 (CYP3A4),

transcript variant 1, mRNA (2768 letters)

vs Drosophila

NC

BI

Fie

ldG

uid

e

MegaBLAST vs Discontiguous MegaBLAST

MegaBLAST = “No significant similarity found.”

Discontiguous megaBLAST =

NC

BI

Fie

ldG

uid

e

Another Example . . .

Discontiguous megaBLAST = numerous hits . . .

Query: NM_078651

Drosophila melanogaster CG18582-PA (mbt) mRNA, (3244 bp)

/note= mushroom bodies tiny; synonyms: Pak2, STE20, dPAK2

MegaBLAST = “No significant similarity found.”

Database: nr (nt), Mammalia[orgn]

NC

BI

Fie

ldG

uid

eEx: Discontiguous MegaBLAST

NC

BI

Fie

ldG

uid

eEx: BLASTN

NC

BI

Fie

ldG

uid

e

PSI-BLAST

Example: Confirming relationships of purine

nucleotide metabolism proteins

Position-specific Iterated BLAST

NC

BI

Fie

ldG

uid

e>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINEMAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFVIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDEQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK

PSI-BLAST

0.005 E value cutoff for PSSM

NC

BI

Fie

ldG

uid

eRESULTS: Initial BLASTP

Same results as protein-protein BLAST; different format

NC

BI

Fie

ldG

uid

eResults of First PSSM Search

Other purine nucleotide metabolizing enzymes not found by ordinary BLAST

NC

BI

Fie

ldG

uid

eTenth PSSM Search: Convergence

Just below threshold, another nucleotide metabolism enzyme

Check to add to PSSM

NC

BI

Fie

ldG

uid

eReverse PSI-BLAST (RPS)-BLAST

NC

BI

Fie

ldG

uid

eAdenosine/AMP Deaminase Domain

AMP Deaminases

.

.

.

NC

BI

Fie

ldG

uid

e

PHI-BLAST

>gi|231729|sp|P30429|CED4_CAEEL CELL DEATH PROTEIN 4MLCEIECRALSTAHTRLIHDFEPRDALTYLEGKNIFTEDHSELISKMSTRLERIANFLRIYRRQASELIDFFNYNNQSHLADFLEDYIDFAINEPDLLRPVVIAPQFSRQMLDRKLLLGNVPKQMTCYIREYHVIKKLDEMCDLDSFFLFLHGRAGSGKSVIASQALSKSDQLIGINYDSIVWLKDSGTAPKSTFDLFTDILKSEDDLLNFPSVEHVTSVVLKRMICNALIDRPNTLFVFDDVVQEETIRWAQELRLRCLVTTRDVEIASQTCEFIEVTSLEIDECYDFLEAYGMPMPVGEKEEDVLNKTIELSSGNPATLMMFFKSCEPKTFEK

[GA]xxxxGK[ST]

NC

BI

Fie

ldG

uid

eGenome BLAST

NC

BI

Fie

ldG

uid

eGenome BLAST via Map Viewer

NC

BI

Fie

ldG

uid

eExample Search Pathways:

Hemochromatosis

Gene

OMIMOMIM GeneGene

“hemochromatosis”HFE

nucleotide sequence

GenomeBLAST Map Viewer

SNP

Protein

Domains

text search

sequence search

NC

BI

Fie

ldG

uid

eExample: Human Genome BLAST

TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAGTGACCACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGAACATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGATGCCAAGGAGTTCGAACCTAAAGACGTATTGCCCAATGGGGATGGGACCTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGGGGAAGAGC

Human EST

NC

BI

Fie

ldG

uid

eHuman Genome BLAST: Results

NC

BI

Fie

ldG

uid

e

Human Genome BLAST: MapViewer

Entrez GeneEntrez Gene

NC

BI

Fie

ldG

uid

e

What’s New?

NC

BI

Fie

ldG

uid

e

BLAST DatabasesNucleotide

• refseq_rna = NM_*, XM_*

• refseq_genomic = NC_*, NG_*

• env_nt– environmental sample[filter], e.g., 16S

rRNA

Protein

• refseq = NP_*, XP_*

• env_nr

NC

BI

Fie

ldG

uid

eNew Formatter

Select lower case Select red

NC

BI

Fie

ldG

uid

eNew Formatter

• gray line = same database hit

• hsp’s color-coded independently

NC

BI

Fie

ldG

uid

e

BLAST Output: Alignments & Filter

low complexity sequence filtered

NC

BI

Fie

ldG

uid

eAdvanced Options

Limit to Organism

all[filter] NOT ma

Example Entrez Queriesall[Filter] NOT mammalia[Organism]ray finned fishes[Organism]srcdb refseq[Properties]

Nucleotide only:biomol mrna[Properties]biomol genomic[Properties]

OtherAdvanced–e 10000 expect value-v 2000 descriptions-b 2000alignments

Example Entrez Queriesall[Filter] NOT mammalia[Organism]ray finned fishes[Organism]srcdb refseq[Properties]

Nucleotide only:biomol mrna[Properties]biomol genomic[Properties]

OtherAdvanced–e 10000 expect value-v 2000 descriptions-b 2000alignments

-e 10000 -v 2000

NC

BI

Fie

ldG

uid

eSearching by Structure

Why search for similar structures?

• Find homologs with low sequence similarity

• Explore protein evolution: similar protein folds can support different functions

• Identify conserved core elements to model related proteins of unknown structure

Why search for similar structures?

• Find homologs with low sequence similarity

• Explore protein evolution: similar protein folds can support different functions

• Identify conserved core elements to model related proteins of unknown structure

NC

BI

Fie

ldG

uid

e

Indexing into MMDB

Structure

id 1 , name "helix 1" , type helix , location subgraph residues interval { { molecule-id 1 , from 49 , to 61 } } } ,

Add secondary structure

inter-residue-bonds { { atom-id-1 { molecule-id 1 , residue-id 1 , atom-id 1 } , atom-id-2 { molecule-id 1 , residue-id 2 , atom-id 9 } } ,

Add chemical bonds

• Import only experimentally determined structures• Convert to ASN.1 • Verify sequences

• Create “backbone” model (Cα, P only)• Create single-conformer model

MMDBMolecular Modeling Data Base

NC

BI

Fie

ldG

uid

e

Structure Summary

Conserved Domains3D Domain Neighbors

Structure Neighbors

NC

BI

Fie

ldG

uid

e

3D Domains

1

32

4

NC

BI

Fie

ldG

uid

e

Conserved Domains

TyrKc

SH3

SH2

NC

BI

Fie

ldG

uid

e

VAST: Alignment

For each protein chain,

locate SSEs (secondarystructure elements),

represent SSEs asindividual vectors, 1

2

3

4

5 6

Human IL-4

IL-4 &Leptinalign the vectors.

NC

BI

Fie

ldG

uid

e

VAST

Structure neighbors

Taq DNA polymerase

NC

BI

Fie

ldG

uid

eVAST Results for the Chain

Table view

NC

BI

Fie

ldG

uid

e

VAST

Vector Alignment Search Tool

3D Domain structure neighbors

NC

BI

Fie

ldG

uid

eVAST Results for Domain 1

Not found with Chain query!

NC

BI

Fie

ldG

uid

e

Best way to convert PDB files to MMDB format

for viewing with Cn3D!

Best way to convert PDB files to MMDB format

for viewing with Cn3D!

submit file to PDB

NC

BI

Fie

ldG

uid

eExample: Mapping Oligos Onto

a Genome

>forwardCCATGGCGACCCTGGAAAAGC

>reverseCAGCAGCGGCTGTGCCTGCGG

??

?

NC

BI

Fie

ldG

uid

eMap Oligos Onto Genome

>CCATGGCGACCCTGGAAAAGCNNNNNNNNNNCAGCAGCGGCTGTGCCTGCGG

-W 7 –e 1000

forward primer reverse primer

NC

BI

Fie

ldG

uid

eGenome BLAST Results

NC

BI

Fie

ldG

uid

e

Primer Alignments

forward primer

reverse primer

NC

BI

Fie

ldG

uid

e

MapViewer

NC

BI

Fie

ldG

uid

e

MapViewer

NC

BI

Fie

ldG

uid

eSequence View (sv)

forward

reverse

NC

BI

Fie

ldG

uid

e

Service Addresses

•BLAST blast-help@ncbi.nlm.nih.gov

•General Help info@ncbi.nlm.nih.gov•Wayne Matten matten@ncbi.nlm.nih.gov

•BLAST blast-help@ncbi.nlm.nih.gov

•General Help info@ncbi.nlm.nih.gov•Wayne Matten matten@ncbi.nlm.nih.gov

top related