a field guide to genbank and ncbi molecular biology resources slightly modified from peter cooper ...

56
A Field Guide to GenBank A Field Guide to GenBank and NCBI Molecular and NCBI Molecular Biology Resources Biology Resources slightly modified from slightly modified from Peter Cooper Peter Cooper ftp://ftp.ncbi.nih.gov/pub/cooper/FieldGuide/ ftp://ftp.ncbi.nih.gov/pub/cooper/FieldGuide/ Eric Sayers Eric Sayers ftp://ftp.ncbi.nih.gov/pub/sayers/Field_Guide/U_P ftp://ftp.ncbi.nih.gov/pub/sayers/Field_Guide/U_P enn/ enn/

Upload: gladys-wiggins

Post on 18-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

A Field Guide to GenBank A Field Guide to GenBank and NCBI Molecular Biology and NCBI Molecular Biology

ResourcesResources

slightly modified fromslightly modified from

Peter CooperPeter Cooperftp://ftp.ncbi.nih.gov/pub/cooper/FieldGuide/ftp://ftp.ncbi.nih.gov/pub/cooper/FieldGuide/

Eric SayersEric Sayersftp://ftp.ncbi.nih.gov/pub/sayers/Field_Guide/U_Penn/ftp://ftp.ncbi.nih.gov/pub/sayers/Field_Guide/U_Penn/

Page 2: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

• About NCBIAbout NCBI• NCBI Sequence DatabasesNCBI Sequence Databases

– Primary Database – GenBankPrimary Database – GenBank– Derivative Databases - RefSeqDerivative Databases - RefSeq

• Entrez Databases and Text Entrez Databases and Text SearchingSearching

• BLAST ServicesBLAST Services• Genomic ResourcesGenomic Resources

NCBI NCBI ResourcesResources

Page 3: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

The National Center for The National Center for Biotechnology Biotechnology

Information (NCBI)Information (NCBI)• Created as a part of NLM in 1988Created as a part of NLM in 1988

– Establish public databasesEstablish public databases– Perform research in computational Perform research in computational

biologybiology– Develop software tools for sequence Develop software tools for sequence

analysisanalysis– Disseminate biomedical informationDisseminate biomedical information

• Tools: BLAST(1990), Entrez (1992)Tools: BLAST(1990), Entrez (1992)• GenBank (1992)GenBank (1992)• Free MEDLINE (PubMed, 1997)Free MEDLINE (PubMed, 1997)• Human genome (2001)Human genome (2001)

Page 4: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

NCBI Home PageNCBI Home Pagehttp://www.ncbi.nlm.nih.govhttp://www.ncbi.nlm.nih.gov

To learn more, visit To learn more, visit thethe “ “Site MapSite Map” and ” and ““About NCBIAbout NCBI””web pagesweb pages

Page 5: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

About NCBIAbout NCBI

Page 6: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Some NCBI Statistics….Some NCBI Statistics….Growth of GenBank

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

22000

24000

26000

28000

30000

1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002

Ba

se P

airs

of D

NA

(m

illio

ns)

0123456789

1011121314151617181920212223

Se

qu

en

ces

(mill

ion

s)

Base Pairs Sequences

Page 7: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Users per dayUsers per day

0

50000

100000

150000

200000

250000

1997 1998 1999 2000 2001

Christmas Day

Page 8: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Molecular Molecular DatabasesDatabases• Primary DatabasesPrimary Databases

– Original submissions by experimentalistsOriginal submissions by experimentalists– Database staff organize but don’t add Database staff organize but don’t add

additional informationadditional information• Example:Example: GenBankGenBank

• Derivative DatabasesDerivative Databases– Human curatedHuman curated

• compilation and correction of datacompilation and correction of data

• Example:Example: SWISS-PROT, NCBI RefSeq mRNASWISS-PROT, NCBI RefSeq mRNA

– Computationally DerivedComputationally Derived• Example:Example: UniGeneUniGene

– CombinationsCombinations• Example:Example: NCBI Genome AssemblyNCBI Genome Assembly

Page 9: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

What is GenBank?What is GenBank? NCBI’s Primary Sequence NCBI’s Primary Sequence

DatabaseDatabase• Nucleotide only sequence database Nucleotide only sequence database • GenBank DataGenBank Data

– Direct submissions individual records (BankIt, Direct submissions individual records (BankIt, Sequin)Sequin)

– Batch submissions via email (EST, GSS, STS)Batch submissions via email (EST, GSS, STS)– ftp accounts established for sequencing centersftp accounts established for sequencing centers

• Data shared amongst three collaborating Data shared amongst three collaborating databases:databases:– GenBankGenBank– DNA Database of Japan (DDBJ). DNA Database of Japan (DDBJ). – European Molecular Biology Laboratory European Molecular Biology Laboratory

Database Database

(EMBL)(EMBL)

Page 10: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

The International Nucleotide SequenceThe International Nucleotide SequenceDatabase CollaborationDatabase Collaboration

EBI

GenBankGenBank

DDBJDDBJ

EMBLEMBL

EMBLEMBL

Entrez

SRS

getentry

NIGNIGCIB

NCBI

NIHNIH

•Submissions•Updates •Submissions

•Updates

•Submissions•Updates

SequinBankItftp

Page 11: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

GenBank: GenBank: NCBI’s Primary Sequence NCBI’s Primary Sequence DatabaseDatabase

• full release every two months• incremental and cumulative updates daily• available only through internet

ftp://ftp.ncbi.nih.gov/genbank/

Release 133 December 2002

22,318,883 Records 28,507,990,166 Nucleotides 110,000 + Species

>90 Gigabytes of data

Page 12: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Entrez Entrez NucleotideNucleotide

GenBank 71%

DDBJ 19%

EMBL 9%

RefSeq 1%

23,464,770 records

Page 13: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

ATTGACTA

Primary vs. Derivative DatabasesPrimary vs. Derivative DatabasesACGTGC

TTGACA

CG

TG

AATTGACTA

TA

TA

GC

CG

ACGTGC

ACGTGC

AC

GT

GC

TTGACA

TTGACA

TTGACA

CG

TGA C

GTG

A

CG

TG

A

ATT

GA

CTA

ATTGACTA ATTGACTA

ATTGACTA

TATAGCCG

TATAGCCG

TATA

GC

CG

TATAGCCG

GenBank

TATAGCCG TATAGCCGTATAGCCGTATAGCCG

ATGA

CATT

GAGA

ATTATT

CC GAGA

ATTC

CGAGA

ATTATT

CC GAGA

ATTC

C

SequencingCenters

GAGA

ATTC

C GAGA

ATTC

C

UniGene

RefSeq

GenomeAssembly

Labs

Curators

Algorithms

TATAGCCGAGCTCCGATACCGATGACAA

Page 14: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Traditional GenBank Traditional GenBank DivisionsDivisions

BCT Bacterial and Archeal INV InvertebrateMAM Mammalian (ex. ROD and PRI)PHG PhagePLN Plant and FungalPRI PrimateROD RodentSYN Synthetic (cloning vectors)VRL ViralVRT Other Vertebrate

•Direct Submissions (Sequin and BankIt)•Accurate•Well characterized

Page 15: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

A Traditional GenBank A Traditional GenBank RecordRecordLocus Field Molecule Type

GenBank Division

Modification DateDefinition Line

Accession NumberVersion

Taxonomy

GI (GenInfo)Keywords

Page 16: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

A Traditional GenBank A Traditional GenBank RecordRecord

Page 17: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Bulk Sequence Bulk Sequence Divisions of GenBankDivisions of GenBank

EST Expressed Sequence Tag STS Sequence Tagged SiteGSS Genome Survey SequenceHTG High Throughput GenomicHTC High Throughput cDNA

•Batch Submissions (email and ftp)•Inaccurate•Poorly Characterized

Page 18: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Organization of GenBankOrganization of GenBank

EST 67%

GSS 19%

Traditional 8%PAT 4%

23,087,196 records

STS, HTG, HTC 2%

11 Traditional Divisions

5 Bulk Divisions

1 Patent Division

Page 19: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

A gene-oriented view of sequence entries

•MegaBlast-based automated sequence

clustering

•Nonredundant set of gene-oriented

clusters

•Each cluster represents a unique gene

•Provides information on tissue-specific

expression and map locations

•Includes well-characterized genes and

novel ESTs

•Useful for gene discovery and selection

of mapping reagents

What is UniGene?What is UniGene?

Page 20: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Organisms RepresentedOrganisms Representedin UniGenein UniGene

Page 21: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Genome Genome Sequencing Sequencing

Draft Sequence (HTG division)

shredding

Whole BAC insert (or genome)

cloning isolating

assembly

sequencing

GSS divisionor trace archive

Page 22: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Working Draft SequenceWorking Draft Sequence

gaps

Page 23: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

phase 1

phase 2

phase 3 ROD

Acc = AC109609.1

Acc =AC109609.6

Acc = AC109609.10

HTG

HTG

HTG Division: HTG Division: HHigh igh TThroughput hroughput GGenomeenome

Page 24: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

HTG Division: HTG Division: HHigh igh TThroughput hroughput GGenomeenome

Page 25: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

NCBI’s Third Party Annotation NCBI’s Third Party Annotation (TPA) Database (TPA) Database

• NCBI now accepts the submission of NCBI now accepts the submission of new annotations of new annotations of existingexisting GenBank GenBank sequences;sequences;

• Facilitates the annotation of Facilitates the annotation of genomes by experts;genomes by experts;

NEW

Page 26: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

A Sample TPA record A Sample TPA record

Page 27: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

RefSeq: RefSeq: NCBI’s Derivative Sequence NCBI’s Derivative Sequence

DatabaseDatabase• Curated transcripts and proteinsCurated transcripts and proteins

– reviewedreviewed– human, mouse, rat, fruit fly, zebrafish, arabidopsishuman, mouse, rat, fruit fly, zebrafish, arabidopsis

• Human model transcripts and Human model transcripts and proteinsproteins

• Assembled Genomic Regions Assembled Genomic Regions (contigs)(contigs)– draft human genomedraft human genome– mouse genomemouse genome

• Chromosome recordsChromosome records– MicrobialMicrobial– viralviral– organelleorganelle

Page 28: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

The RefSeq Accession The RefSeq Accession NumbersNumbersmRNAs and Proteins

NM_123456 Curated mRNANP_123456 Curated ProteinNR_123456 Curated non-coding RNAXM_123456 Predicted Transcript (human, mouse)XP_123456 Predicted Protein (human, mouse)XR_123456 Predicted non-coding RNAGene RecordsNG_ 123456 Reference Genomic Sequence (human)AssembliesNT_ 123456 Contig (Mouse and Human)NW_123456 Supercontig (Mouse)NC_ 123456 Chromosome (Microbial,Viral,Arabidopsis )NR_ 123456 Interim Identifier for Microbial

Chromosomes

humanmouseratfruit flyzebrafishArabidopsis

Page 29: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Curated RefSeq Records: Curated RefSeq Records: NM_, NM_, NP_NP_

Page 30: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Entrez:Entrez:Linking and NeighboringLinking and Neighboring

Page 31: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

The Entrez DatabasesThe Entrez Databases

Page 32: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

The The (ever)(ever) Expanding Entrez Expanding Entrez SystemSystem

Nucleotide

Protein

Structure

PubMed

PopSet

Genome

OMIM

Taxonomy

Books

ProbeSet

3D Domains

UniSTS

SNP

CDD

Entrez

UniGeneJournals

PubMedCentral

Page 33: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

glucose 6 phosphate dehydrogenase

Entrez NucleotidesEntrez Nucleotides

Page 34: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Document Summaries:Document Summaries:glucose 6 phosphate dehydrogenase[All Fields] = 748 hits

Page 35: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

glucose 6 phosphate dehydrogenase

Entrez Nucleotides: Limits Entrez Nucleotides: Limits AccessionAll FieldsAuthor NameEC/RN NumberFeature keyFilterGene NameIssueJournal NameKeywordModification DateOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence LengthSubstance NameText WordTitle WordUidVolume

Page 36: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Entrez Nucleotides: Entrez Nucleotides: Preview/IndexPreview/Index

Page 37: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Adding Terms: Adding Terms: Preview/IndexPreview/Index

AccessionAll FieldsAuthor NameEC/RN NumberFeature keyFilterGene NameIssueJournal NameKeywordModification DateOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence Length. . .

Page 38: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Plant G6PD Plant G6PD mRNAsmRNAs

Page 39: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Display: Display: Formats, Links, and NeighborsFormats, Links, and Neighbors

SummaryBriefASN.1FASTAXMLGenBankGI listLinkOutNucleotide NeighborsGenome LinksProbeSet LinksOMIM LinksPopSet LinksProtein LinksPubMed LinksSNP LinksStructure LinksTaxonomy LinksUniSTS Links

Page 40: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

>gi|603218|gb|U18238.1|MSU18238 Medicago sativa glucose-6-phosphate dehydCCACCAGATATAATTAAGTAGATCAGAGTAGAAGAAGATGGGAACAAATGAATGGCATGTAGAAAGAAGAGATAGCATAGGTACTGAATCTCCTGTAGCAAGAGAGGTACTTGAAACTGGCACACTCTCTATTGTTGTGCTTGGTGCTTCTGGTGATCTTGCCAAGAAGAAGACTTTTCCTGCACTTTTTCACTTATATAAACAGGAATTGTTGCCACCTGATGAAGTTCACATTTTTGGCTATGCAAGGTCAAAGATCTCCGATGATGAATTGAGAAACAAATTGCGTAGCTATCTTGTTCCAGAGAAAGGTGCTTCTCCTAAACAGTTAGATGATGTATCAAAGTTTTTACAATTGGTTAAATATGTAAGTGGCCCTTATGATTCTGAAGATGGATTTCGCTTGTTGGATAAAGAGATTTCAGAGCATGAATATTTGAAAAATAGTAAAGAGGGTTCATCTCGGAGGCTTTTCTATCTTGCACTTCCTCCTTCAGTGTATCCATCCGTTTGCAAGATGATCAAAACTTGTTGCATGAATAAATCTGATCTTGGTGGATGGACACGCGTTGTTGTTGAGAAACCCTTTGGTAGGGATCTAGAATCTGCAGAAGAACTCAGTACTCAGATTGGAGAGTTATTTGAAGAACCACAGATTTATCGTATTGATCACTATTTAGGAAAGGAACTAGTGCAAAACATGTTAGTACTTCGTTTTGCAAATCGGTTCTTCTTGCCTCTGTGGAACCACAACCACATTGACAATGTGCAGATAGTATTTAGAGAGGATTTTGGAACTGATGGTCGTGGTGGATATTTTGACCAATATGGAATTATCCGAGATATCATTCCAAACCATCTGTTGCAGGTTCTTTGCTTGATTGCTATGGAAAAACCCGTTTCTCTCAAGCCTGAGCACATTCGAGATGAGAAAGTGAAGGTTCTTGAATCAGTACTCCCTATTAGAGATGATGAAGTTGTTCTTGGACAATATGAAGGCTATACAGATGACCCAACTGTACCGGACGATTCAAACACCCCGACTTTTGCAACTACTATTCTGCGGATACACAATGAAAGATGGGAAGGTGTTCCTTTCATTGTGAAAGCAGGGAAGGCCCTAAATTCTAGGAAGGCAGAGATTCGGGTTCAATTCAAGGATGTTCCTGGTGACATTTTCAGGAGTAAAAAGCAAGGGAGAAACGAGTTTGTTATCCGCCTACAACCTTCAGAAGCTATTTACATGAAGCTTACGGTCAAGCAACCTGGACTGGAAATGTCTGCAGTTCAAAGTGAACTAGACTTGTCATATGGGCAACGATATCAAGGGATAACCATTCCAGAGGCTTATGAGCGTCTAATTCTCGACACAATTAGAGGTGATCAACAACATTTTGTTCGCAGAGACGAATTAAAGGCATCATGGCAAATATTCACACCACTTTTACACAAAATTGATAGAGGGGAGTTGAAGCCGGTTCCTTACAACCCGGGAAGTAGAGGTCCTGCAGAAGCAGATGAGTTATTAGAAAAAGCTGGATATGTTCAAACACCCGGTTATATATGGATTCCTCCTACCTTATAGAGTGACCAAATTTCATAATAAAACAAGGATTAGGATTATCAGGAGCTTATAAATAAGTCTTCAATAAGCTTGTGAAATTTTCGTTATAATCTCTCTCATTTTGGGGTGTATATCAAGCATTTAAGCGCGTGTTTGACACAGTTTGTGTAATAGATTTGGCTCTGAATGAAAATAAACGGGAATTGTTTCTTTTTGTTTTA

>

FASTA definition line>gi|603218|gb|U18238.1|MSU18238

gi number

Database identifiersgb GenBankemb EMBLdbj DDBJsp SWISS-PROTpdb Protein Databankpir PIRprf PRFref RefSeq

Accession number

Locus name

Page 41: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Entrez GenomeEntrez Genome

Page 42: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Organism Organism PagesPages

Page 43: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

The Map Viewer: The Map Viewer: a common platform for integrated displaya common platform for integrated display

Page 44: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

The Map Viewer The Map Viewer

Page 45: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Entrez PubMedEntrez PubMed

Page 46: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Online BooksOnline Books

Page 47: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Entrez Specialized Entrez Specialized DatabasesDatabases

Taxonomy

OMIM

ProbeSet

Searchable taxonomic tree havingnodes for all species with records inan Entrez database

Online Mendelian Inheritance in Man:A database of genetically linkedhuman diseases

Expression data (GEO) and microarraydatasets

Page 48: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Entrez Taxonomy Entrez Taxonomy

Page 49: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Entrez OMIMEntrez OMIM

Page 50: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Entrez Entrez ProbeSetProbeSet

Page 51: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Trace ArchiveTrace Archive

Page 52: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Entrez StructureEntrez Structure

Page 53: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Structure Structure SummarySummary

Cn3D viewer

Related Structures

Conserved Domains

Page 54: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Cn3D: Displaying StructuresCn3D: Displaying Structures

Page 55: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers

Structural Structural AlignmentAlignment

Page 56: A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper  Eric Sayers