ncbi fieldguide national center for biotechnology information a field guide to genbank and ncbi’s...
TRANSCRIPT
NC
BI
Fie
ldG
uid
e
National Center for Biotechnology Information
A Field Guide to GenBank
and NCBI’s Molecular Biology Resources
August 30, 2005 University of Colorado Health Sciences Center
NC
BI
Fie
ldG
uid
e
Topics About NCBI GenBank overview Primary vs derivative databases
The Reference Sequence (RefSeq) project
Entrez databases Genome resources Bookshelf
-break- Entrez text searching BLAST sequence searching VAST structure searching An integrated example
NC
BI
Fie
ldG
uid
e
The National Institutes of Health
Bethesda, MD
NC
BI
Fie
ldG
uid
eThe National Center for
Biotechnology Information
Accepts submissions of primary data
Develops tools to analyze these data Creates derivative databases based on the
primary data Provides free search, link, and retrieval of these
data, primarily through the Entrez system
NC
BI
Fie
ldG
uid
eNCBI WWW Users per
Day
NC
BI
Fie
ldG
uid
e
Number of Users Per Day
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
Nu
mb
er o
f U
sers
1997 1998 1999 2000 2001 2002 2003
Christmas & New Year
NC
BI
Fie
ldG
uid
e
Homepage - accessing the data
all[filter]
NC
BI
Fie
ldG
uid
eall[filter]
1/11/2005
3/15/2005
8/15/2005
NC
BI
Fie
ldG
uid
e
Entrez Nucleotide
Primary Data GenBank / DDBJ / EMBL 57.3 million (97.4 %) Derivative Data
RefSeq 1.47 million (2.5 %)
RefSeq reviewed 60,000
PDB (structures) 5,973
“Total” 59 million
GenBank
# records
NC
BI
Fie
ldG
uid
e
GenBank: NCBI’s Primary Sequence Database
ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub
ftp://bio-mirror.net/biomirror/genbank
Release 149 August 2005 47 x 106 Records 52 x 109 Nucleotides
195 Gigabytes 816 files
• full release every two months• incremental and cumulative updates daily• available only through internet• release notes: gbrel.txt
Over 100 billionbases!
Over 100 billionbases!
NC
BI
Fie
ldG
uid
eWhat is
GenBank?
Nucleotide only sequence database Archival in nature GenBank Data
Direct submissions (traditional records) Batch submissions (EST, GSS, STS) ftp accounts (genome data)
Three collaborating databases GenBank DNA Database of Japan (DDBJ) European Molecular Biology Laboratory (EMBL)
Database
NC
BI
Fie
ldG
uid
e
GenBank Divisions
“Organismal”PRI (28) Primate ROD (15) Rodent PLN (13) Plant and FungalBCT (11) Bacterial/ArchealINV (7) InvertebrateVRT (7) Other VertebrateVRL (4) ViralMAM (2) MammalianPHG (1) PhageSYN (1) SyntheticUNA (1) Unannotated
“Functional”EST (377) Expressed Sequence Tag GSS (138) Genome Survey SequenceHTG (63) High Throughput GenomicPAT (17) PatentSTS (9) Sequence Tagged SiteCON (1) Contigs, virtual
• Organized by taxonomy (sort of)• Direct submissions (Sequin/Bankit)• Accurate (~1 error per 10,000 bp)• Well characterized
• Organized by sequence type• Batch submissions (ftp/email) • Inaccurate• Poorly characterized
NC
BI
Fie
ldG
uid
eGenBank Functional (Bulk)
Divisions
GenBankEST
STS
GSS
HTG
Expressed Sequence Tag
1st pass single read cDNA
Genome Survey Sequence
1st pass single read gDNA
High Throughput Genomic
incomplete sequences of genomic
clones
Sequence Tagged Site
PCR-based mapping reagents
Whole Genome Shotgun
NC
BI
Fie
ldG
uid
eEST Division: Expressed Sequence
Tags
RNA gene products
nucleus30,000 genes
80-100,000 uniquecDNA clones in library
- isolate unique clones - sequence once from
each end
make cDNA library
5’
3’
>IMAGE:275615 3', mRNA sequenceNNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTATTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTCTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGGTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC
>IMAGE:275615 5' mRNA sequenceGACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAATTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACTGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCAAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
NC
BI
Fie
ldG
uid
e
GSS, WGS, HTG
shred
Whole BAC insert (or genome)
isolate clonessequence
GSS divisionor trace archive
Draft sequence (HTG division)
assembly whole genome shotgun assemblies (traditional division)
NC
BI
Fie
ldG
uid
eHTG Example: Honeybee Draft
Sequences
• Unfinished sequences of BACs
• Gaps and unordered pieces
• Finished sequences (Phase 3) move
to traditional GenBank division
• Unfinished sequences of BACs
• Gaps and unordered pieces
• Finished sequences (Phase 3) move
to traditional GenBank division
LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004
DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT
SEQUENCE, 14 unordered pieces.
ACCESSION AC141845
VERSION AC141845.1 GI:29124029
KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT.
LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004
DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT
SEQUENCE, 14 unordered pieces.
ACCESSION AC141845
VERSION AC141845.1 GI:29124029
KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT.
NC
BI
Fie
ldG
uid
e
Whole Genome Shotgun Projects
351 projects Bacteria (251) Environmental sequences (6) Archaea (6)
Eukaryotes (88), including: Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human
Pufferfish (2)
Honeybee, Anopheles, Fruit Flies (3), Silkworm
Nematode (2)
Yeasts (8), Aspergillus (2)
Rice (2)
351 projects Bacteria (251) Environmental sequences (6) Archaea (6)
Eukaryotes (88), including: Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human
Pufferfish (2)
Honeybee, Anopheles, Fruit Flies (3), Silkworm
Nematode (2)
Yeasts (8), Aspergillus (2)
Rice (2)
NC
BI
Fie
ldG
uid
eWhole Genome Shotgun (WGS)
Projects
wgs master[properties]
NC
BI
Fie
ldG
uid
e
Derivative Databases
GenBank
SequencingCenters UniGene
RefSeq:
Entrez Gene and
annotation pipelines
Labs
Updated ONLY by submitters
ESTUniSTS
STS
HTG
GSS
PRI ROD PLN MAM BCT
INV VRT PHG VRL
ATT GA
ATT
C
GA
C
GA
C
C
CATT
TAACT
Updated
by NCBI
RefSeq
NC
BI
Fie
ldG
uid
e
Why Make Reference Sequences?
Entrez Nucleotide query:
human[organism] AND lipase[title]
NC
BI
Fie
ldG
uid
eWhy Make Reference Sequences?Entrez Nucleotide query:
human[organism] AND lipase[title]
NC
BI
Fie
ldG
uid
ehuman[organism] AND lipase[title] AND endothelial[title]
3927 bp
4150 bp
3927 bp
2323 bp
261 bp
human[organism] AND lipase[title] AND endothelial[title]
NC
BI
Fie
ldG
uid
e
RefSeq Benefits
genomestranscripts
proteins
• non-redundant; best representative
•updates to reflect current sequence data and biology
•distinct, stable accession series
NC
BI
Fie
ldG
uid
e
Reference Sequence: RefSeq
Accession Sequence Type
NM_123456789 mRNANP_123456789 protein, from NM_NR_123456 non-coding RNAXM_123456 predicted mRNAXP_123456 predicted protein XR_123456 predicted non-coding RNAZP_12345678 predicted from NZ_
NC_123456 genomic, e.g., chromosomesNG_123455 genomic, incomplete region
NT_123456 genomic, BAC assemblyNW_123456 genomic, WGS assemblyNZ_ABCD12345678 genomic, WGS collection
blue=curated
NC
BI
Fie
ldG
uid
e
Genomic DNAGenomic DNA((NCNC,, NTNT,, NW NW))
Model mRNAModel mRNA (XM)(XM)(XR)(XR)
Curated mRNACurated mRNA (NM)(NM)(NR)(NR)
Model protein Model protein (XP)(XP)
Annotation Process
Curated ProteinCurated Protein (NP)(NP)
Scanning....
GenbankSequences
RefSeq
NC
BI
Fie
ldG
uid
e
Creating NM_ Records
NM’s must have cDNA support
Genome annotation
Longest mRNA
transcript variant 1transcript variant 2transcript variant 3
NC
BI
Fie
ldG
uid
e
Where is RefSeq?
NC
BI
Fie
ldG
uid
e
GENSAT
The Entrez System
Entrez
Nucleotide
PubMed
Protein
Taxonomy
Structure
Domains 3D DomainsJournal
s
PMC
OMIM
Books
PopSet
SNP
UniGene UniSTS
Genome
Gene
GEO
MeSH
CancerChromosomes
Homologene
PubChem
NC
BI
Fie
ldG
uid
e
A Few Entrez Databases
UniGene Clusters of ESTs, mRNAs
dbSNP Single Nucleotide
Polymorphisms
GEO Gene Expression Omnibus
microarray and other
expression data
CDD Conserved Domain Database protein families (COGs
and KOGs)
single domains (PFAM,
SMART, CD)
UniGene Clusters of ESTs, mRNAs
dbSNP Single Nucleotide
Polymorphisms
GEO Gene Expression Omnibus
microarray and other
expression data
CDD Conserved Domain Database protein families (COGs
and KOGs)
single domains (PFAM,
SMART, CD)
NC
BI
Fie
ldG
uid
eGene-oriented clusters of expressed sequences
• Automatic clustering using MegaBlast
• Each cluster represents a unique gene
• Informed by genome hits
• Information on tissue types and map locations
• Useful for gene discovery and selection of
mapping reagents
UniGene
unique gene
NC
BI
Fie
ldG
uid
e
A Cluster of ESTs
query
5’ EST hits
3’ EST hits
NC
BI
Fie
ldG
uid
eUniGene Collections
NC
BI
Fie
ldG
uid
eExample UniGene Cluster
NC
BI
Fie
ldG
uid
eHistogram of cluster sizes for UniGene Hs Build 177
(Now at Build #186)
NC
BI
Fie
ldG
uid
eUniGene Cluster Hs.95351
SELECTED PROTEIN SIMILARITES
NC
BI
Fie
ldG
uid
eUniGene Cluster Hs.95351
GENE EXPRESSION
NC
BI
Fie
ldG
uid
e
UniGene Cluster Hs.95351: expression
NC
BI
Fie
ldG
uid
eUniGene Cluster Hs.95351: seqs
NC
BI
Fie
ldG
uid
e
Download sequences
web page
ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/
NC
BI
Fie
ldG
uid
eEntrez GEO
NC
BI
Fie
ldG
uid
e
NCBI’s SNP Database
Primary and derivative (RefSNP) Single nucleotide polymorphisms
Repeat polymorphisms
Insertion-deletion polymorphisms
Over 19 million refSNPs (rsXXXXXXX)
(August, 2005)
NC
BI
Fie
ldG
uid
e
Searching dbSNP
NC
BI
Fie
ldG
uid
e
RefSNP
NC
BI
Fie
ldG
uid
e
RefSNP
NC
BI
Fie
ldG
uid
e
RefSNP
NC
BI
Fie
ldG
uid
e
RefSNP
Search Mouse SNP between strains
NC
BI
Fie
ldG
uid
e
RefSNP
MapView GeneView SeqView OMIMNo 3D
NC
BI
Fie
ldG
uid
e
RefSNP
NC
BI
Fie
ldG
uid
eEntrez GEO
NC
BI
Fie
ldG
uid
e
GPLPlatform
descriptions
GSMRaw/processedspot intensities
from a singleslide/chip
GSEGrouping of
slide/chip data“a single experiment”
GDSGrouping ofexperiments
Curated byNCBI
Submitted byExperimentalistsSubmitted by
Manufacturer*
Entrez GEOEntrez
GEO Datasets
GEO SaMple:
experimental
conditions
GEO SEries:
set of related
samples
NC
BI
Fie
ldG
uid
e
What’s a DataSet?
Platform (GPL)
array definition
Sample(GSM)
hyb. measurements
Series(GSE)
related Samples
Supplied by submitter
DataSet (GDS)
• A collection of experimentally-related samples processed using the same platform.• Samples within DataSets are organized into subgroups based on experimental variables.• Form the basis of GEO’s query, analysis and data display tools.
Assembled by GEO staff
NC
BI
Fie
ldG
uid
eGene Expression Omnibus (GEO)
Dataset browser
NC
BI
Fie
ldG
uid
eGEO Dataset Browser
NC
BI
Fie
ldG
uid
eGEO Dataset Report
NC
BI
Fie
ldG
uid
e
GEO Profiles
… of 12625
NC
BI
Fie
ldG
uid
eEntrez CDD
NC
BI
Fie
ldG
uid
eConserved Domain Database
Multiple sequence alignments
Position-specific scoring matrices (PSSM)
Sources SMART, PFAM, COGs, KOGs, and
NCBI curated domains (structure-informed
alignments)
Multiple sequence alignments
Position-specific scoring matrices (PSSM)
Sources SMART, PFAM, COGs, KOGs, and
NCBI curated domains (structure-informed
alignments)
NC
BI
Fie
ldG
uid
e
CDD
>gi|45549418|gb|AAS67634.1| ATP7A [Solenodon paradoxus] IVYQPHLITVEEIKKQIKAVGFPAFIKKQPKYLKLGAIDIERLKNIPVKSSEGSQQMSPSSTNDSKVTLTIDGMHCNSCVSNIESALSTLHYVSSIVVSLQNKSAIIKYNANSVTPEILKKAIEAISPGQYRVSITSEVESTSNSPSSSSQKAPLNVVSQPLTQVTVININGMTCNSCVQSIEGVMSKKAGVKSIQVSLANRNGTVEYDP LLTSPEILRE
NC
BI
Fie
ldG
uid
e
CDD
CD
Pfam
COG
Click on a colored bar to align your sequence to the CD
NC
BI
Fie
ldG
uid
eConserved Domain Database: cd00371.1, HMA
NC
BI
Fie
ldG
uid
e
CDD
NC
BI
Fie
ldG
uid
eCDART: Conserved Domain Architecture Retrieval
Tool
NC
BI
Fie
ldG
uid
e
cdd
Linking from Entrez Protein
NC
BI
Fie
ldG
uid
e
Genome Resources
Gene database
Trace Archive
Map Viewer
Homologene
Genomic Biology
NC
BI
Fie
ldG
uid
e
Genomic Biology
NC
BI
Fie
ldG
uid
eGen Biol: Gen Resources
NC
BI
Fie
ldG
uid
e
Gen Biol: Gen Resources
NC
BI
Fie
ldG
uid
eGen Biol: Gen Resources
NC
BI
Fie
ldG
uid
e
Genome Projects: microb
NC
BI
Fie
ldG
uid
eGen Biol: Gen Resources
NC
BI
Fie
ldG
uid
eGen Biol: Gen Resources
NC
BI
Fie
ldG
uid
eGen Biol: Gen Resources
NC
BI
Fie
ldG
uid
eGen Biol: Gen Resources
NC
BI
Fie
ldG
uid
e
Gen Biol: Gen Resources
NC
BI
Fie
ldG
uid
e
Genome Resources
Gene database
Trace Archive
Map Viewer
Homologene
Genomic Biology
NC
BI
Fie
ldG
uid
e
Entrez Gene
A single query interface to …
• Sequences
- RefSeqs
- GenBank
- Homologene• Maps – MapViewer• Entrez links• Linkouts
More organisms, ~ 3000
Entrez integration
More organisms, ~ 3000
Entrez integration
NC
BI
Fie
ldG
uid
eGlobal Entrez: NADH2
NC
BI
Fie
ldG
uid
eEntrez Gene: NADH2
NC
BI
Fie
ldG
uid
eGene Record for Pongo NADH2
Homo sapiens
Not found with “nadh2”
NC
BI
Fie
ldG
uid
eA Record With More Data: Human HFE
NC
BI
Fie
ldG
uid
eHuman HFE: Transcripts
Transcripts with experimental
evidence
Transcripts with experimental
evidence
NC
BI
Fie
ldG
uid
eGene Table
NC
BI
Fie
ldG
uid
eIntrons/Exons: Gene Table
links to sequence
NC
BI
Fie
ldG
uid
eHuman HFE: Links
NC
BI
Fie
ldG
uid
e
Genotype
NC
BI
Fie
ldG
uid
eGenotype
NC
BI
Fie
ldG
uid
eHuman HFE: Links
NC
BI
Fie
ldG
uid
e
GeneView in dbSNP
NC
BI
Fie
ldG
uid
e
SNP in Structure
NC
BI
Fie
ldG
uid
e
SNP in Structure
NC
BI
Fie
ldG
uid
e
SNP in Structure
H41
S43
C260
NC
BI
Fie
ldG
uid
eAnother Variation Source: OMIM
NC
BI
Fie
ldG
uid
eVariants in OMIM
NC
BI
Fie
ldG
uid
e
Genome Resources
Gene database
Trace Archive
Map Viewer
Homologene
Genomic Biology
NC
BI
Fie
ldG
uid
e
The New Homologene
Automated detection of homologs among the annotated genes of
completely sequenced eukaryotic genomes.
No longer UniGene based
Protein similarities first
Guided by taxonomic tree
Includes orthologs and
paralogs
No longer UniGene based
Protein similarities first
Guided by taxonomic tree
Includes orthologs and
paralogs
NC
BI
Fie
ldG
uid
e
The New Homologene
Homologene Build 43.1 (8/23/05)
Species Number of genes input grouped groups
NC
BI
Fie
ldG
uid
e
RAG1 → Homologene
NC
BI
Fie
ldG
uid
e
RAG1 → HomolgeneRAG1
NC
BI
Fie
ldG
uid
eRAG1
RING-finger
NC
BI
Fie
ldG
uid
e
RAG1 → HomolgeneRAG1
NC
BI
Fie
ldG
uid
eRAG1
Sugar_tr
NC
BI
Fie
ldG
uid
e
Homologene: alignment scores
NC
BI
Fie
ldG
uid
eBLASTPbl2seq
NC
BI
Fie
ldG
uid
e
Genome Resources
LocusLinkLocusLinkGene databaseGene database
UniGeneUniGene
Trace ArchiveTrace Archive
Map ViewerMap Viewer
HomologeneHomologene
NC
BI
Fie
ldG
uid
e
List View
NC
BI
Fie
ldG
uid
eHuman MapViewer
adar
NC
BI
Fie
ldG
uid
eMapViewer: Human ADAR
NC
BI
Fie
ldG
uid
e
MV Hs ADAR3’ UTR
5’ UTR
NC
BI
Fie
ldG
uid
eMaps & Options
--Sequence maps--Ab initioAssemblyRepeatsBES_CloneCloneNCI_CloneContigComponentCpG islanddbSNP haplotypeFosmidGenBank_DNAGenePhenotypeSAGE_TagSTSTCAG_RNATranscript (RNA)Hs_UniGeneHs_EST
--Cytogenetic maps--IdeogramFISH CloneGene_CytogeneticMitelman BreakpointMorbid/Disease--Genetic Maps--deCODEGenethonMarshfield--RH maps--GeneMap99-G3GeneMap99-GB4NCBI RHStandford-G3TNGWhitehead-RHWhitehead-YAC
Mm_UniGeneMm_ESTRn_UniGeneRn_ESTSsc_UniGeneSsc_ESTBt_UniGeneBt_ESTGga_UniGeneGga_ESTVariation
Maps & Options
= SNP
NC
BI
Fie
ldG
uid
e
MapViewerUniGene
Component
Repeats
Gene
NC
BI
Fie
ldG
uid
e
GenePhenotype Variation
NC
BI
Fie
ldG
uid
eMaps & OptionsMaps & Options
NC
BI
Fie
ldG
uid
e
Genome Resources
LocusLinkLocusLinkGene databaseGene database
UniGeneUniGene
Trace ArchiveTrace Archive
Map ViewerMap Viewer
HomologeneHomologene
NC
BI
Fie
ldG
uid
e
Trace Archive Page
NC
BI
Fie
ldG
uid
e
Macaca Mulatta Traces
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
Trace Archive BLAST Page
Access to sequences NOT in GenBankAccess to sequences NOT in GenBank
NC
BI
Fie
ldG
uid
e
Literature Links
NC
BI
Fie
ldG
uid
e
BOOKS Database
NC
BI
Fie
ldG
uid
e
BOOKS Database: hyperlinked
NC
BI
Fie
ldG
uid
e
BOOKS Database
NC
BI
Fie
ldG
uid
e
BOOKS Database
NC
BI
Fie
ldG
uid
e
BOOKS Database
NC
BI
Fie
ldG
uid
e
Genes & Dis
NC
BI
Fie
ldG
uid
e
Genes & Dis
NC
BI
Fie
ldG
uid
e
For More Information…
NC
BI
Fie
ldG
uid
e
Intermission