ncbi fieldguide national center for biotechnology information a field guide to genbank and ncbi’s...

Post on 16-Dec-2015

224 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

NC

BI

Fie

ldG

uid

e

National Center for Biotechnology Information

A Field Guide to GenBank

and NCBI’s Molecular Biology Resources

August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Topics About NCBI GenBank overview Primary vs derivative databases

The Reference Sequence (RefSeq) project

Entrez databases Genome resources Bookshelf

-break- Entrez text searching BLAST sequence searching VAST structure searching An integrated example

NC

BI

Fie

ldG

uid

e

The National Institutes of Health

Bethesda, MD

NC

BI

Fie

ldG

uid

eThe National Center for

Biotechnology Information

Accepts submissions of primary data

Develops tools to analyze these data Creates derivative databases based on the

primary data Provides free search, link, and retrieval of these

data, primarily through the Entrez system

NC

BI

Fie

ldG

uid

eNCBI WWW Users per

Day

NC

BI

Fie

ldG

uid

e

Number of Users Per Day

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

Nu

mb

er o

f U

sers

1997 1998 1999 2000 2001 2002 2003

Christmas & New Year

NC

BI

Fie

ldG

uid

e

Homepage - accessing the data

all[filter]

NC

BI

Fie

ldG

uid

eall[filter]

1/11/2005

3/15/2005

8/15/2005

NC

BI

Fie

ldG

uid

e

Entrez Nucleotide

Primary Data GenBank / DDBJ / EMBL 57.3 million (97.4 %) Derivative Data

RefSeq 1.47 million (2.5 %)

RefSeq reviewed 60,000

PDB (structures) 5,973

“Total” 59 million

GenBank

# records

NC

BI

Fie

ldG

uid

e

GenBank: NCBI’s Primary Sequence Database

ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub

ftp://bio-mirror.net/biomirror/genbank

Release 149 August 2005 47 x 106 Records 52 x 109 Nucleotides

195 Gigabytes 816 files

• full release every two months• incremental and cumulative updates daily• available only through internet• release notes: gbrel.txt

Over 100 billionbases!

Over 100 billionbases!

NC

BI

Fie

ldG

uid

eWhat is

GenBank?

Nucleotide only sequence database Archival in nature GenBank Data

Direct submissions (traditional records) Batch submissions (EST, GSS, STS) ftp accounts (genome data)

Three collaborating databases GenBank DNA Database of Japan (DDBJ) European Molecular Biology Laboratory (EMBL)

Database

NC

BI

Fie

ldG

uid

e

GenBank Divisions

“Organismal”PRI (28) Primate ROD (15) Rodent PLN (13) Plant and FungalBCT (11) Bacterial/ArchealINV (7) InvertebrateVRT (7) Other VertebrateVRL (4) ViralMAM (2) MammalianPHG (1) PhageSYN (1) SyntheticUNA (1) Unannotated

“Functional”EST (377) Expressed Sequence Tag GSS (138) Genome Survey SequenceHTG (63) High Throughput GenomicPAT (17) PatentSTS (9) Sequence Tagged SiteCON (1) Contigs, virtual

• Organized by taxonomy (sort of)• Direct submissions (Sequin/Bankit)• Accurate (~1 error per 10,000 bp)• Well characterized

• Organized by sequence type• Batch submissions (ftp/email) • Inaccurate• Poorly characterized

NC

BI

Fie

ldG

uid

eGenBank Functional (Bulk)

Divisions

GenBankEST

STS

GSS

HTG

Expressed Sequence Tag

1st pass single read cDNA

Genome Survey Sequence

1st pass single read gDNA

High Throughput Genomic

incomplete sequences of genomic

clones

Sequence Tagged Site

PCR-based mapping reagents

Whole Genome Shotgun

NC

BI

Fie

ldG

uid

eEST Division: Expressed Sequence

Tags

RNA gene products

nucleus30,000 genes

80-100,000 uniquecDNA clones in library

- isolate unique clones - sequence once from

each end

make cDNA library

5’

3’

>IMAGE:275615 3', mRNA sequenceNNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTATTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTCTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGGTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC

>IMAGE:275615 5' mRNA sequenceGACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAATTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACTGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCAAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG

NC

BI

Fie

ldG

uid

e

GSS, WGS, HTG

shred

Whole BAC insert (or genome)

isolate clonessequence

GSS divisionor trace archive

Draft sequence (HTG division)

assembly whole genome shotgun assemblies (traditional division)

NC

BI

Fie

ldG

uid

eHTG Example: Honeybee Draft

Sequences

• Unfinished sequences of BACs

• Gaps and unordered pieces

• Finished sequences (Phase 3) move

to traditional GenBank division

• Unfinished sequences of BACs

• Gaps and unordered pieces

• Finished sequences (Phase 3) move

to traditional GenBank division

LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004

DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT

SEQUENCE, 14 unordered pieces.

ACCESSION AC141845

VERSION AC141845.1 GI:29124029

KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT.

LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004

DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT

SEQUENCE, 14 unordered pieces.

ACCESSION AC141845

VERSION AC141845.1 GI:29124029

KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT.

NC

BI

Fie

ldG

uid

e

Whole Genome Shotgun Projects

351 projects Bacteria (251) Environmental sequences (6) Archaea (6)

Eukaryotes (88), including: Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human

Pufferfish (2)

Honeybee, Anopheles, Fruit Flies (3), Silkworm

Nematode (2)

Yeasts (8), Aspergillus (2)

Rice (2)

351 projects Bacteria (251) Environmental sequences (6) Archaea (6)

Eukaryotes (88), including: Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human

Pufferfish (2)

Honeybee, Anopheles, Fruit Flies (3), Silkworm

Nematode (2)

Yeasts (8), Aspergillus (2)

Rice (2)

NC

BI

Fie

ldG

uid

eWhole Genome Shotgun (WGS)

Projects

wgs master[properties]

NC

BI

Fie

ldG

uid

e

Derivative Databases

GenBank

SequencingCenters UniGene

RefSeq:

Entrez Gene and

annotation pipelines

Labs

Updated ONLY by submitters

ESTUniSTS

STS

HTG

GSS

PRI ROD PLN MAM BCT

INV VRT PHG VRL

ATT GA

ATT

C

GA

C

GA

C

C

CATT

TAACT

Updated

by NCBI

RefSeq

NC

BI

Fie

ldG

uid

e

Why Make Reference Sequences?

Entrez Nucleotide query:

human[organism] AND lipase[title]

NC

BI

Fie

ldG

uid

eWhy Make Reference Sequences?Entrez Nucleotide query:

human[organism] AND lipase[title]

NC

BI

Fie

ldG

uid

ehuman[organism] AND lipase[title] AND endothelial[title]

3927 bp

4150 bp

3927 bp

2323 bp

261 bp

human[organism] AND lipase[title] AND endothelial[title]

NC

BI

Fie

ldG

uid

e

RefSeq Benefits

genomestranscripts

proteins

• non-redundant; best representative

•updates to reflect current sequence data and biology

•distinct, stable accession series

NC

BI

Fie

ldG

uid

e

Reference Sequence: RefSeq

Accession Sequence Type

NM_123456789 mRNANP_123456789 protein, from NM_NR_123456 non-coding RNAXM_123456 predicted mRNAXP_123456 predicted protein XR_123456 predicted non-coding RNAZP_12345678 predicted from NZ_

NC_123456 genomic, e.g., chromosomesNG_123455 genomic, incomplete region

NT_123456 genomic, BAC assemblyNW_123456 genomic, WGS assemblyNZ_ABCD12345678 genomic, WGS collection

blue=curated

NC

BI

Fie

ldG

uid

e

Genomic DNAGenomic DNA((NCNC,, NTNT,, NW NW))

Model mRNAModel mRNA (XM)(XM)(XR)(XR)

Curated mRNACurated mRNA (NM)(NM)(NR)(NR)

Model protein Model protein (XP)(XP)

Annotation Process

Curated ProteinCurated Protein (NP)(NP)

Scanning....

GenbankSequences

RefSeq

NC

BI

Fie

ldG

uid

e

Creating NM_ Records

NM’s must have cDNA support

Genome annotation

Longest mRNA

transcript variant 1transcript variant 2transcript variant 3

NC

BI

Fie

ldG

uid

e

Where is RefSeq?

NC

BI

Fie

ldG

uid

e

GENSAT

The Entrez System

Entrez

Nucleotide

PubMed

Protein

Taxonomy

Structure

Domains 3D DomainsJournal

s

PMC

OMIM

Books

PopSet

SNP

UniGene UniSTS

Genome

Gene

GEO

MeSH

CancerChromosomes

Homologene

PubChem

NC

BI

Fie

ldG

uid

e

A Few Entrez Databases

UniGene Clusters of ESTs, mRNAs

dbSNP Single Nucleotide

Polymorphisms

GEO Gene Expression Omnibus

microarray and other

expression data

CDD Conserved Domain Database protein families (COGs

and KOGs)

single domains (PFAM,

SMART, CD)

UniGene Clusters of ESTs, mRNAs

dbSNP Single Nucleotide

Polymorphisms

GEO Gene Expression Omnibus

microarray and other

expression data

CDD Conserved Domain Database protein families (COGs

and KOGs)

single domains (PFAM,

SMART, CD)

NC

BI

Fie

ldG

uid

eGene-oriented clusters of expressed sequences

• Automatic clustering using MegaBlast

• Each cluster represents a unique gene

• Informed by genome hits

• Information on tissue types and map locations

• Useful for gene discovery and selection of

mapping reagents

UniGene

unique gene

NC

BI

Fie

ldG

uid

e

A Cluster of ESTs

query

5’ EST hits

3’ EST hits

NC

BI

Fie

ldG

uid

eUniGene Collections

NC

BI

Fie

ldG

uid

eExample UniGene Cluster

NC

BI

Fie

ldG

uid

eHistogram of cluster sizes for UniGene Hs Build 177

(Now at Build #186)

NC

BI

Fie

ldG

uid

eUniGene Cluster Hs.95351

SELECTED PROTEIN SIMILARITES

NC

BI

Fie

ldG

uid

eUniGene Cluster Hs.95351

GENE EXPRESSION

NC

BI

Fie

ldG

uid

e

UniGene Cluster Hs.95351: expression

NC

BI

Fie

ldG

uid

eUniGene Cluster Hs.95351: seqs

NC

BI

Fie

ldG

uid

e

Download sequences

web page

ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/

NC

BI

Fie

ldG

uid

eEntrez GEO

NC

BI

Fie

ldG

uid

e

NCBI’s SNP Database

Primary and derivative (RefSNP) Single nucleotide polymorphisms

Repeat polymorphisms

Insertion-deletion polymorphisms

Over 19 million refSNPs (rsXXXXXXX)

(August, 2005)

NC

BI

Fie

ldG

uid

e

Searching dbSNP

NC

BI

Fie

ldG

uid

e

RefSNP

NC

BI

Fie

ldG

uid

e

RefSNP

NC

BI

Fie

ldG

uid

e

RefSNP

NC

BI

Fie

ldG

uid

e

RefSNP

Search Mouse SNP between strains

NC

BI

Fie

ldG

uid

e

RefSNP

MapView GeneView SeqView OMIMNo 3D

NC

BI

Fie

ldG

uid

e

RefSNP

NC

BI

Fie

ldG

uid

eEntrez GEO

NC

BI

Fie

ldG

uid

e

GPLPlatform

descriptions

GSMRaw/processedspot intensities

from a singleslide/chip

GSEGrouping of

slide/chip data“a single experiment”

GDSGrouping ofexperiments

Curated byNCBI

Submitted byExperimentalistsSubmitted by

Manufacturer*

Entrez GEOEntrez

GEO Datasets

GEO SaMple:

experimental

conditions

GEO SEries:

set of related

samples

NC

BI

Fie

ldG

uid

e

What’s a DataSet?

Platform (GPL)

array definition

Sample(GSM)

hyb. measurements

Series(GSE)

related Samples

Supplied by submitter

DataSet (GDS)

• A collection of experimentally-related samples processed using the same platform.• Samples within DataSets are organized into subgroups based on experimental variables.• Form the basis of GEO’s query, analysis and data display tools.

Assembled by GEO staff

NC

BI

Fie

ldG

uid

eGene Expression Omnibus (GEO)

Dataset browser

NC

BI

Fie

ldG

uid

eGEO Dataset Browser

NC

BI

Fie

ldG

uid

eGEO Dataset Report

NC

BI

Fie

ldG

uid

e

GEO Profiles

… of 12625

NC

BI

Fie

ldG

uid

eEntrez CDD

NC

BI

Fie

ldG

uid

eConserved Domain Database

Multiple sequence alignments

Position-specific scoring matrices (PSSM)

Sources SMART, PFAM, COGs, KOGs, and

NCBI curated domains (structure-informed

alignments)

Multiple sequence alignments

Position-specific scoring matrices (PSSM)

Sources SMART, PFAM, COGs, KOGs, and

NCBI curated domains (structure-informed

alignments)

NC

BI

Fie

ldG

uid

e

CDD

>gi|45549418|gb|AAS67634.1| ATP7A [Solenodon paradoxus] IVYQPHLITVEEIKKQIKAVGFPAFIKKQPKYLKLGAIDIERLKNIPVKSSEGSQQMSPSSTNDSKVTLTIDGMHCNSCVSNIESALSTLHYVSSIVVSLQNKSAIIKYNANSVTPEILKKAIEAISPGQYRVSITSEVESTSNSPSSSSQKAPLNVVSQPLTQVTVININGMTCNSCVQSIEGVMSKKAGVKSIQVSLANRNGTVEYDP LLTSPEILRE

NC

BI

Fie

ldG

uid

e

CDD

CD

Pfam

COG

Click on a colored bar to align your sequence to the CD

NC

BI

Fie

ldG

uid

eConserved Domain Database: cd00371.1, HMA

NC

BI

Fie

ldG

uid

e

CDD

NC

BI

Fie

ldG

uid

eCDART: Conserved Domain Architecture Retrieval

Tool

NC

BI

Fie

ldG

uid

e

cdd

Linking from Entrez Protein

NC

BI

Fie

ldG

uid

e

Genome Resources

Gene database

Trace Archive

Map Viewer

Homologene

Genomic Biology

NC

BI

Fie

ldG

uid

e

Genomic Biology

NC

BI

Fie

ldG

uid

eGen Biol: Gen Resources

NC

BI

Fie

ldG

uid

e

Gen Biol: Gen Resources

NC

BI

Fie

ldG

uid

eGen Biol: Gen Resources

NC

BI

Fie

ldG

uid

e

Genome Projects: microb

NC

BI

Fie

ldG

uid

eGen Biol: Gen Resources

NC

BI

Fie

ldG

uid

eGen Biol: Gen Resources

NC

BI

Fie

ldG

uid

eGen Biol: Gen Resources

NC

BI

Fie

ldG

uid

eGen Biol: Gen Resources

NC

BI

Fie

ldG

uid

e

Gen Biol: Gen Resources

NC

BI

Fie

ldG

uid

e

Genome Resources

Gene database

Trace Archive

Map Viewer

Homologene

Genomic Biology

NC

BI

Fie

ldG

uid

e

Entrez Gene

A single query interface to …

• Sequences

- RefSeqs

- GenBank

- Homologene• Maps – MapViewer• Entrez links• Linkouts

More organisms, ~ 3000

Entrez integration

More organisms, ~ 3000

Entrez integration

NC

BI

Fie

ldG

uid

eGlobal Entrez: NADH2

NC

BI

Fie

ldG

uid

eEntrez Gene: NADH2

NC

BI

Fie

ldG

uid

eGene Record for Pongo NADH2

Homo sapiens

Not found with “nadh2”

NC

BI

Fie

ldG

uid

eA Record With More Data: Human HFE

NC

BI

Fie

ldG

uid

eHuman HFE: Transcripts

Transcripts with experimental

evidence

Transcripts with experimental

evidence

NC

BI

Fie

ldG

uid

eGene Table

NC

BI

Fie

ldG

uid

eIntrons/Exons: Gene Table

links to sequence

NC

BI

Fie

ldG

uid

eHuman HFE: Links

NC

BI

Fie

ldG

uid

e

Genotype

NC

BI

Fie

ldG

uid

eGenotype

NC

BI

Fie

ldG

uid

eHuman HFE: Links

NC

BI

Fie

ldG

uid

e

GeneView in dbSNP

NC

BI

Fie

ldG

uid

e

SNP in Structure

NC

BI

Fie

ldG

uid

e

SNP in Structure

NC

BI

Fie

ldG

uid

e

SNP in Structure

H41

S43

C260

NC

BI

Fie

ldG

uid

eAnother Variation Source: OMIM

NC

BI

Fie

ldG

uid

eVariants in OMIM

NC

BI

Fie

ldG

uid

e

Genome Resources

Gene database

Trace Archive

Map Viewer

Homologene

Genomic Biology

NC

BI

Fie

ldG

uid

e

The New Homologene

Automated detection of homologs among the annotated genes of

completely sequenced eukaryotic genomes.

No longer UniGene based

Protein similarities first

Guided by taxonomic tree

Includes orthologs and

paralogs

No longer UniGene based

Protein similarities first

Guided by taxonomic tree

Includes orthologs and

paralogs

NC

BI

Fie

ldG

uid

e

The New Homologene

Homologene Build 43.1 (8/23/05)

Species Number of genes input grouped groups

NC

BI

Fie

ldG

uid

e

RAG1 → Homologene

NC

BI

Fie

ldG

uid

e

RAG1 → HomolgeneRAG1

NC

BI

Fie

ldG

uid

eRAG1

RING-finger

NC

BI

Fie

ldG

uid

e

RAG1 → HomolgeneRAG1

NC

BI

Fie

ldG

uid

eRAG1

Sugar_tr

NC

BI

Fie

ldG

uid

e

Homologene: alignment scores

NC

BI

Fie

ldG

uid

eBLASTPbl2seq

NC

BI

Fie

ldG

uid

e

Genome Resources

LocusLinkLocusLinkGene databaseGene database

UniGeneUniGene

Trace ArchiveTrace Archive

Map ViewerMap Viewer

HomologeneHomologene

NC

BI

Fie

ldG

uid

e

List View

NC

BI

Fie

ldG

uid

eHuman MapViewer

adar

NC

BI

Fie

ldG

uid

eMapViewer: Human ADAR

NC

BI

Fie

ldG

uid

e

MV Hs ADAR3’ UTR

5’ UTR

NC

BI

Fie

ldG

uid

eMaps & Options

--Sequence maps--Ab initioAssemblyRepeatsBES_CloneCloneNCI_CloneContigComponentCpG islanddbSNP haplotypeFosmidGenBank_DNAGenePhenotypeSAGE_TagSTSTCAG_RNATranscript (RNA)Hs_UniGeneHs_EST

--Cytogenetic maps--IdeogramFISH CloneGene_CytogeneticMitelman BreakpointMorbid/Disease--Genetic Maps--deCODEGenethonMarshfield--RH maps--GeneMap99-G3GeneMap99-GB4NCBI RHStandford-G3TNGWhitehead-RHWhitehead-YAC

Mm_UniGeneMm_ESTRn_UniGeneRn_ESTSsc_UniGeneSsc_ESTBt_UniGeneBt_ESTGga_UniGeneGga_ESTVariation

Maps & Options

= SNP

NC

BI

Fie

ldG

uid

e

MapViewerUniGene

Component

Repeats

Gene

NC

BI

Fie

ldG

uid

e

GenePhenotype Variation

NC

BI

Fie

ldG

uid

eMaps & OptionsMaps & Options

NC

BI

Fie

ldG

uid

e

Genome Resources

LocusLinkLocusLinkGene databaseGene database

UniGeneUniGene

Trace ArchiveTrace Archive

Map ViewerMap Viewer

HomologeneHomologene

NC

BI

Fie

ldG

uid

e

Trace Archive Page

NC

BI

Fie

ldG

uid

e

Macaca Mulatta Traces

NC

BI

Fie

ldG

uid

e

NC

BI

Fie

ldG

uid

e

Trace Archive BLAST Page

Access to sequences NOT in GenBankAccess to sequences NOT in GenBank

NC

BI

Fie

ldG

uid

e

Literature Links

NC

BI

Fie

ldG

uid

e

BOOKS Database

NC

BI

Fie

ldG

uid

e

BOOKS Database: hyperlinked

NC

BI

Fie

ldG

uid

e

BOOKS Database

NC

BI

Fie

ldG

uid

e

BOOKS Database

NC

BI

Fie

ldG

uid

e

BOOKS Database

NC

BI

Fie

ldG

uid

e

Genes & Dis

NC

BI

Fie

ldG

uid

e

Genes & Dis

NC

BI

Fie

ldG

uid

e

For More Information…

NC

BI

Fie

ldG

uid

e

Intermission

top related