what is genbank?– direct submissions individual records (bankit, sequin) – batch submissions via...

26

Upload: others

Post on 14-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly
Page 2: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

What is What is GenBankGenBank??NCBINCBI’’ss Primary Sequence DatabasePrimary Sequence Database

•• Nucleotide sequence database Nucleotide sequence database •• Archival in natureArchival in nature•• GenBankGenBank DataData

–– Direct submissions individual records (Direct submissions individual records (BankItBankIt, Sequin), Sequin)–– Batch submissions via email (EST, GSS, STS)Batch submissions via email (EST, GSS, STS)–– ftp accounts sequencing centersftp accounts sequencing centers

•• Data shared nightly among three collaborating Data shared nightly among three collaborating databasesdatabases–– GenBankGenBank–– DNA Database of Japan (DDBJ). DNA Database of Japan (DDBJ). MishimaMishima, Japan, Japan–– European Molecular Biology Laboratory Database (EMBL) at European Molecular Biology Laboratory Database (EMBL) at

EBI. EBI. HinxtonHinxton, UK, UK

Page 3: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

GenBankGenBank

DDBJDDBJEMBL

Data LibraryEMBL

Data Library

EMBLEMBL

NIGNIG

NIHNIH Entrez

SRS

getentry

•Submissions•Updates

•Submissions•Updates

•Submissions•Updates

The International Nucleotide SequenceThe International Nucleotide SequenceDatabase Collaboration Database Collaboration DDBJ/EMBL/DDBJ/EMBL/GenBankGenBank

EBICIB

NCBI

Page 4: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

NCBI NCBI HomepageHomepage

Page 5: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

NCBI DatabasesNCBI Databases

Page 6: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

NCBI Databases and ServicesNCBI Databases and Services

•• GenBank GenBank largest sequence databaselargest sequence database

•• Free public access to biomedical literatureFree public access to biomedical literature–– PubMed PubMed free Medlinefree Medline

–– PubMed Central PubMed Central full text online accessfull text online access

•• Entrez Entrez integrated molecular and literature databasesintegrated molecular and literature databases

•• BLAST BLAST highest volume sequence search servicehighest volume sequence search service

•• VASTVAST structure similarity searchesstructure similarity searches

•• Software and DatabasesSoftware and Databases

Page 7: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

A TraditionalA TraditionalGenBank RecordGenBank Record

LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,

complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple)

ORGANISM Malus x domesticaEukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.

REFERENCE 1 (bases 1 to 1931)AUTHORS Pechous,S.W. and Whitaker,B.D.TITLE Cloning and functional expression of an (E,E)-alpha-farnesene

synthase cDNA from peel tissue of apple fruitJOURNAL Planta 219, 84-94 (2004)

REFERENCE 2 (bases 1 to 1931)AUTHORS Pechous,S.W. and Whitaker,B.D.TITLE Direct SubmissionJOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,

USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD20705, USA

REFERENCE 3 (bases 1 to 1931)AUTHORS Pechous,S.W. and Whitaker,B.D.TITLE Direct SubmissionJOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,

USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD20705, USA

REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.FEATURES Location/Qualifiers

source 1..1931/organism="Malus x domestica"/mol_type="mRNA"/cultivar="'Law Rome'"/db_xref="taxon:3750"/tissue_type="peel"

gene 1..1931/gene="AFS1"

CDS 54..1784/gene="AFS1"/note="terpene synthase"/codon_start=1/product="(E,E)-alpha-farnesene synthase"/protein_id="AAO22848.2"/db_xref="GI:32265058"/translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWKNDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLFEKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLEDFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIKGMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHILSLLFQPLVN"

ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat

61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt

//

Header

Feature Table

Sequence

The Flatfile FormatThe Flatfile Format

Page 8: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

Traditional GenBank RecordTraditional GenBank Record

ACCESSION U07418

VERSION U07418.1 GI:466461

ACCESSION U07418

VERSION U07418.1 GI:466461

Accession•Stable•Reportable•Universal

Accession•Stable•Reportable•Universal

VersionTracks changes in sequenceVersionTracks changes in sequence

GI numberNCBI internal useGI numberNCBI internal use

well annotatedwell annotated

the sequence is the datathe sequence is the data

Page 9: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

LOCUS AF062069 3808 bp mRNA INV 02-MAR-2000

Sequence and Database IdentifiersSequence and Database IdentifiersLocus, accession, Locus, accession, gigi, version, version

DEFINITION Limulus polyphemus myosin III mRNA, complete cds.

GB DivisionLocus Name

DEF line (Title)

Modification Datemol-typemRNA (= cDNA)rRNAsnRNADNA

Sequencelength

VERSION AF062069.2 GI:7144484

ACCESSION AF062069 Accession Number

Accession.version gi number

Page 10: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

BASE COUNT 1201 a 689 c 782 g 1136 tORIGIN

1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt<sequence omitted>

3721 accaatgtta taatatgaaa tgaaataaag cagtcatggt agcagtggct gtttgaaata3781 aagatacagt aactagggaa aaaaaaaa

//

SequenceSequence

End of record

Indicates beginning of sequence data

Page 11: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

Using Using EntrezEntrez

An integrated An integrated database search and database search and

retrieval systemretrieval system

Page 12: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

WWWWWWAccessAccess

Entrez&BLAST

Page 13: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

GenomesGenomes

TaxonomyTaxonomy

EntrezEntrez: Neighboring and Hard Links: Neighboring and Hard Links

PubMedabstractsPubMedabstracts

Nucleotide sequences

Nucleotide sequences

Protein sequencesProtein

sequences

3-D Structure(MMDB)(MMDB)

3 -D Structure

3 -D Structure

Word weight

PhylogenyPhylogenyVAST

BLASTBLAST

Page 14: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

WWW WWW EntrezEntrez•All of MEDLINE plus others•Abstracts•Links to online Journals

GenBank, EMBL, DDBJRefSeq, PDB

GenBank, DDBJ, EMBL translationsPDB, PIR, SWISS-PROT, PRF, RefSeq

NCBI’s MMDB - derived from PDB

Graphical viewsAssembled sequence and mapping data

NCBI’s TaxonomyHierarchical tree structureaccess to sequences

MIMNow in Entrez

Population and phylogenetic studies

Page 15: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

EntrezEntrez NucleotidesNucleotides

Mouse

Page 16: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

Document Summaries: Mouse[All Fields]Document Summaries: Mouse[All Fields]

Chicken not mouse !?

3 million records

Page 17: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

EntrezEntrez Nucleotides: Limits: Preview/IndexNucleotides: Limits: Preview/Index

Mouse

Page 18: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

EntrezEntrez Nucleotides: LimitsNucleotides: LimitsAccessionAll FieldsAuthor NameEC/RN NumberFeature keyFilterGene NameIssueJournal NameKeywordModification DateOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence LengthSubstance NameText WordTitle WordUidVolume

Field Restriction

Only FromRefSeqGenBankEMBLDDBJ

Exclude unwanted categories of sequences

MoleculeGenomic DNA/RNAmRNArRNA

Gene LocationGenomic DNA/RNAMitochondrionChloroplast

Mouse

Page 19: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

EntrezEntrez Nucleotides: Limits: OrganismNucleotides: Limits: Organism

Mouse

Page 20: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

Document Summaries: Mouse[Organism]Document Summaries: Mouse[Organism]

2,976,070[All Fields]-2,921,009[Organism]

55,061

Page 21: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

Exclude Bulk Sequences, mRNAExclude Bulk Sequences, mRNA

Page 22: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

Adding Terms: Preview/IndexAdding Terms: Preview/Index

glyceraldehyde 3 phosphate dehydrogenase

AccessionAll FieldsAuthor NameEC/RN NumberFeature keyFilterGene NameIssueJournal NameKeywordModification DateOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence LengthSubstance NameText WordTitle WordUidVolume

Search History

Page 23: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

Mouse GAPD RecordsMouse GAPD Records

(("Mus musculus"[Organism] ANDglyceraldehyde 3 phosphate dehydrogenase[Title Word])AND ((((((1900[MDAT] : 3000[MDAT]) NOT gbdiv_est[PROP]) NOT gbdiv_sts[PROP]) NOT gbdiv_gss[PROP])NOT gbdiv_htg[PROP]) NOT gbdiv_pat[PROP]))AND biomol_mrna[PROP]

Properties Field Terms

Page 24: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

Displaying Mouse GAPD RecordsDisplaying Mouse GAPD Records

SummaryBriefGenBankASN.1FASTAGI listLinkOutPubMed LinksProtein LinksNucleotide NeighborsPopSet LinksStructure LinksGenome LinksTaxonomy LinksOMIM Links

Formats

Links and neighbors (related records)

Page 25: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

>gi|193425|gb|M60978.1|MUSGAPDS Mus musculus testis-specific isoform of glyceraldGGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCCAGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACCACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCTCCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCCCTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCTGACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATTAGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCACACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAACACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGTACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCACTGCACCCTCCCCCGATGCACCCATGTTTGTCATGGGAGTGAACGAGAAGGACTATAACCCTGGCTCTATGACCATTGTCAGCAATGCATCCTGTACCACCAACTGCCTGGCTCCTCTCGCCAAGGTTATTCATGAAAACTTCGGGATCGTGGAAGGGCTAATGACCACAGTCCATTCCTACACAGCCACTCAGAAGACAGTGGATGGGCCATCAAAGAAGGACTGGCGAGGTGGCCGCGGCGCTCACCAAAACATCATCCCATCGTCCACTGGGGCTGCCAAGGCTGTAGGCAAAGTCATCCCAGAGCTCAAAGGGAAGCTAACAGGAATGGCATTCCGGGTGCCAACCCCAAACGTGTCAGTTGTGGACCTGACCTGCCGCCTGGCCAAGCCTGCTTCTTACTCGGCTATCACGGAGGCTGTGAAAGCTGCAGCCAAGGGACCTTTGGCTGGCATCCTTGCTTACACAGAGGACCAGGTGGTCTCCACGGACTTTAACGGCAATCCCCATTCTTCCATCTTTGATGCTAAGGCTGGAATTGCCCTCAATGACAACTTCGTGAAGCTTGTTGCCTGGTACGACAACGAATATGGCTACAGTAACCGAGTGGTCGACCTCCTCCGCTACATGTTTAGCCGAGAGAAGTAACACAAAAGGCCCCTCCTTGCTCCCCTGCGCACCTCGCGTTCCTGACTTCGGCTTCCACTCAAAGGCGCCGCCACCGGGTCAACAATGAAATAAAAACGAGAATGCGC

>

FASTA FormatFASTA Format

FASTA Definition Line>gi|193425|gb|M60978.1|MUSGAPDS

gi number

Database Identifiersgb GenBankemb EMBLdbj DDBJsp SWISS-PROTpdb Protein Databankpir PIRprf PRFref RefSeq

Accession number

Locus Name

Page 26: What is GenBank?– Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers • Data shared nightly

Break!Break!

5 minutes