bioinformacs resources - genbank...april 27th sequence databases (3. sh.) june 15th mongodb,...

Post on 17-Jul-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

BioinfRes SoSe 18

Bioinforma)csResources-Genbank-

Lecture&ExercisesProf.B.Rost,Dr.L.Richter,J.Reeb

Ins)tutfürInforma)kI12

BioinfRes SoSe 18

PreliminarySchedule

* These exercises can earn you a bonus

April 13th Intro, General Overview (1. sh.) June 1th Lecture cancelled April 20th Sequence Databases (2. sh.) June 8th NoSql 2 (7.sh.) April 27th Sequence Databases (3. sh.) June 15th MongoDB, JavaScript (8.sh.) May 4th Structure Databases (4. sh.) June 22nd Node.js Applications (9.sh.) May 11th Lecture cancelled June 29th PredictProtein May 18th SQL (5. sh.) Jul 6th Wrap Up, Q&A May 25th SQL, NoSql (6. sh) Jul 20th Exam

BioinfRes SoSe 18

Na)onalCenterforBiotechnologyInforma)on,NCBI

http://nihrecord.nih.gov/newsletters/2013/07_19_2013/images/milestonesPic6.jpg

●  firstideasinthemiddleofthe80s

●  divisionoftheNa)onalLibraryofMedicine(NLM)insidetheNa)onalIns)tutesofHealth(NIH)

●  poli)calmission

●  foundedin1988

●  DavidLipman

BioinfRes SoSe 18

NCBI’spoli)calmissionasdefinedbythebill:1.  design,develop,implement,andmanageautomatedsystems

forthecollec)on,storage,retrieval,analysis,anddissemina)onofknowledgeconcerninghumanmolecularbiology,biochemistry,andgene)cs;

2.  performresearchintoadvancedmethodsofcomputer-basedinforma)onprocessingcapableofrepresen)ngandanalyzingthevastnumberofbiologicallyimportantmoleculesandcompounds;

3.  enablepersonsengagedinbiotechnologyresearchandmedicalcaretousesystemsdevelopedunderparagraph(1)andmethodsdescribedinparagraph(2);and

4.  coordinate,asmuchasisprac)cable,effortstogatherbiotechnologyinforma)ononaninterna)onalbasis.

BioinfRes SoSe 18

SelectedNCBIAccomplishmentsBlastGenBankatNCBI

NCBIwebsite

GenomesOMIM

PubMed

1990

1992

1994

1995

1996

1997

HumanGenomePubMedCentral

EntrezGene/DTDs

NIHPublicAccessGenomeReferenceConsor)um

1000GenomesProject

1999

2000

2003

2005

2007

2008

BioinfRes SoSe 18

NCBIResources●  NCBIcurrentlyhostsavastbunchofresourceshap://www.ncbi.nlm.nih.gov/guide/all/

●  groupedaccordingtovariouscriteria-  metadata,project-centric-  methodoriented-  topicoriented

●  sortedinthesec)ons:databases,downloads,submissions,tools,howtos

BioinfRes SoSe 18

Genbank’sOrigin

●  WalterGoad,LosAlamosNa)onalLaboratory

●  LosAlamosSequenceDatabase1979

●  Crea)onandreleaseofGenBankin1982

●  Endof1982:2000sequences

●  MovetoNCBIin1992http://www.lanl.gov/science-innovation/features/innovations/images/light/thumbnails/21.jpg

BioinfRes SoSe 18

Minutesfrom20thanniversaryofGenBankin2002

“....AmongthemisamemoonLosAlamosNa)onalLaboratorysta)onerydatedMay9,1980,thatreads:Monday,May12at10:30SteveSimoninvitesyouforcakeandcoffeetocelebrate100,000basesnowintheDNAsequencelibrary.”

takenfromhaps://www.genomeweb.com/genbank-turns-20

BioinfRes SoSe 18

GrowthofGenBankandWGS

-doublingapprox.every18months,diagramforrelease225,Apr.2018-currentversion:release225:260,189,141,631basesinGenbank,2,784,740,996,536basesinWGS-takenfromhap://www.ncbi.nlm.nih.gov/genbank/sta)s)cs

BioinfRes SoSe 18

GrowthofGenBankandWGS

-currentrelease225:208,452,303sequencesinGenbank,621,379,029sequencesinWGS-takenfromhap://www.ncbi.nlm.nih.gov/genbank/sta)s)cs,release225,Apr.2018

BioinfRes SoSe 18

ReferencesforGenBank●  onecurrentcita)onsource:“GenBank”.NucleicAcidsRes.2014Jan;42(Databaseissue):D32-7.doi:10.1093/nar/gkt1030.Epub2013Nov11.

●  PMID:24217914●  themostrecent:“Genbank”.NucleicAcidsRes.2018Jan4;46(D1):D41–D47.Publishedonline2017Nov13th.doi:10.1093/nar/gkx1094

●  PMCID:PMC5753231

BioinfRes SoSe 18

ReferencesforGenBank●  moregeneralforNCBIservices:“DatabaseresourcesoftheNa)onalCenterforBiotechnologyInforma)on”.NucleicAcidsRes.2016Jan4;44(Databaseissue):D7–D19.Publishedonline2015Nov28.doi:10.1093/nar/gkv1290

●  partoftheInterna)onalNucleo)deSequenceDatabaseCollabora)on(INSDC)togetherwithEMBLNucleo)deSequenceDatabase(EMBL-Bank),partoftheEuropeanNucleo)deArchive(ENA)andtheDNADataBankofJapan(DDBJ)

BioinfRes SoSe 18

MostGrowingDivisionsDivision Description Release 197

(8/2013) Annual Increase (%)

WGS* Whole-genome shotgun data 2,035,032,639,807 from Release 219

TSA* Transcriptome shotgun data 149,038,907,599 from Release 219

WGS* Whole-genome shotgun data 500.420.412.665 62.4.

TSA* Transcriptome shotgun data 8.6333123.935 49.9

PHG Phages 119.812.712 42.5

VRL Viruses 1.757.202.472 22.9

BCT Bacteria 10.281.048.518 21.8

ENV Environmental samples 3.743.277.434 10.9

INV Invertebrates 2.737.140.464 9.8

PAT Patented sequences 13.290.161.247 9.7

PLN Plants 5.963.882.822 8.8

GSS Genome survey sequences 23.726.384.753 8.1

VRT Other vertebrates 3.068.956.026 6.3

MAM Other mammals 911.342.025 5.6

... ... ... ...

TOTAL All GenBank sequences 654.613.333.676 45.1 * not distributed with the release; there specific project server sections

BioinfRes SoSe 18

TopOrganisms(Rel.207)Organism Entries Non-WGS base

pair Homo sapiens 20.921.637 17.714.786.437

Mus musculus 9.727.522 9.995.696.539

Rattus norvegicus 2.193.812 6.526.236.496

Bos taurus 2.227.298 5.410.360.312

Zea mays 4.177.175 5.201.714.457

Sus scrofa 3.297.029 4.895.127.638

Danio rerio 1.727.668 3.133.901.682

Triticum aestivum 1.796.780 1.927.718.314

... ... ...

Oryza sativa Japonica Group

1.376.410 1.265.556.227

... ... ...

Arabidopsis thaliana 2.578.785 1.202.100.008

... ...

BioinfRes SoSe 18

TopOrganisms(Rel.219)Organism Entries Non-WGS base pair

Homo sapiens 24,231,652 18,893,466,733

Mus musculus 9,883,173 10,229,286,664

Rattus norvegicus 2,197,781 6,528,984,315

Bos taurus 2,229,235 5,429,379,063

Zea mays 4,197,803 5,227,077,026

Sus scrofa 3,298,802 5,071,347,463

Hordeum vulgare ssp. vulgare

1,346,798 3,235,834,212

Danio rerio 1,729,033 3,190,913,255

Ovis canadanensis canadanensis

72 2,590,574,434

Triticum aestivum 1,812,814 1,942,831,630

... ... ...

Oryza sativa Japonica Group

1,378,262 1,642,328,218

... ... ...

Escherichia coli 118,884 1,571,576,668

... ...

BioinfRes SoSe 18

Distribu)onofSequenceFiles(Rel.207)Division Number of Files

BCT 178 CON 317 ENV 81 EST 478 HTG 142 INV 126 PAT 219 PLN 107 TSA 175 VRL 34

Release 207 consists of 2333 text files in total. Release 225 consists of 3120 text files in total.

BioinfRes SoSe 18

Distribu)onofSequenceFiles(Rel.2019)Division Number of Files

BCT 350 CON 359 ENV 97 EST 483 HTG INV 153 PAT 290 PHG 4 PLN 145 PRI 56 SYN 10 TSA 230 VRL 48

Release 219 consists of 2225 text files in total.

BioinfRes SoSe 18

DatabaseFiles(Rel.225)

●  GenBankcomesinasetofcompressedtextfilesavailableviaFTP

●  seekp://kp.ncbi.nih.gov/genbank/gbrel.txt●  3120ASCIIfiles(listedindivisionplusaddi)onallistfiles)intherangeof0.7-520MB

●  uncompressed~885GB●  eachfileconsistsoftwopor)ons

BioinfRes SoSe 18

DatabaseFiles●  Part1:highlyconserveddatabasefileheaders1 10 20 30 40 50 60 70 79 ---------+---------+---------+---------+---------+---------+---------+--------- GBBCT1.SEQ Genetic Sequence Data Bank April 15 2015 NCBI-GenBank Flat File Release 207.0 Bacterial Sequences (Part 1) 51396 loci, 92682287 bases, from 51396 reported sequences ---------+---------+---------+---------+---------+---------+---------+--------- 1 10 20 30 40 50 60 70 79

●  Part1:sequenceentriesforthatdivisiondescribedintheheader

BioinfRes SoSe 18

1 10 20 30 40 50 60 70 79!---------+---------+---------+---------+---------+---------+---------+---------!GBSMP.SEQ Genetic Sequence Data Bank! December 15 1992!! GenBank Flat File Release 74.0!! Structural RNA Sequences!! 2 loci, 236 bases, from 2 reported sequences!!LOCUS AAURRA 118 bp ss-rRNA RNA 16-JUN-1986!DEFINITION A.auricula-judae (mushroom) 5S ribosomal RNA.!ACCESSION K03160!VERSION K03160.1!KEYWORDS 5S ribosomal RNA; ribosomal RNA.!SOURCE A.auricula-judae (mushroom) ribosomal RNA.! ORGANISM Auricularia auricula-judae! Eukaryota; Fungi; Eumycota; Basidiomycotina; Phragmobasidiomycetes;! Heterobasidiomycetidae; Auriculariales; Auriculariaceae.!REFERENCE 1 (bases 1 to 118)! AUTHORS Huysmans,E., Dams,E., Vandenberghe,A. and De Wachter,R.! TITLE The nucleotide sequences of the 5S rRNAs of four mushrooms and! their use in studying the phylogenetic position of basidiomycetes! among the eukaryotes! JOURNAL Nucleic Acids Res. 11, 2871-2880 (1983)!FEATURES Location/Qualifiers! rRNA 1..118! /note="5S ribosomal RNA"!BASE COUNT 27 a 34 c 34 g 23 t!ORIGIN 5' end of mature rRNA.! 1 atccacggcc ataggactct gaaagcactg catcccgtcc gatctgcaaa gttaaccaga! 61 gtaccgccca gttagtacca cggtggggga ccacgcggga atcctgggtg ctgtggtt!//!!

LOCUS ABCRRAA 118 bp ss-rRNA RNA 15-SEP-1990!DEFINITION Acetobacter sp. (strain MB 58) 5S ribosomal RNA, complete sequence.!ACCESSION M34766!VERSION M34766.1!KEYWORDS 5S ribosomal RNA.!SOURCE Acetobacter sp. (strain MB 58) rRNA.! ORGANISM Acetobacter sp.! Prokaryotae; Gracilicutes; Scotobacteria; Aerobic rods and cocci;! Azotobacteraceae.!REFERENCE 1 (bases 1 to 118)! AUTHORS Bulygina,E.S., Galchenko,V.F., Govorukhina,N.I., Netrusov,A.I.,! Nikitin,D.I., Trotsenko,Y.A. and Chumakov,K.M.! TITLE Taxonomic studies of methylotrophic bacteria by 5S ribosomal RNA! sequencing! JOURNAL J. Gen. Microbiol. 136, 441-446 (1990)!FEATURES Location/Qualifiers! rRNA 1..118! /note="5S ribosomal RNA"!BASE COUNT 27 a 40 c 32 g 17 t 2 others!ORIGIN ! 1 gatctggtgg ccatggcggg agcaaatcag ccgatcccat cccgaactcg gccgtcaaat! 61 gccccagcgc ccatgatact ctgcctcaag gcacggaaaa gtcggtcgcc gccagayy!//!---------+---------+---------+---------+---------+---------+---------+---------!1 10 20 30 40 50 60 70 79!

BioinfRes SoSe 18

TheGenBankFlatFileFormat

●  asequenceentryconsistsofmanyrecords(lines)●  eachrecordconsistsoftwoparts

●  Part1:columns1-10/EntryFieldName

●  Part2:remaininglinewiththecontent

BioinfRes SoSe 18

Part1/1●  akeyword,beginningincolumn1oftherecord(e.g.,REFERENCEisakeyword)

●  asubkeywordbeginningincolumn3,withcolumns1and2blank(e.g.,AUTHORSisasubkeywordofREFERENCE)

●  orasubkeywordbeginningincolumn4,withcolumns1,2,and3blank(e.g.,PUBMEDisasubkeywordofREFERENCE)

BioinfRes SoSe 18

Part1/2

●  blankcharacters,indica)ngthatthisrecordisacon)nua)onoftheinforma)onunderthekeywordorsubkeywordaboveit

●  acode,beginningincolumn6,indica)ngthenatureofanentry(featurekey)intheFEATUREStable

BioinfRes SoSe 18

Part1/3●  anumber,endingincolumn9oftherecord:-  Thisnumberoccursinthepor)onoftheentrydescribingtheactualnucleo)desequenceanddesignatesthenumberingofsequenceposi)ons

●  twoslashes(//)inposi)ons1and2,markingtheendofanentry

BioinfRes SoSe 18

Part2●  Thesecondpartofeachsequenceentryrecordcontainstheinforma)onappropriatetoitskeyword

●  inposi)ons13to80forkeywords

●  inposi)ons11to80forthesequence

BioinfRes SoSe 18

EntryFieldTypes(incomplete)●  Locus:Ashortmnemonicnamefortheentry,chosentosuggestthesequence'sdefini)on;mandatorykeyword/exactlyonerecord.

●  Defini4on:Aconcisedescrip)onofthesequence;mandatorykeyword/oneormorerecords

●  Accession:-  theprimaryaccessionnumberisaunique,unchangingiden4fierassignedtoeachGenBanksequencerecord.

-  tobeusedforcita)onsfromGenBank-  mandatorykeyword/oneormorerecords.

BioinfRes SoSe 18

EntryFieldTypes(incomplete)

●  Version:-  compoundiden)fierconsis)ngoftheprimaryaccessionnumberandanumericversionnumberassociatedwiththecurrentversionofthesequencedataintherecord

-  op)onallyfollowedbyanintegeriden)fier(a"GI")assignedtothesequencebyNCBI

-  mandatorykeyword/exactlyonerecord

BioinfRes SoSe 18

EntryFieldTypes(incomplete)

●  DBLINK:providescross-referencestoresourcesthatsupporttheexistenceasequencerecord;op4onalkeyword/oneormorerecords

●  Keywords:shortphrasesdescribinggeneproductsandotherinforma)onaboutanentry;mandatorykeywordinallannotatedentries/oneormorerecords

BioinfRes SoSe 18

EntryFieldTypes(incomplete)

●  Source:Commonnameoftheorganismorthenamemostfrequentlyusedintheliterature;mandatorykeywordinallannotatedentries/oneormorerecords/includesonesubkeyword

●  Organism:Formalscien)ficnameoftheorganism(firstline)andtaxonomicclassifica)onlevels(secondandsubsequentlines);mandatorysubkeywordinallannotatedentries/twoormorerecords

BioinfRes SoSe 18

EntryFieldTypes(incomplete)●  Reference:-  Cita)onsforallar)clescontainingdatareportedinthisentry

-  includessevensubkeywordsandmayrepeat-  mandatorykeyword/oneormorerecords

●  Journal:liststhejournalname,volume,year,andpagenumbersofthecita)on;mandatorysubkeyword/oneormorerecords

●  op)onalsubkeywords:Authors,Consor)um,Title,Medline,Pubmed,Remark

BioinfRes SoSe 18

EntryFieldTypes(incomplete)●  Features:tablecontaininginforma)ononpor)onsofthesequencethatcodeforproteinsandRNAmolecules;sitesofbiologicalsignificance;op4onalkeyword/oneormorerecords

●  Origin:-  specifica)onofhowthefirstbaseofthereportedsequenceisopera)onallylocatedwithinthegenome

-  mandatorykeyword/exactlyonerecord-  followedbysequencedata(mul)plerecords)

●  //:entrytermina)onsymbol;mandatoryattheendofanentry/exactlyonerecord

BioinfRes SoSe 18

DetailedLocusFormatColumns Contents 01-05 'LOCUS'

06-12 spaces

13-28 Locus name

29-29 space

30-40 Length of sequence, right-justified

41-41 space

42-43 bp

44-44 space

45-47 spaces, ss- (single-stranded), ds- (double-stranded), or ms- (mixed-stranded)

48-53 NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), mRNA (messenger RNA), uRNA (small nuclear RNA), left justified

54-55 space

56-63 'linear' followed by two spaces, or 'circular'

64-64 space

65-67 The division code

68-68 space

69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)

BioinfRes SoSe 18

AccessionFormat●  sixoreightcharacters●  sixcharacterformat:-  singleuppercaseleaer-  5digits

●  eigthcharacterformat:-  twouppercaseleaers-  6digits

●  primaryaccessionnumberalwaysthefirstone

BioinfRes SoSe 18

Features(Incomplete)

●  authorita)vesource:hap://www.insdc.org/documents/feature-table

●  featuretablecontainsinforma)onabout:-  geneandgeneproducts-  regionsofbiologicalsignificance-  canenumeratedifferencesbetweenvariousreports-  providescross-referencestootherdatacollec)ons-  allowshierarchicalrela)onbetweenthefeatures

BioinfRes SoSe 18

Layout●  firstlineofthefeaturetableisaheader●  includesthekeyword‘FEATURES’andthecolumnheader‘Loca)on/Qualifiers’

●  eachfeatureconsistsof:-  descriptorlinecontainingafeaturekeyandaloca)on

-  acon)nua)onlinefortheloca)onmayfollow-  featurequalifiersmayfollowthedescriptorline-  key:column6-20,loca)onstartsincolumn22-  qualifiersonsubsequentlinesatcolumn22star)ngwitha‘/’

BioinfRes SoSe 18

AFewFrequentFeatures●  CDS:sequencecodingforaminoacidsinprotein(includesstopcodon)

●  exon:regionthatcodesforpartofsplicedmRNA●  gene:regionthatdefinesafunc)onalgene,possiblyincludingupstream(promotor,enhancer,etc)anddownstreamcontrolelements,andforwhichanamehasbeenassigned

●  mRNA:messengerRNA

●  .......>60featurescurrently

BioinfRes SoSe 18

Loca)onandQualifiers

●  Loca)on:-  aloca)oncanbe:asinglebase,aspanofbases,asitebetweentwobases,ajoinofsequences,...

-  examples:23,23..56,23^24,join(23..56,87..110)

●  Qualifiers:-  format:fromcolumn22/qualifier_name[=value]-  types:freetext,enumera)onorcontrolledvocabulary,cita)ons,sequences,featurelabels

BioinfRes SoSe 18

DatabaseCrossReferences/db_xref

●  hap://www.ncbi.nlm.nih.gov/genbank/collab/db_xref/

●  Qualifier:/db_xref="database:idenDfier”●  Defini4on:databasecross-reference:pointertorelatedinforma)oninanotherdatabase

●  Scope:allfeaturekeys●  Example:/db_xref="Swiss-Prot:P12345”

●  currently>120databasesavailable

BioinfRes SoSe 18

AnatomyofaGenbankFlatFile

. . .

BioinfRes SoSe 18

AnatomyofaGenbankFlatFile

. . .

Locus line

BioinfRes SoSe 18

AnatomyofaGenbankFlatFile

. . . Accession Number, Version and GI number

BioinfRes SoSe 18

AnatomyofaGenbankFlatFile

. . . Feature table with annotations

BioinfRes SoSe 18

UsefulResourcesfromNCBI

●  Materials:●  Electronicbookshelf

●  hap://www.ncbi.nlm.nih.gov/educa)on/factsheets/

●  kp://kp.ncbi.nih.gov/pub/factsheets/Factsheet_Books.pdf

●  NCBImanuals

●  textbooks

BioinfRes SoSe 18

UsefulResourcesfromNCBI

●  Processes,e.g.Prokaryo)cGenomeAnnota)onPipeline

●  designedforbacterialandarchaealgenomes●  mul)-levelprocessincludingprotein-codinggenepredic)onandfunc)onalgenomeunitlikerRNAs,tRNAs,smallRNAs,pseudogenescontrolregions,repeats,inser)onelementsa.s.f.

●  combina)onofab-iniDopredic)onandhomologybasedmethods

BioinfRes SoSe 18

UsefulResourcesfromNCBI●  referencedatabases:RefSeq●  hap://www.ncbi.nlm.nih.gov/refseq/

●  comprehensive,integrated,non-redundant,well-annotatedsetofsequences,includinggenomicDNA,transcripts,andproteins

●  stablereferenceforgenomeannota)on,esp.subsetofRefSeqGene

●  referencesequences

●  referencecoordinates●  accessibleviaBLAST,EntrezandFTP

BioinfRes SoSe 18

RefSeq●  createdby:-  Eukaryo)cGenomeAnnota)onPipeline-  Prokaryo)cGenomeAnnota)onPipeline-  Manualcura)on-  SubmissiontoINSDCmembers

●  reflectcurrentknowledgeofsequencesdataandbiology

●  formatconsistency●  Accessionnumbercontainsan“_”

BioinfRes SoSe 18

RefSeqGrowth

BioinfRes SoSe 18

DatabasesAccessibleviaEntrez

http://www.ncbi.nlm.nih.gov/gquery/

BioinfRes SoSe 18

Computa)on:BlastatNCBI

BioinfRes SoSe 18

BioinfRes SoSe 18

BioinfRes SoSe 18

BioinfRes SoSe 18

BioinfRes SoSe 18

SearchingtheNCBI/Entrez●  provideanintegratedsearchinterfacetothedifferentNCBIdatabases:EntrezProgrammingU)li)es(E-u)li)es)

●  Base-URL:hap://eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/

●  >40databases

●  stableinterfaceofnineserver-sideprograms

●  hap://www.ncbi.nlm.nih.gov/books/NBK25501/

BioinfRes SoSe 18

EntrezGuidelines●  ifyouusetheeu)lsagainsttheguidelinesyoumightbebanned!

●  >100requests:weekendsoroutsideUSpeak)mes(9pm-5am,EST)

●  notmorethan3requestpersecond

●  provideemailandtoolname:&tool=<...>&email=<...>!

●  registra)onwithemailandtoolnamewithNCBImayrelaxtheserestric)ons

●  supportedbyBioPython

BioinfRes SoSe 18

Construc)ngURLs

●  parameter:&lowerCaseName●  excep)on:&WebEnv

●  norequiredorder

●  nullvaluesandinappropriateparameteraregenerallyignored

●  nospaces,use+instead

●  useURLencodingsforspecialcharacterlike:%22for“or%23for#or%40for@

BioinfRes SoSe 18

E-u)li)es●  Einfo●  Esearch

●  EPost

●  ESummary●  EFetch

●  ELink

●  EGQuery

●  ESpell●  ECitMatch

BioinfRes SoSe 18

ExternalInterfacestoEntrez/API●  thereareanumberofAPIstoaccessthevariousservicesfromNCBI,describedat:

●  hap://www.ncbi.nlm.nih.gov/books/NBK25501/●  baseURL:hap://eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/

●  basicsearching:-  esearch.fcgi?db=<database>&term=<query>-  Input:Entrezdatabase(&db);anyEntreztextquery(&term)

-  Output:ListofUIDsmatchingtheEntrezquery

BioinfRes SoSe 18

ESearch

●  textsearch●  eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/esearch.fcgi

●  respondstoatextquerywiththelistofmatchingUIDsinagivendatabase(forlateruseinESummary,EFetchorELink),alongwiththetermtransla)onsofthequery

BioinfRes SoSe 18

ESummary

●  documentsummarydownloads●  eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/esummary.fcgi

●  respondstoalistofUIDsfromagivendatabasewiththecorrespondingdocumentsummaries

BioinfRes SoSe 18

EGQuery

●  globalquery●  eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/egquery.fcgi

●  respondstoatextquerywiththenumberofrecordsmatchingthequeryineachEntrezdatabase

BioinfRes SoSe 18

EInfo

●  databasesta)s)cs●  eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/einfo.fcgi

●  providesthenumberofrecordsindexedineachfieldofagivendatabase,thedateofthelastupdateofthedatabase,andtheavailablelinksfromthedatabasetootherEntrezdatabases

●  without&db:listsallavailabledatabases

BioinfRes SoSe 18

EFetch

●  datarecorddownloads●  eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/efetch.fcgi

●  respondstoalistofUIDsinagivendatabasewiththecorrespondingdatarecordsinaspecifiedformat

BioinfRes SoSe 18

ELink

●  Entrezlinks●  eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/elink.fcgi

●  respondstoalistofUIDsinagivendatabasewitheitheralistofrelatedUIDs(andrelevancyscores)inthesamedatabaseoralistoflinkedUIDsinanotherEntrezdatabase

BioinfRes SoSe 18

ELink

●  checksfortheexistenceofaspecifiedlinkfromalistofoneormoreUIDs

●  createsahyperlinktotheprimaryLinkOutproviderforaspecificUIDanddatabase,orlistsLinkOutURLsandaaributesformul)pleUIDs

BioinfRes SoSe 18

EPost

●  UIDuploads●  eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/epost.fcgi

●  acceptsalistofUIDsfromagivendatabase,storesthesetontheHistoryServer,andrespondswithaquerykeyandwebenvironmentfortheuploadeddataset

BioinfRes SoSe 18

ESpell

●  spellingsugges)ons●  eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/espell.fcgi

●  retrievesspellingsugges)onsforatextqueryinagivendatabase

BioinfRes SoSe 18

ECitMatch

●  batchcita)onsearchinginPubMed●  eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/ecitmatch.cgi

●  retrievesPubMedIDs(PMIDs)correspondingtoasetofinputcita)onstrings

BioinfRes SoSe 18

Iden)ficators●  recordsareiden)fiedbyanintegerIDcalledUID●  UIDaredatabasespecificlikeGInumbers,PMIDS,MMDB-IDs

●  UIDareaswellinputandoutput

●  especiallyusefulincombina)onwiththeHistoryserver

●  afulldescrip)onofparametersandsyntaxcanbefoundat:hap://www.ncbi.nlm.nih.gov/books/NBK25499/

BioinfRes SoSe 18

SelectedUIDsEntrez Database UID common name E-utility Database Name Books Book ID books Conserved Domains PSSM-ID cdd dbVar dbVar ID dbvar EST GI number nucest Gene Gene ID gene Genome Genome ID genome MeSH MeSH ID mesh NCBI Web Site Web Site ID ncbisearch Nucleotide GI number nuccore PubMed PMID pubmed ... ... ...

BioinfRes SoSe 18

EntrezCoreEngine●  EGQuery,ESearch,andESummary●  twotasks:-  assemblealistofUIDsthatmatchatextquery(ESearch)-  retrieveabriefsummaryrecordcalledaDocumentSummary(DocSum)foreachUIDESummary)

●  EGQuey:globalversionofESearch●  esearch.fcgi?db=database&term=query esummary.fcgi?db=database&id=uid1,uid2,uid3,...!

●  expandedintomorecomplicatedEntrezqueries

BioinfRes SoSe 18

EntrezDatabases(EInfo,EFetch,andELink)

●  EInfo:-  providesdetailedinforma)onabouteachdatabase-  includinglistsoftheindexingfieldsinthedatabase-  availablelinkstootherEntrezdatabases

BioinfRes SoSe 18

EntrezDatabases(EInfo,EFetch,andELink)

●  addedvaluetotherawdata:-  supportsavarietyofdisplayformats:EFetchUIDlistsinXMLandplaintext(&retmode)foralldatabases,otherformats(&rettype)aredatabasespecific

-  hap://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly

-  efetch.fcgi?db=database&id=uid1,uid2,uid3 &rettype=report_type&retmode=data_mode!

BioinfRes SoSe 18

EntrezDatabases(EInfo,EFetch,andELink)

●  addedvaluetotherawdata:-  linkstorecordsinotherEntrezdatabasesmanifestedaslistofassociatedUIDs

-  UIDsmustbevalidinsourcedatabase(&dbfrom)-  elink.fcgi?dbfrom=protein&db=gene&id=15718680,157427902

BioinfRes SoSe 18

EntrezHistoryServer

●  simple:intheGUIaccessibleviatherespec)vetabs

●  youcanstoretemporarilysetsofUIDsasinputforlaterqueriesthroughothertools

●  eachlistofUIDsisspecifiedby:-  &query_key(integerlabel)-  &WebEnv(cookiestring)

BioinfRes SoSe 18

Crea)onofastoredUIDlist

●  EPost:-  EPostcanbeuseduploadaUIDlist-  returns&query_keyand&WebEnv!

●  ESearch:-  storestheresultsifgiven&usehistory=y!

●  ELink:-  storestheresultsifgiven&cmd=neighbor_history!

BioinfRes SoSe 18

UsageofstoredUIDlists●  Useofstoredlists:esummary.fcgi?db=database&WebEnv=webenv &query_key=key!

●  onewebenvironmentcanholdmul)pleresultlists

●  listsinthesamewebenvironmentcanbecombinedwithAND,OR,NOT

●  bydefaulteverycallcreatesanewenvironment

●  ->give&WebEnvinsubsequentcallstostorethelistsinthesamewebenvironment

BioinfRes SoSe 18

SketchingPipelines

●  getDocSummariesorentriesforkeywordsorIDs:-  ESearch->ESummary/EFetch-  EPost->ESummary/EFetch

●  filter/limitarecordset:-  EPost/ELink->ESearch

●  moreadvancedqueries:-  ESearch->ELink->ESummary/EFetch-  EPost->ELink->ESearch->EFetch

BioinfRes SoSe 18

●  storingresults:-  esearch.fcgi?db=<database>&term=<query>&usehistory=y

-  input:anyEntreztextquery(&term);Entrezdatabase(&db);&usehistory=y

-  output:webenvironment(&WebEnv)andquerykey(&query_key)parametersspecifyingtheloca)onontheEntrezhistoryserverofthelistofUIDsmatchingtheEntrezquery

-  example:hap://eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/esearch.fcgi?db=pubmed&term=science%5bjournal%5d+AND+breast+cancer+AND+2008%5bpdat%5d&usehistory=y

BioinfRes SoSe 18

●  Associa)ngSearchResultswithExis)ngSearchResults:-  esearch.fcgi?db=<database>&term=<query1>&usehistory=y

-  esearch.fcgi?db=<database>&term=<query2>&usehistory=y&WebEnv=$web1

-  Input:AnyEntreztextquery(&term);Entrezdatabase(&db);&usehistory=y;Exis)ngwebenvironment(&WebEnv)fromapriorE-u)litycall

-  Output:Webenvironment(&WebEnv)andquerykey(&query_key)parametersspecifyingtheloca)onontheEntrezhistoryserverofthelistofUIDsmatchingtheEntrezquery

BioinfRes SoSe 18

E-u)lityWebinar

●  haps://www.youtube.com/watch?v=iCFVVexp30o

top related