genomics and personalized health care databases bailee ludwig quality management

39
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

Upload: luke-williamson

Post on 12-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

Genomics and Personalized Health Care

Databases

Bailee Ludwig

Quality Management

Page 2: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

Molecular Biology Databases

• Excellent means of storing a vast amount of Information in a central, sharable location

• Biological databases are designed especially for the proper storing, searching & retrieving biological data– Keyword Searches– Cross-Referencing– 3D capabilities

Page 3: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

Database Categories

• Categories– Nucleotide Sequence Databases

• Gene Databases• Genome Databases

– Protein Sequence Databases– Structure Databases– Metabolic and Signaling Pathways– Human Genes and Diseases– Microarray Data and other Expression Databases– …

• Each contains specific information• Each is interrelated

Page 4: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

Nucleotide & Protein Sequence

Databases

Page 5: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

National Center for Biotechnology Information (NCBI)

• Created as a part of National Library of Medicine in 1988– Establish public databases– Perform research in computational biology– Develop software tools for sequence analysis– Disseminate biomedical information

• Databases– Sequence, such as GeneBank, RefSeq, dbSNP– Literature, such as PubMed, OMIM

• Tools– Entrez. Blast, Cn3D, etc.

Page 6: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

NCBI Homepage

Page 7: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

NCBI Site Map

Page 8: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

All Databases at NCBI:

Page 9: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

Let’s Check out NCBI

• http://www.ncbi.nlm.nih.gov/sites/gquery?itool=toolbar

Page 10: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management
Page 11: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

Multiple ways to find Genes…

Page 12: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

Let’s Look at BRCA1

Page 13: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management
Page 14: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management
Page 15: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

GenBank

http://www.ncbi.nlm.nih.gov/Genbank/

Page 16: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

GenBank

• Nucleotide only sequence database • GenBank Data

– Direct submissions individual records (BankIt, Sequin)

– Batch submissions via email (EST, GSS, STS)– ftp accounts established for sequencing centers

• Data shared nightly amongst three collaborating databases:– GenBank– DNA Database of Japan (DDBJ). – European Molecular Biology Laboratory Database

(EMBL)

Page 17: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

0

20

40

60

80

100

120Growth of GenBank (1982-2009)

Base PairsSequences

Year

Seq

uen

ces

(mil

lio

n)

Bas

e P

airs

(b

illi

on

)

Page 18: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

GeneBank Release 175.0

• ftp://ftp.ncbi.nih.gov/genbank/• Full release every two months• Incremental and cumulative updates daily

• Release 175.0 (12/15/2009)

• 112,910,950 Sequences • 110,118,557,163 Bases

Page 19: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

NCBI Reference Sequences

Page 20: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

GenBank Record (Header)

LOCUS NM_001963 4913 bp mRNA linear PRI 20-SEP-2009

DEFINITION Homo sapiens epidermal growth factor (beta-urogastrone) (EGF), mRNA.

ACCESSION NM_001963

VERSION NM_001963.3 GI:166362727

KEYWORDS .

SOURCE Homo sapiens (human)

ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo.

REFERENCE 1 (bases 1 to 4913)

AUTHORS Hosgood,H.D. III, Menashe,I., He,X., Chanock,S. and Lan,Q.

TITLE PTEN identified as important risk factor of chronic obstructive pulmonary disease

JOURNAL Respir Med (2009) In press

PUBMED 19625176

REMAKR GeneRIF: Observational study of gene-disease association.

Page 21: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

Summary

Page 22: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

GenBank Record (Sequence)

ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc 61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt 181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc 241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga 301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag 361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg 421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc 481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg 541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt 601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg 661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt 721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga 781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag 841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt 901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa 961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg

Page 23: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

FASTA: Sequence Format

Page 24: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

Sequence Viewer Graphics

Page 25: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

RefSeq

Page 26: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

RefSeq

• Database of reference sequences– http://www.ncbi.nlm.nih.gov/RefSeq/

• Curated– Many experimentally validated– Some partially validated via ESTs– Some computationally predicted

• Non-redundant; one record for each gene, or each splice variant, from each organism represented

Page 27: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

Accession Numbers

• DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

• RefSeq provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence.

• RefSeq identifiers include the following formats:– Complete chromosome NC_######– Genomic contig NT_######– mRNA (DNA format) NM_######– Protein NP_######

Page 28: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

Accession Numbers: More Examples

AC_123456 Genomic Alternate complete genomic

AP_123456 Protein Protein products; alternate

NG_123456 Genomic Incomplete genomic regions

NR_123456 RNA Non-coding transcripts

NW_123456 GenomicGenomic assemblies

NZ_ABCD12345678 Genomic Whole genome shotgun data

XM_123456 mRNA Transcript products

XP_123456 Protein Protein products

XR_123456 RNA Transcript products

YP_123456 Protein Protein products

ZP_12345678 Protein Protein products

Page 29: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

EST

Page 30: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

EST

• Expressed Sequence Tags database (dbEST) is a division of GenBank that contains sequence data and other information on "single-pass" cDNA sequences, or "Expressed Sequence Tags", from a number of organisms

• http://www.ncbi.nlm.nih.gov/sites/entrez?db=nucest&cmd=search&term=

Page 31: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

EST

• mRNA: Genomic regions actively transcribed in cell• cDNA (complementary DNA)

– Copy of mRNA using mRNA as a template– Sequence is complementary to mRNA

• EST: Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)– Partial cDNA sequence– Can be 5’ or 3’– Typical size: 200 - 500 bp– Represents mRNA actively transcribed in cell– Use to identify

• Genes; Alternative splicing; etc.

Page 32: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

Access to dbEST Data

• EST sequences are included in the EST division of GenBank, available from NCBI by anonymous ftp and through Entrez

• The nucleotide sequences may be searched using the BLAST server – The TBLASTN program which takes an amino acid query

sequence and compares it with six-frame translations of dbEST DNA sequences is particularly useful.

• EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the /repository/dbEST directory at ftp.ncbi.nih.gov

Page 33: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

UniGene

Page 34: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

UniGene

• www.ncbi.nlm.nih.gov/UniGene• Each UniGene entry is a set of transcript sequences that

appear to come from the same transcription locus (gene or expressed pseudogene)

• In addition to sequences of well-characterized genes, hundreds of thousands novel expressed sequence tag (EST) sequences have been included.

• UniGene may be of use as a resource for gene discovery.

• UniGene has also been used by experimentalists to select reagents for gene mapping projects and large-scale expression analysis.

Page 35: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

Numbers of UniGene Entries

• Bos taurus (cow) 42,843 • Canis lupus familiaris (dog) 27,853 • Equus caballus (horse) 8,348 • Homo sapiens (human) 123,396 • Mus musculus (mouse) 78,289 • Ovis aries (sheep) 18,814 • Rattus norvegicus (Norway rat) 63,434 • Sus scrofa (pig) 51,576 • Danio rerio (zebrafish) 51,481

Page 36: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

UniGene

• UniGene is a useful tool to look up information about expressed genes

• UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression

Page 37: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

Protein Structure

Page 38: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management
Page 39: Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management

• Now…

Let’s Give these databases a closer look with a Lab