genomics and personalized care in health systems lecture 2. databases leming zhou, phd school of...
TRANSCRIPT
Genomics and Personalized Care in Health Systems
Lecture 2 Databases
Leming Zhou PhDSchool of Health and Rehabilitation
SciencesDepartment of Health Information
Management
Department of Health Information Management
Outlinebull Nucleotide and protein sequence databases
ndash NCBI
bull GenBank RefSeq dbEST UniGene
ndash PDB
bull Flybasebull dbSNPbull OMIM and HuGE Navigator
Department of Health Information Management
Molecular Biology Databasesbull Categories
ndash Nucleotide Sequence Databasesndash Protein Sequence Databasesndash Structure Databasesndash Metabolic and Signaling Pathwaysndash Human Genes and Diseasesndash Microarray Data and other Expression Databasesndash hellip
bull Each Database contains specific information
bull Each of these databases is interrelated
Nucleotide and Protein Sequence Databases
Department of Health Information Management
NCBIbull Created as a part of National Library of
Medicine in 1988ndash Establish public databases
ndash Perform research in computational biology
ndash Develop software tools for sequence analysis
ndash Disseminate biomedical information
bull Databasesndash Sequence such as GenBank RefSeq dbSNP
ndash Literature such as PubMed OMIM
bull Toolsndash Entrez Blast Cn3D etc
NCBI Homepage
Department of Health Information Management
Molecular Databasesbull Primary Databases
ndash Original submissions by experimentalists
ndash Database staff organize but donrsquot add additional
ndash Information for instance GenBank
bull Derivative Databasesndash Human curated
bull Compilation and correction of data
bull Example SWISS-PROT NCBI RefSeq
ndash Computationally Derived
bull Example UniGene
Department of Health Information Management
GenBankbull httpwwwncbinlmnihgovGenbank
bull Nucleotide only sequence database
bull GenBank Datandash Direct submissions individual records (BankIt Sequin)
ndash Batch submissions via email (EST GSS STS)
ndash ftp accounts established for sequencing centers
bull Data shared nightly amongst three collaborating databasesndash GenBank
ndash DNA Database of Japan (DDBJ)
ndash European Molecular Biology Laboratory Database (EMBL)
Department of Health Information Management
Department of Health Information Management
GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily
Release 1810 (12152011)
bull 146413798 Sequences bull 135117731375 Base Pairs
Department of Health Information Management
GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)
transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo
REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF
TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes
JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399
Department of Health Information Management
GenBank Record (Features)FEATURES LocationQualifiers source 15600
organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25
gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530
exon 1579 number=1
CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip
exon 580779 number=2
exon 780961 number=3
Department of Health Information Management
GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc
61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt
121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt
181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc
241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga
301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag
361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg
421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc
481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg
541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt
601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg
661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt
721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga
781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag
841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt
901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa
961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg
1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc
Department of Health Information Management
FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT
Department of Health Information Management
Too Many Results
Department of Health Information Management
Search Limits
Department of Health Information Management
Reduced Search Results
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Outlinebull Nucleotide and protein sequence databases
ndash NCBI
bull GenBank RefSeq dbEST UniGene
ndash PDB
bull Flybasebull dbSNPbull OMIM and HuGE Navigator
Department of Health Information Management
Molecular Biology Databasesbull Categories
ndash Nucleotide Sequence Databasesndash Protein Sequence Databasesndash Structure Databasesndash Metabolic and Signaling Pathwaysndash Human Genes and Diseasesndash Microarray Data and other Expression Databasesndash hellip
bull Each Database contains specific information
bull Each of these databases is interrelated
Nucleotide and Protein Sequence Databases
Department of Health Information Management
NCBIbull Created as a part of National Library of
Medicine in 1988ndash Establish public databases
ndash Perform research in computational biology
ndash Develop software tools for sequence analysis
ndash Disseminate biomedical information
bull Databasesndash Sequence such as GenBank RefSeq dbSNP
ndash Literature such as PubMed OMIM
bull Toolsndash Entrez Blast Cn3D etc
NCBI Homepage
Department of Health Information Management
Molecular Databasesbull Primary Databases
ndash Original submissions by experimentalists
ndash Database staff organize but donrsquot add additional
ndash Information for instance GenBank
bull Derivative Databasesndash Human curated
bull Compilation and correction of data
bull Example SWISS-PROT NCBI RefSeq
ndash Computationally Derived
bull Example UniGene
Department of Health Information Management
GenBankbull httpwwwncbinlmnihgovGenbank
bull Nucleotide only sequence database
bull GenBank Datandash Direct submissions individual records (BankIt Sequin)
ndash Batch submissions via email (EST GSS STS)
ndash ftp accounts established for sequencing centers
bull Data shared nightly amongst three collaborating databasesndash GenBank
ndash DNA Database of Japan (DDBJ)
ndash European Molecular Biology Laboratory Database (EMBL)
Department of Health Information Management
Department of Health Information Management
GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily
Release 1810 (12152011)
bull 146413798 Sequences bull 135117731375 Base Pairs
Department of Health Information Management
GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)
transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo
REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF
TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes
JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399
Department of Health Information Management
GenBank Record (Features)FEATURES LocationQualifiers source 15600
organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25
gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530
exon 1579 number=1
CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip
exon 580779 number=2
exon 780961 number=3
Department of Health Information Management
GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc
61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt
121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt
181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc
241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga
301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag
361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg
421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc
481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg
541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt
601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg
661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt
721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga
781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag
841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt
901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa
961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg
1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc
Department of Health Information Management
FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT
Department of Health Information Management
Too Many Results
Department of Health Information Management
Search Limits
Department of Health Information Management
Reduced Search Results
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Molecular Biology Databasesbull Categories
ndash Nucleotide Sequence Databasesndash Protein Sequence Databasesndash Structure Databasesndash Metabolic and Signaling Pathwaysndash Human Genes and Diseasesndash Microarray Data and other Expression Databasesndash hellip
bull Each Database contains specific information
bull Each of these databases is interrelated
Nucleotide and Protein Sequence Databases
Department of Health Information Management
NCBIbull Created as a part of National Library of
Medicine in 1988ndash Establish public databases
ndash Perform research in computational biology
ndash Develop software tools for sequence analysis
ndash Disseminate biomedical information
bull Databasesndash Sequence such as GenBank RefSeq dbSNP
ndash Literature such as PubMed OMIM
bull Toolsndash Entrez Blast Cn3D etc
NCBI Homepage
Department of Health Information Management
Molecular Databasesbull Primary Databases
ndash Original submissions by experimentalists
ndash Database staff organize but donrsquot add additional
ndash Information for instance GenBank
bull Derivative Databasesndash Human curated
bull Compilation and correction of data
bull Example SWISS-PROT NCBI RefSeq
ndash Computationally Derived
bull Example UniGene
Department of Health Information Management
GenBankbull httpwwwncbinlmnihgovGenbank
bull Nucleotide only sequence database
bull GenBank Datandash Direct submissions individual records (BankIt Sequin)
ndash Batch submissions via email (EST GSS STS)
ndash ftp accounts established for sequencing centers
bull Data shared nightly amongst three collaborating databasesndash GenBank
ndash DNA Database of Japan (DDBJ)
ndash European Molecular Biology Laboratory Database (EMBL)
Department of Health Information Management
Department of Health Information Management
GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily
Release 1810 (12152011)
bull 146413798 Sequences bull 135117731375 Base Pairs
Department of Health Information Management
GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)
transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo
REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF
TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes
JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399
Department of Health Information Management
GenBank Record (Features)FEATURES LocationQualifiers source 15600
organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25
gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530
exon 1579 number=1
CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip
exon 580779 number=2
exon 780961 number=3
Department of Health Information Management
GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc
61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt
121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt
181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc
241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga
301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag
361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg
421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc
481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg
541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt
601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg
661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt
721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga
781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag
841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt
901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa
961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg
1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc
Department of Health Information Management
FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT
Department of Health Information Management
Too Many Results
Department of Health Information Management
Search Limits
Department of Health Information Management
Reduced Search Results
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Nucleotide and Protein Sequence Databases
Department of Health Information Management
NCBIbull Created as a part of National Library of
Medicine in 1988ndash Establish public databases
ndash Perform research in computational biology
ndash Develop software tools for sequence analysis
ndash Disseminate biomedical information
bull Databasesndash Sequence such as GenBank RefSeq dbSNP
ndash Literature such as PubMed OMIM
bull Toolsndash Entrez Blast Cn3D etc
NCBI Homepage
Department of Health Information Management
Molecular Databasesbull Primary Databases
ndash Original submissions by experimentalists
ndash Database staff organize but donrsquot add additional
ndash Information for instance GenBank
bull Derivative Databasesndash Human curated
bull Compilation and correction of data
bull Example SWISS-PROT NCBI RefSeq
ndash Computationally Derived
bull Example UniGene
Department of Health Information Management
GenBankbull httpwwwncbinlmnihgovGenbank
bull Nucleotide only sequence database
bull GenBank Datandash Direct submissions individual records (BankIt Sequin)
ndash Batch submissions via email (EST GSS STS)
ndash ftp accounts established for sequencing centers
bull Data shared nightly amongst three collaborating databasesndash GenBank
ndash DNA Database of Japan (DDBJ)
ndash European Molecular Biology Laboratory Database (EMBL)
Department of Health Information Management
Department of Health Information Management
GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily
Release 1810 (12152011)
bull 146413798 Sequences bull 135117731375 Base Pairs
Department of Health Information Management
GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)
transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo
REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF
TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes
JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399
Department of Health Information Management
GenBank Record (Features)FEATURES LocationQualifiers source 15600
organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25
gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530
exon 1579 number=1
CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip
exon 580779 number=2
exon 780961 number=3
Department of Health Information Management
GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc
61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt
121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt
181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc
241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga
301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag
361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg
421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc
481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg
541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt
601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg
661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt
721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga
781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag
841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt
901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa
961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg
1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc
Department of Health Information Management
FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT
Department of Health Information Management
Too Many Results
Department of Health Information Management
Search Limits
Department of Health Information Management
Reduced Search Results
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
NCBIbull Created as a part of National Library of
Medicine in 1988ndash Establish public databases
ndash Perform research in computational biology
ndash Develop software tools for sequence analysis
ndash Disseminate biomedical information
bull Databasesndash Sequence such as GenBank RefSeq dbSNP
ndash Literature such as PubMed OMIM
bull Toolsndash Entrez Blast Cn3D etc
NCBI Homepage
Department of Health Information Management
Molecular Databasesbull Primary Databases
ndash Original submissions by experimentalists
ndash Database staff organize but donrsquot add additional
ndash Information for instance GenBank
bull Derivative Databasesndash Human curated
bull Compilation and correction of data
bull Example SWISS-PROT NCBI RefSeq
ndash Computationally Derived
bull Example UniGene
Department of Health Information Management
GenBankbull httpwwwncbinlmnihgovGenbank
bull Nucleotide only sequence database
bull GenBank Datandash Direct submissions individual records (BankIt Sequin)
ndash Batch submissions via email (EST GSS STS)
ndash ftp accounts established for sequencing centers
bull Data shared nightly amongst three collaborating databasesndash GenBank
ndash DNA Database of Japan (DDBJ)
ndash European Molecular Biology Laboratory Database (EMBL)
Department of Health Information Management
Department of Health Information Management
GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily
Release 1810 (12152011)
bull 146413798 Sequences bull 135117731375 Base Pairs
Department of Health Information Management
GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)
transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo
REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF
TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes
JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399
Department of Health Information Management
GenBank Record (Features)FEATURES LocationQualifiers source 15600
organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25
gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530
exon 1579 number=1
CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip
exon 580779 number=2
exon 780961 number=3
Department of Health Information Management
GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc
61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt
121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt
181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc
241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga
301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag
361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg
421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc
481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg
541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt
601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg
661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt
721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga
781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag
841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt
901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa
961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg
1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc
Department of Health Information Management
FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT
Department of Health Information Management
Too Many Results
Department of Health Information Management
Search Limits
Department of Health Information Management
Reduced Search Results
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
NCBI Homepage
Department of Health Information Management
Molecular Databasesbull Primary Databases
ndash Original submissions by experimentalists
ndash Database staff organize but donrsquot add additional
ndash Information for instance GenBank
bull Derivative Databasesndash Human curated
bull Compilation and correction of data
bull Example SWISS-PROT NCBI RefSeq
ndash Computationally Derived
bull Example UniGene
Department of Health Information Management
GenBankbull httpwwwncbinlmnihgovGenbank
bull Nucleotide only sequence database
bull GenBank Datandash Direct submissions individual records (BankIt Sequin)
ndash Batch submissions via email (EST GSS STS)
ndash ftp accounts established for sequencing centers
bull Data shared nightly amongst three collaborating databasesndash GenBank
ndash DNA Database of Japan (DDBJ)
ndash European Molecular Biology Laboratory Database (EMBL)
Department of Health Information Management
Department of Health Information Management
GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily
Release 1810 (12152011)
bull 146413798 Sequences bull 135117731375 Base Pairs
Department of Health Information Management
GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)
transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo
REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF
TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes
JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399
Department of Health Information Management
GenBank Record (Features)FEATURES LocationQualifiers source 15600
organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25
gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530
exon 1579 number=1
CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip
exon 580779 number=2
exon 780961 number=3
Department of Health Information Management
GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc
61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt
121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt
181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc
241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga
301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag
361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg
421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc
481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg
541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt
601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg
661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt
721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga
781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag
841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt
901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa
961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg
1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc
Department of Health Information Management
FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT
Department of Health Information Management
Too Many Results
Department of Health Information Management
Search Limits
Department of Health Information Management
Reduced Search Results
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Molecular Databasesbull Primary Databases
ndash Original submissions by experimentalists
ndash Database staff organize but donrsquot add additional
ndash Information for instance GenBank
bull Derivative Databasesndash Human curated
bull Compilation and correction of data
bull Example SWISS-PROT NCBI RefSeq
ndash Computationally Derived
bull Example UniGene
Department of Health Information Management
GenBankbull httpwwwncbinlmnihgovGenbank
bull Nucleotide only sequence database
bull GenBank Datandash Direct submissions individual records (BankIt Sequin)
ndash Batch submissions via email (EST GSS STS)
ndash ftp accounts established for sequencing centers
bull Data shared nightly amongst three collaborating databasesndash GenBank
ndash DNA Database of Japan (DDBJ)
ndash European Molecular Biology Laboratory Database (EMBL)
Department of Health Information Management
Department of Health Information Management
GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily
Release 1810 (12152011)
bull 146413798 Sequences bull 135117731375 Base Pairs
Department of Health Information Management
GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)
transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo
REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF
TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes
JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399
Department of Health Information Management
GenBank Record (Features)FEATURES LocationQualifiers source 15600
organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25
gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530
exon 1579 number=1
CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip
exon 580779 number=2
exon 780961 number=3
Department of Health Information Management
GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc
61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt
121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt
181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc
241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga
301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag
361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg
421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc
481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg
541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt
601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg
661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt
721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga
781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag
841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt
901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa
961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg
1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc
Department of Health Information Management
FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT
Department of Health Information Management
Too Many Results
Department of Health Information Management
Search Limits
Department of Health Information Management
Reduced Search Results
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
GenBankbull httpwwwncbinlmnihgovGenbank
bull Nucleotide only sequence database
bull GenBank Datandash Direct submissions individual records (BankIt Sequin)
ndash Batch submissions via email (EST GSS STS)
ndash ftp accounts established for sequencing centers
bull Data shared nightly amongst three collaborating databasesndash GenBank
ndash DNA Database of Japan (DDBJ)
ndash European Molecular Biology Laboratory Database (EMBL)
Department of Health Information Management
Department of Health Information Management
GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily
Release 1810 (12152011)
bull 146413798 Sequences bull 135117731375 Base Pairs
Department of Health Information Management
GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)
transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo
REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF
TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes
JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399
Department of Health Information Management
GenBank Record (Features)FEATURES LocationQualifiers source 15600
organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25
gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530
exon 1579 number=1
CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip
exon 580779 number=2
exon 780961 number=3
Department of Health Information Management
GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc
61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt
121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt
181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc
241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga
301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag
361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg
421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc
481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg
541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt
601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg
661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt
721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga
781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag
841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt
901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa
961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg
1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc
Department of Health Information Management
FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT
Department of Health Information Management
Too Many Results
Department of Health Information Management
Search Limits
Department of Health Information Management
Reduced Search Results
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Department of Health Information Management
GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily
Release 1810 (12152011)
bull 146413798 Sequences bull 135117731375 Base Pairs
Department of Health Information Management
GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)
transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo
REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF
TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes
JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399
Department of Health Information Management
GenBank Record (Features)FEATURES LocationQualifiers source 15600
organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25
gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530
exon 1579 number=1
CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip
exon 580779 number=2
exon 780961 number=3
Department of Health Information Management
GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc
61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt
121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt
181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc
241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga
301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag
361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg
421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc
481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg
541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt
601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg
661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt
721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga
781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag
841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt
901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa
961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg
1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc
Department of Health Information Management
FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT
Department of Health Information Management
Too Many Results
Department of Health Information Management
Search Limits
Department of Health Information Management
Reduced Search Results
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily
Release 1810 (12152011)
bull 146413798 Sequences bull 135117731375 Base Pairs
Department of Health Information Management
GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)
transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo
REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF
TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes
JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399
Department of Health Information Management
GenBank Record (Features)FEATURES LocationQualifiers source 15600
organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25
gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530
exon 1579 number=1
CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip
exon 580779 number=2
exon 780961 number=3
Department of Health Information Management
GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc
61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt
121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt
181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc
241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga
301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag
361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg
421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc
481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg
541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt
601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg
661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt
721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga
781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag
841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt
901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa
961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg
1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc
Department of Health Information Management
FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT
Department of Health Information Management
Too Many Results
Department of Health Information Management
Search Limits
Department of Health Information Management
Reduced Search Results
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)
transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo
REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF
TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes
JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399
Department of Health Information Management
GenBank Record (Features)FEATURES LocationQualifiers source 15600
organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25
gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530
exon 1579 number=1
CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip
exon 580779 number=2
exon 780961 number=3
Department of Health Information Management
GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc
61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt
121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt
181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc
241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga
301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag
361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg
421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc
481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg
541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt
601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg
661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt
721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga
781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag
841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt
901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa
961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg
1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc
Department of Health Information Management
FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT
Department of Health Information Management
Too Many Results
Department of Health Information Management
Search Limits
Department of Health Information Management
Reduced Search Results
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
GenBank Record (Features)FEATURES LocationQualifiers source 15600
organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25
gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530
exon 1579 number=1
CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip
exon 580779 number=2
exon 780961 number=3
Department of Health Information Management
GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc
61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt
121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt
181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc
241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga
301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag
361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg
421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc
481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg
541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt
601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg
661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt
721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga
781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag
841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt
901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa
961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg
1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc
Department of Health Information Management
FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT
Department of Health Information Management
Too Many Results
Department of Health Information Management
Search Limits
Department of Health Information Management
Reduced Search Results
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc
61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt
121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt
181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc
241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga
301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag
361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg
421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc
481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg
541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt
601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg
661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt
721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga
781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag
841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt
901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa
961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg
1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc
Department of Health Information Management
FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT
Department of Health Information Management
Too Many Results
Department of Health Information Management
Search Limits
Department of Health Information Management
Reduced Search Results
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT
Department of Health Information Management
Too Many Results
Department of Health Information Management
Search Limits
Department of Health Information Management
Reduced Search Results
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Too Many Results
Department of Health Information Management
Search Limits
Department of Health Information Management
Reduced Search Results
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Search Limits
Department of Health Information Management
Reduced Search Results
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Reduced Search Results
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Gene Record
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
RefSeqbull Database of reference sequences
ndash httpwwwncbinlmnihgovRefSeq
bull Curatedndash Many experimentally validated
ndash Some partially validated via ESTs
ndash Some computationally predicted
bull Non-redundant one record for each gene or each splice variant from each organism represented
bull Status Codesndash Provisional (temporary)
ndash Reviewed
ndash Predicted
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Page 26
Accession Numbersbull DNA sequences and other molecular data are
tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data
bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence
bull RefSeq identifiers include the following formatsndash Complete chromosome NC_
ndash Genomic contig NT_
ndash mRNA (DNA format) NM_ XM_
ndash Protein NP_ XP_
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
EST
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
ESTbull mRNA Genomic regions actively transcribed in
cellbull cDNA (complementary DNA)
ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA
bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify
bull Genes Alternative splicing etc
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
dbEST (release 120111 Dec 1
2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml
bull Number of Entries 71276166ndash Homo sapiens (human) 8315294
ndash Mus musculus (mouse) 4853562
ndash Arabidopsis thaliana (thale cress) 1529700
ndash Danio rerio (zebrafish) 1488275
ndash Drosophila melanogaster (fruit fly) 821005
ndash Gallus gallus (chicken) 600433
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Access to dbEST Databull EST sequences are included in the EST division of
GenBank available from NCBI by anonymous ftp and through Entrez
bull The nucleotide sequences may be searched using the BLAST server
bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Protein Structure
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Cn3D ftpftpncbinihgovcn3dCn3D-43msi
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Crystal Structure of A Protein
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains
httpwwwebiacukinterprobull Protein Information Resource (PIR)
httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences
httpwwwexpasychsprot bull UniProt
httpwwwexpasyuniprotorgindexshtml
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs
domains) which may have functional significance
bull Databases exist to store protein families motifs and structural domainsbull CDD
httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Protein Structure Databasesbull Proteins take on 3D structure
bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg
ndash SCOP httpscopmrc-lmbcamacukscop
ndash MMDB httpwwwncbinlmnihgovStructure
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single
worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids
bull Understanding the shape of a molecule helps to understand how it works
bull As of January 2010 there are 62787 searchable structures in the PDB database
bull PDB providesndash Sequence Atomic Coordinates Derived geometric data
Secondary structure content Annotations about protein literature references
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
PDB Statistics
httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
FlyBase
httpwwwflybaseorg
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
FlyBase Introduction
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Quick Searches
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Quick Search Results
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Gene Report Page gfzf
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
More Details Gene Model amp Product
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Sequence Searches (BLAST)
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Choosing Database Inputting Sequence
41
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
More BLAST Options
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
BLAST Results
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Genetic Variations
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Polymorphismsbull Genomic sequences from two unrelated
individuals are 999 identical
bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences
among different individuals
bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals
bull Genetic variations reveal clues of ancestral human migration history
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Major Types of Genetic Variationsbull Single nucleotide mutation
ndash Majority of SNPs do NOT directly contribute to any phenotypes
bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of
variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)
bull Used as genetic markers for DNA finger printing (forensic parentage testing)
bull Many cause genetic diseases
ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)
bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments
ndash Often causing serious genetic diseases
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
SNPs and Mutationsbull Terminology for variation at a single nucleotide
position is defined by allele frequencyndash A single base change occurring in a population at a
frequency of gt1 is termed a single nucleotide polymorphism (SNP)
ndash When a single base change occurs at lt1 it is considered to be a mutation
bull A SNP is a polymorphic position where the point mutation has been fixed in the population
bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
SNPsbull SNPs can occur anywhere on a genome they are
classified based on their locationsndash Many SNPs in genomic non-coding regions
ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc
bull Often play an important role in differentiation and disease
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
The Effect of SNPsbull The phenotypic consequence of a SNP is
significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence
ndash Affect gene transcription quantitatively or qualitatively
ndash Affect gene translation quantitatively or qualitatively
ndash Change protein structure and functions
ndash Change gene regulation at different steps
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are
often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc
bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes
asthmas obesity etc
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid
to be valine instead of glutamine in hemoglobin
httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm
1 Normal red blood cells 2 Sickled red blood cells
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an
alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits
bull Genotype the genetic constitution of a cell an organism or an individual
bull Genotyping the process of identifying what genotype a person has for any given locus (loci)
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Genetic Variations Databasesbull dbSNP
ndash httpwwwncbinlmnihgovSNP
bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim
bull International HapMap Projectndash httpwwwhapmaporg
bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a
public- domain archive for a broad collection of simple genetic variations
bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide
polymorphisms -SNPs)
bull Roughly 10 million in human population or on average 1 per 300 bps
bull Less than half of these SNPs are identified and stored in the database
ndash Microsatellite repeat variations (or short tandem repeats - STRs)
bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome
ndash Small-scale multi-base deletions or insertions
bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
dbSNP Data Typesbull The dbSNP contains two classes of records
ndash Submitted record
bull The original observations of sequence variation submitted SNPs (SS) records started with ss
ndash Computationally annotated record
bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic
ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Different Ways to Search SNPs in dbSNP
bull dbSNP web site
ndash Direct search of SS record batch search allow SNP record submission No search limit
bull Entrez SNP
ndash httpwwwncbinlmnihgovsitesentrezdb=Snp
ndash Search limits options allows precise retrieval
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Search SNPs from dbSNP Web Page
bull httpwwwncbinlmnihgovSNPindexhtml
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
dbSNP Search Examples
Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names
starting with the letter BRC (ie BRCA1 and BRCA2)
1[CHR] AND (frameshift[Function_Class])
Search SNPs located on chromosome 1 with function class frame-shift
1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]
Search all SNPs on chromosome 1 or 2 detected by all methods except unknown
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Legend in Results
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Search dbSNP Example bull Some mutations on human BRCA1 gene have been
reported to be involved in the early onset of breast cancer
bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP
bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Entrez SNP Search Results
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|
snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
SNP Fasta Header FormatHeader
Fasta header line starts with gt and has fields separated by | Each field is explained below
Gnl Internal usedbSNP Database name
ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp
allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1
lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation
handle|submitted_snp_id
Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id
Taxid NCBI taxonomy id
MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria
snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details
Alleles Lists alleles of the snp separated by
Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements
ATCG Green color is used for assay sequence (observed by the submitter)
ATCGBlack color is used for flank sequence (extracted from sequence databases )
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
GeneView of a SNP
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Links to Various Gene Records
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Gene and Disease
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Disease Causing GenesDisease centric databases
bull OMIM httpwwwncbinlmnihgovomim
bull CDC HugeNavigator httphugenavigatornet
bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp
bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
NCBImdashOMIM
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM
bull OMIM is a human genetic disorders database built and curated using results from published studies
bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved
in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants
bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
OMIM Variantbull The OMIM database includes genetic disorders
caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities
bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its
variants
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Variants in OMIM Recordsbull For most genes only selected mutations are included
ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance
bull Most of the variants represent disease-producing mutations NOT polymorphisms
bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders
bull Few neutral polymorphisms are included in OMIM
bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Office of Public Health Genomics CDCbull The CDC established the Office of Public Health
Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health
research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases
bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and
preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and
programsbull strengthening capacity for public health genomics in disease
prevention programs
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)
ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease
bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops
Reviews Case studies Book
bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology
ndash information on population prevalence of genetic variants
ndash gene-disease associations
ndash gene-gene and gene- environment interactions
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
HuGE Navigator
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Finding Disease Causing Genes
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Finding Genersquos Associated Diseases
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Disease Databasesbull Genes are involved in disease
bull Many diseases are well studied
bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim
ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-
Department of Health Information Management
Homework 1bull Using PubMed search for a recent paper related to genetic
disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc
bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation
bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein
- Genomics and Personalized Care in Health Systems Lecture 2 Databases
- Nucleotide and Protein Sequence Databases
- NCBI Homepage
- EST
- Protein Structure
- FlyBase
- Genetic Variations
- Gene and Disease
-