alastair kerr, ph.d. wtccb bioinformatics core

31
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases

Upload: obelia

Post on 27-Jan-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Alastair Kerr, Ph.D. WTCCB Bioinformatics Core. An introduction to DNA and Protein Sequence Databases. Questions to address. What are the main sequence databases? Which one to use for: Looking up a gene name/identifier from a paper Identifiers What should I use and why? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Alastair Kerr, Ph.D.WTCCB Bioinformatics Core

An introduction to DNA and Protein

Sequence Databases

Page 2: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Questions to address

What are the main sequence databases? Which one to use for:

Looking up a gene name/identifier from a paper Identifiers

What should I use and why? Coordinate based systems

Annotation Protein domains Gene Ontology

Page 3: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Database Varieties

Sequence Warehouses “everything under one roof”

Genome Databases Containing single genome dataset(s)

Reference Sets Often human curated, the 'standard' for a particular

gene or protein from which variants are defined Specialist

Short reads from next generation sequencing (Short read archive)

[EST] Expressed sequence tags and [GSS] Genome survey sequence

Page 4: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

NCBIGenBank EMBL

DDBJ

Sharing primary data

Page 5: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

NCBI

Warehouse GenBank <live demo>

NR dataset : NR = non redundant (but is is not..)

Reference Dataset RefSeq

Genome Datasets NCBI Genomes

Page 6: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

EMBL

Warehouse EMBL

Historically Protein set was call translated EMBL (trEMBL)

Gold standard reference set was called SwissProt

Reference set = Uniprot UniProtKB/Swiss-Prot

Manually annotated and reviewed UniProtKB/TrEMBL

automatically annotated and not reviewed

Genome database Ensembl <live demo>

Page 7: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Live Demo

Search GenBank for human adh4 How many are there? How many should there be? Why are some different to those found in Uniprot? Are there better databases to use? Which identifier should you use in your lab book?

Page 8: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

We should now be able to answer these:

What are the main sequence databases? Which one to use for:

Looking up a gene identifier from a paper Searching for a gene name Searching for an orthologus genes from another

species

Page 9: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Identifiers

Or what to write in your lab book

Page 10: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

How to identify a feature

Gene/protein name Common name Standardised Name

Database identifier Unique for each database Some have revision numbers

Position in genome Dependant on Genome build

Position in a Gene/Protein Protein Domains

Page 11: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Never use common namesExample of EPHB2

EPH receptor B2EPHT MGC87492DRTERK EPTH3Hek5 Renal carcinoma antigen NY-REN-47Tyro5 hEK5CAPB HEK5PCBC EK5

EPHB2TYRO5

protein-tyrosine kinase HEK5

EPH-like kinase 5

EphBephrin type-B receptor 2elk-related tyrosine kinase Tyrosine-protein kinase TYRO5eph tyrosine kinase 3 Tyrosine-protein kinase receptor EPH-3

Page 12: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Consortia identifiers

Most key species have a consortia / group / community that provides the key identifiers in the field

Humans Was HUGO (HUman Genome Organisation) now the HGNC (Human Genome Nomenclature

Committee)

Page 13: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Database Identifiers

Every dataset has their own system of identifying gene/protein

Example: Human ADH4 Ensembl

ENSG00000198099 ENST00000423445 ENSP00000397939 SwissProt

ADH4_HUMAN P08319  RefSeq

NM_000670.3 NP_000661.2  GenBank

gi|71565152|ref|NP_000661.2|

Page 14: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Keeping Track of Changes

Gene models can change Will the id you used yesterday still get the same

sequence today? Or: How to you get the latest version of a

sequence?

Page 15: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Keeping Track of Changes

Genbank: GI or “genbank identifier” Gi number changes each time, often removed when it

gets superseded SwissProt: Accession and ID

Accession changes each time (P08319) but the ID remains constant (ADH4_HUMAN)

RefSeq and Ensembl Revision based ids

NM_000670.3 ENSG00000198099.1 XXX.number

XXX always retrieve latest XXX.number retrieves the version

Page 16: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Demo: Retrieving old data

Page 17: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Definining: Chromosome coordinates

Demo: Ensembl

Page 18: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Chromosome Positions

Features identified by Chromosome & position File formats: BED, WIG, gff .. All major genome databases store features as

coordinates Ubiquitous in deep sequencing studies

Note: coordinates change depending on the assembly

Always note the build number of the genome assembly if you are using coordinates

Page 19: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Coordinates

New concept of PATCH This is an assembly update without changing the

primary sequence However additional 'improved' contigs map to the

reference These will be in the net assembly: you may

wish to use them

Genome assembly names can differ by institution but are the same underlying sequence:

GenBank/UCSC DEMO liftOver

Page 20: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Protein Domains: Protein Positions

Page 21: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Protein Domains

Interpro Site that stores information on known protein domains from

different projects Covered by Interpro

Similarities between proteins Conserved region in an alignment Conserved protein folds

Not Covered by Interpro Predicted features on primary protein sequence Trans-membrane regions Low complexity regions Phosphorylation sites

Page 22: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Domain Complexity

Many different types of domains

Vast amounts of domain based data

Many different projects identifying them

x

=

Page 23: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Old way of interacting with a database

Request information

Retrieve information From single source

Page 24: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Distributed Annotation

Page 25: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

DAS clients

Different type of software can have a DAS client build-in

Genome Browsers: ensembl, IGB, IGV.. Multiple Alignment editors: Jalview, STRAP 3D Structures: Spice 3D electron microscopy data: PeppeR

Demo

Page 26: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Annotation

Page 27: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Annotation

Problem: Many ways to name a gene Reductase = oxidase = dehydrogenase

Gene Ontology Consortium [GO] GO terms standardise naming Note that errors may still occur in the assignment

of terms Found in RefSeq, UniProt and most genome

databases GO browsers e.g. AmiGO

Page 28: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Gene Ontology

all [535063 gene products] GO:0008150 : biological_process

[404412 gene products] GO:0005575 : cellular_component

[372379 gene products] GO:0003674 : molecular_function

[436597 gene products]

Page 29: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Gene Ontology: acyclical Tree

Page 30: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Evidence Codes

Experimental# EXP: Inferred from Experiment # IDA: Inferred from Direct Assay

# IPI: Inferred from Physical Interaction # IMP: Inferred from Mutant Phenotype

# IGI: Inferred from Genetic Interaction # IEP: Inferred from Expression Pattern

Computational# ISS: Inferred from Sequence or Structural Similarity

# ISO: Inferred from Sequence Orthology # ISA: Inferred from Sequence Alignment

# ISM: Inferred from Sequence Model # IGC: Inferred from Genomic Context

# RCA: inferred from Reviewed Computational Analysis

Author Statement

# TAS: Traceable Author Statement # NAS: Non-traceable Author Statement

# Curator Statement Evidence Codes # IC: Inferred by Curator

# ND: No biological Data available

Automatically-assigned

# IEA: Inferred from Electronic Annotation

Page 31: Alastair Kerr, Ph.D. WTCCB Bioinformatics Core

Best annotation?

Use DAS clients to get more information on genomic, gene or protein features

Protein Domains are especially useful The Gene Ontology is useful for general

classification BUT be aware from where the annotation was

derived