[email protected] swiss-prot group, geneva sib swiss institute of bioinformatics...

243
[email protected] Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge http://education.expasy.org/cours/UniProt/

Upload: samuel-garrett

Post on 25-Dec-2015

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

[email protected] group, GenevaSIB Swiss Institute of Bioinformatics

Protein sequence databases:dissemination of protein

knowledge

http://education.expasy.org/cours/UniProt/

Page 2: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Menu

Introduction

Nucleic acid sequence databases ENA, GenBank, DDBJ

Protein sequence databasesUniProt databases (UniProtKB)

NCBI protein databases (NCBInr, RefSeq…)

Page 3: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Protein sequences are the fundamental determinants

of biological structure and function.

http://www.ncbi.nlm.nih.gov/protein

Page 4: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Challenge

Flood of data -> need to be stored, curated and made available for analysis and knowledge

discovery

Page 5: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

TrEMBL Genpept

Swiss-Prot

RefSeqPRF

Ensembl

CCDS

UniParc

UniProtKB

PDB(PIR)

(IPI)

UniMES

TPA

Challenge (1)Many different protein sequence databases

NCBInr

Page 6: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 7: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

These identifiers are all pointing to the same TP53 protein sequence (p53) !

P04637, NP_000537, ENSG00000141510, CCDS11118, UPI000002ED67, IPI00025087, HIT000320921, XP_001172091, DD954676 , JT0436 , etc.

Challenge (1bis)

Different protein sequence databases : many identifiers for the same protein sequence

Page 8: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

A HUPO test sample study reveals common problems in mass spectrometry–based proteomics

PubMed 19448641 (2009)

• A single mass spectrometry experiment can identified up to about 4000 proteins (15’000 peptides)

• Protein databases vary greatly in terms of their curation, completeness and comprehensiveness (search with different protein databases = could get different results).

• Only 7 labs (on 27) were able to identify the 20 human proteins present in a sample, mainly due to the fact that the search engines used cannot distinguish among different identifiers for the same protein…

Page 9: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Nucleic Acids Res. 2010 ; 38(Database issue): D633–D639.

‘Examining links from the perspective of PubMed, we found that only a small fraction of published articles are linked to human genes (Entrez Gene).’

Challenge (3)(protein) sequence annotation

Page 10: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Journals do not (SHOULD NOT) accept a paper dealing with a nucleic acid sequence if the ENA/GenBank/DDBJ AC

number is not available…

‘journal publishers generally require deposition prior to publication so that an accession number can be included in

the paper.’ http://www.ncbi.nlm.nih.gov/books/bv.fcgi?highlight=refseq&rid=handbook.section.GenBank_ASM#GenBank_ASM.RefSeq

…not the case for protein sequences

!!! no more the case for a lot of genomes !!!

Page 11: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Protein sequence origin…

Page 12: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

More than 99 % of the protein sequences are derived from the translation of nucleotide sequences

(genomes and/or cDNAs)

sequencing quality

coding sequence (CDS) annotation accuracy

gene prediction quality

Page 13: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 14: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

… ~ 2500 genomes sequenced (single organism, varying sizes, including virus)

… ~ 5’000 ongoing genome sequencing projects

Page 15: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.htmlhttp://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi?taxid=10239&hopt=stat

~ 50-100 genomes/month

+ ~2’500 viral genomes=> Total ~ 5’000 genomes 

Page 16: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

… ~ 2500 genomes sequenced (single organism, varying sizes, including virus)

… ~ 5’000 ongoing genome sequencing projects

… cDNAs sequencing projects (ESTs or cDNAs)

… metagenome sequencing projects = environmental samples: multiple ‘unknown’ organisms,

Page 17: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Metagenomicsstudy of genetic material recovered directly from environmental samples

• Global Ocean Sampling (C. Venter) 1ml sea water: 1 mo bacteria and 10 mo virus

• Whale fall (AAFZ00000000.1)

• Soil, sand beach, New-York air, …

• Human fluids, mouse gut (millions of bacteria within human body)

• Water treatment industry…

• Lists of projects: http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi

Venter’s Sorcerer II

Page 18: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

… ~ 2500 genomes sequenced (single organism, varying sizes)

… ~ 5’000 ongoing genome sequencing projects

… cDNAs sequencing projects (ESTs or cDNAs)

… metagenome sequencing projects

… personal human genomes

new generation sequencers : Illumina: 25 billions of bp /day;

Page 19: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

http://www.youtube.com/watch?v=mVZI7NBgcWM

…2700 genomes in 2010, 30’000 genomes in 2011 ?

2’000’000 $(2007)

70’000’000 $(diploid,

2007)

3’000’000’000 $(public consortium,

2000)

300’000’000 $(Celera, 2000)

2010

Page 20: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

But…we known now that his apoE allele is the one associated with increased risk for Alzheimer and that he has the ‘blue eye’ allele…

Page 21: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

apoE gene (Ensembl genome browser)

Page 22: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

New projects (homo sapiens)

• 1000 genomes (first publication, October 2010)

• Multiple personal genomes (sexual cells, lymphoid cells, cancer cells…)

• International cancer genome consortium (www.icgc.org).

They look at the most common cancers and for each they sequence the genome of 500 patients with cancer and 500 healthy individuals….

How to define the human proteome ??? Which sequences ???

Page 23: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

How many proteins-coding genes at the end?

Page 24: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

190‘500'025'0421st estimate: ~30 million species (1.8 million named) 2nd estimate:

20 million bacteria/archea x 4'000 genes

1 million protists x 6'000 genes

5 million insects x 14'000 genes

2 million fungi x 6'000 genes

0.5 million plants x 20'000 genes

0.5 million molluscs, worms, arachnids, etc. x 20'000 genes

0.1 million vertebrates x 25'000 genes

The calculation: 2x107x4000+1x106x6000+5x106x14000+2x106x6000+5x105x20000+5x105x20000+1x105x25000

+20000 (Craig Venter)+ 42(Douglas Adam) + …

Page 25: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

About 190 billions of proteins (?)

About 14.0 millions of ‘known’ protein sequences in 2011(from ~300’000 species)

More than 99 % of the protein sequences are derived from the translation of nucleotide sequences

Less than 1 % direct protein sequencing (Edman, MS/MS…)

-> It is important that protein database users know where the protein sequence comes from…

Page 26: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

cDNAs, ESTs, genes, genomes, …

Nucleic acid sequence databases

The ideal life of a sequence …

Protein sequence databases

Page 27: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Menu

Introduction

Nucleic acid sequence databases ENA/GenBank, DDBJ

Protein sequence databasesUniProt databases (UniProtKB)

NCBI protein databases

Page 28: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

ENA (EMBL-Bank) GenBankDDBJDNA Data Bank of Japan

archive of primary sequence data and corresponding annotation submitted by the laboratories that did the

sequencing.

European Nucleotide Archive

Page 29: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

http://www.insdc.org/

ENA/GenBank/DDBJ

Page 30: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

• Serve as archives : ‘nothing goes out’• Contain all public sequences derived from:

– Genome projects (> 80 % of entries)– Sequencing centers (cDNAs, ESTs…)– Individual scientists ( 15 % of entries)– Patent offices (i.e. European Patent Office, EPO)

• Currently: ~210x106 sequences, ~320 x109 bp;• Sequences from > 300’000 different species;

ENA/GenBank/DDBJ

Page 31: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Archival databases:

- Can be very redundant for some loci

- Sequence records are owned by the original submitter and can not be alterered by a third party (except TPA)

Page 32: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

taxonomy

Cross-references

references

accession number

Page 33: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

CDS annotation

(Prediction or experimentally determined)

sequence

CDSCoDing Sequence

(proposed by submitters)

Page 34: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

The hectic life of a sequence …

cDNAs, ESTs, genes, genomes, …

ENA, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

with or without annotated CDS

provided by authors

CDSCoDing Sequence

portion of DNA/RNA translated into protein(from Met to STOP)

Experimentally provedor derived from gene prediction

!!! not so well documented !!!

Page 35: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

CONTIG --------------------------------------------------------------------------------------CGANGGCCTATCAACAATGAAAGGTCGAAACCTG

Genomic AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACAATG-AAGGTCGAAACCTG *** ************ ** * **************

 CONTIG CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG-------------------------------------------------------Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA ************************************ **************************** CONTIG ------------------------------------------------------------------------------------------------------------------------Genomic TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT

CONTIG ----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGACGenomic GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC ************************************************************************** CONTIG TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTCGenomic TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC

**************************** ***********************************************

CONTIG CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGAGenomic CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA ************************************************************************************************************************

CONTIG TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG-----------------------------------------------------------------------------Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT ********************************** ******** CONTIG -------------------------------------------------------------------------------------------------------------------GNAAAGenomic GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA * ***

CONTIG GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGATAAACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC

Genomic GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGATAAACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC******************************************* * ************** ******** ***** **** * *********** ***************************

 CONTIG C-----------------------------------------------------------------------------------------------------------------------Genomic CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA *  

CoDing SequenceAlignment between a mRNA and a genomic sequence

exon

exon

exon

exon

exon

intron

intron

intron

Page 36: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

CDS translation provided by ENA

CDS provided by the submitters

The first Met !

Page 37: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UCSC: human EPO

5’ 3’

mRNAs and their corresponding CDS annotation (from EMBL/GenBank/DDBJ)

contig

Page 38: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

mRNAs and their corresponding CDS annotation (from EMBL/GenBank/DDBJ)

Page 39: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Very rarely done…

Page 40: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Complete genome (submitted)

but only ~ 2,000 CDS/proteins available !

Page 41: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

http://www.ebi.ac.uk/swissprot/sptr_stats/index.html

…annotated CDS in UniProtKB

Page 42: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

From nucleic acid to amino acid sequences databases….

Page 43: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

The hectic life of a protein sequence …

cDNAs, ESTs, genomes, …

ENA, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

…if the submitters provide an annotated Coding Sequence

(CDS)(1/10 ENA entries)

Protein sequence databases

Nucleic acid databases

Gene predictionRefSeq, Ensembl

no CDS

RefSeq, Ensembl and other

Page 44: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Why doing things in a simple way, when you can do it in a very complex

one ?

Page 45: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

The hectic life of a sequence …

TrEMBL Genpept

CoDing Sequences provided by submitters

cDNAs, ESTs, genomes, …

ENA, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

Swiss-Prot

RefSeq PRF

Scientific publications derived sequences

Ensembl

CCDS

UniParc

UniProtKB

PDB(PIR)

+ all ‘species’ specific databases (EcoGene, TAIR, …)

(IPI)

UniMES

CoDing Sequences provided by submitters

and gene prediction

TPA

Page 46: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Major ‘general’ protein sequence database ‘sources’

UniProtKB: Swiss-Prot + TrEMBL

NCBI-nr: Swiss-Prot + TrEMBL + GenPept + PIR + PDB + PRF + RefSeq + TPA

PIR PDB PRF

UniProtKB/Swiss-Prot: manually annotated protein sequences (12’000 species)

UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (300’000 species)

GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (300’000 species ?)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species)

TPA: Third part annotation

Integrated resources

‘cross-references’

Resources kept separated

TPA

not complete !!! (only entries created before 2007 ?)

Page 47: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Swiss-Prot

TrEMBL

Look for EPO(homo sapiens)

www.uniprot.org

Page 48: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Look for EPO(homo sapiens)

Swiss-Prot

Swiss-Prot

GenPept

RefSeq

RefSeq

GenPept

Page 49: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Menu

Introduction

Nucleic acid sequence databases ENA-Bank/GenBank, DDBJ

Protein sequence databasesUniProt databases (UniProtKB)

NCBI protein databases

Page 50: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtUniProt consortium

EBI : European Bioinformatics Institute (UK)SIB : Swiss Institute of Bioinformatics (CH)PIR : Protein information resource (US)

Page 51: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProt databases

Page 52: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (~14 mo entries)

UniParc: protein sequence archive (ENA equivalent at the

protein level). Each entry contains a protein sequence with cross-links to other databases where you find the sequence (active or not). Not annotated (query, Blast, download) (~25mo entries)

UniRef: 3 clusters of protein sequences with 100, 90 and 50 % identity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo entries; UniRef50 3.3 mo entries)

UniMES: protein sequences derived from metagenomic projects (mostly Global Ocean Sampling (GOS)) (download) (8 mo entries, included in UniParc)

Page 53: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtKBan encyclopedia on proteins

composed of 2 sectionsUniProtKB/TrEMBL and UniProtKB/Swiss-Prot

unreviewed and reviewed automatically annotated and manually annotated

released every 4 weeks

Page 54: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtKB

from EMBL to TrEMBL

UniProtKB protein sequence data are mainly derived from EMBL (CDS) but also from Ensembl, RefSeq, model organism databases (MODs; e.g. TAIR) and

PDB.

Data from the PIR database have been integrated in UniProtKB since 2003.

Page 55: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtKB/Swiss-Prot and UniProtKB/TrEMBL give access to all the protein sequences which are available to the public.

However, UniProtKB excludes the following protein sequences:- Most non-germline immunoglobulins and T-cell receptors- Synthetic sequences- Most patent application sequences- Small fragments encoded from nucleotide sequence (<8 amino acids)- Pseudogenes*- Fusion/truncated proteins- Not real proteins

* many putative pseudogene sequences (which are tagged as potential pseudogenes) may be expected to remain in UniProtKB for some time as it can be difficult to prove the non-existence of a protein

Page 56: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Data increase in UniProtKB

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

UniProtKB

UniProtKB/Swiss-Prot

Date

Num

ber

of s

eque

nces

Page 57: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

TrEMBL

EMBL

Automated extraction of protein sequence

(translated CDS), gene name and references.+Automated annotation

Page 58: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

One protein sequenceOne species

Automated annotationKeywords

and Gene Ontology

Automated annotationFunction, Subcellular location,

Catalytic activity, Sequence similarities…

Automated annotationtransmembrane domains,

signal peptide…

Cross-references to over 125 databases

References

Protein and gene namesTaxonomic information

UniProtKB/TrEMBLwww.uniprot.org

Page 59: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtKB: from EMBL to TrEMBL

Automated annotation

1. Protein sequence

2. Biological information

Page 60: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Protein sequence- The quality of UniProtKB/TrEMBL protein sequences is dependent on the information provided by the submitter of the original nucleotide entry (CDS). - 100% identical sequences (same lenght, same organism are merged automatically).

Biological informationSources of annotation- Provided by the submitter (EMBL, PDB, TAIR…)- From automated annotation (SAAS: automated

generated annotation rules)- From automated annotation (UniRule; manually

generated annotation rules)

UniProtKB/TrEMBL

Page 61: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 62: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Protein sequence- The quality of UniProtKB/TrEMBL protein sequences is dependent on the information provided by the submitter of the original nucleotide entry (CDS). - 100% identical sequences (same length, same organism are merged automatically).

Biological informationSources of annotation- Provided by the submitter (EMBL, PDB, TAIR)- From automated annotation (SAAS: automated

generated annotation rules)- From automated annotation (UniRule; manually

generated annotation rules)

UniProtKB/TrEMBL

Page 63: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Automatic annotation in UniProtKB/TrEMBL

System Rule creation Trigger Annotations Scope

SAAS automatic InterPro comments, KW all taxa

UniRules manual InterPro*

protein names,

comments, features, KW,

GO terms

all taxa

* Flexibility to create custom signatures for InterPro as required

Page 64: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

SAAS

• Rules are derived from the UniProtKB/Swiss-Prot manual annotation.

• Fully automated rule generation based on C4.5 decision tree algorithm.

• One annotation, one rule.

• Precision calculated for each rule vs UniProtKB/Swiss-Prot.

• High stringency – require 99% or greater estimated precision on UniProtKB/TrEMBL to generate annotation.

• Rules are produced, updated and validated at each release.

UniProtKB/TrEMBL

Page 65: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 66: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 67: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniRules (RuleBase, HAMAP, PIRSF)

• Rules of varying complexity: annotation varies from simple KW attribution to complete annotation as for UniProtKB/Swiss-Prot

• Rules are manually curated: From SAAS rules as input From UniProtKB/Swiss-Prot annotation and InterPro match data,

taxonomy information – continuously reported to curators From literature based curation of characterized families - with the

possibility to create new signatures for specific functional groups

• Rules are continuously monitored – validation on UniProtKB/Swiss-Prot – 97% confidence

• HAMAP is also used for the annotation of some UniProtKB/Swiss-Prot entries

Page 68: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniRule – HAMAP

Page 69: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniRule – HAMAP

Page 70: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

• SAAS – automatically generated annotation rules for comments, KWs- Tested on UniProtKB/Swiss-Prot

• UniRule – manually curated annotation rules (e.g. HAMAP) – annotation varies from simple KWs to full annotation– start point can be SAAS rules, InterPro reports, literature-based

curation of protein families – possibility to create custom signatures -> InterPro

• Automatic annotation of UniProtKB/TrEMBL is refreshed, and validated, each UniProtKB release – validation using UniProtKB/Swiss-Prot as reference. ~10% of the rules are ‘refreshed’ at each release.

• The source of each annotation is indicated - users can access rule logic

Automatic annotation in UniProtKB/TrEMBL - Summary

Page 71: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Current status – coverage of UniProtKB/TrEMBL

System Rules Coverage*

SAAS 1684 17.2%

UniRules

RuleBase 1108 23.0%

PIR name /

site rules142 0.26%

HAMAP 1087 4.7%

* Proportion of entries with at least one annotation from the specified system

UniProtKB/TrEMBL 2010_12: 12,769,092 entries, all systems combined, 33%

Page 72: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Current status - coverage of UniProtKB/TrEMBL

All CC DE FT GN KW0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

UniRule UniRule + SAAS SAAS

Annotation type

% c

ove

rag

e o

f Un

iPro

tKB

/TrE

MB

L

UniProt release 2010_10 included annotations from 7767 SAAS rules, 1814 UniRules

Page 73: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

GO annotation- KW2GO- InterPro2GO- HAMAP2GO

Page 74: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtKB

from TrEMBL to Swiss-Prot

Once manually annotated and integrated into Swiss-Prot, the entry is deleted from TrEMBL

-> minimal redundancy

Page 75: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

TrEMBL

EMBL

Automated extraction of protein sequence (translated CDS), gene name and

references.+Automated annotation

Manual annotation of the sequence and associated

biological information

Swiss-Prot

Page 76: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAARAFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVKNMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFLNKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRGTVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGRAGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKDEGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG

One protein sequenceOne gene

One species

Manual annotationKeywords

and Gene Ontology

Manual annotationFunction, Subcellular location,

Catalytic activity, Disease, Tissue specificty, Pathway…

Manual annotationPost-translational modifications,

variants, transmembrane domains, signal peptide…

Cross-references to over 125 databases

References

Protein and gene namesTaxonomic information

Alternative products:protein sequences produced by

alternative splicing, alternative promoter usage,

alternative initiation…

UniProtKB/Swiss-Protwww.uniprot.org

Page 77: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtKB: from TrEMBL to Swiss-Prot

Manual annotation

1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)

2. Biological information (sequence analysis, extract literature information, ortholog data propagation, …)

Page 78: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtKB

1- Sequence curation

Page 79: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

The displayed protein sequence

…canonical, representative, consensus…

Page 80: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

The displayed sequence is the most prevalent protein sequence and/or the protein sequence which is also found in orthologous species.

The displayed sequence is generally derived from the translation of the genomic sequence (when available).

Sequence differences are documented.

1 entry <-> 1 gene (1 species) 1 displayed sequence

(annotation of alternative sequences, when available)

UniProtKB/Swiss-Prot protein sequence annotation‘Merging/Redundancy policy’:

a gene-centric view of protein space

Page 81: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

What is the current status?

• At least 20% of Swiss-Prot entries required a minimal amount of curation effort so as to obtain the “correct” sequence.

• Typical problems– unsolved conflicts;– uncorrected initiation sites;– frameshifts;– other ‘problems’

Page 82: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 83: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

… once a gene on chromosome 11…

Page 84: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Quality of protein information from genome projects

• Lets look at proteins originating from genome projects:– Drosophila: the paradigm of a curated genome should look

like (thanks to FlyBase) : only 1.8% of the gene models conflict with Swiss-Prot sequences;

– Arabidopsis: a typical example of a genome where a lot of annotation was done when it was sequenced, but no update since then (at least in the public view): 20% of the gene models are erroneous;

– Tetraodon nigroviridis: the typical example of a quick and dirty automatic run through a genome with no manual intervention: >90% of the gene models produce incorrect proteins.

– Bacteria and Archaea have almost no splicing, so predictions are “easier”, however errors are still made… Start codons, missed small proteins (<100aa)…

Page 85: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtKB/Swiss-ProtProtein sequence annotation

Page 86: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Example of problem (derived from gene prediction pipeline)

Ensembl completes the human ‘proteome’ by predicting/annotating missing genes according to orthologous sequences..

ID   URAD_HUMAN            Unreviewed;       171 AA. AC   A6NGE7; DT   24-JUL-2007, integrated into UniProtKB/TrEMBL. DT   24-JUL-2007, sequence version 1. DT   02-OCT-2007, entry version 3. DE   2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog DE   (OHCU decarboxylase homolog) (Parahox neighbour). GN   Name=PRHOXNB; …DR   EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR   Ensembl; ENSG00000183463; Homo sapiens. DR   HGNC; HGNC:17785; PRHOXNB. PE   4: Predicted; In primates the genes coding for the enzymes for the

degradation of uric acid were inactivated and converted to pseudogenes.

Page 87: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

• Producing a clean set of sequences is not a trivial task;

• It is not getting easier as more and more types of sequence data are submitted;

Page 88: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

• The ‘Protein existence’ tag indicates what is the evidence for the existence of a given protein;

• Different qualifiers:1. Evidence at protein level (~18%) (MS, western blot (tissue specificity), immuno (subcellular

location),…)2. Evidence at transcript level (~19%)3. Inferred from homology (~58 %)4. Predicted (~5%)5. Uncertain (mainly in TrEMBL)

‘Protein existence’ tag

http://www.uniprot.org/docs/pe_criteria

Page 89: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 90: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

In order to avoid ‘pseudogenes’ and most of the unprobable protein sequences, you can filter your query and avoid sequences with ‘protein existence tag’ = ‘Uncertain’

Page 91: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 92: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

The ‘alternative’ sequence(s)

Page 93: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtKB/Swiss-Prot

1 entry <-> 1 gene (1 species)

Annotation of the sequence differences

(including conflicts, polymorphisms, splice variants etc..)

-> annotation of protein diversity

Page 94: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Multiple alignment of the end of the available GCR sequences

Annotation of the sequence differences (protein diversity)

1 entry <-> 1 gene (1 species)

…and natural variants

Page 95: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

P04150

www.uniprot.org

Page 96: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtKB (and RefSeq) do under-represent alternatively spliced products

According to PMID:21307931: alternative splicing seems to occur at more than 90% of protein-coding genes (might not always modify the protein sequence).

Transcript variants are only made when there is information available on the full-lenght nature of the product; if multiple, alternate exons are found through the lenght of the gene, no assumption is made about the combination of the alternate exons that may exist in vivo.

Uncertain alternative sequences (confirmed by only one cDNA) are tagged with ‘No experimental confirmation available’http://www.ncbi.nlm.nih.gov/books/NBK50679/#RefSeqFAQ.what_does_a_reviewed_status_me

Page 97: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Available in separated files!

Important remark

> 30’000 additional sequences (total)

Page 98: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

The ‘alternative’ sequence(s)

not ‘directly available’ for a lot of tools, including protein identification tools, Blast, depending on the server

!….

Not included yet in the UniProtKB complete proteome sets !

Page 99: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Depending upon the organism, the inclusion of alternative sequences to the basic set of protein sequences can make a tremedous difference. For instance, in Homo sapiens, alternative sequences currently represent close to 40% of the total number of annotated human sequences described in UniProtKB/Swiss-Prot.

http://www.uniprot.org/faq/38

Page 100: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Blast P04150 against Swiss-Prot / homo sapiens @ UniProt

Isoform sequences

Page 101: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Blast P04150 against Swiss-Prot / homo sapiens @ NCBI

The isoform sequences are not present in the NCBI protein databases !The .x number (P06401.4) correspond to the version number of the sequence…not to an alternatively spliced sequence !

Page 102: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

How to track sequence changes ?

• The sequence version number applies to the canonical sequence only

• There is no easy way yet to track sequence updates of isoforms

Page 103: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtKB

2- Biological data curation

Page 104: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtKB/Swiss-Prot gathers data form multiple sources:

- publications (literature/Pubmed)- prediction programs (Prosite, TMHMM, …)- contacts with experts - other databases- nomenclature committees

An evidence attribution system allows to easily trace the source of each annotation

Extract literature informationand protein sequence analysis

maximum usage of controlled vocabulary

Page 105: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Protein and gene names

Page 106: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

…enable researchers to obtain a summary of what is known about a protein…

General annotation

(Comments)

www.uniprot.org

Page 107: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Human protein manual annotation: some statistics (Aug 2010)

Page 108: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Sequence annotation

(Features)

…enable researchers to obtain a summary of what is known about a protein…

www.uniprot.org

Page 109: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Human protein manual annotation: some statistics

(PTM)

Page 110: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Non-experimental qualifiers

UniProtKB/Swiss-Prot considers both experimental and predicted data and makes a clear distinction between

both.

Level. Type of evidence Qualifier

1st. Strong experimental evidence Ref.X

2nd. Light experimental evidence Probable

3rd. Inferred by similarity with homologous protein (data of 1st or 2nd level)

By similarity

4th. Inferred by sequence prediction Potential

Page 111: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Find all the protein localized in the cytoplasm (experimentally

proven) which are phosphorylated on a serine

(experimentally proven)

Page 112: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtKB

Additional information can be found in the cross-references (to more than 140 databases)

Page 113: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

2D gel2DBase-EcoliANU-2DPAGEAarhus/Ghent-2DPAGE (no server)

COMPLUYEAST-2DPAGECornea-2DPAGE DOSAC-COBS-2DPAGEECO2DBASE (no server)

OGPPHCI-2DPAGEPMMA-2DPAGERat-heart-2DPAGEREPRODUCTION-2DPAGESiena-2DPAGESWISS-2DPAGEUCD-2DPAGEWorld-2DPAGE

Family and domainGene3DHAMAPInterProPANTHERPfamPIRSFPRINTSProDomPROSITESMARTSUPFAMTIGRFAMs

Organism-specificAGDArachnoServerCGDConoServerCTDCYGD dictyBaseEchoBASEEcoGeneeuHCVdbEuPathDBFlyBaseGeneCardsGeneDB_SpombeGeneFarmGenoListGrameneH-InvDB HGNCHPA LegioListLepromaMaizeGDBMGIMIMneXtProtOrphanet PharmGKBPseudoCAPRGDSGDTAIRTubercuListWormBaseXenbaseZFIN

Protein family/groupAllergomeCAZyMEROPSPeroxiBasePptaseDBREBASETCDB

Genome annotationEnsemblEnsemblBacteriaEnsemblFungiEnsemblMetazoaEnsemblPlantsEnsemblProtistsGeneIDGenomeReviewsKEGGNMPDRTIGRUCSCVectorBase

Enzyme and pathwayBioCycBRENDAPathway_Interaction_DBReactome

OtherBindingDBDrugBank NextBio PMAP-CutDB

SequenceEMBLIPIPIRRefSeqUniGene

3D structureDisProtHSSPPDBPDBsumProteinModelPortalSMR

PTMGlycoSuiteDBPhosphoSitePhosSite

UniProtKB/Swiss-Prot:129 explicit links

and 14 implicit links!

ProteomicPeptideAtlasPRIDEProMEX

PPIDIPIntAct MINTSTRING

Phylogenomic dbseggNOGGeneTreeHOGENOMHOVERGENInParanoidOMAOrthoDBPhylomeDBProtClustDB

PolymorphismdbSNP

Gene expressionArrayExpressBgeeCleanExGenevestigatorGermOnline

Ontologies GO

Page 114: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Protein sequence origin

http://www.uniprot.org/faq/35

Page 115: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Access to UniProtKB

www.uniprot.org

Page 116: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

The UniProt web site - www.uniprot.org

• Powerful search engine, google-like and easy-to-use, but also supports very directed field searches (similar to SRS)

• Scoring mechanism presenting relevant matches first

• Entry views, search result views and downloads are customizable

• The URL of a result page reflects the query; all pages and queries are bookmarkable, supporting programmatic access

• Tools: Blast, Align, IDmapping, Batch retrieval (Retrieve)

Page 117: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Search

A very powerful text search tool with autocompletion and refinement

options allowing to look for UniProt entries and documentation by

biological information

Page 118: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Search

A very powerful text search tool with autocompletion and refinement

options allowing to look for UniProt entries and documentation by

biological information

Page 119: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

The search interface guides users with helpful suggestions and hints

Page 120: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 121: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Advanced Search

A very powerful search tool

To be used when you know in which entry section the information is stored

Have first a look to annotation examples.

Page 122: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Find all human proteins with experimental evidence for their

location in the nucleus

Page 123: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

The information is stored in the ‘General annotation’ section, Subcellular location

Page 124: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 125: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Find all human proteins with experimental evidence for their

location in the nucleus

Page 126: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Result pages: Highly customizable

Page 127: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Custom downloads….

Accession Genes Domains Protein Existence P02768 ALB (GIG20) (GIG42) (PRO0903) (PRO1708) (PRO2044) (PRO2619) (PRO2675) (UNQ696/PRO1341) Albumin domains (3) Evidence at protein level P02769 ALB Albumin domains (3) Evidence at protein level P02770 Alb Albumin domains (3) Evidence at protein level P07724 Alb (Alb-1) (Alb1) Albumin domains (3) Evidence at protein level P08759 alb-A Albumin domains (3) Evidence at transcript level P14872 alb-B Albumin domains (3) Evidence at transcript level P43652 AFM (ALB2) (ALBA) Albumin domains (3) Evidence at protein level P08835 ALB Albumin domains (3) Evidence at protein level P49822 ALB Albumin domains (3) Evidence at protein level P19121 ALB Albumin domains (3) Evidence at protein level

Open with Excel etc.

Page 128: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

The URL (results) can be bookmarked and manually modified.

Page 129: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Blast

A tool associated with the standard options to search

sequences in UniProt databases

Page 130: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Blast results: customize display

Page 131: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Blast: use of UniProt annotationamino-acids highlighting options

and feature annotation highlighting option in the local alignment

Page 132: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Align

A ClustalW multiple alignment tool with amino-acids highlighting optionsand feature annotation highlighting

option

Page 133: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

ClustalW multiple alignment of insulin

sequencesamino-acids highlighting options

and feature annotation highlighting option in the local alignment

Page 134: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Retrieve

A UniProt specific tool allowing to retrieve a list of entries in several standard identifiers formats.

You can then query your ‘personal database’ with the UniProt search tool.

Page 135: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Your dataset: results of a Scan Prosite

Page 136: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

ID Mapping

Gives the possibility to get a mapping between different databases for a given

protein

Page 137: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

These identifiers are all pointing to TP53 (p53) !

P04637, NP_000537, ENSG00000141510, CCDS11118, UPI000002ED67, IPI00025087, etc.

Page 138: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 139: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Download

Page 140: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Downloading UniProt http://www.uniprot.org/downloads

Page 141: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Downloading UniProt http://www.uniprot.org/downloads

Page 142: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Canonical and isoform sequences

Page 143: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Complete proteome

‘gene’ centredor

all known proteins ?

Page 144: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

http://www.uniprot.org/faq/38

Page 145: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 146: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Remark: Some peptides are not associated with the keyword ‘Complete proteome’ because they do not match with the human genome

Page 147: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProt proteome sets, if downloaded in UniProt flat file or XML format, contain one sequence per UniProt record !

‘gene’ centred

all protein sequences in UniProtKB/Swiss-Prot…Are missing: other alternatively spliced protein sequences in UniProtKB/TrEMBL

Page 148: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

IPI closure

• The complete MOUSE proteome will be composed of all MOUSE sequences in UniProtKB/Swiss-Prot plus those MOUSE sequences in UniProtKB/TrEMBL that have a cross-reference to an Ensembl protein.

• The complete HUMAN proteome will therefore be composed of all HUMAN sequences in UniProtKB/Swiss-Prot plus those HUMAN sequences in UniProtKB/TrEMBL that have a cross-reference to an Ensembl protein.

• News: 30th March 2011: next UniProtKB release.

Page 149: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtKB

Statistics

Page 150: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

520’000 + 14’000’000 14’000’000

Swiss-Prot & TrEMBL introduce a new arithmetical

concept !

Redundancy in TrEMBL&

Redundancy between TrEMBL and Swiss-Prot

12’000 species 130’000 species

Swiss-Prot TrEMBL

Page 151: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

12’000 speciesmainly model organisms

Page 152: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Not yet available

Page 153: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

~ 200 new entries / day new release every 4 weeks

- Annotation is useful, good annotation is better, update is essential !

- Some entries have gone through more than 120 versions since their integration in UniProtKB/Swiss-Prot

Page 154: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtKB entry history

Always cite the primary accession number (AC) !

Page 155: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniParc

Page 156: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniParc

- non-redundant protein sequence archive, containing both active and inactive sequences (including sequences which are not in UniProtKB i.e. immunoglobulins….)

- the equivalent of ENA/GenBank/DDBJ at the protein level

- species-merged: merge sequences between species when 100% identical over the whole length.

- no annotation (only taxonomy)

- can be searched only with database names, taxonomy, checksum (CRC64) and accession numbers (ACs) or UniProtKB, UniRef and UniParc IDs.

- Beware: contains wrong prediction, pseudogenes etc…

Page 157: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Query UniParc

Page 158: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniRef

Page 159: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

‘UniRef is useful for comprehensive BLAST similarity searches by providing

sets of representative sequences’

Page 160: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

«Collapsing BLAST results»

Three collections of sequence clusters from UniProtKB and selected UniParc entries:

One UniRef100 entry -> all identical sequences (identical sequences and sub-fragments are grouped in a single record) -> reduction of 12 %

One UniRef90 entry -> sequences that have at least 90 % or more identity -> reduction of 40 %

One UniRef50 entry -> sequences that are at least 50 % identical-> reduction of 65 %

Based on sequence identity -> Independent of the species !

Page 161: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 162: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Independent of species and

sequence length

UniRef 90

Page 163: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniMes

Page 164: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental protein data (only GOS data for the moment).

Download only (but included in UniParc -> Blast).

- UniMES Fasta sequences- UniMES matches to InterPro methods

ftp.uniprot.org/pub/databases/uniprot

Page 165: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 166: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniMES: sequences in fasta format

Page 167: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 168: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Menu

Introduction

Nucleic acid sequencedatabases ENA/GenBank, DDBJ

Protein sequence databasesUniProt databases (UniProtKB)

NCBI protein databases

Page 169: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

NCBI protein databases

(Entrez protein, NCBI nr)

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein

Page 170: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Major ‘general’ protein sequence database ‘sources’

UniProtKB: Swiss-Prot + TrEMBL

NCBI-nr: Swiss-Prot + TrEMBL + GenPept + PIR + PDB + PRF + RefSeq + TPA

PIR PDB PRF

UniProtKB/Swiss-Prot: manually annotated protein sequences (12’000 species)

UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (300’000 species)

GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (300’000 species ?)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species)

TPA: Third part annotation

Integrated resources

‘cross-references’

Resources kept separated

TPA

not complete !!! (only entries created before 2007 ?)

Page 171: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Query at Entrez protein

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein

Page 172: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Typical result of a query at

« Entrez protein » RefSeq

Swiss-Prot

Genpept

Page 173: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 174: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 175: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

A Swiss-Prot entry with the NCBI look

Page 176: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

A TrEMBL entry with the NCBI look

!!!!

Page 177: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

GI number ‘GenInfo identifier’ number

- In addition to an AC number specific from the original database, each protein sequence in the NCBInr database (included Swiss-Prot entry) has a GI number.

Page 178: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

AC

Page 179: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

GI number: ‘GenInfo identifier’ number

- If the sequence changes in any way, a new GI number will be assigned:

GI identifiers provide a mechanism for identifying the exact sequence that was used or retrieved in a given search.

- A separate GI number is assigned to each protein translation (alternative products)

- A Sequence Revision History tool is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record:

http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi

Page 180: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

ID/AC mapping

Page 181: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 182: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

http://www.ebi.ac.uk/Tools/picr/

Page 183: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

GenPept

Translation from annotated CDS in GenBankContains all translated CDS annotated in

GenBank/ENA/DDBJ sequences

- equivalent to UniProtKB/TrEMBL, except that it is

redundant with other databases (Swiss-Prot, RefSeq, PIR….)

Page 184: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

GenPept: ‘translations from all annotated coding regions (CDS) in GenBank’

Page 185: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

RefSeq

Produced by NCBI and NLM

http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/handbook/ch18.pdf

FAQ: http://www.ncbi.nlm.nih.gov/books/NBK50679/

http://www.ncbi.nlm.nih.gov/RefSeq/

Page 186: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.

Protein – mRNA – genomic sequence

Also chromosomes, organelle genomes, plasmids, intermediate assembled genomic contigs, ncRNAs.

- tighly linked to Entrez Gene (« interdependent curated resources »)

Page 187: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Example: NP_000790

Beware: NeXtProt accession number: NX_P00918

Page 188: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 189: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

KW

AC

Taxonomy

References

Page 190: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

GenBank sourceand status

Annotation and ontologies

Page 191: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Curated records

Page 192: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtKB vs RefSeq

Page 193: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 194: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

UniProtKB/Swiss-Prot merges all CDS available for a given gene and describes the sequence differences

UniProtKB/Swiss-Prot P04150 (GCR_HUMAN):

Page 195: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

RefSeq chooses one or several protein reference sequences for a given gene: they do not annotate the sequence differences.

- If there is an alternative splicing event, there will be several distinct entries for a given gene

Example: GCR_HUMAN

GCR_HUMANUniProtKB/Swiss-Prot

1 UniProtKB entry 7 RefSeq entriescross-linked with

Page 196: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Protein feature annotation found in RefSeq

- Conserved domains - Signal and mature petides- Propagation of a subset of features from Swiss-Prot.

Page 197: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

PTM annotation Swiss-Prot vs

RefSeq

GCR_human

Page 198: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

RefSeq statistics

The numbers are not comparable: entries ‘sequence’ (RefSeq) vs entries ‘gene’ (UniProtKB/Swiss-Prot)

Page 199: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

SummaryUniProtKB vs NCBI protein

Page 200: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

ENA/GenBank/DDBJ RefSeqwww.ncbi.nlm.nih.gov/RefSeq/

UniProtwww.uniprot.org

Protein and nucleotide data Genomic, RNA and protein data

Protein data only 

Biological data added by the submitters (gene name, tissue…)

Biological data annotated by curators, also found in the corresponding Entrez Gene entry

Biological data annotated by curators (Swiss-Prot), within the entry

Not curated  Partially manually curated (‘reviewed’ entries)

Manually curated in Swiss-Prot, not in TrEMBL 

Author submission NCBI creates from existing data + gene prediction

UniProt creates from existing data

Only author can revise (except TPA)

NCBI revises as new data emerge

UniProt revises as new data emerge

Multiple records for same loci common 

Single records for each molecule of major organisms

Single records for each protein from one gene of major organisms (in Swiss-Prot, TrEMBL is redundant)

Records can contradict each other  

Identification and annotation of discrepancy

No limit to species included   Limited to model organisms Priority (but not limited) to model organisms

Data exchanged among INSDC members 

NCBI database; collaboration with UniProt

UniProt database; collaboration with NCBI (RefSeq, CCDS)

Page 201: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

PIR

Page 202: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

PIR: the Protein Identification Resource

PIR-PSD is no more updated, but exists as an archive

Page 203: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

PDB

Page 204: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

PDB• PDB (Protein Data Bank), 3D structure

• Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X-ray or NMR studies

• Contains also the corresponding protein sequences

*The PIR-NRL3D database makes the sequence information in PDB available for similarity searches and other tools

• Includes protein sequences which are mutated, chimearic etc… (created specifically to study the effect of a mutation on the 3D structure)

Page 205: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

PDB: Protein Data Bankwww.rcsb.org/pdb/

• Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA).

• Associated with specialized programs allow the visualization of the corresponding 3D structure (e.g., SwissPDB-viewer, Chime, Rasmol)).

• Currently there are ~68’000 structural data for about 15’000 different proteins, but far less protein family (highly redundant) !

Page 206: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

PDB: example

Page 207: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Coordinates of each atom

Sequence

Page 208: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Visualisation with Jmol

Page 209: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

PRF

Protein Research Foundation

Page 210: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

http://www.genome.jp/dbget-bin/www_bfind?prf

Looks for the peptide sequence described in publication (and which are not submitted in databases !!!)

Page 211: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 212: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Other protein databases

Page 213: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Ensembl http://www.ensembl.org/

Reviewhttp://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D610

Annotation pipelinehttp://www.genome.org/cgi/content/full/14/5/942

Page 214: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

- Ensembl: align the genomic sequences with all the sequences found in ENA, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (-> known genes)

- Also do gene prediction (-> novel genes)

Ensembl= UniProtKB + RefSeq + gene prediction

- DNA, RNA and protein sequences available for several species.

- Ensembl concentrates on vertebrate genomes, but other groups have adapted the system for use with plant, fungal and metazoa genomes.

Page 215: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 216: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 217: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Example of problem (derived from gene prediction pipeline)

Ensembl completes the human ‘proteome’ by predicting/annotating missing genes according to orthologous sequences..

ID   URAD_HUMAN            Unreviewed;       171 AA. AC   A6NGE7; DT   24-JUL-2007, integrated into UniProtKB/TrEMBL. DT   24-JUL-2007, sequence version 1. DT   02-OCT-2007, entry version 3. DE   2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog DE   (OHCU decarboxylase homolog) (Parahox neighbour). GN   Name=PRHOXNB; …DR   EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR   Ensembl; ENSG00000183463; Homo sapiens. DR   HGNC; HGNC:17785; PRHOXNB. PE   4: Predicted; In primates the genes coding for the enzymes for the

degradation of uric acid were inactivated and converted to pseudogenes.

Page 218: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

IPIhttp://www.ebi.ac.uk/IPI/IPIhelp.html

IPI: Closure !

Page 219: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 220: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Automatic approach that builds clusters through combining knowledge already present in the primary data source (UniProtKB, RefSeq, Ensembl) and sequence similarity.

IPI=UniProtKB + RefSeq + Ensembl (+ H-InvDB, TAIR +VEGA).

!!! Complete proteome sets include all alternative splicing sequences….

Available for human, mouse, rat, Zebrafish, Arabidopsis, Chicken, and Cow

Page 221: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 222: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 223: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 224: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

CCDS

Page 225: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

htt

p:/

/ww

w.n

cb

i.n

lm.n

ih.g

ov/C

CD

S/

Page 226: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

CCDS (human, mouse)

Combining different approaches – ab initio, by

similarity - and taking advantage of the expertise

acquired by different institutes, including manual

annotation…

Consensus between 4 institutions…

Page 227: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 228: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Gene Ontology (GO)

Page 229: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Standards :Why is it so important ?

•‘The ever-increasing number of sequencing projects necessitates a standardized system (…) to ensure that the flood of information produced can be effectively utilized.‘ (PMID 19577473 )

•Standardization of biological data/information (data sharing and computational analysis).

•Aim: extract and compare annotation between different resources or species (semantic similarity).

Page 230: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Secreted or not secreted ?

Pubmed19299134

Page 231: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

• The Gene Ontology is a controlled vocabulary, a set of standard terms—words and phrases—used for indexing and retrieving information. In addition to defining terms, GO also defines the relationships between the terms, making it a structured vocabulary. Contains ~30’000 terms.

Gene Ontology (GO)

Page 232: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Gene Ontology (GO) terms

biological process• broad biological phenomena e.g.

mitosis, growth, digestion

molecular function• molecular role e.g. catalytic activity,

binding

cellular component• Subcellular location e.g nucleus,

ribosome, origin recognition complex

Page 233: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

GO terms associated with human Erythropoietin

Page 234: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 235: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

http://www.geneontology.org

Page 236: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Caveats

• Annotation is the process of assigning/mapping GO terms to gene products…

• Electronic vs Manual annotation…

Page 237: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Example with EPO

Page 238: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 239: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge
Page 240: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Histone H4

!!! Large scale derived data (‘proteome’)

Page 241: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

GO terms: Essential link between biological knowledge and high throuput genomic and proteomic datasets…

PMID: 15514041

‘summary of the gene ontology classifications for all mapped ESTs…’

Page 242: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

Human proteins functional distribution

Maybe

Potentially

Putative

Expected

Probably

Hopefully

~40 % of human proteins have no known function (experimental data)…but many more are associated with GO terms…(computer-assigned).

Page 243: Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: dissemination of protein knowledge

All documents (including practicals) are online

http://education.expasy.org/cours/UniProt/