part i: identifying sequences with … speaker : s. gaj date 11-01-2005

Part I:Identifying sequences with …

Speaker : S. Gaj

Date 11-01-2005

Annotation

Annotation• Best possible description available for a given

sequence at the current time.

How to annotate?• Combining

• Alignment Tools • Databases• Datamining (scripts)B

ackg

roun

d

Microarrays

Introduction

Global alignment• Optimal alignment between two sequences

containing as much characters of the query as possible.

Ex: predicting evolutionary relationship between genes, …

Local alignment• Optimal alignment between two sequences

identifying identical area(s)Ex: Identifying key molecular structures (S-bonds, - helices, …)

Bac

kgro

und

Introduction

Basic Local Alignment Search Tool• Aligning an unknown sequence (query) against all

sequences present in a chosen database based on a score-value.

• Aim : Obtaining structural or functional information on the unknown sequence.

BLA

ST

Programs

• Different BLAST programs available

• Usable criteria:• E-Value, Gap Opening Penalty (GOP), Gap Extension Penalty

(GEP), …

• Terms• Query Sequence which will be aligned• Subject Sequence present in database• Hit Alignment result.

BLA

ST

Nucleic Protein

Nucleic BlastN BlastX

Protein - BlastP

Common BLAST problems

• BlastN

BLA

ST

C G A T A GC C CG CC A G G A T A T A

C G A T A GC C C - CC A G G A T A T A

Sequencing Error

Clone seq

mRNA

• Solution:

Low penalty for GOP and GEP = 1

| | | | | | | | | | | | | | | | || |

Translation Problems

• 6-Frame translation

BLA

ST

>embl|J03801|HSLSZ Human lysozyme mRNA, complete cds with an Alu repeat in the 3' flank.

ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggct...

+1 L A L * P S S Q H E G S H C S G A

Translation Problems

• 6-Frame translation

BLA

ST

>embl|J03801|HSLSZ Human lysozyme mRNA, complete cds with an Alu repeat in the 3' flank.

ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggct...

+1

+2

+3

-3

-2

-1

L A L * P S S Q H E G S H C S G A

* H S D L A V N M K A L I V L G


BLA

ST

Gene X

full mRNA

mRNA

intron

exon

Translation

Splicing


BLA

ST

mRNA

Clones derived from mRNA

Coding region

Non-coding region

BlastX against protein sequence

3 possible hit-situations


BLA

ST

Yields no protein hit

Aligns with protein in 1 of the 6 frames.

Part perfect alignment

Coding region

Non-coding region

or

Part II: Databases and annotation

Introduction

Primary database:– DNA Sequence (EMBL, GenBank, … )– AminoAcid Sequence (SwissProt, PIR, …)– Protein Structure (PDB, …)

Secondary database:– Derived from primary DB– DNA Sequence (UniGene, RefSeq, …)– Combination of all (LocusLink, ENSEMBL, …)

Structure:– Flat file databases

Dat

abas

es

Primary Databases

EMBL:– DNA Sequence– Human: 4.126.190.851 nucleotides in 292.205 entries– Clones, mRNA, (Riken) cDNA, …

– New sequences can be admitted by everyone.– No curative check before admittance.

Dat

abas

es

Primary Databases

SwissProt:– Amino Acid sequence– Human: – Contains protein information– SwissProt (EU) PIR (USA)

– Crosslinks to most informative DB (PDB, OMIM)– Part of UniProt consortium.

– Each addition needs validation by appointed curators.– Highly curated

Dat

abas

es

Secondary Databases

TrEMBL:– Translated EMBL– Hypothetical proteins

– After careful assessment SpTrEMBL SwissProt

Dat

abas

es

Secondary Databases

UniGene:– Automated clustering of sequences with high similarity– Derived from GenBank / EMBL– 1 consensus-sequence– Species-specific

Dat

abas

es

Secondary Databases

LocusLink:– Curated sequences– Descriptive information about genetic loci

RefSeq:– Non-redundant set of sequences.– Genomic DNA, mRNA, Protein– Stable reference for gene identification and

characterization.– High curation

Dat

abas

es

Database Quality?

Dat

abas

es

mRNA Protein

EMBL SwissProt

Submitter

Database Manager

Submitter

Database Manager

Curators

DNA

How to Annotate?

BlastN against random nucleotide DB– EST’s

BlastN against structured nucleotide DB (UniGene, RefSeq)– mRNA hits– Sometimes not annotated at all– Best information

Dat

abas

es

Microarrays

Part III: Annotation Techniques

What do we have?

Probe sequence

Alignment Tools (e.g. BLAST)

Databases

!?! What to choose ?!?

Ann

otat

ion

Possibilities?

1. Do it like everyone else does.

2. Make use of curative properties of certain databases

Goal: Annotate as many genes with as much information as possible (e.g. SwissProt ID)

Ann

otat

ion

1st Approach - General

“Done by most array manufacturers”

Step-by-step approach:– BLAST sequences against nucleic database

(preferably UniGene)

– Extract high quality (HQ) hits (>95%)

– For each HQ hit search crosslinks.

– Find a well-described (SwissProt) ID for each sequence.

Ann

otat

ion

Tec

hniq

ues

1st Approach - Concept

Ann

otat

ion

Tec

hniq

ues

2nd Approach - General

“Make use of present database curation”

Other way around:– Use SwissProt to clean out EMBL

– Result:“Cleaned” EMBL database with direct SP crosslinks

– BLAST against cEMBL

– Extract high quality alignment hits (>95%)

– Convert EMBL ID to SP ID.

Ann

otat

ion

Tec

hniq

ues

2nd Approach - Concept

Ann

otat

ion

Tec

hniq

ues

Annotating Incyte Reporters

Total: 13.497

cEMBL-approach: 2.898 (21,47%) SP-IDs

DM approach: 10.013 (74,18%) UG-IDs in whichM = 4.723 (34,9%) SP-IDs ; MR = 5.147 (38,1%) SP-IDs; MRH = 6.641 (49,2%) SP-IDs

Res

ults


All reporters present on “Incyte Mouse UniGene 1” convertedTotal: 9.596 reporters

Old annotation : 9.370 (97,6%) UG-IDs in whichNon-existing UG-IDs = 5.713 (59,5%); M = 1.939 (20,2%) SP-IDs;

MR = 2.096 (21,8%) SP-IDs; MRH = 2.582 (26,9%) SP-IDsDatamining approach : 8.532 (88,9%) UG-IDs in which

M = 4.145 (43,2%) SP-IDs ; MR = 4.499 (38,1%) SP-IDs; MRH = 5.576 (60,1%) SP-IDsCustom EMBL-approach : 2.898 (30,2%) SP-IDs

Res

ults


Combined methods “Incyte Mouse UniGene 1” reportersTotal: 9.596 reporters

No annotation : 1.062 (11%) reportersAnnotated with SP-ID : 5.895 (61,3%) reporters of which

2.184 (22,7%) identical SP-IDs; 532 (5%) reporters with improved SP-IDs by EMBL-method;174 (1,8%) reporters with different mouse SP-IDs; 5 reporters found only by EMBL-method

Res

ults

Conclusions

• Annotation is much needed Array sequences can point to different genes

• Direct translation into protein not best option: Sequencing errors Addition or deletion of nucleotides 6-Frame window

• Public nucleotide databases are redundant. Sequencing errors Differences in sequence-length Attachment of vector-sequenceC

oncl

usio

ns

Questions?

End

part i: identifying sequences with … speaker : s. gaj date 11-01-2005

Documents

p s s q h e g s h c

given sequence

termsquery sequence

sequences present

dna sequence embl

unknown sequence query

new sequences

aminoacid sequence swissprot