bioinformatics for proteomics shu-hui chen ( 陳淑慧 ) department of chemistry national cheng kung...

Bioinformatics for Proteomics

Shu-Hui Chen (陳淑慧 )Department of Chemistry

National Cheng Kung University

TranscriptionDNA

5’ 3’

mRNASplicing

TranslationPoly-peptide

Folding

Protein

• Transport / Localization• Oligomerization• PTM (Post-Translational Modification)

Function Function

How do we find protein coding regions, introns and exons in genomic DNA sequences?

Bioinformatics I

What is Proteomics ?

Systematic analysis of All protein sequences All protein expression pattern All protein interactions

This involves Protein isolation Protein separation Protein identification Functional characterization of all proteins

The tools of Proteomics

Traditional protein chemistry assay methods struggle to establish Identity

Identity requires: Specificity of measurement (Precision) Mass Spectrometry MS-based data acquisition algorithm A reference for comparison Protein sequence databases Search algorithms

MS-based Proteomics and Bioinformatics

• MS instrument is so far not sensitive enough to resolve proteins in a biological system solely based on signals measured.

• MS, however, is able to acquire sufficient data for mapping a protein from the database using new computer algorithms to analyze the data.

• This is the field of bioinformatics

Ion source Mass analyzer

Sample inlet

Data acquisition

vacuum

Instrumentation

“Bioanalytical Chemistry” Mikkelsen, S.R., published by John Wiley & Sons, Inc.

MS-based Protein Identification

Mass Mapping

Peptide Sequencing

Conventional Methodology- Expression Proteomics

Trypsin DigestionWe know that trypsin cleaves polypeptides C-terminal to basic amino acids.

-NH-CH(R1)-CO-NH-CH(R2)-CO-

trypsin

-NH-CH(R1)-COOH H2N-CH(R2)-CO-

m/z

Ion

in

ten

sity

Mass SpectrometryProtein identified by database mapping

Automated Database SearchNumber 1 match: tumor necrosis factor type 1 receptorassociated protein TRAP-1 (Mr): 76030.271 RALRRAPALA AVPGGKPILC PRRTTAQLGP RRNPAWSLQA GRLFSTQTAE

51 DKEEPLHSII SSTESVQGST SKHEFQAETK KLLDIVARSL YSEKEVFIRE

101 LISNASDALE KLRHKLVSDG QALPEMEIHL QTNAEKGTIT IQDTGIGMTQ

151 EELVSNLGTI ARSGSKAFLD ALQNQAEASS KIIGQFGVGF YSAFMVADRV

201 EVYSRSAAPG SLGYQWLSDG SGVFEIAEAS GVRTGTKIII HLKSDCKEFS

251 SEARVRDVVT KYSNFVSFPL YLNGRRMNTL QAIWMMDPKD VGEWQHEEFY

301 RYVAQAHDKP RYTLHYKTDA PLNIRSIFYV PDMKPSMFDV SRELGSSVAL

351 YSRKVLIQTK ATDILPKWLR FIRGVVDSED IPLNLSRELL QESALIRKLR

401 DVLQQRLIKF FIDQSKKDAE KYAKFFEDYG LFMREGIVTA TEQEVKEDIA

451 KLLRYESSAL PSGQLTSLSE YASRMRAGTR NIYYLCAPNR HLAEHSPYYE

501 AMKKKDTEVL FCFEQFDELT LLHLREFDKK KLISVETDIV VDHYKEEKFE

551 DRSPAAECLS EKETEELMAW MRNVLGSRVT NVKVTLRLDT HPAMVTVLEM

601 GAARHFLRMQ QLAKTQEERA QLLQPTLEIN PRHALIKKLN HCAQASLAWL

651 SCWWIRYTRT P

Total coverage: 33.4%

Minimal content of a « protein sequence » db

• Sequences !!• Accession number (AC)• Taxonomic data• References• ANNOTATION/CURATION• Keywords• Cross-references• Documentation

Bioinformatics I

SWISS-PROT/TrEMBL

• Collaboration between the SIB (CH) and EMBL/EBI (UK)

• SWISS-PROT: Fully annotated (manually), non-redundant,

cross-referenced, documented protein sequence database.

• TrEMBL: is automatically generated (from annotated EMBL coding sequences (CDS)) and annotated using software tools.

http://www.expasy.org/sprot/

Bioinformatics I

ExPASy Web Server

ExPASy =

ExpertProtein AnalysisSystem

Molecular Weight Search

By Pappin and Bleasby

History for MS Searching

MOWSE

MOWSEⅡ

1993

1996

1994 SEQUEST By Yates and Eng

1997

1998

MOWSEⅢ

MASCOT By Matrix science

Scoring algorithmScoring algorithmFinal score= -10*LOG(P), where P is absolute probability that the observed match is a random event

E value (expected value) = describes the number of hits one can expect to see by chance when searching a database of a particular size. A value of zero indicates that no matches would be expected by chance.Significant hits at 95% confidence level (p<0.05)

there is less than a 1 in 20 chance that the observed match is a random event.

5 7

Increase mass tolerance

MS-based Protein Identification

Mass Mapping

Peptide Sequencing

Tandem Mass Spectrometry- MS/MS

MS/MS acquisition is controlled by software setting

Protein Identification

Peptide Sequencing using MSMS

peptide

ABCDEF AB CDEF

A BCDEF

ABC DEF

ABCDE F

ABCD EF

A

ABABC

ABCDABCDE

A B C D E

CID

m/z

precursor ion

Nomenclature used for CID peptide fragmentation-

Low Energy (eV)- Q, TOF, FT


Protein Identification by Database Search

Trypsin DigestionWe know that trypsin cleaves polypeptides C-terminal to basic amino acids.

-NH-CH(R1)-CO-NH-CH(R2)-CO-

trypsin

-NH-CH(R1)-COOH H2N-CH(R2)-CO-

m/z

Ion

in

ten

sity

Sequence Tag Approach for Peptide Sequencing


The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences.

The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches.

BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

Bioinformatics I NCBI BLAST http://www.ncbi.nlm.nih.gov/blast/

BLAST:

BasicLocalAlignmentSearchTool

Sequence alignments and comparison

1: MYTAILORISRICH 2: MONTAILLEURESTRICHE

1: MY-TAIL--ORIS-RICH- ¦x ¦¦¦¦ x¦x¦ ¦¦¦¦2: MONTAILLEURESTRICHE

¦ = Identityx = Mismatch- = Insertion / Deletion

1: TAILO RICH ¦¦¦¦x ¦¦¦¦2: TAILL RICHE

Global Alignment

Two Local Alignments

Bioinformatics I

HBA_CHICK VL-SAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYFPHF-DL 48HBAD_CHICK ML-TAEDKKLIQQAWEKAASHQEEFGAEALTRMFTTYPQTKTYFPHF-DL 48HBPI_CHICK AL-TQAEKAAVTTIWAKVATQIESIGLESLERLFASYPQTKTYFPHF-DV 48HBB_CHICK VHWTAEEKQLITGLWGKV--NVAECGAEALARLLIVYPWTQRFFASFGNL 48HBE_CHICK VHWSAEEKQLITSVWSKV--NVEECGAEALARLLIVYPWTQRFFASFGNL 48HBRH_CHICK VHWSAEEKQLITSVWSKV--NVEECGAEALARLLIVYPWTQRFFDNFGNL 48MYG_CHICK GL-SDQEWQQVLTIWGKVEADIAGHGHEVLMRLFHDHPETLDRFDKFKGL 49 .... . ..* . .. * * * *.. .* * * * .. HBA_CHICK SH-----GSAQIKGHGKKVVAALIEAANHIDDIAGTLSKLSDLHAHKLRV 93HBAD_CHICK SP-----GSDQVRGHGKKVLGALGNAVKNVDNLSQAMAELSNLHAYNLRV 93HBPI_CHICK SQ-----GSVQLRGHGSKVLNAIGEAVKNIDDIRGALAKLSELHAYILRV 93HBB_CHICK SSPTAILGNPMVRAHGKKVLTSFGDAVKNLDNIKNTFSQLSELHCDKLHV 98HBE_CHICK SSPTAIMGNPRVRAHGKKVLSSFGEAVKNLDNIKNTYAKLSELHCDKLHV 98HBRH_CHICK SSPTAIIGNPKVRAHGKKVLSSFGEAVKNLDNIKNTYAKLSELHCEKLHV 98MYG_CHICK KTPDQMKGSEDLKKHGATVLTQLGKILKQKGNHESELKPLAQTHATKHKI 99 . *. .. ** .*.. . . .. .. . *.. * .. HBA_CHICK DPVNFKLLGQCFLVVVAIHHPAALTPEVHASLDKFLCAVGTVLTAKYR-- 141HBAD_CHICK DPVNFKLLSQCIQVVLAVHMGKDYTPEVHAAFDKFLSAVSAVLAEKYR-- 141HBPI_CHICK DPVNFKLLSHCILCSVAARYPSDFTPEVHAEWDKFLSSISSVLTEKYR-- 141HBB_CHICK DPENFRLLGDILIIVLAAHFSKDFTPECQAAWQKLVRVVAHALARKYH-- 146HBE_CHICK DPENFRLLGDILIIVLASHFARDFTPACQFAWQKLVNVVAHALARKYH-- 146HBRH_CHICK DPENFRLLGNILIIVLAAHFTKDFTPTCQAVWQKLVSVVAHALAYKYH-- 146MYG_CHICK PVKYLEFISEVIIKVIAEKHAADFGADSQAAMKKALELFRNDMASKYKEF 149 . .... . .* . . ... . .* . .. **. HBA_CHICK ---- 141HBAD_CHICK ---- 141HBPI_CHICK ---- 141HBB_CHICK ---- 146HBE_CHICK ---- 146HBRH_CHICK ---- 146MYG_CHICK GFQG 153 Consensus length: 154; Identity : 19 ( 12.3%); Similarity: 51 ( 33.1%)Character to show that a position in the alignment is perfectly conserved: '*'Character to show that a position is well conserved: '.'

MultipleSequenceAlignment

(MSA)

Programs:

• CLUSTALW

• T_COFFEE

• MULTALIGN

Bioinformatics I

Searching databases with multiple alignments

PSI-BLAST: Position-Specific Iterative BLAST (Altschul et al., 1997)

1. Starting with a single sequence, PSI-BLAST searches a database using BLAST and builds a multiple sequence alignment and a profile.

2. The profile is then used to search the protein database again.

3. Running the program several times can further refine the profile and increase search sensitivity.

Error tolerance search

0.2Da/0.2Da32

0.05Da/0.05Da

27

0.5Da/0.5Da

33

MS/MS Scan Functions

mass scan modesingle mass transmission

m2 m2 m2 m2m3

m1

m4

m2

Collision Chamber (gas)Collision Chamber (gas)

++

+

+

+

+

N2

+ + + ++

Q1 Q3Product Ion Scan (PI) Fix ScanMultiple Reaction Mode (MRM) Fix FixPrecursor Ion Scan (PS) Scan FixNeutral Loss Scan (NL) Scan Scan

IP + MS/ID for searching protein interaction complex

Conclusions

Protein identification by MS is a key element of proteomics andthe ID process is an informatics-based methodology.

MS + sequence databases represent a huge leap for protein Biochemistry- A large scale analysis approach.

Biochemical manipulation + protein ID is capable of providing functional information of proteins.

Bioinformatics tools are needed to link proteomics data to protein interaction and biological pathways.

bioinformatics for proteomics shu-hui chen ( 陳淑慧 ) department of chemistry national cheng kung...

Documents