bioinformatics for proteomics shu-hui chen ( 陳淑慧 ) department of chemistry national cheng kung...
Post on 20-Dec-2015
250 views
TRANSCRIPT
Bioinformatics for Proteomics
Shu-Hui Chen (陳淑慧 )Department of Chemistry
National Cheng Kung University
TranscriptionDNA
5’ 3’
mRNASplicing
TranslationPoly-peptide
Folding
Protein
• Transport / Localization• Oligomerization• PTM (Post-Translational Modification)
Function Function
How do we find protein coding regions, introns and exons in genomic DNA sequences?
Bioinformatics I
What is Proteomics ?
Systematic analysis of All protein sequences All protein expression pattern All protein interactions
This involves Protein isolation Protein separation Protein identification Functional characterization of all proteins
The tools of Proteomics
Traditional protein chemistry assay methods struggle to establish Identity
Identity requires: Specificity of measurement (Precision) Mass Spectrometry MS-based data acquisition algorithm A reference for comparison Protein sequence databases Search algorithms
MS-based Proteomics and Bioinformatics
• MS instrument is so far not sensitive enough to resolve proteins in a biological system solely based on signals measured.
• MS, however, is able to acquire sufficient data for mapping a protein from the database using new computer algorithms to analyze the data.
• This is the field of bioinformatics
Ion source Mass analyzer
Sample inlet
Data acquisition
vacuum
Instrumentation
“Bioanalytical Chemistry” Mikkelsen, S.R., published by John Wiley & Sons, Inc.
MS-based Protein Identification
Mass Mapping
Peptide Sequencing
Conventional Methodology- Expression Proteomics
Trypsin DigestionWe know that trypsin cleaves polypeptides C-terminal to basic amino acids.
-NH-CH(R1)-CO-NH-CH(R2)-CO-
trypsin
-NH-CH(R1)-COOH H2N-CH(R2)-CO-
m/z
Ion
in
ten
sity
Mass SpectrometryProtein identified by database mapping
Automated Database SearchNumber 1 match: tumor necrosis factor type 1 receptorassociated protein TRAP-1 (Mr): 76030.271 RALRRAPALA AVPGGKPILC PRRTTAQLGP RRNPAWSLQA GRLFSTQTAE
51 DKEEPLHSII SSTESVQGST SKHEFQAETK KLLDIVARSL YSEKEVFIRE
101 LISNASDALE KLRHKLVSDG QALPEMEIHL QTNAEKGTIT IQDTGIGMTQ
151 EELVSNLGTI ARSGSKAFLD ALQNQAEASS KIIGQFGVGF YSAFMVADRV
201 EVYSRSAAPG SLGYQWLSDG SGVFEIAEAS GVRTGTKIII HLKSDCKEFS
251 SEARVRDVVT KYSNFVSFPL YLNGRRMNTL QAIWMMDPKD VGEWQHEEFY
301 RYVAQAHDKP RYTLHYKTDA PLNIRSIFYV PDMKPSMFDV SRELGSSVAL
351 YSRKVLIQTK ATDILPKWLR FIRGVVDSED IPLNLSRELL QESALIRKLR
401 DVLQQRLIKF FIDQSKKDAE KYAKFFEDYG LFMREGIVTA TEQEVKEDIA
451 KLLRYESSAL PSGQLTSLSE YASRMRAGTR NIYYLCAPNR HLAEHSPYYE
501 AMKKKDTEVL FCFEQFDELT LLHLREFDKK KLISVETDIV VDHYKEEKFE
551 DRSPAAECLS EKETEELMAW MRNVLGSRVT NVKVTLRLDT HPAMVTVLEM
601 GAARHFLRMQ QLAKTQEERA QLLQPTLEIN PRHALIKKLN HCAQASLAWL
651 SCWWIRYTRT P
Total coverage: 33.4%
Minimal content of a « protein sequence » db
• Sequences !!• Accession number (AC)• Taxonomic data• References• ANNOTATION/CURATION• Keywords• Cross-references• Documentation
Bioinformatics I
SWISS-PROT/TrEMBL
• Collaboration between the SIB (CH) and EMBL/EBI (UK)
• SWISS-PROT: Fully annotated (manually), non-redundant,
cross-referenced, documented protein sequence database.
• TrEMBL: is automatically generated (from annotated EMBL coding sequences (CDS)) and annotated using software tools.
http://www.expasy.org/sprot/
Bioinformatics I
ExPASy Web Server
ExPASy =
ExpertProtein AnalysisSystem
Molecular Weight Search
By Pappin and Bleasby
History for MS Searching
MOWSE
MOWSEⅡ
1993
1996
1994 SEQUEST By Yates and Eng
1997
1998
MOWSEⅢ
MASCOT By Matrix science
Scoring algorithmScoring algorithmFinal score= -10*LOG(P), where P is absolute probability that the observed match is a random event
E value (expected value) = describes the number of hits one can expect to see by chance when searching a database of a particular size. A value of zero indicates that no matches would be expected by chance.Significant hits at 95% confidence level (p<0.05)
there is less than a 1 in 20 chance that the observed match is a random event.
5 7
Increase mass tolerance
MS-based Protein Identification
Mass Mapping
Peptide Sequencing
Tandem Mass Spectrometry- MS/MS
MS/MS acquisition is controlled by software setting
Protein Identification
Peptide Sequencing using MSMS
peptide
ABCDEF AB CDEF
A BCDEF
ABC DEF
ABCDE F
ABCD EF
A
ABABC
ABCDABCDE
A B C D E
CID
m/z
precursor ion
Nomenclature used for CID peptide fragmentation-
Low Energy (eV)- Q, TOF, FT
“Bioanalytical Chemistry” Mikkelsen, S.R., published by John Wiley & Sons, Inc.
Protein Identification by Database Search
Trypsin DigestionWe know that trypsin cleaves polypeptides C-terminal to basic amino acids.
-NH-CH(R1)-CO-NH-CH(R2)-CO-
trypsin
-NH-CH(R1)-COOH H2N-CH(R2)-CO-
m/z
Ion
in
ten
sity
Sequence Tag Approach for Peptide Sequencing
“Bioanalytical Chemistry” Mikkelsen, S.R., published by John Wiley & Sons, Inc.
The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences.
The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches.
BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
Bioinformatics I NCBI BLAST http://www.ncbi.nlm.nih.gov/blast/
BLAST:
BasicLocalAlignmentSearchTool
Sequence alignments and comparison
1: MYTAILORISRICH 2: MONTAILLEURESTRICHE
1: MY-TAIL--ORIS-RICH- ¦x ¦¦¦¦ x¦x¦ ¦¦¦¦2: MONTAILLEURESTRICHE
¦ = Identityx = Mismatch- = Insertion / Deletion
1: TAILO RICH ¦¦¦¦x ¦¦¦¦2: TAILL RICHE
Global Alignment
Two Local Alignments
Bioinformatics I
HBA_CHICK VL-SAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYFPHF-DL 48HBAD_CHICK ML-TAEDKKLIQQAWEKAASHQEEFGAEALTRMFTTYPQTKTYFPHF-DL 48HBPI_CHICK AL-TQAEKAAVTTIWAKVATQIESIGLESLERLFASYPQTKTYFPHF-DV 48HBB_CHICK VHWTAEEKQLITGLWGKV--NVAECGAEALARLLIVYPWTQRFFASFGNL 48HBE_CHICK VHWSAEEKQLITSVWSKV--NVEECGAEALARLLIVYPWTQRFFASFGNL 48HBRH_CHICK VHWSAEEKQLITSVWSKV--NVEECGAEALARLLIVYPWTQRFFDNFGNL 48MYG_CHICK GL-SDQEWQQVLTIWGKVEADIAGHGHEVLMRLFHDHPETLDRFDKFKGL 49 .... . ..* . .. * * * *.. .* * * * .. HBA_CHICK SH-----GSAQIKGHGKKVVAALIEAANHIDDIAGTLSKLSDLHAHKLRV 93HBAD_CHICK SP-----GSDQVRGHGKKVLGALGNAVKNVDNLSQAMAELSNLHAYNLRV 93HBPI_CHICK SQ-----GSVQLRGHGSKVLNAIGEAVKNIDDIRGALAKLSELHAYILRV 93HBB_CHICK SSPTAILGNPMVRAHGKKVLTSFGDAVKNLDNIKNTFSQLSELHCDKLHV 98HBE_CHICK SSPTAIMGNPRVRAHGKKVLSSFGEAVKNLDNIKNTYAKLSELHCDKLHV 98HBRH_CHICK SSPTAIIGNPKVRAHGKKVLSSFGEAVKNLDNIKNTYAKLSELHCEKLHV 98MYG_CHICK KTPDQMKGSEDLKKHGATVLTQLGKILKQKGNHESELKPLAQTHATKHKI 99 . *. .. ** .*.. . . .. .. . *.. * .. HBA_CHICK DPVNFKLLGQCFLVVVAIHHPAALTPEVHASLDKFLCAVGTVLTAKYR-- 141HBAD_CHICK DPVNFKLLSQCIQVVLAVHMGKDYTPEVHAAFDKFLSAVSAVLAEKYR-- 141HBPI_CHICK DPVNFKLLSHCILCSVAARYPSDFTPEVHAEWDKFLSSISSVLTEKYR-- 141HBB_CHICK DPENFRLLGDILIIVLAAHFSKDFTPECQAAWQKLVRVVAHALARKYH-- 146HBE_CHICK DPENFRLLGDILIIVLASHFARDFTPACQFAWQKLVNVVAHALARKYH-- 146HBRH_CHICK DPENFRLLGNILIIVLAAHFTKDFTPTCQAVWQKLVSVVAHALAYKYH-- 146MYG_CHICK PVKYLEFISEVIIKVIAEKHAADFGADSQAAMKKALELFRNDMASKYKEF 149 . .... . .* . . ... . .* . .. **. HBA_CHICK ---- 141HBAD_CHICK ---- 141HBPI_CHICK ---- 141HBB_CHICK ---- 146HBE_CHICK ---- 146HBRH_CHICK ---- 146MYG_CHICK GFQG 153 Consensus length: 154; Identity : 19 ( 12.3%); Similarity: 51 ( 33.1%)Character to show that a position in the alignment is perfectly conserved: '*'Character to show that a position is well conserved: '.'
MultipleSequenceAlignment
(MSA)
Programs:
• CLUSTALW
• T_COFFEE
• MULTALIGN
Bioinformatics I
Searching databases with multiple alignments
PSI-BLAST: Position-Specific Iterative BLAST (Altschul et al., 1997)
1. Starting with a single sequence, PSI-BLAST searches a database using BLAST and builds a multiple sequence alignment and a profile.
2. The profile is then used to search the protein database again.
3. Running the program several times can further refine the profile and increase search sensitivity.
Error tolerance search
0.2Da/0.2Da32
0.05Da/0.05Da
27
0.5Da/0.5Da
33
MS/MS Scan Functions
mass scan modesingle mass transmission
m2 m2 m2 m2m3
m1
m4
m2
Collision Chamber (gas)Collision Chamber (gas)
++
+
+
+
+
N2
+ + + ++
Q1 Q3Product Ion Scan (PI) Fix ScanMultiple Reaction Mode (MRM) Fix FixPrecursor Ion Scan (PS) Scan FixNeutral Loss Scan (NL) Scan Scan
IP + MS/ID for searching protein interaction complex
Conclusions
Protein identification by MS is a key element of proteomics andthe ID process is an informatics-based methodology.
MS + sequence databases represent a huge leap for protein Biochemistry- A large scale analysis approach.
Biochemical manipulation + protein ID is capable of providing functional information of proteins.
Bioinformatics tools are needed to link proteomics data to protein interaction and biological pathways.