tools and algorithms in bioinformatics · 1 _____ 12/6/2013 gcba 815 tools and algorithms in...
TRANSCRIPT
1
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
Tools and Algorithms in Bioinformatics GCBA815, Fall 2013
Week-14: Protein Structure and PTM
Analysis Tools
Babu Guda Department of Genetics, Cell Biology and Anatomy
University of Nebraska Medical Center
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
Structural Bioinformatics
2
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
Human cancer-related protein (MDM2) with embedded small-molecule drug compounds (“nutlin”). MDM2 is shown as stick figures; “nutlin” is shown as small cyan colored spheres (van der Wall’s radii).
Picture taken from BayeNetwork
Binding of Drug compound to a cancer-related protein, MDM2
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
Structural View of Biology
• The function of a biological macromolecule is highly dependent on its structural confirmation
• Deciphering the structure of DNA (double-helix) has revolutionized biological research
• Similarly, enzyme functions are highly specific that are regulated by proper orientation of their active sites
• While a lot of proteins act as enzymes, there are a number of structural proteins that support cellular and tissue-level infrastructure and aid in intra and inter cellular communication
3
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
Examples • Actin: Support the size, shape, structure and motion of cells • Cadherin: Adhesive proteins that glue cells together • Clathrin:Vesicular trafficking • Collagen: About 25% of all protein in our body • Integrins: On the cell surface, linking cells • Vaults: Symmetrical shells made of vault proteins
• PDB-101: http://www.pdb.org/pdb/101/structural_view_of_biology.do
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
The 20 natural amino acids
4
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
n Primary structure: The linear amino acid sequence of the polypeptide (PP) chain including post-translational modifications and disulfide bonds.
n Secondary structure: Local structure of linear segments of the PP backbone atoms without regard to the conformation of the side chains.
n Tertiary structure: The three-dimensional arrangement of all atoms in a single PP chain.
n Quaternary structure: The arrangement of separate PP chains (subunits) into the functional protein
Bovine Mitochondrial F1-Atpase (ATP Synthase Chain Heart Isoform; Ec: 3.6.1.34) Chain α : A, B, C; Chain β: D, E, F; Chain γ: G
Calcium/Calmodulin-Dependent Protein Kinase
Structural Forms of Proteins
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
5
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
Protein Data Bank (PDB) http://www.rcsb.org/pdb
Molecule Type
Proteins Nucleic Acids Protein/NA Complexes Other Total
Exp. Method
X-ray 79224 1496 4125 4 84849 NMR 8949 1054 197 7 10207
Electron Microscopy 493 51 162 0 706
Other 208 7 8 14 237 Total 88874 2608 4492 25 95999
7
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
Protein structure data format: PDB
8
PDB IDs
• Four letter code for the compound, case insensitive (Ex: 2HHB)
• Always start with a numeric followed by alphanumeric
• Each compound may have multiple chains, a chain ID is denoted by compound ID followed by ‘:’ and chain identifier (Ex: 2HHB:A)
• If the compound has only one chain (monomer), ‘_’ denotes the chain position (Ex: 1BBS:_)
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
• Structural alignment involves establishing equivalencies between residues in two or more proteins based on their 3D-coordinates
• 3-D coordinates from C-α atoms are most commonly used for calculation of distance in structural alignments
Structure Alignments
L F KR
I F GR
L F KR
L W GP
9
Protein 3-D Visualization Tools
• Jmol (http://jmol.sourceforge.net)
• Simple viewer (PDB)
• Protein workshop (PDB)
• QuickPDB viewer (PDB)
• DeepView - Swiss-Pdb Viewer (http://spdbv.vital-it.ch/)
• PyMOL (http://ww.pymol.org)
• KiNG viewer (http://http://kinemage.biochem.duke.edu/software/king.php)
Visualization of Protein Structures
• All Alpha:
• Haemoglobin – 1BAB
• K+ Channel Protein - 1BL8
• All Beta : Porin - 2POR
• Mixed Alpha-beta: TIM barrel -1YPI
10
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
Educational resources
• PDB: http://www.rcsb.org/pdb
• http://public.csusm.edu/jayasinghe
• Expasy tools: http://expasy.org
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
Predicting Post-translational Modification (PTM) Sites of Proteins
11
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
General Method for PTM site Prediction
• PROSITE provides consensus patterns for a number of PTM sites. PTM modifications occur based on the structural or environmental context in the protein fold
• Because of this reason, methods based on regular expressions (regex) or local alignment methods produce large number of false positives
• In almost all methods used in PTM site prediction, artificial neural networks (ANNs) or HMMs are used.
• General procedure:
• Prepare datasets with experimentally-known PTM sites
• Separate the dataset into training and testing data
• Train a network using training data and test it with the test dataset. This process is iterated until the model is well refined
• Sufficient number of training sequences and good quality data are important for the success of any neural network method
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
Different Post-translational modifications (PTMs)
• Glycosylation
• ASN(N)-glycosylation (NetNGlyc)
• O-glycosylation (NetOGlyc)
• Sulfation (Sulfinator)
• Phosphorylation (NetPhos)
• Myristoylation/Palmitoylation (adding a lipid group
• SUMOyalation (ubiquitin like proteins)
• S-nitrosylation
12
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
Prediction of Phosphorylation Sites (NetPhos (http://www.cbs.dtu.dk/services/NetPhos/)
• Protein kinases, a very large family of enzymes that catalyze phosphorylation
• NetPhos produces neural network predictions for serine (S), threonine (T) or tyrosine (Y) phosphorylation sites in eukaryotic proteins that affect a multitude of cellular signaling processes
• Y-kinase Phosphorylation
• S or T-Phosphorylation in Caesin Kinase II
• Since these are very short patterns, the amino acids surrounding a phosphorylated residue are significant in determining whether a particular site can be phosphorylated or not
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
Prediction of Glycosylation Sites (NetNGlyc, NetOGlyc)
• Glycoproteins are specially synthesized molecules by covalent attachment of oligosaccharides to certain proteins at the ASN(N-glycosylation) or Serine or Threonine residues (O-glycosylation).
• These are usually exported to extra-cellular destinations like mucin in alimentary tract or glycoprotein harmones in the anterior Pitutory gland.
• N-glycosylation
• O-glycosyltion
• No consensus pattern
• SEA domain is associated with it
13
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
Prediction of Sulfation Sites
• Tyrosine (Y) sulfation is an important post-translational modification for proteins that go through the secretory pathway. It regulates several protein-protein interactions and modulates the binding affinity of TM peptide receptors
• Based on the rules described above, HMMs could be trained to build models for predicting proteins sequences with patterns that abide by these rules
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
Sulfinator Algorithm (http://us.expasy.org/tools/sulfinator/)
• Sulfinator employs four different HMMs to recognize N-terminal (HMM-N), Internal (HMM-I), C-terminal (HMM-C) and in Y-clusters (HMM-Y)
14
Prediction of protein subcellular localization
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=mcb.figgrp.4668
Protein Sorting in Eukaryotic Cells
15
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
ngLOC: An n-gram based Bayesian method King and Guda, Genome Biology (2007)
__________________________________________________________________________________________________ 12/6/2013 GCBA 815
Predicting subcellular proteomes using ngLOC
Yeast Worm Fruitfly Mosquito Zebrafish Chicken Mouse Human S.cerevisiae Nematode D.melano. A.gambiae D.rerio G.gallus M.musculus H.sapiens RANGE Proteome Size: 5799 22400 13649 15145 13803 5394 33043 38149 GO annotated: 5486 12357 9997 8847 10106 4363 23744 24638 % ngLOC Coverage: 97.48 94.92 96.73 97.94 98.64 99.82 94.79 94.52 94.79 - 99.82 Proteome Estimated: 5653 21262 13203 14833 13616 5384 31320 36059 % CYT 15.22 14.80 12.74 14.43 15.01 13.66 13.44 14.14 12.74 - 15.22 % CSK 1.07 1.19 1.05 1.11 1.31 1.24 1.50 1.48 1.05 - 1.50 % END 2.71 3.47 2.85 3.25 3.34 2.53 2.99 3.04 2.53 - 3.47 % EXC 8.88 12.60 12.26 14.28 9.91 12.65 11.52 11.71 8.88 - 14.28 % GOL 1.48 1.31 1.40 1.07 1.68 1.47 1.52 1.56 1.07 - 1.68 % LYS 0.11 0.58 0.55 0.53 0.65 0.44 0.59 0.67 0.11 - 0.67 % MIT 9.55 5.84 4.86 5.52 4.72 4.16 4.24 4.80 4.16 - 9.55 % NUC 33.53 29.75 37.38 29.50 30.31 28.24 27.35 28.38 27.35 - 37.38 % PLA 16.19 24.41 20.06 21.36 21.66 22.78 27.18 24.08 16.19 - 27.18 % POX 0.54 0.66 0.42 0.48 0.51 0.25 0.44 0.46 0.25 - 0.66 % Single-Localized 89.29 94.60 93.59 91.53 89.11 87.42 90.77 90.32 % Multi-Localized 10.71 5.40 6.41 8.47 10.89 12.58 9.23 9.68 % CYT-NUC 6.49 2.36 2.76 3.44 5.40 6.27 4.51 4.74