classification of g protein-coupled receptors using...
TRANSCRIPT
Classification of G Protein-Coupled Receptors using
Machine Learning Techniques
ZIA-UR-REHMAN
PhD thesis
Department of Computer and Information Sciences, Pakistan Institute of
Engineering and Applied Sciences, Nilore, Islamabad, Pakistan
Classification of G Protein-Coupled Receptors using
Machine Learning Techniques
By
Zia-ur-Rehman
A dissertation submitted in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Computer and Information Sciences
To
Department of Computer and Information Sciences, Pakistan Institute of
Engineering and Applied Sciences, Nilore, Islamabad, Pakistan
2013
2
ABSTRACT
G protein-coupled receptors (GPCRs) are located at the boundary of a cell, and are used for
inter-cellular communications. They are mostly found in Eukaryotic cells; but can also be found
in some Prokaryote cells. GPCRs modulate synaptic transmission in spinal cord and brain, and
can trigger signaling pathways for the regulation of cell proliferation and gene expression. They
are physiologically very important and according to an estimate, more than 50% of the marketed
drugs target GPCRs. Computational prediction of unknown GPCRs has great importance in
pharmacology because, malfunction of GPCRs can cause many diseases. The goal of this thesis
is to propose new methods for the classification of GPCRs using Machine Learning approaches.
The work in this thesis is divided into two parts. The first part is based on the
classification of GPCRs using Machine Learning methods. We analyze biological, statistical, and
transform-domain based feature extraction strategies and exploited various physiochemical
properties to generate discriminate features of GPCR sequences. We have developed various
GPCR classification methods. In the first method, GPCRs are predicted using the hybridization
of pseudo amino acid composition and multi scale energy representation of physiochemical
properties. In this method, our focus is on the introduction of various physiochemical properties
(hydrophobicity, electronic and bulk property). In the second method, GPCRs are predicted
using grey incidence degree measure and principal component analysis, whereby relation
between various components of GPCR sequences is exploited. In the third method, we perform
weighted ensemble classification of GPCRs using evolutionary information and multi-scale
energy based features. The weights for each of the classifier are optimized using genetic
algorithm, which provides an improvement in classification performance.
Second part of the thesis is based on multiple sequence alignment of GPCRs, whereby,
we utilize the structural information of GPCRs. The three-dimensional structures of several
Rhodopsin like GPCRs have been resolved at atomic resolution and validates the prediction
using sequence information alone that GPCRs fold has a bundle of seven transmembrane helices
(TMs). The dataset is aligned initially using multiple sequence alignment methods and TMs are
extracted. The dataset is composed of 19 sub families of Rhodopsin receptors, belonging to 62
species. Weights are assigned to avoid bias for a particular specie. Position specific scoring
matrices (PSSM) are computed for the seven TMs data and pseudo counts are added. Pseudo
3
counts are added using conventional Blosum62 scoring matrix. The unknown receptors are
classified using PSSMs of the known receptors and by the TM similarity methods.
Our research may have valuable contributions in the fields of Bioinformatics, Pattern
Classification, and Computational Biology, and has yielded comparable results with the existing
approaches. We conclude that our research may help the researchers in further exploring
membrane protein classification or any other sub cellular localization classification.
4
This thesis is carried out under the supervision of
Dr. Asifullah Khan
Associate Professor
Department of Computer and Information Sciences,
Pakistan Institute of Engineering and Applied Sciences,
Islamabad, Pakistan
This work is financially supported by Higher Education Commission (HEC) of Pakistan
Under the indigenous 5000 Ph.D. fellowship program
Pin # 074-1844-PS4-406.
5
DECLARATION
I declare that all material in this thesis, which is not my own work has been identified and that no
material has previously been submitted and approved for the award of a degree by this or any
other university.
Signature: ____________________
Author’s Name: Zia-ur-Rehman
It is certified that the work in this thesis is carried out and completed under my
supervision.
Supervisor's signature: _____________
Dr. Asifullah Khan
Associate Professor, DCIS, PIEAS, Islamabad
Head of the Department: _____________
Dr. Javaid Khurshid
DCIS, PIEAS, Islamabad
6
Dedicated to my parents & supervisor
7
Acknowledgements
I am thankful to my Allah almighty for showering his blessings on me and honoring me with
strength and determination to accomplish this PhD research work. He helped me during every
phase of my PhD and helped me to take right decisions.
I am thankful to my loving parents, siblings, cousins (especially Abid Hussain) and other
relatives, who helped me and prayed for me to successfully accomplish PhD degree. I am also
thankful to my friends (especially to Shozab Mehdi, Khurram jawad, Mattiullah, Mehdi Hassan,
Adnan Idris, Iqbal Mirza and PIEAS Bachelor juniors) who have been very helping me during
my stay at PIEAS.
I would also like to thanks all my beloved teachers (especially to Dr. Mutawwarra, Dr. Abdul
jalil, Dr. Abdul majid, Dr. Anila usman, Mr. Fayyaz and Dr. Henri Xhaard), who guided me at
my course work and research. I am very thankful to Dr. Henri Xhaard for helping me to conduct
one phase of PhD research. I am especially thankful to Dr. Asifullah Khan for supervising me all
heartedly and inspiring me for conducting this research. He has given his full devotion and care
in supervising my PhD research. Other than PhD research, he also assisted me in non-technical
affairs and guided me very well.
Finally, I want to thank to Higher Education Commission (HEC) of Pakistan, for the financial
support provided by Indigenous 5000 PhD program with reference to the Pin # 74-1844-PS4-
406.
8
List of Journal Publications
Zia ur Rehman and A. Khan, “G Protein-Coupled receptor prediction using pseudo-amino-acid
composition and multi scale energy representation of different physiochemical properties”, Anal
Biochem., 412 (2), 2011, pp.173-82 (impact factor: 3.2)
Zia-ur-Rehman, Asifullah Khan, Muhammad Tayyeb Mirza and Henri Xhaard, ”Predicting G-
Protein Coupled Receptors Families using Different Physiochemical Properties and Pseudo
Amino Acid Composition”, Methods in Enzymology, 2012 (impact factor: 2.0)
Zia ur rehman, Asifullah Khan, “Prediction of GPCRs with Pseudo Amino Acid Composition:
Employing Composite Features and Grey Incidence Degree Based Classification”, Protein
&Pept Lett.18(9), 2011, pp. 872-878 (impact factor : 1.82)
Zia ur rehman, Asifullah Khan, “Identify GPCRs and their types with Chou's pseudo amino
acid composition: an approach from multi-scale energy representation and position specific
scoring matrix”, Protein &PeptLett.,2012,19(8), pp. 890-903.(impact factor: 1.82)
Zia ur rehman, Maiju Rinne, Henri Xhaard, Asifullah Khan, "Re-classification of Rhodopsin
like receptors using transmembrane helical structures" to be submitted in "European Journal of
Pharmaceutical Sciences".
9
Contents
ABSTRACT .................................................................................................................................................................. 2
List of Figures .............................................................................................................................................................. 14
List of Tables ............................................................................................................................................................... 16
List of Abbreviations ................................................................................................................................................... 17
Symbol Table ............................................................................................................................................................... 19
1. INTRODUCTION .............................................................................................................................................. 21
1.1. STRUCTURE OF GPCRS ......................................................................................................................... 21
1.2. GPCR CLASSIFICATIONS AND THEIR SIGNIFICANCE ................................................................... 22
1.3. RESEARCH CONTRIBUTIONS AND OBJECTIVES ............................................................................ 24
1.4. STRUCTURE OF THESIS ........................................................................................................................ 25
2. LITERATURE SURVEY AND THEORY ........................................................................................................ 27
2.1. ALIGNMENT DEPENDENT CLASSIFICATION OF GPCRS ............................................................... 27
2.1.1. Sequence alignment ........................................................................................................................... 27
2.1.1.1 Local and global alignments ......................................................................................................... 28
2.1.1.2 Pairwise alignments ...................................................................................................................... 28
2.1.2. Multiple Sequence Alignment ........................................................................................................... 29
2.1.2.1 Progressive alignments ................................................................................................................. 29
2.1.2.2 Iterative methods ........................................................................................................................... 30
2.1.2.3 Hidden Markov models ................................................................................................................. 30
2.1.2.4 Motif finding algorithms ............................................................................................................... 30
2.1.2.5 Genetic algorithms and simulated annealing methods .................................................................. 32
2.1.3. Protein scoring matrices .................................................................................................................... 32
10
2.1.3.1 Point accepted mutation (PAM) .................................................................................................... 33
2.1.3.2 Block substitution matrix (BLOSUM) .......................................................................................... 34
2.1.4. Position specific scoring matrices (PSSM)........................................................................................ 35
2.2. ALIGNMENT INDEPENDENT CLASSIFICATION .............................................................................. 36
2.2.1. Machine Learning .............................................................................................................................. 36
2.2.1. Feature Extraction Strategies ............................................................................................................. 37
2.2.1.1 Amino Acid Composition ............................................................................................................. 38
2.2.1.2 Pseudo Amino Acid Composition ................................................................................................. 38
2.2.1.3 Wavelet based multi scale energy features ................................................................................... 39
2.2.1.4 Fast Fourier transform based features ........................................................................................... 40
2.2.1.5 Split amino acid ............................................................................................................................ 41
2.2.1.6 Evolutionary information based features using PSSM .................................................................. 42
2.2.2. Classification Algorithms .................................................................................................................. 43
2.2.2.1 Nearest Neighbor .......................................................................................................................... 43
2.2.2.2 Support vector machines ............................................................................................................... 43
2.2.2.3 Probabilistic Neural Network ........................................................................................................ 45
2.2.3. Performance Assessment ................................................................................................................... 46
2.2.4. Genetic Algorithms ........................................................................................................................... 47
2.2.4.1 Initialization .................................................................................................................................. 48
2.2.4.2 Selection ........................................................................................................................................ 48
2.2.4.3 Genetic operators .......................................................................................................................... 49
2.2.4.4 Termination ................................................................................................................................... 50
2.3. GPCR DATASETS .................................................................................................................................... 50
11
3. GPCR PREDICTION BY EMPLOYING PHYSIOCHEMICAL PROPERTIESUSING HYBRID FEATURES
52
3.1. PHYSIOCHEMICALPROPERTIES ......................................................................................................... 53
3.2. FEATURE EXTRACTIONAND CLASSIFICATION.............................................................................. 54
3.3. GPCR-HYBRID ......................................................................................................................................... 55
3.4. RESULTS AND DISCUSSIONS .............................................................................................................. 56
3.4.1. Family Level Classification ............................................................................................................... 56
3.4.1.1 Performance for PseAA2 .............................................................................................................. 57
3.4.1.2 Performance for PseAA3 .............................................................................................................. 57
3.4.1.3 Performance for MSE-PseAA ....................................................................................................... 57
3.4.1.4 Performance using MSE-AA ........................................................................................................ 57
3.4.2. Sub Family Classification.................................................................................................................. 58
3.4.2.1 Performance for PseAA2 .............................................................................................................. 58
3.4.2.2 Performance for PseAA3 .............................................................................................................. 59
3.4.2.3 Performance for MSE-PseAA ....................................................................................................... 59
3.4.2.4 Performance for MSE-AA ............................................................................................................ 59
3.4.3. Sub-sub Family Classification ........................................................................................................... 60
3.4.3.1 Performance for PseAA2 .............................................................................................................. 60
3.4.3.2 Performance for PseAA3 .............................................................................................................. 61
3.4.3.3 Performance for MSE-PseAA ....................................................................................................... 61
3.4.3.4 Performance for MSE-AA ............................................................................................................ 61
3.4.4. Comparison with Selective Top Down Approach ............................................................................. 62
3.4.5. Comparison with other methods ........................................................................................................ 63
12
4. GPCRs PREDICTION USING GREY INCIDENCE DEGREE MEASURE AND PRINCIPAL
COMPONENT ANALYIS .......................................................................................................................................... 66
4.1. GREY INCIDENCE DEGREE MEASURE .............................................................................................. 67
4.2. PRINCIPAL COMPONENT ANALYSIS ................................................................................................. 68
4.3. RESULTS AND DESCUSSIONS ............................................................................................................. 69
4.3.1. Family level classification ................................................................................................................. 70
4.3.2. Sub family level classification ........................................................................................................... 70
4.3.3. Sub-sub family level classification .................................................................................................... 70
4.3.4. Comparison with other methods ........................................................................................................ 71
4.3.4.1 Comparison with Selective top down approach ............................................................................ 71
4.3.4.2 Comparison with other existing methods on D167 and D566 datasets ......................................... 72
5. GPCRs PREDICTION USING GENETIC ALGORITHM BASED ENEMBLE CLASSIFICATION ............. 74
5.1. CLASSIFICATION ALGORITHM .......................................................................................................... 75
5.2. WEIGHT OPTIMIZATION USING GENETIC ALGORITHM ............................................................... 75
5.3. RESULTS AND DISCUSSIONS .............................................................................................................. 77
5.3.1. Classification performance on D8354 ............................................................................................... 77
5.3.1.1 Family level classification............................................................................................................. 77
5.3.1.1 Classification performance at sub family level ............................................................................. 78
5.3.1.1 Classification performance at sub-sub family level ...................................................................... 78
5.3.2. Comparison with existing approaches on D8354 .............................................................................. 81
5.3.3. Comparison on D167, D365 and D566 datasets ................................................................................ 82
6. ALIGNMENT BASED STRUCTURAL CLASSIFICATION OF GPCRS USING TRANSMEMBRANE
REGIONS .................................................................................................................................................................... 89
6.1. SEVEN MOTIFS OF RHODOPSIN LIKE GPCRS .................................................................................. 90
13
6.2. POSITION SPECIFIC SCORING MATRIX USING PSEUDO COUNTS .............................................. 91
6.3. EXTREME VALUE DISTRIBUTION (EVD) .......................................................................................... 93
6.4. MOTIF DETECTION ALGORITHM ....................................................................................................... 96
6.5. MULTI DIMENSIONAL SCALING ........................................................................................................ 97
7. CONCLUSIONS AND FUTURE DIRECTIONS .............................................................................................. 99
7.1. ALIGNMENT INDEPENDENT CLASSIFICATION .............................................................................. 99
7.2. ALIGNMENT DEPENDENT CLASSIFICATION ................................................................................ 100
7.3. FUTURE DIRECTIONS ......................................................................................................................... 101
8. REFERENCES ................................................................................................................................................. 102
14
List of Figures
Figure 1-1: Structure of Rhodopsin receptor ............................................................................................................... 22
Figure 1-2: GPCR classification methods ................................................................................................................... 25
Figure 2-1: Overview of chapter 2 ............................................................................................................................... 28
Figure 2-2: Simple motif based alignment of GPCRs ................................................................................................. 31
Figure 2-3:PAMmatrix ................................................................................................................................................ 34
Figure 2-4: Classification of GPCRs using machine learning ..................................................................................... 38
Figure 3-1: Overview of chapter 3 ............................................................................................................................... 53
Figure 3-2: GPCR-Hybrid web interface ..................................................................................................................... 55
Figure 3-3: Working of GPCR-Hybrid ........................................................................................................................ 56
Figure 3-4: GPCR classification performance for family level in terms of Accuracy, sensitivity and specificity ...... 58
Figure 3-5: GPCR classification performance for family level in terms of MCC and F-Measure .............................. 59
Figure 3-6: GPCR classification performance for sub family level ............................................................................. 60
Figure 3-7: GPCR classification performance for sub-sub family level ...................................................................... 61
Figure 3-8: Comparison with Selective Top Down method ........................................................................................ 62
Figure 3-9: Comparison on D167 dataset .................................................................................................................... 63
Figure 3-10: Comparison on D365 dataset .................................................................................................................. 64
Figure 3-11: Comparison on D566 dataset .................................................................................................................. 64
Figure 4-1: Overview of chapter 4 ............................................................................................................................... 67
Figure 4-2: Overview of GPCR-GID ........................................................................................................................... 69
Figure 4-3: Performance of GID and Euclidian distance methods .............................................................................. 70
Figure 4-4: Comparison with selective top down approach ........................................................................................ 71
Figure 4-5: Comparison on D167 ................................................................................................................................ 72
Figure 4-6: Comparison on D566 ................................................................................................................................ 73
Figure 5-1: Overview of chapter 5 ............................................................................................................................... 75
Figure 5-2: Overview of PSE-PSSM method .............................................................................................................. 77
Figure 5-3: GA run for family level............................................................................................................................. 78
Figure 5-4: GA run for subfamily level ....................................................................................................................... 79
Figure 5-5: GA run for sub-subfamily level ................................................................................................................ 80
Figure 5-6: Classification performance on D8354 dataset ........................................................................................... 80
Figure 5-7: Comparison on D8354 dataset .................................................................................................................. 82
Figure 5-8: Classification performance on D365 and D 566 dataset ........................................................................... 83
Figure 5-9: Comparison on D167 dataset .................................................................................................................... 83
Figure 5-10: GA run for D167 using MSE-PseAA ..................................................................................................... 84
Figure 5-11: GA run for D167 using PSE-PSSM ........................................................................................................ 85
Figure 5-12: Classification performance on D365 and D566 datasets ........................................................................ 85
Figure 5-13: GA run for D365 dataset ......................................................................................................................... 86
15
Figure 5-14: GA run for D566 ..................................................................................................................................... 87
Figure 5-15: Comparisons on D365 dataset in terms of % accuracy ........................................................................... 87
Figure 5-16: Comparison on D566 .............................................................................................................................. 88
Figure 6-1: Overview of chapter 6 ............................................................................................................................... 90
Figure 6-2: PSSM plot tested on Chemokine PSSM ................................................................................................... 93
Figure 6-3: Plot of pdf for motif-1 of Amine sub family ............................................................................................. 95
Figure 6-4: Plot of E-values for motif-3 Amine sub family ........................................................................................ 96
Figure 6-5: Number of false positives for different E-values ...................................................................................... 96
Figure 6-6: MDS plot based on sequence similarity between various sub families..................................................... 98
16
List of Tables
Table 2-1: BLOSUM 62 matrix ................................................................................................................................... 34
17
List of Abbreviations
AA Amino Acid
BLAST Basic Local Alignment Search Tool
BLOSUM Block Substitution Matrix
COV Covariance matrix
DWT Discrete Wavelet Transform
FFT Fast Fourier Transform
GA Genetic Algorithm
GID Grey Incidence Degree
GPCR G- Protein Coupled Receptor
HMM Hidden Markov Model
MAFFT Multiple Alignment with Fast Fourier Transform
MCC Mathew’s Correlation Coefficient
MEME Multiple EM for Motif Elicitation
MSA Multiple Sequence Alignment
MSE Wavelet based Multi-Scale Energy
MUSCLE Multiple Sequence Comparison by Log- Expectation
NN Nearest Neighbor
PAM Point Accepted Mutation
PCA Principal Component Analysis
PNN Probabilistic Neural Network
POA Partial Order Alignment
PseAA Pseudo amino acid
18
PseAA2 Pseudo amino acid with two physiochemical properties
PseAA3 Pseudo amino acid with three physiochemical properties
PSI-BLAST Position Specific Iterated BLAST
PSSM Position Specific Scoring Matrix
RBF Radial Base Function
SAAC Split Amino Acid Composition
SAM Sequence Alignment and Modeling System
SVM Support Vector Machine
T-Coffee Tree-based Consistency Objective Function For alignment Evaluation
TM Transmembrane
7TM Seven Transmembrane
19
Symbol Table
amk Root mean square energy of wavelet approximation coefficients in
mth decomposition level in the kth sequence
djk Root mean square energy of wavelet detailed coefficients in the
corresponding jth decomposition level in the kth sequence
D Euclidian distance
fi Occurrence frequencies of amino acid i
Gi GPCR sequence i
gij jth amino acid in a GPCR sequence i
H, h Physiochemical property function for an amino acid
i, j, k For indexing of elements
N Total number of sequences
n, m Total number of elements in the vector
ci Correlation factors in PseAA
Ri Physiochemical property value of amino acid i
L Length of GPCR sequence
L Length of motif of a GPCR sequence
C Classes
Sij Substitution score of replacing amino i with j
O Grey incidence degree between two sequences
P Occurrence probability of amino acids
qi Back ground probabilities
T Final form of extracted features for a GPCR sequence
V Decision function for classifiers
20
Z(i) , z(i) Class prediction by a classifier for a GPCR sequence i
21
1. INTRODUCTION
Cell is the basic functional unit of all living organisms. Organisms can be classified into 2
categories i.e. unicellular or multi-cellular. Each cell has outer membrane that protects it from
unwanted substances from the environment. Cells communicate with each other by signaling
pathways. G Protein-Coupled Receptors (GPCRs) provide this cellular communication by
transducing extracellular stimuli into intracellular signals. GPCRs are the family of membrane
proteins, which is commonly found in Eukaryotic cells including: yeasts, bacteria, plants and
animals. There are various tasks of GPCRs such as triggering signaling pathways, regulation of
gene expression, proliferation of cell, controlling the proper reaction of cells, tissues, organs and
organisms to the changing environment, modulating synaptic transmission in the brain and spinal
cord (Lundstrom & Chiu, 2006).Due to its biological significance, GPCRs are widely useful for
drug discovery. Currently, more than 50% of drugs in markets are based on GPCRs (Lundstrom
& Chiu, 2006) (Bhasin & Raghava, 2004)
1.1.STRUCTURE OF GPCRS
GPCR sequences are polypeptide chains made up of amino acids. Amino acids are the basic
building blocks of proteins. There are 20 amino acids (Salam, 2012), that are named as:
Cysteine(C), Alanine (A), Glutamic acid (E), Aspartic acid (D), Glycine(G), Phenylalanine (F),
Isoleucine (I),Histidine (H),Leucine (L), Lysine (K), Asparagine (N), Methionine (M),
Glutamine (Q),Proline (P), Serine (S), Arginine (R),Valine (V), Threonine (T), Tyrosine (Y),
Tryptophan (W).
Structure of Rhodopsin like GPCRs consists of an extracellular N terminus and intracellular C
terminus as shown in
Figure 1-1. They have transmembrane (TM) helical structures passing through the membrane
seven times and hence are called as seven TM (7TM) receptors. The 7TM structure is connected
to three extracellular and three intracellular loops. Further, it has N-terminus and C-terminus and
is connected to alpha, beta and gamma sub units.
22
Figure 1-1: Structure of Rhodopsin receptor
The sample GPCR sequence belonging to Rhodopsin family and Amine sub family of GPCR is
given below:
>ENSGMOP00000010676_Gmor_/1-1355
NMSVDWDPWFASYIAMEVVIAVLSVLGNVLVVWAVILNRSLRDTTFCFIFSLALADIAV
GSLAIPLAITISIGLQTTFYSCLVGTCTMLVLTQSSILALLAIAIDRYLRVKIPMSYRWVVT
PRRARTAVGLCWLVSFMVGLTPLLGWNKLQHANGTVGSGPEAQMTCTFENIISMDYM
VYFNFLGWVLPPLLLMLLIYIEIFYIIHKHLNKKVTASQAGPRRRQDYGKELKLVKSLAL
VLFLFTVSWLPVHILNCITLFCPKCVEHKKGIRIAILLSHGNSAVNPVVYSFHINKFHTAF
RKIWQQYILCRDPVGKLPQKSGQSGWNHAVRRRHNSKDAHEF.
Throughout in our datasets, we have worked on these types of sequences.
1.2.GPCR CLASSIFICATIONS AND THEIR SIGNIFICANCE
GPCRs can be categorized into families in various ways. Based on similarity of the
transmembrane region, GPCRs are divided into 5 families, such as Rhodopsin, Secretin,
Adhesion, Glutamate and Frizzled (George, O`Dowd, & Lee, 2012). However on the basis of
sequence homologies, GPCRs are divided into six families i.e. Rhodophsin, Secretin,
Metabotropic glutamate, Pheromone, Cyclic AMP, Frizzled receptors (Horn, Bettler, Oliveira,
23
Campagne, Cohen, & Vriend, 2003). The structure of only Rhodopsin like GPCR has been
solved up until now. It is the biggest family of GPCRs and comprises about 80% of all GPCRs.
Rhodopsin family receptors are activated by many signals such as: peptides, nucleotides, and
small manoamines. Adrenaline or dopamine, odorant molecules, are activated by proteolysis and
react to light (Rhodopsin) through the activation of chromophore. (Rehman & Khan, 2011)
(Fridmanis, Fredriksson, Kapa, Helgi, & Klovins, 2006). They control processes such as:
paracrine, autocrine and endocrine and transducer extracellular signals through interaction with
nucleotide-bindings. The structure of Rhodopsin family is described in section 1.1. They are
usually further classified into various sub families of known receptors named as Amine,
Prostaglandin, Beta, Sog and MCH, Opsin, Meca, Melatonin, Purin, Chemokine, Mas, Glyco
protein. Some of these sub families are further divided into 2 groups, thus making in total 19 sub
families. Secretin family receptors play important role in the bindings of some parathyroid
hormones and glucagon (Cardoso, Pinto, Vieira, Clark, & Power, 2006)and are mostly found in
animals. Their known receptors can be further classified into three sub families. Metabotropic
family receptors are triggered by Metabotropic processes (Das & Banker, 2006)and are involved
in peripheral and central nervous system processes. They control learning capabilities, feelings of
grief and pain. Pheromone family is involved in chemical interaction in some organisms
(Nakagawa, Sakurai, Nishioka, & Touhara, 2005). They consist of eight different types of
receptors making three sub groups of this family. Cyclic AMP performs chemotactic signaling in
slime molds (Prabhu & Eichinger, 2006) and control the development in Amoeba specie.
Frizzled and Smoothened receptors are used to perform Wnt binding (Foord, Jupe, & Holbrook,
2002). There are 10 Frizzled receptors. Their function is to control embryonic development, cell
polarity, cell proliferation, and formation of neural synapses. In 1980’s and 1990’s, newly
sequenced GPCRs were first named from their pharmacological properties. With the expansion
of known GPCRs, molecular phylogenetics has been increasingly used to confirm the naming
convention of new sequences according to an evolutionary criteria (Rehman, Mirza, Khan, &
Xhaard, 2013). There are several works on phylogenetic classification of GPCRs such as:
(MOEREELS, LEWI, KOYMANS, & JANSSEN, 1997) and (Fredriksson, Lagerström, Lundin,
& Schiöth, 2003)
Because of the significant importance of GPCRs, research is being done on the computational
classification of GPCRs. The computational classification of GPCRs can be divided into two
24
categories i.e. Alignment based classification (phylogenetic analysis) and Alignment
independent classification
1.3.RESEARCH CONTRIBUTIONS AND OBJECTIVES
The classification of GPCRs can help in understanding its functions. Historically, GPCRs are
classified based on their pharmacological response. Molecular phylogenetic analysis has been
then used to cluster together similar sequences. Nowadays, it is largely agreed upon that the best
way to classify unknown GPCR sequences is through a phylogenetic analysis that includes
chromosomal mapping. These studies are however difficult and inaccurate over long
evolutionary distances. However, with the increasing number of newly discovered GPCRs, the
experimental based classification became very expensive and infeasible. Hence, the demand for
computational classification is increased. Overall objective of our research is to perform efficient
computational classification of GPCRs. We have analyzed both alignment dependent and
alignment independent classification of GPCRs. We have used various machine learning,
evolutionary, statistical, and alignment algorithms, and adopted following methods for the
classification of GPCRs:
Hybrid Classification of GPCRs using physiochemical properties
GPCR classification using grey incidence degree measure and Principal Component
Analysis
GPCR classification using ensemble approaches and evolutionary information
Alignment based structural classification of GPCRs using seven TM regions and position
specific scoring matrices.
The block diagram of our research work related to GPCRs classification is shown in Figure 1-2.
25
Figure 1-2: GPCR classification methods
1.4.STRUCTURE OF THESIS
In chapter 1, we have described GPCRs, their importance in different organisms and their
classifications. In chapter 2, we have given the detailed literature survey of existing GPCR
classification methods. In addition, we have also mentioned some feature extraction strategies
and classification algorithms used in our research. Further, optimization algorithms, machine
learning, alignment based methods, and the details of all the data sets are discussed in chapter 2.
Chapter 3mainly focuses on the feature extraction of GPCRs using physiochemical properties.
Three physiochemical properties are used in this work such as: Hydrophobicity, Electronic and
Bulk property. These physiochemical properties are employed using various feature extraction
strategies. Chapter 4 discusses the GPCR classification using Grey incidence degree measure and
Principal Component Analysis (PCA). We use features obtained through Fast Fourier transform,
Split amino acid, and Pseudo amino acid composition. The PCA is employed to reduce the
number of features. Chapter 5 discusses weighted ensemble based approaches for the
classification of GPCRs. The Position Specific Scoring Matrices (PSSM) are used to extract
evolutionary features. The weights are optimized using binary genetic algorithms. Chapter 6
discusses the transmembrane domain based classification and alignment of Rhodopsin like
26
GPCRS. We also discuss general TM shapes and structures for different sub families of
Rhodopsin like GPCRs and the identification of motifs in GPCRs using PSSM and pseudo
counts. The generalization of this method can help in detecting motifs in other protein families as
well. In Chapter 7, we have presented the conclusion of the overall research along with our major
achievements. It also discusses the future directions and improvement, which can be further
applied to the proposed methods.
27
2. LITERATURE SURVEY AND THEORY
In this chapter, we will give the detailed description of the existing GPCR classification and
alignment approaches. We will also explain the algorithms and terminologies used in the present
research. First, we will give overview of alignment approaches and alignment based
classification of GPCRs. Later we will explain alignment independent classification GPCRs and
machine learning approaches. Then we will explain different feature extraction, classification,
optimization tools and different GPCRs datasets used in present research. The layout of chapter
2 is shown in Figure 2-1.
2.1.ALIGNMENT DEPENDENT CLASSIFICATION OF GPCRS
In alignment based classification methods, first alignment between the sequences is generated
and the classification task is performed using the alignment. Alignment based classification
method utilizes the structural information of the GPCR sequences. There are various approaches,
which predict GPCRs using their 7TM regions (Inoue, Yamazaki, & Shimizu, 2005).
2.1.1. Sequence alignment
Technically speaking, sequence alignment is simply the re-arrangement of the sequences, such
that similar regions can be identified between two sequences. This region of similarity can be
functionally or structurally related. Sequences are placed row wise and gaps are often inserted
between the amino acid residues so that identical amino acid letters are aligned in successive
columns e.g.
AM - - AMTCFGHGAMKCMTCMAK
- MCACMTMMHM - M -CMT - - - - -
28
Figure 2-1: Overview of chapter 2
Protein alignments usually use a substitution matrix like BLOSUM 62, PAM etc, to assign scores
to amino acid matches or mismatches and a gap penalty for matching an amino acid in one
sequence to a gap in the other. There are different categories of sequence alignment methods
such as local, global, and pairwise alignment.
2.1.1.1 Local and global alignments
Local alignment is the alignment of only small portion in a set of sequences and global alignment
is the alignment of all residues in a set of sequences. An example of global alignment is
Needleman–Wunsch algorithm (Needleman & Wunsch, 1970)and example of local alignment is
Smith Waterman algorithm (Smith & Waterman, 1981).
2.1.1.2 Pairwise alignments
Pairwise alignments are performed only between two sequences at a time. Some of the Pairwise
alignment methods are dot matrix, dynamic programming, and word methods (Mount, 2004).
29
They generate a highly efficient alignment and but if number of sequences to be aligned are
increased these methods take a lot of time and are computationally much expensive.
2.1.2. Multiple Sequence Alignment
Multiple sequence alignment (MSA) is the alignment of set of sequences at a time. It can be used
to identify structurally or functionally similar region across the set of sequences. The main
objective of MSA is to maximize the number of matches between the sequences and to minimize
gap insertions and mismatches. They are computationally very expensive and difficult to build.
There are three possibilities for alignment of sequences at a position i.e. gaps, matches, and
mismatches.
AM -- CMTCFGHGAMKCMTCMAK
- MCACMTMMMM-M -CMT - - - - -
- - - MTMKAN - - - - MT- CM - - - - - -
There are various approaches for performing multiple sequence alignment i.e. dynamic
programming methods, progressive methods, iterative methods, Hidden Markov Models
(HMM), genetic algorithm, simulated annealing, and motif finding methods. Dynamic
programming can guarantee the optimal multiple sequence alignment, but it is computationally
much expensive, if number of sequences is more than 4.
2.1.2.1 Progressive alignments
It starts with pairwise alignment between the most similar sequences and progresses towards the
most dissimilar sequences. Alignments are produced by first putting the sequences in a tree
structure, the two most similar sequences become siblings and are connected to same ancestor.
The tree is known as guide tree. MSAs are built bottom-up along the guide tree by pairwise
alignments of set of sequences. The pairwise distance matrix of the sequences is computed using
pairwise sequence alignments and the distance matrix is used to differentiate between closely
related and distant sequences. The drawback of progressive methods is this that if there is a
mistake at any lower tree level, it propagates to upper levels and cannot be corrected; also, it
does not guarantee globally optimal alignment. There are various algorithms for implementing
progressive methods i.e. MAFFT, T-Coffee, and CLUSTALW. MAFFT takes Fast Fourier
30
transform (FFT) to locate similar regions (Katoh, Misawa, Kuma, & Miyata, 2002). Although T-
Coffee is slower than CLUSTALW, but it provides more accurate alignment for protein
sequences which are distantly related (Thompson, Higgins, & Gibson, 1994) (Notredame,
Higgins, & Heringa, 2000).
2.1.2.2 Iterative methods
These methods produce an initial alignment by supposition and then iteratively improve multiple
sequence alignment by minimizing error. The overall quality of multiple sequence alignment is
dependent on initial alignment. Common algorithms for iterative alignment are DIALIGN-T
(Subramanian, M.J., Kaufmann, & Morgenstern, 2005) and MUSCLE (Edgar, 2004 ).
2.1.2.3 Hidden Markov models
Hidden Markov models (HMM) are solely based on probabilities (Baum & Petrie, 1966). It
generates an accurate multiple sequence alignment or family of multiple sequence alignments by
assigning likelihoods to all possibilities of gaps, matches, and mismatches. Both local and global
alignments can be generated by HMM. MSAs in HMM are represented by directed a-cyclic
graph. The nodes of the graph show possible entries in multiple sequence alignment. There are
two types of states in HMMs i.e. observed states and hidden states. The observed states show
columns of the alignment and the hidden states show an ancestor sequence. The software used
for implementing HMM are partial order alignment (POA) (Grasso & Lee, 2004), sequence
alignment and modeling system (SAM) (Hughey & Krogh, 1996) and HMMER (Durbin, Eddy,
Krogh, & G., 1998).
2.1.2.4 Motif finding algorithms
From evolutionary point of view, the hypothetical "common ancestor" to all current GPCRs is
likely to present seven motifs characteristic of each of the 7TMs. However, for present day
GPCRs, some of these motifs might have mutated and therefore, are not detectable anymore. In
addition, a part of amino acid sequence might have evolved to present the same succession of
letters than the canonical motifs. This is more likely for shorter motifs, and for motifs that are
containing amino acids with high background frequencies. For example, it is more probable to
find the motif 'GN' than 'CWxxPxxxY' in the random part of the sequence. This is
simply because GN is composed of few letters. This makes the motif detection a challenging
task.
31
In Rhodopsin like GPCR family, there are 7TMregions, which could also be termed as patterns
or motifs. Aligning the common patterns or motifs can result in multiple sequence alignment.
These motifs should occur sequentially in sequences i.e. first should be motif 1, then motif 2 and
up to motif 7. It could also be possible that in a sequence that there are more than one
occurrences of a particular motif or it could also be possible that any of the motifs is not present
in a sequence. It could also be possible that motif-1 is found after motif-2, so in that case either
we have to ignore motif-1 or motif-2, depending upon which is more appropriate for the
alignment. It is a challenging task to identify motifs and to preserve sequentially. One simple
way to identify or define motifs in protein sequences is to look at multiple sequence alignment of
sequences belonging to same family, see the conserved amino acid regions, and create a
consensus of those regions. That conserved region can then be helpful to identify motifs in a new
sequence of same family. A consensus cannot be necessarily a right combination. Therefore, we
should adopt a regular expression notation for searching any particular motif or pattern. The final
motif should maximize true positives and minimize false positives. The commonly used tools for
motif finding are: BLOCKS (Blocks WWW Server), MEME (Bailey, Williams, Misleh, & Li, 2006),
and MAST.
Figure 2-2: Simple motif based alignment of GPCRs
32
2.1.2.5 Genetic algorithms and simulated annealing methods
Genetic algorithm is used for optimization of any problem. It has many applications in computer
science and bioinformatics. It starts with generating population of chromosomes (made up of
genes). Every possible alignment can be represented as a chromosome, which is composed of N
genes, where N is the number of sequences to be aligned. Genetic algorithm first generates some
MSAs called population of chromosomes, and evaluates fitness of population. A fitness function
can be defined based on number of matching symbols and their location in the sequences, the
number of gaps or based on sum of pair method. Sum of pair method is used to assess quality of
MSA. Then, it performs selection, crossover and mutation operations to improve the MSA. In
crossover, two MSAs are combined to form two new MSAs. Some of the MSAs are mutated.
The fitness of newly or edited MSAs are again calculated and fitness of all MSAs are ranked.
The best MSAs are then selected for the next generation and produce off springs for the new
generation and the process is repeated until satisfactory solutions evolve or maximum number of
generations is reached. Genetic Algorithms perform better than dynamic programming methods
if number of sequences is high. Genetic Algorithms are processed in parallel and can get
advantage of parallel computers. The key advantage of genetic algorithms over other
optimization methods is that it only needs a fitness function to evaluate the quality of different
solutions and no need to change the inner functionality of algorithm.
2.1.3. Protein scoring matrices
Protein scoring matrices are used to score the alignment of any possible amino acid residues of
two sequences and are the key elements in assessing the quality of alignment. These can also be
termed as similarity matrices. The amino acids taken from each of two sequences are matched
and assigned a score from a similarity matrix. The similarity matrices are based on substitution
probabilities of amino acids and are sometimes called substitution matrices. The score for the
entire match between pair of sequences is the sum of the scores of the individually matched
amino acids. It shows the rate at which one amino acid residue in a sequence changes to another
residue. The most commonly used matrices are based on Dayhoff model (Dayhoff, Schwartz, &
Orcutt, 1978) in which matrices are derived from large bunch of protein sequences, which are at
least 85% identical. The matrices particular to any sub family of proteins can also be developed
by examining pairwise alignments in a large MSA of amino acids and by extracting frequency
33
information of amino acid i mutated into amino acid j. The substitution matrix formed in this
way would be effective, if the MSA is built from large number of distantly related sequences.
Various substitution matrices are now proposed for scoring the protein sequence alignments,
such as: Point accepted mutation (PAM) and Block substitution matrix (BLOSUM). The
substitution matrices are usually presented in the Log-odd form (Henikoff, S. &Henikoff, J.G.
1992). In log-odd form, each score in the matrix is the natural logarithm of an odd ratio, where
odd ratio is the ratio of the occurrences of two amino acids appearing with a biological sense and
the likelihood of the same amino acids appearing by chance. The positive entry in the matrix
shows pair of amino acids that replace each other more often than expected by chance and
negative entry in the matrix corresponds to a pair of amino acids that replace each other less
often than expected by chance.
2.1.3.1 Point accepted mutation (PAM)
PAM matrices are based on observed mutations.PAM matrix was developed by Margaret
Dayhoff in 1978 (Dayhoff, Schwartz, & Orcutt, 1978). PAM examines mutations that can occur
in closely related protein sequences at small evolution. The advantage of PAM matrices over
other similarity matrices is this that PAM matrices describe more accurately the changes in
amino acid composition that are expected after a given number of mutations.
There are various series of PAM matrices based on estimated mutation rates i.e. PAM1,
PAM100, and PAM250. PAM1 has 1 % accepted mutations per 100 amino acid residues,
PAM100 has 100 accepted mutations per 100 residues and PAM250 matrix corresponds to 250
mutations per 100 amino acids residues.
34
Figure 2-3:PAMmatrix
2.1.3.2 Block substitution matrix (BLOSUM)
BLOSUM is mostly used for scoring of alignments of evolutionarily divergent protein sequences
and is based on log-odd scores (Henikoff & Henikoff, 1992). There are three most commonly
used BLOSUM matrices i.e. BLOSUM45, BLOSUM 62 and BLOSUM 80. BLOSUM45 is the
matrix composed from alignment of amino acid sequences, which are 45% similar. Similarly,
BLOSUM62 is constructed from alignment of sequences, which have 62% similarity, and
BLOSUM80 has 80% amino acid sequence similarity.
The BLOSUM matrix Sij is calculated using the following equation:
1log
ij
ij
i j
PS
q q
2.1
where Pij is the substitution probability of amino acid i with j.qi and qj are the background
probabilities of amino acids i andj. By rearranging terms, we achieve:
exp ijS
ij i jP q q
2.2
Sum of all substitution probabilities is one. The unknown λ can be found by solving
exp 1ijS
i jij
q q
35
whereiq ,
jq and ijS are already known (Sean, 2004). BLOSUM62 matrix is shown in Error! Not a
valid bookmark self-reference..
Table 2-1: BLOSUM 62 matrix
2.1.4. Position specific scoring matrices (PSSM)
PSSMs are normally calculated from blocks of amino acids. Blocks are the highly conserved
aligned un-gapped portions in MSA of amino acid sequences. The length of PSSM vector is the
same as the width of block and has 20 rows, one for each amino acid. The PSSM is used to score
alignments of the blocks of amino acid or DNA sequences. PSSM estimates the probabilities of
amino acids appearing at each position of the block. The scores in a column of PSSM are based
on the frequencies of the amino acids observed in the corresponding column of the block. The
more frequently occurring amino acid gets a higher score.
The simplest representation of PSSM is calculated from a multiple sequence alignment. Each
column of alignment can be represented as a column vector of 20 amino acids. These 20 amino
C S T P A G N D E Q H R K M I L V F Y W
C 9 -1 -1 -3 0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2
S -1 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3
T -1 1 4 1 -1 1 0 1 0 0 0 -1 0 -1 -2 -2 -2 -2 -2 -3
P -3 -1 1 7 -1 -2 -1 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4
A 0 1 -1 -1 4 0 -1 -2 -1 -1 -2 -1 -1 -1 -1 -1 -2 -2 -2 -3
G -3 0 1 -2 0 6 -2 -1 -2 -2 -2 -2 -2 -3 -4 -4 0 -3 -3 -2
N -3 1 0 -2 -2 0 6 1 0 0 -1 0 0 -2 -3 -3 -3 -3 -2 -4
D -3 0 1 -1 -2 -1 1 6 2 0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4
E -4 0 0 -1 -1 -2 0 2 5 2 0 0 1 -2 -3 -3 -3 -3 -2 -3
Q -3 0 0 -1 -1 -2 0 0 2 5 0 1 1 0 -3 -2 -2 -3 -1 -2
H -3 -1 0 -2 -2 -2 1 1 0 0 8 0 -1 -2 -3 -3 -2 -1 2 -2
R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 2 -1 -3 -2 -3 -3 -2 -3
K -3 0 0 -1 -1 -2 0 -1 1 1 -1 2 5 -1 -3 -2 -3 -3 -2 -3
M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 1 2 -2 0 -1 -1
I -1 -2 -2 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 2 1 0 -1 -3
L -1 -2 -2 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 3 0 -1 -2
V -1 -2 -2 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 -1 -3
F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 3 1
Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 2
W -2 -3 -3 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11
36
acids contain the observed frequencies of 20 amino acids in the multiple sequence alignment.
These observed frequencies can be an imperfect representation of a position because the
observed sequences are just a subset of the full set of related sequences. So for some amino acids
the observed frequencies can be zero, if that amino acid is missing in the column and can affect
the performance. PSSMs are normally used to identify motifs in amino acid sequences (Ben, et
al., 2005). Without the information of those missing amino acids, we cannot effectively identity
motifs or patterns in amino acid sequences. There are various ways to handle this situation. One
such solution is to model-missing amino acids by adding pseudo-counts to the observed
frequency count vector.
2.2.ALIGNMENT INDEPENDENT CLASSIFICATION
Classification of GPCRs or protein sequences can also be performed using alignment
independent methods. Molecular phylogenetic tree analyses are fully dependent on MSAs. With
the increase of the number of sequences to identify the computational cost of alignment methods,
exponentially increase. Alignment independent methods are useful and much faster. In alignment
independent methods, we need some feature extraction strategy and some classification
algorithm. The classification algorithm is trained and evaluated on a dataset once the feature
extraction is completed. Classification based on physiochemical or biochemical properties of
sequences can effectively predict the families of GPCRs. During the last few years, there has
been proposed various systems for the annotation of functions of GPCRs automatically by
exploiting their physiochemical or biochemical properties in a fast and efficient way. Various
statistical and machine learning methods have also been proposed in this regard e.g. Bayesian
classification method (Lundstrom & Chiu, 2006), SVM (Bhasin & Raghava, 2004), (Bhasin &
Raghava, 2005), (Karchin, Karplus, & Haussler, 2002), (Guo, et al., 2005)and Hidden Markov
models (Möller, Vilo, & Croning, 2001), (Papasaikas, Bagos, Litou, & Hamodrakas, 2003),
(Martelli, Fariselli, Malaguti, & Casadio, 2002), (Rehman & Khan, 2011), (Davies, Secker,
Freitas, Mendao, Timmis, & Flower, 2007).There are various online classification servers
available (Davies, BIAS-PROFS) and (Rehman Z. , GPCR prediction, 2011).
2.2.1. Machine Learning
Machine learning is the category of algorithms that improve automatically by experience. The
process starts from learning and ends up at testing. Learning is acquired, by some training data,
37
and testing is applied on unseen new data. It exploits various features from training data in
specification phase and makes, intelligent decisions based on training data. It further helps in the
categorization of testing data called generalization. There could be many features of an available
training data; the selection of optimal features is also one of task of machine learning. There are
various machine learning based classification algorithms such as SVM, Nearest Neighbor (NN),
Grey incidence degree measure (GID) and ensemble approaches. After classification, the method
can further be optimized and validated using jackknife or independent testing methods. After
validation, the performance of method is assessed and compared with existing methods. The
machine learning based classification of GPCRs has following phases as shown in Figure 2-4.
2.2.1. Feature Extraction Strategies
The classification algorithms usually require GPCRs sequences into numeric form to classify
them into classes. The numeric form of the sequence can be obtained, by using any
physiochemical property. The numeric form of the overall sequence can be very large and can
vary in size, because sequences can be big and of variable size. Hence, we transform it to reduce
dimensionality by computing some useful properties from this numeric GPCR sequence. These
properties are called features, and process of computing these properties is called feature
extraction. We have extracted features using: amino acid composition, pseudo amino acid
composition, fast Fourier transform, wavelet based multi scale energy, split amino acid and
evolutionary information based methods. Let us consider a GPCR sequence G1, containing n
amino acids. After converting it to numeric form, it is mathematically represented as:
1 11 12 1, ,..., ng g gG 2.3
where11g is the amino acid at residue 1 in sequence G1, g12 is amino acid at position 2 and
similarly g1n represents the last amino acid at position n of the sequence G1.
38
Figure 2-4: Classification of GPCRs using machine learning
2.2.1.1 Amino Acid Composition
Amino acid (AA) composition is simply the frequency of occurrence of amino acids in the
sequence (Elrod & Chou, 2002). It accounts only sequence order information and is given:
1 2 20, ,...,f f fT 2.4
where,if is the occurrence frequency of ith amino acids and T is the numeric form of the
sequence.
2.2.1.2 Pseudo Amino Acid Composition
Pseudo amino acid (PseAA) composition is more accurate feature extraction strategy than simple
AA (Chou, 2001), (Qiu, Huang, Liang, & Lu, 2009). It also accounts length and order of
extracted features. The first 20 elements of PseAA composition are same as given by AA feature
vector, but it contains some additional elements (21,...,c c ) to account the sequence order of a
protein. PseAA is mathematically represented as:
1 2 20 21, ,..., , ,...,f f f c cT 2.5
where, 20 n and n is the number of physiochemical properties used. Here, shows the
number of tiers used, is usually between: 1,…21. These tiers are computed using correlation
factors. Number of correlation factors is dependent on number of physiochemical properties
used. First tier correlation factors are the most contiguous residues along protein chain, second
39
tier correlation factors are the second most contiguous residues, and so on (Chou, 2001). These
tiers utilize the information contained in correlation factors by employing Physiochemical
properties. In case of using two physiochemical properties (Hydrophobicity and electronic), the
correlation factors are given by the equations 2.6 and 2.7:
1 1
1 , 11
1 2
2 , 11
2 1
3 , 21
2 2
4 , 21
1
2 1 ,1
2
2 ,1
1
1
1
1
1
2
1, ,
2
...........................
1
1
L
i ii
L
i ii
L
i ii
L
i ii
L
i ii
L
i ii
HL
HL
HL
H LL
HL
HL
2.6
1 1 1
,
2 2 2
,
( ). ( )
( ). ( )
i j i j
i j i j
H h R h R
H h R h R
2.7
where, L is the length of the GPCR sequence.1 is first tier correlation factor based on
Hydrophobicity property,2 is the first tier correlation factor based on electronic property,
3 is
the second tier correlation factor by using the Hydrophobicity property, τ4 is the second tier
correlation factor using electronic property and so on.
1
ih R is the Hydrophobicity value of amino acid i and 2
ih R is the Electronic value of amino
acid i (Any other physiochemical property can be used in place of Hydrophobicity/Electronic
property and will be represented as Ri).
2.2.1.3 Wavelet based multi scale energy features
The discrete wavelet transform (DWT) can be used to represent a signal in transform domain
(Qiu, Huang, Liang, & Lu, 2009). DWT can be implemented by many different approaches,
however we have used Mallat’s Fast algorithm. In fast algorithm, main wavelet is decomposed
40
into several levels. At each level, approximation and detailed coefficients are obtained by low
pass and high pass filters respectively.
At first, each sequence is converted into the numeric digital signal using Hydrophobicity values
(using FH scale) (Fauchere & Pliska, 1983). Then, the Haar transform of digital signal is
computed. At third step, the approximation and detailed coefficients are computed using various
decomposition levels. The maximum decomposition level for a particular sequence is equal to
Log2 (length of sequence) and is denoted by m. In some sequences, zero padding is used to keep
consistent size for feature vectors of all the sequences. The resultant feature vector obtained in
this way is named as MSE (Shi, Zhang, Pan, Cheng, & Xie, 2007). The MSE-feature vector of
(m+1)-Dimensions is formed as:
1 2, ,..., ,..., ,k k k k k
j m mk d d d d a T 2.8
where 1,2,...,k N and N = total sequences, k
ma is the root mean square energy of wavelet
approximation coefficients in mth decomposition level and k
jd shows the root mean square
energy of wavelet detailed coefficients in the corresponding jth decomposition level .
1 2
0
1( }){
jN kk
j jnj
nd uN
2.9
1 2
0(
1){ }
m k
m
Nk
m nj
v naN
2.10
where,mN = number of approximation coefficients and jN = number of detail coefficients,
k
mv n is the nth approximation coefficient in the mth level and k
ju n is the nth detail
coefficient in the jth decomposition level.
2.2.1.4 Fast Fourier transform based features
Fast Fourier transform (FFT) is an efficient way of implementing discrete Fourier transform
(DFT) algorithm (Guo, et al., 2005). It reduces the computations of DFT from O(N2) to O(N log
N). FFT requires numerical input; it decomposes the set of values into different frequency
41
components. We have converted GPCR sequence into numeric form using Hydrophobicity
property of proteins and then normalized the values using following equation.
T
i
i
T -T 2.11
whereiT is the numeric form of GPCR sequence i , T is the average (mean) of
iT and is
deviation from mean. Since the size of sequences varies, iT will also have variable sizes. FFT
has two benefits over some other feature extraction strategies. First, it keeps the length of feature
vector consistent for all sequences. Secondly, some features, which are difficult to extract in
spatial domain, can easily be extracted in the transform domain. Sequences belonging to one
class bear differences with sequences of other classes. These differences can easily be analyzed
using FFT.
The FFT of iT is computed using eq. 2.12.After that, we have applied Power spectral density
(PSD) of even frequency points equal to 256. The PSD is the square of Fourier transform divided
by total number of frequency points. The formulas for PSD and FFT are given by the following
equations.
-1 -1
1
Ni k
N
i
k i
X X 2.12
2
PSDn
X
2.13
where, exp 2 i N is an Nth root of Unity, n is the number of frequency points. The feature
vector formed using FFT and PSD in this case is of length 256 and is shown by eq. 2.14.
2
-1 -1
1
Ni k
N
i
i
kn
T
T
2-14
2.2.1.5 Split amino acid
The GPCRs have peptides at their N and C terminus regions, which are very informative. Split
amino acid (SAAC) helps in extracting N and C terminus information from the GPCR
sequence (Afridi, Khan, & Lee, 2012), (Chou & Shen, 2006), (Chou & Shen, 2006). GPCR
42
sequence is split into 3 parts and computational composition of each part is performed
independently. First part consists of 20 amino acids of N terminus. Second part contains 20
amino acids of C terminus and third part contains amino acids lying between C and N termini.
The overall size of SAAC features in our research is taken as 60.
2.2.1.6 Evolutionary information based features using PSSM
The evolution in GPCRs results in biological changes in GPCR sequences e.g. deletions,
insertions or mutations of some amino acid residues. These changes if analyzed can help in
classification of GPCRs. We have mathematically described this evolutionary information using
Position Specific Scoring Matrix (PSSM) (Schaffer, et al., 2001), (Chou & Shen, 2010). PSSM
shows the probabilities of substitutions of one amino acid into other. We have computed PSSMs
using PSI-Blast (Institute). We have to manually enter each sequence through PSI-Blast
interface, appropriate database for sequence and to run iterations of PSI-Blast 2 times to get the
PSSM. This PSSM is saved in a file and further processed. PSSM for each sequence has size
equal to the length of sequence with each column showing the probabilities of 20 amino acids.
1,1 1,2 1,20
2,1 2,2 2,20
,1 ,2 ,20
...
...
. . ... .
. . ... .
...L L L
E E E
E E E
P
E E E
2.15
where ,i jE are the scores of the amino acid residues at the ith position in GPCR sequence
substituted to amino acid j . The search threshold for PSI is set to 0.001. Next, we have
normalized Ei,j by computed zero mean and standard deviation of PSSMs.
^,
,
( )
( )
i j i
i j
i
E EE
SD E
2.16
where, iSD E is standard deviation andiE is the mean of ith amino acid residue in a GPCR
sequence. Next, we have merged PSSM features with Pseudo amino acid to compute the final
equation to represent a GPCR sequence as shown in eq. 2.17:
'
1 2 20 1 2 20... , , ...E E E E E E
T 2.17
43
where,
^
,1
1 L
j i ji
E EL
2.18
2^ ^
, ,1
1 L
j i j i ji
E E EL
2.19
and where, jE ( j=1 to 20) are the mean scores of the amino acid residues in GPCR sequence. is
the number of tiers used. The value of should be less than the size of the minimum sequence
present in the database. We have chosen =49 in the proposed research. We have named this
feature extraction strategy as PSE-PSSM in our present research.
2.2.2. Classification Algorithms
For the sake of classification, we have used Support vector machine, probabilistic neural
network, nearest neighbor and ensemble classification approaches.
2.2.2.1 Nearest Neighbor
Nearest Neighbor algorithm (NN) annotates the test sample to a class in a sample space of N
classes by calculating its distance to all classes and annotating it by the label of a class, which
has minimum Euclidian distance to it (Rehman & Khan, 2011). Euclidian distance can be
calculated as:
1 1,2,...,D i N ii
i
x.xx,x
x x 2.20
where, x is the test sample and xiis the sample having ith training class, x and x
i are their
respective modulus.
2.2.2.2 Support vector machines
SVM classification algorithm is a binary classifier, but it can be used for multi-classification
problems (Karchin, Karplus, & Haussler, 2002). The model formed by SVM computes such a
decision boundary having maximum distance to the nearest points in the training feature space.
The SVM is based on the principle of finding the optimal linear hyper plane so as to minimize
the classification error for the new test samples (Javed, Khan, Majid, Mirza, & Bashir, 2007).
For a linearly separable data of N training pairs (xi, yi), the function of decision surface V is
given by Eq. (2.21).
44
1
.N
Ti i i
i
x y biasV
2.21
where, the coefficient 0i is the Lagrange multiplier in an optimization problem. A sample i
corresponding to 0i is called a support vector. The function V x is independent of the
dimension of the feature space. To find an optimal hyper plane surface S for non-separable
sample points, we have to find the solution of the following equation.
1
1,
2
N
ii
w o
TW W 2.22
Subject to: 1 , 0Ti iY x bias W
where, o is the penalty parameter of the error term 1
N
i
i
. It represents the cost of constraint
violation of those data points, which occurs on the wrong side of the decision boundary, and
x is the nonlinear mapping. The weight vector W minimizes the cost function term:T
W W .
For nonlinear data, the input data is mapped to higher dimension through a mapping function
x such that : ,N MR F M N . Each point in the new feature space is defined by a kernel
function .i jK i jx ,x x x . There are many different kernel functions; we have evaluated
the performance of our proposed method using Linear, Polynomial, Radial basis function and
Sigmoid kernel functions. The nonlinear decision surfaceV can now be constructed as given by
the Eq. (2.23).
1
Ns
i i
i
V y K bias
i jx x ,x 2.23
where, Ns is the number of support vectors. Mathematically, the Radial Basis kernel Function
(RBF) is defined as given by the Eq. (2.24).
2
2exp
2K
i j
i j
x -xx ,x
2.24
where, the parameters σ shows the width of Gaussian function.
As our classification problem is multi-classification, we have used the one-Vs-all classification
45
strategy using the LIBSVM 2.88-1 package (lib SVM). SVM problem in this software is solved
using non-linear quadratic programming technique. During parameters optimization of SVM
models, the average accuracy of SVM models is maximized.
2.2.2.3 Probabilistic Neural Network
Probabilistic Neural Network(PNN) was developed by Specht (Specht, 1990). It is based on
Bayesian classification algorithm. PNN has four layers: input, pattern, summation, and decision
layer. There is different number of neurons at each layer. The PNN receives n dimensional
feature vector as input i.e. 1 2, ,...,i nx x xx to the N nodes of input layer. There are M pattern
layer nodes fully connected to these N input layers..In pattern layer, for each class 1k k c
km Gaussian functions are calculated as given by the Eq. (2.25):
1
1/2
/2
1 1exp
22
kT
k k k
j j jk jn
j
p
x x x
2.25
where, k
ju is the mean andk
j
is the covariance matrix of training samples. The summation layer
computes the approximation of the class probability functions as given in the Eq. (2.26).
1
mkk kpj jk
j
Φ x x 2.26
where k
j is the within class mixing proportion and 1
1
mkkj
j
for 1,2,...,k c .The decision layer
makes decision about the test sample by computing the risk as given in the Eq. (2.27).
1
ckV vl lk l
l
x x 2.27
where,1 indicates the prior probability of class l. The test sample is assigned to the label of a
class, for which risk is the minimum. The performance of PNN is dependent on optimized
smoothing factor which is used to control the deviations of Gaussian functions.
46
2.2.3. Performance Assessment
After the classification of all classes, the performance of the classifier is assessed by some
statistical measures. The class assignment for each sequence is usually performed in binary way
i.e. negative and positive class. Even multi class problem can also be broken down into 2-class
problem. The true positives (TP) and true negatives (TN) are the number of correctly classified
positive and negative sequences respectively. Similarly, false positives (FP) and false negatives
(FN) are the number of incorrectly classified positive and negative sequences. In the end,
performance for whole dataset is analyzed using various measures such as: overall accuracy,
sensitivity, specificity, Mathew correlation coefficient (MCC) and F-measure. Accuracy shows
us the overall effectiveness of the method and determines the degree of true predictions of
method i.e. true positives or true negatives. Specificity indicates the proportions of true
negatives. Sensitivity indicates the proportions of true positives. There should be proper tradeoff
between the values of sensitivity and specificity and both should normally be high. The values of
MCC lie between 1 and -1, where 1 means correct prediction, 0 is for average prediction and -1
means wrong prediction. MCC helps to show the bias of classifier towards bigger class in case of
imbalance data. F-measure consists of both the precision and recall of the test to compute the
score. The F-measure is weighted average of the precision and recall.
100TP TN
AccuracyTP FP FN TN
2.28
100TP
SensitivityFP TN
2.29
100TN
SpecificityFP TN
2.30
TP TN FP FNMCC
TP FP TP FN TN FP TN FN
2.31
Precision Recall2
Precision RecallF measure
2.32
PrecisionTP
TP FP
2.33
RecallTP
TP FN
2.34
47
2.2.4. Genetic Algorithms
Genetic algorithms (GA) are a family of evolution-inspired computational models, normally used
to solve the complex optimization and search problems. They are based on natural selection and
survival of the fittest principles as in most of the biological organisms. In nature, individuals in a
population usually compete with each other for resources and to attract a mate. The individuals,
which succeed in survival and attraction to other mates, give birth to more offspring rather than
poor individuals. The genes transferring from the individuals, who are highly adaptive to their
environments, will increase the number of individuals in each generation. The combination of
good characteristics from different ancestors can results in producing best offspring, which can
be fit than their parents. Species evolve in this way to become better suited for their environment.
The basic terminology of GAs was first proposed by Holland (Holland, 1992). GA starts by
encoding the random solutions in the form of population of chromosomes. A chromosome is a
long, complicated string of DNA (deoxyribonucleic acid) containing genes that determine
particular characteristics of an individual. The chromosomes are then evaluated and reproduced,
in such a way that the fitter chromosome has more chance to evolve resulting in better solutions.
Reproduction makes changes in the shapes of the chromosomes. The chromosomes from the
parents exchange randomly by a process called crossover. The offspring thus exhibit some
characteristics of the father and some from the mother. Sometimes mutation occurs, which rarely
happens and changes some characteristics. Occasionally, an error can occur during copying of
chromosomes, this process is called mitosis. However, these accidental mistakes can produce
better specie. Genetic algorithms have applications in many fields of science such as
computational science, bioinformatics, engineering, phylogenetics, economics, manufacturing,
chemistry, physics, mathematics and other fields. The functionality of GA can be divided into 4
phases.
Initialization
Selection
Genetic operators
Termination
48
2.2.4.1 Initialization
Solution to a potential problem can be represented in terms of set of genes (parameters). The
combination of these genes forms chromosomes (string containing the combination of
parameters). An initial population is made by generating number of random chromosomes
(solutions). There can be thousands or millions of possible solutions and the size of population
can vary from problem to problem. The encoding of solution into chromosomes also varies from
problem to problem and can be binary or continuous. A randomly generated population can
represent the entire range of possible solutions.
2.2.4.2 Selection
Naturally the individual having better survival characteristics will survive for a longer period of
time and has a better chance to produce offspring. Generations after generations, the population
will consist of lots of genes from the superior individuals and less from the inferior individuals.
This process is called natural selection. Individual solutions at each generation are selected by
using a fitness evaluation method, and ranked according to their fitness values. A fitness function
is defined as maximization of the objective function of the problem and varies from problem to
problem. However, it can be converted to minimization problem by negating the function. It
returns a fitness value, which is the quality measure of the chromosomes (solution) of the
problem. Fitness of each individual of the population is measured. The individuals with high
fitness value will have more chance of being selected for the genetic operations and to be used in
next successive generations. The most important thing in the GA is to define suitable fitness
function and proper encoding of parameters. Following are some of the selection functions used
in literature:
Stochastic uniform selection
Remainder selection
Uniform selection
Shift linear selection
Roulette wheel selection
Tournament selection
Rank selection
49
2.2.4.3 Genetic operators
After selection and fitness evaluation, GA operators are applied to introduce diversity in the
chromosomes. There are 3 most commonly used GA operators i.e. Reproduction, Crossover and
Mutation. Reproduction can also be termed as selection and simply make copies of the better
solutions to a new population. In Crossover, two or more parent chromosomes are chosen on the
basis of fitness values to produce a child chromosome. Crossover rate can be defined with
respect to the problem. Following are the some of the crossover methods used in existing
literature:
Scattered crossover
One point crossover
Two point crossover
Intermediate crossover
Heuristic crossover
Arithmetic crossover
Custom crossover
In Mutation operator, some of the gene values (parameters) are altered to preserve diversity in
the chromosomes from generation to generation. Mutation makes changes in the chromosomes
sometimes partially or sometimes completely. Following are the well-known mutation
approaches:
Gaussian Mutation
Uniform Mutation
Adaptive feasible Mutation
Custom Mutation
Boundary
Non-Uniform
Bit string mutation
These GA operators are used to make new off springs (new solutions). Overall objective of GA
is to maximize the fitness function. The new chromosomes are created by either selection,
50
alteration or combination of characteristics of currently good performing chromosomes. Hence, a
new population of solutions is formulated.
2.2.4.4 Termination
The above-mentioned 3 phases of GA are repeated iteratively. One cycle of GA run is called
generation. The GA process repeats itself unless some termination criteria are achieved or an
optimal solution is found. It can also be stopped when maximum number of generations is
reached.
2.3.GPCR DATASETS
There are various protein databases on web, which provide GPCRs datasets i.e. swissprot,
uniprot and protein databank. There are some web servers, which also provide GPCR sequence
data belonging to different species i.e. (GPCRDB, 2012)and (ENSEMBL). We have used 5
different datasets in the present research, named as: D8354, D167 (Elrod & Chou, 2002), D365
(Chou, 2005), D566 (Chou & Elrod, 2002) and D11026. The D8354 dataset is available at
http://www.cs.kent.ac.uk/projects/biasprofs/. The dataset was identified using the entrez search
and retrieval system (Wheeler, Barrett, Benson, & et.al., 2007). The sequences with length > 280
amino acids were deleted. There are 8354 sequences in total. Out of which, Rhodopsin like has
5526, Secretin like has 625, Metabotropic glutamate has 2172, fungal pheromone has 13 and
cyclic AMP has 18sequences.
D365 contain 365 sequences belonging to 6 major families of GPCRs: (1) Rhodopsin-like (2)
Secretin-like (3) Metabotrophic glutamate pheromone (4) Fungal pheromone (5) cAMP receptor
and (6) Drizzled/smoothened family. D167 have 167 sequences and is classified into 4 sub-sub
families i.e. (1) acetylcholine (2) adrenoceptor (3) dopamine and (4) serotonin. The dataset D566
[26] has 466 sequences belonging to 7 sub-sub families i.e. (1) Adrenoceptor (2) Chemokine (3)
dopamine (4) Neuropeptide (5) Olfactory type (6) Rhodopsin (7) serotonin. It is reported by
Chou and Elrodthat the sequences in D167, D566 and D365 have similarity less than 40%.
D11026 dataset is gathered from ENSEMBL repository. It has sequences belonging to 19 known
sub families of GPCRs and some unknown receptors. The sequences belong to 62 species from
10 organisms named as: Eutharia, Marsupials, Monotremata, Amphibian, Reptilia, Birds, Ray-
finned fish, Zebra fish, Latimeria and Lamprey. Initially, the number of sequences in this dataset
51
was more than 12000. We aligned this sequence data, extracted seven TMs individually from it,
and then later merged the seven TMs. We then ignored those sequences who has X in TM
regions or who have more than five gaps in all of seven TMs. After these refinements, we are left
with 11026 sequences. These 11026 sequences are than used to train and test one of our methods.
52
3. GPCR PREDICTION BY EMPLOYING PHYSIOCHEMICAL
PROPERTIESUSING HYBRID FEATURES
As discussed in chapter 1, GPCRs has been classified into different classes by different
researchers. We have followed the classification similar to that adopted in GPCRDB (GPCRDB,
2012). In GPCRDB, GPCRs are divided into 6 major classes i.e. Rhodopsin-like (Class
A),Secretin (Class B), Metabotropic glutamate (Class C), fungal mating pheromone (Class D),
cyclic AMP (Class E) and frizzled or smoothened receptors (Class F). These 6 families are
further divided into sub classes and so on. In this method, we have classified GPCRs into three
levels i.e. into families, sub families, and sub-sub families. At first level, we have classified
GPCRs into 5 main families (class F is ignored because of very less number of sequences), at
second level 40 sub families and finally 108 sub-sub families at third level, same as done by
(Davies, Secker, Freitas, Mendao, Timmis, & Flower, 2007).We have used 4 dataset in this
proposed method named as: D8354, D167, D566, D365. D8354 is the main dataset for this
method. These datasets are explained in chapter 2.Focus of this method is to first investigate and
utilize the importance of different physiochemical properties to classify GPCRs and to use
hybrid combination of spatial and transform domain methods to increase overall classification
performance. We have named our method as GPCR-Hybrid (Rehman & Khan, 2011). The
Overview of chapter is shown in the Figure 3-1.
53
Figure 3-1: Overview of chapter 3
3.1.PHYSIOCHEMICALPROPERTIES
The physiochemical properties used in present method are: Hydrophobicity, electronic and bulk
properties. Hydrophobicity property can be used to determine the structure and function of
GPCR. Their values can vary for different amino acids under different experimental conditions.
Biological molecules may have large non-polar regions, which can be described by hydrophobic
region. Each GPCR contains 7 stretches of 20-30 hydrophobic amino acids essential to pass
through the cell membrane. Hydrophobicity can be employed mathematically using many scales
such as: KDH (James, et al., 1987), MH (Mandell, Selz, & Shlesinger, 1997), and FH (Fauchere
& Pliska, 1983). Out of these scales, the Fauchere scale (Fauchere & Pliska, 1983)was found to
be the most discriminative for classifying GPCRs (Guo, et al., 2005), and it was therefore used
by Rehman (Rehman & Khan, 2011), (Rehman & Khan, 2012). Electronic property can be given
by the value of the Electron-Ion Interaction Potential (EIIP) model, which is derived from the
average energy states of all valence electrons in a given amino acid (Cosic, 1994). Any particular
amino acid delocalizing electrons has the strongest impact on the electronic distribution of the
whole protein. The third physiochemical property (bulk) use descriptors of compositi0on,
polarity and molecular volume model (CPV) (Grantham, 1974). The folding of the protein
sequence is greatly affected by this property.
54
3.2.FEATURE EXTRACTIONAND CLASSIFICATION
We have performed feature extraction using three methods. In first method, we have used pseudo
amino acid composition (PseAA) (Chou, 2001), which has been employed using 2 and 3
physiochemical properties of GPCRs. PseAA2 is computed using Hydrophobicity and electronic
property while PseAA3 is computed by Hydrophobicity, bulk exposition and electronic
properties. Second feature extraction method is composed of hybrid feature vector (MSE-
PseAA), which is a combination of wavelet based multi-scale energy (MSE) and PseAA based
features. We have used 2 physiochemical properties in MSE-PseAA i.e. Hydrophobicity and
Electronic property (Rehman & Khan, 2011). In Third method, again a hybrid feature vector
(MSE-AA) is computed by the combining amino acid composition and MSE features (Rehman
& Khan, 2011). We have used 3 classification algorithms i.e. support vector machine (SVM),
nearest neighbor (NN) and Probabilistic neural network (PNN). The details of these features
extraction strategies and classification algorithms are given in chapter 2. For cross validation of
present method we have used jackknife test. At any level, the unknown test sequence is classified
using the classification algorithm, which was performing best on training data and the feature
extraction method, which best describes the training data at that level. We have also developed a
web site (GPCR-Hybrid), which asks GPCR test sequence as input and predicts its family, sub
family and sub-sub family categories. GPCR-Hybrid is available online at (Rehman Z. , GPCR
prediction, 2011)
55
Figure 3-2: GPCR-Hybrid web interface
You have to enter valid GPCR sequence in the textbox and have to click on Submit button. This
web server will then show the family, sub family and sub sub-family classes by running the
appropriate classification algorithms at each levels.
3.3.GPCR-HYBRID
The GPCR-Hybrid is an online web server available at (Rehman Z. , GPCR prediction, 2011),
provides the efficient classification of an unknown GPCR sequence into three levels as discussed
in the above sections. Using the training data, we have figured out the best performing feature
extraction technique and classification algorithms at each level. The interface of GPCR-Hybrid is
shown in figure 3.1. First, you have to input a valid GPCR sequence in the textbox. The input
GPCR sequence should be in capital letters (MPWNG). Then, click on the Submit button.
GPCR-Hybrid performs feature extraction of input sequence using the best feature extraction
strategy of family level(i.e. PseAA2 ) and predicts its class using the best performing classifier
(i.e. SVM) for predicting the family class. After predicting the main family class, the same
process is repeated for predicting sub family and sub sub family levels. The names of family, sub
56
family and sub sub-family classes are mentioned in a label as shown in figure 3.1. The algorithm
of GPCR-Hybrid is shown in Figure 3-3.
Figure 3-3: Working of GPCR-Hybrid
3.4.RESULTS AND DISCUSSIONS
As discussed in above sections that classifications of a sequences are performed into 3 levels or
stages. At each level, the GPCR-Hybrid program selects the best feature extraction strategy and
classification algorithms. The results of classification for each level are described in following
sections.
3.4.1. Family Level Classification
GPCR-Hybrid classifies GPCRs into five families. The performance is shown in terms of
sensitivity, overall accuracy, MCC, specificity and F-measure. The performance of each of the
classifier for each of feature extraction strategy is described below.
57
3.4.1.1 Performance for PseAA2
The overall accuracies obtained using PseAA2 for the PNN, NN and SVM are reported as:
97.38%, 97.22 % and 97.86% respectively. The value of smoothing factor for PNN is chosen as
1. Using the same set of classifiers, MCC values are: 0.94, 0.93 and 0.95, respectively. Similarly
specificity values are: 96.72 %, 96.50 % and 96.89 %, sensitivity values are: 98.22 %, 98.13%
and 98.95%, and finally F-measures are: 0.96, 0.96 and 0.97, respectively.
3.4.1.2 Performance for PseAA3
The overall accuracies achieved using PseAA3 for PNN, NN and SVM are: 97.74%, 97.58% and
93.66%, respectively with smoothing factor in PNN = 0.6. Using the same set of classifiers,
sensitivity values are: 98.52%, 98.41% and 98.04%, specificity values are: 97.16%, 96.96% and
89.83%, MCC values are: 0.94, 0.94 and 0.85 and. F-measures are: 0.96, 0.96 and 0.90,
respectively.
3.4.1.3 Performance for MSE-PseAA
The overall accuracies achieved for MSE-PseAA by using the PNN, NN and SVM are 96.98%,
96.89% and 97.41%, respectively. Using the same set of classifiers, specificity values are
96.16%, 96.01% and 96.58%, MCC values are: 0.93, 0.92 and 0.94, sensitivity: 98.01%, 97.97%
and 98.43% and F-measures: 0.96, 0.96 and 0.90, respectively.
3.4.1.4 Performance using MSE-AA
The overall accuracies achieved for MSE-PseAA using PNN, NN and SVM are: 96.28%, 96.22%
and 97.06%, MCC: 0.91, 0.91 and 0.93, specificity: 95.22%, 95.08% and 96.06%, sensitivity:
97.57%, 97.59% and 98.23%, F-measures are: 0.94, 0.94 and 0.95, respectively.
It is obvious from above results that PseAA2 using SVM is giving the best performance at family
level. The accuracy, sensitivity, MCC and F-measure values are highest, while specificity value
is also comparable. Therefore, family level feature extraction is performed using PseAA2 and
classification SVM for an unknown GPCR sequence. Results for family level classification are
mentioned in Figure 3-4 and Figure 3-5.
58
3.4.2. Sub Family Classification
There are 40 sub families in total. The performance is shown in terms of sensitivity, overall
accuracy and specificity. The detailed performance of each of the classifier for each of feature
extraction strategy is described below.
3.4.2.1 Performance for PseAA2
The overall accuracies at sub family level achieved for PseAA2 using the PNN, NN and SVM:
82.13%, 81.02% and 81.58%, specificity: 82.10%, 80.99% and 81.55%, sensitivity: 81.30%,
80.55% and 81.15% respectively.
Figure 3-4: GPCR classification performance for family level in terms of Accuracy, sensitivity
and specificity
59
Figure 3-5: GPCR classification performance for family level in terms of MCC and F-Measure
3.4.2.2 Performance for PseAA3
The Specificity measures for PseAA3 using PNN, NN and SVM are: 83.42%, 81.85% and
78.98%, overall accuracies: 83.47%, 81.88% and 79.02% and sensitivity: 83.18%, 81.52% and
78.85% respectively.
3.4.2.3 Performance for MSE-PseAA
The overall accuracies achieved for MSE-PseAA using the PNN, NN and SVM are: 80.36%,
80.73% and 84.97%, specificity: 80.27%, 80.69% and 84.94% and sensitivity values are 81.24%,
80.72% and 84.08% respectively.
3.4.2.4 Performance for MSE-AA
The overall accuracies achieved for MSE-PseAA using the PNN, NN and SVM are: 78.29%,
78.55% and 80.96%, specificity: 78.21%, 78.51% and 81.90% and sensitivity: 78.79%, 78.51%
and 81.95% respectively.
60
MSE-PseAA feature extraction strategy using SVM is most appropriate for sub family level
classification and hence used by GPCR-Hybrid for sub family classification of any GPCR
sequence. The sub family classification results are also presented in Figure 3-6.
3.4.3. Sub-sub Family Classification
There are 108 sub-sub families in our main GPCR dataset. The performance is shown in terms of
sensitivity, overall accuracy and specificity. The detailed performance of each of the classifier
for each of feature extraction strategy is following sections.
Figure 3-6: GPCR classification performance for sub family level
3.4.3.1 Performance for PseAA2
The Specificity values at sub-sub family level for PseAA2 using PNN, NN and SVM are: 72.94%,
73.01% and 72.70%, overall accuracies: 72.88%, 72.95% and 72.65%, and sensitivity: 67.77%,
69.02% and 67.08% respectively.
61
3.4.3.2 Performance for PseAA3
The Specificity values at sub-sub family level for PseAA3 using PNN, NN and SVM are: 74.35%,
73.72% and 68.81%, overall accuracies: 74.29%, 73.67% and 68.78% and sensitivity: 69.82%,
69.71% and 68.96%, respectively.
3.4.3.3 Performance for MSE-PseAA
The Specificity values for MSE-PseAA using PNN, NN and SVM are: 71.15%, 72.53% and
70.32%, overall accuracies: 71.10%, 72.48% and 75.60% and sensitivity: 67.67%, 69.01% and
75.67%, respectively.
3.4.3.4 Performance for MSE-AA
The Specificity values for MSE-AA using PNN, NN and SVM are: 68.58%, 69.80% and 73.59%,
overall accuracies: 69.53%, 69.75% and 73.45% and sensitivity: 65.01%, 66.32% and 69.89%
respectively.
As mentioned in above-mentioned sections, the overall accuracy and sensitivity values for MSE-
PseAA with SVM classifier are the highest. Hence, GPCR-Hybrid chooses MSE-PseAA with SVM
for sub-sub family level classification of any unknown GPCR sequence. Results for sub-sub
family level classification of GPCRs are shown in Figure 3-7.
Figure 3-7: GPCR classification performance for sub-sub family level
62
3.4.4. Comparison with Selective Top Down Approach
Selective top down approach (Davies, Secker, Freitas, Mendao, Timmis, & Flower, 2007)also
classifies GPCRs into 3 levels. At family level, PseAA2 with SVM is superior and hence
compared to family level performance of top down approach. While at sub family and sub-sub
family levels MSE-PseAA with SVM are compared to the top down method’s performance.
Selective top down approach has only shown performance only in terms of overall accuracy. The
overall accuracy achieved in Selective top down approach, at family stage is 95.87%, while
GPCR-Hybrid has an overall accuracy of 97.86%. At sub family level, the Selective top down
method: 80.77% and GPCR-Hybrid: 84.97%. Selective top down method has overall accuracy =
69.98% at sub-sub family level, while GPCR-Hybrid has overall accuracy of 75.60%. At all the
3 levels GPCR-Hybrid has much better performance than Selective top down approach. This
great improvement in performance is because of hybrid combination of transform and spatial
domain feature-extraction strategies. In addition, the use of physiochemical properties has
positively affected the performance.
Figure 3-8: Comparison with Selective Top Down method
63
3.4.5. Comparison with other methods
As mentioned in section 2.2, we have tested and compared our method on 3 additional datasets
termed as: D167, D566 and D365. The GPCRs sequences in all of these datasets belong only to
one of the levels. We have compared our overall accuracy to the overall accuracies of existing
methods on these datasets. We have computed results on each of these datasets using SVM
classifier with four different kernels i.e. Lin-SVM, Poly-SVM, RBF-SVM and Sig-SVM and
compared the results of the best kernel function for these datasets
On D167, We have compared the overall accuracy of GPCR-Hybrid to overall accuracies of 6
existing methods (Elrod & Chou, 2002), (Huang, Cai, Ji, & Li, 2004), (Bhasin & Raghava,
2005), (Gao & Wang, Classification of G protein-coupled receptors at four levels, 2006), (Gao,
Wu, Ma, Lu, & He, 2008) , (Peng, Yang, & Chen, 2010 ). Overall accuracy achieved in GPCR-
Hybrid is higher than all of these 6 methods.
Figure 3-9: Comparison on D167 dataset
There are 2 existing methods, which have used D365. One of them is GPCR-CA (Xiao, Wang, &
Chou, 2009)and second is PCA-GPCR (Peng, Yang, & Chen, 2010 ). The overall accuracies
achieved by PCA-GPCR and GPCR-CA methods are: 92.60% and 83.56%, while overall
64
accuracy achieved in GPCR-Hybrid method is 91.72%, which is almost 9% higher than GPCR-
CA method and is comparable to PCA-GPCR method.
Figure 3-10: Comparison on D365 dataset
Figure 3-11: Comparison on D566 dataset
GPCR-Hybrid is compared with PCA-GPCR method on D566 dataset. The overall accuracy
achieved by PCA-GPCR method is 97.88 and in GPCR-Hybrid: 97.91%.
The improvements in the performance of GPCR-Hybrid over the existing methods are because of
hybrid combination of spatial and transform domain features and employment of physiochemical
65
properties. The optimization of SVM parameters with proper kernel for a dataset has also
contributed in the improvement.
66
4. GPCRs PREDICTION USING GREY INCIDENCE DEGREE MEASURE
AND PRINCIPAL COMPONENT ANALYIS
The GPCRs sequences are made up of amino acid polypeptide chains. We can also call them sub
units. The number and arrangements of these sub units forming a GPCR sequence is called
quaternary structure. There are different types of quaternary structures in GPCRs, such as:
dimmer, monomer, tetramer, trimer and pentamer. Some biological processes are directly
affected by quaternary structures. For example, monomers form sodium channels (Chen,
Alcayaga, Suarez-Isla, ORourke, Tomaselli, & Marban, 2002), homo-tetramers form potassium
channel (Doyle, et al., 1998), homo-pentamers make phospholamban channels (Oxenoid &
Chou, 2005), (Oxenoid, Rice, & Chou, 2007)and hetero-pentamers make α7 nicotinic
acetylcholine receptor (Chou, 2004).Some transitions only occur in tetramers, dimmers bind
some of ligands and tetramers make some ion channels.
In this method, we have again classified GPCRs into three levels as in chapter 3. We have
hybridized3 feature extraction approaches i.e. Split amino acid composition (SAAC), Pseudo
amino acid (PseAA) composition and Fast Fourier transform (FFT). We have employed two
physiochemical properties i.e. Electronic and Bulk in PseAA, which are already explained in
chapter 3. All of these feature extraction strategies are explained in chapter 2. The number of
features taken in PseAA is 62, in SAAC are 60 and in FFT is256. Total number of features is
378. As the number of features after the hybridization becomes so high and to avoid curse of
dimensionality, we have applied principal component analysis (PCA) is used to reduce the
features. After applying PCA, size of feature vector is reduced to 180.For the sake of
classification we have used nearest neighbor algorithm. We have computed the nearest neighbors
of a test sequence in two ways i.e. grey incidence degree measure and Euclidian distance
measure. The grey incidence degree measure is performing better than Euclidian distance. We
have trained and tested our methods on D8354 and compared with other methods on datasets:
D167 and D566. Over of chapter is shown in the Figure 4-1.
67
Figure 4-1: Overview of chapter 4
4.1.GREY INCIDENCE DEGREE MEASURE
Deng introduced grey theory in 1982 to analyze the uncertainty of a system (Deng, 1982). This
theory can be applicable to the problems in which information is fuzzy or uncertain. Grey
incidence degree (GID)measure is one of the major components of this theory (Liu, Fang, & Lin,
2005).The classification of GPCRs is also a fuzzy problem. Some GPCR sequences can be put
into one class based on some properties but they can also be put in another class because of some
other properties.
1 2, ,..., nT T T T 4.1
,,t i Min Max
k k t i
k Max
T T
4.2
where1 2, ,..., nT T T T are the numeric forms of n training sequences and
tT is the test sequence. is
the grey relational coefficient. t j
Min j k k kMin Min P P ,
,,t j t i t i
Max j k k k k k kMax Max P P P P , 1,2,...,j n are the indices of training sequences,
1,2,...,180k are indices of features of a GPCR sequence and = distinguishing coefficient. The
value of distinguishing coefficient is between 0 and 1.
68
The grey incidence degree 𝑂 of the test sequence with training sequences is a weighted sum of
grey relational coefficient and is given by the following equation.
180
1
, ,t i
t i k k k
k
O G G W G G
4.3
where,wkis weight associated with each feature. Wehave given equal weight to each feature and
taken the value of ξ equal to 0.5 as in existing work (Tsai, Liou, & Jiang, 2005) , (Xiao, Wang,
& Chou, 2009). The grey incidence degree ,t iO G G is the correlation between the test sequence
tG and the training sequencesiG . A training sequence closest to the test sequence will have high
grey incidence degree measure higher than other training sequences and hence can annotate the
test sequence to its class. In this method, we have employed GID in Nearest Neighbor algorithm
to compute the neighbors of a test sequence, which further can help to annotate the test sequence.
4.2.PRINCIPAL COMPONENT ANALYSIS
Principal component analysis (PCA) is a useful technique in pattern classification or machine
learning to analyze patterns in a high dimensional data and to prominent differences and the
similarities in the data. It transforms high dimensional data into very low dimension without the
loss of significant information. PCA is used in many different fields from neuroscience to
computer graphics because it is non-parametric method used to extract useful relevant
information from confusing data sets. The mathematically description of PCA is summarized in
sections given below.
The mathematical details of PCA are explained in detail in (Howard, 2000). Let us suppose a
multi-dimensional data. We first compute the mean across each dimension and subtract mean
from each value of that dimension, the data has now mean value equal to zero. Then we calculate
the covariance matrix of zero mean data. Covariance matrix shows the relation between different
dimensions in high dimensional data. Covariance can only be measured for data of more than 2
dimensions. Covariance matrix is N x N matrix, where N is number of dimensions of data.
Covariance of one dimension to itself is equal to variance of that dimension
1,1
n
i i
i
X X Y Y
COV X Yn
4.4
69
where, ,COV X Y is covariance between X andY dimensions. X is the mean of X dimension and
Y is the mean of X dimension and n is the number of data points. Next, we have to compute the
Eigen values and Eigen vector of the covariance matrix and sort Eigen vectors according to
Eigen values. Next, we will ignore some of less important Eigen vectors to reduce
dimensionality of the data. Finally, multiply the transpose of the chosen Eigen vector to the
original high dimensional data and use this data as features to classification algorithm. We have
named the GID based method as: GPCR-GID (Rehman & Khan, 2011). The overview of
GPCR-GID is shown in Figure 4-2.
Figure 4-2: Overview of GPCR-GID
4.3.RESULTS AND DESCUSSIONS
As explained in start of this chapter, we have trained and tested our methods on D8354. The
GPCRs in this dataset are classified into three levels i.e. family, sub family and sub-sub family
70
levels. In this proposed method, we have used only accuracy measure for performance
assessment. Following sections gives the details of the results.
4.3.1. Family level classification
GPCRs are classified into five families. The percentage accuracy of GID based method is 97.82%
and Euclidian distance based method has achieved 97.44%.
4.3.2. Sub family level classification
The five families of GPCRs are further classified into 40 sub families at this level. The
percentage accuracy of GID based method is 81.55% and Euclidian distance based method is
80.97%.
4.3.3. Sub-sub family level classification
The 40 sub families of GPCRs are further classified into 108 sub-sub families at this level. The
percentage accuracy of GID based method is 73.32% and Euclidian distance based method is
72.66%.The performance of both methods is also shown in Figure 4-3.
Figure 4-3: Performance of GID and Euclidian distance methods
Figure 4-3clearly shows that the performance of GPCR-GID is superior than Euclidian distance
based method at all the three levels. Hence, we have compared GPCR-GID with other existing
methods.
71
4.3.4. Comparison with other methods
We have trained our method on D8354 dataset and compared it with other methods using D8354.
We have also compared our method with existing methods using D167 and D566 datasets. D167
and D566 are already explained in chapter 2. The comparison details are as follows.
4.3.4.1 Comparison with Selective top down approach
In the selective top down approach, GPCRs are hierarchically classified into 3 levels (Davies,
Secker, Freitas, Mendao, Timmis, & Flower, 2007). The selective top down method has assessed
their performance using accuracy measure so we have compared our accuracy with them as
shown in Figure 4-4.
Figure 4-4: Comparison with selective top down approach
At family level, the best percentage accuracy achieved in selective top down approach is 95.87%,
while accuracy achieved in GPCR-GID is 97.82%. At sub family level, the best accuracy
achieved in selective top down approach is 80.77% while accuracy achieved in GPCR-GID
81.55%. Selective top down approach has achieved 69.98% accuracy at sub-sub family level,
while accuracy achieved in GPCR-GID is 73.32%. At all the three levels of GPCRs, GPCR-GID
is significantly superior to the selective top down approach and hence strengthening the worth of
GPCR-GID.
72
4.3.4.2 Comparison with other existing methods on D167 and D566 datasets
There are 6 existing methods with whom we have compared GPCR-GID on D167 dataset i.e.
(Elrod & Chou, 2002) , (Huang, Cai, Ji, & Li, 2004), (Bhasin & Raghava, 2005), (Gao & Wang,
2006), (Gao, Wu, Ma, Lu, & He, 2008) and PCA-GPCR (Peng, Yang, & Chen, 2010 ). Again,
we have used accuracy measure for the sake of comparison. This comparison is shown in Figure
4-5, which clearly shows the superiority of GPCR-GID over all of the 6 methods.
Figure 4-5: Comparison on D167
There are 2 methods with which we have compared GPCR-GID on D566. One is PCA-GPCR
(Peng, Yang, & Chen, 2010 )and the other is by Chou (Chou & Elrod, 2002). The percentage
accuracy achieved PCA-GPCR is 97.88% and in (Chou & Elrod, 2002) is 92.05%, where as the
accuracy achieved in GPCR-GID is 97.96%.
73
Figure 4-6: Comparison on D566
Figure 4-6shows the superiority of GPCR-GID over PCA-GPCR and Chou’s method (Chou &
Elrod, 2002). This improvement in performance of GPCR-GID is because of several reasons.
One reason is the hybridization of spatial domain and transformed domain features and applying
PCA for feature reduction. Secondly, GID measure based method can efficiently discriminate
classes by computing quaternary structure of GPCR numerically.
74
5. GPCRs PREDICTION USING GENETIC ALGORITHM BASED
ENEMBLE CLASSIFICATION
This chapter focuses on the classification of GPCRs using ensemble approaches. In Ensemble
classification, various classifiers contribute their strengths to increase the efficiency of overall
classification. There are several types of ensemble approaches. Our focus in this chapter will be
on weighted ensemble classification. In weighted ensemble classification, weights are assigned
to each classifier and optimized using appropriate optimization techniques. Each classifier votes
for a class after weighting and that label is assigned to the unknown GPCR sequence, which has
majority of votes. Binary Genetic algorithm is one such suitable technique to optimize the
weights. The optimization performance of genetic algorithm is controlled by appropriate
parameter settings. The features of a GPCR sequence are first extracted using MSE-PseAA and
PSE-PSSM techniques. The physiochemical properties used in MSE-PseAA approach are
Hydrophobicity, Electronic and Bulk properties, which are explained in detail in chapter 3.
MSE-PseAA is also explained in chapter 3 and PSE-PSSM is already explained in chapter 2.
PSE-PSSM incorporates evolutionary information in features. Physiochemical and biological
properties are utilized to extract features. Position specific scoring matrix is used to extract
biological features (Schaffer, Aravind, Madden, Shavirin, Spouge, & al., 2001). The
classification algorithms used are NN, PNN, GID and SVM. The predictions are all of these 4
classifiers are combined by weighting and final prediction for a GPCR sequence is performed.
We have named this technique as PSE-PSSM (Rehman & Khan, 2012). The overview of chapter
5 is shown in Figure 5-1.
75
Figure 5-1: Overview of chapter 5
The datasets used in PSE-PSSM are D8354, D167, D365 and D566. Again, we have classified
GPCRs into 3 levels using D8354 dataset i.e. family, sub family and sub-sub family levels. PSE-
PSSM is a very accurate method for feature extraction but it consumes a lot of time, so we have
used it only for a smaller dataset i.e. D167.
5.1.CLASSIFICATION ALGORITHM
As discussed in start of this chapter, NN, PNN, SVM and GID are used as classification
algorithms. The ensemble classifier is made from the weighted majority voting of these 4 main
classifiers. In some datasets, we have used 4 different kernel functions for SVM i.e. Radial Basis
Function (RBF-SVM), Polynomial (Poly-SVM), Sigmoid (Sig-SVM) and Linear (Lin-SVM).
LIBSVM 2.88-1 package (lib SVM)is available online and provide the codes for different SVMs.
If we count each of these four kernels as different classifiers then weighted majority voting is
performed using seven classifiers in total. Each classifier votes for a class with certain weight.
Unlabeled GPCR sequence is classified using a class, which has maximum votes.
5.2.WEIGHT OPTIMIZATION USING GENETIC ALGORITHM
Binary genetic algorithm (GA) is discussed in detail in chapter 2. It has 4 main phases:
Population generation and initialization
76
Evaluation of fitness
Crossover, Mutation and reproduction
Termination criteria
For each dataset, we have first computed prediction matrices using each classifier (Rehman &
Khan, 2012). The chromosome in GA is represents weight vectors for each of the seven
classifiers to be optimized. This weight vector is multiplied with prediction matrices as shown in
following equation.
,1,2,..., 1
maxn
k k jj C k
Z i W z
5.1
where, Z i shows the prediction of ensemble for a sequence , 1,2,...,i j C shows total number
of classes in the datasets, 1,2,...,k n shows the number of classifiers used, kW shows the
weights for a particular classifier k and zk shows the predictions of individual classifiers. The
unknown sequence i will be annotated with the label of a class with maximum voting or score
(after multiplying with weight vector). Same process will be repeated for all sequences in the
dataset and the accuracy over the dataset is computed. Fitness function is defined as negative of
accuracy.
Fitness Accuracy 5.2
The GA’s objective is to increase the overall accuracy or decrease fitness value. The increase in
accuracy will be achieved using optimization of weights for each of the classifier. The ranking of
chromosomes (weight vectors) is performed based on their accuracies for a dataset. After
ranking, crossover, mutation and reproduction are performed with certain probability and the
chromosome population is preceded to the next generation. The GA is run for 100 generations
and stall limit is 50 generations i.e. if there is no improvement in performance in last few
generations before 50 then stop GA. If stall limit is not reached but some other termination
criteria are reached, GA will be stopped and weighting is optimized. The PSE-PSSM method is
shown in Figure 5-2.
77
Figure 5-2: Overview of PSE-PSSM method
5.3.RESULTS AND DISCUSSIONS
The performance on each dataset is first individually assessed by each of the classification
algorithms. Later, the weighted majority voting is performed as explained in section 5.2. The
performance details are mentioned in the fore coming sections.
5.3.1. Classification performance on D8354
In D8354 dataset, a GPCR sequence is predicted at three levels i.e. family, sub family class and
sub-sub family levels. For each sequence we outputs family, sub family, and sub-sub family
class names. The details of family, sub family and sub-sub family names and respective number
of sequences are already given in chapter 2. The crossover rate is 0.8, mutation rate is 0.1 and
reproduction rate is 0.1 for all the three levels.
5.3.1.1 Family level classification
The individual classifier accuracies achieved using PNN, NN, SVM and GID are: 96.98%,
96.89%, 97.41% and 97.12% respectively. The weights optimized by GA are associated with
78
each of these 4 classifiers to further improve performance. Initially, a population of 30
chromosomes is generated. The size of a chromosome is taken as 4 for D8354 dataset. Roulette
wheel is used as a selection function. The number of generations is 100 and stall limit is set as 50.
At the end of each generation, weight vectors are improved. At the end, the optimized weight
vector is obtained as: PNN=0.034, NN=0.119, SVM= 0.209, GID= 0.637. The accuracy after
weighted majority voting is reported as 97.414%.
5.3.1.1 Classification performance at sub family level
There are 40 classes at sub family level. The individual classifier’s accuracies achieved using
PNN, NN, SVM and GID are: 80.36%, 80.73%, 84.97% and 81.10% respectively. The weight
vector after optimization by GA is PNN=0.298, NN=0.022, SVM=0.097 and GID=0.582.The
accuracy achieved by weighted majority voting is 84.97%.
Figure 5-3: GA run for family level
5.3.1.1 Classification performance at sub-sub family level
There are 108 classes at sub-sub family level. The individual classifier’s accuracies using PNN,
NN, SVM and GID, are: 71.10%, 72.48%, 75.60% and 72.90% respectively. The weight vector
79
after optimization by GA is: PNN= -0.086, NN=0.307, SVM=, 0.249 and GID=0.529. The
accuracy achieved by weighted majority voting is 75.81%. The details of the results on GDS
dataset are shown in Figure 5-6.
Figure 5-4: GA run for subfamily level
80
Figure 5-5: GA run for sub-subfamily level
Figure 5-6: Classification performance on D8354 dataset
81
5.3.2. Comparison with existing approaches on D8354
For the sake of comparison on D8354 dataset, we have compared our method with selective top
method (Davies, Secker, Freitas, Mendao, Timmis, & Flower, 2007)and with GPCR-Hybrid
(Rehman & Khan, 2011). Selective top down approach classifies GPCRs hierarchically into 3
levels. The comparisons are shown in Figure 5-7.
We have compared the accuracies of PSE-PSSM with that of GPCR-Hybrid and Selective top
down methods. Selective top down has achieved an accuracy of 95.87% for family level. GPCR-
Hybrid has achieved an overall accuracy of 97.86, while PSE-PSSM has achieved an accuracy of
97.41%. PSE-PSSM’s accuracy is comparable with GPCR-Hybrid and is slightly higher than
Selective top down approach. At sub family level, accuracy of Selective top down method is
80.77%, GPCR-Hybrid has achieved 84.97% accuracy, while PSE-PSSM has achieved an
accuracy of84.97%. Finally, at sub-subfamily level, Selective top down method has achieved an
accuracy of 69.98%, GPCR-Hybrid has 75.60%, while PSE-PSSM has achieved an accuracy of
75.85%. At the sub-subfamily, PSE-PSSM has performed better than other 2 methods. We think
that this improved performance is first due to the hybrid combination of wavelet based multi
scale energy and Pseudo amino acid composition based features. Secondly, optimized weighted
majority voting has played important role in improving the performance of the classification
method.
82
Figure 5-7: Comparison on D8354 dataset
5.3.3. Comparison on D167, D365 and D566 datasets
As mentioned in chapter 2, there are 167 GPCRs sequences in D167 dataset. The population size
in GA for D167 is taken as 50, mutational probability is taken as: 0.1, crossover probability is
0.8 and reproduction probability is 0.1. We have used Roulette wheel method for selection of
chromosomes. We have run the GA program for 100 generations and assessed the performance
of our both feature extraction strategies i.e. MSE-PseAA and PSE-PSSM separately. We have
mentioned results on D167 in Figure 5-8, GA graphs for MSE-PseAA and PSE-PSSM are shown
in Figure 5-10andFigure 5-11, respectfully.
It is shown in Figure 5-8, performance of PSE-PSSM is slightly better than MSE-PseAA, so we
have compared PSE-PSSM based method to 2 existing methods i.e. (Elrod & Chou, 2002) and
(Huang, Cai, Ji, & Li, 2004). This improvement is because of embedding of evolutionary
information in feature extraction. The accuracy achieved in (Elrod & Chou, 2002) is 83.23, in
(Huang, Cai, Ji, & Li, 2004) accuracy is 83.20% and PSE-PSSM has achieved an accuracy of
95.81%, this comparison is shown in Figure 5-9.
83
Figure 5-8: Classification performance on D365 and D 566 dataset
Figure 5-9: Comparison on D167 dataset
84
Figure 5-10: GA run for D167 using MSE-PseAA
85
Figure 5-11: GA run for D167 using PSE-PSSM
We have compared our method with GPCR-CA (Xiao, Wang, & Chou, 2009)method on D365
dataset (Chou, 2005).The GA parameters used for D365 are: population size = 100, Selection
function = Tournament selection, uniform mutation rate = 0.1, 2-point cross over rate = 0.8.
GPCR-CA has achieved an accuracy of 83.56%, while PSE-PSSM has achieved an accuracy of
90.14%. The population size used for GA is 100, Selection function is Roulette wheel, uniform
mutation is applied with probability 0.1 and 2-point cross over with probability 0.8. The GA run
is shown in Figure 5-13and classification performance on D365 & D566 are shown in Figure
5-12.
Figure 5-12: Classification performance on D365 and D566 datasets
86
Figure 5-13: GA run for D365 dataset
We have compared PSE-PSSM with three existing methods on D566 i.e. PCA-GPCR (Peng,
Yang, & Chen, 2010 )method, (Chou & Elrod, 2002)and GPCR-Hybrid (Rehman & Khan,
2011). The accuracy achieved in PCA-GPCR method is 97.88%, accuracy in (Chou & Elrod,
2002)is 92.05%, in GPCR-Hybrid it is 97.91%, while the accuracy achieved by PSE-PSSM is
97.88%. The graph of GA run is shown in Figure 5.9.
87
Figure 5-14: GA run for D566
The comparison on D365 is shown in Figure 5-15and on D566 is shown in Figure 5-16.
Figure 5-15: Comparisons on D365 dataset in terms of % accuracy
88
Figure 5-16: Comparison on D566
89
6. ALIGNMENT BASED STRUCTURAL CLASSIFICATION OF GPCRS
USING TRANSMEMBRANE REGIONS
GPCRs can be classified based on ligands binding or by molecular phylogenetic analyses.
Phylogenetic analyses are usually based on multiple sequence alignments (MSA).There are
various methods for MSAs i.e. progressive methods, iterative methods, local and global
alignments and motif based alignment. We already discussed different types of MSAs in chapter
2.We proposed a novel motif based alignment method for the alignment and classification of 19
sub families (and unknown receptors) of Rhodopsin like GPCRs. We have divided some of sub
families in some case. In addition, there are some receptors, which are unknown, and we kept
them separately. Rhodopsin like receptors has much diversity in structure, functions and is
highly demanding for drug development. Humans have 800 Rhodopsin GPCRs. The structure of
Rhodopsin receptors consists of extracellular N terminal, intracellular C terminal and 7
transmembrane helical structures and comprises of 80% of all GPCRs. We computed pseudo-
count based position specific scoring matrices. Scores are then mapped to extreme value
distribution. EVD is used to set thresholds to identify motifs and then alignments are performed
based on motifs. Based on EVD scores and thresholds, we have then performed classification of
GPCRs.
Initially we have generated alignments of 19 sub families from T-Coffee (T-Coffee). We have
set human sequences as references in each family and extracted the motifs (TMs) of all
sequences of that family. We have extracted those sequences, which do not have seven motifs or
bad motifs (so many gaps in motifs). These extracted motifs are merged together. We tested our
method for various motif lengths and determined the appropriate combined length of motif as
182 amino acids. Position specific scoring matrices (PSSM) are then computed for the extracted
merged motif regions for 19 sub families. To account for missing amino acids in PSSMs, we
have added pseudo counts using Blosum62 matrix and proposed GPCR scoring matrix. Raw
scores from PSSMs are mapped to extreme value distribution to define thresholds for each
family corresponding to each of seven motifs. These thresholds are used to identify motifs and to
classify Rhodopsin GPCRs into 19 sub families. The overview of chapter 6 is shown in Figure
6-1.
90
Figure 6-1: Overview of chapter 6
6.1.SEVEN MOTIFS OF RHODOPSIN LIKE GPCRS
The generalized structure of the 7-motif regions (or seven transmembrane helical regions) in
Rhodopsin GPCRs observed from SYLYBS software are as follows:
M1: xxxxxxxxxxxxxxxxGNxxxxxxxx
M2: xxxxxxxLxxxDxxxxxxxxxxxxxxxxx
M3: xxxxxxxxxxxxxxxxxxxxDRYxxx
M4: xxxxxxxxWxxxxxxxxxPx
M5: xxxxxxxxFxxPxxxxxxxYxxxxxxxx
M6: xxxxxxxxxxxxxxFxxCWxPxxxxxxxxx
91
M7: xxxxxxxxxxxxxxxNPxxYxxx
Where M1 to M7 are seven motifs, x at any position shows that any amino acid can come at this
position. These motif structures are found in most of Rhodopsin like sequences with mutations
occasionally at some places. There can be zero or more than one instance of a motif in a
sequence. We have identified only those motifs, which preserve sequence order i.e. M1, comes
before M2, M2 before M3 and so on. This sequential nature of motifs increases the overall
quality of multiple sequence alignment. Initially, positions of seven motifs are manually given to
calculate PSSMs and to train the method. Then, this method can be used to identify M1-M7 of
any unknown sequence of corresponding family.
6.2.POSITION SPECIFIC SCORING MATRIX USING PSEUDO COUNTS
Position specific scoring matrices (PSSMs) are calculated from blocks of set of aligned
sequences. The length of PSSM is same as length of block. We have taken seven blocks
corresponding to seven motifs of Rhodopsin like family. Each column is represented by a vector
of 20 amino acids. These 20-D vector counts the number of occurrences of 20 amino acids in the
block and their probabilities are computed. The amino acid, which occurs more frequently,
receives a higher score. PSSMs can be used to score the alignment of different sequences by
sliding each sequence over PSSM and looking for the value in the corresponding column of the
PSSM. Length of sliding window is same as the length of PSSM. Scores for each position of
sequence are computed and then all the scores are summed up to resulting in overall alignment
score of that particular sliding window. Overall scores are usually computed in terms of log-odds
(G.D., 1990)and (Altschul, 1991), so PSSMs are mostly composed of log-odds scores. For the
sake of simplicity we have, called pseudo counts based PSSM as PSSM-PC.
The drawback of simple PSSM method is this that the training sequences may be an incomplete
sample of the full family set and there may be some missing amino acids in some columns of
block resulting in zero count. We have solved this problem by adding artificial pseudo counts for
the missing counts. Pseudo counts can be added by various ways. One such way is to use
traditional scoring matrices like BLOSUM or PAM. We have used BLOSUM 62 matrix and later
computed GPCR scoring matrix to compute pseudo counts. In the dataset of n sequences and a
92
motif of length l residues, the PSSM-PC will be l x 20 matrix (Bissantz, Logean, & Rognan,
2004). We have computed seven PSSM-PC corresponding to seven motifs for each sub family of
Rhodopsin like GPCRs. Each element caW of the matrix is given by:
( )
2logca
a
ca
f
fW 6.1
Where, 1,2,...,c l and 1,2,...,a l .The caf is frequency of amino acid a at position c of the
motif, af is the overall frequency of amino acid a in current training data set. Pseudo counts are
added in caf to account for missing amino acid frequencies (Henikoff & Henikoff, 1996). So
caf
is calculated as:
ca
ca ca
c c
n bf
N B
6.2
where bca is the pseudo count for amino acid a at position c of motif, nca is the number of counts
of amino acid a over the n sequences for amino acid a at position c, Nc is the total number of
counts at position c, and Bc is the total number of pseudo counts at position c. bca is obtained by
multiplying the number of pseudo counts at position c by a α.
ca cb B α 6.3
20
1
ia
i
i ia
a
q
Q
Q q
20ci
i=1 c
fα =
N 6.4
wherecif is the frequency of amino acid i at position c and qia is the probability for replacement
of amino acid i to a according to the Blosum62 matrix (Henikoff & Henikoff, 1992). We have
calculated PSSM-PC for each of family. Then, we considered each PSSM as training PSSM
successively, computed the scores of other sub families, and plotted them. The plot helped us to
identify the relationships between each sub family and to assign unknown receptors to any of the
sub family or put them in new sub family as shown in
93
Figure 6-2.
94
Figure 6-2: PSSM plot tested on Chemokine PSSM
6.3.EXTREME VALUE DISTRIBUTION (EVD)
After the computation of PSSM-PC, we get all the values for all possible amino acid at all
locations of a motif. Each possible sliding window gives us one particular score. If motif length
is 26, then there are 2026 possible scores for a sliding window. EVDs are normally used to show
distributions for maximums or minimums (extreme values) of a sample of independent,
identically distributed random variables. They are used for measurement of events, which occur
very rare. The exponent2026 is computationally very much expensive to calculate, so we have
taken samples of 2-10 millions randomly selected scores to map for EVD. The EVD can be
mapped using various methods i.e. linear regression and maximum likely-hood estimation etc.
We have mapped scores to EVD using maximum likely-hood estimation (Richard, 1992). There
are 3 types of EVDs i.e. Gumbel (type I), Frechet (type II), Weibull (type III). There are two
statistical measures in EVD i.e. P- value and E-value. P-value is probability of observing atleast
one score greater than or equal to some score x. E- value is the expected number of scores greater
than or equal to score x. We have taken these x values = 0.1 in most of cases and in few cases
0.00001.The probability density function of EVD is:
95
exp exp
XPdf x X
6.5
E value is given by:
_ exp =1 expX
E value
6.6
where X is vector of random scores, λ and µ are scale and location parameters of EVD. These 2
parameters are estimated using maximum likely-hood estimation. The likely hood of n random
scores in extreme value distribution is:
( )1 2
1
( , ... | , ) exp [ ( ) ]i
nx
n ii
P x x x x e
6.7
By simplifying equation (6.8) we get:
( )
1 2
1 1
( , ,..., | , ) exp ( ) i
n nxn
n i
i i
P x x x x e
6.8
Log likely hood of eq. (6.9) or (6.10) is given by:
1 2log ( , ) log , ,..., | , nL P x x x 6.9
( )
1 1
log ( ) exp i
n nx
i
i i
n x
6.10
Now we have to compute the estimates of λ and so that log likely hood is maximized. For this
purpose we have to take partial derivates of log likely hood function and equal it 0.
( )
1
log exp 0i
nx
i
d Ln
d
6.11
96
( )
1 1
log ( ) ( ) exp 0
i
n nx
i i
i i
d L nx x
d
6.12
By solving, we get:
1
1 1 log exp
i
nx
i
µn
6.13
Now by substituting back this value of µ in eq. (6.12) and by simplifying eq. (12), we get:
1
1
1
exp1 1
0
exp
i
i
nx
ini
i nxi
i
x
xn
6.14
After solving eq. (6.14) using Newton Raphson methods, we get the solution for λ, by putting λ
back in eq. (6.13) and we can get the value of µ.Figure 6-3, shows the pdf of EVD for motif-6 of
Amine sub family.
Figure 6-3: Plot of pdf for motif-1 of Amine sub family
E values are used to define thresholds for identification of motifs or transmembrane regions.
Higher the E values, more will be the number of false positives for a motif (or TM) detections.
The plot of E- values for motif-3 are showninFigure 6-4. Plot of number of false positives
(wrong detection of a motif) for different E-values is showninFigure 6-5.
97
Figure 6-4: Plot of E-values for motif-3 Amine sub family
Figure 6-5: Number of false positives for different E-values
6.4.MOTIF DETECTION ALGORITHM
We have developed an algorithm for detection of TMs in an unknown Rhodopsin like
sequences. In this algorithm, we have defined seven sliding windows equal to length of each of
seven motifs (TMs). These windows are slide over the sequence one by one, scores are computed
98
and then from scores E values are calculated using λ and µ parameters. We have defined a
threshold of E < 0.1 for all motifs detection. All those scores, which have E values, less than 0.1
are the candidate of a particular motif. We have tested for our data that motifs have highest
scores or lowest E values for 95-98% of the times. However, to become more careful we have
selected top 5 scores as candidates of motifs so that we do not miss any motif. So there can be
maximum 57 (= 78125) possibilities for selection of seven motifs in one sequence. We have to
select such a perfect combination for choices of 7-motifs, that they are maintained
sequentially(i.e. M1 comes first, then M2, then M3... M7). There are four different E value
thresholds. Following the number of steps involved in motif detection algorithm.
1. Slide a motif window over a sequence, find top 5 scores and sort them.
2. Repeat step 1, for all 7 motifs.
3. If top scores of each of seven motifs preserve sequence order, then output locations of
these seven top motifs in test sequence and go to step nine.
4. Assign ratings to all choices of scores for a motif, i.e. top score's rating = 5, second = 4
and last =1.
5. Find out all those combination in which scores for 7-motifs preserves sequence order
1 2 3,..., 7M M M M
6. Add up the ratings for each of the combination (e.g. 5+5+4+3+5+4+5= 31)
7. Select that combination which gives highest rating
8. Output the locations of motifs in test sequence, whose scores combinations have given
highest rating.
9. End
After the detection of motifs in the test sequence, its sub family is predicted. In addition, its
alignment is performed with training alignment of that sub family.
6.5.MULTI DIMENSIONAL SCALING
Multi-dimensional scaling (MDS) is a statistical technique used to show similarities or
dissimilarities in different types of data. It can also visualize the relationship between items of
high dimensional data into corresponding low dimensions. It takes N x N similarity matrix as
input, where the matrix is symmetric and diagonal elements are zero and performs MDS.
99
We have performed the MDS scaling in three ways i.e. sequence similarity between families,
sequence similarity individually between all sequences and PSSM based sequence similarity for
each sub families.
Figure 6-6, shows the sequence similarities between different sub families. We have split most of
the families into two parts and also many unknown receptors families. The families, which are
more similar, are closer in the graph. This method showed correctly the closeness of most of sub
families, e.g. purin1 and purin2 etc. However, in some cases it may not show some families
closer e.g. Beta1 and Beta 2. However, we believe that PSSM based MDS can overcome this
problem.
Figure 6-6: MDS plot based on sequence similarity between various sub families
100
7. CONCLUSIONS AND FUTURE DIRECTIONS
GPCRs are physiologically very important in living organisms and are targeted by more than
50% of the market drugs. The number of newly discovered GPCR sequences entering into the
databanks is increasing day by day and thus, it is very difficult to annotate them manually.
Hence, in this regard, automatic and accurate classification is highly desired. A lot of research
has already been done in the prediction of GPCRs. The focus of this thesis is to propose efficient
and accurate prediction techniques for the classification of GPCRs. Once a GPCR sequence is
classified; it can be used in the relevant drug. We have divided our thesis into two parts i.e.
alignment independent classification and alignment dependent classification of GPCRs.
Alignment dependent classification is more accurate than alignment independent classification,
because it also includes structural information. In addition, alignment dependent classification
can highlight important regions in a GPCR sequence. However, it is very complex and
computationally expensive.
7.1.ALIGNMENT INDEPENDENT CLASSIFICATION
Chapter 3, 4, 5 explain three alignment independent classification techniques. GPCR
classification presented in chapter 3 is mainly dependent on physiochemical properties, and
hybrid combination of spatial, and transforms domain feature extraction strategies. In spatial
domain, we have used PseAA as feature extraction strategy, and for transform domain, we have
used MSE based feature extraction. Unlike conventional amino acid composition based feature
extraction, Pseudo amino acid composition also accounts for the order and length of sequence.
We have used D8354 dataset as primary dataset. D8354 has GPCRs belonging to three levels i.e.
family, sub family and sub-sub family levels. Therefore, we have extracted features at each level
with aforementioned feature extraction strategies. For the sake of classification, we have used
SVM, NN, and PNN classifiers. At each GPCR level, we have chosen the best combination of
feature extraction strategies, and the classification algorithms to annotate the unknown GPCR
sequences. We have tested the method on 3 other datasets. Our approach performed better than
the existing methods on these datasets. The improvement is because of the employment of
appropriate physiochemical properties and the hybrid combination of spatial and transforms
domain based features extraction strategies. In addition, SVM has proved to be robust against the
curse of dimensionality dilemma.
101
The chapter 4 proposes the grey incidence degree based classification instead of Euclidian
distance based classification. We have used three feature extraction strategies i.e. FFT, PseAA
and SAAC. FFT is extracts features in transform domain, while PseAA and SAAC extracts
features in spatial domain. To avoid the curse of dimensionality, we have reduced features using
PCA. We have analyzed that the hybrid combination of FFT, PseAA and SAAC based features
can improve the overall performance in GPCR prediction. The GID based method has efficiently
analyzed the numerical relationship between various quaternary structures of GPCRs. A GPCR
sequence can have certain level of similarity to one family and certain level of similarity to the
other family. The GPCR divisions into families, subfamilies or sub-sub families is partial and
GID based classification is useful for the partial systems.
In chapter 5, we have proposed ensemble classification in which weights are optimized using
genetic algorithm. We have employed a hybrid combination of PseAA and MSE for feature
extraction. We have also focused on evolutionary information based feature extraction using
position specific scoring matrices. The hybrid combination of evolutionary information based
feature extraction with PseAA or MSE can further improve the overall performance of the
method. The employment of evolutionary information in features has further improved the
classification performance of the method. However, the evolutionary information based method
that we have proposed is time consuming and hence useful for only small datasets.
7.2.ALIGNMENT DEPENDENT CLASSIFICATION
We have explained different types of sequence alignments in chapter 2. Sequence alignment is
useful in understanding the relationships between different sequences or families. It highlights
the conserved regions in a family. GPCRS have transmembrane helical structures. We have
analyzed and aligned the 7 transmembrane helical structures of Rhodopsin like GPCRs and have
also proposed a general form for each of the 7 transmembrane helices. We have developed a
7TM detection algorithm and pseudo count based PSSM is computed for each block of
transmembrane. Pseudo count based PSSM can be used to score the TM region. It can also help
to identify a particular TM region. Pseudo counts help in assigning the score to any amino acid
which is absent in transmembrane region of a particular family or sequence. The unknown
receptors can also be identified using Pseudo count based PSSMs. Alignment dependent
classification utilizes structural information of the dataset, and hence can be more accurate than
102
alignment independent classification. Pseudo count based PSSMs also shows various
relationships of each family to its PSSM and can find out similarities between various families.
The sub families can further be defined using Pseudo-count based PSSMs.
7.3.FUTURE DIRECTIONS
There are numerous possible future directions of this study. First, the alignment independent
classification based methods proposed in this thesis can also be used to predict sub cellular
localization classification, membrane protein classification and mitochondrial classification.
Second, physiochemical properties based method can be improved further by employing more
appropriate physiochemical properties. Third, PCA can also be replaced with some better feature
reduction algorithm or one can propose his/her own feature reduction algorithm, suitable for
GPCRs. Fourth, Alignment dependent method is discussed in this thesis can be applied to other
cell parts by identifying their conserved regions. Fifth, 3D structures of GPCRs can be predicted
by combining internal properties of protein with the properties of its membrane environment, and
further by using its backbone coordinates and adding the appropriate side chains of each GPCR.
At the end, improvement can be made in BLOSUM by considering different sequence similarity
levels depending on the available sequence data for a particular protein family.
103
8. REFERENCES
(n.d.). Retrieved from Blocks WWW Server: http://blocks.fhcrc.org/
Afridi, T., Khan, A., & Lee, Y. (2012). Mito-GSAAC: Mitochondria Prediction using Genetic Ensemble Classifier
and Split Amino Acid Composition. Amino Acids , 1443-1454.
Altschul, S. (1991). Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol., 219 ,
555-565.
Bailey, T., Williams, N., Misleh, C., & Li, W. (2006). MEME: discovering and analyzing DNA and protein
sequence motifs. Nucleic Acids Res , W369-373.
Baum, L. E., & Petrie, T. (1966). Statistical Inference for Probabilistic Functions of Finite State Markov Chains. The
Annals of Mathematical Statistics , 1554–1563.
Ben, G., Shani, A., Gohr, A., Grau, J., Arviv, S., Shmilovici, A., et al. (2005). Identification of Transcription Factor
Binding Sites with Variable-order Bayesian Networks. Bioinformatics , 2657–2666.
Bhasin, M., & Raghava, G. (2005). GPCRsclass: a web tool for the classification of amine type of G- protein
coupled receptors. Nucleic Acids , 143-147.
Bhasin, M., & Raghava, G. P. (2004). GPCRpred: an SVM-based method for prediction of families and sub-families
of G protein-coupled receptors. Nucleic Acids Res. , 383-389.
Bissantz, C., Logean, A., & Rognan, D. (2004). High-Throughput Modeling of Human G-Protein Coupled
Receptors: Amino Acid Sequence Alignment, Three-Dimensional Model Building, and Receptor Library Screening.
J. Chem. Inf. Comput. Sci., 44 , 1162-1176.
Cardoso, J., Pinto, V., Vieira, F., Clark, M., & Power, D. (2006). Evolution of secretin family GPCR members in the
metazoa. BMC Evolutionary Biology , 6:108.
Chen, Z., Alcayaga, C., Suarez-Isla, B., ORourke, B., Tomaselli, G., & Marban, E. (2002). A “minimal” sodium
channel construct consisting of ligated S5-P-S6 segments forms a toxin-activatable ionophore. J Biol Chem.,277 ,
24653–24658.
Chou, K. (2004). Insights from modelling the 3D structure of the extracellular domain of alpha7 nicotinic
acetylcholine receptor. Biochem Biophys Res Commun, 319 , 433–438.
Chou, K. (2005). Prediction of G-protein-coupled receptor classes. J Proteome Res ,4 , 1413-1418.
Chou, K. (2001). Prediction of protein cellular attributes using pseudo amino acid composition. PROTEINS:
Structure, Function, and Genetics , 246-255.
Chou, K. (2001). Prediction of protein cellular attributes using pseudo-amino-acid-composition. Proteins, 43 , 246-
255.
Chou, K., & Elrod, D. (2002). Bioinformatical analysis of G-protein-coupled receptors. J Proteome Res 1 , 429-433.
Chou, K., & Shen, H. (2010). A new method for predicting the subcellular localization of eukaryotic proteins with
both single and multiple sites: Euk-mPLoc 2.0 . Plos One , doi:10.1371/journal.pone.0009931.t002.
Chou, K., & Shen, H. (2006). Hum-PLoc: a novel ensemble classifier for predicting human protein Subcellular
localization. Biochem Biophys Res Commun., 347 , 150–157.
104
Chou, K., & Shen, H. (2006). Predicting eukaryotic protein subcellular location by fusing optimized evidence-
theoretic K-nearest neighbor classifiers. J Proteome Res., 5 , 1888–1897.
Cosic, I. (1994). Macromolecular bioactivity: is it resonant interaction between macromolecules?-Theory and
applications. IEEE Trans Biomed Eng, 41 , 1101–1114.
Das, S., & Banker, G. ( 2006). The role of protein interaction motifs in regulating the polarity and clustering of the
metabotropic glutamate receptor mGluR1a. Journal of Neuroscience , 8115–8125.
Davies, M. (n.d.). BIAS-PROFS. Retrieved from http://www.cs.kent.ac.uk/projects/biasprofs/
Davies, M., Secker, A., Freitas, A., Mendao, M., Timmis, J., & Flower, D. (2007). On the Hierarchical classification
of G-Proteon coupled receptors. Bioinformatics, 23 , 3113-3118.
Dayhoff, M., Schwartz, R., & Orcutt, B. (1978). A model of Evolutionary Change in Proteins. . Atlas of protein
sequence and structure , 345–358.
Deng, J. (1982). Control problems of grey systems. Syst Control Lett., 1(5) , 288–294.
Doyle, D., Morais, C., Pfuetzner, R., Kuo, A., Gulbis, J., Cohen, S., et al. (1998). The structure of the potassium
channel: molecular basis of K+ conduction and selectivity. Science , 280 , 69–77.
Durbin, R., Eddy, S., Krogh, A., & G., M. (1998). Biological sequence analysis: probabilistic models of proteins
and nucleic acids. Cambridge University Press.
Edgar, R. (2004 ). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids
Research , 1792–1797.
Elrod, D., & Chou, K. (2002). A study on the correlation of G-protein-coupled receptor types with amino acid
composition. Protein Eng Des Sel, 15 , 713-715.
ENSEMBL. (n.d.). Retrieved from http://www.ensembl.org/index.html
Fauchere, J., & Pliska, V. (1983). Hydrophobic parameters of amino acid side chains from the partitioning of N-
acetyl-amino acid amides. Eur. J. Med. Chem.-Chim. Ther., 18 , 369–375.
Foord, S., Jupe, S., & Holbrook, J. (2002). Bioinformatics and type II G-protein-coupled receptors. Biochemical
Society Transactions , 473–479.
Fredriksson, R., Lagerström, M. C., Lundin, l. G., & Schiöth, H. B. (2003). The G-protein-coupled receptors in the
human genome form five main families. Phylogenetic analysis, paralogon groups, and fingerprints. Molecular
Pharmacology , 1256–1272.
Fridmanis, D., Fredriksson, R., Kapa, I., Helgi, B., & Klovins, J. (2006). Formation of new genes explains lower
intron density in mammalian Rhodopsin G protein-coupled receptors. Molecular Phylogenetics and Evolution , 864–
880.
G.D., S. ( 1990). Consensus patterns in DNA. Methods Enzymol.,183 , 211-21.
Gao, Q., & Wang, Z. (2006). Classification of G protein-coupled receptors at four levels. Protein Eng Des Sel., 19 ,
511-516.
Gao, Q., Wu, C., Ma, X., Lu, J., & He, J. (2008). Classification of amine type G-protein coupled receptors with
feature selection. Protein Pept Lett., 15 , 834-842.
George, S., O`Dowd, B., & Lee, S. (2012). G-Protein Coupled Receptor oligomerization and its potential for drug
discovery. Nature Reviews Drug Disc 1 , 808-820.
105
GPCRDB. (2012). Retrieved from http://www.gpcr.org/7tm/
Grantham, R. (1974). Amino acid difference formular to help explain protein evolution. Science, 185 , 862–864.
Grasso, C., & Lee, C. (2004). Combining partial order alignment and progressive multiple sequence alignment
increases alignment speed and scalability to very large alignment problems . Bioinformatics , 1546–1556.
Guo, Y., Li, M., Wang, K., Wen, Z., Lu, M., Liu, L., et al. (2005). Fast Fourier transform-based support vector
machine for prediction of G-protein coupled receptor subfamilies. Acta Biochim. Biophys. Sin. , 759–766.
Henikoff, J. G., & Henikoff, S. (1996). Using Substitution Probabilities To Improve Position-Specific Scoring
Matrices. Comput. Appl. Biosci., 12 , 135-143.
Henikoff, S., & Henikoff, J. (1992). Amino Acid Substitution Matrices from Protein Blocks. PNAS , 10915–10919.
Holland, J. H. (1992). Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to
Biology, Control, and Artificial Intelligence. Cambridge: MA: MIT Press. ISBN 978-0262581110.
Horn, H., Bettler, E., Oliveira, L., Campagne, F., Cohen, F., & Vriend, G. (2003). GPCRDB information system for
G protein-coupled receptors. Nucleic Acids 31(1) , 294:297.
Howard, A. (2000). Elementary Linear Algebra. Wiley; 8 edition.
Huang, Y., Cai, J., Ji, L., & Li, Y. (2004). Classifying G-protein coupled receptors with bagging classifition tree.
Comput Biol Chem., 28 , 275-280.
Hughey, R., & Krogh, A. (1996). Hidden Markov models for sequence analysis: extension and analysis of the basic
method. CABIOS , 95–107.
Inoue, Y., Yamazaki, Y., & Shimizu, T. (2005). How accurately can we discriminate G-protein-coupled receptors as
7-tms TM protein sequences from other sequences? Biochemical and Biophysical Research Communications ,
1542–1546.
Institute, E. B. (n.d.). EMBL-EBI. Retrieved from http://www.ebi.ac.uk/Tools/sss/psisearch/
James, L. C., Kemp, B. C., Hanah, M., John, L., Spouge, Jay, A. B., et al. (1987). Hydrophobicity scales and
computational techniques for detecting amphipathic structures in proteins. Journal of Molecular Biology,195 , 659–
685.
Javed, J., Khan, A., Majid, A., Mirza, A. M., & Bashir, J. (2007). Lattice Constant Prediction of orthorhombic
ABO3 Perovskites using Support Vector Machines. Computational Materials Science, 39 , 627-634.
Karchin, R., Karplus, K., & Haussler, D. (2002). Classifying G-protein coupled receptors with support vector
machines . Bioinformatics ,18, , 147-159.
Katoh, K., Misawa, K., Kuma, K., & Miyata, T. (2002). MAFFT: a novel method for rapid multiple sequence
alignment based on fast Fourier transform . Nucleic Acids Research , 3059–3066.
lib SVM. (n.d.). Retrieved from http://en.pudn.com/downloads136/sourcecode/math/detail580267_en.html
Liu, S., Fang, Z., & Lin, Y. (2005). A new definition for the degree of grey incidence. Sci Inq., 7(2) , 111–124.
Lundstrom, K. H., & Chiu, M. L. (2006). G- protein coupled receptors in drug discovery. CRC Press, Taylor &
Francis Group, Boca Raton, FL.
Mandell, A., Selz, K., & Shlesinger, M. (1997). Wavelet transformation of protein hydrophobicity sequences
suggests their memberships in structural families. Physica A, 244 , 254−262.
106
Martelli, P., Fariselli, P., Malaguti, L., & Casadio, R. (2002). Prediction of the disulfide bonding state of cysteines in
proteins with hidden neural networks. Protein Eng., 15 , 951-953.
MOEREELS, H., LEWI, P. J., KOYMANS, L. M., & JANSSEN, P. A. (1997). The alpha and omega of G-protein
coupled receptors: a novel method for classification. Part 2. Bin classification. Annals of the New York Academy of
Sciences , 147–148.
Möller, S., Vilo, J., & Croning, M. (2001). Prediction of the coupling specificity of G protein coupled receptors to
their Gproteins. Bioinformatics, 17, , 174-181.
Mount, D. (2004). Bioinformatics: Sequence and Genome Analysis (2nd edition). Newyork: Cold Spring Harbor
Laboratory Press: Cold Spring Harbor, ISBN 0-87969-608-7.
Nakagawa, T., Sakurai, T., Nishioka, T., & Touhara, K. (2005). Insect sex-pheromone signals mediated by specific
combinations of olfactory receptors. Science , 1638–1642.
Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino
acid sequence of two proteins. Journal of molecular biology , 443-453.
Notredame, C., Higgins, D., & Heringa, J. (2000). T-coffee: a novel method for fast and accurate multiple sequence
alignment. Journal of Molecular Biology , 205–217.
Oxenoid, K., & Chou, J. (2005). The structure of phospholamban pentamer reveals a channel-like architecture in
membranes. Proc Natl Acad Sci USA, 102 , 10870–10875.
Oxenoid, K., Rice, A., & Chou, J. (2007). Comparing the structure and dynamics of phospholamban pentamer in its
unphosphorylated and pseudo-phosphorylated states. Protein Sci.,16 , 1977–1983.
Papasaikas, P., Bagos, P., Litou, Z., & Hamodrakas, S. (2003). A novel method for GPCR recognition and family
classification from sequence alone using signatures derived from profile hidden Markov models. SAR and QSAR
Environmental Research, 14 , 413-420.
Peng, Z. L., Yang, J. Y., & Chen, X. (2010 ). An improved classification of G-protein-coupled receptors using
sequence-derived features, BMC Bioinformatics 11 (2010). BMC Bioinformatics , doi: 10.1186/1471-2105-11-420.
Prabhu, Y., & Eichinger, L. (2006). The Dictyostelium rep-ertoire of seven transmembrane domain receptors.
European Journal of Cell Biology , 937–946.
Qiu, J., Huang, J., Liang, R., & Lu, X. (2009). Prediction of G-protein-coupled receptor classes based on the concept
of Chou's pseudo amino acid composition: an approach from discrete wavelet transform. Analytical Biochemistry,
390 , 68-73.
Qiu, J., Huang, J., Liang, R., & Lu, X. (2009). Prediction of G-protein-coupled receptor classes based on the concept
of Chou's pseudo amino acid composition: an approach from discrete wavelet transform. Analytical Biochemistry,
390 , 68-73.
Rehman, Z. (2011). GPCR prediction. Retrieved from http://111.68.99.218/GPCR/default.aspx
Rehman, Z. u., Mirza, M. T., Khan, A., & Xhaard, H. (2013). Predicting G-protein-coupled receptors families using
different physiochemical properties and pseudo amino acid composition. Methods Enzymology , 61-79.
Rehman, Z., & Khan, A. (2011). G-protein-coupled receptor prediction using pseudo-amino-acid composition and
multiscale energy representation of different physiochemical properties. Analytical Biochemistry 412(2) , 173:182.
Rehman, Z., & Khan, A. (2012). Identifying GPCRs and their types with Chou's pseudo amino acid composition: an
approach from multi-scale energy representation and position specific scoring matrix. Protein Pept Lett., 19(8) ,
890-903.
107
Rehman, Z., & Khan, A. (2011). Prediction of GPCRs with pseudo amino acid composition: employing composite
features and grey incidence degree based classification. Protein Pept Lett.,18(9) , 872-878.
Richard, M. (1992). Maximum likelyhood estimation of statistical distribution of smith waterman local sequence
similarity scores. Bull. math. Biol., 54 , 59-75.
Salam, A.-K. (2012). The 20 Amino Acids - Protein Structure and Structural Bioinformatics. Retrieved from
http://www.proteinstructures.com/Structure/Structure/amino-acids.html
Schaffer, A., Aravind, L., Madden, T., Shavirin, S., Spouge, J., & al., a. e. (2001). Improving the accuracy of PSI-
BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids ,
2994:3005.
Schaffer, A., Aravind, L., Madden, T., Shavirin, S., Spouge, J., Wolf, Y., et al. (2001). Improving the accuracy of
PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids ,
2994:3005.
Sean, R. E. (2004). Where did the BLOSUM62 alignment score. NATURE BIOTECHNOLOGY .
Shi, J., Zhang, S., Pan, Q., Cheng, Y., & Xie, J. (2007). Prediction of protein subcellular localization by support
vector machines using multi-scale energy and pseudo amino acid composition. Amino Acids, 33 , 69–74.
Smith, T. F., & Waterman, M. S. (1981). Identification of Common Molecular Subsequences. Journal of Molecular
Biology , 195–197.
Specht, D. (1990). Probablistic neural networks. Neural Networks, 3 , 109-118.
Subramanian, A., M.J., W., Kaufmann, M., & Morgenstern, B. (2005). DIALIGN-T: An improved algorithm for
segment-based multiple sequence alignment. Bioinformatics , 6:66.
T-Coffee. (n.d.). Retrieved from http://www.ebi.ac.uk/Tools/msa/tcoffee/
Thompson, J., Higgins, D., & Gibson, T. (1994). CLUSTAL W: improving the sensitivity of progressive multiple
sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic
Acids Res , 4673–4680.
Tsai, L., Liou, H., & Jiang, G. (2005). Application of grey relational analysis to the influential factors on natural
frequencies of helical springs. J Grey Syst., 8(2) , 141–156.
Wheeler, D., Barrett, T., Benson, D., & et.al. (2007). Database resources of the national center for biotechnology
information. Nucleic Acids Res., 35 , D5–D12.
Xiao, X., Wang, P., & Chou, K. (2009). GPCR-CA: A cellular automaton image approach for predicting G-protein-
coupled receptor functional classes. J Comput Chem, 30 , 1413-1423.