classification of g protein-coupled receptors using...

Classification of G Protein-Coupled Receptors using

Machine Learning Techniques

ZIA-UR-REHMAN

PhD thesis

Department of Computer and Information Sciences, Pakistan Institute of

Engineering and Applied Sciences, Nilore, Islamabad, Pakistan

Classification of G Protein-Coupled Receptors using

Machine Learning Techniques

By

Zia-ur-Rehman

A dissertation submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Computer and Information Sciences

To

Department of Computer and Information Sciences, Pakistan Institute of

Engineering and Applied Sciences, Nilore, Islamabad, Pakistan

2013

2

ABSTRACT

G protein-coupled receptors (GPCRs) are located at the boundary of a cell, and are used for

inter-cellular communications. They are mostly found in Eukaryotic cells; but can also be found

in some Prokaryote cells. GPCRs modulate synaptic transmission in spinal cord and brain, and

can trigger signaling pathways for the regulation of cell proliferation and gene expression. They

are physiologically very important and according to an estimate, more than 50% of the marketed

drugs target GPCRs. Computational prediction of unknown GPCRs has great importance in

pharmacology because, malfunction of GPCRs can cause many diseases. The goal of this thesis

is to propose new methods for the classification of GPCRs using Machine Learning approaches.

The work in this thesis is divided into two parts. The first part is based on the

classification of GPCRs using Machine Learning methods. We analyze biological, statistical, and

transform-domain based feature extraction strategies and exploited various physiochemical

properties to generate discriminate features of GPCR sequences. We have developed various

GPCR classification methods. In the first method, GPCRs are predicted using the hybridization

of pseudo amino acid composition and multi scale energy representation of physiochemical

properties. In this method, our focus is on the introduction of various physiochemical properties

(hydrophobicity, electronic and bulk property). In the second method, GPCRs are predicted

using grey incidence degree measure and principal component analysis, whereby relation

between various components of GPCR sequences is exploited. In the third method, we perform

weighted ensemble classification of GPCRs using evolutionary information and multi-scale

energy based features. The weights for each of the classifier are optimized using genetic

algorithm, which provides an improvement in classification performance.

Second part of the thesis is based on multiple sequence alignment of GPCRs, whereby,

we utilize the structural information of GPCRs. The three-dimensional structures of several

Rhodopsin like GPCRs have been resolved at atomic resolution and validates the prediction

using sequence information alone that GPCRs fold has a bundle of seven transmembrane helices

(TMs). The dataset is aligned initially using multiple sequence alignment methods and TMs are

extracted. The dataset is composed of 19 sub families of Rhodopsin receptors, belonging to 62

species. Weights are assigned to avoid bias for a particular specie. Position specific scoring

matrices (PSSM) are computed for the seven TMs data and pseudo counts are added. Pseudo

http://en.wikipedia.org/wiki/Eukaryote

3

counts are added using conventional Blosum62 scoring matrix. The unknown receptors are

classified using PSSMs of the known receptors and by the TM similarity methods.

Our research may have valuable contributions in the fields of Bioinformatics, Pattern

Classification, and Computational Biology, and has yielded comparable results with the existing

approaches. We conclude that our research may help the researchers in further exploring

membrane protein classification or any other sub cellular localization classification.

4

This thesis is carried out under the supervision of

Dr. Asifullah Khan

Associate Professor

Department of Computer and Information Sciences,

Pakistan Institute of Engineering and Applied Sciences,

Islamabad, Pakistan

This work is financially supported by Higher Education Commission (HEC) of Pakistan

Under the indigenous 5000 Ph.D. fellowship program

Pin # 074-1844-PS4-406.

5

DECLARATION

I declare that all material in this thesis, which is not my own work has been identified and that no

material has previously been submitted and approved for the award of a degree by this or any

other university.

Signature: ____________________

Author’s Name: Zia-ur-Rehman

It is certified that the work in this thesis is carried out and completed under my

supervision.

Supervisor's signature: _____________

Dr. Asifullah Khan

Associate Professor, DCIS, PIEAS, Islamabad

Head of the Department: _____________

Dr. Javaid Khurshid

DCIS, PIEAS, Islamabad

6

Dedicated to my parents & supervisor

7

Acknowledgements

I am thankful to my Allah almighty for showering his blessings on me and honoring me with

strength and determination to accomplish this PhD research work. He helped me during every

phase of my PhD and helped me to take right decisions.

I am thankful to my loving parents, siblings, cousins (especially Abid Hussain) and other

relatives, who helped me and prayed for me to successfully accomplish PhD degree. I am also

thankful to my friends (especially to Shozab Mehdi, Khurram jawad, Mattiullah, Mehdi Hassan,

Adnan Idris, Iqbal Mirza and PIEAS Bachelor juniors) who have been very helping me during

my stay at PIEAS.

I would also like to thanks all my beloved teachers (especially to Dr. Mutawwarra, Dr. Abdul

jalil, Dr. Abdul majid, Dr. Anila usman, Mr. Fayyaz and Dr. Henri Xhaard), who guided me at

my course work and research. I am very thankful to Dr. Henri Xhaard for helping me to conduct

one phase of PhD research. I am especially thankful to Dr. Asifullah Khan for supervising me all

heartedly and inspiring me for conducting this research. He has given his full devotion and care

in supervising my PhD research. Other than PhD research, he also assisted me in non-technical

affairs and guided me very well.

Finally, I want to thank to Higher Education Commission (HEC) of Pakistan, for the financial

support provided by Indigenous 5000 PhD program with reference to the Pin # 74-1844-PS4-

406.

8

List of Journal Publications

Zia ur Rehman and A. Khan, “G Protein-Coupled receptor prediction using pseudo-amino-acid

composition and multi scale energy representation of different physiochemical properties”, Anal

Biochem., 412 (2), 2011, pp.173-82 (impact factor: 3.2)

Zia-ur-Rehman, Asifullah Khan, Muhammad Tayyeb Mirza and Henri Xhaard, ”Predicting G-

Protein Coupled Receptors Families using Different Physiochemical Properties and Pseudo

Amino Acid Composition”, Methods in Enzymology, 2012 (impact factor: 2.0)

Zia ur rehman, Asifullah Khan, “Prediction of GPCRs with Pseudo Amino Acid Composition:

Employing Composite Features and Grey Incidence Degree Based Classification”, Protein

&Pept Lett.18(9), 2011, pp. 872-878 (impact factor : 1.82)

Zia ur rehman, Asifullah Khan, “Identify GPCRs and their types with Chou's pseudo amino

acid composition: an approach from multi-scale energy representation and position specific

scoring matrix”, Protein &PeptLett.,2012,19(8), pp. 890-903.(impact factor: 1.82)

Zia ur rehman, Maiju Rinne, Henri Xhaard, Asifullah Khan, "Re-classification of Rhodopsin

like receptors using transmembrane helical structures" to be submitted in "European Journal of

Pharmaceutical Sciences".

http://www.journals.elsevier.com/european-journal-of-pharmaceutical-sciences/

http://www.journals.elsevier.com/european-journal-of-pharmaceutical-sciences/

9

Contents

ABSTRACT .................................................................................................................................................................. 2

List of Figures .............................................................................................................................................................. 14

List of Tables ............................................................................................................................................................... 16

List of Abbreviations ................................................................................................................................................... 17

Symbol Table ............................................................................................................................................................... 19

1. INTRODUCTION .............................................................................................................................................. 21

1.1. STRUCTURE OF GPCRS ......................................................................................................................... 21

1.2. GPCR CLASSIFICATIONS AND THEIR SIGNIFICANCE ................................................................... 22

1.3. RESEARCH CONTRIBUTIONS AND OBJECTIVES ............................................................................ 24

1.4. STRUCTURE OF THESIS ........................................................................................................................ 25

2. LITERATURE SURVEY AND THEORY ........................................................................................................ 27

2.1. ALIGNMENT DEPENDENT CLASSIFICATION OF GPCRS ............................................................... 27

2.1.1. Sequence alignment ........................................................................................................................... 27

2.1.1.1 Local and global alignments ......................................................................................................... 28

2.1.1.2 Pairwise alignments ...................................................................................................................... 28

2.1.2. Multiple Sequence Alignment ........................................................................................................... 29

2.1.2.1 Progressive alignments ................................................................................................................. 29

2.1.2.2 Iterative methods ........................................................................................................................... 30

2.1.2.3 Hidden Markov models ................................................................................................................. 30

2.1.2.4 Motif finding algorithms ............................................................................................................... 30

2.1.2.5 Genetic algorithms and simulated annealing methods .................................................................. 32

2.1.3. Protein scoring matrices .................................................................................................................... 32

10

2.1.3.1 Point accepted mutation (PAM) .................................................................................................... 33

2.1.3.2 Block substitution matrix (BLOSUM) .......................................................................................... 34

2.1.4. Position specific scoring matrices (PSSM)........................................................................................ 35

2.2. ALIGNMENT INDEPENDENT CLASSIFICATION .............................................................................. 36

2.2.1. Machine Learning .............................................................................................................................. 36

2.2.1. Feature Extraction Strategies ............................................................................................................. 37

2.2.1.1 Amino Acid Composition ............................................................................................................. 38

2.2.1.2 Pseudo Amino Acid Composition ................................................................................................. 38

2.2.1.3 Wavelet based multi scale energy features ................................................................................... 39

2.2.1.4 Fast Fourier transform based features ........................................................................................... 40

2.2.1.5 Split amino acid ............................................................................................................................ 41

2.2.1.6 Evolutionary information based features using PSSM .................................................................. 42

2.2.2. Classification Algorithms .................................................................................................................. 43

2.2.2.1 Nearest Neighbor .......................................................................................................................... 43

2.2.2.2 Support vector machines ............................................................................................................... 43

2.2.2.3 Probabilistic Neural Network ........................................................................................................ 45

2.2.3. Performance Assessment ................................................................................................................... 46

2.2.4. Genetic Algorithms ........................................................................................................................... 47

2.2.4.1 Initialization .................................................................................................................................. 48

2.2.4.2 Selection ........................................................................................................................................ 48

2.2.4.3 Genetic operators .......................................................................................................................... 49

2.2.4.4 Termination ................................................................................................................................... 50

2.3. GPCR DATASETS .................................................................................................................................... 50

11

3. GPCR PREDICTION BY EMPLOYING PHYSIOCHEMICAL PROPERTIESUSING HYBRID FEATURES

52

3.1. PHYSIOCHEMICALPROPERTIES ......................................................................................................... 53

3.2. FEATURE EXTRACTIONAND CLASSIFICATION.............................................................................. 54

3.3. GPCR-HYBRID ......................................................................................................................................... 55

3.4. RESULTS AND DISCUSSIONS .............................................................................................................. 56

3.4.1. Family Level Classification ............................................................................................................... 56

3.4.1.1 Performance for PseAA2 .............................................................................................................. 57


3.4.1.3 Performance for MSE-PseAA ....................................................................................................... 57

3.4.1.4 Performance using MSE-AA ........................................................................................................ 57

3.4.2. Sub Family Classification.................................................................................................................. 58




3.4.2.4 Performance for MSE-AA ............................................................................................................ 59

3.4.3. Sub-sub Family Classification ........................................................................................................... 60




3.4.3.4 Performance for MSE-AA ............................................................................................................ 61

3.4.4. Comparison with Selective Top Down Approach ............................................................................. 62

3.4.5. Comparison with other methods ........................................................................................................ 63

12

4. GPCRs PREDICTION USING GREY INCIDENCE DEGREE MEASURE AND PRINCIPAL

COMPONENT ANALYIS .......................................................................................................................................... 66

4.1. GREY INCIDENCE DEGREE MEASURE .............................................................................................. 67

4.2. PRINCIPAL COMPONENT ANALYSIS ................................................................................................. 68

4.3. RESULTS AND DESCUSSIONS ............................................................................................................. 69

4.3.1. Family level classification ................................................................................................................. 70

4.3.2. Sub family level classification ........................................................................................................... 70

4.3.3. Sub-sub family level classification .................................................................................................... 70

4.3.4. Comparison with other methods ........................................................................................................ 71

4.3.4.1 Comparison with Selective top down approach ............................................................................ 71

4.3.4.2 Comparison with other existing methods on D167 and D566 datasets ......................................... 72

5. GPCRs PREDICTION USING GENETIC ALGORITHM BASED ENEMBLE CLASSIFICATION ............. 74

5.1. CLASSIFICATION ALGORITHM .......................................................................................................... 75

5.2. WEIGHT OPTIMIZATION USING GENETIC ALGORITHM ............................................................... 75

5.3. RESULTS AND DISCUSSIONS .............................................................................................................. 77

5.3.1. Classification performance on D8354 ............................................................................................... 77

5.3.1.1 Family level classification............................................................................................................. 77

5.3.1.1 Classification performance at sub family level ............................................................................. 78

5.3.1.1 Classification performance at sub-sub family level ...................................................................... 78

5.3.2. Comparison with existing approaches on D8354 .............................................................................. 81

5.3.3. Comparison on D167, D365 and D566 datasets ................................................................................ 82

6. ALIGNMENT BASED STRUCTURAL CLASSIFICATION OF GPCRS USING TRANSMEMBRANE

REGIONS .................................................................................................................................................................... 89

6.1. SEVEN MOTIFS OF RHODOPSIN LIKE GPCRS .................................................................................. 90

13

6.2. POSITION SPECIFIC SCORING MATRIX USING PSEUDO COUNTS .............................................. 91

6.3. EXTREME VALUE DISTRIBUTION (EVD) .......................................................................................... 93

6.4. MOTIF DETECTION ALGORITHM ....................................................................................................... 96

6.5. MULTI DIMENSIONAL SCALING ........................................................................................................ 97

7. CONCLUSIONS AND FUTURE DIRECTIONS .............................................................................................. 99

7.1. ALIGNMENT INDEPENDENT CLASSIFICATION .............................................................................. 99

7.2. ALIGNMENT DEPENDENT CLASSIFICATION ................................................................................ 100

7.3. FUTURE DIRECTIONS ......................................................................................................................... 101

8. REFERENCES ................................................................................................................................................. 102

14

List of Figures

Figure 1-1: Structure of Rhodopsin receptor ............................................................................................................... 22

Figure 1-2: GPCR classification methods ................................................................................................................... 25

Figure 2-1: Overview of chapter 2 ............................................................................................................................... 28

Figure 2-2: Simple motif based alignment of GPCRs ................................................................................................. 31

Figure 2-3:PAMmatrix ................................................................................................................................................ 34

Figure 2-4: Classification of GPCRs using machine learning ..................................................................................... 38


Figure 3-2: GPCR-Hybrid web interface ..................................................................................................................... 55

Figure 3-3: Working of GPCR-Hybrid ........................................................................................................................ 56

Figure 3-4: GPCR classification performance for family level in terms of Accuracy, sensitivity and specificity ...... 58

Figure 3-5: GPCR classification performance for family level in terms of MCC and F-Measure .............................. 59

Figure 3-6: GPCR classification performance for sub family level ............................................................................. 60

Figure 3-7: GPCR classification performance for sub-sub family level ...................................................................... 61

Figure 3-8: Comparison with Selective Top Down method ........................................................................................ 62

Figure 3-9: Comparison on D167 dataset .................................................................................................................... 63

Figure 3-10: Comparison on D365 dataset .................................................................................................................. 64



Figure 4-2: Overview of GPCR-GID ........................................................................................................................... 69

Figure 4-3: Performance of GID and Euclidian distance methods .............................................................................. 70

Figure 4-4: Comparison with selective top down approach ........................................................................................ 71

Figure 4-5: Comparison on D167 ................................................................................................................................ 72

Figure 4-6: Comparison on D566 ................................................................................................................................ 73


Figure 5-2: Overview of PSE-PSSM method .............................................................................................................. 77

Figure 5-3: GA run for family level............................................................................................................................. 78

Figure 5-4: GA run for subfamily level ....................................................................................................................... 79

Figure 5-5: GA run for sub-subfamily level ................................................................................................................ 80

Figure 5-6: Classification performance on D8354 dataset ........................................................................................... 80


Figure 5-8: Classification performance on D365 and D 566 dataset ........................................................................... 83

Figure 5-9: Comparison on D167 dataset .................................................................................................................... 83

Figure 5-10: GA run for D167 using MSE-PseAA ..................................................................................................... 84

Figure 5-11: GA run for D167 using PSE-PSSM ........................................................................................................ 85

Figure 5-12: Classification performance on D365 and D566 datasets ........................................................................ 85

Figure 5-13: GA run for D365 dataset ......................................................................................................................... 86

15

Figure 5-14: GA run for D566 ..................................................................................................................................... 87

Figure 5-15: Comparisons on D365 dataset in terms of % accuracy ........................................................................... 87

Figure 5-16: Comparison on D566 .............................................................................................................................. 88


Figure 6-2: PSSM plot tested on Chemokine PSSM ................................................................................................... 93

Figure 6-3: Plot of pdf for motif-1 of Amine sub family ............................................................................................. 95

Figure 6-4: Plot of E-values for motif-3 Amine sub family ........................................................................................ 96

Figure 6-5: Number of false positives for different E-values ...................................................................................... 96

Figure 6-6: MDS plot based on sequence similarity between various sub families..................................................... 98

16

List of Tables

Table 2-1: BLOSUM 62 matrix ................................................................................................................................... 34

17

List of Abbreviations

AA Amino Acid

BLAST Basic Local Alignment Search Tool

BLOSUM Block Substitution Matrix

COV Covariance matrix

DWT Discrete Wavelet Transform

FFT Fast Fourier Transform

GA Genetic Algorithm

GID Grey Incidence Degree

GPCR G- Protein Coupled Receptor

HMM Hidden Markov Model

MAFFT Multiple Alignment with Fast Fourier Transform

MCC Mathew’s Correlation Coefficient

MEME Multiple EM for Motif Elicitation

MSA Multiple Sequence Alignment

MSE Wavelet based Multi-Scale Energy

MUSCLE Multiple Sequence Comparison by Log- Expectation

NN Nearest Neighbor

PAM Point Accepted Mutation

PCA Principal Component Analysis

PNN Probabilistic Neural Network

POA Partial Order Alignment

PseAA Pseudo amino acid

18

PseAA2 Pseudo amino acid with two physiochemical properties

PseAA3 Pseudo amino acid with three physiochemical properties

PSI-BLAST Position Specific Iterated BLAST

PSSM Position Specific Scoring Matrix

RBF Radial Base Function

SAAC Split Amino Acid Composition

SAM Sequence Alignment and Modeling System

SVM Support Vector Machine

T-Coffee Tree-based Consistency Objective Function For alignment Evaluation

TM Transmembrane

7TM Seven Transmembrane

19

Symbol Table

amk Root mean square energy of wavelet approximation coefficients in

mth decomposition level in the kth sequence

djk Root mean square energy of wavelet detailed coefficients in the

corresponding jth decomposition level in the kth sequence

D Euclidian distance

fi Occurrence frequencies of amino acid i

Gi GPCR sequence i

gij jth amino acid in a GPCR sequence i

H, h Physiochemical property function for an amino acid

i, j, k For indexing of elements

N Total number of sequences

n, m Total number of elements in the vector

ci Correlation factors in PseAA

Ri Physiochemical property value of amino acid i

L Length of GPCR sequence

L Length of motif of a GPCR sequence

C Classes

Sij Substitution score of replacing amino i with j

O Grey incidence degree between two sequences

P Occurrence probability of amino acids

qi Back ground probabilities

T Final form of extracted features for a GPCR sequence

V Decision function for classifiers

20

Z(i) , z(i) Class prediction by a classifier for a GPCR sequence i

21

1. INTRODUCTION

Cell is the basic functional unit of all living organisms. Organisms can be classified into 2

categories i.e. unicellular or multi-cellular. Each cell has outer membrane that protects it from

unwanted substances from the environment. Cells communicate with each other by signaling

pathways. G Protein-Coupled Receptors (GPCRs) provide this cellular communication by

transducing extracellular stimuli into intracellular signals. GPCRs are the family of membrane

proteins, which is commonly found in Eukaryotic cells including: yeasts, bacteria, plants and

animals. There are various tasks of GPCRs such as triggering signaling pathways, regulation of

gene expression, proliferation of cell, controlling the proper reaction of cells, tissues, organs and

organisms to the changing environment, modulating synaptic transmission in the brain and spinal

cord (Lundstrom & Chiu, 2006).Due to its biological significance, GPCRs are widely useful for

drug discovery. Currently, more than 50% of drugs in markets are based on GPCRs (Lundstrom

& Chiu, 2006) (Bhasin & Raghava, 2004)

1.1.STRUCTURE OF GPCRS

GPCR sequences are polypeptide chains made up of amino acids. Amino acids are the basic

building blocks of proteins. There are 20 amino acids (Salam, 2012), that are named as:

Cysteine(C), Alanine (A), Glutamic acid (E), Aspartic acid (D), Glycine(G), Phenylalanine (F),

Isoleucine (I),Histidine (H),Leucine (L), Lysine (K), Asparagine (N), Methionine (M),

Glutamine (Q),Proline (P), Serine (S), Arginine (R),Valine (V), Threonine (T), Tyrosine (Y),

Tryptophan (W).

Structure of Rhodopsin like GPCRs consists of an extracellular N terminus and intracellular C

terminus as shown in

Figure 1-1. They have transmembrane (TM) helical structures passing through the membrane

seven times and hence are called as seven TM (7TM) receptors. The 7TM structure is connected

to three extracellular and three intracellular loops. Further, it has N-terminus and C-terminus and

is connected to alpha, beta and gamma sub units.

http://en.wikipedia.org/wiki/Eukaryote

22

Figure 1-1: Structure of Rhodopsin receptor

The sample GPCR sequence belonging to Rhodopsin family and Amine sub family of GPCR is

given below:

>ENSGMOP00000010676_Gmor_/1-1355

NMSVDWDPWFASYIAMEVVIAVLSVLGNVLVVWAVILNRSLRDTTFCFIFSLALADIAV

GSLAIPLAITISIGLQTTFYSCLVGTCTMLVLTQSSILALLAIAIDRYLRVKIPMSYRWVVT

PRRARTAVGLCWLVSFMVGLTPLLGWNKLQHANGTVGSGPEAQMTCTFENIISMDYM

VYFNFLGWVLPPLLLMLLIYIEIFYIIHKHLNKKVTASQAGPRRRQDYGKELKLVKSLAL

VLFLFTVSWLPVHILNCITLFCPKCVEHKKGIRIAILLSHGNSAVNPVVYSFHINKFHTAF

RKIWQQYILCRDPVGKLPQKSGQSGWNHAVRRRHNSKDAHEF.

Throughout in our datasets, we have worked on these types of sequences.

1.2.GPCR CLASSIFICATIONS AND THEIR SIGNIFICANCE

GPCRs can be categorized into families in various ways. Based on similarity of the

transmembrane region, GPCRs are divided into 5 families, such as Rhodopsin, Secretin,

Adhesion, Glutamate and Frizzled (George, O`Dowd, & Lee, 2012). However on the basis of

sequence homologies, GPCRs are divided into six families i.e. Rhodophsin, Secretin,

Metabotropic glutamate, Pheromone, Cyclic AMP, Frizzled receptors (Horn, Bettler, Oliveira,

23

Campagne, Cohen, & Vriend, 2003). The structure of only Rhodopsin like GPCR has been

solved up until now. It is the biggest family of GPCRs and comprises about 80% of all GPCRs.

Rhodopsin family receptors are activated by many signals such as: peptides, nucleotides, and

small manoamines. Adrenaline or dopamine, odorant molecules, are activated by proteolysis and

react to light (Rhodopsin) through the activation of chromophore. (Rehman & Khan, 2011)

(Fridmanis, Fredriksson, Kapa, Helgi, & Klovins, 2006). They control processes such as:

paracrine, autocrine and endocrine and transducer extracellular signals through interaction with

nucleotide-bindings. The structure of Rhodopsin family is described in section 1.1. They are

usually further classified into various sub families of known receptors named as Amine,

Prostaglandin, Beta, Sog and MCH, Opsin, Meca, Melatonin, Purin, Chemokine, Mas, Glyco

protein. Some of these sub families are further divided into 2 groups, thus making in total 19 sub

families. Secretin family receptors play important role in the bindings of some parathyroid

hormones and glucagon (Cardoso, Pinto, Vieira, Clark, & Power, 2006)and are mostly found in

animals. Their known receptors can be further classified into three sub families. Metabotropic

family receptors are triggered by Metabotropic processes (Das & Banker, 2006)and are involved

in peripheral and central nervous system processes. They control learning capabilities, feelings of

grief and pain. Pheromone family is involved in chemical interaction in some organisms

(Nakagawa, Sakurai, Nishioka, & Touhara, 2005). They consist of eight different types of

receptors making three sub groups of this family. Cyclic AMP performs chemotactic signaling in

slime molds (Prabhu & Eichinger, 2006) and control the development in Amoeba specie.

Frizzled and Smoothened receptors are used to perform Wnt binding (Foord, Jupe, & Holbrook,

2002). There are 10 Frizzled receptors. Their function is to control embryonic development, cell

polarity, cell proliferation, and formation of neural synapses. In 1980’s and 1990’s, newly

sequenced GPCRs were first named from their pharmacological properties. With the expansion

of known GPCRs, molecular phylogenetics has been increasingly used to confirm the naming

convention of new sequences according to an evolutionary criteria (Rehman, Mirza, Khan, &

Xhaard, 2013). There are several works on phylogenetic classification of GPCRs such as:

(MOEREELS, LEWI, KOYMANS, & JANSSEN, 1997) and (Fredriksson, Lagerström, Lundin,

& Schiöth, 2003)

Because of the significant importance of GPCRs, research is being done on the computational

classification of GPCRs. The computational classification of GPCRs can be divided into two

http://en.wikipedia.org/wiki/Learning

http://en.wikipedia.org/wiki/Pain

http://en.wikipedia.org/wiki/Synapses

24

categories i.e. Alignment based classification (phylogenetic analysis) and Alignment

independent classification

1.3.RESEARCH CONTRIBUTIONS AND OBJECTIVES

The classification of GPCRs can help in understanding its functions. Historically, GPCRs are

classified based on their pharmacological response. Molecular phylogenetic analysis has been

then used to cluster together similar sequences. Nowadays, it is largely agreed upon that the best

way to classify unknown GPCR sequences is through a phylogenetic analysis that includes

chromosomal mapping. These studies are however difficult and inaccurate over long

evolutionary distances. However, with the increasing number of newly discovered GPCRs, the

experimental based classification became very expensive and infeasible. Hence, the demand for

computational classification is increased. Overall objective of our research is to perform efficient

computational classification of GPCRs. We have analyzed both alignment dependent and

alignment independent classification of GPCRs. We have used various machine learning,

evolutionary, statistical, and alignment algorithms, and adopted following methods for the

classification of GPCRs:

Hybrid Classification of GPCRs using physiochemical properties

GPCR classification using grey incidence degree measure and Principal Component

Analysis

GPCR classification using ensemble approaches and evolutionary information

Alignment based structural classification of GPCRs using seven TM regions and position

specific scoring matrices.

The block diagram of our research work related to GPCRs classification is shown in Figure 1-2.

25

Figure 1-2: GPCR classification methods

1.4.STRUCTURE OF THESIS

In chapter 1, we have described GPCRs, their importance in different organisms and their

classifications. In chapter 2, we have given the detailed literature survey of existing GPCR

classification methods. In addition, we have also mentioned some feature extraction strategies

and classification algorithms used in our research. Further, optimization algorithms, machine

learning, alignment based methods, and the details of all the data sets are discussed in chapter 2.

Chapter 3mainly focuses on the feature extraction of GPCRs using physiochemical properties.

Three physiochemical properties are used in this work such as: Hydrophobicity, Electronic and

Bulk property. These physiochemical properties are employed using various feature extraction

strategies. Chapter 4 discusses the GPCR classification using Grey incidence degree measure and

Principal Component Analysis (PCA). We use features obtained through Fast Fourier transform,

Split amino acid, and Pseudo amino acid composition. The PCA is employed to reduce the

number of features. Chapter 5 discusses weighted ensemble based approaches for the

classification of GPCRs. The Position Specific Scoring Matrices (PSSM) are used to extract

evolutionary features. The weights are optimized using binary genetic algorithms. Chapter 6

discusses the transmembrane domain based classification and alignment of Rhodopsin like

26

GPCRS. We also discuss general TM shapes and structures for different sub families of

Rhodopsin like GPCRs and the identification of motifs in GPCRs using PSSM and pseudo

counts. The generalization of this method can help in detecting motifs in other protein families as

well. In Chapter 7, we have presented the conclusion of the overall research along with our major

achievements. It also discusses the future directions and improvement, which can be further

applied to the proposed methods.

27

2. LITERATURE SURVEY AND THEORY

In this chapter, we will give the detailed description of the existing GPCR classification and

alignment approaches. We will also explain the algorithms and terminologies used in the present

research. First, we will give overview of alignment approaches and alignment based

classification of GPCRs. Later we will explain alignment independent classification GPCRs and

machine learning approaches. Then we will explain different feature extraction, classification,

optimization tools and different GPCRs datasets used in present research. The layout of chapter

2 is shown in Figure 2-1.

2.1.ALIGNMENT DEPENDENT CLASSIFICATION OF GPCRS

In alignment based classification methods, first alignment between the sequences is generated

and the classification task is performed using the alignment. Alignment based classification

method utilizes the structural information of the GPCR sequences. There are various approaches,

which predict GPCRs using their 7TM regions (Inoue, Yamazaki, & Shimizu, 2005).

2.1.1. Sequence alignment

Technically speaking, sequence alignment is simply the re-arrangement of the sequences, such

that similar regions can be identified between two sequences. This region of similarity can be

functionally or structurally related. Sequences are placed row wise and gaps are often inserted

between the amino acid residues so that identical amino acid letters are aligned in successive

columns e.g.

AM - - AMTCFGHGAMKCMTCMAK

- MCACMTMMHM - M -CMT - - - - -

http://en.wikipedia.org/wiki/Structural_biology

http://en.wikipedia.org/wiki/Residue_%28chemistry%29

28

Figure 2-1: Overview of chapter 2

Protein alignments usually use a substitution matrix like BLOSUM 62, PAM etc, to assign scores

to amino acid matches or mismatches and a gap penalty for matching an amino acid in one

sequence to a gap in the other. There are different categories of sequence alignment methods

such as local, global, and pairwise alignment.

2.1.1.1 Local and global alignments

Local alignment is the alignment of only small portion in a set of sequences and global alignment

is the alignment of all residues in a set of sequences. An example of global alignment is

Needleman–Wunsch algorithm (Needleman & Wunsch, 1970)and example of local alignment is

Smith Waterman algorithm (Smith & Waterman, 1981).

2.1.1.2 Pairwise alignments

Pairwise alignments are performed only between two sequences at a time. Some of the Pairwise

alignment methods are dot matrix, dynamic programming, and word methods (Mount, 2004).

http://en.wikipedia.org/wiki/Substitution_matrix

http://en.wikipedia.org/wiki/Gap_penalty

http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm

http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm

29

They generate a highly efficient alignment and but if number of sequences to be aligned are

increased these methods take a lot of time and are computationally much expensive.

2.1.2. Multiple Sequence Alignment

Multiple sequence alignment (MSA) is the alignment of set of sequences at a time. It can be used

to identify structurally or functionally similar region across the set of sequences. The main

objective of MSA is to maximize the number of matches between the sequences and to minimize

gap insertions and mismatches. They are computationally very expensive and difficult to build.

There are three possibilities for alignment of sequences at a position i.e. gaps, matches, and

mismatches.

AM -- CMTCFGHGAMKCMTCMAK

- MCACMTMMMM-M -CMT - - - - -

- - - MTMKAN - - - - MT- CM - - - - - -

There are various approaches for performing multiple sequence alignment i.e. dynamic

programming methods, progressive methods, iterative methods, Hidden Markov Models

(HMM), genetic algorithm, simulated annealing, and motif finding methods. Dynamic

programming can guarantee the optimal multiple sequence alignment, but it is computationally

much expensive, if number of sequences is more than 4.

2.1.2.1 Progressive alignments

It starts with pairwise alignment between the most similar sequences and progresses towards the

most dissimilar sequences. Alignments are produced by first putting the sequences in a tree

structure, the two most similar sequences become siblings and are connected to same ancestor.

The tree is known as guide tree. MSAs are built bottom-up along the guide tree by pairwise

alignments of set of sequences. The pairwise distance matrix of the sequences is computed using

pairwise sequence alignments and the distance matrix is used to differentiate between closely

related and distant sequences. The drawback of progressive methods is this that if there is a

mistake at any lower tree level, it propagates to upper levels and cannot be corrected; also, it

does not guarantee globally optimal alignment. There are various algorithms for implementing

progressive methods i.e. MAFFT, T-Coffee, and CLUSTALW. MAFFT takes Fast Fourier

30

transform (FFT) to locate similar regions (Katoh, Misawa, Kuma, & Miyata, 2002). Although T-

Coffee is slower than CLUSTALW, but it provides more accurate alignment for protein

sequences which are distantly related (Thompson, Higgins, & Gibson, 1994) (Notredame,

Higgins, & Heringa, 2000).

2.1.2.2 Iterative methods

These methods produce an initial alignment by supposition and then iteratively improve multiple

sequence alignment by minimizing error. The overall quality of multiple sequence alignment is

dependent on initial alignment. Common algorithms for iterative alignment are DIALIGN-T

(Subramanian, M.J., Kaufmann, & Morgenstern, 2005) and MUSCLE (Edgar, 2004 ).

2.1.2.3 Hidden Markov models

Hidden Markov models (HMM) are solely based on probabilities (Baum & Petrie, 1966). It

generates an accurate multiple sequence alignment or family of multiple sequence alignments by

assigning likelihoods to all possibilities of gaps, matches, and mismatches. Both local and global

alignments can be generated by HMM. MSAs in HMM are represented by directed a-cyclic

graph. The nodes of the graph show possible entries in multiple sequence alignment. There are

two types of states in HMMs i.e. observed states and hidden states. The observed states show

columns of the alignment and the hidden states show an ancestor sequence. The software used

for implementing HMM are partial order alignment (POA) (Grasso & Lee, 2004), sequence

alignment and modeling system (SAM) (Hughey & Krogh, 1996) and HMMER (Durbin, Eddy,

Krogh, & G., 1998).

2.1.2.4 Motif finding algorithms

From evolutionary point of view, the hypothetical "common ancestor" to all current GPCRs is

likely to present seven motifs characteristic of each of the 7TMs. However, for present day

GPCRs, some of these motifs might have mutated and therefore, are not detectable anymore. In

addition, a part of amino acid sequence might have evolved to present the same succession of

letters than the canonical motifs. This is more likely for shorter motifs, and for motifs that are

containing amino acids with high background frequencies. For example, it is more probable to

find the motif 'GN' than 'CWxxPxxxY' in the random part of the sequence. This is

simply because GN is composed of few letters. This makes the motif detection a challenging

task.

31

In Rhodopsin like GPCR family, there are 7TMregions, which could also be termed as patterns

or motifs. Aligning the common patterns or motifs can result in multiple sequence alignment.

These motifs should occur sequentially in sequences i.e. first should be motif 1, then motif 2 and

up to motif 7. It could also be possible that in a sequence that there are more than one

occurrences of a particular motif or it could also be possible that any of the motifs is not present

in a sequence. It could also be possible that motif-1 is found after motif-2, so in that case either

we have to ignore motif-1 or motif-2, depending upon which is more appropriate for the

alignment. It is a challenging task to identify motifs and to preserve sequentially. One simple

way to identify or define motifs in protein sequences is to look at multiple sequence alignment of

sequences belonging to same family, see the conserved amino acid regions, and create a

consensus of those regions. That conserved region can then be helpful to identify motifs in a new

sequence of same family. A consensus cannot be necessarily a right combination. Therefore, we

should adopt a regular expression notation for searching any particular motif or pattern. The final

motif should maximize true positives and minimize false positives. The commonly used tools for

motif finding are: BLOCKS (Blocks WWW Server), MEME (Bailey, Williams, Misleh, & Li, 2006),

and MAST.

Figure 2-2: Simple motif based alignment of GPCRs

http://blocks.fhcrc.org/

32

2.1.2.5 Genetic algorithms and simulated annealing methods

Genetic algorithm is used for optimization of any problem. It has many applications in computer

science and bioinformatics. It starts with generating population of chromosomes (made up of

genes). Every possible alignment can be represented as a chromosome, which is composed of N

genes, where N is the number of sequences to be aligned. Genetic algorithm first generates some

MSAs called population of chromosomes, and evaluates fitness of population. A fitness function

can be defined based on number of matching symbols and their location in the sequences, the

number of gaps or based on sum of pair method. Sum of pair method is used to assess quality of

MSA. Then, it performs selection, crossover and mutation operations to improve the MSA. In

crossover, two MSAs are combined to form two new MSAs. Some of the MSAs are mutated.

The fitness of newly or edited MSAs are again calculated and fitness of all MSAs are ranked.

The best MSAs are then selected for the next generation and produce off springs for the new

generation and the process is repeated until satisfactory solutions evolve or maximum number of

generations is reached. Genetic Algorithms perform better than dynamic programming methods

if number of sequences is high. Genetic Algorithms are processed in parallel and can get

advantage of parallel computers. The key advantage of genetic algorithms over other

optimization methods is that it only needs a fitness function to evaluate the quality of different

solutions and no need to change the inner functionality of algorithm.

2.1.3. Protein scoring matrices

Protein scoring matrices are used to score the alignment of any possible amino acid residues of

two sequences and are the key elements in assessing the quality of alignment. These can also be

termed as similarity matrices. The amino acids taken from each of two sequences are matched

and assigned a score from a similarity matrix. The similarity matrices are based on substitution

probabilities of amino acids and are sometimes called substitution matrices. The score for the

entire match between pair of sequences is the sum of the scores of the individually matched

amino acids. It shows the rate at which one amino acid residue in a sequence changes to another

residue. The most commonly used matrices are based on Dayhoff model (Dayhoff, Schwartz, &

Orcutt, 1978) in which matrices are derived from large bunch of protein sequences, which are at

least 85% identical. The matrices particular to any sub family of proteins can also be developed

by examining pairwise alignments in a large MSA of amino acids and by extracting frequency

33

information of amino acid i mutated into amino acid j. The substitution matrix formed in this

way would be effective, if the MSA is built from large number of distantly related sequences.

Various substitution matrices are now proposed for scoring the protein sequence alignments,

such as: Point accepted mutation (PAM) and Block substitution matrix (BLOSUM). The

substitution matrices are usually presented in the Log-odd form (Henikoff, S. &Henikoff, J.G.

1992). In log-odd form, each score in the matrix is the natural logarithm of an odd ratio, where

odd ratio is the ratio of the occurrences of two amino acids appearing with a biological sense and

the likelihood of the same amino acids appearing by chance. The positive entry in the matrix

shows pair of amino acids that replace each other more often than expected by chance and

negative entry in the matrix corresponds to a pair of amino acids that replace each other less

often than expected by chance.

2.1.3.1 Point accepted mutation (PAM)

PAM matrices are based on observed mutations.PAM matrix was developed by Margaret

Dayhoff in 1978 (Dayhoff, Schwartz, & Orcutt, 1978). PAM examines mutations that can occur

in closely related protein sequences at small evolution. The advantage of PAM matrices over

other similarity matrices is this that PAM matrices describe more accurately the changes in

amino acid composition that are expected after a given number of mutations.

There are various series of PAM matrices based on estimated mutation rates i.e. PAM1,

PAM100, and PAM250. PAM1 has 1 % accepted mutations per 100 amino acid residues,

PAM100 has 100 accepted mutations per 100 residues and PAM250 matrix corresponds to 250

mutations per 100 amino acids residues.

34

Figure 2-3:PAMmatrix

2.1.3.2 Block substitution matrix (BLOSUM)

BLOSUM is mostly used for scoring of alignments of evolutionarily divergent protein sequences

and is based on log-odd scores (Henikoff & Henikoff, 1992). There are three most commonly

used BLOSUM matrices i.e. BLOSUM45, BLOSUM 62 and BLOSUM 80. BLOSUM45 is the

matrix composed from alignment of amino acid sequences, which are 45% similar. Similarly,

BLOSUM62 is constructed from alignment of sequences, which have 62% similarity, and

BLOSUM80 has 80% amino acid sequence similarity.

The BLOSUM matrix Sij is calculated using the following equation:

1log

ij

ij

i j

PS

q q

2.1

where Pij is the substitution probability of amino acid i with j.qi and qj are the background

probabilities of amino acids i andj. By rearranging terms, we achieve:

exp ijS

ij i jP q q

2.2

Sum of all substitution probabilities is one. The unknown λ can be found by solving

exp 1ijS

i jij

q q

35

whereiq ,

jq and ijS are already known (Sean, 2004). BLOSUM62 matrix is shown in Error! Not a

valid bookmark self-reference..

Table 2-1: BLOSUM 62 matrix

2.1.4. Position specific scoring matrices (PSSM)

PSSMs are normally calculated from blocks of amino acids. Blocks are the highly conserved

aligned un-gapped portions in MSA of amino acid sequences. The length of PSSM vector is the

same as the width of block and has 20 rows, one for each amino acid. The PSSM is used to score

alignments of the blocks of amino acid or DNA sequences. PSSM estimates the probabilities of

amino acids appearing at each position of the block. The scores in a column of PSSM are based

on the frequencies of the amino acids observed in the corresponding column of the block. The

more frequently occurring amino acid gets a higher score.

The simplest representation of PSSM is calculated from a multiple sequence alignment. Each

column of alignment can be represented as a column vector of 20 amino acids. These 20 amino

C S T P A G N D E Q H R K M I L V F Y W

C 9 -1 -1 -3 0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2

S -1 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3

T -1 1 4 1 -1 1 0 1 0 0 0 -1 0 -1 -2 -2 -2 -2 -2 -3

P -3 -1 1 7 -1 -2 -1 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4

A 0 1 -1 -1 4 0 -1 -2 -1 -1 -2 -1 -1 -1 -1 -1 -2 -2 -2 -3

G -3 0 1 -2 0 6 -2 -1 -2 -2 -2 -2 -2 -3 -4 -4 0 -3 -3 -2

N -3 1 0 -2 -2 0 6 1 0 0 -1 0 0 -2 -3 -3 -3 -3 -2 -4

D -3 0 1 -1 -2 -1 1 6 2 0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4

E -4 0 0 -1 -1 -2 0 2 5 2 0 0 1 -2 -3 -3 -3 -3 -2 -3

Q -3 0 0 -1 -1 -2 0 0 2 5 0 1 1 0 -3 -2 -2 -3 -1 -2

H -3 -1 0 -2 -2 -2 1 1 0 0 8 0 -1 -2 -3 -3 -2 -1 2 -2

R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 2 -1 -3 -2 -3 -3 -2 -3

K -3 0 0 -1 -1 -2 0 -1 1 1 -1 2 5 -1 -3 -2 -3 -3 -2 -3

M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 1 2 -2 0 -1 -1

I -1 -2 -2 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 2 1 0 -1 -3

L -1 -2 -2 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 3 0 -1 -2

V -1 -2 -2 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 -1 -3

F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 3 1

Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 2

W -2 -3 -3 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

36

acids contain the observed frequencies of 20 amino acids in the multiple sequence alignment.

These observed frequencies can be an imperfect representation of a position because the

observed sequences are just a subset of the full set of related sequences. So for some amino acids

the observed frequencies can be zero, if that amino acid is missing in the column and can affect

the performance. PSSMs are normally used to identify motifs in amino acid sequences (Ben, et

al., 2005). Without the information of those missing amino acids, we cannot effectively identity

motifs or patterns in amino acid sequences. There are various ways to handle this situation. One

such solution is to model-missing amino acids by adding pseudo-counts to the observed

frequency count vector.

2.2.ALIGNMENT INDEPENDENT CLASSIFICATION

Classification of GPCRs or protein sequences can also be performed using alignment

independent methods. Molecular phylogenetic tree analyses are fully dependent on MSAs. With

the increase of the number of sequences to identify the computational cost of alignment methods,

exponentially increase. Alignment independent methods are useful and much faster. In alignment

independent methods, we need some feature extraction strategy and some classification

algorithm. The classification algorithm is trained and evaluated on a dataset once the feature

extraction is completed. Classification based on physiochemical or biochemical properties of

sequences can effectively predict the families of GPCRs. During the last few years, there has

been proposed various systems for the annotation of functions of GPCRs automatically by

exploiting their physiochemical or biochemical properties in a fast and efficient way. Various

statistical and machine learning methods have also been proposed in this regard e.g. Bayesian

classification method (Lundstrom & Chiu, 2006), SVM (Bhasin & Raghava, 2004), (Bhasin &

Raghava, 2005), (Karchin, Karplus, & Haussler, 2002), (Guo, et al., 2005)and Hidden Markov

models (Möller, Vilo, & Croning, 2001), (Papasaikas, Bagos, Litou, & Hamodrakas, 2003),

(Martelli, Fariselli, Malaguti, & Casadio, 2002), (Rehman & Khan, 2011), (Davies, Secker,

Freitas, Mendao, Timmis, & Flower, 2007).There are various online classification servers

available (Davies, BIAS-PROFS) and (Rehman Z. , GPCR prediction, 2011).

2.2.1. Machine Learning

Machine learning is the category of algorithms that improve automatically by experience. The

process starts from learning and ends up at testing. Learning is acquired, by some training data,

37

and testing is applied on unseen new data. It exploits various features from training data in

specification phase and makes, intelligent decisions based on training data. It further helps in the

categorization of testing data called generalization. There could be many features of an available

training data; the selection of optimal features is also one of task of machine learning. There are

various machine learning based classification algorithms such as SVM, Nearest Neighbor (NN),

Grey incidence degree measure (GID) and ensemble approaches. After classification, the method

can further be optimized and validated using jackknife or independent testing methods. After

validation, the performance of method is assessed and compared with existing methods. The

machine learning based classification of GPCRs has following phases as shown in Figure 2-4.

2.2.1. Feature Extraction Strategies

The classification algorithms usually require GPCRs sequences into numeric form to classify

them into classes. The numeric form of the sequence can be obtained, by using any

physiochemical property. The numeric form of the overall sequence can be very large and can

vary in size, because sequences can be big and of variable size. Hence, we transform it to reduce

dimensionality by computing some useful properties from this numeric GPCR sequence. These

properties are called features, and process of computing these properties is called feature

extraction. We have extracted features using: amino acid composition, pseudo amino acid

composition, fast Fourier transform, wavelet based multi scale energy, split amino acid and

evolutionary information based methods. Let us consider a GPCR sequence G1, containing n

amino acids. After converting it to numeric form, it is mathematically represented as:

1 11 12 1, ,..., ng g gG 2.3

where11g is the amino acid at residue 1 in sequence G1, g12 is amino acid at position 2 and

similarly g1n represents the last amino acid at position n of the sequence G1.

38

Figure 2-4: Classification of GPCRs using machine learning

2.2.1.1 Amino Acid Composition

Amino acid (AA) composition is simply the frequency of occurrence of amino acids in the

sequence (Elrod & Chou, 2002). It accounts only sequence order information and is given:

1 2 20, ,...,f f fT 2.4

where,if is the occurrence frequency of ith amino acids and T is the numeric form of the

sequence.

2.2.1.2 Pseudo Amino Acid Composition

Pseudo amino acid (PseAA) composition is more accurate feature extraction strategy than simple

AA (Chou, 2001), (Qiu, Huang, Liang, & Lu, 2009). It also accounts length and order of

extracted features. The first 20 elements of PseAA composition are same as given by AA feature

vector, but it contains some additional elements (21,...,c c ) to account the sequence order of a

protein. PseAA is mathematically represented as:

1 2 20 21, ,..., , ,...,f f f c cT 2.5

where, 20 n and n is the number of physiochemical properties used. Here, shows the

number of tiers used, is usually between: 1,…21. These tiers are computed using correlation

factors. Number of correlation factors is dependent on number of physiochemical properties

used. First tier correlation factors are the most contiguous residues along protein chain, second

39

tier correlation factors are the second most contiguous residues, and so on (Chou, 2001). These

tiers utilize the information contained in correlation factors by employing Physiochemical

properties. In case of using two physiochemical properties (Hydrophobicity and electronic), the

correlation factors are given by the equations 2.6 and 2.7:

1 1

1 , 11

1 2

2 , 11

2 1

3 , 21

2 2

4 , 21

1

2 1 ,1

2

2 ,1

1

1

1

1

1

2

1, ,

2

...........................

1

1

L

i ii

L

i ii

L

i ii

L

i ii

L

i ii

L

i ii

HL

HL

HL

H LL

HL

HL

2.6

1 1 1

,

2 2 2

,

( ). ( )

( ). ( )

i j i j

i j i j

H h R h R

H h R h R

2.7

where, L is the length of the GPCR sequence.1 is first tier correlation factor based on

Hydrophobicity property,2 is the first tier correlation factor based on electronic property,

3 is

the second tier correlation factor by using the Hydrophobicity property, τ4 is the second tier

correlation factor using electronic property and so on.

1

ih R is the Hydrophobicity value of amino acid i and 2

ih R is the Electronic value of amino

acid i (Any other physiochemical property can be used in place of Hydrophobicity/Electronic

property and will be represented as Ri).

2.2.1.3 Wavelet based multi scale energy features

The discrete wavelet transform (DWT) can be used to represent a signal in transform domain

(Qiu, Huang, Liang, & Lu, 2009). DWT can be implemented by many different approaches,

however we have used Mallat’s Fast algorithm. In fast algorithm, main wavelet is decomposed

40

into several levels. At each level, approximation and detailed coefficients are obtained by low

pass and high pass filters respectively.

At first, each sequence is converted into the numeric digital signal using Hydrophobicity values

(using FH scale) (Fauchere & Pliska, 1983). Then, the Haar transform of digital signal is

computed. At third step, the approximation and detailed coefficients are computed using various

decomposition levels. The maximum decomposition level for a particular sequence is equal to

Log2 (length of sequence) and is denoted by m. In some sequences, zero padding is used to keep

consistent size for feature vectors of all the sequences. The resultant feature vector obtained in

this way is named as MSE (Shi, Zhang, Pan, Cheng, & Xie, 2007). The MSE-feature vector of

(m+1)-Dimensions is formed as:

1 2, ,..., ,..., ,k k k k k

j m mk d d d d a T 2.8

where 1,2,...,k N and N = total sequences, k

ma is the root mean square energy of wavelet

approximation coefficients in mth decomposition level and k

jd shows the root mean square

energy of wavelet detailed coefficients in the corresponding jth decomposition level .

1 2

0

1( }){

jN kk

j jnj

nd uN

2.9

1 2

0(

1){ }

m k

m

Nk

m nj

v naN

2.10

where,mN = number of approximation coefficients and jN = number of detail coefficients,

k

mv n is the nth approximation coefficient in the mth level and k

ju n is the nth detail

coefficient in the jth decomposition level.

2.2.1.4 Fast Fourier transform based features

Fast Fourier transform (FFT) is an efficient way of implementing discrete Fourier transform

(DFT) algorithm (Guo, et al., 2005). It reduces the computations of DFT from O(N2) to O(N log

N). FFT requires numerical input; it decomposes the set of values into different frequency

41

components. We have converted GPCR sequence into numeric form using Hydrophobicity

property of proteins and then normalized the values using following equation.

T

i

i

T -T 2.11

whereiT is the numeric form of GPCR sequence i , T is the average (mean) of

iT and is

deviation from mean. Since the size of sequences varies, iT will also have variable sizes. FFT

has two benefits over some other feature extraction strategies. First, it keeps the length of feature

vector consistent for all sequences. Secondly, some features, which are difficult to extract in

spatial domain, can easily be extracted in the transform domain. Sequences belonging to one

class bear differences with sequences of other classes. These differences can easily be analyzed

using FFT.

The FFT of iT is computed using eq. 2.12.After that, we have applied Power spectral density

(PSD) of even frequency points equal to 256. The PSD is the square of Fourier transform divided

by total number of frequency points. The formulas for PSD and FFT are given by the following

equations.

-1 -1

1

Ni k

N

i

k i

X X 2.12

2

PSDn

X

2.13

where, exp 2 i N is an Nth root of Unity, n is the number of frequency points. The feature

vector formed using FFT and PSD in this case is of length 256 and is shown by eq. 2.14.

2

-1 -1

1

Ni k

N

i

i

kn

T

T

2-14

2.2.1.5 Split amino acid

The GPCRs have peptides at their N and C terminus regions, which are very informative. Split

amino acid (SAAC) helps in extracting N and C terminus information from the GPCR

sequence (Afridi, Khan, & Lee, 2012), (Chou & Shen, 2006), (Chou & Shen, 2006). GPCR

42

sequence is split into 3 parts and computational composition of each part is performed

independently. First part consists of 20 amino acids of N terminus. Second part contains 20

amino acids of C terminus and third part contains amino acids lying between C and N termini.

The overall size of SAAC features in our research is taken as 60.

2.2.1.6 Evolutionary information based features using PSSM

The evolution in GPCRs results in biological changes in GPCR sequences e.g. deletions,

insertions or mutations of some amino acid residues. These changes if analyzed can help in

classification of GPCRs. We have mathematically described this evolutionary information using

Position Specific Scoring Matrix (PSSM) (Schaffer, et al., 2001), (Chou & Shen, 2010). PSSM

shows the probabilities of substitutions of one amino acid into other. We have computed PSSMs

using PSI-Blast (Institute). We have to manually enter each sequence through PSI-Blast

interface, appropriate database for sequence and to run iterations of PSI-Blast 2 times to get the

PSSM. This PSSM is saved in a file and further processed. PSSM for each sequence has size

equal to the length of sequence with each column showing the probabilities of 20 amino acids.

1,1 1,2 1,20

2,1 2,2 2,20

,1 ,2 ,20

...

...

. . ... .

. . ... .

...L L L

E E E

E E E

P

E E E

2.15

where ,i jE are the scores of the amino acid residues at the ith position in GPCR sequence

substituted to amino acid j . The search threshold for PSI is set to 0.001. Next, we have

normalized Ei,j by computed zero mean and standard deviation of PSSMs.

^,

,

( )

( )

i j i

i j

i

E EE

SD E

2.16

where, iSD E is standard deviation andiE is the mean of ith amino acid residue in a GPCR

sequence. Next, we have merged PSSM features with Pseudo amino acid to compute the final

equation to represent a GPCR sequence as shown in eq. 2.17:

'

1 2 20 1 2 20... , , ...E E E E E E

T 2.17

43

where,

^

,1

1 L

j i ji

E EL

2.18

2^ ^

, ,1

1 L

j i j i ji

E E EL

2.19

and where, jE ( j=1 to 20) are the mean scores of the amino acid residues in GPCR sequence. is

the number of tiers used. The value of should be less than the size of the minimum sequence

present in the database. We have chosen =49 in the proposed research. We have named this

feature extraction strategy as PSE-PSSM in our present research.

2.2.2. Classification Algorithms

For the sake of classification, we have used Support vector machine, probabilistic neural

network, nearest neighbor and ensemble classification approaches.

2.2.2.1 Nearest Neighbor

Nearest Neighbor algorithm (NN) annotates the test sample to a class in a sample space of N

classes by calculating its distance to all classes and annotating it by the label of a class, which

has minimum Euclidian distance to it (Rehman & Khan, 2011). Euclidian distance can be

calculated as:

1 1,2,...,D i N ii

i

x.xx,x

x x 2.20

where, x is the test sample and xiis the sample having ith training class, x and x

i are their

respective modulus.

2.2.2.2 Support vector machines

SVM classification algorithm is a binary classifier, but it can be used for multi-classification

problems (Karchin, Karplus, & Haussler, 2002). The model formed by SVM computes such a

decision boundary having maximum distance to the nearest points in the training feature space.

The SVM is based on the principle of finding the optimal linear hyper plane so as to minimize

the classification error for the new test samples (Javed, Khan, Majid, Mirza, & Bashir, 2007).

For a linearly separable data of N training pairs (xi, yi), the function of decision surface V is

given by Eq. (2.21).

44

1

.N

Ti i i

i

x y biasV

2.21

where, the coefficient 0i is the Lagrange multiplier in an optimization problem. A sample i

corresponding to 0i is called a support vector. The function V x is independent of the

dimension of the feature space. To find an optimal hyper plane surface S for non-separable

sample points, we have to find the solution of the following equation.

1

1,

2

N

ii

w o

TW W 2.22

Subject to: 1 , 0Ti iY x bias W

where, o is the penalty parameter of the error term 1

N

i

i

. It represents the cost of constraint

violation of those data points, which occurs on the wrong side of the decision boundary, and

x is the nonlinear mapping. The weight vector W minimizes the cost function term:T

W W .

For nonlinear data, the input data is mapped to higher dimension through a mapping function

x such that : ,N MR F M N . Each point in the new feature space is defined by a kernel

function .i jK i jx ,x x x . There are many different kernel functions; we have evaluated

the performance of our proposed method using Linear, Polynomial, Radial basis function and

Sigmoid kernel functions. The nonlinear decision surfaceV can now be constructed as given by

the Eq. (2.23).

1

Ns

i i

i

V y K bias

i jx x ,x 2.23

where, Ns is the number of support vectors. Mathematically, the Radial Basis kernel Function

(RBF) is defined as given by the Eq. (2.24).

2

2exp

2K

i j

i j

x -xx ,x

2.24

where, the parameters σ shows the width of Gaussian function.

As our classification problem is multi-classification, we have used the one-Vs-all classification

45

strategy using the LIBSVM 2.88-1 package (lib SVM). SVM problem in this software is solved

using non-linear quadratic programming technique. During parameters optimization of SVM

models, the average accuracy of SVM models is maximized.

2.2.2.3 Probabilistic Neural Network

Probabilistic Neural Network(PNN) was developed by Specht (Specht, 1990). It is based on

Bayesian classification algorithm. PNN has four layers: input, pattern, summation, and decision

layer. There is different number of neurons at each layer. The PNN receives n dimensional

feature vector as input i.e. 1 2, ,...,i nx x xx to the N nodes of input layer. There are M pattern

layer nodes fully connected to these N input layers..In pattern layer, for each class 1k k c

km Gaussian functions are calculated as given by the Eq. (2.25):

1

1/2

/2

1 1exp

22

kT

k k k

j j jk jn

j

p

x x x

2.25

where, k

ju is the mean andk

j

is the covariance matrix of training samples. The summation layer

computes the approximation of the class probability functions as given in the Eq. (2.26).

1

mkk kpj jk

j

Φ x x 2.26

where k

j is the within class mixing proportion and 1

1

mkkj

j

for 1,2,...,k c .The decision layer

makes decision about the test sample by computing the risk as given in the Eq. (2.27).

1

ckV vl lk l

l

x x 2.27

where,1 indicates the prior probability of class l. The test sample is assigned to the label of a

class, for which risk is the minimum. The performance of PNN is dependent on optimized

smoothing factor which is used to control the deviations of Gaussian functions.

46

2.2.3. Performance Assessment

After the classification of all classes, the performance of the classifier is assessed by some

statistical measures. The class assignment for each sequence is usually performed in binary way

i.e. negative and positive class. Even multi class problem can also be broken down into 2-class

problem. The true positives (TP) and true negatives (TN) are the number of correctly classified

positive and negative sequences respectively. Similarly, false positives (FP) and false negatives

(FN) are the number of incorrectly classified positive and negative sequences. In the end,

performance for whole dataset is analyzed using various measures such as: overall accuracy,

sensitivity, specificity, Mathew correlation coefficient (MCC) and F-measure. Accuracy shows

us the overall effectiveness of the method and determines the degree of true predictions of

method i.e. true positives or true negatives. Specificity indicates the proportions of true

negatives. Sensitivity indicates the proportions of true positives. There should be proper tradeoff

between the values of sensitivity and specificity and both should normally be high. The values of

MCC lie between 1 and -1, where 1 means correct prediction, 0 is for average prediction and -1

means wrong prediction. MCC helps to show the bias of classifier towards bigger class in case of

imbalance data. F-measure consists of both the precision and recall of the test to compute the

score. The F-measure is weighted average of the precision and recall.

100TP TN

AccuracyTP FP FN TN

2.28

100TP

SensitivityFP TN

2.29

100TN

SpecificityFP TN

2.30

TP TN FP FNMCC

TP FP TP FN TN FP TN FN

2.31

Precision Recall2

Precision RecallF measure

2.32

PrecisionTP

TP FP

2.33

RecallTP

TP FN

2.34

47

2.2.4. Genetic Algorithms

Genetic algorithms (GA) are a family of evolution-inspired computational models, normally used

to solve the complex optimization and search problems. They are based on natural selection and

survival of the fittest principles as in most of the biological organisms. In nature, individuals in a

population usually compete with each other for resources and to attract a mate. The individuals,

which succeed in survival and attraction to other mates, give birth to more offspring rather than

poor individuals. The genes transferring from the individuals, who are highly adaptive to their

environments, will increase the number of individuals in each generation. The combination of

good characteristics from different ancestors can results in producing best offspring, which can

be fit than their parents. Species evolve in this way to become better suited for their environment.

The basic terminology of GAs was first proposed by Holland (Holland, 1992). GA starts by

encoding the random solutions in the form of population of chromosomes. A chromosome is a

long, complicated string of DNA (deoxyribonucleic acid) containing genes that determine

particular characteristics of an individual. The chromosomes are then evaluated and reproduced,

in such a way that the fitter chromosome has more chance to evolve resulting in better solutions.

Reproduction makes changes in the shapes of the chromosomes. The chromosomes from the

parents exchange randomly by a process called crossover. The offspring thus exhibit some

characteristics of the father and some from the mother. Sometimes mutation occurs, which rarely

happens and changes some characteristics. Occasionally, an error can occur during copying of

chromosomes, this process is called mitosis. However, these accidental mistakes can produce

better specie. Genetic algorithms have applications in many fields of science such as

computational science, bioinformatics, engineering, phylogenetics, economics, manufacturing,

chemistry, physics, mathematics and other fields. The functionality of GA can be divided into 4

phases.

Initialization

Selection

Genetic operators

Termination

48

2.2.4.1 Initialization

Solution to a potential problem can be represented in terms of set of genes (parameters). The

combination of these genes forms chromosomes (string containing the combination of

parameters). An initial population is made by generating number of random chromosomes

(solutions). There can be thousands or millions of possible solutions and the size of population

can vary from problem to problem. The encoding of solution into chromosomes also varies from

problem to problem and can be binary or continuous. A randomly generated population can

represent the entire range of possible solutions.

2.2.4.2 Selection

Naturally the individual having better survival characteristics will survive for a longer period of

time and has a better chance to produce offspring. Generations after generations, the population

will consist of lots of genes from the superior individuals and less from the inferior individuals.

This process is called natural selection. Individual solutions at each generation are selected by

using a fitness evaluation method, and ranked according to their fitness values. A fitness function

is defined as maximization of the objective function of the problem and varies from problem to

problem. However, it can be converted to minimization problem by negating the function. It

returns a fitness value, which is the quality measure of the chromosomes (solution) of the

problem. Fitness of each individual of the population is measured. The individuals with high

fitness value will have more chance of being selected for the genetic operations and to be used in

next successive generations. The most important thing in the GA is to define suitable fitness

function and proper encoding of parameters. Following are some of the selection functions used

in literature:

Stochastic uniform selection

Remainder selection

Uniform selection

Shift linear selection

Roulette wheel selection

Tournament selection

Rank selection

49

2.2.4.3 Genetic operators

After selection and fitness evaluation, GA operators are applied to introduce diversity in the

chromosomes. There are 3 most commonly used GA operators i.e. Reproduction, Crossover and

Mutation. Reproduction can also be termed as selection and simply make copies of the better

solutions to a new population. In Crossover, two or more parent chromosomes are chosen on the

basis of fitness values to produce a child chromosome. Crossover rate can be defined with

respect to the problem. Following are the some of the crossover methods used in existing

literature:

Scattered crossover

One point crossover

Two point crossover

Intermediate crossover

Heuristic crossover

Arithmetic crossover

Custom crossover

In Mutation operator, some of the gene values (parameters) are altered to preserve diversity in

the chromosomes from generation to generation. Mutation makes changes in the chromosomes

sometimes partially or sometimes completely. Following are the well-known mutation

approaches:

Gaussian Mutation

Uniform Mutation

Adaptive feasible Mutation

Custom Mutation

Boundary

Non-Uniform

Bit string mutation

These GA operators are used to make new off springs (new solutions). Overall objective of GA

is to maximize the fitness function. The new chromosomes are created by either selection,

http://en.wikipedia.org/wiki/Genetic_diversity

50

alteration or combination of characteristics of currently good performing chromosomes. Hence, a

new population of solutions is formulated.

2.2.4.4 Termination

The above-mentioned 3 phases of GA are repeated iteratively. One cycle of GA run is called

generation. The GA process repeats itself unless some termination criteria are achieved or an

optimal solution is found. It can also be stopped when maximum number of generations is

reached.

2.3.GPCR DATASETS

There are various protein databases on web, which provide GPCRs datasets i.e. swissprot,

uniprot and protein databank. There are some web servers, which also provide GPCR sequence

data belonging to different species i.e. (GPCRDB, 2012)and (ENSEMBL). We have used 5

different datasets in the present research, named as: D8354, D167 (Elrod & Chou, 2002), D365

(Chou, 2005), D566 (Chou & Elrod, 2002) and D11026. The D8354 dataset is available at

http://www.cs.kent.ac.uk/projects/biasprofs/. The dataset was identified using the entrez search

and retrieval system (Wheeler, Barrett, Benson, & et.al., 2007). The sequences with length > 280

amino acids were deleted. There are 8354 sequences in total. Out of which, Rhodopsin like has

5526, Secretin like has 625, Metabotropic glutamate has 2172, fungal pheromone has 13 and

cyclic AMP has 18sequences.

D365 contain 365 sequences belonging to 6 major families of GPCRs: (1) Rhodopsin-like (2)

Secretin-like (3) Metabotrophic glutamate pheromone (4) Fungal pheromone (5) cAMP receptor

and (6) Drizzled/smoothened family. D167 have 167 sequences and is classified into 4 sub-sub

families i.e. (1) acetylcholine (2) adrenoceptor (3) dopamine and (4) serotonin. The dataset D566

[26] has 466 sequences belonging to 7 sub-sub families i.e. (1) Adrenoceptor (2) Chemokine (3)

dopamine (4) Neuropeptide (5) Olfactory type (6) Rhodopsin (7) serotonin. It is reported by

Chou and Elrodthat the sequences in D167, D566 and D365 have similarity less than 40%.

D11026 dataset is gathered from ENSEMBL repository. It has sequences belonging to 19 known

sub families of GPCRs and some unknown receptors. The sequences belong to 62 species from

10 organisms named as: Eutharia, Marsupials, Monotremata, Amphibian, Reptilia, Birds, Ray-

finned fish, Zebra fish, Latimeria and Lamprey. Initially, the number of sequences in this dataset

http://www.cs.kent.ac.uk/projects/biasprofs/

51

was more than 12000. We aligned this sequence data, extracted seven TMs individually from it,

and then later merged the seven TMs. We then ignored those sequences who has X in TM

regions or who have more than five gaps in all of seven TMs. After these refinements, we are left

with 11026 sequences. These 11026 sequences are than used to train and test one of our methods.

52

3. GPCR PREDICTION BY EMPLOYING PHYSIOCHEMICAL

PROPERTIESUSING HYBRID FEATURES

As discussed in chapter 1, GPCRs has been classified into different classes by different

researchers. We have followed the classification similar to that adopted in GPCRDB (GPCRDB,

2012). In GPCRDB, GPCRs are divided into 6 major classes i.e. Rhodopsin-like (Class

A),Secretin (Class B), Metabotropic glutamate (Class C), fungal mating pheromone (Class D),

cyclic AMP (Class E) and frizzled or smoothened receptors (Class F). These 6 families are

further divided into sub classes and so on. In this method, we have classified GPCRs into three

levels i.e. into families, sub families, and sub-sub families. At first level, we have classified

GPCRs into 5 main families (class F is ignored because of very less number of sequences), at

second level 40 sub families and finally 108 sub-sub families at third level, same as done by

(Davies, Secker, Freitas, Mendao, Timmis, & Flower, 2007).We have used 4 dataset in this

proposed method named as: D8354, D167, D566, D365. D8354 is the main dataset for this

method. These datasets are explained in chapter 2.Focus of this method is to first investigate and

utilize the importance of different physiochemical properties to classify GPCRs and to use

hybrid combination of spatial and transform domain methods to increase overall classification

performance. We have named our method as GPCR-Hybrid (Rehman & Khan, 2011). The

Overview of chapter is shown in the Figure 3-1.

53


3.1.PHYSIOCHEMICALPROPERTIES

The physiochemical properties used in present method are: Hydrophobicity, electronic and bulk

properties. Hydrophobicity property can be used to determine the structure and function of

GPCR. Their values can vary for different amino acids under different experimental conditions.

Biological molecules may have large non-polar regions, which can be described by hydrophobic

region. Each GPCR contains 7 stretches of 20-30 hydrophobic amino acids essential to pass

through the cell membrane. Hydrophobicity can be employed mathematically using many scales

such as: KDH (James, et al., 1987), MH (Mandell, Selz, & Shlesinger, 1997), and FH (Fauchere

& Pliska, 1983). Out of these scales, the Fauchere scale (Fauchere & Pliska, 1983)was found to

be the most discriminative for classifying GPCRs (Guo, et al., 2005), and it was therefore used

by Rehman (Rehman & Khan, 2011), (Rehman & Khan, 2012). Electronic property can be given

by the value of the Electron-Ion Interaction Potential (EIIP) model, which is derived from the

average energy states of all valence electrons in a given amino acid (Cosic, 1994). Any particular

amino acid delocalizing electrons has the strongest impact on the electronic distribution of the

whole protein. The third physiochemical property (bulk) use descriptors of compositi0on,

polarity and molecular volume model (CPV) (Grantham, 1974). The folding of the protein

sequence is greatly affected by this property.

54

3.2.FEATURE EXTRACTIONAND CLASSIFICATION

We have performed feature extraction using three methods. In first method, we have used pseudo

amino acid composition (PseAA) (Chou, 2001), which has been employed using 2 and 3

physiochemical properties of GPCRs. PseAA2 is computed using Hydrophobicity and electronic

property while PseAA3 is computed by Hydrophobicity, bulk exposition and electronic

properties. Second feature extraction method is composed of hybrid feature vector (MSE-

PseAA), which is a combination of wavelet based multi-scale energy (MSE) and PseAA based

features. We have used 2 physiochemical properties in MSE-PseAA i.e. Hydrophobicity and

Electronic property (Rehman & Khan, 2011). In Third method, again a hybrid feature vector

(MSE-AA) is computed by the combining amino acid composition and MSE features (Rehman

& Khan, 2011). We have used 3 classification algorithms i.e. support vector machine (SVM),

nearest neighbor (NN) and Probabilistic neural network (PNN). The details of these features

extraction strategies and classification algorithms are given in chapter 2. For cross validation of

present method we have used jackknife test. At any level, the unknown test sequence is classified

using the classification algorithm, which was performing best on training data and the feature

extraction method, which best describes the training data at that level. We have also developed a

web site (GPCR-Hybrid), which asks GPCR test sequence as input and predicts its family, sub

family and sub-sub family categories. GPCR-Hybrid is available online at (Rehman Z. , GPCR

prediction, 2011)

55

Figure 3-2: GPCR-Hybrid web interface

You have to enter valid GPCR sequence in the textbox and have to click on Submit button. This

web server will then show the family, sub family and sub sub-family classes by running the

appropriate classification algorithms at each levels.

3.3.GPCR-HYBRID

The GPCR-Hybrid is an online web server available at (Rehman Z. , GPCR prediction, 2011),

provides the efficient classification of an unknown GPCR sequence into three levels as discussed

in the above sections. Using the training data, we have figured out the best performing feature

extraction technique and classification algorithms at each level. The interface of GPCR-Hybrid is

shown in figure 3.1. First, you have to input a valid GPCR sequence in the textbox. The input

GPCR sequence should be in capital letters (MPWNG). Then, click on the Submit button.

GPCR-Hybrid performs feature extraction of input sequence using the best feature extraction

strategy of family level(i.e. PseAA2 ) and predicts its class using the best performing classifier

(i.e. SVM) for predicting the family class. After predicting the main family class, the same

process is repeated for predicting sub family and sub sub family levels. The names of family, sub

56

family and sub sub-family classes are mentioned in a label as shown in figure 3.1. The algorithm

of GPCR-Hybrid is shown in Figure 3-3.

Figure 3-3: Working of GPCR-Hybrid

3.4.RESULTS AND DISCUSSIONS

As discussed in above sections that classifications of a sequences are performed into 3 levels or

stages. At each level, the GPCR-Hybrid program selects the best feature extraction strategy and

classification algorithms. The results of classification for each level are described in following

sections.

3.4.1. Family Level Classification

GPCR-Hybrid classifies GPCRs into five families. The performance is shown in terms of

sensitivity, overall accuracy, MCC, specificity and F-measure. The performance of each of the

classifier for each of feature extraction strategy is described below.

57

3.4.1.1 Performance for PseAA2

The overall accuracies obtained using PseAA2 for the PNN, NN and SVM are reported as:

97.38%, 97.22 % and 97.86% respectively. The value of smoothing factor for PNN is chosen as

1. Using the same set of classifiers, MCC values are: 0.94, 0.93 and 0.95, respectively. Similarly

specificity values are: 96.72 %, 96.50 % and 96.89 %, sensitivity values are: 98.22 %, 98.13%

and 98.95%, and finally F-measures are: 0.96, 0.96 and 0.97, respectively.


The overall accuracies achieved using PseAA3 for PNN, NN and SVM are: 97.74%, 97.58% and

93.66%, respectively with smoothing factor in PNN = 0.6. Using the same set of classifiers,

sensitivity values are: 98.52%, 98.41% and 98.04%, specificity values are: 97.16%, 96.96% and

89.83%, MCC values are: 0.94, 0.94 and 0.85 and. F-measures are: 0.96, 0.96 and 0.90,

respectively.

3.4.1.3 Performance for MSE-PseAA

The overall accuracies achieved for MSE-PseAA by using the PNN, NN and SVM are 96.98%,

96.89% and 97.41%, respectively. Using the same set of classifiers, specificity values are

96.16%, 96.01% and 96.58%, MCC values are: 0.93, 0.92 and 0.94, sensitivity: 98.01%, 97.97%

and 98.43% and F-measures: 0.96, 0.96 and 0.90, respectively.

3.4.1.4 Performance using MSE-AA

The overall accuracies achieved for MSE-PseAA using PNN, NN and SVM are: 96.28%, 96.22%

and 97.06%, MCC: 0.91, 0.91 and 0.93, specificity: 95.22%, 95.08% and 96.06%, sensitivity:

97.57%, 97.59% and 98.23%, F-measures are: 0.94, 0.94 and 0.95, respectively.

It is obvious from above results that PseAA2 using SVM is giving the best performance at family

level. The accuracy, sensitivity, MCC and F-measure values are highest, while specificity value

is also comparable. Therefore, family level feature extraction is performed using PseAA2 and

classification SVM for an unknown GPCR sequence. Results for family level classification are

mentioned in Figure 3-4 and Figure 3-5.

58

3.4.2. Sub Family Classification

There are 40 sub families in total. The performance is shown in terms of sensitivity, overall

accuracy and specificity. The detailed performance of each of the classifier for each of feature

extraction strategy is described below.


The overall accuracies at sub family level achieved for PseAA2 using the PNN, NN and SVM:

82.13%, 81.02% and 81.58%, specificity: 82.10%, 80.99% and 81.55%, sensitivity: 81.30%,

80.55% and 81.15% respectively.

Figure 3-4: GPCR classification performance for family level in terms of Accuracy, sensitivity

and specificity

59

Figure 3-5: GPCR classification performance for family level in terms of MCC and F-Measure


The Specificity measures for PseAA3 using PNN, NN and SVM are: 83.42%, 81.85% and

78.98%, overall accuracies: 83.47%, 81.88% and 79.02% and sensitivity: 83.18%, 81.52% and

78.85% respectively.


The overall accuracies achieved for MSE-PseAA using the PNN, NN and SVM are: 80.36%,

80.73% and 84.97%, specificity: 80.27%, 80.69% and 84.94% and sensitivity values are 81.24%,


3.4.2.4 Performance for MSE-AA

The overall accuracies achieved for MSE-PseAA using the PNN, NN and SVM are: 78.29%,

78.55% and 80.96%, specificity: 78.21%, 78.51% and 81.90% and sensitivity: 78.79%, 78.51%

and 81.95% respectively.

60

MSE-PseAA feature extraction strategy using SVM is most appropriate for sub family level

classification and hence used by GPCR-Hybrid for sub family classification of any GPCR

sequence. The sub family classification results are also presented in Figure 3-6.

3.4.3. Sub-sub Family Classification

There are 108 sub-sub families in our main GPCR dataset. The performance is shown in terms of

sensitivity, overall accuracy and specificity. The detailed performance of each of the classifier

for each of feature extraction strategy is following sections.

Figure 3-6: GPCR classification performance for sub family level


The Specificity values at sub-sub family level for PseAA2 using PNN, NN and SVM are: 72.94%,

73.01% and 72.70%, overall accuracies: 72.88%, 72.95% and 72.65%, and sensitivity: 67.77%,


61


The Specificity values at sub-sub family level for PseAA3 using PNN, NN and SVM are: 74.35%,

73.72% and 68.81%, overall accuracies: 74.29%, 73.67% and 68.78% and sensitivity: 69.82%,

69.71% and 68.96%, respectively.


The Specificity values for MSE-PseAA using PNN, NN and SVM are: 71.15%, 72.53% and

70.32%, overall accuracies: 71.10%, 72.48% and 75.60% and sensitivity: 67.67%, 69.01% and

75.67%, respectively.

3.4.3.4 Performance for MSE-AA

The Specificity values for MSE-AA using PNN, NN and SVM are: 68.58%, 69.80% and 73.59%,

overall accuracies: 69.53%, 69.75% and 73.45% and sensitivity: 65.01%, 66.32% and 69.89%

respectively.

As mentioned in above-mentioned sections, the overall accuracy and sensitivity values for MSE-

PseAA with SVM classifier are the highest. Hence, GPCR-Hybrid chooses MSE-PseAA with SVM

for sub-sub family level classification of any unknown GPCR sequence. Results for sub-sub

family level classification of GPCRs are shown in Figure 3-7.

Figure 3-7: GPCR classification performance for sub-sub family level

62

3.4.4. Comparison with Selective Top Down Approach

Selective top down approach (Davies, Secker, Freitas, Mendao, Timmis, & Flower, 2007)also

classifies GPCRs into 3 levels. At family level, PseAA2 with SVM is superior and hence

compared to family level performance of top down approach. While at sub family and sub-sub

family levels MSE-PseAA with SVM are compared to the top down method’s performance.

Selective top down approach has only shown performance only in terms of overall accuracy. The

overall accuracy achieved in Selective top down approach, at family stage is 95.87%, while

GPCR-Hybrid has an overall accuracy of 97.86%. At sub family level, the Selective top down

method: 80.77% and GPCR-Hybrid: 84.97%. Selective top down method has overall accuracy =

69.98% at sub-sub family level, while GPCR-Hybrid has overall accuracy of 75.60%. At all the

3 levels GPCR-Hybrid has much better performance than Selective top down approach. This

great improvement in performance is because of hybrid combination of transform and spatial

domain feature-extraction strategies. In addition, the use of physiochemical properties has

positively affected the performance.

Figure 3-8: Comparison with Selective Top Down method

63

3.4.5. Comparison with other methods

As mentioned in section 2.2, we have tested and compared our method on 3 additional datasets

termed as: D167, D566 and D365. The GPCRs sequences in all of these datasets belong only to

one of the levels. We have compared our overall accuracy to the overall accuracies of existing

methods on these datasets. We have computed results on each of these datasets using SVM

classifier with four different kernels i.e. Lin-SVM, Poly-SVM, RBF-SVM and Sig-SVM and

compared the results of the best kernel function for these datasets

On D167, We have compared the overall accuracy of GPCR-Hybrid to overall accuracies of 6

existing methods (Elrod & Chou, 2002), (Huang, Cai, Ji, & Li, 2004), (Bhasin & Raghava,

2005), (Gao & Wang, Classification of G protein-coupled receptors at four levels, 2006), (Gao,

Wu, Ma, Lu, & He, 2008) , (Peng, Yang, & Chen, 2010 ). Overall accuracy achieved in GPCR-

Hybrid is higher than all of these 6 methods.

Figure 3-9: Comparison on D167 dataset

There are 2 existing methods, which have used D365. One of them is GPCR-CA (Xiao, Wang, &

Chou, 2009)and second is PCA-GPCR (Peng, Yang, & Chen, 2010 ). The overall accuracies

achieved by PCA-GPCR and GPCR-CA methods are: 92.60% and 83.56%, while overall

64

accuracy achieved in GPCR-Hybrid method is 91.72%, which is almost 9% higher than GPCR-

CA method and is comparable to PCA-GPCR method.



GPCR-Hybrid is compared with PCA-GPCR method on D566 dataset. The overall accuracy

achieved by PCA-GPCR method is 97.88 and in GPCR-Hybrid: 97.91%.

The improvements in the performance of GPCR-Hybrid over the existing methods are because of

hybrid combination of spatial and transform domain features and employment of physiochemical

65

properties. The optimization of SVM parameters with proper kernel for a dataset has also

contributed in the improvement.

66

4. GPCRs PREDICTION USING GREY INCIDENCE DEGREE MEASURE

AND PRINCIPAL COMPONENT ANALYIS

The GPCRs sequences are made up of amino acid polypeptide chains. We can also call them sub

units. The number and arrangements of these sub units forming a GPCR sequence is called

quaternary structure. There are different types of quaternary structures in GPCRs, such as:

dimmer, monomer, tetramer, trimer and pentamer. Some biological processes are directly

affected by quaternary structures. For example, monomers form sodium channels (Chen,

Alcayaga, Suarez-Isla, ORourke, Tomaselli, & Marban, 2002), homo-tetramers form potassium

channel (Doyle, et al., 1998), homo-pentamers make phospholamban channels (Oxenoid &

Chou, 2005), (Oxenoid, Rice, & Chou, 2007)and hetero-pentamers make α7 nicotinic

acetylcholine receptor (Chou, 2004).Some transitions only occur in tetramers, dimmers bind

some of ligands and tetramers make some ion channels.

In this method, we have again classified GPCRs into three levels as in chapter 3. We have

hybridized3 feature extraction approaches i.e. Split amino acid composition (SAAC), Pseudo

amino acid (PseAA) composition and Fast Fourier transform (FFT). We have employed two

physiochemical properties i.e. Electronic and Bulk in PseAA, which are already explained in

chapter 3. All of these feature extraction strategies are explained in chapter 2. The number of

features taken in PseAA is 62, in SAAC are 60 and in FFT is256. Total number of features is

378. As the number of features after the hybridization becomes so high and to avoid curse of

dimensionality, we have applied principal component analysis (PCA) is used to reduce the

features. After applying PCA, size of feature vector is reduced to 180.For the sake of

classification we have used nearest neighbor algorithm. We have computed the nearest neighbors

of a test sequence in two ways i.e. grey incidence degree measure and Euclidian distance

measure. The grey incidence degree measure is performing better than Euclidian distance. We

have trained and tested our methods on D8354 and compared with other methods on datasets:

D167 and D566. Over of chapter is shown in the Figure 4-1.

67


4.1.GREY INCIDENCE DEGREE MEASURE

Deng introduced grey theory in 1982 to analyze the uncertainty of a system (Deng, 1982). This

theory can be applicable to the problems in which information is fuzzy or uncertain. Grey

incidence degree (GID)measure is one of the major components of this theory (Liu, Fang, & Lin,

2005).The classification of GPCRs is also a fuzzy problem. Some GPCR sequences can be put

into one class based on some properties but they can also be put in another class because of some

other properties.

1 2, ,..., nT T T T 4.1

,,t i Min Max

k k t i

k Max

T T

4.2

where1 2, ,..., nT T T T are the numeric forms of n training sequences and

tT is the test sequence. is

the grey relational coefficient. t j

Min j k k kMin Min P P ,

,,t j t i t i

Max j k k k k k kMax Max P P P P , 1,2,...,j n are the indices of training sequences,

1,2,...,180k are indices of features of a GPCR sequence and = distinguishing coefficient. The

value of distinguishing coefficient is between 0 and 1.

68

The grey incidence degree 𝑂 of the test sequence with training sequences is a weighted sum of

grey relational coefficient and is given by the following equation.

180

1

, ,t i

t i k k k

k

O G G W G G

4.3

where,wkis weight associated with each feature. Wehave given equal weight to each feature and

taken the value of ξ equal to 0.5 as in existing work (Tsai, Liou, & Jiang, 2005) , (Xiao, Wang,

& Chou, 2009). The grey incidence degree ,t iO G G is the correlation between the test sequence

tG and the training sequencesiG . A training sequence closest to the test sequence will have high

grey incidence degree measure higher than other training sequences and hence can annotate the

test sequence to its class. In this method, we have employed GID in Nearest Neighbor algorithm

to compute the neighbors of a test sequence, which further can help to annotate the test sequence.

4.2.PRINCIPAL COMPONENT ANALYSIS

Principal component analysis (PCA) is a useful technique in pattern classification or machine

learning to analyze patterns in a high dimensional data and to prominent differences and the

similarities in the data. It transforms high dimensional data into very low dimension without the

loss of significant information. PCA is used in many different fields from neuroscience to

computer graphics because it is non-parametric method used to extract useful relevant

information from confusing data sets. The mathematically description of PCA is summarized in

sections given below.

The mathematical details of PCA are explained in detail in (Howard, 2000). Let us suppose a

multi-dimensional data. We first compute the mean across each dimension and subtract mean

from each value of that dimension, the data has now mean value equal to zero. Then we calculate

the covariance matrix of zero mean data. Covariance matrix shows the relation between different

dimensions in high dimensional data. Covariance can only be measured for data of more than 2

dimensions. Covariance matrix is N x N matrix, where N is number of dimensions of data.

Covariance of one dimension to itself is equal to variance of that dimension

1,1

n

i i

i

X X Y Y

COV X Yn

4.4

69

where, ,COV X Y is covariance between X andY dimensions. X is the mean of X dimension and

Y is the mean of X dimension and n is the number of data points. Next, we have to compute the

Eigen values and Eigen vector of the covariance matrix and sort Eigen vectors according to

Eigen values. Next, we will ignore some of less important Eigen vectors to reduce

dimensionality of the data. Finally, multiply the transpose of the chosen Eigen vector to the

original high dimensional data and use this data as features to classification algorithm. We have

named the GID based method as: GPCR-GID (Rehman & Khan, 2011). The overview of

GPCR-GID is shown in Figure 4-2.

Figure 4-2: Overview of GPCR-GID

4.3.RESULTS AND DESCUSSIONS

As explained in start of this chapter, we have trained and tested our methods on D8354. The

GPCRs in this dataset are classified into three levels i.e. family, sub family and sub-sub family

70

levels. In this proposed method, we have used only accuracy measure for performance

assessment. Following sections gives the details of the results.

4.3.1. Family level classification

GPCRs are classified into five families. The percentage accuracy of GID based method is 97.82%

and Euclidian distance based method has achieved 97.44%.

4.3.2. Sub family level classification

The five families of GPCRs are further classified into 40 sub families at this level. The

percentage accuracy of GID based method is 81.55% and Euclidian distance based method is

80.97%.

4.3.3. Sub-sub family level classification

The 40 sub families of GPCRs are further classified into 108 sub-sub families at this level. The

percentage accuracy of GID based method is 73.32% and Euclidian distance based method is

72.66%.The performance of both methods is also shown in Figure 4-3.

Figure 4-3: Performance of GID and Euclidian distance methods

Figure 4-3clearly shows that the performance of GPCR-GID is superior than Euclidian distance

based method at all the three levels. Hence, we have compared GPCR-GID with other existing

methods.

71

4.3.4. Comparison with other methods

We have trained our method on D8354 dataset and compared it with other methods using D8354.

We have also compared our method with existing methods using D167 and D566 datasets. D167

and D566 are already explained in chapter 2. The comparison details are as follows.

4.3.4.1 Comparison with Selective top down approach

In the selective top down approach, GPCRs are hierarchically classified into 3 levels (Davies,

Secker, Freitas, Mendao, Timmis, & Flower, 2007). The selective top down method has assessed

their performance using accuracy measure so we have compared our accuracy with them as

shown in Figure 4-4.

Figure 4-4: Comparison with selective top down approach

At family level, the best percentage accuracy achieved in selective top down approach is 95.87%,

while accuracy achieved in GPCR-GID is 97.82%. At sub family level, the best accuracy

achieved in selective top down approach is 80.77% while accuracy achieved in GPCR-GID

81.55%. Selective top down approach has achieved 69.98% accuracy at sub-sub family level,

while accuracy achieved in GPCR-GID is 73.32%. At all the three levels of GPCRs, GPCR-GID

is significantly superior to the selective top down approach and hence strengthening the worth of

GPCR-GID.

72

4.3.4.2 Comparison with other existing methods on D167 and D566 datasets

There are 6 existing methods with whom we have compared GPCR-GID on D167 dataset i.e.

(Elrod & Chou, 2002) , (Huang, Cai, Ji, & Li, 2004), (Bhasin & Raghava, 2005), (Gao & Wang,

2006), (Gao, Wu, Ma, Lu, & He, 2008) and PCA-GPCR (Peng, Yang, & Chen, 2010 ). Again,

we have used accuracy measure for the sake of comparison. This comparison is shown in Figure

4-5, which clearly shows the superiority of GPCR-GID over all of the 6 methods.

Figure 4-5: Comparison on D167

There are 2 methods with which we have compared GPCR-GID on D566. One is PCA-GPCR

(Peng, Yang, & Chen, 2010 )and the other is by Chou (Chou & Elrod, 2002). The percentage

accuracy achieved PCA-GPCR is 97.88% and in (Chou & Elrod, 2002) is 92.05%, where as the

accuracy achieved in GPCR-GID is 97.96%.

73


Figure 4-6shows the superiority of GPCR-GID over PCA-GPCR and Chou’s method (Chou &

Elrod, 2002). This improvement in performance of GPCR-GID is because of several reasons.

One reason is the hybridization of spatial domain and transformed domain features and applying

PCA for feature reduction. Secondly, GID measure based method can efficiently discriminate

classes by computing quaternary structure of GPCR numerically.

74

5. GPCRs PREDICTION USING GENETIC ALGORITHM BASED

ENEMBLE CLASSIFICATION

This chapter focuses on the classification of GPCRs using ensemble approaches. In Ensemble

classification, various classifiers contribute their strengths to increase the efficiency of overall

classification. There are several types of ensemble approaches. Our focus in this chapter will be

on weighted ensemble classification. In weighted ensemble classification, weights are assigned

to each classifier and optimized using appropriate optimization techniques. Each classifier votes

for a class after weighting and that label is assigned to the unknown GPCR sequence, which has

majority of votes. Binary Genetic algorithm is one such suitable technique to optimize the

weights. The optimization performance of genetic algorithm is controlled by appropriate

parameter settings. The features of a GPCR sequence are first extracted using MSE-PseAA and

PSE-PSSM techniques. The physiochemical properties used in MSE-PseAA approach are

Hydrophobicity, Electronic and Bulk properties, which are explained in detail in chapter 3.

MSE-PseAA is also explained in chapter 3 and PSE-PSSM is already explained in chapter 2.

PSE-PSSM incorporates evolutionary information in features. Physiochemical and biological

properties are utilized to extract features. Position specific scoring matrix is used to extract

biological features (Schaffer, Aravind, Madden, Shavirin, Spouge, & al., 2001). The

classification algorithms used are NN, PNN, GID and SVM. The predictions are all of these 4

classifiers are combined by weighting and final prediction for a GPCR sequence is performed.

We have named this technique as PSE-PSSM (Rehman & Khan, 2012). The overview of chapter

5 is shown in Figure 5-1.

75


The datasets used in PSE-PSSM are D8354, D167, D365 and D566. Again, we have classified

GPCRs into 3 levels using D8354 dataset i.e. family, sub family and sub-sub family levels. PSE-

PSSM is a very accurate method for feature extraction but it consumes a lot of time, so we have

used it only for a smaller dataset i.e. D167.

5.1.CLASSIFICATION ALGORITHM

As discussed in start of this chapter, NN, PNN, SVM and GID are used as classification

algorithms. The ensemble classifier is made from the weighted majority voting of these 4 main

classifiers. In some datasets, we have used 4 different kernel functions for SVM i.e. Radial Basis

Function (RBF-SVM), Polynomial (Poly-SVM), Sigmoid (Sig-SVM) and Linear (Lin-SVM).

LIBSVM 2.88-1 package (lib SVM)is available online and provide the codes for different SVMs.

If we count each of these four kernels as different classifiers then weighted majority voting is

performed using seven classifiers in total. Each classifier votes for a class with certain weight.

Unlabeled GPCR sequence is classified using a class, which has maximum votes.

5.2.WEIGHT OPTIMIZATION USING GENETIC ALGORITHM

Binary genetic algorithm (GA) is discussed in detail in chapter 2. It has 4 main phases:

Population generation and initialization

76

Evaluation of fitness

Crossover, Mutation and reproduction

Termination criteria

For each dataset, we have first computed prediction matrices using each classifier (Rehman &

Khan, 2012). The chromosome in GA is represents weight vectors for each of the seven

classifiers to be optimized. This weight vector is multiplied with prediction matrices as shown in

following equation.

,1,2,..., 1

maxn

k k jj C k

Z i W z

5.1

where, Z i shows the prediction of ensemble for a sequence , 1,2,...,i j C shows total number

of classes in the datasets, 1,2,...,k n shows the number of classifiers used, kW shows the

weights for a particular classifier k and zk shows the predictions of individual classifiers. The

unknown sequence i will be annotated with the label of a class with maximum voting or score

(after multiplying with weight vector). Same process will be repeated for all sequences in the

dataset and the accuracy over the dataset is computed. Fitness function is defined as negative of

accuracy.

Fitness Accuracy 5.2

The GA’s objective is to increase the overall accuracy or decrease fitness value. The increase in

accuracy will be achieved using optimization of weights for each of the classifier. The ranking of

chromosomes (weight vectors) is performed based on their accuracies for a dataset. After

ranking, crossover, mutation and reproduction are performed with certain probability and the

chromosome population is preceded to the next generation. The GA is run for 100 generations

and stall limit is 50 generations i.e. if there is no improvement in performance in last few

generations before 50 then stop GA. If stall limit is not reached but some other termination

criteria are reached, GA will be stopped and weighting is optimized. The PSE-PSSM method is

shown in Figure 5-2.

77

Figure 5-2: Overview of PSE-PSSM method

5.3.RESULTS AND DISCUSSIONS

The performance on each dataset is first individually assessed by each of the classification

algorithms. Later, the weighted majority voting is performed as explained in section 5.2. The

performance details are mentioned in the fore coming sections.

5.3.1. Classification performance on D8354

In D8354 dataset, a GPCR sequence is predicted at three levels i.e. family, sub family class and

sub-sub family levels. For each sequence we outputs family, sub family, and sub-sub family

class names. The details of family, sub family and sub-sub family names and respective number

of sequences are already given in chapter 2. The crossover rate is 0.8, mutation rate is 0.1 and

reproduction rate is 0.1 for all the three levels.

5.3.1.1 Family level classification

The individual classifier accuracies achieved using PNN, NN, SVM and GID are: 96.98%,

96.89%, 97.41% and 97.12% respectively. The weights optimized by GA are associated with

78

each of these 4 classifiers to further improve performance. Initially, a population of 30

chromosomes is generated. The size of a chromosome is taken as 4 for D8354 dataset. Roulette

wheel is used as a selection function. The number of generations is 100 and stall limit is set as 50.

At the end of each generation, weight vectors are improved. At the end, the optimized weight

vector is obtained as: PNN=0.034, NN=0.119, SVM= 0.209, GID= 0.637. The accuracy after

weighted majority voting is reported as 97.414%.

5.3.1.1 Classification performance at sub family level

There are 40 classes at sub family level. The individual classifier’s accuracies achieved using

PNN, NN, SVM and GID are: 80.36%, 80.73%, 84.97% and 81.10% respectively. The weight

vector after optimization by GA is PNN=0.298, NN=0.022, SVM=0.097 and GID=0.582.The

accuracy achieved by weighted majority voting is 84.97%.

Figure 5-3: GA run for family level

5.3.1.1 Classification performance at sub-sub family level

There are 108 classes at sub-sub family level. The individual classifier’s accuracies using PNN,

NN, SVM and GID, are: 71.10%, 72.48%, 75.60% and 72.90% respectively. The weight vector

79

after optimization by GA is: PNN= -0.086, NN=0.307, SVM=, 0.249 and GID=0.529. The

accuracy achieved by weighted majority voting is 75.81%. The details of the results on GDS

dataset are shown in Figure 5-6.

Figure 5-4: GA run for subfamily level

80

Figure 5-5: GA run for sub-subfamily level

Figure 5-6: Classification performance on D8354 dataset

81

5.3.2. Comparison with existing approaches on D8354

For the sake of comparison on D8354 dataset, we have compared our method with selective top

method (Davies, Secker, Freitas, Mendao, Timmis, & Flower, 2007)and with GPCR-Hybrid

(Rehman & Khan, 2011). Selective top down approach classifies GPCRs hierarchically into 3

levels. The comparisons are shown in Figure 5-7.

We have compared the accuracies of PSE-PSSM with that of GPCR-Hybrid and Selective top

down methods. Selective top down has achieved an accuracy of 95.87% for family level. GPCR-

Hybrid has achieved an overall accuracy of 97.86, while PSE-PSSM has achieved an accuracy of

97.41%. PSE-PSSM’s accuracy is comparable with GPCR-Hybrid and is slightly higher than

Selective top down approach. At sub family level, accuracy of Selective top down method is

80.77%, GPCR-Hybrid has achieved 84.97% accuracy, while PSE-PSSM has achieved an

accuracy of84.97%. Finally, at sub-subfamily level, Selective top down method has achieved an

accuracy of 69.98%, GPCR-Hybrid has 75.60%, while PSE-PSSM has achieved an accuracy of

75.85%. At the sub-subfamily, PSE-PSSM has performed better than other 2 methods. We think

that this improved performance is first due to the hybrid combination of wavelet based multi

scale energy and Pseudo amino acid composition based features. Secondly, optimized weighted

majority voting has played important role in improving the performance of the classification

method.

82


5.3.3. Comparison on D167, D365 and D566 datasets

As mentioned in chapter 2, there are 167 GPCRs sequences in D167 dataset. The population size

in GA for D167 is taken as 50, mutational probability is taken as: 0.1, crossover probability is

0.8 and reproduction probability is 0.1. We have used Roulette wheel method for selection of

chromosomes. We have run the GA program for 100 generations and assessed the performance

of our both feature extraction strategies i.e. MSE-PseAA and PSE-PSSM separately. We have

mentioned results on D167 in Figure 5-8, GA graphs for MSE-PseAA and PSE-PSSM are shown

in Figure 5-10andFigure 5-11, respectfully.

It is shown in Figure 5-8, performance of PSE-PSSM is slightly better than MSE-PseAA, so we

have compared PSE-PSSM based method to 2 existing methods i.e. (Elrod & Chou, 2002) and

(Huang, Cai, Ji, & Li, 2004). This improvement is because of embedding of evolutionary

information in feature extraction. The accuracy achieved in (Elrod & Chou, 2002) is 83.23, in

(Huang, Cai, Ji, & Li, 2004) accuracy is 83.20% and PSE-PSSM has achieved an accuracy of

95.81%, this comparison is shown in Figure 5-9.

83

Figure 5-8: Classification performance on D365 and D 566 dataset


84

Figure 5-10: GA run for D167 using MSE-PseAA

85

Figure 5-11: GA run for D167 using PSE-PSSM

We have compared our method with GPCR-CA (Xiao, Wang, & Chou, 2009)method on D365

dataset (Chou, 2005).The GA parameters used for D365 are: population size = 100, Selection

function = Tournament selection, uniform mutation rate = 0.1, 2-point cross over rate = 0.8.

GPCR-CA has achieved an accuracy of 83.56%, while PSE-PSSM has achieved an accuracy of

90.14%. The population size used for GA is 100, Selection function is Roulette wheel, uniform

mutation is applied with probability 0.1 and 2-point cross over with probability 0.8. The GA run

is shown in Figure 5-13and classification performance on D365 & D566 are shown in Figure

5-12.

Figure 5-12: Classification performance on D365 and D566 datasets

86

Figure 5-13: GA run for D365 dataset

We have compared PSE-PSSM with three existing methods on D566 i.e. PCA-GPCR (Peng,

Yang, & Chen, 2010 )method, (Chou & Elrod, 2002)and GPCR-Hybrid (Rehman & Khan,

2011). The accuracy achieved in PCA-GPCR method is 97.88%, accuracy in (Chou & Elrod,

2002)is 92.05%, in GPCR-Hybrid it is 97.91%, while the accuracy achieved by PSE-PSSM is

97.88%. The graph of GA run is shown in Figure 5.9.

87

Figure 5-14: GA run for D566

The comparison on D365 is shown in Figure 5-15and on D566 is shown in Figure 5-16.

Figure 5-15: Comparisons on D365 dataset in terms of % accuracy

88


89

6. ALIGNMENT BASED STRUCTURAL CLASSIFICATION OF GPCRS

USING TRANSMEMBRANE REGIONS

GPCRs can be classified based on ligands binding or by molecular phylogenetic analyses.

Phylogenetic analyses are usually based on multiple sequence alignments (MSA).There are

various methods for MSAs i.e. progressive methods, iterative methods, local and global

alignments and motif based alignment. We already discussed different types of MSAs in chapter

2.We proposed a novel motif based alignment method for the alignment and classification of 19

sub families (and unknown receptors) of Rhodopsin like GPCRs. We have divided some of sub

families in some case. In addition, there are some receptors, which are unknown, and we kept

them separately. Rhodopsin like receptors has much diversity in structure, functions and is

highly demanding for drug development. Humans have 800 Rhodopsin GPCRs. The structure of

Rhodopsin receptors consists of extracellular N terminal, intracellular C terminal and 7

transmembrane helical structures and comprises of 80% of all GPCRs. We computed pseudo-

count based position specific scoring matrices. Scores are then mapped to extreme value

distribution. EVD is used to set thresholds to identify motifs and then alignments are performed

based on motifs. Based on EVD scores and thresholds, we have then performed classification of

GPCRs.

Initially we have generated alignments of 19 sub families from T-Coffee (T-Coffee). We have

set human sequences as references in each family and extracted the motifs (TMs) of all

sequences of that family. We have extracted those sequences, which do not have seven motifs or

bad motifs (so many gaps in motifs). These extracted motifs are merged together. We tested our

method for various motif lengths and determined the appropriate combined length of motif as

182 amino acids. Position specific scoring matrices (PSSM) are then computed for the extracted

merged motif regions for 19 sub families. To account for missing amino acids in PSSMs, we

have added pseudo counts using Blosum62 matrix and proposed GPCR scoring matrix. Raw

scores from PSSMs are mapped to extreme value distribution to define thresholds for each

family corresponding to each of seven motifs. These thresholds are used to identify motifs and to

classify Rhodopsin GPCRs into 19 sub families. The overview of chapter 6 is shown in Figure

6-1.

90


6.1.SEVEN MOTIFS OF RHODOPSIN LIKE GPCRS

The generalized structure of the 7-motif regions (or seven transmembrane helical regions) in

Rhodopsin GPCRs observed from SYLYBS software are as follows:

M1: xxxxxxxxxxxxxxxxGNxxxxxxxx

M2: xxxxxxxLxxxDxxxxxxxxxxxxxxxxx

M3: xxxxxxxxxxxxxxxxxxxxDRYxxx

M4: xxxxxxxxWxxxxxxxxxPx

M5: xxxxxxxxFxxPxxxxxxxYxxxxxxxx

M6: xxxxxxxxxxxxxxFxxCWxPxxxxxxxxx

91

M7: xxxxxxxxxxxxxxxNPxxYxxx

Where M1 to M7 are seven motifs, x at any position shows that any amino acid can come at this

position. These motif structures are found in most of Rhodopsin like sequences with mutations

occasionally at some places. There can be zero or more than one instance of a motif in a

sequence. We have identified only those motifs, which preserve sequence order i.e. M1, comes

before M2, M2 before M3 and so on. This sequential nature of motifs increases the overall

quality of multiple sequence alignment. Initially, positions of seven motifs are manually given to

calculate PSSMs and to train the method. Then, this method can be used to identify M1-M7 of

any unknown sequence of corresponding family.

6.2.POSITION SPECIFIC SCORING MATRIX USING PSEUDO COUNTS

Position specific scoring matrices (PSSMs) are calculated from blocks of set of aligned

sequences. The length of PSSM is same as length of block. We have taken seven blocks

corresponding to seven motifs of Rhodopsin like family. Each column is represented by a vector

of 20 amino acids. These 20-D vector counts the number of occurrences of 20 amino acids in the

block and their probabilities are computed. The amino acid, which occurs more frequently,

receives a higher score. PSSMs can be used to score the alignment of different sequences by

sliding each sequence over PSSM and looking for the value in the corresponding column of the

PSSM. Length of sliding window is same as the length of PSSM. Scores for each position of

sequence are computed and then all the scores are summed up to resulting in overall alignment

score of that particular sliding window. Overall scores are usually computed in terms of log-odds

(G.D., 1990)and (Altschul, 1991), so PSSMs are mostly composed of log-odds scores. For the

sake of simplicity we have, called pseudo counts based PSSM as PSSM-PC.

The drawback of simple PSSM method is this that the training sequences may be an incomplete

sample of the full family set and there may be some missing amino acids in some columns of

block resulting in zero count. We have solved this problem by adding artificial pseudo counts for

the missing counts. Pseudo counts can be added by various ways. One such way is to use

traditional scoring matrices like BLOSUM or PAM. We have used BLOSUM 62 matrix and later

computed GPCR scoring matrix to compute pseudo counts. In the dataset of n sequences and a

92

motif of length l residues, the PSSM-PC will be l x 20 matrix (Bissantz, Logean, & Rognan,

2004). We have computed seven PSSM-PC corresponding to seven motifs for each sub family of

Rhodopsin like GPCRs. Each element caW of the matrix is given by:

( )

2logca

a

ca

f

fW 6.1

Where, 1,2,...,c l and 1,2,...,a l .The caf is frequency of amino acid a at position c of the

motif, af is the overall frequency of amino acid a in current training data set. Pseudo counts are

added in caf to account for missing amino acid frequencies (Henikoff & Henikoff, 1996). So

caf

is calculated as:

ca

ca ca

c c

n bf

N B

6.2

where bca is the pseudo count for amino acid a at position c of motif, nca is the number of counts

of amino acid a over the n sequences for amino acid a at position c, Nc is the total number of

counts at position c, and Bc is the total number of pseudo counts at position c. bca is obtained by

multiplying the number of pseudo counts at position c by a α.

ca cb B α 6.3

20

1

ia

i

i ia

a

q

Q

Q q

20ci

i=1 c

fα =

N 6.4

wherecif is the frequency of amino acid i at position c and qia is the probability for replacement

of amino acid i to a according to the Blosum62 matrix (Henikoff & Henikoff, 1992). We have

calculated PSSM-PC for each of family. Then, we considered each PSSM as training PSSM

successively, computed the scores of other sub families, and plotted them. The plot helped us to

identify the relationships between each sub family and to assign unknown receptors to any of the

sub family or put them in new sub family as shown in

93

Figure 6-2.

94

Figure 6-2: PSSM plot tested on Chemokine PSSM

6.3.EXTREME VALUE DISTRIBUTION (EVD)

After the computation of PSSM-PC, we get all the values for all possible amino acid at all

locations of a motif. Each possible sliding window gives us one particular score. If motif length

is 26, then there are 2026 possible scores for a sliding window. EVDs are normally used to show

distributions for maximums or minimums (extreme values) of a sample of independent,

identically distributed random variables. They are used for measurement of events, which occur

very rare. The exponent2026 is computationally very much expensive to calculate, so we have

taken samples of 2-10 millions randomly selected scores to map for EVD. The EVD can be

mapped using various methods i.e. linear regression and maximum likely-hood estimation etc.

We have mapped scores to EVD using maximum likely-hood estimation (Richard, 1992). There

are 3 types of EVDs i.e. Gumbel (type I), Frechet (type II), Weibull (type III). There are two

statistical measures in EVD i.e. P- value and E-value. P-value is probability of observing atleast

one score greater than or equal to some score x. E- value is the expected number of scores greater

than or equal to score x. We have taken these x values = 0.1 in most of cases and in few cases

0.00001.The probability density function of EVD is:

95

exp exp

XPdf x X

6.5

E value is given by:

_ exp =1 expX

E value

6.6

where X is vector of random scores, λ and µ are scale and location parameters of EVD. These 2

parameters are estimated using maximum likely-hood estimation. The likely hood of n random

scores in extreme value distribution is:

( )1 2

1

( , ... | , ) exp [ ( ) ]i

nx

n ii

P x x x x e

6.7

By simplifying equation (6.8) we get:

( )

1 2

1 1

( , ,..., | , ) exp ( ) i

n nxn

n i

i i

P x x x x e

6.8

Log likely hood of eq. (6.9) or (6.10) is given by:

1 2log ( , ) log , ,..., | , nL P x x x 6.9

( )

1 1

log ( ) exp i

n nx

i

i i

n x

6.10

Now we have to compute the estimates of λ and so that log likely hood is maximized. For this

purpose we have to take partial derivates of log likely hood function and equal it 0.

( )

1

log exp 0i

nx

i

d Ln

d

6.11

96

( )

1 1

log ( ) ( ) exp 0

i

n nx

i i

i i

d L nx x

d

6.12

By solving, we get:

1

1 1 log exp

i

nx

i

µn

6.13

Now by substituting back this value of µ in eq. (6.12) and by simplifying eq. (12), we get:

1

1

1

exp1 1

0

exp

i

i

nx

ini

i nxi

i

x

xn

6.14

After solving eq. (6.14) using Newton Raphson methods, we get the solution for λ, by putting λ

back in eq. (6.13) and we can get the value of µ.Figure 6-3, shows the pdf of EVD for motif-6 of

Amine sub family.

Figure 6-3: Plot of pdf for motif-1 of Amine sub family

E values are used to define thresholds for identification of motifs or transmembrane regions.

Higher the E values, more will be the number of false positives for a motif (or TM) detections.

The plot of E- values for motif-3 are showninFigure 6-4. Plot of number of false positives

(wrong detection of a motif) for different E-values is showninFigure 6-5.

97

Figure 6-4: Plot of E-values for motif-3 Amine sub family

Figure 6-5: Number of false positives for different E-values

6.4.MOTIF DETECTION ALGORITHM

We have developed an algorithm for detection of TMs in an unknown Rhodopsin like

sequences. In this algorithm, we have defined seven sliding windows equal to length of each of

seven motifs (TMs). These windows are slide over the sequence one by one, scores are computed

98

and then from scores E values are calculated using λ and µ parameters. We have defined a

threshold of E < 0.1 for all motifs detection. All those scores, which have E values, less than 0.1

are the candidate of a particular motif. We have tested for our data that motifs have highest

scores or lowest E values for 95-98% of the times. However, to become more careful we have

selected top 5 scores as candidates of motifs so that we do not miss any motif. So there can be

maximum 57 (= 78125) possibilities for selection of seven motifs in one sequence. We have to

select such a perfect combination for choices of 7-motifs, that they are maintained

sequentially(i.e. M1 comes first, then M2, then M3... M7). There are four different E value

thresholds. Following the number of steps involved in motif detection algorithm.

1. Slide a motif window over a sequence, find top 5 scores and sort them.

2. Repeat step 1, for all 7 motifs.

3. If top scores of each of seven motifs preserve sequence order, then output locations of

these seven top motifs in test sequence and go to step nine.

4. Assign ratings to all choices of scores for a motif, i.e. top score's rating = 5, second = 4

and last =1.

5. Find out all those combination in which scores for 7-motifs preserves sequence order

1 2 3,..., 7M M M M

6. Add up the ratings for each of the combination (e.g. 5+5+4+3+5+4+5= 31)

7. Select that combination which gives highest rating

8. Output the locations of motifs in test sequence, whose scores combinations have given

highest rating.

9. End

After the detection of motifs in the test sequence, its sub family is predicted. In addition, its

alignment is performed with training alignment of that sub family.

6.5.MULTI DIMENSIONAL SCALING

Multi-dimensional scaling (MDS) is a statistical technique used to show similarities or

dissimilarities in different types of data. It can also visualize the relationship between items of

high dimensional data into corresponding low dimensions. It takes N x N similarity matrix as

input, where the matrix is symmetric and diagonal elements are zero and performs MDS.

http://en.wikipedia.org/wiki/Statistical

99

We have performed the MDS scaling in three ways i.e. sequence similarity between families,

sequence similarity individually between all sequences and PSSM based sequence similarity for

each sub families.

Figure 6-6, shows the sequence similarities between different sub families. We have split most of

the families into two parts and also many unknown receptors families. The families, which are

more similar, are closer in the graph. This method showed correctly the closeness of most of sub

families, e.g. purin1 and purin2 etc. However, in some cases it may not show some families

closer e.g. Beta1 and Beta 2. However, we believe that PSSM based MDS can overcome this

problem.

Figure 6-6: MDS plot based on sequence similarity between various sub families

100

7. CONCLUSIONS AND FUTURE DIRECTIONS

GPCRs are physiologically very important in living organisms and are targeted by more than

50% of the market drugs. The number of newly discovered GPCR sequences entering into the

databanks is increasing day by day and thus, it is very difficult to annotate them manually.

Hence, in this regard, automatic and accurate classification is highly desired. A lot of research

has already been done in the prediction of GPCRs. The focus of this thesis is to propose efficient

and accurate prediction techniques for the classification of GPCRs. Once a GPCR sequence is

classified; it can be used in the relevant drug. We have divided our thesis into two parts i.e.

alignment independent classification and alignment dependent classification of GPCRs.

Alignment dependent classification is more accurate than alignment independent classification,

because it also includes structural information. In addition, alignment dependent classification

can highlight important regions in a GPCR sequence. However, it is very complex and

computationally expensive.

7.1.ALIGNMENT INDEPENDENT CLASSIFICATION

Chapter 3, 4, 5 explain three alignment independent classification techniques. GPCR

classification presented in chapter 3 is mainly dependent on physiochemical properties, and

hybrid combination of spatial, and transforms domain feature extraction strategies. In spatial

domain, we have used PseAA as feature extraction strategy, and for transform domain, we have

used MSE based feature extraction. Unlike conventional amino acid composition based feature

extraction, Pseudo amino acid composition also accounts for the order and length of sequence.

We have used D8354 dataset as primary dataset. D8354 has GPCRs belonging to three levels i.e.

family, sub family and sub-sub family levels. Therefore, we have extracted features at each level

with aforementioned feature extraction strategies. For the sake of classification, we have used

SVM, NN, and PNN classifiers. At each GPCR level, we have chosen the best combination of

feature extraction strategies, and the classification algorithms to annotate the unknown GPCR

sequences. We have tested the method on 3 other datasets. Our approach performed better than

the existing methods on these datasets. The improvement is because of the employment of

appropriate physiochemical properties and the hybrid combination of spatial and transforms

domain based features extraction strategies. In addition, SVM has proved to be robust against the

curse of dimensionality dilemma.

101

The chapter 4 proposes the grey incidence degree based classification instead of Euclidian

distance based classification. We have used three feature extraction strategies i.e. FFT, PseAA

and SAAC. FFT is extracts features in transform domain, while PseAA and SAAC extracts

features in spatial domain. To avoid the curse of dimensionality, we have reduced features using

PCA. We have analyzed that the hybrid combination of FFT, PseAA and SAAC based features

can improve the overall performance in GPCR prediction. The GID based method has efficiently

analyzed the numerical relationship between various quaternary structures of GPCRs. A GPCR

sequence can have certain level of similarity to one family and certain level of similarity to the

other family. The GPCR divisions into families, subfamilies or sub-sub families is partial and

GID based classification is useful for the partial systems.

In chapter 5, we have proposed ensemble classification in which weights are optimized using

genetic algorithm. We have employed a hybrid combination of PseAA and MSE for feature

extraction. We have also focused on evolutionary information based feature extraction using

position specific scoring matrices. The hybrid combination of evolutionary information based

feature extraction with PseAA or MSE can further improve the overall performance of the

method. The employment of evolutionary information in features has further improved the

classification performance of the method. However, the evolutionary information based method

that we have proposed is time consuming and hence useful for only small datasets.

7.2.ALIGNMENT DEPENDENT CLASSIFICATION

We have explained different types of sequence alignments in chapter 2. Sequence alignment is

useful in understanding the relationships between different sequences or families. It highlights

the conserved regions in a family. GPCRS have transmembrane helical structures. We have

analyzed and aligned the 7 transmembrane helical structures of Rhodopsin like GPCRs and have

also proposed a general form for each of the 7 transmembrane helices. We have developed a

7TM detection algorithm and pseudo count based PSSM is computed for each block of

transmembrane. Pseudo count based PSSM can be used to score the TM region. It can also help

to identify a particular TM region. Pseudo counts help in assigning the score to any amino acid

which is absent in transmembrane region of a particular family or sequence. The unknown

receptors can also be identified using Pseudo count based PSSMs. Alignment dependent

classification utilizes structural information of the dataset, and hence can be more accurate than

102

alignment independent classification. Pseudo count based PSSMs also shows various

relationships of each family to its PSSM and can find out similarities between various families.

The sub families can further be defined using Pseudo-count based PSSMs.

7.3.FUTURE DIRECTIONS

There are numerous possible future directions of this study. First, the alignment independent

classification based methods proposed in this thesis can also be used to predict sub cellular

localization classification, membrane protein classification and mitochondrial classification.

Second, physiochemical properties based method can be improved further by employing more

appropriate physiochemical properties. Third, PCA can also be replaced with some better feature

reduction algorithm or one can propose his/her own feature reduction algorithm, suitable for

GPCRs. Fourth, Alignment dependent method is discussed in this thesis can be applied to other

cell parts by identifying their conserved regions. Fifth, 3D structures of GPCRs can be predicted

by combining internal properties of protein with the properties of its membrane environment, and

further by using its backbone coordinates and adding the appropriate side chains of each GPCR.

At the end, improvement can be made in BLOSUM by considering different sequence similarity

levels depending on the available sequence data for a particular protein family.

103

8. REFERENCES

(n.d.). Retrieved from Blocks WWW Server: http://blocks.fhcrc.org/

Afridi, T., Khan, A., & Lee, Y. (2012). Mito-GSAAC: Mitochondria Prediction using Genetic Ensemble Classifier

and Split Amino Acid Composition. Amino Acids , 1443-1454.

Altschul, S. (1991). Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol., 219 ,

555-565.

Bailey, T., Williams, N., Misleh, C., & Li, W. (2006). MEME: discovering and analyzing DNA and protein

sequence motifs. Nucleic Acids Res , W369-373.

Baum, L. E., & Petrie, T. (1966). Statistical Inference for Probabilistic Functions of Finite State Markov Chains. The

Annals of Mathematical Statistics , 1554–1563.

Ben, G., Shani, A., Gohr, A., Grau, J., Arviv, S., Shmilovici, A., et al. (2005). Identification of Transcription Factor

Binding Sites with Variable-order Bayesian Networks. Bioinformatics , 2657–2666.

Bhasin, M., & Raghava, G. (2005). GPCRsclass: a web tool for the classification of amine type of G- protein

coupled receptors. Nucleic Acids , 143-147.

Bhasin, M., & Raghava, G. P. (2004). GPCRpred: an SVM-based method for prediction of families and sub-families

of G protein-coupled receptors. Nucleic Acids Res. , 383-389.

Bissantz, C., Logean, A., & Rognan, D. (2004). High-Throughput Modeling of Human G-Protein Coupled

Receptors: Amino Acid Sequence Alignment, Three-Dimensional Model Building, and Receptor Library Screening.

J. Chem. Inf. Comput. Sci., 44 , 1162-1176.

Cardoso, J., Pinto, V., Vieira, F., Clark, M., & Power, D. (2006). Evolution of secretin family GPCR members in the

metazoa. BMC Evolutionary Biology , 6:108.

Chen, Z., Alcayaga, C., Suarez-Isla, B., ORourke, B., Tomaselli, G., & Marban, E. (2002). A “minimal” sodium

channel construct consisting of ligated S5-P-S6 segments forms a toxin-activatable ionophore. J Biol Chem.,277 ,

24653–24658.

Chou, K. (2004). Insights from modelling the 3D structure of the extracellular domain of alpha7 nicotinic

acetylcholine receptor. Biochem Biophys Res Commun, 319 , 433–438.

Chou, K. (2005). Prediction of G-protein-coupled receptor classes. J Proteome Res ,4 , 1413-1418.

Chou, K. (2001). Prediction of protein cellular attributes using pseudo amino acid composition. PROTEINS:

Structure, Function, and Genetics , 246-255.

Chou, K. (2001). Prediction of protein cellular attributes using pseudo-amino-acid-composition. Proteins, 43 , 246-

255.

Chou, K., & Elrod, D. (2002). Bioinformatical analysis of G-protein-coupled receptors. J Proteome Res 1 , 429-433.

Chou, K., & Shen, H. (2010). A new method for predicting the subcellular localization of eukaryotic proteins with

both single and multiple sites: Euk-mPLoc 2.0 . Plos One , doi:10.1371/journal.pone.0009931.t002.

Chou, K., & Shen, H. (2006). Hum-PLoc: a novel ensemble classifier for predicting human protein Subcellular

localization. Biochem Biophys Res Commun., 347 , 150–157.

104

Chou, K., & Shen, H. (2006). Predicting eukaryotic protein subcellular location by fusing optimized evidence-

theoretic K-nearest neighbor classifiers. J Proteome Res., 5 , 1888–1897.

Cosic, I. (1994). Macromolecular bioactivity: is it resonant interaction between macromolecules?-Theory and

applications. IEEE Trans Biomed Eng, 41 , 1101–1114.

Das, S., & Banker, G. ( 2006). The role of protein interaction motifs in regulating the polarity and clustering of the

metabotropic glutamate receptor mGluR1a. Journal of Neuroscience , 8115–8125.

Davies, M. (n.d.). BIAS-PROFS. Retrieved from http://www.cs.kent.ac.uk/projects/biasprofs/

Davies, M., Secker, A., Freitas, A., Mendao, M., Timmis, J., & Flower, D. (2007). On the Hierarchical classification

of G-Proteon coupled receptors. Bioinformatics, 23 , 3113-3118.

Dayhoff, M., Schwartz, R., & Orcutt, B. (1978). A model of Evolutionary Change in Proteins. . Atlas of protein

sequence and structure , 345–358.

Deng, J. (1982). Control problems of grey systems. Syst Control Lett., 1(5) , 288–294.

Doyle, D., Morais, C., Pfuetzner, R., Kuo, A., Gulbis, J., Cohen, S., et al. (1998). The structure of the potassium

channel: molecular basis of K+ conduction and selectivity. Science , 280 , 69–77.

Durbin, R., Eddy, S., Krogh, A., & G., M. (1998). Biological sequence analysis: probabilistic models of proteins

and nucleic acids. Cambridge University Press.

Edgar, R. (2004 ). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids

Research , 1792–1797.

Elrod, D., & Chou, K. (2002). A study on the correlation of G-protein-coupled receptor types with amino acid

composition. Protein Eng Des Sel, 15 , 713-715.

ENSEMBL. (n.d.). Retrieved from http://www.ensembl.org/index.html

Fauchere, J., & Pliska, V. (1983). Hydrophobic parameters of amino acid side chains from the partitioning of N-

acetyl-amino acid amides. Eur. J. Med. Chem.-Chim. Ther., 18 , 369–375.

Foord, S., Jupe, S., & Holbrook, J. (2002). Bioinformatics and type II G-protein-coupled receptors. Biochemical

Society Transactions , 473–479.

Fredriksson, R., Lagerström, M. C., Lundin, l. G., & Schiöth, H. B. (2003). The G-protein-coupled receptors in the

human genome form five main families. Phylogenetic analysis, paralogon groups, and fingerprints. Molecular

Pharmacology , 1256–1272.

Fridmanis, D., Fredriksson, R., Kapa, I., Helgi, B., & Klovins, J. (2006). Formation of new genes explains lower

intron density in mammalian Rhodopsin G protein-coupled receptors. Molecular Phylogenetics and Evolution , 864–

880.

G.D., S. ( 1990). Consensus patterns in DNA. Methods Enzymol.,183 , 211-21.

Gao, Q., & Wang, Z. (2006). Classification of G protein-coupled receptors at four levels. Protein Eng Des Sel., 19 ,

511-516.

Gao, Q., Wu, C., Ma, X., Lu, J., & He, J. (2008). Classification of amine type G-protein coupled receptors with

feature selection. Protein Pept Lett., 15 , 834-842.

George, S., O`Dowd, B., & Lee, S. (2012). G-Protein Coupled Receptor oligomerization and its potential for drug

discovery. Nature Reviews Drug Disc 1 , 808-820.

105

GPCRDB. (2012). Retrieved from http://www.gpcr.org/7tm/

Grantham, R. (1974). Amino acid difference formular to help explain protein evolution. Science, 185 , 862–864.

Grasso, C., & Lee, C. (2004). Combining partial order alignment and progressive multiple sequence alignment

increases alignment speed and scalability to very large alignment problems . Bioinformatics , 1546–1556.

Guo, Y., Li, M., Wang, K., Wen, Z., Lu, M., Liu, L., et al. (2005). Fast Fourier transform-based support vector

machine for prediction of G-protein coupled receptor subfamilies. Acta Biochim. Biophys. Sin. , 759–766.

Henikoff, J. G., & Henikoff, S. (1996). Using Substitution Probabilities To Improve Position-Specific Scoring

Matrices. Comput. Appl. Biosci., 12 , 135-143.

Henikoff, S., & Henikoff, J. (1992). Amino Acid Substitution Matrices from Protein Blocks. PNAS , 10915–10919.

Holland, J. H. (1992). Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to

Biology, Control, and Artificial Intelligence. Cambridge: MA: MIT Press. ISBN 978-0262581110.

Horn, H., Bettler, E., Oliveira, L., Campagne, F., Cohen, F., & Vriend, G. (2003). GPCRDB information system for

G protein-coupled receptors. Nucleic Acids 31(1) , 294:297.

Howard, A. (2000). Elementary Linear Algebra. Wiley; 8 edition.

Huang, Y., Cai, J., Ji, L., & Li, Y. (2004). Classifying G-protein coupled receptors with bagging classifition tree.

Comput Biol Chem., 28 , 275-280.

Hughey, R., & Krogh, A. (1996). Hidden Markov models for sequence analysis: extension and analysis of the basic

method. CABIOS , 95–107.

Inoue, Y., Yamazaki, Y., & Shimizu, T. (2005). How accurately can we discriminate G-protein-coupled receptors as

7-tms TM protein sequences from other sequences? Biochemical and Biophysical Research Communications ,

1542–1546.

Institute, E. B. (n.d.). EMBL-EBI. Retrieved from http://www.ebi.ac.uk/Tools/sss/psisearch/

James, L. C., Kemp, B. C., Hanah, M., John, L., Spouge, Jay, A. B., et al. (1987). Hydrophobicity scales and

computational techniques for detecting amphipathic structures in proteins. Journal of Molecular Biology,195 , 659–

685.

Javed, J., Khan, A., Majid, A., Mirza, A. M., & Bashir, J. (2007). Lattice Constant Prediction of orthorhombic

ABO3 Perovskites using Support Vector Machines. Computational Materials Science, 39 , 627-634.

Karchin, R., Karplus, K., & Haussler, D. (2002). Classifying G-protein coupled receptors with support vector

machines . Bioinformatics ,18, , 147-159.

Katoh, K., Misawa, K., Kuma, K., & Miyata, T. (2002). MAFFT: a novel method for rapid multiple sequence

alignment based on fast Fourier transform . Nucleic Acids Research , 3059–3066.

lib SVM. (n.d.). Retrieved from http://en.pudn.com/downloads136/sourcecode/math/detail580267_en.html

Liu, S., Fang, Z., & Lin, Y. (2005). A new definition for the degree of grey incidence. Sci Inq., 7(2) , 111–124.

Lundstrom, K. H., & Chiu, M. L. (2006). G- protein coupled receptors in drug discovery. CRC Press, Taylor &

Francis Group, Boca Raton, FL.

Mandell, A., Selz, K., & Shlesinger, M. (1997). Wavelet transformation of protein hydrophobicity sequences

suggests their memberships in structural families. Physica A, 244 , 254−262.

106

Martelli, P., Fariselli, P., Malaguti, L., & Casadio, R. (2002). Prediction of the disulfide bonding state of cysteines in

proteins with hidden neural networks. Protein Eng., 15 , 951-953.

MOEREELS, H., LEWI, P. J., KOYMANS, L. M., & JANSSEN, P. A. (1997). The alpha and omega of G-protein

coupled receptors: a novel method for classification. Part 2. Bin classification. Annals of the New York Academy of

Sciences , 147–148.

Möller, S., Vilo, J., & Croning, M. (2001). Prediction of the coupling specificity of G protein coupled receptors to

their Gproteins. Bioinformatics, 17, , 174-181.

Mount, D. (2004). Bioinformatics: Sequence and Genome Analysis (2nd edition). Newyork: Cold Spring Harbor

Laboratory Press: Cold Spring Harbor, ISBN 0-87969-608-7.

Nakagawa, T., Sakurai, T., Nishioka, T., & Touhara, K. (2005). Insect sex-pheromone signals mediated by specific

combinations of olfactory receptors. Science , 1638–1642.

Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino

acid sequence of two proteins. Journal of molecular biology , 443-453.

Notredame, C., Higgins, D., & Heringa, J. (2000). T-coffee: a novel method for fast and accurate multiple sequence

alignment. Journal of Molecular Biology , 205–217.

Oxenoid, K., & Chou, J. (2005). The structure of phospholamban pentamer reveals a channel-like architecture in

membranes. Proc Natl Acad Sci USA, 102 , 10870–10875.

Oxenoid, K., Rice, A., & Chou, J. (2007). Comparing the structure and dynamics of phospholamban pentamer in its

unphosphorylated and pseudo-phosphorylated states. Protein Sci.,16 , 1977–1983.

Papasaikas, P., Bagos, P., Litou, Z., & Hamodrakas, S. (2003). A novel method for GPCR recognition and family

classification from sequence alone using signatures derived from profile hidden Markov models. SAR and QSAR

Environmental Research, 14 , 413-420.

Peng, Z. L., Yang, J. Y., & Chen, X. (2010 ). An improved classification of G-protein-coupled receptors using

sequence-derived features, BMC Bioinformatics 11 (2010). BMC Bioinformatics , doi: 10.1186/1471-2105-11-420.

Prabhu, Y., & Eichinger, L. (2006). The Dictyostelium rep-ertoire of seven transmembrane domain receptors.

European Journal of Cell Biology , 937–946.

Qiu, J., Huang, J., Liang, R., & Lu, X. (2009). Prediction of G-protein-coupled receptor classes based on the concept

of Chou's pseudo amino acid composition: an approach from discrete wavelet transform. Analytical Biochemistry,

390 , 68-73.

Qiu, J., Huang, J., Liang, R., & Lu, X. (2009). Prediction of G-protein-coupled receptor classes based on the concept

of Chou's pseudo amino acid composition: an approach from discrete wavelet transform. Analytical Biochemistry,

390 , 68-73.

Rehman, Z. (2011). GPCR prediction. Retrieved from http://111.68.99.218/GPCR/default.aspx

Rehman, Z. u., Mirza, M. T., Khan, A., & Xhaard, H. (2013). Predicting G-protein-coupled receptors families using

different physiochemical properties and pseudo amino acid composition. Methods Enzymology , 61-79.

Rehman, Z., & Khan, A. (2011). G-protein-coupled receptor prediction using pseudo-amino-acid composition and

multiscale energy representation of different physiochemical properties. Analytical Biochemistry 412(2) , 173:182.

Rehman, Z., & Khan, A. (2012). Identifying GPCRs and their types with Chou's pseudo amino acid composition: an

approach from multi-scale energy representation and position specific scoring matrix. Protein Pept Lett., 19(8) ,

890-903.

107

Rehman, Z., & Khan, A. (2011). Prediction of GPCRs with pseudo amino acid composition: employing composite

features and grey incidence degree based classification. Protein Pept Lett.,18(9) , 872-878.

Richard, M. (1992). Maximum likelyhood estimation of statistical distribution of smith waterman local sequence

similarity scores. Bull. math. Biol., 54 , 59-75.

Salam, A.-K. (2012). The 20 Amino Acids - Protein Structure and Structural Bioinformatics. Retrieved from

http://www.proteinstructures.com/Structure/Structure/amino-acids.html

Schaffer, A., Aravind, L., Madden, T., Shavirin, S., Spouge, J., & al., a. e. (2001). Improving the accuracy of PSI-

BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids ,

2994:3005.

Schaffer, A., Aravind, L., Madden, T., Shavirin, S., Spouge, J., Wolf, Y., et al. (2001). Improving the accuracy of

PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids ,

2994:3005.

Sean, R. E. (2004). Where did the BLOSUM62 alignment score. NATURE BIOTECHNOLOGY .

Shi, J., Zhang, S., Pan, Q., Cheng, Y., & Xie, J. (2007). Prediction of protein subcellular localization by support

vector machines using multi-scale energy and pseudo amino acid composition. Amino Acids, 33 , 69–74.

Smith, T. F., & Waterman, M. S. (1981). Identification of Common Molecular Subsequences. Journal of Molecular

Biology , 195–197.

Specht, D. (1990). Probablistic neural networks. Neural Networks, 3 , 109-118.

Subramanian, A., M.J., W., Kaufmann, M., & Morgenstern, B. (2005). DIALIGN-T: An improved algorithm for

segment-based multiple sequence alignment. Bioinformatics , 6:66.

T-Coffee. (n.d.). Retrieved from http://www.ebi.ac.uk/Tools/msa/tcoffee/

Thompson, J., Higgins, D., & Gibson, T. (1994). CLUSTAL W: improving the sensitivity of progressive multiple

sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic

Acids Res , 4673–4680.

Tsai, L., Liou, H., & Jiang, G. (2005). Application of grey relational analysis to the influential factors on natural

frequencies of helical springs. J Grey Syst., 8(2) , 141–156.

Wheeler, D., Barrett, T., Benson, D., & et.al. (2007). Database resources of the national center for biotechnology

information. Nucleic Acids Res., 35 , D5–D12.

Xiao, X., Wang, P., & Chou, K. (2009). GPCR-CA: A cellular automaton image approach for predicting G-protein-

coupled receptor functional classes. J Comput Chem, 30 , 1413-1423.

classification of g protein-coupled receptors using...

Documents