a graph theoretic model to understand the behavioral

16
A graph theoretic model to understand the behavioral difference of PPCA among its paralogs towards recognition of DXCA SHANKAR KGHOSH 1 , SUVANKAR GHOSH 2 , GOUTAM P AUL 3 * and RAJA BANERJEE 2 * 1 Advanced Computing and Microelectronics Unit, Indian Statistical Institute, Kolkata 700 108, India 2 Maulana Abul Kalam Azad University of Technology, West Bengal (Formerly WBUT), Kolkata 700 064, India 3 Cryptology and Security Research Unit, R. C. Bose Centre for Cryptology and Security, Indian Statistical Institute, Kolkata 700 108, India *Corresponding authors (Emails, [email protected]; [email protected]) MS received 10 August 2020; accepted 13 January 2021 Among all the proteins of Periplasmic C type Cytochrome family obtained from cytochrome C 7 found in Geobacter sulfurreducens, only the Periplasmic C type Cytochrome A (PPCA) protein can recognize the deoxycholate (DXCA), while its other paralogs do not, as observed from the crystal structures. Though some existing works have used graph-theoretic approaches to realize the 3-D structural properties of proteins, its usage in the rationalisation of the physiochemical behavior of proteins has been very limited. To understand the driving force towards the recognition of DXCA exclusively by PPCA among its paralogs, in this work, we propose two graph theoretic models based on the combinatorial properties, namely, base-pair-type and impact, of the nucleotide bases and the amino acid residues, respectively. Combinatorial analysis of the binding sequences using the proposed base-pair type based graph theoretic model reveals the differential behaviour of PPCA among its other paralogs. Further, to investigate the underlying chemical phenomenon, another graph theoretic model has been developed based on impact. Analysis of the results obtained from impact-based model clearly indicates towards the helix formation of PPCA which is essential for the recognition of DXCA, making PPCA a completely different entity from its paralogs. Keywords. PPCA; DXCA; conformational analysis; base-pair type; impact; graph-theoretical modeling 1. Introduction Geobacter sulfurreducens, one of the predominant metal and sulfur reducing bacteria (Bond and Lovley 2003), found below the surface of earth, are comma shaped gram-negative, anaerobic bacteria. This organ- ism can act as an electron donor and participate in redox reaction (Caccavo et al. 1994). This ability can be used to increase the effectiveness of microbial fuel (Caccavo et al. 1994). Geobacter sulfurreducens encodes over 100 Cytochrome C and several of them have important roles in the respiration of this organism under various conditions (Caccavo et al. 2007; Shi et al. 2007). Periplasmic C type Cytochrome A (PPCA) family proteins, one of the Cytochromes of the C 7 family found in Geobacter sulfurreducens and are used for the reduction of Fe(III) (Lloyd et al. 2003). Besides this, PPCA can interact with deoxycholate (DXCA), also known as deoxycholic acid. Deoxycolic acid is one of the secondary bile acids, which are metabolic bypro- duct of intestinal bacteria and are used in the medicinal field to emulsify fats for absorption in the intestine (LaRusso et al. 1977). DXCA is also used in the research field as a mild detergent, for the isolation of membrane associated proteins (Burgess and Deutscher 2009). From the observed crystal structure it is Electronic supplementary material: The online version contains supplementary material available at https://doi.org/10. 1007/s12038-021-00144-8. A preliminary version of this paper (Ghosh et al., 2015) was archived on 17 September, 2015 (arXiv:1608.09009). http://www.ias.ac.in/jbiosci J Biosci (2021)46:35 Ó Indian Academy of Sciences DOI: 10.1007/s12038-021-00144-8

Upload: others

Post on 19-Apr-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A graph theoretic model to understand the behavioral

A graph theoretic model to understand the behavioral differenceof PPCA among its paralogs towards recognition of DXCA

SHANKAR K GHOSH1, SUVANKAR GHOSH

2, GOUTAM PAUL3*and RAJA BANERJEE

2*1Advanced Computing and Microelectronics Unit, Indian Statistical Institute, Kolkata 700 108, India2Maulana Abul Kalam Azad University of Technology, West Bengal (Formerly WBUT), Kolkata

700 064, India3Cryptology and Security Research Unit, R. C. Bose Centre for Cryptology and Security, Indian

Statistical Institute, Kolkata 700 108, India

*Corresponding authors (Emails, [email protected]; [email protected])

MS received 10 August 2020; accepted 13 January 2021

Among all the proteins of Periplasmic C type Cytochrome family obtained from cytochrome C7 found inGeobacter sulfurreducens, only the Periplasmic C type Cytochrome A (PPCA) protein can recognize thedeoxycholate (DXCA), while its other paralogs do not, as observed from the crystal structures. Though someexisting works have used graph-theoretic approaches to realize the 3-D structural properties of proteins, itsusage in the rationalisation of the physiochemical behavior of proteins has been very limited. To understand thedriving force towards the recognition of DXCA exclusively by PPCA among its paralogs, in this work, wepropose two graph theoretic models based on the combinatorial properties, namely, base-pair-type and impact,of the nucleotide bases and the amino acid residues, respectively. Combinatorial analysis of the bindingsequences using the proposed base-pair type based graph theoretic model reveals the differential behaviour ofPPCA among its other paralogs. Further, to investigate the underlying chemical phenomenon, another graphtheoretic model has been developed based on impact. Analysis of the results obtained from impact-basedmodel clearly indicates towards the helix formation of PPCA which is essential for the recognition of DXCA,making PPCA a completely different entity from its paralogs.

Keywords. PPCA; DXCA; conformational analysis; base-pair type; impact; graph-theoretical modeling

1. Introduction

Geobacter sulfurreducens, one of the predominantmetal and sulfur reducing bacteria (Bond and Lovley2003), found below the surface of earth, are commashaped gram-negative, anaerobic bacteria. This organ-ism can act as an electron donor and participate inredox reaction (Caccavo et al. 1994). This ability canbe used to increase the effectiveness of microbial fuel(Caccavo et al. 1994). Geobacter sulfurreducensencodes over 100 Cytochrome C and several of themhave important roles in the respiration of this organismunder various conditions (Caccavo et al. 2007; Shi

et al. 2007). Periplasmic C type Cytochrome A (PPCA)family proteins, one of the Cytochromes of the C7

family found in Geobacter sulfurreducens and are usedfor the reduction of Fe(III) (Lloyd et al. 2003). Besidesthis, PPCA can interact with deoxycholate (DXCA), alsoknown as deoxycholic acid. Deoxycolic acid is one ofthe secondary bile acids, which are metabolic bypro-duct of intestinal bacteria and are used in the medicinalfield to emulsify fats for absorption in the intestine(LaRusso et al. 1977). DXCA is also used in theresearch field as a mild detergent, for the isolation ofmembrane associated proteins (Burgess and Deutscher2009). From the observed crystal structure it is

Electronic supplementary material: The online version contains supplementary material available at https://doi.org/10.1007/s12038-021-00144-8.

A preliminary version of this paper (Ghosh et al., 2015) was archived on 17 September, 2015 (arXiv:1608.09009).

http://www.ias.ac.in/jbiosci

J Biosci (2021) 46:35 � Indian Academy of SciencesDOI: 10.1007/s12038-021-00144-8 (0123456789().,-volV)(0123456789().,-volV)

Page 2: A graph theoretic model to understand the behavioral

appealing to note that among the Periplasmic C typeCytochrome A family proteins found in Geobactersulfurreducens, only PPCA can interact with Deoxy-cholate (DXCA), while its other paralogs, namely,PPCB, PPCC, PPCD and PPCE cannot, although theyhave high sequence identity with PPCA (Pokkuluriet al. 2011, 2010). Moreover, towards interaction withDXCA, residue numbers 4, 29, 37, 38, 41, 45 and 50 ofPPCA are utilized (Pokkuluri et al. 2011; Morgadoet al. 2012). At this prevailing situation it would beworthy to identify the underlying reason of such anamazing difference between the paralogs proteinshaving high sequence similarity towards recognizing asingle compound.Towards understanding the behavioural difference

among different proteins, various graph theoreticmodels have been used extensively in the field ofproteomics previously (Vishveshwara et al. 2002;Artymiuk et al. 2005). A state of art regarding graph-theoretic analysis of protein structures have been pre-sented and various protein graph construction algo-rithms have been analysed in Vishveshwara et al.(2002). In such graphs, the notion of nodes and edgeschanges depending on the motivation for constructingthese graphs.As an example, in Grindley and Willet (1994), to

perform fold and pattern identification through graphtheoritic modelling, secondary structure types such asa-helix and b-strand have been used to construct the setof nodes and spatial closeness between these secondarystructures have been taken to construct the set of edges.Using the same set of vertices and edges, in a differentwork, a dynamic matrix construction based on graphtopology have been used for the testing of foldingrules. Although these kind of models can realize vari-ous aspects of three-dimensional structures of proteinsefficiently, implications regarding biological propertiesfrom these models are quite limited. In Fiser et al.(2000), the authors have proposed an automatedmodelling technique using an energy function toimprove the accuracy of loop predictions in proteinstructures, however they did not investigate whetherthe chemical properties of proteins can be realized fromtheir loop prediction technique. A new algorithm hasbeen developed by Adrian et al. (2003) that usesresults from graph theory to solve the combinatorialproblem encountered in the side-chain predictionproblem, but its applicability in protein–protein inter-action have not been clarified. In Artymiuk et al.(2005), the authors have used the Bron-Kerbosch cli-que detection algorithm and Ullmann subgraph

isomorphism algorithm to identify structural relation-ships between biological macromolecules. However,no further conclusion regarding physiochemical prop-erties of proteins have been made.A closer look to the existing graph-theoretic models

reveal that there are two principle objectives fordeveloping such models of a particular protein: First, togain insight into the structural details of the proteinwhich is largely governed by non-covalent interactions(Vishveshwara et al. 2002). Here, the structural detailsare explored based on the analyses of graph topology(Vishveshwara et al. 2002: Fiser et al. 2000). Second,to explore the physiochemical properties of the proteinswith known structures, e.g., the clustering of specifictypes of amino acids which is important for structure,folding and function (Vishveshwara et al. 2002; Fiseret al. 2000; Adrian et al. 2003). In this phase, bio-chemical properties are realized from the graph-theo-retic representations of the proteins based on somepredefined combinatorial or statistical parameters (Fiseret al. 2000; Adrian et al. 2003). To the best of theauthor’s knowledge, the former aspect of the graphtheoretic models have drawn much attention from theresearch communities than that of the second onewhich deals with the structure-function relationship.The effectiveness of the later aspect relies on designingeffective combinatorial parameters. Although a largenumber of protein structures are analysed by themethod of X-ray crystallography (Vishveshwara et al.2002), designing automatic methods to analyze bio-logical properties of proteins are still challenging.Towards this end, in this work, we developed a

graph-theoretic model based on combinatorial proper-ties of the primary sequences of the proteins. Theinformation obtained from the crystal structure relatedto interactions of PPCA paralogs with DXCA was takenhere as the basis of the proposed graph-theoretic modelfor exploring the behavioral difference among PPCAparalogs towards DXCA. This would not only help torationalize the physicochemical significance of theinteraction but also to validate the structure basedrecognition of ligands. We first introduced the conceptof binding sequence for PPCA family proteins. In ouranalysis, the binding sequence for a protein representsthe contiguous sequence of the interacting amino acid(with DXCA molecule) residues of the binding locationsof the concerned protein. Comparison of bindingsequences of respective paralogs proteins has beenobtained from their sequence alignments. We haveconsidered two combinatorial properties of the bindingsequences: base-pair type (b-type) and impact. The b-

35 Page 2 of 16 S K Ghosh et al.

Page 3: A graph theoretic model to understand the behavioral

type of a nucleotide base represents the nature of thebase pair based on the hydrogen bonding nature. Forexample, Adenine (A) forms a base pair with Thymine(T) with two hydrogen bonds. Hence A and T havesame b-type value. The impact of an amino acid intu-itively represents the similarity of b-types of the firsttwo nucleotide bases of the triplet codon representingthat amino acid. Our graph theoretic analysis on thebinding sequences based on b-type clearly demon-strates the differential nature of PPCA among its par-alogs. To ensure the basic motif of such a differencebetween the paralogues proteins towards recognition ofa single compound, a graph-theoretic model based onimpact has also been proposed. Analyses of the resultsobtained from impact-based model clearly indicatestowards the helix formation of PPCA which is essentialfor recognition of DXCA,One may raise the question that when information

about protein sequence and structure is available, whythe genomic sequence is used to perform the analyses?Towards our understanding, such a route map fromDNA to protein utilizing the combinatorial approachwould address the problem from the root, as all thebiological information which passes to protein for itsfunction is primarily embedded in DNA, the blueprint oflife (Leavitt and Sarah 1961; Larkin et al. 2007). As anexample, in Tuqan and Rushdi (2008) a new DigitalSignal Processing (DSP)-based model has been devel-oped to explain the underlying mechanism of the per-iod-3 component. Here the DNA-based model havebeen used to characterize the DNA spectrum by a set ofnumerical sequences termed as filtered polyphasesequences. Such sequence/structural analysis of pro-tein(s) from the DNA sequence turns out to justify thebehavioural difference of PPCA from its paralogswhich would have an added value to the basic biology(Myers 2015).

2. Preliminaries from graph theory

In this section, some graph theoretic terminologies(West 2001; Bondy and Murty 2006) have beenintroduced, which are used in the subsequent combi-natorial analyses.

• A graph G ¼ V ;Eð Þ consists of a set of verticesV ¼ v1; v2; . . .; vnf g and a set of edges E ¼e1; e2; . . .; emf g such that each edge ek, k 21; 2; 3; . . .mf g is identified as a pair of verticesvi; vj

� �called endpoints. In a graph, the vertex set

V is essentially a non-empty set. G is called directed

graph if the endpoints form an order pair of verticeswritten as vi; vj

� �. For example, the directed graph

shown in A-(i) (figure 1) is defined as V1;E1ð Þwhere V1 ¼ fa; b; cg andE1 ¼ ða; cÞ; ðc; bÞ; ðc; bÞ; ðb; aÞ; ða; aÞf g. On theother hand the graph shown in A-(ii) (figure 1)is defined as V2;E2ð Þ where V2 ¼ a; b; cf g andE2 ¼ ffa; cg; fb; cg; fb; cg; fb; ag; fa; agg. Itshould be noted that the order of vertices doesnot matter in the definition of the graph shown inA-(ii) (figure 1).

• An edge having identical end vertices is called aloop. For example, e3 is a self loop in the graphshown in A-(i) (figure 1). If more than one edgeis associated with a given pair of vertices then theedges are referred to as parallel edges. Forexample, e1 and e2 are parallel edges in graphshown in A (figure 1).

• A graph is said to be connected if there exists atleast one path between every pair of vertices.Otherwise the graph is called disconnected. Adisconnected graph consists of more than oneconnected sub graphs each of which is called aconnected component. For example, the graphshown in B (figure 1) consists of only oneconnected component and the graph shown in C(figure 1) consists of two connected componentsX and Y. For example, there is no path between anyvertex of component X (e.g. vertex c) and that ofany vertex of component Y (e.g. vertex e).

• A walk is a list v0; e1; v1; . . .; ek; vk of vertices andedges such that, for 1� i� k, the edge ei hasendpoints vi�1 and vi. In the graph shown in C(figure 1), a� e4 � c� e2 � b� e9 � p is a walk.

• A closed walk is a walk which begins and ends withsame vertices. A closed walk in which no vertexappears more than once (except the initial and finalvertex) is called cycle. In the graph shown in A-(i) (figure 1), a� e4 � c� e2 � b� e5 � a is acycle.

3. Methods

Primary sequence and the 3-D coordinate file ofPPCA protein and its paralogs (PPCB, PPCC, PPCDand PPCE) have been obtained from the Protein DataBank (http://www.rcsb.org). Corresponding PDBIDs of the respective protein are: 1OS6: PPCA;3BXU: PPCB; 3H33: PPCC; 3H4N: PPCD and3H34: PPCE.

Understand the difference among PPCA and its paralogs: a graph theoretic model Page 3 of 16 35

Page 4: A graph theoretic model to understand the behavioral

3.1 Sequence and structural alignments

Comparison of sequence identity of each of the PPCAprotein paralogs with respect to PPCA is obtained bypair wise alignment of sequences using the sequencealignment tool CLUSTAL W (Maiti et al. 2004). Fur-ther, alignments of 3D structure of each of the paralogsproteins with respect to PPCA have been performedusing SuperPose server (Kabsch and Sander1983) which also helps to cross check the sequencealignments and gives an idea about the RMSD statistics,Different Distance Plots, along with interactive imagesof the superimposed structures.

3.2 Conformational analysis of PPCA family

Conformational analysis of PPCA protein paralogs hasbeen pursued using DSSP program (West 2001) on therespective PDB entries. This gives an overview of thedetailed secondary structure which includes main chaintorsion angles (/, w) as well as virtual bond angle (j),

virtual torsion angle (a), nature of hydrogen bondbetween the atoms.

4. Proposed combinatorial parameters

In this section, we propose two combinatorial param-eters, namely base-pair type (b-type) of a nucleotidebase and impact of an amino acid.

4.1 base-pair type (b-type) of a nucleotide base

Considering the characteristics of base pair determinedby the hydrogen bonding nature of the nucleotidebases, two categories of base-pair types, shortened asb-type, for nucleotide bases are defined. Adenine (A)forms a base pair with Thymine (T) with two hydrogenbonds, whereas Guanine (G) forms a base pair withCytocine (C) with three hydrogen bonds. In proteinbiosynthesis, the translation process (in central dogma)originates from single stranded RNA where bindingbetween A-T or G-C is redundant because suchinteraction is only possible in DNA duplex. So, withoutloss of generality, we assign b-type value 1 to bothA and U/T (2 hydrogen bonds) and b-type value 0 toboth C and G (3 hydrogen bonds).Although b-type can demonstrate the nature of

hydrogen bond of A, U/T, C and G (in RNA T isreplaced by U), it is unable to uniquely identify fournucleotide bases. Information-theoretically (Cover andThomasJoy 2006), to uniquely recognize 4 nucleotidebases, log24 ¼ 2 binary symbols are needed. In gen-eral, one could assign any permutation of the strings00, 01, 10, 11 to the symbols A, U/T, C and G whichgenerates 24 possible encodings. Here, the choice ofencoding scheme is crucial as the entire analysis isdependent on that particular encoding scheme. Theencoding scheme should be mathematically consistentand biologically relevant. To identify a consistentencoding scheme, we have performed of systematicanalysis of all possible encodings based on hammingdistance as it is a well-known metric to perform com-binatorial analysis. This analyses also take care of theb-type values of the nucleotide bases. Finally, we cor-relate the resulting encoding with purine-pyrimidinecontent of the nucleotide bases.Given two strings over a set of symbols, the Ham-

ming distance (Anfinsen 1973) between them is thenumber of positions in which they differ. For example,the Hamming distance between the strings 0101 and0011 is 2 as the first and the last two bits are identical

Figure 1. Illustration of some graph theoreticpreliminaries.

35 Page 4 of 16 S K Ghosh et al.

Page 5: A graph theoretic model to understand the behavioral

for both the strings, while each of the two bits in themiddle differs. Thus, in some sense, it measures thesimilarity between the two strings the more the Ham-ming distance, the less the similarity. This similarity isonly at the level of bitstring representation and need notnecessarily correlate with similarity in nucleotidebases.It can be shown that the Hamming distance [denoted

by the function d(., .)] satisfies the definition of ametric, i.e., for any two strings p and q:

dðp; qÞ� 0 (non-negativity),dðp; qÞ ¼ 0 if and only if p ¼ q (coincidence),dðp; qÞ ¼ dðq; pÞ (symmetry), anddðp; qÞ� dðp; rÞ þ dðq; rÞ (triangle inequality).

In this context, any two strings in the above represen-tation have a Hamming distance of 1 or 2. For the sakeof simplicity in further analysis, we define the termi-nology of intra-b-type and inter-b-type distances asfollows. For a particular encoding, intra-b-type dis-tance is defined as the Hamming distance between twostrings representing nucleotides of same b-type value,e.g., for the encoding A $ 00, T $ 01, C $ 10,G $ 11, (b-type of A/T is 1 while that C/G is 0, intra-b-type distance is dð00; 01Þ ¼ dð10; 11Þ ¼ 1). On theother hand, for the encoding A $ 00, T $ 11,C $ 01, G $ 10, intra-b-type distance isdð00; 11Þ ¼ dð01; 10Þ ¼ 2.It is to be noted that the Hamming distance of two

strings does not change, if both the strings are bit-wisecomplemented. Since, the two encodings of any one ofthe b-type is always exactly the bitwise complement ofthe two encodings of the other b-type, therefore intra-b-type distance for a particular encoding is alwaysunique.On the other hand, inter-b-type distance is defined as

the hamming distance between two strings representingnucleotides of different b-type values. In the firstexample above, inter-b-type distance can bedð00; 10Þ ¼ dð01; 11Þ ¼ 1 ordð00; 11Þ ¼ dð01; 10Þ ¼ 2. In contrast, in the secondexample, the inter-b-type distance can be onlydð00; 01Þ ¼ d(00, 10) ¼ d(11, 01) ¼ dð11; 10Þ ¼ 1. Itwould be worthy to note that unlike intra-b-type dis-tance, inter-b-type distance for a particular encoding isnot unique. Thus, considering the similarity of b-types,there are two possibilities: A) Either an intra-b-typedistance of 1 or B) an intra-b-type distance of 2.As shown in table 1, out of the 24 possible encod-

ings, 16 encodings are belonging to category A and 8encodings are belonging category B. It may be noted

that the encodings of category A can have both 1 and 2as inter-b-type distance, thereby leading to ambiguity.On the contrary, each and every encoding of category Bhas the inter-b-type distance of 1 only, leading to aunique and uniform feature of both inter-b-type andintra-b-type distances respectively. For this reason, westick to the encoding category B. Further, it has beenfound that the target biochemical properties can belinked to the combinatorial properties of encodings ofcategory B mentioned above and any one of the 8possible encodings leads to the same results, as onewould expect. Hence, category B encoding scheme ismathematically competent to carry out the necessarycombinatorial analyses.Now, one can debate on the biological relevance of

category B encoding. It may be noted that, in categoryB encoding, the complementary base pairs (consistingof one purine and one pyrimidine base) always getcomplementary bit-string representation. For example,in the encoding (A 00, T 11, C 01, G 10), A (havingpurine base) and T (having pyrimidine base) getscomplementary bit-strings (00 and 11 respectively). Onthe other hand, C (having pyrimidine base) and G(having purine base) gets complementary bit strings(01 and 10 respectively). It is well known that pyrim-idine-pyrimidine pairings do not occur as they can notform stable hydrogen bonds. On the other hand, Pur-ine-purine links do not form because these bases aretoo large to fit in the space. Thus the biological com-bination is favoured by geometrically dissimilar bases,which correctly correspond to and correlate with ourhamming distance wise dissimilar encodings. Hence,the category B encoding is biologically sound. Hence,without loss of generality, in this study, we fix theencoding A $ 00, T $ 11, C $ 01, G $ 10.

4.2 impact of an amino acid

Any sequence XYZ of three nucleotide bases repre-senting an amino acid can be considered as a sequenceof 6 symbols b1b2b3b4b5b6. Here (b1b2), (b3b4), (b5b6)represent the encoding of X, Y and Z respectively.Since the first two nucleotide bases mostly take thedetermining role towards coding an amino acid (e.g.,Glycine be coded by GGU, GGC, GGA and GGG, wherethe first two nucleotide bases are GG), the first 4symbols b1b2b3b4 have been considered to define im-pact of an amino acid. Given the encodingb1b2b3b4b5b6 of an amino acid, the impact of thatamino acid can be represented as a pair (I1, I2), whereI1 and I2 can be calculated as follows:

Understand the difference among PPCA and its paralogs: a graph theoretic model Page 5 of 16 35

Page 6: A graph theoretic model to understand the behavioral

I1 ¼1

2½symðb1; b3Þ þ symðb2; b4Þ�

I2 ¼1

2½symðb1; b2Þ þ symðb3; b4Þ�;

ð1Þ

where the value of symðbi; bjÞ is 0 if bi and bj areidentical and is 1 otherwise. To illustrate, for Glycinethe encoding of the first two nucleotide bases, i.e., GG,is 1010; hence, I1 ¼ 1

2½symð1; 1Þ þ symð0; 0Þ� ¼ 0; and

I2 ¼ 12½symð1; 0Þ þ symð1; 0Þ� ¼ 1þ1

2¼ 1. Thus, impact

of Glycine is (0, 1). It may be noted that the impact ofan amino acid remains invariant, even if the assignmentof symbols to the nucleotide bases are changed keepingintra-b-type distance 2. Impact values of the amino acidresidues forming the binding sequences of PPCAfamily proteins have been shown in table 2.At this point, one may raise the question that while

b-type of a nucleotide base is directly related to thehydrogen bonding nature of the concerned nucleotidebase, what is the relevance of the parameter impactwith respect to the chemical properties of amino acids.It may be noted that impact values of an amino acidrepresents the similarity of b-types of the first twonucleotide bases of the triplet codon representing thecorresponding amino acid. The value of I1 and I2 areequal to 1

2when the first nucleotide base (X) and second

nucleotide base (Y) have different b-type values. On theother hand, (I1, I2) may be (0, 0), (1, 0), (0, 1) or (1, 1)when the X and Y have same b-type values. If b-typesof X and Y are same then I2 represents the b-type valueof X (and Y), whereas I1 indicates whether X and Y areidentical nucleotide bases or not. Here I1 ¼ 0 if X andY are identical nucleotide bases. I1 ¼ 1 when X and Yare different nucleotide bases.

5. Results and discussion

More than 50 years ago it was hypothesized byChemistry Nobel Laureate C.B. Anfinsen (Anfinsen1973) that the information regarding the conformation(i.e., secondary and tertiary structure) of proteins isembedded in its primary amino acid sequence (Anfin-sen 1973) and hence the functional aspect, as both thefunction and the conformation are correlated. Utilizingthe concept of structure function relationship of proteinalong with the proposition based on mathematicalmodelling and graph theory, here it can be substanti-ated that the underlying impetus for recognition ofDXCA exclusively by the protein PPCA is based on itslocal conformation which is guided by the first prin-ciple of folding.T

able

1.Allpossible

encodingsof

nucleotide

bases(G

hosh

etal.2015

)

b-type

Nucleotidebase

Intra-b-type

distance

1Intra-b-type

distance

2

1A

0000

0101

0000

1010

1011

1011

0111

0111

0000

1111

0101

1010

T01

0100

0010

1000

0011

1011

1011

0111

0111

1100

0010

1001

010

C10

1110

1101

1101

1100

0001

0100

0010

1001

1001

1000

1100

11G

1110

1110

1101

1101

0101

0000

1010

0000

1001

1001

1100

1100

CategoryA:Inter-b-type

distance

iseither

1or

2CategoryB:Inter-b-type

distance

isalways1

35 Page 6 of 16 S K Ghosh et al.

Page 7: A graph theoretic model to understand the behavioral

5.1 Conformational aspect of interaction

From the crystal structure of PPCA (PDB ID- 1OS6) itis found that the amino acid residues present at 4th,29th, 37th, 38th, 41st, 45th and 50th positions in thesequence participate in the interaction with four oxygenatoms of DXCA. As the paralogs of PPCA cannot rec-ognize DXCA, it will be worthy to identify whether theunderlying hypothesis is implanted either in the pri-mary sequence or in the secondary structure of theseparticular residues or simultaneously on both.

5.2 Sequence alignment

Using the program BLAST (blastp) (Altschul et al.1990) it has been observed that with respect to PPCA,the sequence identity of PPCB: 77 percent (approx.),PPCC: 65 percent (approx.), PPCD: 62 percent (ap-prox.) and PPCE: 58 percent (approx.) which estab-lishes that there exists at least 25 percent (approx.) ofdifferences in sequence identity among the paralogswith respect to PPCA. More precisely if one looks intothe interacting residues of PPCA, it is found that iso-leucine (I) in PPCA is replaced by methionine (M) inPPCB at 4th position; lysine (K) (in PPCA) is replacedby valine (V) (in PPCB) at 29th position and glycine(G) (in PPCA) is replaced by serine (S) (in PPCB) at50th position. In PPCC with respect to PPCA, mutationobserved at the residue position 29th and 37th, wherelysine (K) (at 29th position) is replaced by glycine (G)and lysine (K) (at 37th position) is replaced by arginine(R) respectively. While in protein PPCD almost all theinteracting residues are mutated. In PPCE mutation hasbeen observed at residues 29th, 45th and 50th position.Comparison of primary amino acid sequence throughpair wise alignment highlighting the interacting resi-dues of PPCA towards recognition of DXCA and theother paralogs of PPCA family is shown in figure 2.

5.3 Secondary structure analysis

Analysis of secondary structure of PPCA and its par-alogs (PPCB, PPCC, PPCD and PPCE) using DSSPprogram (West 2001) (table 2) indicate that amino acidof each proteins at 4th position have extended (E)conformation. At 29th position amino acid of proteinPPCA and PPCE adopt 310 helical (G) conformations,while in PPCB amino acid at the same position is in a-helical (H) conformation; while in protein PPCC andPPCD amino acid forms H-bonded turn (T) and bendT

able

2.Representationof

secondarystructureobtained

from

DSSPalongwithbackbone

dihedral

(/,w)fortheresidues

ofPPCAparticipatingforrecognitionof

DXCAandtheresidues

atsimilar

position

ofPPCAparalogs

Position

PPCA

PPCB

PPCC

PPCD

PPCE

Pattern

PHI

PSI

Pattern

PHI

PSI

Pattern

PHI

PSI

Pattern

PHI

PSI

Pattern

PHI

PSI

4E

�120.2

148.9

E�100.8

147.6

E�124.7

117

E�92.4

109.3

E�115.6

133.5

29G

�64.3

�27.7

H�52.8

�41.9

T�58.9

�52.5

S�109.6

153.6

G�65.8

�19.2

37H

�75.8

�23.9

�63.5

136.5

�60.1

145.6

�116.7

101.1

�58.4

140.3

38H

�72.5

�46.8

�89

99.9

�103.9

110.6

T�57.6

�34.2

�99.9

112.6

41�69.3

150.7

�56.8

137.2

�70.6

118.3

S165.5

�170.8

�64.5

139.2

45H

�64.4

�40.6

H�67.6

�44

H�72.2

�44.5

H�67.4

�44.1

H�60.5

�45.1

50T

�140.2

�76.4

T�136.8

�77.9

T�143.8

�79.6

H�68.5

�39.6

H�64.9

�62.1

Understand the difference among PPCA and its paralogs: a graph theoretic model Page 7 of 16 35

Page 8: A graph theoretic model to understand the behavioral

(S) conformation respectively. From the aspect ofsecondary structure conformation of the binding resi-dues, the most appealing scenario is found at 37th and38th position which mainly involve in recognition ofDXCA through H-bonding. Amino acids at 37th and38th position of PPCA protein are within the a-helical(H) region; while, for the rest other paralogs theseresidues are the part of PPII structure (table 2) whichis demonstrated in figure 3 by the superposition of thetruncated secondary structures (from residue 29 to 41)using the program Superpose (Kabsch and Sander1983). Amino acids at 41st and 45th position of allproteins are in coil and a-helical (H) conformationrespectively. At 50th position, amino acid in PPCA,PPCB and PPCC are in H-bonded turn (T) conforma-tion, while rest (PPCD and PPCE) is in a-helical (H)conformation. Comparison of overall 3D-structure ofall the paralogs of PPCA with respect to PPCA isshown in supplementary figure 1. Considering the sitesof interaction of PPCA towards recognition of DXCA,one can conclude that there is a large deviation in thesecondary structure of 37th and 38th residue (K and I)which play the major role in recognition of DXCA withrespect to other members of the PPCA family. More-over, for the other interacting positions (figure 2),deviation in the respective primary sequence may notinitiate the formation of hydrogen bond towardsrecognition of DXCA by the other paralogs.From the crystal structure of DXCA bound PPCA, it

is found that the binding interactions is mediatedthrough non-covalent interactions, either throughhydrophobic interaction or through hydrogen bond

interaction. As per the definition of chemical science, itis considered that if the distance between the two atomis greater than 4, the interaction is non-relevant. So, inour study, we have considered only those interactionswhere the interacting non-bonded atoms are placed at adistance less than 4. For example, for residue number29 and 33, DXCA molecule are far apart, i.e., theirdistance is much greater than 4 as represented in tab-ular form in supplementary table 1 and also shown infigure 3. Moreover the side chain of 29, 33 and 49 areoriented oppositely with respect to DXCA in crystralstructure. So these residues are not in favourableposition for interaction to be considered. If oneobserves keenly, it will be found that our alignment isatpar with that of used in Pokkuluri et al. (2004). InPokkuluri et al. (2004), the alignment was donethrough multiple sequence alignment, whereas we haveused pairwise alignment. One or two shifting in thesequence is basically due to as done by the software,which is not deviated from the norms of sequenceanalysis.From the analysis of the sequence and the secondary

structure of the proteins representing the PPCA familytowards recognition of DXCA, it can be established thatPPCA differs from its paralogs at the interacting residuesalong with few others which may be correlated with itsfunctionality. Moreover, it can be emphasized that foradopting a distinct conformation at a particular residuethere would be a role of neighbouring amino acids.Justification of exclusive interaction of PPCA withDXCA would be validated through the graph theoreticapproach in the subsequent part of this manuscript.

Figure 2. Pairwise sequence alignment of the primary sequences of PPCA and its paralogs (with respect to PPCA) obtainedfrom cytochrome C7 found in Geobacter sulfurreducens using Clustal W (Ghosh et al. 2015).

35 Page 8 of 16 S K Ghosh et al.

Page 9: A graph theoretic model to understand the behavioral

5.4 Combinatorial analysis

In order to justify the underlying hypothesis for thebehavioral difference of PPCA with DXCA as con-cluded from the aspect of structure-function relation-ship, two graph theoretic models have been developedbased on b-type of nucleotide bases and impact ofamino acids respectively. The graph theoretic modelsproposed here is based on underlying order ofsequential appearence of the binding residues. This isbecause the interaction at the binding region of PPCAis originated due to the spatial arrangement of threedimensional structure of protein where each bindingresidue is a part of particular secondary structure. Thelocal arrangement of these particular secondary struc-tures originate from contiguous stretch of amino acidsthrough different non-covalent interaction. These

interactions originate such a situation where the bind-ing residues are spatially exposed in such an orientationthat provide proximity (distance is less than 4) toprotein and ligand, facilitating their interaction.Therefore, the sequential connectivity in the graphrepresentation of the binding residues properly corre-late with the contiguous behaviour of the bindingresidues.

5.4.1 Connectivity of graphs based on b-type: Eachprotein of PPCA family has been characterized by abinding sequence. Binding sequence of a PPCA familyprotein is defined as the contiguous sequence of aminoacid residues at the binding locations. Thus the bindingsequence for PPCA is I (4th position) K (29th position)K (37th position) I (38th position) F (41st position) M(45th position) G (50th position) as depicted in figure 2

Figure 3. Cartoon representation of overlay of secondary structural elements (from position 29 to 41) between PPCA(shown in red color) and its different paralogs (shown in black color) obtained from cytochrome C7 using Superpose servertowards understanding the exclusive interaction between DXCA (shown in ball and stick model) and PPCA (Ghosh et al.2015).

Understand the difference among PPCA and its paralogs: a graph theoretic model Page 9 of 16 35

Page 10: A graph theoretic model to understand the behavioral

and table 3. For any given protein of PPCA family, wedenote each interacting residue of the binding sequenceas Pij where i 2 f1; 2; 3; 4; 5; 6; 7g denotes the positionof the interacting residues in the binding sequence andj 2 f1; 2; 3g denotes the position of nucleotide bases inthe corresponding interacting residues as shown intable 3. Corresponding to each member of PPCAfamily protein, we draw two separate directed graphs(depicted in figure 4) namely G0 ¼ ðV0;E0Þ and G1 ¼ðV1;E1Þ based on Pij values as follows:

V0 ¼fPij : b� typeðPijÞ ¼ 0gE0 ¼f Pij;Pðiþ1Þ;j0

� �: Pij;Pðiþ1Þ;j0 2 V0; j j� j0 j � 1g

V1 ¼fPij : b� typeðPijÞ ¼ 1gE1 ¼f Pij;Pðiþ1Þ;j0

� �: Pij;Pðiþ1Þ;j0 2 V1; j j� j0 j � 1g

ð2Þ

To illustrate, let us consider the construction of graphG0 for PPCA. From the second column correspondingto PPCA as depicted in table 3, it may be noted that forL ¼ 4 (residue number in protein) and i ¼ 1, b-type ofP13 (representing C) is 0; for L ¼ 29 and i ¼ 2, b-typeof P23 (representing G) is 0. As a result, there is an edgebetween the nodes representing P13 and P23 in G0 asdepicted in figure 4. Similarly G1 can be drawn fromthe information presented in table 3. It may be notedthat following equation (2) the graph representation ofa protein may not be unique. For example, in graphrepresentation of PPCC, either of the following pathmay exist: P13 ! P23 ! P33 ! P43 ! P53 ! P63 !P73 or P13 ! P22 ! P32 ! P43 ! P53 ! P63 ! P72

depending on the choice of Pij. However, the numberof connected components is independent of the choiceof Pij. Graphs for PPCB, PPCC, PPCD and PPCE arealso depicted in figure 4.From the graph-theoretic representations of different

proteins of PPCA family it can be observed that in thegraph of PPCA, both G0 and G1 consist of a singleconnected component. However, for each of PPCB toPPCE, at least one of G0 and G1 is disconnected, i.e.,there exists a pair of vertices in either G0 or G1 so thatthere does not exist any path between them.It should be noted that choice of any one of the 7

alternative encodings of category B (shown in table 1)other than that considered in this study (A ! 00, T !11, C ! 01 and G ! 10) also produces the sameresult. This clearly demonstrates the differential natureof PPCA from its paralogs.

5.4.2 Characterization of interacting residues usinga graph theoretic approach based on impact: Although

the graph theoretic approach based on b-type values ofthe nucleotide bases establish the characteristicuniqueness of PPCA from its other paralogs in respectto its amino acid sequence, no explicit indication canbe obtained regarding the underlying chemical phe-nomenon. So, at this juncture it would be valuable tojustify whether this exceptional property can beexplained further from the first principle of proteinfolding (i.e, importance of primary sequence in attainsthe 3D fold/conformation) through developing a graphtheoretic model based on the impact of amino acids.table 3 represents the demonstration of location wiseinteracting amino acid residues for each protein ofPPCA family in terms of Nucleotide Codons (N.C.)(obtained from NCBI: http://www.ncbi.nlm.nih.gov/)along with the b-type values of corresponding nucleo-tide bases in triplet form, binary representations (b.r.)of codons and impact values.To develop the combinatorial model the notion of

alternative representation has been introduced. Alter-native representation of a binding sequence is definedas the contiguous sequence of the impact values (I1i,I2i): i 2 f1; 2; 3; 4; 5; 6; 7g of the interacting amino acidresidues of the corresponding binding sequence pre-serving the order of appearance in the binding sequenceas shown in table 3. For example, the binding sequenceof PPCA protein is (I, K, K, I, F, M, G) (figure 2 andtable 3). Hence the alternative representation of PPCAprotein would be fimpact of I; impact of K;impact of K; impact of I; impact of F; impact of

M; impact of Gg, i.e.,fð1, 0), (0, 0), (0, 0), (1, 0), (0, 0), 1, 0), (0, 1Þg(shown in table 3). The alternative representations ofPPCB, PPCC, PPCD and PPCE can be found similarlyas depicted in table 3.From the alternative representation, a directed graph

G ¼ V ;Eð Þ, has been drawn for each protein of PPCAfamily as shown in figure 5 , where V = distinct pairsfrom fðI1i; I2iÞ : i ¼ 1; 2; 3; 4; 5; 6; 7g andE ¼ fððI1i; I2iÞ; ðI1j; I2jÞÞ : j ¼ iþ 1g. To illustrate, letus consider the construction of the graph for PPCA:Since there exists only five amino acid residues in thebinding sequence of PPCA protein (due to repetition ofI and K) and seven binding locations (depicted intable 3), repetition of impactvalues in alternative rep-resentation is obvious according to pigeon-hole prin-ciple (Rosen and Kenneth 2011). As shown in table 3,in alternative representation of PPCA, (I11, I21)=(1, 0),(I12, I22)=(0, 0), (I13, I23)=(0, 0), (I14, I24)=(1, 0), (I15,I25)=(0, 0), (I16, I26)=(1, 0) and (I17, I27)=(1, 0).Towards construction of the graph, the impact values of

35 Page 10 of 16 S K Ghosh et al.

Page 11: A graph theoretic model to understand the behavioral

Tab

le3.

Binaryrepresentation

s(b.r.)of

nucleotide

codon(N

.C.)of

theinteractingresidues

ofPPCA

in(P

i1Pi2Pi3)form

asobtained

from

NCBIandtheresidues

atsimilar

position

ofPPCAparalogs

alongwiththecalculated

b-type

(inparenthesis)

andimpact

oftheam

inoacids(G

hosh

etal.2015

)

Locationof

theinteractingresidues

inprotein(L)andin

bindingsequence

(i)

PPCA

PPCB

PPCC

N.C.(b-type)

Impact

I 1i;I 2i

ðÞ

Residue

N.C.(b-type)

Impact

I 1i;I 2i

ðÞ

Residue

N.C.(b-type)

Impact

I 1i;I 2i

ðÞ

jj

j

Residue

12

3b.r.

12

3b.r.

12

3b.r.

4,i¼

1I

ATC

(1,1,0)

001101

(1,0)

MATG

(1,1,0)

001110

(1,0)

IATC

(1,1,0)

001101

(1,0)

29,i¼

2K

AAG

(1,1,0)

000010

(0,0)

VGTA

(0,1,1)

101100

1 2;1 2

��

GGGC

(0,0,0)

101001

(0,1)

37,i¼

3K

AAG

(1,1,0)

000010

(0,0)

KAAG

(1,1,0)

000010

(0,0)

RAGG

(1,0,0)

001010

1 2;1 2

��

38,i¼

4I

ATC

(1,1,0)

001101

(1,0)

IATC

(1,1,0)

001101

(1,0)

IATC

(1,1,0)

001101

(1,0)

41,i¼

5F

TTC

(1,1,0)

111101

(0,0)

FTTC

(1,1,0)

111101

(0,0)

FTTC

(1,1,0)

111101

(0,0)

45,i¼

6M

ATG

(1,1,0)

001110

(1,0)

MATG

(1,1,0)

001110

(1,0)

MATG

(1,1,0)

001110

(1,0)

50,i¼

7G

GGC

(0,0,0)

101001

(0,1)

SAGT

(1,0,1)

001011

1 2;1 2

��

GGGC

(0,0,0)

101001

(0,1)

Understand the difference among PPCA and its paralogs: a graph theoretic model Page 11 of 16 35

Page 12: A graph theoretic model to understand the behavioral

the alternating sequence have been considered in pairwise fashion starting from the pair (I11, I21) and (I12,I22). Since, (I11, I21) (= (1, 0)) and (I12, I22) (= (0, 0))are representing different impact values, two distinctnodes have been created representing the impact values(1, 0) and (0, 0) respectively and an edge has been

drawn between them. Next, we consider the pair (I12,I22), and (I13, I23). Since (I12, I22) and (I13, I23) arerepresenting the same impact value of (0, 0) and therealready exists a node representing (0, 0) no new nodehas been created and the edge ((I12, I22), (I13, I23))become a self loop on (0, 0). Analogous logic holds for

Figure 4. Graphical representations of each PPCA family protein based on b-type values showing that for PPCA both G0

and G1 contain only one connected component whereas for PPCB, PPCC, PPCD or PPCE either G0 or G1 is disconnected(Ghosh et al. 2015).

35 Page 12 of 16 S K Ghosh et al.

Page 13: A graph theoretic model to understand the behavioral

the pairs ((I13, I23), (I14, I24)), ((I14, I24), (I15, I25)) and((I15, I25), (I16, I26)). It is to be noted that the paralleledges are not shown, as they do not signify any specialproperty. Following similar type of logic, the graphs forPPCB, PPCC, PPCD and PPCE have been drawn(figure 5). The graph theoretic representations remainunchanged even if the codes assigned to the differentnucleotide bases are altered, keeping intra-b-typeHamming distance 2.To relate the graph theoretic representations of PPCA

family proteins with their biological properties, an aliashas been given to each vertex (figure 5). The alias for avertex has the form of Ai (Ci Li), where Ai indicates theamino acid residue, Ci represents the characteristicsfeature of the amino acid according to Chou-Fasmanscale (Chou and Fasman 1974) [helix breaker(B), helixformer(H) or helix indifferent(i)] and Li denotes thelocation of the amino acid in the respective protein.Two notable points can be observed from the graphtheoretic representations of PPCA family proteins (de-picted in figure 5). The graph of PPCA contains thefollowing cycle of length 2: ((0, 0) ! (1, 0) ! (0, 0)).Each graph of the other paralogs contain a three-lengthcycle as follows: PPCB: (1, 0) ! (1

2, 12) ! (0, 0) ! (1,

0), PPCC: (1, 0) ! (0, 1) ! (12, 12) ! (1, 0), PPCD: (0,

1) ! (1, 1) ! (0, 0) ! (0, 1) and PPCE: (0, 0) ! (1,0) ! (1

2, 12) ! (0, 0).

This observation establishes the characteristicuniqueness of PPCA from its other paralogs, asdescribed in previous sections with the help of b-type.Further, to delve into the details of the chemicalproperties, we focus on residue number 37 and 38 ofbinding location. It may be noted that, though K and Iare present in position 37 and 38 in all of PPCA toPPCE, only PPCA interact while others do not. This isbecause, though the sequence is important, it cannotsolely explain all the phenomena. Thus we look intothe secondary structure which contains the spatialinformation. Structural analysis clearly emphasizes thatK and I in PPCA are part of small but considerablehelix, while in others these are in flexible loops/ex-tended structure (figure 3). Thus, only in PPCA, beinga part of the helical structure, K and I are exposed insuch a way that is critical for interaction/recognition ofDXCA.It may be observed from figure 5 that the three

length cycles in graphs of PPCB to PPCE contain atleast one vertex which represents a helix breaker (B)(e.g. Glycine) or helix indifferent (i) (e.g. Arginine,Serine) while in the graph of PPCA, the vertex repre-senting the helix breaker Glycine (at 50th position)does not belong to the cycle of length 2. It may also benoted that the cycle of length 2 present in the graph ofPPCA constitutes both the vertices representing 37th(impact value (0, 0)) and 38th (impact value (1, 0))

Figure 5. Graphical representation of each interacting amino acid based on the impact of the residues of binding sequencesof PPCA and the residues at similar position of the paralogs of PPCA showing that there is a cycle of length 3 in the graphsrepresenting PPCB to PPCE whereas no cycle of length 3 in the graph of PPCA (Ghosh et al. 2015).

Understand the difference among PPCA and its paralogs: a graph theoretic model Page 13 of 16 35

Page 14: A graph theoretic model to understand the behavioral

location. All the amino acids which have been groupedinto these two vertices are helix formers. Unlike otherparalogs, nodes representing 37th and 38th location inthe graph of PPCA also do not have any incomingedges generated from a node representing a helixbreaker or helix indifferent. These observations clearlyindicate towards the helix formation of PPCA involv-ing 37th and 38th location which is essential forrecognition of DXCA, making PPCA completely dif-ferent entity from its paralogs.

5.4.3 Significance of our results: There are manyinstances in biology showing ligand-induced confor-mational change (extended to helix) during ligandbinding to protein (Denessiouk and Denesyuk 2005;Sheet and Banerjee 2020). However, this processinvolves two types of interaction: (i) incorporation ofhelicity in the peptide/protein sequence which isentropically unfavorable (disorder to order) and (ii)interaction energy involving ligand-induced helix withthe ligand. Overall stabilization of the binding inter-action depends on the magnitude of the unfavorableinteraction. Lower the magnitude of the unfavorableinteraction compared to the overall magnitude, higherthe stabilization. This indicates that an interactionbetween a proper helix with ligand must be favorableover the interaction involving co-operative effect, i.e.,ligand-induced helix formation and then interaction, foroverall stabilization.The PPCA protein originally contain the helical

region, whereas PPC[B-E] exist in the extended formoriginally. Accordingly, the magnitude of the unfavor-able interaction is much lower in PPCA compared tothat of PPC[B-E]. As a result, the PPCA protein caneasily recognize the DXCA, while the DXCA legend isnot sufficiently potent to induce the helicity for thePPC[B-E] system. In other words, energy involved inPPC[B-E] system is much higher. This is an exampleof structure based ligand recognition.Considering the interaction between PPCA-DXCA as

a model system, in this work, we try to generate aconsensus of structure-function relationship in the lightof biological explanation vis-a-vis our graph theoreticapproach. It may be noted that our proposed models arenot predictive and they rely on a priori knowledge ofcrystal structures. Given a numerical output, mathe-matical models are often used to realize the underlyingsystem behavior. Our graph theoretic model is build onthe assumption that the binding sequences are known,however physiochemical properties of different PPCAparalogs have been assumed to be completelyunknown. In other words, it is completely unknown to

our model that which PPCA paralog binds with DXCA.Starting from numerical interpretations of bindingsequences, our graph theoretic model delve into thedetails of physiochemical differences among differentPPCA paralogs towards recognition of DXCA, whichare not obvious from the binding sequences. Further-more, it has been observed that the conclusion drawnby our proposed models coincides with that of theelectrostatics based study reported by Pokkuluri et al.(2004). This shows that our model is robust enough tocapture the physiochemical differences among PPCA.Usually, the ligand protein interaction is mediated

through non-covalent weak electrostatic interaction(e.g. hydrogen bond dipole-dipole interaction) whichdemands a favorable environment to be accommo-dated. In case of PPCA, the orientation of helix dipole,i.e., orientation of dipole associated with helical con-formation of PPCA, may create a favorable opportunityfor recognition with DXCA. Through our graph theo-retic model, we inculcate and validate that the helixdipole would generate a favorable electrostatic poten-tial towards recognition of DXCA. This is not possiblefor other paralogs of PPCA.There are multiple phenomena and there may exist

innumerable number of explanatory models. Onemodel may fit into exactly one system or multiplesystems or all systems. Further, one system may beexplained by single model or more than one model.Understanding which model is appropriate for whichsystem is an open question. Thus, study of PPCA-DXCA interaction through our graph theoreticapproach opens a new avenue which may be exercisedto justify other natural systems. In near future,building such mathematical models will pave the waytowards rationalizing different biological phe-nomenon whose root cause is unknown, althoughpartial information is available to the researchcommunities.Towards understanding the physiochemical beha-

viour of PPCA proteins using graph theoretic models,our contributions in this work can be summarized asfollows:

Graph theoretic model based on b-type:

• Node/vertices: b-type values of nucleotides ofvarious interacting residues belonging to the bind-ing sequence.

• Edges: Similarity of b-type values (of differentnucleotides) between adjacent amino acid residuesbelonging to binding sequence.

• Graph theoretic operations: Search forconnectivity.

35 Page 14 of 16 S K Ghosh et al.

Page 15: A graph theoretic model to understand the behavioral

• Findings: Structural difference between differentproteins of PPCA family.

Graph theoretic model based on impact:

• Node/vertices: impact values of various interactingresidues belonging to the binding sequence.

• Edges: Adjacency of amino acid residues belong-ing to the binding sequence.

• Graph theoretic operations: Search for cycle.• Findings: Physiochemical difference between dif-

ferent proteins of PPCA family.

6. Conclusion

The function of a protein is in fact a synchronized effectof the overall sequence and structure of the bindingsequence comprised of interacting amino acids alongwith their neighbouring residues. Difference in beha-viour of PPCA towards recognition of DXCA in com-parison to its other paralogs in C7 family is embeddedwithin the primary sequence and its secondary structureat the interacting region. This is actually guided by theirgenomic sequence as converged from the result ofstructural bioinformatics and discrete mathematics.Combinatorial analysis of sequences and graph theoreticmodels thus imply the harmony among the interactingresidues enabling to understand the functional networkof both the protein and the gene. In the near future,building such mathematical models will pave the waytowards rationalizing different biological phenomenonwhose root cause is unknown, although partial infor-mation is available to the research communities.

Acknowledgements

SG and RB acknowledge the BIF-DBT centre WBUTfor computational facility and financial assistance.SKG is thankful to Prof. Pabitra Pal Choudhury forproviding necessary support at Applied Statistics Unit(ASU), Indian Statistical Institute, during the initialphase of this work.

References

Adrian A, Canutescu AA, Shelenkov, roland L andDunbrack JR 2003 A graph-theory algorithm for rapidprotein side-chain prediction. Protein Sci. 12 2001–2014

Altschul S, Gish W, Miller W, Myers E and Lipman D 1990Basic local alignment search tool. J. Mol. Biol. 215403–410

Anfinsen CB 1973 Principles that govern the folding ofprotein chains. Science 181 223–230

Artymiuk PJ, Spriggs RVand Willett P 2005 Graph theoreticmethods for the analysis of structural relationships inbiological macromolecule. J. Am. Soc. Inform. Sci.Technol. 56 518–528

Bond DR and Lovley DR 2003 Electricity production bygeobacter sulfurreducens attached to electrodes. Appl.Environ. Microbiol. 69 1548–1555

Bondy JA and Murty USR 2006 Graduate Texts inMathematics (Springer)

Burgess R and Deutscher V 2009 Guide to proteinpurification. Methods Enzymol. 463 900

Caccavo F, et al. 2007 Importance of c-type cytochromes foru(vi) reduction by geobactersulfurreducens. BMC Micro-biol. 7 3752–3759

Caccavo F, Lonergan DJ, Lovley DR, Davis M, Stolz JF andMcInerney MJ 1994 Geobactersulfurreducens sp. nov., ahydrogen- and acetate-oxidizing dissimilatory metal-reduc-ing microorganism. Appl. Environ. Microbiol. 60 3752–3759

Chou PY and Fasman GD 1974 Prediction of proteinconformation biochemistry. Biochemistry 13 222–245

Cover TM and ThomasJoy A 2006 Elements of InformationTheory. Wiley

Denessiouk KA, Johnson MS and Denesyuk AI 2005 Novelcnn structural motif for protein recognition of phosphateions. J. Mol. Biol. 345 611–629

Fiser A, Do RKG and Sali A 2000 Modeling of loops inprotein structures. Protein Sci. 9 1753–1773

Ghosh S, Ghosh SK, Ray C, Paul G, Choudhury PP andBanerjee R 2015 Understanding the behavioural differ-ence of ppca among its homologs in c7 family towardsrecognition of dxca. 1608.09009 [q-bio.BM]

Kabsch W and Sander C 1983 Dictionary of proteinsecondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 222577–2637

Larkin MA, Blackshields G and Brown NP 2007 Clustalwand clustalx version 2. Bioinformatics 23 2947–2948

LaRusso NF, Szczepanik PA and Hofmann AF 1977 Effectof deoxycholic acid ingestion on bile acid metabolism andbiliary lipid secretion in normal subjects. Gastroenterol.Rev. PubMed 72 132–140

Leavitt SA 1961 Deciphering the Genetic Code: MarshallNirenberg (Office of NIH History)

Lloyd JR, Leang C and Hodges Myerson AL 2003Biochemical and genetic characterization of ppca, aperiplasmic c-type cytochrome in geobactersulfurre-ducens. Biochem. J. 369 153–161

Maiti R, Domselaar GHV, Zhang H and Wishart DS 2004Superpose: a simple server for sophisticated structuralsuperposition. Nucleic Acids Res. 32 W590W594

Understand the difference among PPCA and its paralogs: a graph theoretic model Page 15 of 16 35

Page 16: A graph theoretic model to understand the behavioral

Morgado L, Paixo VB and Schiffer M 2012 Revealing thestructural origin of the redox-bohr effect: the first solutionstructure of a cytochrome from geobactersulfurreducens.Biochem. J. 441 179–187

Myers CJ 2015 Computational synthetic biology: progressand the road ahead. IEEE Trans. Multiscale Comput. Syst.1 19–32

Poirrette AR, Grindley HM, Rice DW, Artymiuk PJ andWillett P 1994 Structural similarity between binding sitesin influenza sialidase and isocitrate dehydrogenase:implications for an alternative approach to rational drugdesign. Protein Sci. 3 1128–1130

Pokkuluri PR, Londer YY, Duke N, Long W and Schiffer M2004 Family of cytochrome c7-type proteins fromgeobacter.sulfurreducens: structure of one cytochromec7 at 1.45 resolutions. Biochemistry 43 849–859

Pokkuluri PR, Londer YY and Yang X 2010 Structuralcharacterization of a family of cytochromes c7 involvedin fe(iii) respiration by geobactersulfurreducens. Biochim.Biophys. Acta Bioenergetics 1797 222–232

Pokkuluri PR, Londer YY and Duke NEC 2011 Structure ofa novel dodecaheme cytochrome c from geobacter.sul-furreducens reveals an extended 12 nm proteinwithinteracting hemes. J. Struct. Biol. 174 223–233

Rosen K 2011 Discrete Mathematics and its Applications.Mc-GrawHill Education

Sheet T and Banerjee R 2020 Design of a peptide-basedmodel leads for scavenging anions. ACS Omega 59759–9767

Shi L, Squier TC, Zachara JM and Fredrickson JK 2007Respiration of metal (hydr) oxides by shewanella andgeobacter: a key role for multi haem c-type cytochromes.Mol. Microbiol. 65 12–20

Tuqan J and Rushdi A 2008 A dsp approach for finding thecodon bias in dna sequences. IEEE J. Select. Top. SignalProcess. 2 343–356

Vishveshwara S, Brinda KV and Kannan N 2002 Proteinstructure: insights from graph theory. J. Theor. Comput.Chem. 1 187–211

West DB 2001 Introduction to Graph Theory (Pearson)

Corresponding editor: Ravindra Venkatramani

35 Page 16 of 16 S K Ghosh et al.