a holistic approach to understanding cazy families through

55
A holistic approach to understanding CAZy families through reductionist methods. Jens Eklöf Licentiate thesis KTH Royal Institute of Technology School of Biotechnology Department of Glycoscience Stockholm 2009

Upload: others

Post on 18-Oct-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A holistic approach to understanding CAZy families through

A holistic approach to understanding CAZy families through

reductionist methods.

Jens Eklöf

Licentiate thesis

KTH Royal Institute of Technology

School of Biotechnology

Department of Glycoscience

Stockholm 2009

Page 2: A holistic approach to understanding CAZy families through

ii

© Jens Eklöf

Stockholm 2009

Royal Institute of Technology

School of Biotechnology

Albanova University Center

SE-100 44 Stockholm

Sweden

ISBN 978-91-7415-269-2

TRITA BIO Report 2009:5

ISSN 1654-2312

Page 3: A holistic approach to understanding CAZy families through

iii

Abstract 

In a time when the amount of biological data present in the public domain is becoming

increasingly vast, the need for good classification systems has never been greater. In the

field of glycoscience the necessity of a good classification for the enzymes involved in the

biosynthesis, modification and degradation of polysaccharides is even more pronounced

than in other fields. This is due to the complexity of the substrates, the polysaccharides, as

the theoretical number of possible hexa-oligosaccharides from only hexoses exceeds 1012

isomers!

An initiative to classify enzymes acting on carbohydrates began around 1990 by the French

scientist Bernard Henrissat. The resulting database, the Carbohydrate Active enZymes

database (CAZy), classifies enzymes by sequence similarity into families allowing the

inference of structure and catalytic mechanism. What CAZy does not provide however, are

means to understand how members of a family are related, and in what way they differ from

each other. The top-down approach used in this thesis, combining phylogenetic analysis of

whole CAZy families, or sub-families, with structural determinations and detailed kinetic

analysis allows for exactly that.

Finding determinants for transglycosylation versus hydrolysis within the xth gene product

family of GH16 as well as restricting the hydrolytic enzymes to a well defined clade are

integral parts of paper I. In paper II a new bacterial sub-clade within CE8 was discovered.

The structural determination of the Escherichia coli outer membrane lipoprotein YbhC from

the new sub-clade explained the difference in specificity. The information provided in the

two papers of this thesis gives a better understanding of the development of different

specificities of diverse CAZY families as well as it aids in future gene product annotations.

Furthermore this work has begun to fill the white spots uncovered in the phylogenetic trees.

Page 4: A holistic approach to understanding CAZy families through

iv

Sammanfattning 

Nu när mängden biologisk data på servrar runt om i världen, vida överstiger det som kan

anses vara överblickbart, är nöden att finna goda klassificeringssystem som störst. Inom

glykovetenskapen är behovet av att finna goda klassificeringsmetoder för de enzymer som

ansvarar för biosyntes-, modifiering- och degradering av polysackarider extra stort. Skälet till

det står att finna i komplexiteten hos substraten till dessa enzymer, polysacckariderna.

Antalet teoretiskt möjliga isomerer av en hexasacckarid från enbart hexoser överstiger 1012

stycken.

Ett sådant klassificeringsinitiativ startades kring 1990 av den franske forskaren Bernard

Henrissat. Hans arbete resulterade i en databas kallad, Carbohydrate Active enZymes

(CAZy), där enzymer klassificeras i olika familjer med ledning av deras aminosyrasekvens.

Denna familjeuppdelning möjliggör för besökare att dra slutsatser om ett enzyms struktur

och mekanism. Vad CAZy dock inte gör, är att visa hur sekvenserna är besläktade och på

vilket sätt de skiljer sig åt. Det uppifrån-och-ner tillvägagånssätt som använts i denna

avhandling, en kombination av fylogenetiskanalys av hela CAZy-familjer, eller underfamiljer,

samt strukturbestämmning av proteiner och detaljerad kinetikanalys åskådligör precis de

saker som CAZy utelämnar. I denna avhandling visas två exempel av ovan nämnda

tillvägagångssätt.

I artikel I påvisas faktorer för transglykosylering jämfört med hydrolys inom en underfamilj

(xth genproduktfamiljen) av glykosylhydrolysfamilj 16, GH16, och den hydrolytiska

aktiviteten begränsas till en underfamilj av xth genproduktfamiljen. I artikel II identifieras en

ny underfamilj av kolhydratesterasfamilj 8, CE8, som efter strukturbestämning av en dess

medlemar, yttermembran lipoproteinet YbhC från Escherichia coli, visade sig ha utvecklat en

annan specificitet än sina stamfäder pektinmetylesteraserna. Informationen i ovanstående

artiklar ökar förståelsen för utvecklingen av nya aktiviteter inom dessa två familjer och

förenklar framtida genannoteringar. Dessutom, så har de vita fläckar som uppdagats i dessa

fylogenetiska träd börjat fyllas i.

Page 5: A holistic approach to understanding CAZy families through

v

List of publications 

I The crystal structure of the outer membrane lipoprotein YbhC from Escherichia coli

sheds new light on the phylogeny of Carbohydrate Esterase family 8. Eklöf J.M.≡, Tan

T.C.≡, Divne C., Brumer H. Submitted (Proteins: Structure, Function, and Bioinformatics)

Personal contribution to paper I. I did the phylogenetic work on Carbohydrate Esterase Family 8

and the biochemical characterisation. I also contributed to the crystallisation of YbhC and

was the principal author.

II Structural Evidence for the Evolution of Xyloglucanase Activity from Xyloglucan

Endo-Transglycosylases: Biological Implications for Cell Wall Metabolism. Baumann M.J.,

Eklöf J.M., Michel G., Kallas Å.M., Teeri T.T., Czjzek M., Brumer H., III.. Plant Cell 2007;

19(6):1947-1963

Personal contribution to paper II. Shared the work on producing and characterising TmNXG1

and TmNXG2 with Martin Baumann, and the work on the phylogeny of the xth gene

product family with Gurvan Michel.

Page 6: A holistic approach to understanding CAZy families through

vi

Related publications 

1. Analysis of nasturtium TmNXG1 complexes by crystallography and molecular

dynamics provides detailed insight into substrate recognition by family GH16 xyloglucan

endo-transglycosylases and endo-hydrolases. Mark P, Baumann M.J., Eklöf J.M., Gullfot F.,

Michel G., Kallas Å.M., Teeri T.T., Brumer H., Czjzek M. Proteins: Structure, Function, and

Bioinformatics 2008; DOI 10.1002/prot.22291

2. Top-Down Grafting of Xyloglucan to Gold Monitored by QCM-D and AFM:

Enzymatic Activity and Interactions with Cellulose. Nordgren N., Eklöf J. M., Zhou Q.,

Brumer H., Rutland M.W. Biomacromolecules 2008; 9(3):942-948.

3 Characterisation and 3-D structures of two distinct bacterial xyloglucanases from

families GH5 and GH12. Gloster T.M., Ibatullin F.M., Macauley K., Eklöf J.M., Roberts S.,

Turkenburg J.P., Bjornvad M.E., Jorgensen P.L., Danielsen S., Johansen K.E., Borchert

T.V., Wilson K.S., Brumer H., Davies G.J. Journal of Biological Chemistry 2007; 282(26):19177-

19189

Page 7: A holistic approach to understanding CAZy families through

vii

Table of contents 

INTRODUCTION ..................................................................................................................................... 1 

1 PROTEINS ................................................................................................................................................... 2 

1.1 Protein evolution ............................................................................................................................. 4 

1.2 Protein folding ................................................................................................................................. 5 

2 BIOINFORMATICS ......................................................................................................................................... 8 

2.1 Multiple sequence analysis ............................................................................................................. 9 

2.1.1 An example of a multiple sequence alignment algorithm ...................................................................... 11 

2.3 Phylogenetics ................................................................................................................................ 13 

3 THE CARBOHYDRATE‐ACTIVE ENZYMES DATABASE, CAZY .................................................................................. 17 

3.1 Reaction mechanism of glycosyl hydrolases ................................................................................. 18 

3.2 Glycosyl Hydrolase family 16 ........................................................................................................ 20 

3.2.1 Xyloglucan endo‐transglycosidases ........................................................................................................ 21 

3.2.2 XETs physiological role in plant cell walls ............................................................................................... 22 

3.3 Carbohydrate Esterase family 8 .................................................................................................... 24 

3.3.1 Pectin methylesterases role in plant cell walls and pectin degradation ................................................ 25 

3.3.2 The pectin methylesterases ................................................................................................................... 28 

PRESENT INVESTIGATION .................................................................................................................... 32 

4.1 PAPER I: A PHYLOGENETIC ANALYSIS OF CE8 LOCATES A NEW BACTERIAL SUB‐CLADE AND THE STRUCTURAL 

DETERMINATION OF E. COLI YBHC. ................................................................................................................... 33 

4.2 PAPER II: INVESTIGATION OF THE GH16 XTH GENE FAMILY CLARIFIES THE DETERMINANTS FOR TRANSGLYCOSYLATION 

VERSUS HYDROLYSIS WITHIN THE FAMILY BY EXPLORING THE NEW 3D STRUCTURE OF TMNXG1 AND RESTRICTS THE 

HYDROLYTIC ACTIVITY TO A SPECIFIC SUB‐CLADE. ................................................................................................. 35 

CONCLUDING REMARKS ...................................................................................................................... 38 

ACKNOWLEDGEMENT ......................................................................................................................... 39 

REFERENCES ........................................................................................................................................ 41 

 

   

Page 8: A holistic approach to understanding CAZy families through

viii

  

 

.

Page 9: A holistic approach to understanding CAZy families through

Jens Eklöf 1

Introduction 

This thesis is focused on proteins, especially those acting on carbohydrates. The complexity

of the substrates of these proteins, the polysaccharides built from carbohydrates, and the

important functions they have in organisms, make carbohydrate acting proteins an exciting

and diverse field. Proteins acting on carbohydrates are interesting for the pharmaceutical

industry due to the diseases caused by dysfunctional biosynthetic enzymes and because of

immune response caused by oligosaccharides, to the energy sector for their use in making

alternative fuels and to the forest and agricultural industry because of the roles

polysaccharides play in their products, just to mention a few of the beneficiaries of the

research on these enzymes.

Because of the abundance and variety of carbohydrate acting proteins an initiative (CAZy)

was started around 1990 to order these proteins into families by their primary structure.

Research in the field of glycoscience is often focused on singular enzymes and clarifying that

enzyme’s role. There is therefore a gap in the information between the family level of CAZy

and the individual proteins of these families. Each enzyme can be seen as an island in an

ocean (representing a protein family) but there is no map of the ocean. The work in this

thesis has tried to bridge that gap between families and individual proteins by showing how

they are related (mapping the ocean) and investigating proteins in previously uncharacterised

areas of these families.

Chapter 1 gives an introduction to proteins, their origin, evolution and folding. In chapter 2

the field of bioinformatics is introduced together with some specific examples of

bioinformatics-methods used to understand the relationship between proteins, described for

non-bioinformaticians. In chapter 3 the concept of the CAZy database is described in more

detail as well as the CAZy families that are investigated in chapter 4.

Page 10: A holistic approach to understanding CAZy families through

2 A holistic approach to understanding CAZy families through reductionist methods

1 Proteins Proteins are extremely versatile biopolymers whose complexity is matched only by the

complexity of polysaccharides. They perform crucial functions in virtually all biological

processes. Some of these functions are as catalyst (enzymes), or involved in transport, in

signal transmission, as provider of structural support and a whole array of other functions.

The term protein was first mentioned in a letter by the Swedish chemist Jöns Jacob Berzelius

in 1838;1,2

“The name protein that I propose for the organic oxide of fibrin and albumin, I wanted to

derive from the Greek word πρωτειος, because it appears to be the primitive or principal

substance of animal nutrition.”

Apart from being essential to all forms of life, proteins have also become work horses of

man in our modern society. They are manufactured and sold as a commodity in animal feed

and as enzymes to the baking and brewing industries. Proteins are also used in detergents, as

biocatalyst in the chemical industry and an increasing demand is coming from the biofuel

sector.3 The first protein used as a pharmaceutical drug was insulin introduced back in 1922

and today pharmaceutical companies are focusing much of their research into protein drugs.4

The rationale for using enzymes for industrial purpose is their inherent specificity, the

possibility to save energy and replacing harmful chemicals. In 2007, the world wide enzyme

market was $4.1 billion according to the Freedonia Group Inc. (www.freedoniagroup.com).

Proteins are linear biopolymers built up by monomers called amino-acids. There are 20

common amino-acids and they all share a backbone structure with an amino group on one

end and a carboxylic group on the other. The amino acids vary in their R group, which

covers a wide variety of chemical and physical properties (Figure 1a).

Page 11: A holistic approach to understanding CAZy families through

Jens Eklöf 3

Figure 1. The common building blocks of all proteins. a, the common structure of all

amino-acids. The R group is variable and unique for the 20 amino-acids. b, two amino-

acid coupled together via a peptide bond (connecting the amino group and the carbonyl

carbon within the box).

The biosynthesis of proteins begins with the transcription of DNA into mRNA, called

transcription, the mRNA is then translated into a peptide chain, called translation. The

transcription and the translation form the core of what is called the central dogma of molecular

biology5,6 that stipulates the flow of information between proteins, RNA and DNA (Figure

2).

Figure 2. A simplified version of the central dogma of molecular biology.

In the translation, that takes place at the ribosome, amino-acids are coupled together through

a condensation reaction between the carboxylic group of one amino-acid and the amino

group of another. The bond created is called a peptide bond and it is part of the backbone of

all proteins (Figure 1b).

C

R

NH3+

OOCH

H3N

R

O

HN

R

a b

COO+

Transcription Translation

ProteinmRNADNA

Replication

Page 12: A holistic approach to understanding CAZy families through

4 A holistic approach to understanding CAZy families through reductionist methods

The order in which amino-acids are put together is dictated by the mRNA, and a protein’s

amino-acid sequence is called the primary structure. The polypeptide chain forms certain

regular structures such as α-helices, β-sheets, turns and coils. These regular structures are

referred to as a protein’s secondary structure. The tertiary structure is a protein’s 3D structure i.e.

how the different secondary structure elements interact in space. Proteins can also build

complexes with other proteins or domains and this is called the quaternary structure.

1.1 Protein evolution 

Even though the origin of proteins is unknown one thing is certain. Once proteins entered

the stage of life they came to stay. One theory of the origin of life, “the RNA world”, states

that RNA was the first biopolymer,7 acting both as a blueprint and a catalyst to replicate

itself. Today the different proposed roles of prehistoric RNA have to a large extent been

exchanged by other biomolecules. DNA has taken over as the blueprint of life (even though

some viruses use RNA as their information carrier) and proteins have taken over the role as

the work horse of life. RNA still has important roles but they have become increasingly

specific.

The reason proteins have overtaken the proposed roles of RNA is because of their greater

functional potential. Firstly the peptide bond in proteins is less susceptible to hydrolysis than

the phosphodiester bond in RNA increasing the life span for a protein molecule compared to

an RNA molecule. Secondly and perhaps more important is the fact that RNA is limited in

the amount of building blocks i.e. the four ribonucleotide sugars, and their similarity in terms

of functional groups. Proteins on the other hand are built from 20 amino-acids that have a

wide variety of functional groups with different chemical properties.

Page 13: A holistic approach to understanding CAZy families through

Jens Eklöf 5

1.2 Protein folding 

Rendering a protein active and functional requires both that a protein is translated correctly

and that the polypeptide chain folds properly. Incorrectly folded proteins are not just

dysfunctional. They can aggregate and become harmful as in the case of Alzheimer’s

disease8,9 or in the mad cow disease.10 In both cases proteins adopt an alternative fold,

building up insoluble plaques.

The mechanisms behind protein folding are not fully elucidated but several theories have

been presented. One of the first theories on protein folding, originally hypothesised by

Dorothy M. Wrinch and Harold Jeffries already in 1919, called hydrophobic collapse,11-13 is

sprung from the observation that most proteins have a hydrophobic core and that these

residues interact in the folding process to minimise their exposure to the aqueous solvent,

thereby maximising entropy. Other protein folding theories have found alternative driving

forces for protein folding, for example that small domains or islands of the polypeptide chain

adopt a correct secondary structure first and that these islands then interact to build the

correctly folded protein (ref. 14 and references therein). The consensus today is that several

mechanisms work together in protein folding. However, there is still debate on what

happens first and to what extent different mechanisms contribute to the folding of a given

protein. One issue in studying the complex mechanism behind protein folding is that the

studies are usually conducted on small peptides or certain proteins like lysozyme and

assessing to what extent the knowledge attained is applicable on larger more complex

proteins is difficult.15-17

So how many protein folds are there? As more and more 3D structures of protein are being

solved it is likely that we can make an accurate estimate of the protein fold space. The

millions of protein sequences known can be grouped into families counted in tens of

thousands.18,19 Even though these families can have very little or no detectable sequence

similarity they can still share a similar fold as folds tend to be more preserved than

sequence.20 Therefore the tens of thousands of families can be boiled down to roughly 1000

distinct folds.21,22

Page 14: A holistic approach to understanding CAZy families through

6 A holistic approach to understanding CAZy families through reductionist methods

Is there a common ancestor for these thousand folds or do different folds represent distinct

evolutionary events? Choi and Kim suggested that proteins may have had several

independent origins but also present the possibility that new folds could arise from a frame

shift, a reversed reading direction or from a mutated stop codon.23 Even though a fold is

usually seen as something rather rigid there is evidence for structural plasticity. In the last

couple of years it has been shown that the same sequence can adopt different conformations

based on the environment. Factors such as pH, ionic strength or the presence of metal ions

can all act as conformational switches (ref. 24,25 and references therein), being important both

in certain diseases and potentially also in the development of new folds.

Out of all protein folds some seem to have been more successful as they are used by more

sequences than others. The concept of designability has been used to explain this

phenomenon.26 Designability implies that proteins with a higher average number of contacts

formed by residues are more stable and therefore allow a sequence to be further from

optimal but still fold correctly (Figure 3). The higher stability of highly designable proteins

renders them more evolvable since they can withstand a higher degree of genetic drift.27

Figure 3. Higher mountains have wider bases. A conceptual illustration of designability.

Structure A with a higher foldability also has a larger space of possible sequences to

explore and still be able to fold into a native state. Structure B more easily falls under the

threshold of foldability and is less likely to evolve a new function. Picture adapted from

Goldstein, 2008.28

Page 15: A holistic approach to understanding CAZy families through

Jens Eklöf 7

The idea of designability has been supported by a study in yeast where highly designable

proteins had a faster evolution rate.29

We have seen that proteins evolve, but how and why? The answer is that proteins evolve

attaining improved characteristics or acquiring new functionalities that make their host better

suited for their environment. This evolution happens through random genetic drift in the

DNA. Through mutations, deletions and inserts in the DNA sequence, a protein amino-acid

sequence or expression is altered. Some changes are silent (do not affect the amino-acid

sequence) and the changes that leads to the translation of a different amino-acid are likely to

be deleterious or neutral to a protein and only a few actually improve a proteins property

such as catalytic rate, temperature stability or substrate specificity. A number of different

factors influence the rate of this drift, if designability sets the framework for protein

evolution, factors such as protein expression levels, genetic linkage and others, have been

shown to set the rate of evolution on a protein.30

Page 16: A holistic approach to understanding CAZy families through

8 A holistic approach to understanding CAZy families through reductionist methods

2 Bioinformatics The vast amount of data produced by scientists around the world demands efficient and fast

tools for handling it. Whether the data comes from a genome project, a metabolomics

project or from any other source it needs to be processed and ordered to get the most out of

it. Ever since the early 1990s the importance and acceptance of bioinformatics has grown to

become tools used by scientists within all sectors of biology. Because of its diversity and

complexity a definition of this field is challenging. Here two definitions are presented, the

first by the National Institute of Health and the second from the National Center for

Biotechnology Information;

“Bioinformatics: Research, development, or application of computational tools and

approaches for expanding the use of biological, medical, behavioral or health data, including

those to acquire, store, organize, archive, analyze, or visualize such data.”

(http://www.bisti.nih.gov/docs/CompuBioDef.pdf)

“Bioinformatics is the field of science in which biology, computer science, and information

technology merge into a single discipline. There are three important sub-disciplines within

bioinformatics: the development of new algorithms and statistics with which to assess

relationships among members of large data sets; the analysis and interpretation of various

types of data including nucleotide and amino acid sequences, protein domains, and protein

structures; and the development and implementation of tools that enable efficient access and

management of different types of information.”

(http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html)

Because of the complexity and diversity of the field of bioinformatics only those areas that

have been touched upon within this work are presented. These two fields are the related

areas of multiple sequence alignments and phylogenetic analysis. Both are instrumental

disciplines for deciphering the relationship between different organisms and proteins.

Page 17: A holistic approach to understanding CAZy families through

Jens Eklöf 9

2.1 Multiple sequence analysis 

Multiple sequence alignments (MSAs) are widely used as starting points for protein function

and structure prediction, phylogenetic studies and other common tasks in sequence analysis

of proteins and DNA. They also aid in experimental design by revealing conserved residues

of potential functional importance.

Ever since the most commonly used program ClustalW was introduced in 198831 many new

programs have been developed, especially in the last half decade. These have all made claims

to be either more accurate or faster than previous methods. The early programs align

sequences on the basis on sequence alone. This can be done in a relatively fast way, with for

example MAFFT32 or MUSCLE33 but as the sequence identity of a given dataset drops below

30% these methods become increasingly unreliable. Many newer programs therefore

incorporate secondary or tertiary structure data into their algorithm to improve the alignment

quality on distantly related sequences. Secondary structure prediction programs such as

PSIBLAST34 and PSIPRED35 are used by for example PROMALS36, PRALINE37, SPEM38

while other programs use real structural data (Expresso39) or both predicted and real

structural data (PROMALS3D). The secondary structure elements, α-helices and β-sheets,

are more important for structural stability and therefore these newer programs can put

insertions and deletions in loop regions making them more accurate even when the sequence

similarity is low.

Quality control of different MSA programs is done by comparing alignment results against

“golden standards”, such as BAliBASE40, HOMSTRAD41, Prefab42 and SABmark43. By

comparing the “golden standards” true alignment with the alignment produced by a MSA

program, the accuracy of a program can be assessed.

MSA programs can be divided into two categories, matrix or consistency-based. The matrix-

based algorithms such as ClustalW,31 MUSCLE,42 and SPEM,38 use substitution matrices to

find the cost of matching two symbols or profiles. This method only takes the symbol or its

immediate surrounding into account. The consistency-based methods such as T-Coffee,44

MAFFT,32 ProbCons45 or PROMALS 36 uses a combination of local and global alignments to

Page 18: A holistic approach to understanding CAZy families through

10 A holistic approach to understanding CAZy families through reductionist methods

make a position-specific substitution matrix. This incorporation of more information comes

with a computational cost generally making consistency-based methods slower.

To choose the right software for a certain set of sequences can be difficult for the

inexperienced user and to avoid some common pitfalls one first needs to decide what is most

sought after; biological accuracy, speed or memory usage. Certain parameters such as the

number of sequences, length of the sequences and the homology within a certain dataset also

makes an influence on the choice of program.

As mentioned above different programs use different means to achieve an alignment. In

Figure 4 the relationship between the results, and not the method, of different alignment

algorithms on a certain golden standard is shown, reflecting the similarity between different

algorithms in terms of the alignment they produce.

Page 19: A holistic approach to understanding CAZy families through

Jens Eklöf 11

Figure 4. Method tree. A tree showing clustering of some multiple sequence alignment

methods. The fftns -1, -2, -i, finsi and ginsi are different versions of MAFFT v5.531.

Pairwise distances were calculated using the HOMSTRAD41 benchmark by computing

Sum of Pair score differences produced by the individual methods. Methods in bold are

included in M-Coffee suite. M-Coffee uses these different algorithms and combines the

results through T-Coffee to make a MSA.46 Picture adapted from Wallace et al., 2005.46

2.1.1 An example of a multiple sequence alignment algorithm 

For the reader to better grasp how an alignment algorithm works the outline of the

MUSCLE algorithm33,42 is presented below.

MUSCLE is an multiple sequence alignment program that is similar to MAFFT32,47,48 when it

comes to accuracy and speed but slightly faster on large datasets e.g. >5000 sequences. Even

though they are similar in terms of performance, the algorithms behind them are different.

MUSCLE uses a three step process (Figure 5), first building a rough guide tree, secondly

making a more accurate tree and then finally iterative refinement. A more detailed

description is given below:

ClustalW

Dialign

ProbCons

Muscle 6

T-Coffe2

ginsi

finsi

muscle 3.52

PCMA

fftnsi

fftns2

fftns1

Poa-local

Poa-global

Dialign-T

Page 20: A holistic approach to understanding CAZy families through

12 A holistic approach to understanding CAZy families through reductionist methods

1. 1.1 Computing the k-mer distance, a k-mer is a short string of length k, and related

sequences share more k-mers than expected by chance. The k-mer distance is a

measurement of have many k-mers two sequences have in common. 1.2 These

distances are made into a distance matrix and a guide tree is constructed. 1.3 A

progressive alignment is built by following the order of the branching in the first

guide tree.

2. In stage 2 the first tree is improved and a new progressive alignment made. 2.1 From

the alignment in 1.3 the Kimura distance49 is computed. The Kimura distance is a

fast scoring method for protein similarity that ignores gaps and only counts identities.

A score, S is calculated by dividing the number of identities, m by the number of

positions, npos. The distance, d is then calculated through a series of manipulations

(vide infra)

nposmS =

SD −=1

( )22.01ln DDd −−−=

2.2 From the new Kimura distance matrix a new tree is constructed that dictates the

order of the progressive alignment (2.3).

3. Refinement. 3.1 Tree 2 (Figure 5) is cut in two and a new alignment is made for each

of the two sub-trees (3.2). 3.3 These sub-alignments are, in turn, realigned and if the

Sum-of-Pairs score (SP, a score that takes gaps into account) is better than the old

tree 2 it now enters a new iteration process from point 3, otherwise it is discarded

(3.4). The iterative process continues until convergence or a predetermined limit.

Page 21: A holistic approach to understanding CAZy families through

Jens Eklöf 13

Figure 5. The flow of the MUSCLE algorithm. The picture is from Edgar.42

The more advanced programs using secondary or tertiary structures include extra steps in

their algorithms. The program PROMALS that uses profiles in the algorithm works much

like MUSCLE up to 1.3 in Figure 3. After that it aligns sequences with a sequence identity

over a certain predetermined threshold value in a fast way, clustering groups of pre-aligned

sequences that are relatively divergent from each other. In the second step one sequence

from each group is chosen and PSIBLAST and PSIPRED are used to build secondary

structure based profiles. These profiles are then used to increase the quality of the

consistency scoring.36

2.3 Phylogenetics 

One of the reasons for making multiple sequence alignments is because of their use for

inferring phylogenetic relationships (Figure 6). As in the case of the multiple sequence

alignment programs there are multitude of phylogenetic or at least tree-making programs

based on different algorithms and methods.

The two main techniques for inferring phylogenies from proteins are the distance-based and

the character-based methods. The distance methods are usually faster but have some

drawbacks when it comes to making true phylogenies since they are only based on a distance

matrix (i.e. only using part of the information present). The UPGMA method (Unweighted

Page 22: A holistic approach to understanding CAZy families through

14 A holistic approach to understanding CAZy families through reductionist methods

Pair Group Method with Arithmetic mean) is the simplest and fastest method with the

drawbacks of being sensitive to difference in evolutionary rates and branch lengths. It is

often used, as an initial tree builder in multiple sequence alignment programs due to its speed

(For example in the MUSCLE algorithm, vide supra). The other commonly used distance

method is the Neighbour Joining method (NJ).50 Unlike UPGMA the NJ method does not

require the data set to be ultrametric (meaning it does not require all lineages to have evolved

equal distances from the root, or it allows different rates of evolution). Another difference

between NJ methods and UPGMA is that NJ methods start with a star-like tree assumption,

rendering an unrooted tree while UPGMA produces rooted trees.

Figure 6. A typical workflow in the phylogenetic analysis of protein or DNA sequences.

Starting by choosing a dataset, aligning the dataset and finally inferring phylogenetic

relationships from the multiple sequence alignment.

The character-based methods have the advantage of using the information in the characters

of a sequence as opposed to only distances between sequences. This comes at the cost of

computational time making them more cumbersome for large data sets but generally more

accurate. The three character-based methods are maximum parsimony, maximum likelihood

and Bayesian inference phylogenetics.

The maximum parsimony method is a method that tries to minimise the “cost” of mutational

events to explain a given data set. Compared to the other character-based methods it tries to

fit data unbiased of any evolutionary model. The maximum parsimony method is therefore

not a true phylogenetic method but rather a cladistic method since it does not build trees on

the basis of an evolutionary theory. A drawback with the maximum parsimony method is

that it only uses informative sites, i.e. sites where at least two different characters are present

in at least two sequences per character.

Choose sequences Phylogenetic analysis

Multiple sequence alignment

Page 23: A holistic approach to understanding CAZy families through

Jens Eklöf 15

Maximum likelihood methods, on the other hand, use evolutionary models when

constructing phylogenies and therefore require the use of an appropriate model for a given

dataset. The evolutionary models for amino-acid sequences are mostly derived from

empirical data. An early pioneer was Margaret Dayhoff the maker of the PAM matrices, that

were constructed by observing differences in closely related sequences.51 Other substitution

matrices soon followed, for example the JTT matrix52,53 and the BLOSUMs54 that are

supposed to be more accurate on more distantly related proteins.

The BLOSUMs were calculated by comparing conserved local alignments in datasets with

different sequence identities giving rise to a number of matrices suitable for datasets of

different sequence divergence. Therefore BLOSUM62 was made from a dataset with a

threshold at 62% identity, while BLOSUM42 had a threshold value of 42%. To make things

more difficult the PAM matrices have an opposite numbering with low numbers for datasets

of high similarity. There are also some special substitutions matrices tailored for

transmembrane regions55 or for chloroplast proteins,56 just to mention a few.

The Bayesian inference of phylogeny is also based on likelihood function but finds a suitable

tree in a different way using Bayes’s theorem. The Bayes’s theorem is a method used in

statistics that describes how one can update beliefs about a hypothesis in the light of new

data. This requires an a priori guess (hypothesis) about the tree you want to find. Fortunately

an a priori guess can be that all trees are equally likely. The problem comes in assessing the

probability of all the possible trees. This has been solved numerically, using Markov Chain

Monte Carlo methods (MCMC) that work in a two step process: (1) A new tree is proposed

by stochastically altering the current tree and (2) the new tree is either accepted or rejected

with a certain probability. This process of perturbing and evaluating trees is the chain

process and the number of times a certain tree comes up in this chain is an estimate of its

posterior probability i.e. how likely it is to be the best tree (ref.57 and references therein).

Compared to maximum parsimony methods, the maximum likelihood and Bayesian

inference of phylogeny methods use all sequence information potentially making them more

accurate but on the downside they are very CPU-intensive and therefore slow. The Bayesian

inference of phylogeny methods have the advantage over maximum likelihood that it

Page 24: A holistic approach to understanding CAZy families through

16 A holistic approach to understanding CAZy families through reductionist methods

calculates the probability of a clade as a part of algorithm making statistical sampling such as

bootstrapping redundant.

Once the appropriate method has been chosen for a given data set the choice left is what

program to use. For the distance-based methods a program called MEGA58,59 is suitable. It

is both simple to use and has a nice tree-drawing function as a starting point for figure

making. MEGA can also calculate maximum parsimony cladograms. Other programs with

packages of methods like MEGA are PHYLIP60 and PAUP*

(http://paup.csit.fsu.edu/index.html). PHYLIP and PAUP* can also do maximum

likelihood phylogenies. The trees presented in this thesis are all built using Phyml61, using

maximum likelihood calculations. For Bayesian phylogenetic inference the most popular

program is MrBayes3.62,63 Most of these programs are available as freeware to be downloaded

and run locally or as web-based services that can be run online. PAUP* is the only program

that needs to be purchased.

Page 25: A holistic approach to understanding CAZy families through

Jens Eklöf 17

3 The Carbohydrate‐Active enZymes database, CAZy The diversity in carbohydrates far exceeds any other biopolymer. In nature over 30 different

sugars64 have been found and these can be further modified by the non-carbohydrates such

as sulphate, phosphate and acyl esters. Only assuming unmodified carbohydrates the number

of possible hexasaccharide isomers from D-hexoses reach a staggering 1012. Therefore the

enzymes responsible for their biosynthesis, modification and degradation face a daunting

task. The work of classifying the carbohydrate active enzymes began around 1990 by

Bernard Henrissat, comparing different glycosyl hydrolases.65,66 The resulting database,

CAZy is an excellent example of a successful bioinformatic project and today it is an

instrumental tool for all glyco-scientists with approximately 3000 downloaded pages daily

showing the key role sequence classification has in carbohydrate research.

Bernard Henrissat began the CAZy project by ordering 291 sequences into 35 glycosyl

hydrolase families, GH1-GH35, using the unusual method, hydrophobic clustering analysis

(HCA).67 From the original classification of 35 GH families, CAZy has continuously grown

to now contain almost 40 000 ORFs of GHs ordered into a total of 112 GH families (Figure

7).68

Figure 7. The year by year growth of GH ORFs in the CAZy database. Picture from

Davies & Sinnott, 2008.68

Page 26: A holistic approach to understanding CAZy families through

18 A holistic approach to understanding CAZy families through reductionist methods

Following the initial initiative to classify GHs other types of carbohydrate-acting proteins

were added to CAZy. The newer groups are glycosyl transferases (GTs), polysaccharide

lyases (PLs), carbohydrate esterases (CEs) and carbohydrate binding modules (CBMs).

The power of the CAZy database lies in that the classification contains more information

than what is found in an EC number (NC-IUBMB,

http://www.chem.qmul.ac.uk/iubmb/enzyme/). The EC numbers 3.2.1.x account for all

GHs’ activities as the first three numbers indicate that an enzyme hydrolyses O or S-glycosyl

linkages. The shortcomings of EC numbers are that they only account for one reaction and

seldom reveal anything about the mechanism of the enzyme. For GHs, who often have

broad substrate specificities, a single EC number is not sufficient. Instead CAZy classifies

enzymes according to their amino acid sequence and fold. This classification enables the user

to infer protein fold, reaction mechanism and catalytic amino-acids from other family

members as well as getting hints to an enzymes activity.

3.1 Reaction mechanism of glycosyl hydrolases 

The glycosidic bond, especially between two glucose residues, is the strongest bond found in

biopolymers with a calculated half-life of about 5 million years. Glycosyl hydrolases face a

daunting task trying to hydrolyse these bonds but have managed to accelerate the reaction by

an impressive 1017-fold.69 This is accomplished by several means but one of the more

important features is the catalytic residues within an enzymes active site. Since Koshland

back in 195370 first laid the basis of how the GHs accomplish hydrolysis much work has been

done, but his early theories still hold true.71 Essentially all GH families use one of two

reaction mechanisms, either the inverting mechanism or the retaining mechanism.72

The inverting mechanism is a one step mechanism leading to overall inversion of the

anomeric carbon. It uses a general base to activate a water molecule by extracting a proton

and an acid to facilitate the departure of the leaving group by protonation. The transition

Page 27: A holistic approach to understanding CAZy families through

Jens Eklöf 19

state is proposed to be oxocarbenium-ion-like73 (Figure 8a). The catalytic residues are usually

aspartates or glutamates and even though the distance between them have been shown to

vary considerably74 a general rule of thumb is that they are ~10 Å apart.

Figure 8. General mechanisms for inverting and retaining glycosidases. a, the inverting

mechanism, a one step mechanism leading to overall inversion of the anomeric carbon.

b, the retaining mechanism, a two step mechanism leading to overall retention of the

anomeric carbon.

The other mechanism used by glycosidases is the retaining mechanism (Figure 8b). It is a

two step mechanism involving a nucleophile and an acid-base functionality. Similar to the

inverting glycosidases, the catalytic residues are usually aspartates or glutamates but with

some variation as in clan GH-E, where a tyrosine acts as the nucleophile,75 and in GH20 and

GH84, that use substrate-assisted catalysis and where the nucleophile is an acetoamide group

from the substrate itself.76 In the retaining mechanism the first step is the attack of the

nucleophile on the C-1 carbon of a sugar ring resulting in a covalent enzyme-substrate

intermediate. The departure of the leaving group is assisted by protonation from the acid-

OO R

OO

HO O

OO

OO

O O

O

O O

H

R

O

H

H

OO

OO

OO

O O

H

H

O

OH

OO

HO O

δ−

δ+

δ+

δ−

δ−

δ−

O

O R

OO

HO O

HO H

O O

OO

O O

H

OH

O

OH

OHO

O OH

R

O

H

R

δ+

δ−

δ−

a

b

Page 28: A holistic approach to understanding CAZy families through

20 A holistic approach to understanding CAZy families through reductionist methods

base. In the second step the acid-base activates a water molecule that subsequently attacks

the C-1 carbon releasing the substrate with retention of the anomeric carbon and restoring

the active site. For a more in depth discussion on the mechanistic properties of glycosyl

hydrolases the reader is referred to a recent review by Vocadlo and Davies.71

3.2 Glycosyl Hydrolase family 16 

The glycosyl hydrolase family 16 is an interesting GH family using the retaining mechanism

to cleave glycosidic bonds. It is interesting for several reasons, for one it has a wide spectrum

of enzyme specificities. The different enzymes of GH16 cleave β-1,4, β-1,3 and β-1,3-β-1,4

linkages in various glucans and galactans. Even more interesting, GH16 does not contain

only hydrolytic enzymes. Some enzymes in GH16 have evolved into strict transglycosidases.

While transglycosylation can often be forced upon an enzyme, especially enzymes using the

retaining mechanism with a glycosyl intermediate, most members of the xth gene family, a

subgroup of GH16 seem to be strict xyloglucan endo-transglycosidases77 (and paper II)

without any hydrolytic activity. There are also indications of other transglycosylating enzymes

in GH16. These indications are from a group of fungal enzymes called Crh1 and Crh2

proposed to covalently link chitin with β-1,3 branches of β-1,6 glucans,78,79 suggesting that

the GH16 scaffold might be well suited for constructing strict transglycosidases with novel

substrate specificities.

GH16 shares its β jelly roll fold with the related glycosyl hydrolase family 7. These two

glycosyl hydrolase families share a common ancestry and are ordered into clan-B of the

CAZy classification. The enzymes of GH7 mainly act on cellulose, while GH16 enzymes

have, as stated previously, a wider substrate specificity. Phylogenetic analysis of GH16 has

shown that these different substrate specificities can be grouped into separate, distinct clades

(Figure 9).80,81

Page 29: A holistic approach to understanding CAZy families through

Jens Eklöf 21

Figure 9. A tree representation of Clan B, showing its evolution81 with structural

representatives for the different substrate specificities.

3.2.1 Xyloglucan endo‐transglycosidases 

The xyloglucan endo-transglycosidases (XETs) are exclusively found within the kingdom of

plantae, where they act on xyloglucan, a heavily substituted type of glucan with a β-1,4

glycosyl backbone. The smallest unit of xyloglucan is called a xylogluco-oligosaccharide

(XGO). The XGO is commonly Glc4-based with xylose attached to the three first glucoses

beginning from the non-reducing end via α-1,6 linkages (Figure 10). The xyloses can then be

further substituted in a multitude of ways. To simplify the nomenclature Fry et al. devised a

coding system for naming different sidechains back in 1993.82 An unsubstituted glucosyl

residue is named G, if substituted with xylose it is called X. The last relevant substitution

within the scope of this thesis is when the xylosyl residue is further substituted with a

β-agarases

κ-carrageenases

cellulases

Common ancestorActive site:β-bulge

XETs

lichenases

laminarinases

(1,4-β transglycosidases / hydrolases)

(1,3-β glucanases)

(1,3-1,4-β glucanases)

GH 7

GH 16

Subgroup 2Active site: regular β-strand

Subgroup 1Active site: β-bulge

(1,4-β glucanases)

Clan B

(1,4-β galactanases)

Page 30: A holistic approach to understanding CAZy families through

22 A holistic approach to understanding CAZy families through reductionist methods

galactosyl residue via a β-1,2 bond denominated L (Figure 10). The L can then be further

substituted by more galactosyl, fucosyl or arabinosyl residues depending on species and

tissue.

Figure 10. A xylogluco-oligosaccharide, XXLG, with the non-reducing end to the left

and the reducing end to the right.

Xyloglucan is the main hemicellulose in the primary cell wall of dicotelydons and non-

commelinoid monocotelydons where it crosslinks cellulose microfibrils.83-85 It is also present

in other plants species but to a lesser extent and in for example the commelinoid

monocotyledons the crosslinking function is instead mainly held by glucoronoarabinoxylan

(GAX) and β-1,3-1,4 glucans.86-88 In some species xyloglucan is also used as a storage

polysaccharide89,90 and is produced in large scale from tamarind seeds (Tamarindus indica).

3.2.2 XETs physiological role in plant cell walls 

XETs are as xyloglucan, mainly found in the primary cell wall of plants. This thin cell wall of

growing or fleshy tissues is made up of strength-bearing thin cellulose microfibrils in a matrix

of crosslinking glucans, pectins and some structural proteins (ref 85 and references therein).

The primary cell wall serves many purpose such as defining cell shape, being the load bearing

entity, act as a first defence against plant pathogens, allowing passage of small molecules etc.

As this is the cell wall of growing cells it needs to allow the cells to expand. Some of the

cross-linking glucans functions are to prevent the cellulose microfibrils from aggregating, and

to make the cell wall resilient, yet sufficiently strong to maintain the cell wall intact.

O

HOOH

O

HOOH

O

O

HOOH

O

HOOH

OO

OHOHO

OOH

OH

O

HOOH

O

O

O

O

HO

HO

OH

O

HO

HO

HO

O

OH

OHH

n

X GLX

Page 31: A holistic approach to understanding CAZy families through

Jens Eklöf 23

Plant cells grow either by tip growth or by a more general diffuse growth where the cell

grows in all directions. The diffuse growth of plant cells is governed by an increase in turgor

pressure accompanied by a decrease in apoplastic pH and can be described as a controlled

polymer creep.85 The XETs act in the diffuse growth by transiently weakening the

interactions between cellulose microfibrils, allowing them to slide with respect to each other

as the cell wall expands. At the same time new cell wall material is synthesised and deposited

into the growing wall to maintain its thickness. XETs accomplish this transient weakening

by a cleaving and religation process. They cleave a xyloglucan chain but instead of

hydrolysing the glycosyl enzyme intermediate with water, the stable intermediate can only be

broken down by a new xyloglucan chain resulting in a new crosslink. It has been proposed,

and shown that at least some XETs can use other acceptors than xyloglucan but at reduced

rates91 (and paper II). The biological importance of this activity is however not clear and

might be irrelevant due to the rate of these reactions.

Plants have several xth genes in their genomes and at this early stage in plant genomics, when

only a few plants genomes have been published, it seems that plants regardless of origin have

20-50 xth genes in their genomes: Thale cress (Arabidopsis thaliana) has 33 xth genes,92 the

first monocot genome from rice (Oryza sativa) has 29 xth genes93 and the first tree genome

from black cottonwood (Populus trichocarpa) has 41 xth genes.94 The differential expression of

the Arabidopsis xth genes have been investigated in detail using GUS fusions showing their

spatial and temporal expression patterns.95

The first XET protein structure solved was a XET from Populus tremula×tremuloides,

(PttXET16-34, PDB ID 1UMZ) in 2002 by Johansson et al.77 Compared to other GH16

enzymes XETs have a C-terminal extension that elongates the acceptor side of the active site

cleft. A ligand structure with a XGO bound in the positive sub-sites of PttXET16-34

showed that the elongation was necessary to accommodate a full XGO.77 The second

structure of an xth gene product was the mainly hydrolytic xyloglucan endo-hydrolase (XEH),

TmNXG1 (paper II) from the ornamental plant nasturtium (Tropaeolum majus). This is the

first and so far only xth gene product that has been shown to have hydrolytic activity.96-98

Earlier phylogenetic analysis on xth gene products divided the xth gene product family into

three different subgroups, I, II and III with TmNXG1 in subgroup III.99,100 This grouping

Page 32: A holistic approach to understanding CAZy families through

24 A holistic approach to understanding CAZy families through reductionist methods

has recently been revised merging group I and II and splitting group III into IIIA and IIIB

with the XEH activity restricted to the sub group IIIA. (paper II)

3.3 Carbohydrate Esterase family 8 

Carbohydrate esterases generally hydrolyse an ester into an acid and an alcohol. The alcohol

group can be contributed by the sugar as in the C-2, C-3 acetyl groups of xylan or the alcohol

can be exchanged for an amine as the N-acetyl groups of chitin.101 In carbohydrate esterase

family 8, CE 8, of the CAZy database all characterised enzymes work on one substrate. This

is the plant polysaccharide homogalacturonan, where the acid group of the ester originates

from the sugar. These enzymes are called pectin methylesterases (PMEs, EC nr. 3.1.1.11).

The members of CE8 can be divided into groups depending on their kingdom of origin e.g.

plantae, eubacteria or fungi (Figure 11, paper I and ref.102). The bacterial and fungal proteins

of CE8 are either from plant cell wall-degrading organisms or from plant pathogens. The

plant enzymes are either responsible for building/remodelling the cell wall or for degrading it

as in fruit ripening.103 To this day no proteins of archeal origin have been found belonging to

carbohydrate esterase family 8.

Page 33: A holistic approach to understanding CAZy families through

Jens Eklöf 25

Figure 11. Maximal likelihood tree showing the tree topology of all CE8 entries in CAZy.

The plant clade names are from Markovič and Janeček.102 The tree is a version of the tree

presented in paper I.

3.3.1 Pectin methylesterases role in plant cell walls and pectin degradation 

Pectin is a collective name for a diverse collection of plant polysaccharides that are found in

the primary cell wall and in the middle lamella.104 The common denominator for pectins is

that they are extractable from the cell wall with hot water. Pectin contains four major

polysaccharide domains called homogalacturonan, HGA, rhamnogalcturonan-I, RGI,

rhamnogalcturonan-II, RGII and xylogalacturonan, XGA. While RGI and RGII are

complex polysaccharide with extensive branching HGA is an unsubstituted homopolymer

consisting of α-1,4-galacturonyl residues.105 HGA can be modified with acetyl groups at the

C-2 and the C-3 position. XGA has a structure similar to HGA but is supposed to be closely

connected to RGI and it is branched with single xylosyl residue, β-1,3 linked to the

galacturonan backbone. Exactly how these different pectin polysaccharides are connected to

each other is not known and an old consensus that HGA, RGI and RGII composed the

Bacterial clade

Fungal clade

Plant clades I-IV & X1Plant clade X2

Page 34: A holistic approach to understanding CAZy families through

26 A holistic approach to understanding CAZy families through reductionist methods

backbone of pectin was challenged in 2003 by Vincken who argued that HGA was in fact a

very long sidechain of RGI.106

Pectin is synthesised in the Golgi and here HGA becomes heavily methylesterified before it

is released into the cell wall.107 The properties of HGA are profoundly dependent on

postsecretory changes of its structure due to enzymes in the cell wall. At least three different

types of enzymes act on homogalcturonan in muro. These are the PMEs, the

polygalacturonases, PGs, that hydrolyse the α-1,4 bond between two galacturonyl residues

and the pectate lyases, PLs, that break α-1,4 bonds as PGs but through an elimination

reaction yielding a non-reducing end made up by a 4-deoxy-α-D-galact-4-enuronosyl.

PMEs expose the carboxylate of galacturonic acid by releasing protons and methanol,

thereby making HGA negatively charged. The action pattern of most plant PMEs is believed

to make contiguous stretches of free carboxyl groups i.e. work in a block-wise fashion.108

The negatively charged groups of different pectin chains can then be crosslinked via Ca2+

ions through “egg boxes” (Figure 12).109,110 This stiffens111 the cell wall restricting cell growth,

or glues112 cells together as in the middle lamella. The action of PMEs can also weaken the

cell wall. A random de-esterification pattern does not promote “egg box” formation and

instead the reduction in apoplastic pH due to the released protons, inhibits PME activity and

activates pectin degrading enzymes such as polygalacturonases and pectate and pectin lyases.

In addition many pectinolytic enzymes are more efficient in cleaving demethylated HGA

(Figure 12).113

Page 35: A holistic approach to understanding CAZy families through

Jens Eklöf 27

Figure 12. The action of PMEs in cell walls. Random de-esterification promotes pectin

degradation while block-wise de-esterification promotes cell wall stiffening.

The action pattern of fungal PMEs is different from plant PMEs (even though no plant PME

from the fungal-like plant clade X2 (Figure 11), has been characterised) as they have been

shown to work in a random fashion.114 A random demethylation pattern prevents the

formation of Ca2+ egg boxes and promotes the degradation by pectinases. Even though

bacterial PMEs have been proposed to work like fungal PMEs, crystallographic data of the

PmeA from E. chrysanthemi suggests that at least this bacterial PME works in a processive

manner preferring esterified galacturonyl residues at the -1 subsite and a free carboxyl group

at the +1 subsite.115 It is possible however that different bacterial enzymes work in different

fashions, either in a block-wise, random or mixed mode way.

+H+

pH↓

PG

PL

pectin degradation

PME

pH↑wall loosening/cell separation

cell growth

charge dilution

Ca2+

wall stiffening

-

-

-

-

-

-

-- -

-

-

-

-

--

-

de-esterified pectinesterified pectin

-

- +

+

Page 36: A holistic approach to understanding CAZy families through

28 A holistic approach to understanding CAZy families through reductionist methods

Most plant PMEs have an extra N-terminal domain. This domain is called the pro-domain

and shares sequence similarity with pectin methylesterase inhibitors (PMEIs) found in

plants.116 The function of the pro domain is unknown but it has been proposed to be a

folding chaperone or to act as a PMEI during transport to the cell wall where it is cleaved off

since PMEs found in the cell wall only contain the catalytic domain. Whether the final

destination of the pro-domain is the cytosol or the cell wall is unknown but it is likely cleaved

off at a conserved di-basic KEX2-like protease site situated in a linker-region between the

two domains.102

Plant PMEs can be further divided by their pI. While most plant PMEs isoforms have a

basic pI, two thirds of the PMEs in Arabidopsis have a pI over 8, only a few have a pI lower

than six. The pI of plant PMEs has been proposed to have important effects on the mode of

action of the enzymes and while acidic plant PMEs can be extracted with water from plant

cell walls basic isoforms need harsher conditions for extraction.117 Since the negative charge

of the cell wall is mostly due to homogalacturonan, the substrate for PMEs, one can

hypothesise that basic PMEs might have additional binding sites for homogalacturonan

aiding them in making egg box structures and that this could be the reason for their block-

wise action pattern.

3.3.2 The pectin methylesterases 

The pectin methylesterases of CE 8 all act on the C-6 methylesterified galacturonyl residues

of homogalacturonan producing methanol, galacturonyl residues and protons. The

mechanism is not that of the common catalytic triad118 used by many esterases (Asp, His and

Ser), instead it resembles that of the aspartic acid proteases with two carboxylic acids acting

as a nucleophile and an acid-base respectively. The details of the PME mechanism have been

debated in the past but recent work by Fries et al., 2007 strongly suggests that the mechanism

goes through a covalent enzyme-substrate intermediate. It begins with a nucleophilic attack

by Asp199 on the C-6 carbonyl carbon of a galacturonyl residue (1 Figure 13, Erwinia

chrysanthemi numbering, pdb ID 2NSP). A tetrahedral intermediate is formed with a negative

charge on the carbonyl oxygen. This negative charge is stabilised by the anion hole made up

by Gln177 and Asp178 (2, Figure 13). The first leaving group, methoxide is released aided by

Page 37: A holistic approach to understanding CAZy families through

Jens Eklöf 29

protonation from the acid-base, Asp178 (3, Figure 13). Asp178 sits in a hydrophobic

environment making the deprotonated state unfavourable. It therefore acts as a base

activating water to break the anhydride again creating a negative charge on the carbonyl

carbon. The tetrahedral intermediate breaks, restoring the active site and releasing a

demethylated galacturonyl residue (4-6, Figure 13).

Figure 13. The proposed mechanism of CE8 pectin methylesterases.115

Pectin methylesterases have their right handed β-helix structure in common with other

pectinolytic enzymes such as pectate and pectin lyases, polygalacturonases and

O

O

Asp

H

O O

N N

H

H

NH H

HO

N

H

H

O

R2

R1

HOOH

O O

O

O

H

O OO

N

H

H

O

R2

R1

HOOH

O O

Asp

Asp

Gln

Arg

Asp

Gln

O

O

H

O OO

N

H

H

O

R2

R1

HOOH

O O

Asp

AspGln

O

O

O OO

N

H

H

O

R2

R1

HOOH

O

Asp

AspGln -CH3OH

+H2O

O

H H

O

O

O OO

N

H

H

O

R2

R1

HOOH

O

Asp

AspGln

O

HH

O

O

O OO

NH

H

O

R2

R1

HOOH

O

Asp

Asp

Gln

OH

H

1 2

4 3

5 6

Page 38: A holistic approach to understanding CAZy families through

30 A holistic approach to understanding CAZy families through reductionist methods

rhamnogalacturonases.119,120 In CAZy, there are three PME structures belonging to CE8 (not

counting the structure in paper I). Two are of plant origin, one from carrot121 and one from

tomato in complex with a pectin methylesterase inhibitor (PMEI).122 The third is from the

bacterial plant pathogen E. chrysanthemi.120 One striking difference between the plant and

bacterial enzymes is that in the bacterial PmeA from E. chrysanthemi, elongated loops line both

sides of the active site cleft (Figure 14).

Figure 14. a; Superpositioning of PmeA, silver, from E. chrysanthemi and a PME from

carrot, Daucus carota, gold, (PDB ID 1QJV and 1GQ8 respectively). b; Structure of

PME1 from tomato, Solanum lycopersicum green, in complex with a pectin methylesterase

inhibitor from kiwi, Actinidia chinensis, beige (PDB ID 1XG2). The catalytic residues are

represented in sticks.

These elongated loops of the bacterial PMEs could be an adaptation to avoid detection by

plant pectin methylesterase inhibitors (PMEIs). Plants secrete PMEIs into their cell walls to

regulate the activity of their own PMEs in muro. Plant PMEIs have been shown to inhibit

most plant PMEs but not fungal or bacterial PMEs.123,124 When the complex structure of a

plant PME and an inhibitor was solved in 2004122 the mechanism behind the inhibition

became clear (Figure 14b). The inhibitor binds to the active site cleft, mainly via polar

interactions and the reason why bacterial enzymes are not inhibited is most likely because of

a b

Page 39: A holistic approach to understanding CAZy families through

Jens Eklöf 31

the extended loops lining the active site cleft (Figure 14a) making an interaction impossible.

Loop extension is a common feature in the arms race between plants and pathogens to avoid

detection by inhibitors.125

Page 40: A holistic approach to understanding CAZy families through

32 A holistic approach to understanding CAZy families through reductionist methods

Present investigation 

Page 41: A holistic approach to understanding CAZy families through

Jens Eklöf 33

4.1 Paper I: A phylogenetic analysis of CE8 locates a new bacterial sub‐

clade and the structural determination of E. coli YbhC. The phylogenetic analysis of the full carbohydrate esterase family 8 revealed a new bacterial

sub-clade previously not noted by others working on CE8.102,126 This new sub-clade

contained outer membrane lipoproteins exclusively from gram-negative bacteria (YbhC and

PmeB sub-clades, Figure 15). In comparison with the other pectin methylesterases of CE8

the YbhC sub-clade had some interesting sequence features that caught our attention. Apart

from major differences in loop regions several important CE8 conserved amino-acid were

different including the acid-base in the proposed mechanism115 of PMEs.

Figure 15. A maximum likelihood phylogram of the sequences present in the publicly

available CE8 in November 2007. The plant clades are named according to Markovič and

Janeček.102

While PME activity most likely needs an acid-base for aiding the departure of the methoxide

group the previously indicated thioesterase activity127 does not necessarily need assistance of

an acid-base for leaving group departure.128 This activity on palmitoyl-CoA could however

not be reproduced nor could any general esterase activity (on acyl pNPs of various lengths)

Bacterial clade

Fungal clade

Plant clades I-IV & X1

Plant clade X2

YbhC sub-clade

PmeB sub-clade

Page 42: A holistic approach to understanding CAZy families through

34 A holistic approach to understanding CAZy families through reductionist methods

or PME activity be demonstrated. As closely related to the experimentally determined PME

PmeB129 it was also tested for pectin binding but to no avail.

Figure 16. a, the structure of E. coli YbhC. b, Surface representation of YbhC (carbon,

yellow; nitrogen, blue, oxygen, red).

The proteins of the YbhC sub-clade all reside alone in their operons and gave no clue as to

their function. Instead clues to a putative function in the biosynthesis of murein or cell

division came from clustering transctriptomic data (http://genexpdb.ou.edu) where the top

hits were all involved in murein synthesis and turnover. The crystal structure of YbhC also

gives some clues to the characteristics of a potential substrate. While PMEs have a cleft-like

active site YbhC has one end of the cleft cut off by a long blocking loop and a large insert

(Figure 16). The closed end of the cleft is very hydrophobic with mainly aliphatic side chains

while the open end looks more like a PME indicating that a substrate should have a

hydrophobic part and a hydrophilic part (Figure 16b). Such substrates present in the

periplasm could be lipids, lipoproteins or some quorum sensing molecules to mention but a

few. Whatever the function of YbhCs might be, the crystal structure of the E.coli YbhC lays

the basis for further characterisation of these proteins.

Insert Insert

Blocking loop Blocking loopa b

Page 43: A holistic approach to understanding CAZy families through

Jens Eklöf 35

4.2  Paper  II:  Investigation  of  the  GH16  xth  gene  family  clarifies  the 

determinants  for transglycosylation versus hydrolysis within the  family 

by  exploring  the  new  3D  structure  of  TmNXG1  and  restricts  the 

hydrolytic activity to a specific sub‐clade. 

It has been known for a long time that two different activities exist within the xth gene

family. The hydrolytic activity96 was discovered a long time ago but it is still confined to a

single member, namely TmNXG1 while all other characterised members have been

transglycosidases, XETs (29 according to CAZy). The underlying mechanisms for

transglycosylation versus hydrolysis are interesting both from a fundamental research

perspective as from a biotechnology perspective. Therefore the previous phylogenetic work

done on the xth gene product family was revisited. Our analysis included more sequences

than previous work and we could conclude that the old grouping was misleading. The old

family I and II were merged in our tree and the old family III was clearly divided in two sub-

clades, IIIA and IIIB (Figure 17). This revised grouping was supported by high bootstrap

values but also by enzyme characteristics. While all characterised members of group I and II

had been shown to be XETs, family III had conflicting activities with XETs in IIIB and the

xyloglucan endo-hydrolase (XEH), TmNXG1 in group IIIA.

Page 44: A holistic approach to understanding CAZy families through

36 A holistic approach to understanding CAZy families through reductionist methods

Figure 17. Unrooted phylogenetic tree of ca. 130 full length xth gene products and Bacillus

licheniformis lichenase (1GBG, GenPept CAA40547). Bootstrap values from 100

Maximum Likelihood resamplings are indicated.

To elucidate why TmNXG1 was hydrolytic, after the earlier success of the lab with

PttXET16-34,77 a group I-II member, TmNXG1 was also produced and crystallised. A loop

region conserved in family IIIB looked to be a promising target for mutagenesis and

therefore a chimera was made, using TmNXG1 as the scaffold and the loop from PttXET16-

34. In order to characterise these proteins in detail a new HPLC-based assay was developed

that could measure both hydrolysis and transglycosylation simultaneously.

Interestingly the chimera TmNXG1ΔYNIIG showed intermediate characteristics with a

transglycosylation rate in between that of TmNXG1 and PttXET16-34 and a hydrolytic rate

PD

B 1

gb

g (

lich

enas

e)A

t-X

TH

11A

t-X

TH

3A

t-X

TH

1A

t-X

TH

2P

t-X

TH

15

Os-

XTH

11O

s-X

TH10

SI-

XTH

11S

I-XTH

10P

t-XTH

10S

I-XTH

9Sl

-XTH

17

Pt-X

TH12

Pt-X

TH33

Pt-X

TH28

Pt-XTH

13Pt-X

TH24

At-XTH15

At-XTH16

Ptt-XTH1614

Pt-XTH14

Ptt-XTH1621

Pt-XTH21

SI-XTH13

At-XTH25

Pt-XTH19

Pt-XTH2

Pt-XTH37

Pt-XTH18

Pt-XTH17

At-XTH21

Os-XTH12

Pt-XTH11

At-XTH22

At-XTH23

At-XTH24At-XTH20At-XTH17At-XTH18At-XTH19SI-XTH2Ptt-XTH166Pt-XTH6

At-XTH14At-XTH12At-XTH13

Pt-XTH20Os-XTH9

Os-XTH8Os-XTH7

Os-XTH6Os-XTH5.2

Os-XTH5

Os-XTH4

Os-XTH16O

s-XTH17

Os-XTH

18

Os-X

TH14

Os-X

TH13

Os-X

TH15

At-X

TH26

Pt-X

TH23

At-X

TH

10

Pt-X

TH

29

Os-

XTH

3A

t-X

TH

9S

l-XT

H16

Ptt-

XE

T16

35P

t-X

TH

35

Ptt-

XT

H16

30

Pt-

XT

H30

SI-

XT

H7

At-

XT

H7

At-

XT

H6

Ptt-

XT

H16

36

Pt-

XT

H36

Os-

XTH

1

SI-

XTH

23Sl-X

TH15Pt-X

TH16

Pt-X

TH25

At-X

TH8

Os-

XTH

2

SI-X

TH4

SI-X

TH3

SI-XTH

1

Ptt-XTH16

27

Pt-XTH27

Pt-XTH34

Ptt-XET1634

Ptt-XTH1626

Pt-XTH26

At-XTH5

At-XTH4

Pt-XTH38Ptt-X

TH1638At-XTH33Pt-XTH22

Os-XTH23Os-XTH24Os-XTH25

Os-XTH26SI-XTH5

Ptt-XTH163

Pt-XTH3

Pt-XTH41

At-XTH30

At-XTH29

SI-XTH8

Pt-XTH40

Ptt-XTH1639

Pt-XTH39

At-XTH27

At-XTH28

Os-XTH27

Os-XTH28

Os-XTH29Lc-XET1

C. papaya

Pt-XTH7

Vv-XET1

Tm-NXG1

At-XTH31SI-XTH6

At-XTH32Ptt-XTH

1632Pt-XTH

32Pt-XTH

31

Os-X

TH22

Os-X

TH19

Os-X

TH20

Os-X

TH21

5945

993493

100

78

14

4

34

97

37

97

161

6285

51

76

26

7 9863

68

72

100

100

100

100

100100100

98

31

97

50

60

49

18

17100

100

100

100

100

100

100

100

100

100

100

100

100

100

100100

100

100

100

100

100

100

5985

4644

13

19

98

84

35

40

87

95

83

8195

95

99

86

91

96

98

100

100

100

100

56

70

3536

17

33

70

5397

97

94

7596

50

34

48

82

67

98

98

97

52

54

76

66

73

94

9891

94

100

100

100

100

100

100

97

17

5632

1427

18

Gro

up II

I-A

Group III-B

AncestralGroup

Group I/I

I

Page 45: A holistic approach to understanding CAZy families through

Jens Eklöf 37

close to 6 times lower than that of the parental scaffold TmNXG1 under saturating substrate

conditions (Figure 18).

Figure 18. Initial rate kinetics of TmNXG1 (A), TmNXG1-ΔYNIIG (B), and PttXET16-

34 (C) as a function of XGO2 concentration. Closed circles, rate of XGO3 production

due to transglycosylation (2 XGO2 → XGO3 + XGO1); open squares, total rate of XGO1

production; closed squares, corrected rate of XGO1 production obtained by subtracting

contribution of XGO1 release due to substrate transglycosylation. This observed rate is

twice the actual catalytic rate, according to the stoichiometry of the hydrolysis reaction

(XGO2 → 2 XGO1).

The presence of xyloglucan endo-hydrolases (group IIIB members) in plants is not only for

breaking down xyloglucan in germinating seeds. In fact only a few species uses xyloglucan as

a storage polysaccharide, instead it seems that group IIIB members are expressed in fast

growing tissues. The Arabidopsis group IIIB members XTH31 and XTH32 are expressed in

the root elongation zone and in the shoot apex respectively,95 both being fast

growing/dividing tissues.

C

[Glc8 based XGO2] (mM)

TmNXG1-DYNIIG

0 500 1000 1500 2000 2500 3000

0

1

2

3

v 0 / [E

0], (1

/min

)

PttXET16A

TmNXG1

[Glc8 based XGO2] (mM)

A

B

0 500 1000 1500 2000 2500 3000

0

2

4

6

8

[Glc8 based XGO2] (mM)

0 500 1000 1500 2000 2500 30000

1

2

3

v 0 / [E

0], (1

/min

)v 0 /

[E0],

(1/m

in)

Page 46: A holistic approach to understanding CAZy families through

38 A holistic approach to understanding CAZy families through reductionist methods

Concluding remarks 

The work presented in this thesis has lead to a better understanding of the determinants for

different activities in glycosyl hydrolase family16 and carbohydrate esterase family 8 as well as

shedding some light on the evolution of these two families. CAZy contains hundreds of

families, and while the methods applied in this thesis are applicable to all of them, the

predictive power is probably greater in some families, especially those working on polymeric

substrates.

The comparison of the phylogenetic analysis of CE8 and GH16 in this work with previous

analyses done by others, clearly shows that incorporation of more sequences into an analysis

improves the quality of the outcome. Selecting only a few sequences without knowledge of

how they are related can lead to false conclusions such as the old family grouping of the xth

gene product family discussed in paper II.

While the speed for genome sequencing has increased, the annotation of a genome is

challenging and time consuming. In paper I the previous misannotation of YbhC as a PME,

is an example of the challenges posed on electronic annotation. To continue the functional

annotation of genes is important and the methodology used in this thesis can, and has

predicted which proteins are interesting to study from a functional divergence perspective.

Page 47: A holistic approach to understanding CAZy families through

Jens Eklöf 39

Acknowledgement 

It all began with dissected radios on the living room table and with small experiments such as

spraying cold water on hot light bulbs. Yes your right, they do explode, but who would

know if no one tried? I think dad thought Emil i Lönneberga was a far too kind a nickname

and probably swore behind grinning teeth...

My inspirations to the exploration of nature, came as a results of walks in the woods near our

summer house with my grandmother, and from the digging for flint tools in the fields of my

aunts summerhouse in Bohuslän. A less obvious source of inspiration came from that

elusive uncle of mine, that was always somewhere else, weighing the largest southern sea

elephant ever close to Antarctica (think he was called Stalin, a bad boy), or diving with sperm

whales around Galapagos, or hunting down wolfs in Canada. Once home he taught me

about plants and birds and took me on excursions. On one of these trips, I and my sister

saw a jumping sperm whale, but no one believed us. “They’re not supposed to jump around

Lofoten” someone said… An early sign of global warming perhaps?

Under all these years I never said thanks. So here it comes, grandma thanks, where ever you

are and Tom, thanks for everything, you are the reason I ended up here working with great

enthusiasm on something that almost no one else in the world cares about…

On a sunny and warm early summer day I stepped into Tuula’s office and sat down with her

and my future mentor in the lab, Martin Baumann. A lot of laughs and Friday beers later my

diploma work ended and my PhD with Harry Brumer began.

Hooked on plants, I’m still here and I would like to take the opportunity to thank all the

people in the lab for their support, and the good times we have had. Especially the people in

my group, past and present, Farid, Fredrika, Johan, Maria, Martin, Niklas and Nomchit, one

could not wish for better team mates. A special thanks goes to the people in my room Erik,

Felicia, Gustav, Johanna for making work a lot easier, maybe not always more efficient, but

hey it’s all good, and finally to Vincent the biggest fish we’ve got.

Page 48: A holistic approach to understanding CAZy families through

40 A holistic approach to understanding CAZy families through reductionist methods

Did you really think I had forgotten you Harry? Of course I hadn’t. I’m not only thanking

you for work related help but also for interesting conversations and for putting up with me...

Finally, thank you friends and family for your support, someday, hopefully I can repay you.

Page 49: A holistic approach to understanding CAZy families through

Jens Eklöf 41

References 

1. Vickery HB. The origin of the word protein. Yale J Biol Med, 1950(22):387-393. 2. Berzelius JJ. Brevväxling mellan Berzelius och G.J. Mulder (1834-1837). Bref Utgifna

af Kungl Svenska Vetenskapsakademien genom HG Söderbaum. Uppsala; 1916. 3. Ragauskas AJ, Williams CK, Davison BH, Britovsek G, Cairney J, Eckert CA,

Frederick WJ, Hallett JP, Leak DJ, Liotta CL, Mielenz JR, Murphy R, Templer R, Tschaplinski T. The path forward for biofuels and biomaterials. Science, 2006;311(5760):484-489.

4. Wong G. Biotech scientists bank on big pharma's biologics push. Nat Biotech, 2009;27(3):293-295.

5. Crick F. Central Dogma of Molecular Biology. Nature, 1970;227(5258):561-563. 6. Crick FHC. Symp. Soc. Exp. Biol., The Biological Replication of Macromolecules.

1958. 7. Gilbert W. Origin of life: The RNA world. Nature, 1986;319(6055):618-618. 8. Hardy J, Selkoe DJ. Medicine - The amyloid hypothesis of Alzheimer's disease:

Progress and problems on the road to therapeutics. Science, 2002;297(5580):353-356. 9. Lauren J, Gimbel DA, Nygaard HB, Gilbert JW, Strittmatter SM. Cellular prion

protein mediates impairment of synaptic plasticity by amyloid-β oligomers. 2009;457(7233):1128-1132.

10. Prusiner SB. Prions causing degenerative neurological diseases. Annu Rev Med, 1987;38:381-398.

11. Wrinch DM, Jeffreys H. On Some Aspects of the Theory of Probability. Philosophical Magazine, 1919;38:715-731.

12. Fiebig KM, Dill KA. Protein core assembly processes. J Chem Phys, 1993;98(4):3475-3487.

13. Kauzmann W, C.B. Anfinsen, Anson MLJ, Bailey K, John TE. Some Factors in the Interpretation of Protein Denaturation. Advances in Protein Chemistry. 14: Academic Press; 1959. p 1-63.

14. Ivarsson Y, Travaglini-Allocatelli C, Brunori M, Gianni S. Mechanisms of protein folding. Eur Biophys J Biophy, 2008;37(6):721-728.

15. Mok KH, Kuhn LT, Goez M, Day IJ, Lin JC, Andersen NH, Hore PJ. A pre-existing hydrophobic collapse in the unfolded state of an ultrafast folding protein. Nature, 2007;447(7140):106-109.

16. Dill KA, Ozkan SB, Shell MS, Weikl TR. The Protein Folding Problem. Annu Rev Biophys, 2008;37(1):289-316.

17. Udgaonkar JB. Multiple Routes and Structural Heterogeneity in Protein Folding. Annu Rev Biophys, 2008;37(1):489-510.

18. Pearl FMG, Bennett CF, Bray JE, Harrison AP, Martin N, Shepherd A, Sillitoe I, Thornton J, Orengo CA. The CATH database: an extended protein family resource for structural and functional genomics. Nucleic Acids Res, 2003;31(1):452-455.

19. Yan Y, Moult J. Protein Family Clustering for Structural Genomics. J Mol Biol 2005;353(3):744-759.

20. Flaherty KM, McKay DB, Kabsch W, Holmes KC. Similarity of the 3-dimensional structures of actin and ATPase fragment of a 70 kDa heat-shock cognate protein. Proc Nat Acad Sci USA 1991;88(11):5041-5045.

Page 50: A holistic approach to understanding CAZy families through

42 A holistic approach to understanding CAZy families through reductionist methods

21. Chothia C. One thousand families for the molecular biologist. Nature, 1992;357(6379):543-544.

22. Zhang C, DeLisi C. Estimating the number of protein folds. J Mol Biol 1998;284(5):1301-1305.

23. Choi I-G, Kim S-H. Evolution of protein structural classes and protein sequence families. Proc Nat Acad Sci USA, 2006;103(38):14056-14061.

24. Minor DL, Kim PS. Context-dependent secondary structure formation of a designed protein sequence. Nature, 1996;380(6576):730-734.

25. Pagel K, Koksch B. Following polypeptide folding and assembly with conformational switches. Curr Opin Chem Biol, 2008;12(6):730-739.

26. Govindarajan S, Goldstein RA. Why are some proteins structures so common? Proc Nat Acad Sci USA 1996;93(8):3341-3345.

27. Zeldovich KB, Berezovsky IN, Shakhnovich EI. Physical Origins of Protein Superfamilies. J Mol Biol, 2006;357(4):1335-1343.

28. Goldstein RA. The structure of protein evolution and the evolution of protein structure. Curr Opin Struct Biol 2008;18(2):170-177.

29. Bloom JD, Drummond DA, Arnold FH, Wilke CO. Structural Determinants of the Rate of Protein Evolution in Yeast. Mol Biol Evol, 2006;23(9):1751-1761.

30. Pál C, Papp B, Lercher MJ. An integrated view of protein evolution. Nat Rev Genet, 2006;7(5):337-348.

31. Higgins DG, Sharp PM. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 1988;73(1):237-244.

32. Katoh K, Misawa K, Kuma K-i, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res, 2002;30(14):3059-3066.

33. Edgar R. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinf 2004;5(1):113.

34. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 1997;25(17):3389-3402.

35. McGuffin LJ, Bryson K, Jones DT. The PSIPRED protein structure prediction server. Bioinformatics, 2000;16(4):404-405.

36. Pei J, Grishin NV. PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics, 2007;23(7):802-808.

37. Simossis VA, Heringa J. PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Res, 2005;33:W289-W294.

38. Zhou H, Zhou Y. SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics, 2005;21(18):3615-3621.

39. O'Sullivan O, Suhre K, Abergel C, Higgins DG, Notredame C. 3DCoffee: Combining Protein Sequences and Structures within Multiple Sequence Alignments. J Mol Biol, 2004;340(2):385-395.

40. Thompson JD, Plewniak F, Poch O. BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 1999;15(1):87-88.

41. Mizuguchi K, Deane CM, Blundell TL, Overington JP. HOMSTRAD: A database of protein structure alignments for homologous families. Protein Sci, 1998;7(11):2469-2471.

Page 51: A holistic approach to understanding CAZy families through

Jens Eklöf 43

42. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 2004;32(5):1792-1797.

43. Van Walle I, Lasters I, Wyns L. SABmark--a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics, 2005;21(7):1267-1268.

44. Notredame C, Higgins DG, Heringa J. T-coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000;302(1):205-217.

45. Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res, 2005;15(2):330-340.

46. Wallace IM, O'Sullivan O, Higgins DG, Notredame C. M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res, 2006;34(6):1692-1699.

47. Katoh K, Kuma K-i, Toh H, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res, 2005;33(2):511-518.

48. Katoh K, Toh H. Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform, 2008;9(4):286-298.

49. Kimura M. The Neutral Theory of Molecular Evolution: Cambridge University Press; 1985. 384 p.

50. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol, 1987;4(4):406-425.

51. Dayhoff MO, Eck EV, Park CM. A Model of Evolutionary Change in Proteins. In: Dayhoff MO, editor Atlas of Protein Sequence and Structure. Vol 5. Silver Spring, Maryland: National Biomedical Research Foundation; 1972. p pp. 89–99.

52. Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data from protein sequences. CABIOS, 1992;8(3):275-282.

53. Gonnet GH, Cohen MA, Benner SA. Exhaustive matching of the entire protein-sequence database. Science, 1992;256(5062):1443-1445.

54. Henikoff S, Henikoff JG. Amino-acid substituition matrices from protein blocks. Proc Nat Acad Sci USA, 1992;89(22):10915-10919.

55. Ng PC, Henikoff JG, Henikoff S. PHAT: a transmembrane-specific substitution matrix. Bioinformatics, 2000;16(9):760-766.

56. Adachi J, Waddell PJ, Martin W, Hasegawa M. Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. J Mol Evol, 2000;50(4):348-358.

57. Huelsenbeck JP, Ronquist F, Nielsen R, Bollback JP. Evolution - Bayesian inference of phylogeny and its impact on evolutionary biology. Science, 2001;294(5550):2310-2314.

58. Kumar S, Tamura K, Nei M. MEGA3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment. Brief Bioinform, 2004;5(2):150-163.

59. Tamura K, Dudley J, Nei M, Kumar S. MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) Software Version 4.0. Mol Biol Evol, 2007;24(8):1596-1599.

60. Felsenstein J, Churchill GA. A hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol, 1996;13(1):93-104.

61. Guindon S, Gascuel O. A Simple, Fast, and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood. Syst Biol, 2003;52(5):696 - 704.

62. Huelsenbeck JP, Ronquist F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics, 2001;17(8):754-755.

Page 52: A holistic approach to understanding CAZy families through

44 A holistic approach to understanding CAZy families through reductionist methods

63. Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics, 2003;19(12):1572-1574.

64. Marth JD. A unified vision of the building blocks of life. Nat Cell Biol, 2008;10(9):1015-1016.

65. Henrissat B, Claeyssens M, Tomme P, Lemesle L, Mornon JP. Cellulase families revealed by hydrophobic cluster-analysis. Gene, 1989;81(1):83-95.

66. Henrissat B. A classification of Glycosyl hydrolases based on amino-acid sequence similarities. Biochem J 1991;280:309-316.

67. Gaboriaud C, Bissery V, Benchetrit T, Mornon JP. Hydrophobic cluster analysis: An efficient new way to compare and analyse amino acid sequences. FEBS Lett, 1987;224(1):149-155.

68. Davies GJ, Sinnott ML. Sorting the diverse:the sequence-based classification of carbohydrate-active enzymes. Biochem J, 2008;DOI 10.1042/BJ20080382 (online only).

69. Wolfenden R, Lu X, Young G. Spontaneous Hydrolysis of Glycosides. J Am Chem Soc 1998;120(27):6814-6815.

70. Koshland JDE. Stereochemistry and the mechanism of enzymatic reactions. Biol Rev Camb Philos Soc, 1953;28(4):416-436.

71. Vocadlo DJ, Davies GJ. Mechanistic insights into glycosidase chemistry. Curr Opin Chem Biol 2008;12(5):539-555.

72. Rye CS, Withers SG. Glycosidase mechanisms. Curr Opin Chem Biol, 2000;4(5):573-580.

73. Vocadlo DJ, Davies GJ, Laine R, Withers SG. Catalysis by hen egg-white lysozyme proceeds via a covalent intermediate. Nature, 2001;412(6849):835-838.

74. Zechel DL, Withers SG. Glycosidase Mechanisms: Anatomy of a Finely Tuned Catalyst. Acc Chem Res 2000;33(1):11-18.

75. Watts AG, Damager I, Amaya ML, Buschiazzo A, Alzari P, Frasch AC, Withers SG. Trypanosoma cruzi Trans-sialidase Operates through a Covalent Sialyl-Enzyme Intermediate: Tyrosine Is the Catalytic Nucleophile. J Am Chem Soc, 2003;125(25):7532-7533.

76. Macauley MS, Whitworth GE, Debowski AW, Chin D, Vocadlo DJ. O-GlcNAcase Uses Substrate-assisted Catalysis: Kinetic analysisand development of highly selective mechanism-inspired inhibitors. J Biol Chem, 2005;280(27):25313-25322.

77. Johansson P, Brumer H, Baumann MJ, Kallas AM, Henriksson H, Denman SE, Teeri TT, Jones TA. Crystal structures of a poplar xyloglucan endotransglycosylase reveal details of transglycosylation acceptor binding. Plant Cell, 2004;16(4):874-886.

78. Cabib E, Blanco N, Grau C, Rodríguez-Peña JM, Arroyo J. Crh1p and Crh2p are required for the cross-linking of chitin to β-(1-6)glucan in the Saccharomyces cerevisiae cell wall. Mol Microbiol, 2007;63(3):921-935.

79. Cabib E, Farkas V, Kosik O, Blanco N, Arroyo J, McPhie P. Assembly of the Yeast Cell Wall: Crh1p AND Crh2p act as transglycosylases in vivo and in vitro. J Biol Chem, 2008;283(44):29859-29872.

80. Barbeyron T, Gerard A, Potin P, Henrissat B, Kloareg B. The kappa-carrageenase of the marine bacterium Cytophaga drobachiensis. Structural and phylogenetic relationships within family-16 glycoside hydrolases. Mol Biol Evol, 1998;15(5):528-537.

81. Michel G, Chantalat L, Duee E, Barbeyron T, Henrissat B, Kloareg B, Dideberg O. The κ-carrageenase of P. carrageenovora Features a Tunnel-Shaped Active Site: A Novel

Page 53: A holistic approach to understanding CAZy families through

Jens Eklöf 45

Insight in the Evolution of Clan-B Glycoside Hydrolases. Structure, 2001;9(6):513-525.

82. Fry SC, York WS, Albersheim P, Darvill A, Hayashi T, Joseleau JP, Kato Y, Lorences EP, Maclachlan GA, McNeil M, Mort AJ, Reid JSG, Seitz HU, Selvendran RR, Voragen AGJ, White AR. An unambiguous nomenclature for xyloglucan-derived oligosaccharides. Physiol Plant, 1993;89(1):1-3.

83. Rose JKC, Bennett AB. Cooperative disassembly of the cellulose-xyloglucan network of plant cell walls: parallels between cell expansion and fruit ripening. Trends Plant Sci, 1999;4(5):176-183.

84. Carpita NC, Gibeaut DM. Structural models of primary-cell walls in flowering plants - consistency of molecular-structure with the physical-properties of the walls during growth. Plant J, 1993;3(1):1-30.

85. Cosgrove DJ. Growth of the plant cell wall. Nat Rev Mol Cell Biol, 2005;6(11):850-861.

86. Pauly M, Albersheim P, Darvill A, York WS. Molecular domains of the cellulose/xyloglucan network in the cell walls of higher plants. Plant J, 1999;20(6):629-639.

87. Fry SC. The Structure and Functions of Xyloglucan. J Exp Bot, 1989;40(210):1-11. 88. Hayashi T. Xyloglucans in the Primary-Cell Wall. Annu Rev Plant Physiol Plant

Molec Biol, 1989;40:139-168. 89. Kooiman P. On the occurence of amyloids in plant seeds. Acta Bot Neerl,

1960;9:208-219. 90. Tine MAS, Silva CO, Lima DUd, Carpita NC, Buckeridge MS. Fine structure of a

mixed-oligomer storage xyloglucan from seeds of Hymenaea courbaril. Carbohy Pol, 2006;66(4):444-454.

91. Hrmova M, Farkas V, Lahnstein J, Fincher GB. A Barley Xyloglucan Xyloglucosyl Transferase Covalently Links Xyloglucan, Cellulosic Substrates, and (1,3;1,4)-beta-D-Glucans. J Biol Chem, 2007;282(17):12951-12962.

92. Yokoyama R, Nishitani K. A comprehensive expression analysis of all members of a gene family encoding cell-wall enzymes allowed us to predict cis-regulatory regions involved in cell-wall construction in specific organs of arabidopsis. Plant Cell Physiol, 2001;42(10):1025-1033.

93. Yokoyama R, Rose JKC, Nishitani K. A surprising diversity and abundance of xyloglucan endotransglucosylase/hydrolases in rice. Classification and expression analysis. Plant Physiol, 2004;134(3):1088-1099.

94. Geisler-Lee J, Geisler M, Coutinho PM, Segerman B, Nishikubo N, Takahashi J, Aspeborg H, Djerbi S, Master E, Andersson-Gunneras S, Sundberg B, Karpinski S, Teeri TT, Kleczkowski LA, Henrissat B, Mellerowicz EJ. Poplar Carbohydrate-Active Enzymes. Gene Identification and Expression Analyses. Plant Physiol, 2006;140(3):946-962.

95. Becnel J, Natarajan M, Kipp A, Braam J. Developmental Expression Patterns of Arabidopsis XTH Genes Reported by Transgenes and Genevestigator. Plant Mol Biol, 2006;61(3):451-467.

96. Edwards M, Dea IC, Bulpin PV, Reid JS. Purification and properties of a novel xyloglucan-specific endo-(1 →4)-beta-D-glucanase from germinated nasturtium seeds (Tropaeolum majus L.). J Biol Chem, 1986;261(20):9489-9494.

Page 54: A holistic approach to understanding CAZy families through

46 A holistic approach to understanding CAZy families through reductionist methods

97. Fanutti C, Gidley MJ, Reid JS. Action of a pure xyloglucan endo-transglycosylase (formerly called xyloglucan-specific endo-(1-->4)-beta-D-glucanase) from the cotyledons of germinated nasturtium seeds. Plant J, 1993;3(5):691-700.

98. Fanutti C, Gidley MJ, Reid JS. Substrate subsite recognition of the xyloglucan endo-transglycosylase or xyloglucan-specific endo-(1-->4)-beta-D-glucanase from the cotyledons of germinated nasturtium (Tropaeolum majus L.) seeds. Planta, 1996;200(2):221-228.

99. Xu W, Campbell, Vargheese PAK, Braam J. The Arabidopsis XET-related gene family: environmental and hormonal regulation of expression. Plant J, 1996;9(6):879-889.

100. Uozu S, Tanaka-Ueguchi M, Kitano H, Hattori K, Matsuoka M. Characterization of XET-Related Genes of Rice. Plant Physiol, 2000;122(3):853-860.

101. Davies GJ, Gloster TM, Henrissat B. Recent structural insights into the expanding world of carbohydrate-active enzymes. Curr Opin Struct Biol, 2005;15(6):637-645.

102. Markovič O, Janeček S. Pectin methylesterases: sequence-structural features and phylogenetic relationships. Carbohydr Res, 2004;339(13):2281-2295.

103. Duan XW, Cheng GP, Yang E, Yi C, Ruenroengklin N, Lu WJ, Luo YB, Jiang YM. Modification of pectin polysaccharides during ripening of postharvest banana fruit. Food Chem, 2008;111(1):144-149.

104. Iwai H, Masaoka N, Ishii T, Satoh S. A pectin glucuronyltransferase gene is essential for intercellular attachment in the plant meristem. Proc Natl Acad Sci U S A, 2002;99(25):16319-16324.

105. Visser J, Voragen AGJ. Progress in Biotechnology 14: Pectins and pectinases. Amsterdam: Elsevier; 1996.

106. Vincken J-P, Schols HA, Oomen RJFJ, McCann MC, Ulvskov P, Voragen AGJ, Visser RGF. If Homogalacturonan Were a Side Chain of Rhamnogalacturonan I. Implications for Cell Wall Architecture. Plant Physiol, 2003;132(4):1781-1789.

107. Scheller HV, Jensen JK, Sorensen SO, Harholt J, Geshi N. Biosynthesis of pectin. Physiol Plant, 2007;129(2):283-295.

108. Markovic O, Kohn R. Mode of pectin deesterification by Trichoderma reesei pectineserase. Experientia, 1984;40(8):842-843.

109. Cabrera JC, Boland A, Messiaen J, Cambier P, Van Cutsem P. Egg box conformation of oligogalacturonides: The time-dependent stabilization of the elicitor-active conformation increases its biological activity. Glycobiology, 2008;18(6):473-482.

110. Jarvis MC, Apperley DC. Chain conformation in concentrated pectic gels - evidence from C-13 NMR. Carbohydr Res, 1995;275(1):131-145.

111. Bordenave M, Goldberg R. Immobilized and free apoplastic pectinmethylesterases in mung bean hypocotyl. Plant Physiol, 1994;106(3):1151-1156.

112. Wen FS, Zhu YM, Hawes MC. Effect of pectin methylesterase gene expression on pea root development. Plant Cell, 1999;11(6):1129-1140.

113. Payasi A, Sanwal R, Sanwal GG. Microbial pectate lyases: characterization and enzymological properties. World J Microbiol Biotechnol, 2009;25(1):1-14.

114. Limberg G, Körner R, Buchholt HC, Christensen TMIE, Roepstorff P, Mikkelsen JD. Analysis of different de-esterification mechanisms for pectin by enzymatic fingerprinting using endopectin lyase and endopolygalacturonase II from A. Niger. Carbohydr Res, 2000;327(3):293-307.

115. Fries M, Ihrig J, Brocklehurst K, Shevchik VE, Pickersgill RW. Molecular basis of the activity of the phytopathogen pectin methylesterase. Embo J, 2007;26(17):3879-3887.

Page 55: A holistic approach to understanding CAZy families through

Jens Eklöf 47

116. Micheli F. Pectin methylesterases: cell wall enzymes with important roles in plant physiology. Trends Plant Sci, 2001;6(9):414-419.

117. Micheli F, Sundberg B, Goldberg R, Richard L. Radial Distribution Pattern of Pectin Methylesterases across the Cambial Region of Hybrid Aspen at Activity and Dormancy. Plant Physiol, 2000;124(1):191-200.

118. Matthews BW, Sigler PB, Henderson R, Blow DM. Three-dimensional Structure of Tosyl-α-chymotrypsin. Nature, 1967;214(5089):652-656.

119. Jenkins J, Pickersgill R. The architecture of parallel β-helices and related folds. Prog Biophys Mol Biol 2001;77(2):111-175.

120. Jenkins J, Mayans O, Smith D, Worboys K, Pickersgill RW. Three-dimensional structure of Erwinia chrysanthemi pectin methylesterase reveals a novel esterase active site. J Mol Biol 2001;305(4):951-960.

121. Johansson K, El-Ahmad M, Friemann R, Jörnvall H, Markovič O, Eklund H. Crystal structure of plant pectin methylesterase. FEBS Lett, 2002;514(2-3):243-249.

122. Di Matteo A, Giovane A, Raiola A, Camardella L, Bonivento D, De Lorenzo G, Cervone F, Bellincampi D, Tsernoglou D. Structural Basis for the Interaction between Pectin Methylesterase and a Specific Inhibitor Protein. Plant Cell, 2005;17(3):849-858.

123. Giovane A, Servillo L, Balestrieri C, Raiola A, D'Avino R, Tamburrini M, Ciardiello MA, Camardella L. Pectin methylesterase inhibitor. Biochim Biophys Acta, Proteins Proteomics, 2004;1696(2):245-252.

124. Raiola A, Camardella L, Giovane A, Mattei B, De Lorenzo G, Cervone F, Bellincampi D. Two Arabidopsis thaliana genes encode functional pectin methylesterase inhibitors. FEBS Lett, 2004;557(1-3):199-203.

125. Misas-Villamil JC, van der Hoorn RAL. Enzyme-inhibitor interactions at the plant-pathogen interface. Curr Opin Plant Biol 2008;11(4):380-388.

126. Spök A, Stubenrauch G, Schorgendorfer K, Schwab H. Molecular-cloning and sequencing of a pectinesterase gene from Pseudomonas-solanacearum. J Gen Microbiol 1991;137:131-140.

127. Kuznetsova E, Proudfoot M, Sanders SA, Reinking J, Savchenko A, Arrowsmith CH, Edwards AM, Yakunin AF. Enzyme genomics: Application of general enzymatic screens to discover new enzymes. FEMS Microbiol Rev, 2005;29(2):263-279.

128. Maskill H. The Physical Basis of Organic Chemistry: Oxford University Press; 1986. 480 p.

129. Shevchik VE, Condemine G, Hugouvieux-Cotte-Pattat N, Robert-Baudouy J. Characterization of pectin methylesterase B, an outer membrane lipoprotein of Erwinia chrysanthemi 3937. Mol Microbiol 1996;19(3):455-466.