spectroscopic data and multivariate analysis - diva portal140881/fulltext01.pdf · spectroscopic...

81
Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees ~ Susanne Wiklund Akademisk avhandling Som med tillstånd av rektorsämbetet vid Umeå universitet för erhållande av Teknologie Doktorsexamen vid Teknisk–Naturvetenskapliga fakulteten, framlägges till offentlig granskning vid Kemiska institutionen, Umeå universitet, sal N360, fredagen den 9:e November 2007, Kl 10.00 Fakultetsopponent: Professor Olav Kvalheim, Department of Chemistry, University of Bergen, Norway Department of Chemistry Research Group for Chemometrics, Umeå University, Umeå 2007

Upload: others

Post on 20-Oct-2019

26 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Spectroscopic Data

and Multivariate Analysis ~Tools to Study Genetic Perturbations

in Poplar Trees ~

Susanne Wiklund

Akademisk avhandling Som med tillstånd av rektorsämbetet vid Umeå universitet för erhållande av Teknologie Doktorsexamen vid Teknisk–Naturvetenskapliga fakulteten, framlägges till offentlig granskning vid Kemiska institutionen, Umeå universitet, sal N360, fredagen den 9:e November 2007, Kl 10.00 Fakultetsopponent: Professor Olav Kvalheim, Department of Chemistry, University of Bergen, Norway

Department of Chemistry Research Group for Chemometrics,

Umeå University, Umeå 2007

Page 2: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Copyright © 2007 Susanne Wiklund Printed in Sweden by Solfjädern Offset AB, Umeå 2007

ISBN: 978-91-7264-380-2

Page 3: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Title Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~ Author Susanne Wiklund, Department of Chemistry, Umeå University, SE-90187 Umeå, Sweden Abstract

In our society in the 21st century one of the greatest challenges is to provide raw materials to the industry in a sustainable way. This requires increased use of renewable raw materials such as wood. Wood is widely used in pulp, paper and textile industries and ongoing research efforts aim to extend the use of wood in e.g. liquid biofuels and green chemicals. At Umeå Plant Science Center (UPSC) poplar trees are used as model systems to study wood formation. The objective is to understand the function of the genes underlying the wood forming process. This knowledge could result in improved chemical and physical wood properties suitable for different industrial processes. This will in turn meet the demands for a sustainable development.

This thesis presents tools and strategies to unravel information regarding genetic perturbations in poplar trees by the use of nuclear magnetic resonance (NMR) spectroscopy and multivariate analysis (MVA). Furthermore, gas chromatography/mass spectrometry (GC/MS) is briefly discussed in this context. Multivariate methods to find patterns and trends in NMR data have been used for more than 30 years. In the beginning MVA was applied in pattern recognition studies in order to characterize chemical structures with different ligands and in different solvents. Today, the multivariate methods have developed and the research have changed focus towards the study of biofluids from plant extracts, urine, blood plasma, saliva etc. NMR spectra of biofluids can contain thousands of resonances, originating from hundreds of different compounds. This type of complex data can be hard to summarize and interpret without appropriate tools and require sophisticated strategies for data evaluation. Related fields of research are rapidly growing and are here referred to as metabolomics.

Five different research projects are presented which includes analysis of poplar samples where macromolecules such as pectin and also small molecules such as metabolites were analysed by high resolution magic angle spinning (HR/MAS) NMR spectroscopy, 1H-13C HSQC NMR and GC/MS. The discussion topics include modelling of metabolomic time dependencies in combination with genetic variation, validation of orthogonal projections to latent structures (OPLS) models, selection of putative biomarkers related to genetic modification from OPLS-discriminant analysis (DA) models, measuring one of the main components found in the primary cell-walls of poplar i.e. pectin, the use of Fourier transformed two-dimensional (2D) NMR data in OPLS modelling and model complexity in a PLS model.

Keywords: NMR spectroscopy, Metabolomics, Metabonomics, Poplar, MLR, PLS, OPLS, PCA, Pectin, Wood, Degree of methylation, S-plot, SUS-plot

ISBN: 978-91-7264-380-2

Page 4: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~
Page 5: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

To Oliver, Nova and Fredric

Page 6: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Contents 11.. LLiisstt ooff PPaappeerrss __________________________________________________________________________________ 88

22.. LLiisstt ooff AAbbbbrreevviiaattiioonnss ______________________________________________________________________ 99

33.. LLiisstt ooff NNoottaattiioonnss ____________________________________________________________________________ 1100

44.. IInnttrroodduuccttiioonn __________________________________________________________________________________ 1111

55.. AAiimm__________________________________________________________________________________________________ 1133

66.. BBaacckkggrroouunndd ____________________________________________________________________________________ 1144

6.1. Metabolomics _____________________________________ 15 6.1.1. Definition ____________________________________ 15 6.1.2. Goals in Metabolomic Studies ____________________ 15 6.1.3. Dynamics ____________________________________ 16

6.2. Macromolecules ___________________________________ 17 6.2.1. Pectin _______________________________________ 18

77.. IInnssttrruummeennttaattiioonn ____________________________________________________________________________ 2200

7.1. NMR Spectroscopy ________________________________ 20 7.1.1. Basic Theory _________________________________ 20 7.1.2. Applications __________________________________ 21 7.1.3. Data Quality __________________________________ 25

7.2. Two-Dimensional NMR Spectroscopy _________________ 25

7.3. Gas Chromatography/Mass Spectrometry _______________ 26

88.. CChheemmoommeettrriiccss ________________________________________________________________________________ 2288

8.1. Multiple Linear Regression __________________________ 28

8.2. Principal Component Analysis________________________ 29 8.2.1. Interpretation of PCA Models ____________________ 29

8.3. Partial Least Squares Projections to Latent Structures______ 31 8.3.1. Model Rank __________________________________ 32 8.3.2. Interpretation of PLS Models_____________________ 33

8.4. Orthogonal Projections to Latent Structures _____________ 34 8.4.1. Interpretation of OPLS Models ___________________ 35 8.4.2. Putative Biomarker Identification _________________ 37

8.5. Batch Processing __________________________________ 38

Page 7: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

8.6. Multivariate Analysis on NMR Data ___________________ 40 8.6.1. History ______________________________________ 40 8.6.2. Multivariate Analysis and NMR Today_____________ 41

99.. RReessuullttss aanndd DDiissccuussssiioonn__________________________________________________________________ 4444

9.1. Modelling of Time Dependencies (Paper I)______________ 44

9.2. OPLS-DA Model Visualization (Paper II)_______________ 48

9.3. Pectin Measurements Using HSQC NMR (Paper III) ______ 52

9.4. HSQC NMR and MVA (Paper IV) ____________________ 56

9.5. PLS Component Selection (Paper V)___________________ 62

1100.. CCoonncclluussiioonnss aanndd FFuuttuurree PPeerrssppeeccttiivveess__________________________________________ 6655

1111.. AAcckknnoowwlleeddggeemmeennttss________________________________________________________________________ 6677

1122.. RReeffeerreenncceess ______________________________________________________________________________________ 7700

Page 8: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

8

11.. LLiisstt ooff PPaappeerrss

This thesis is based on the papers listed below, which will be referred to in the text by the corresponding Roman numerals (I-V). Papers I and V are reprinted with kind permission from the publishers. I Wiklund, S., Karlsson, M., Antti, H., Johnels, D., Sjöström, M.,

Wingsle, G., Edlund, U., (2005) A new metabonomic strategy for analysing the growth process in poplar tree, Plant Biotechnology Journal 3: 353-362

II Wiklund, S., Johansson, E., Sjöström, L., Mellerowicz, J.E.,

Edlund, U., Shockcor, P.J., Gottfries, J., Moritz, T., Trygg, J., (2007) Visualization of GC-TOF/MS based metabolomics data for identification of biochemically interesting compounds using OPLS class models, Analytical Chemistry, in press

III Siedlecka, A., Wiklund, S., Péronne M-A., Micheli, F.,

Lesniewska, J., Sethson, I., Edlund, U., Richard, L., Sundberg, B., and Mellerowicz, J. E., (2007) Pectin methyl esterase inhibits intrusive and symplastic cell growth in developing wood of Populus trees, manuscript

IV Hedenström, M*., Wiklund, S*., Sundberg, B., Edlund, U., (2007)

Visualization and interpretation of OPLS models based on 2D NMR Data, submitted

*To be considered as joint first authors V Wiklund, S., Nilsson, D., Eriksson, L., Sjöström, M., Wold, S.,

Faber, K., (2007) A randomization test for PLS component selection, Journal of Chemometrics 21: 427-439

Page 9: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

9

22.. LLiisstt ooff AAbbbbrreevviiaattiioonnss

Abbreviation Meaning CP Cross Polarized DmodX Distance to Model in X DoE Design of Experiments DA Discriminant Analysis DM Degree of Methylation FID Free Induction Decay FT Fourier Transformation GC Gas Chromatography HSQC Heteronuclear Single Quantum Coherence

Spectroscopy MAS Magic Angle Spinning MLR Multiple Linear Regression MS Mass Spectrometry MVA Multivariate Data Analysis NMR Nuclear Magnetic Resonance 2D NMR Two Dimensional NMR OPLS Orthogonal Projections to Latent Structures PCA Principal Component Analysis PLS Partial Least squares Projections to Latent

Structures PME Pectin Methyl Esterase RMSEP Root Mean Squared Error of Prediction SS Sum of Squares

Page 10: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

10

33.. LLiisstt ooff NNoottaattiioonnss

Matrices are represented by bold capital letters, vectors by bold lower case letters and scalars by lower case letters. Notation Meaning A Number of components or practical rank of a

model c Weight vector for y Corr(tp,X) Modelled predictive correlation vector, in OPLS Cov(tp,X) Modelled predictive covariation vector, in OPLS

E Residual matrix for X F Residual matrix for Y K Number of columns/variables in X N Number of rows in X and Y P Loading matrix for X p Loading vector for X po Orthogonal loading vector, in OPLS pp Predictive loading vector, in OPLS T Score matrix t Score vector to Orthogonal score vector, in OPLS w Weight vector of X X Data matrix comprised of N observations and K

variables Y Response matrix

Page 11: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Introduction

11

44.. IInnttrroodduuccttiioonn

LANT BREEDING IS THE MANIPULATION OF PLANTS in order to develop plants with specific properties. The history

of breeding is believed to extend more than 10000 years which has provided benefits that include improvements in crop yields, crop quality and resistance against various plant pests, such as viruses, bacteria, fungi and insects. In the early days of breeding, manipulations were performed more or less as a “black box”. However, the generation of plant hybrids can result in unintended alterations of gene expression, which could lead to redundant properties. For this reason, it is important to control all biological and chemical changes in the breeding process.

The genetic manipulation of plants is now performed in a more systematic and controlled manner and this research field is very active. Umeå Plant Science Centre (UPSC) is a world-leading center for basic plant research. At UPSC, poplar trees are used as model systems to elucidate processes involved in wood formation, xylogenesis1-3 (e.g. cell wall synthesis, lignification and cell death); a key objective of plant research. Another objective is to investigate tree modification in order to incorporate suitable fibre properties. Today, wood is used as an energy source, in construction and in the pulp and paper industry. In the future, it might be possible to use wood in biofuels, biopolymers, biomaterials and in the production of ‘green’ chemicals.4 Hence, plant breeding is of both economical and environmental interest. One approach to achieve these goals is to initiate genetic perturbations and to observe their effects on, variables such as the trees growth rates, wood fibre properties, lignin properties, pulping properties and chemical content of the wood components.

This thesis presents a number of studies where NMR spectroscopy and GC/MS were used in conjunction with multivariate data analysis to monitor perturbations of two different poplar genes. NMR spectroscopy was the primary analytical technique used and consequently this thesis focus on the applications of different NMR techniques. One of the genes studied was the PttMYB76 poplar gene. This gene is a transcription factor that is believed to influence parts of the phenyl propanoid synthetic pathway, which leads to the formation of S- and G-lignin, flavonoids and sinapate esters. The effects of up and down regulation of the PttPME1

P

Page 12: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Introduction

12

gene were also studied. This gene produces pectin methyl esterase, PME, an enzyme that de-esterifies the methylated groups within pectin.

Page 13: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Aim

13

55.. AAiimm

HIS THESIS IS BASED ON studies where biological problems were evaluated using NMR spectroscopy or GC/MS

and multivariate statistics. One objective with this thesis is to describe analytical techniques which are powerful in order to gain a deeper understanding of perturbations in wood components. Furthermore, the characterisation of biological samples using the described techniques will result in complex multivariate profiles. The other objective is therefore to present tools and strategies to evaluate the data from the presented analytical platforms. Thus emphasis is placed on how to use and interpret multivariate models.

T

Page 14: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Background

14

66.. BBaacckkggrroouunndd

OOD CELL WALLS ARE COMPOSED OF cellulose embedded in a hemicellulose (xylan, xyloglucan,

glucomannan)-pectin-protein matrix encrusted by lignin.5 These components are referred to as the macromolecules of wood polymers. Furthermore, the cell wall is divided into primary and secondary walls, which are produced in different developmental stages and/or tissues and differ in composition, as illustrated in figure 1.6, 7

As the major components, cellulose, hemicellulose and lignin, are synthesized in the process of growth, it is important to understand the function of metabolism at different developmental stages. Therefore the study of both xylem (consisting mainly of secondary cell walls) and phloem tissues (only primary cell walls) is important for understanding xylogenesis.

Figure 1. Composition of poplar plant cell walls: a) primary, b) secondary

Small molecules in a plant have many different functions, for example they may be a source of energy or may be involved in diverse biosynthetic pathways. These small molecules are referred to as metabolites because they function in different metabolic systems. The research field that aims to analyze and understand the biochemical processes of different metabolites and their interactions is called metabolomics. This field is further described in section 6.1.

W

Page 15: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Background

15

In addition to elucidating the functions of genes in the wood forming process, information regarding their genotypes (DNA), phenotypes (characteristics for the individual), wood morphology, fiber properties, cellulose, hemicellulose, lignin and pectin etc. is important for developing an overall understanding of their complex biological systems.

66..11.. MMeettaabboolloommiiccss

HE FIELD OF METABOLOMICS has emerged onto the scientific scene during the past years.8-16 Major objectives of

metabolomics are to elucidate how metabolic variations caused by diseases, environmental or genetic changes affect the studied systems, and to contribute to wider systems biology studies, in which information from diverse fields is combined in attempts to elucidate the interactions between genes, proteins and metabolites. 17-19

66..11..11.. DDeeffiinniittiioonn

N THE YEAR 2000, metabolomics was defined as the “comprehensive analysis of the whole metabolome under a given

set of conditions”20 which can be traced back to definition of the metabolome “the sum of metabolites produced by an investigated biological system” which was suggested in 1998.21-23 Another definition presented by Jeremy Nicholson 1999, of the similar research area metabonomics is ”the quantitative measurement of the dynamic multiparametric metabolic response of living systems to pathophysiological stimuli or genetic modification”.8 Since both metabolomics and metabonomics seem to have the same philosophy and goals, they will not be differentiated in this thesis. However, in this thesis all experimental work regarding the analysis of small biochemical molecules will be referred to as metabolomics.

66..11..22.. GGooaallss iinn MMeettaabboolloommiicc SSttuuddiieess

S GENES TRANSLATE TO PROTEINS and proteins are involved in many enzymatic reactions, such as anabolism

and catabolism, these reactions influence the formation of metabolites by various metabolic processes. A primary objective of plant metabolomics

T

I

A

Page 16: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Background

16

is to identify biomarkers related to genetic modifications or environmental factors and to apply this information to understand a biological system.11, 14, 24-27 Metabolomics as applied to human studies has additional objectives. For example it may be applied as a clinical diagnostic tool for the early detection of disease or used to determine whether a potential candidate drug will have organ-specific toxicity.28-34 It is also used in applications of physical exercise with the objective to increase the understanding of the biology and mechanism behind exercise.35 Since no analytical technique (today) can detect all bio-molecules in one experiment, it is important to use different analytical platforms to maximise the total number of measured metabolites. Today the most frequently used techniques are NMR, GC/MS and Liquid Chromatography/Mass Spectrometry (LC/MS).16, 36 NMR spectroscopy is a versatile technique with high reproducibility and relatively simple sample preparation requirements.29, 30 On the other hand, NMR has a very low sensitivity compared to MS-based techniques. An additional advantage of GC/MS and LC/MS techniques is that they include a chromatographic step prior to MS that resolves many of the metabolites.

66..11..33.. DDyynnaammiiccss

IOLOGICAL SYSTEMS ARE DYNAMIC by nature. These systems can be of different character for example, the time

required for a drug metabolite to degrade, the time for a biological system to recover from stress or time that is related to the growth process.37-39

Moreover, in order to understand the dynamics of a system in relation to genetic modifications, drug toxicity, stress or disease as well as time dependency, experimental design is of great importance.40-42 Metabolomics studies generate large data matrices that are associated with hundreds to thousands of biochemical compounds. Underlying systematic variation can be difficult to summarize using classical data assessment approaches, especially in cases of subtle changes. Chemometrics or multivariate statistical methods such as Principal Component Analysis (PCA), partial least squares to latent structures (PLS) and orthogonal projections to latent structures (OPLS), are tools that will facilitate the analysis of complex data structures.43-50 In Paper I, variations related to genetic alterations are jointly examined with the growth process of a poplar plant.

B

Page 17: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Background

17

One difficulty in metabolomics studies is the detection of putative biomarkers or biochemically significant compounds. Paper II evaluates a strategy for the visualization of metabolomics data and the detection of putative biomarkers. The multivariate techniques used are described in more detail in chapter 8.

66..22.. MMaaccrroommoolleeccuulleess

LANT CELL WALLS OF POPULUS TREES are mainly composed of cellulose, lignin, hemicelluloses and pectic

polysaccharides. As previously mentioned, the cells in the wood-forming tissue contain primary and secondary cell walls. Wood components can also be divided into xylem and phloem. The phloem fibres are located close to the bark and are composed of growing tissues which do not contain any secondary cell walls. The xylem is dead tissue and contains secondary cell walls, as depicted in figure 2.

Figure 2. Illustration of the location of xylem and phloem in wood forming tissues.

P

Page 18: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Background

18

66..22..11.. PPeeccttiinn

ECTINS COMPRISE A COMPLEX GROUP of heterogeneous polysaccharides found in all vascular plants. It is one of the

main carbohydrates found in the primary cell walls. The dominant pectin structure is a linear chain of poly-α-(1→4)-D-galacturonic acid, called homogalacturonan (HG) figure 3. The carboxylic side chain attached to carbon 5 is either esterified or de-esterified in HG and the ratio of esterified to de-esterified carboxylic side chains is referred to as the degree of methyl esterification, DM. Pectin is widely used for its gelling properties and is often used as a gelling agent in the food industry. Pectin with a low DM will form gels in the presence of Ca2+ ions and pectin with a high DM will form gels in acidic media.51 It also known that pectin has the ability to crosslink its side chains via Ca2+ bridges to form an “egg box” structure.52, 53 However, the precise structure of pectin is still a subject of debate.

Figure 3. Structure of the most commonly form of pectin found in the primary cell walls of poplar.

In plant cell walls, pectins create a hydrophilic matrix that is capable of storing several important ions which regulate wall extensibility and cellular adhesion as well as participating in biotic stress-response signalling. Pectins form a highly branched network that influences the permeability of the cell wall to different molecules through the regulation of pore size. They also interact with other molecules in the cell wall. Covalent bonds between pectins and arabinogalactan proteins,54

P

Page 19: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Background

19

phenolics55 and xyloglucan have also been observed.56 In addition, pectins may also interact with cellulose.57 These interactions create a three-dimensional network that affects the mechanical properties of cell walls that are critical for cell expansion. In this context, the DM of pectin is of special importance.58, 59 It determines the tendency of pectin to form intra-molecular complexes, harbouring Ca2+ ions. These complexes are thought to be involved in wall stiffening and intercellular adhesion.

Several techniques are available for determining DM, including Infra-Red Spectroscopy (IR) 60, Pyrolysis-MS61, 1H NMR spectroscopy62,

63, titration64, and High-Performance Liquid Chromatography/Refractive Index (HPLC-RI) analysis65. However, these methods require a large amount of pure pectin and/or saponification of the samples prior to analysis. Since most of the pectin in poplar plants is found in the phloem tissues, the amount of sample material available can be a limiting factor. Thus, the ability to determine and compare the DM of pectin using a small amount of sample is very important for plant biology research. Moreover, measuring a sample without any purification would enable the simultaneous analysis of other plant components e.g. metsbolites. In Paper III a new strategy is presented, in which 1H-13C NMR is used to measure the DM of phloem extracted from samples of Populus that displayed PttPME1 alternations. The lines used in this study were two up-regulated lines (2B and 7) and one down-regulated line (5), which were compared to the wild type (WT). A line is a genetically modified plant with a significant gene expression related to the modification. The transgenic lines originate from different gene transformations of wild type.

Page 20: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Instrumentation

20

77.. IInnssttrruummeennttaattiioonn

HIS SECTION DESCRIBES the analytical techniques that were applied to biochemical compounds of interest. NMR

spectroscopy was the main technique used throughout the studies. However, GC/MS was also used in one investigation and a short description of this technique is also provided.

All materials analysed in the studies presented in Papers I-IV were derived from poplar samples. Phloem and xylem extracts were analysed by 1H-13C HSQC NMR spectroscopy and GC/MS and a semi solid plant slurry was analysed using 1H High-Resolution Magic Angle Spinning (HR/MAS) NMR spectroscopy. Data were produced for metabolites and pectin in the Populus samples.

77..11.. NNMMRR SSppeeccttrroossccooppyy

UCLEAR MAGNETIC PROPERTIES were discovered in 1924 by O. Stern and W. Gerlach, who were awarded a Nobel

Prize for their work in 1945. Later, F. Bloch at Stanford University and E.M Purcell at Harvard University observed the first NMR signals. This achievement was recognised when they received the Nobel Prize for Physics in 1952.66, 67 However, it was not until 1958 that the first nuclear magnetic resonance spectrometer became commercially available, thereby enabling several important discoveries during the following decades. The technique was further developed by the introduction of Fourier transform NMR spectroscopy by Richard R. Ernst and his colleagues, who received the Nobel Prize for Chemistry in 1991.68-72

77..11..11.. BBaassiicc TThheeoorryy

NE BASIC PRINCIPLE OF NMR is that it can only apply to nuclei that have a magnetic moment, e.g. 1H, 13C, 15N, 31P and

19F. These nuclei can adopt two magnetic quantum states, a lower energy state, m=1/2, and a higher energy state, m=-1/2, referred to as the alpha and beta states. Relative to an applied external magnetic field, B0, the nuclei will align with a small population excess of the lower energy state oriented in the B0 direction. This excess leads to a net magnetization, M0,

T

N

O

Page 21: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Instrumentation

21

in the B0 direction. The magnitude of M0 is determined according to the Boltzmann distribution

kTE

eNN Δ

− =2/1

2/1 , Eq. 1

where N-1/2 is the higher energy state population and N1/2 is the lower energy state population, k is the Boltzmann constant (1.38*10-23 J/K), T the absolute temperature and ΔE the difference in energy which is directly proportional to the external magnetic field, B0 according to

πγ2

0hBE =Δ , Eq. 2

where γ is the gyromagnetic ratio, which is a constant for a given nucleus (26.75*107 radT-1S-1 for 1H), and h is Planck’s constant (6.63*10-34 JS). The resonance frequencies arising from one observed nucleus in a molecule that is exposed to an electromagnetic pulse are functions of the nucleus’ Larmor frequency and the nuclear environment (for example solvent, pH, temperature and the distance and type of neighbouring nuclei). The free induction decay, FID, from a simple one-pulse 1H experiment is a mixture of frequencies from all nuclei affected by the electromagnetic pulse. The directly observed signal arise in the time-domain and requires Fourier transformation to the frequency-domain prior to interpretation. This is a short description of NMR theory. For additional reading see references.68-70, 73

77..11..22.. AApppplliiccaattiioonnss

ROTONS ARE THE MOST FREQUENTLY measured nuclei using NMR because the 1H isotope has 99% natural

abundance, while other magnetic nuclei have lower abundances. For example, 13C has only 1% natural abundance. One commonly used application of NMR spectroscopy is within medicine where, for example, the imaging technique, MRI, is applied to the structural investigation of organs such as the human brain. It is also regularly used in the structural elucidation of proteins and small organic compounds. The information

P

Page 22: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Instrumentation

22

provided by NMR depends on the type of experiment that is conducted. For example, a routine one-dimensional (1D) 1H experiment gives information about chemical shifts, J-couplings and integrals. The chemical shift which corresponds to a certain position in a molecule will be unique for each observed nucleus, but it will depend upon the surrounding environment (for example, type of solvent, pH, temperature and the adjacent nuclei). One of the most important pieces of information in structure elucidation is the J-coupling. J-couplings are the result of interactions between different spin states through the chemical bonds of a molecule, which result in splitting of the NMR signals. The splitting patterns provide information about the number of neighbouring units and their angle. The integral is a mathematical calculation of the size of a peak. In the case of 1H NMR the peak integral is proportional to the number of protons that give rise to the peak.

The integration of peaks to give the relative number of protons is relatively straightforward when the sample is pure. This simplifies quantification since only one major chemical component is present and the peaks are well resolved. However, scientists have started to analyse and compare more complex biological samples, such as urine,33 blood plasma74 and plant extracts.14, 75 Such complex matrices require additional analytical tools for the evaluation of NMR data. This requirement will be discussed further in section 8.5 and analyses of plant NMR data are presented in Papers I, III and IV.

NMR spectroscopy has several useful features for studies of this type, the most important of which are:

• easy sample preparation • ability to obtain detailed information about the chemical

structure on a molecular level • reproducibility and • the non-destructive nature of the technique.

NMR is also an extremely versatile technique since it offers the possibility to analyse macromolecules and morphology in solid samples by 13C Cross Polarized/Magic Angle Spinning (CP/MAS) NMR.76, 77 An example of the information obtained by 13C CP/MAS NMR analysis of scraped xylem tissues can be seen in figure 4. In addition, quantitative and structural measurements of both macromolecules and small molecules can be performed on semisolid samples by High-resolution (HR)/MAS NMR

Page 23: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Instrumentation

23

spectroscopy, and on dissolved samples by solution state NMR spectroscopy. The ‘magic’ of the MAS technique in comparison to normal NMR is that the sample is spun at a high frequency while it is set at an angle of 54.7o relative to the magnetic field. This enhances the signal resolution of a sample that is not in a solution.78

Figure 4. 125 MHz 13C CP/MAS spectra from scraped poplar xylem. The assignment in the upper left corner corresponds to carbon shifts of cellulose, hemicelluloses and lignin components. The S represents syringyl units G guaicyl units, e is bound to another unit through ether bonds and f is not bound. Although the CP/MAS spectrum is highly informative, the assignments are complicated due to the broad overlapping resonances. In contrast to 13C CP/MAS, an example of the additional information that can be obtained from a solution state 1H-13C HSQC spectrum from acetylated and dissolved xylem tissues (the whole plant cell-wall), can be seen in figure 5.

Page 24: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Instrumentation

24

Figure 5. 600 MHz Cryoprobe 1H-13C HSQC NMR spectrum. This spectrum is from acetylated and dissolved xylem tissues.79 The assignment corresponds to cellulose (C), p-hydroxybenzoate (PB), syringyl (S) guaicyl (G), resinol (R), β-aryl ether (K) and phenylcoumaran (P).80, 81 The analysis of dissolved plant cell-wall was developed in 2003 by Fachuang Lu and John Ralph at the University of Wisconsin, Madison, USA. This method dissolved the whole plant cell-wall in chloroform subsequent to acetylation and was analysed by HSQC 1H-13C NMR spectroscopy. 79, 81 The resolution seen in figure 5 is significantly improved compared to CP/MAS, which allows for a detailed structural information of cellulose, lignin and hemicelluloses. Examples of poplar plant data obtained by 1H HR/MAS NMR and an additional solution state HSQC 1H-13C NMR spectrum can be seen in the results sections, 9.1, 9.3 and 9.4.

There are also some disadvantages that should be mentioned. Beside the fact that NMR is a very expensive technique, its high detection limit is the most important limitation of NMR. The technique is also complicated by the large number of parameters that must be optimised before data acquisition, especially for MAS experiments. This makes NMR analysis time demanding and sometimes operationally complex. However, despite its technical complexity and high detection limits, NMR, HR/MAS NMR and CP/MAS NMR jointly offer the possibility to

Page 25: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Instrumentation

25

obtain chemical understanding of wood and plant material on a molecular level.

77..11..33.. DDaattaa QQuuaalliittyy

HE QUALITY OF NMR DATA depends on several experimental factors, including sample preparation, data

acquisition, pulse sequence, optimisation of instrumental parameters and pre-processing of data. A high field spectrometer is needed to obtain a high quality spectrum, i.e. high signal-to-noise S/N ratio and high resolution. A field increase from 500 MHz to 800 MHz will increase the S/N by approximately 60% at room temperature, calculated from equations 1 and 2. A recent technical development of cooling the probes has led to increased sensitivity due to reductions in electronic and thermal noise. These probes are called cryoprobes and they improve the S/N by a factor 4 or even more. Other important factors include optimisation of the experimental parameters prior to data acquisition. The probe temperature has also an impact on data quality. According to the Boltzmann distribution, reducing the temperature will enhance nuclear polarisation (i.e. increase the population of M0) resulting in increased instrumental sensitivity. However, a low temperature will also reduce the speed of molecular motion, which will decrease the S/N ratio due to line broadening. Slow molecular motion can be a problem when analysing macromolecules such as pectin, which is why a higher temperature was used for such experiments in the study described in Paper III.

Following data acquisition, the standard procedure for both 1D and two-dimensional (2D) NMR data is to pre-process the FID in order to obtain interpretable data. Pre-processing usually includes zero filling, Fourier transformation (FT), phasing, calibration (or alignment), base line correction and normalization. Fourier transformation transports the data from the time domain to the frequency domain and for 2D NMR data the FT is performed twofold, in both the F1 and F2 directions.

77..22.. TTwwoo--DDiimmeennssiioonnaall NNMMRR SSppeeccttrroossccooppyy

S PREVIOUSLY MENTIONED, the nucleus most commonly observed by NMR is 1H due to its high natural abundance.

However, the information gained from a 1D spectrum is often inadequate

T

A

Page 26: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Instrumentation

26

for spectral assignments. In addition, a complex 1D spectrum from urine, plasma or plant extracts can contain many overlapping peaks from multiple compounds. Such overlaps obscure peak assignments and compromise the quantitative analysis. In order to overcome these constraints, data are acquired for 13C as well as for 1H. A heteronuclear NMR experiment offers considerable time savings by indirect observation (rather than direct observation) of the 13C nucleus. There are a number of 2D NMR techniques, such as TOCSY (1H-1H), COSY (1H-1H) and HSQC (1H-13C)82. However, only HSQC is discussed here, since this was the only 2D technique used in the studies.

A 2D NMR spectrum is an “image-like” spectrum in which the spectral resonances are resolved in two dimensions, F1 and F2. For HSQC these axes are the 1H and 13C frequencies from the observed nuclei.

77..33.. GGaass CChhrroommaattooggrraapphhyy//MMaassss SSppeeccttrroommeettrryy

AS CHROMATOGRAPHY AND MASS SPECTROMETRY, GC/MS, are two analytical techniques that can be combined.83,

84 In the first step (GC), analytes are resolved according to their vapour pressure and their interaction with the stationary phase. In metabolomics studies, metabolites are often high boiling compounds which must be derivatized in order to increase their volatility so that they are GC-compatible. This step can be performed using trimethysilyl (TMS) which will ad to the functional groups OH, NH or SH. This derivatisation forms TMS-ethers which have a lower boiling temperature than the original metabolite.

Following the GC step the metabolites are introduced to the second system which involves mass spectrometry. In MS the metabolites are ionized and fragmentized in an ion-source. The generated ions are separated in the mass analyser according to their mass to charge ratios, m/z. The mass analyser presents the ions to a detector and the signal intensity for each separated ion is recorded.

GC/MS technique creates a three dimensional data structure represented by a time axis, m/z axis and the signal intensity axis. The data can be visualized as the Total Ion Current (TIC). Gas chromatography may not separate all the components of the sample, meaning that compounds may co-elute and the resulting mass spectra will not be pure.

G

Page 27: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Instrumentation

27

This phenomenon causes problems in the identification and quantification of compounds. It is preferable to process data that are compromised in this way so that each compound can be identified and quantified. In Paper II, GC/MS was utilised to study metabolic changes in PttPME1 plants. The acquired GC/MS data were pre-processed using Hierarchical Multivariate Curve Resolution (H-MCR) modelling.85 H-MCR resolves the chromatographic profiles, calculates the area of the resolved profiles and generates the corresponding mass spectrum. If the resolved mass spectrum is of acceptable quality, it can be subjected to a library search and the compound can thus be identified.

Page 28: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Chemometrics

28

88.. CChheemmoommeettrriiccss

HE ANALYSIS OF BIOLOGICAL SAMPLES as described in previous sections, will result in a complex multivariate

profile that defines the properties of the samples as characterized by the analytical platforms. Chemometrics is the information aspect of chemistry as it provides mathematical and statistical tools to analyse complex chemical and biological data.86 This type of data will be referred to as X[N,K] where N is the number of samples and K the number of variables or descriptors n the data matrix. Chemometrics can be divided into two core areas, multivariate analysis (MVA) and design of experiments, (DoE). The MVA tools used in these studies include principal component analysis (PCA), partial least squares projections to latent structures (PLS) and orthogonal projections to latent structures (OPLS). The other area, DoE, is the aspect of planning an experimental set up in a systematic and structured manner, which is an important part for a successful studie.40, 87,

88 This chapter describes the application of the regression methods

MLR, PCA, PLS, OPLS and batch modelling. These multivariate methods are useful for summarizing, visualising and interpreting complex data structures in a robust and statistically appropriate way.

88..11.. MMuullttiippllee LLiinneeaarr RReeggrreessssiioonn

HEN APPLYING DOE, the variables in X are usually set to be independent or orthogonal. A designed study can be

solved by MLR under the assumption that X´X is of full rank i.e. that there is no co-linearity in the X data and N>K. A MLR model with two x-variables can be described as y=1b0+b1x1+ b2x2+ b12x1x2+ b11x1

2+ b22x22+f , Eq. 3

b=(XTX)-1(XTy), Eq. 4 where x represents the centered variables, y the response vector, b the MLR coefficients and f the residuals. This solution can find linear

T

W

Page 29: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Chemometrics

29

dependence between x and y but it can also find non-linear terms i.e. interaction terms and quadratic terms.

The classical statistical diagnostic tools that are used to evaluate an MLR model include Analysis of Variance, ANOVA, R2 and Q2. R2 represents the degree of variation explained by the model (also called goodness of fit) and Q2 is the cross-validated predictive ability (also called goodness of prediction). These two diagnostic tools are described further in section 8.3. An ANOVA test compares whether the variation of y that is explained by the model is significantly larger than the variation of y that is associated with the residuals. The results of an ANOVA determine the significance of the model. The ANOVA also test if the model error is significantly larger than the experimental error, this is referred to as Lack of Fit (LoF). These two tests are performed by a statistical F-test. Moreover, the model terms are evaluated by their confidence intervals. In Paper IV MLR was used to evaluate the experimental design.

88..22.. PPrriinncciippaall CCoommppoonneenntt AAnnaallyyssiiss

HE PURPOSE OF PCA is to compress co-varying data from an X[N,K] matrix into fewer independent (orthogonal) latent

variables or principal components related to systematic variation in the original data.46, 89 These new latent variables are eigenvectors extracted from the X matrix using either the NIPALS algorithm90 or singular value decomposition SVD91, to obtain A uncorrelated latent variables, principal components.

88..22..11.. IInntteerrpprreettaattiioonn ooff PPCCAA MMooddeellss

HE NEW PRINCIPAL COMPONENTS thus identified are comprised of scores, T[N,A], and loadings, P[K,A],

representing the large correlated systematic variations in X. The scores and loadings can be interpreted simultaneously since the t(A) direction describes the variation seen within the observations and p(A) represents the variables contributing to this direction. Using the nonlinear iterative partial least squares (NIPALS) algorithm, the first principal component is extracted to explain the maximum variance in the data. All principal components are calculated to be independent and are optimised to explain

T

T

Page 30: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Chemometrics

30

the directions of maximum variance from the original data. The resulting PCA model describes the raw data as a sum of extracted components. PCA model of X: EptptptETPX ++++=+= T

AAT22

T11

T ..... Eq. 5

Interpretations of scores provide an overview of the data in which groups, outliers and trends can be detected. The loadings provide information about the different directions observed for the scores. The distance to model in X, DmodX, is the residual (un-modeled part) standard deviations for an observation and is used to detect minor outliers or deviating behaviours. DmodX is therefore informative when evaluating if observations fits the model. PCA is frequently applied in metabolomics studies that generate NMR and GC/MS data.

Although the unsupervised method PCA is a powerful tool in data evaluation it can be an advantage to use other alternative multivariate tools. Some of these include the supervised methods PLS, OPLS and batch modeling. These are described in the following sections.

Page 31: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Chemometrics

31

88..33.. PPaarrttiiaall LLeeaasstt SSqquuaarreess PPrroojjeeccttiioonnss ttoo LLaatteenntt

SSttrruuccttuurreess

ARTIAL LEAST SQUARES projections to Latent Structures (PLS) is a linear regression method that is commonly used to

find relations between the X data and a response matrix, Y.89, 92, 93 If the response is comprised only of a binary vector describing e.g. class membership, the regression is called PLS-discriminant analysis (DA). The difference between PCA and regression methods is that the extracted components of PCA are rotated towards the variation of y so that this variation will be seen in the first components. PLS regression can also be calculated by the NIPALS algorithm and for single y the following equations are used:

1) w=XTy/(yTy) 2) w=w/║w║ *w is normalized to the length of 1 3) t=Xw 4) c=yTt/(tTt) 5) p=XTt/(tTt) 6) E=X-tpT, f=y-tcT

If the PLS model is comprised of more than one component the same calculations are performed by using the residual matrix E for X and f for y. This algorithm can be generalised for multiple-y which, in the following sections, is denoted as Y, even in the case of a single vector. The representations of X and Y are described in a model of both data sets as PLS Model of: X=TPT+E Eq. 6 PLS Model of: Y=TCT+F Eq. 7

The model significance is evaluated by its predictive ability Q2(Y), from equation 8. The predictive ability is determined by cross-validation, CV. Cross-validation is performed through the exclusion of a number of samples from the data (usually 1/7) prior to modelling. The excluded samples are then predicted using the model and the error of

P

Page 32: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Chemometrics

32

prediction is calculated. This procedure is repeated until all samples have been excluded once. Other important statistical tools for model evaluation are R2(X) and R2(Y) (equations 8 and 9). These two statistics represent the degree of variation explained by the model in the X and Y data, respectively. The model diagnostic tools are calculated as

Q2(Y)=1-PRESS/SS(Y)=1-SS(Y-Ypred)/SS(Y), Eq. 8 R2(X)=1-SS(E)/SS(X), Eq. 9 R2(Y)=1-SS(F)/SS(Y) , Eq. 10 where SS(E) is the sum of squares from the residual matrix E and SS(X) is the total sum of squares for X. PRESS is the predictive error sum of squares for Y and is calculated by cross-validation, and SS(Y) is the total sum of squares for Y. A high Q2(Y) is desirable and the optimum value is 1 for a model with a high predictive ability.

88..33..11.. MMooddeell RRaannkk

ULTIVARIATE MODELS CAN BE SENSITIVE towards model rank i.e. the number of significant components, A.

The sensitivity lies within the model’s predictive ability, which can be compromised if the wrong model rank is selected. Using too few components will result in the model being under-fitted with significant variation remaining unexplained. Using too many components will create an over-fitted model that attempts to explain noise. Therefore one important step in PLS is the selection of an appropriate number of significant components. One commonly used strategy is to select components by finding the best Y predictive ability, Q2(Y), using cross-validation, CV.94-96 For each additional component the value of Q2(Y) is calculated and, as long as the predictive ability continues to increase, one additional component is extracted. When the Q2(Y) decreases, the calculation is stopped and it is assumed that the optimal model has been identified.

In the ideal case this strategy is straight forward and easy. However, this strategy is not always straightforward to apply since a number of samples (usually 1/7) must be excluded from the data set in each CV round. For example, the maximum Q2(Y) may not be found after A components, but after varying numbers of components, as illustrated in

M

Page 33: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Chemometrics

33

figure 6. Consequently the selection of components is often based on softer rules, such as “the beginning of a plateau” or “the first local maximum”. There are other alternative rules and parameters that can be used in combination with, or instead of, Q2(Y). These are briefly described in section 9.5 and Paper IV.

Figure 6. Selection of an optimal model is generally performed using cross-validation. In a) the data behaves ideally and the selection is straightforward. An asterisk indicates the selected number of components and optimal model. b) The best model is not readily apparent which is indicated by the question mark.

However, even with such approaches, the “true” model rank might be difficult to select. Paper V, introduces an alternative, new strategy to find the model rank without any exclusion of data and with more objective guidance. This method is based on a randomisation test (also referred to as permutation test) for cov(t,yp), where yp is the permuted response vector. The covariation between y and t is tested against the distribution of the randomised covariation. The test is performed for each calculated PLS component and will indicate how many components to use in the model.

88..33..22.. IInntteerrpprreettaattiioonn ooff PPLLSS MMooddeellss

PLS MODEL IS INTERPRETED in a similar way to a PCA model, by plotting score vectors, t(A), and the corresponding

weight vector, w(A). A notable difference is that the directions are related to both the X and the Y data. An issue associated with PLS is that the directions seen in score space are influenced by variation in X which is

A

Page 34: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Chemometrics

34

not related (orthogonal) to Y. Thus, interpretation of single y in score space usually involves more than one component. For this reason the extension method OPLS, which was developed by Trygg et al. (2002), is preferably used.48

88..44.. OOrrtthhooggoonnaall PPrroojjeeccttiioonnss ttoo LLaatteenntt SSttrruuccttuurreess

RTHOGONAL PROJECTIONS TO LATENT structures (OPLS) is related to PLS regression but with OPLS the

systematic variation in X is separated into two elements, one that is correlated (linearly related) to Y and one that is uncorrelated (orthogonal) to Y. The extension step in the NIPALS algorithm involves a rotation to find variation in X which is uncorrelated to Y.48, 97 The calculations are performed as follows for single y, where the first five steps are the same as for PLS. For multi Y, this algorithm is further modified.

1) w= XTy/(yTy) 2) w=w/║w║ *w is normalized to the length of 1 3) t=Xw 4) c=yTt/(tTt) 5) p=XTt/(tTt) 6) wo=p-w/║p-w║ 7) to=Xwo 8) po=XTto/(to

Tto) 9) Xnew=X-topo

T Return to step 1 and use X=Xnew.

This algorithm is repeated until no further systematic orthogonal

variation is observed i.e. ║p-w║ is equal to or close to zero or Q2(y) is optimized. A geometrical illustration of the differences between PLS and OPLS in terms of model interpretation in score space is provided in Figure 7.

.

O

Page 35: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Chemometrics

35

Figure 7. Illustration of geometrical differences between a) PLS-DA score space and b) OPLS-DA score space.

The representations of X and Y are described in a model of both data sets as OPLS model of X: X= TpPp

T +ToPoT + E Eq. 11

OPLS model of Y: Y=TpCpT+F Eq. 12

Hence, the OPLS model comprises two blocks of modelled

variation, the Y-predictive block (TpPpT), which represents the variation

related to the response vector Y, and the Y-orthogonal block (ToPoT), also

referred to as the uncorrelated variation. The OPLS solution is closely related to ‘target rotation’ originally developed by Kvalheim and Karstang.97

88..44..11.. IInntteerrpprreettaattiioonn ooff OOPPLLSS MMooddeellss

HE INTERPRETATION OF AN OPLS MODEL is similar to the interpretation of a PLS model but easier since the

meaning of the model parameters is more transparent. Both visualization tools and model diagnostics are separated into predictive and orthogonal variation so that they can be interpreted independently. This separation facilitates better model interpretation in comparison to PLS. The score

T

Page 36: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Chemometrics

36

vectors and the loading vectors are separated into predictive vectors, tp and pp, and orthogonal vectors, to and po, as shown in figure 8.

Figure 8. OPLS model representation for single y. The systematic variation is separated into one predictive, tppp, and orthogonal, ToPo components.

The estimation of explained variation, R2(X), is also separated into two elements. The first, R2(X)p, is the magnitude of predictive variation in X related to Y, and the second element, R2(X), is the total variation explained by the model (i.e. including both the predictive and orthogonal variation).

Model visualization and diagnostics are improved by OPLS. However, a risk in all data driven modelling is the over-fitting, this was also was discussed in section 8.3.1. In OPLS the predictive score vector will change when more components are added and if the data are over-fitted the score vector, tp can appear to be perfectly correlated to the response vector y. From the modelled score plot itself it is impossible to distinguish a very good model from an over fitted model. In Paper II, visualization and validation of OPLS class models are discussed, and it is recommended that instead of using the regular score plot, a cross-validated score plot98 is used in combination with Q2(y) as a tool for the validation of class separation as this plot will give better discrimination between over-fitted models and very good models. Nevertheless, even though CV-score plots are valuable since they provide model transparency, the best validation is always performed using an external test set, preferably sampled and analysed on a different occasion.

Page 37: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Chemometrics

37

88..44..22.. PPuuttaattiivvee BBiioommaarrkkeerr IIddeennttiiffiiccaattiioonn

NE DIFFICULTY IN METABOLOMICS studies are the extraction of putative biomarkers (or biochemically

interesting compounds). One source to this arise from that metabolomic data with a large dynamic range in concentration will usually have few metabolites with a large concentration and few with low concentration. This will result in a data where the variables have unequal variance due to unequal concentration. Moreover, the data also include experimental errors, noise and artefacts that could have a negative impact on the model quality. For these reasons different pre-treatments are normally tested before selection of the model. The most frequently used pre-treatments are: 1) centering, which only removes the average from each variable and 2) centering + unit variance scaling (UV scaling), which is a combination of centering and a step in which all variables are divided by the standard deviation (SD) and 3) pareto scaling, which is a softer scaling than UV and divides the data by the square root of the standard deviation (SD) rather than by the SD.

If the best model between X and y is established using centered data prior to modelling, the resulting loading vector, pp, is the modelled covariance, as in equation 13. The modelled covariance is slightly different from the covariance between X and y, (equation 14). The modelled covariance is based on the OPLS modelled score vector, tp. If the model is good, tp and y will be highly correlated, making the difference between modelled covariance and the covariance small.

However, with centered data the loading vector, pp, depends on the concentration (or intensity) of the variables. This might lead to a selection of putative biomarkers biased towards high concentration which in turn might lead to incorrect conclusions. With UV scaling the loading vector is the modelled correlation between X and y, which is independent of concentration (equation 15). As previously mentioned, noise and artefacts is common in metabolomics data and if these factors are equally weighted for all variables, UV scaled models will be more sensitive to the noise as they might introduce erroneous correlations. These correlations are also referred to as spurious correlations i.e. correlations that are caused by something other than the modelled response.

O

Page 38: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Chemometrics

38

Consequently, UV scaling is not always used in metabolomics and other MVA studies in order to avoid problems that can arise from spurious correlations. Modelled covariance: Cov(tp,X)=(tp

TX) /(N-1) Eq. 13 Covariance: Cov(y,X)= (yTX)/(N-1) Eq. 14 Modelled correlation: Corr(tp,X)= Cov(tp,X)/stsX Eq. 15 Correlation: Corr(y,X)= Cov(y,X)/sysX Eq. 16

where s represents the estimated standard deviation. A new approach to handle these problems is discussed in Paper II, in which the modelled covariance and modelled correlation is combined in a scatter plot, S-plot, and proposed as a tool for visualization and interpretation of classification models.

88..55.. BBaattcchh PPrroocceessssiinngg

ULTIVARIATE STATISTICAL BATCH PROCESSING, BP, is a well established method in industry and is used to

monitor time-related data.99-101 The BP strategy is a two level, projection-based method applied to 3-dimensional data. On the first level, observation level, one batch is represented by several time points and PLS regression is used to monitor time dependencies. At the second level, batch level, PCA or PLS is used to relate the individual batches by their score values calculated from the first level.

Since metabolomic studies often generate variation caused by time dependencies, BP has been proposed as a strategy to monitor how, for example, rats are affected by drugs over time or, as described in Paper I, how the growth process can be monitored in modified poplar plants and compared to WT plants. In poplar plants, time can be substituted by the plants’ internodes since these represent growth linked to time. These internodes are located in between two subsequent leaves on the stem. The BP strategy for plants is illustrated in figure 9.

M

Page 39: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Chemometrics

39

Alternative approaches to the interpretation of 3D data are described in several papers and are generally referred to as multi-way methods, such as Tucker3,102 PARAFAC103, 104 and N-PLS.105

Figure 9. Illustration of BP, using plants collated by their internodes. a) At the observation level the X data are arranged in three dimensions i.e. plants, internodes and NMR data. This matrix is unfolded into two directions where each plant is represented j times according to the number of internodes. The response vector y reflects the internodes and is used in PLS regression. b) At the batch level, the score values from the observation level are rearranged so that each observation is one plant and the variables are the score values from all components.

Page 40: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Chemometrics

40

88..66.. MMuullttiivvaarriiaattee AAnnaallyyssiiss oonn NNMMRR DDaattaa

MR IS AN VERY VERSATILE ANALYTICAL technique where the utility of the data has been extended to include

analysis of complex biofluids in combination with multivariate analysis. The interpretation of a 1H NMR spectrum collected from biofluids can be very complex due to the large number of compounds in the mixture which also often is overlapping. Regular spectral integration or visual inspection of such data includes a priori selection of peaks, which is subjective, impractical, cumbersome and subtle changes will be difficult to detect. Powerful MVA methods have been developed to interpret such complex data structures.

88..66..11.. HHiissttoorryy

VA ON NMR DATA CAN BE TRACED BACK TO 1977, when Michael Sjöström and Ulf Edlund were pioneers of the

field and used 13C NMR in pattern recognition studies of norbornanes with different exo and endo 2-substituents.106 This started an era of chemical characterisation of solvent and substituent effects on NMR data using pattern recognition tools such as PCA and PLS.107-111 The analysis of more complex samples started in the early 1990s when 13C NMR was used to study crude oil samples from the North sea.112, 113 In the late 1980s, Hans Grahn et al extended MVA techniques from 1D NMR to 2D NMR in order to analyse DNA and RNA using NOESY and COSY data.114-116 Their method was based on a single 2D spectrum and their ultimate goal was to develop automated tools for spectral assignments of macromolecules. Another extension of 2D NMR and MVA is the use of 1H-13C HSQC from biofluids resulting from multiple samples. Anders Berglund et al demonstrated this by interpretation of the 2D NMR time-domain data from multiple samples to study the folding of the MerP protein.117 In Paper IV their method is extended to the application of multiple frequency domain (Fourier Transformed) HSQC NMR spectra. The data analysis was improved by creating the capability to select or exclude spectral regions. This method facilitate the interpretation of multivariate models when compared to 1D NMR.

Although NMR in combination with MVA entered the scientific field 30 years ago it has only recently become accepted and established. Today many examples of its use can be found in the metabolomics

N

M

Page 41: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Chemometrics

41

literature, including pattern recognition in complex biofluids in clinical studies, toxicity analyses and plant studies.9, 15, 74, 75, 118-121 One of the pioneers in the metabonomics field is Jeremy Nicholson, who started the era with analysis of blood plasma and urine samples by 1H NMR.122-124 The original purpose of applying multivariate methods to NMR data from biofluids was to observe trends and to characterize the objects according to their origins (genetic modification or toxic response).9 It is still a tool of interest for observing trends and groupings, but the scientific focus has changed their interest to the detection and understanding of putative biomarkers and the creation of diagnostic tools for diseases.

88..66..22.. MMuullttiivvaarriiaattee AAnnaallyyssiiss aanndd NNMMRR TTooddaayy

LTHOUGH NMR HAS BEEN SUCCESSFULLY APPLIED in combination with multivariate analysis, data handling is a

critical step in ensuring successful (or even useful) results. One complication associated with the data is related to the shifting of peaks or alignment problems, i.e. cases in which a peak from the same compound in all samples is shifted to slightly different frequencies. Such peak shifts can be related to the sample pH or the salt concentration. Phasing errors can also be disturbing as well as noise, variation in line-width, artefacts, baseline errors and temperature variations. These errors are usually not easy to correct, but some papers can be found in the literature in which automated algorithms are proposed to address some of the causes.125-131

An alternative strategy to handle variation in chemical shift, phasing problems and line-width, is by spectral binning also called bucketing. This is commonly applied in metabolomics studies. Bucketing divides the NMR data into a number of segments by integration over a small spectral width (usually 0.01-0.04 ppm). This procedure reduces the spectral resolution so that resonances from different compounds might be re-aligned in the same region, which is a possible disadvantage for data interpretation. Figure 10 illustrates the effects of bucketing. However, bucketing also simplifies the data by reducing its size and it smooths out alignment errors and line-shapes between different samples.

A

Page 42: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Chemometrics

42

Figure 10. Illustration of the effects after bucketing a) NMR data results in loss of spectral resolution. The spectrum below is bucketed while the above has original resolution, which can be contrasted to b) an image with different number of pixels. Picture courtesy of Jack Newton at Chenomx.

An additional complication, that can irritate NMR spectroscopists when analysing NMR data using MVA methods, is the interpretation of loading vectors from a model based on UV scaling. The problem is that the loading plot has distorted line shapes, which make interpretation more complicated. NMR spectroscopists prefer to visualize data that resemble a regular spectrum. However, when the model is UV scaled, the data no longer resemble a spectrum. Cloarec´ et al therefore proposed the STOCSY plot.98, 132 The STOCSY approach combines the OPLS modelled correlation, corr(tp,X), with the OPLS modelled covariance, cov(tp,X) in the same plot by colouring the covariance with the correlation, figure 11.

Page 43: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Chemometrics

43

Figure 11. STOCSY plot. This plot combines the modelled correlation with the modelled covariance from the OPLS predictive component. The cov(tp,X) is plotted and coloured by corr(tp,X). The colour in the plot spans between 0-1 where 1 is the highest correlation. This plot is derived from the same data discussed in section 9.1.

This strategy is very useful as information regarding both line shape and peak intensity (which is related to concentration), as well as their modelled correlation, will be visualized. This strategy can also be applied to other types of data such as GC/MS, as presented in Paper II.

Page 44: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

44

99.. RReessuullttss aanndd DDiissccuussssiioonn

HIS SECTION DESCRIBES five papers in which some of the problems addressed in previous chapters are discussed. In

Paper I, section 9.1, the handling of time dependencies from plants using 1H HR/MAS NMR data is discussed. In Paper II, section 9.2, the importance of proper visualization of an OPLS class model is discussed from the perspective of identifying putative biomarkers and model validation. In Paper III, section 9.3, MLR is used to evaluate DM measurements in poplar plants from HSQC 1H-13C data. In Paper IV, section 9.4, a tool to visualise and evaluate multiple Fourier transformed HSQC 1H-13C data by MVA is discussed. In Paper V, section 9.5, a new tool for the selection of components in a PLS model is tested and discussed.

99..11.. MMooddeelllliinngg ooff TTiimmee DDeeppeennddeenncciieess ((PPaappeerr II))

HIS STUDY WAS EXPERIMENTALLY DESIGNED to evaluate metabolic variation caused by a modification within

the PttMYB76 gene, wherein the WT line was compared to a single transgenic line. The PttMYB76 gene is believed to affect components of the phenyl propanoid synthetic pathway, which leads to the formation of S- and G-lignin, flavonoids and sinapate esters. Three replicated plants were used for each line. Each plant was divided into eight segments by their internodes and each internode was analysed using 1H HR/MAS NMR to produce the X data used in modelling. The advantage with HR/MAS NMR compared to solution NMR is that sample preparation is very simple. No extractions are needed, instead the sample was ball milled and diluted in D2O to a slurry prior to 1H HR/MAS NMR measurement. Since interest was to identify changes in the internode direction also related to genetic perturbations, batch processing, BP, seemed to be an interesting approach.

A PLS based BP model was derived using 1H HR/MAS NMR results as X data and internodes as the response vector, y. The principle of BP is to create a reference model using one of the plants and then to compare other plants to the reference model. In this study the transgenic plants were used to create the reference system (N=24) and the WT poplar

T

T

Page 45: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

45

plants were tested. To test and compare the WT to the modelled reference plant, the samples were predicted onto the model. Deviating behaviour from the WT plant was then visualized both in scores and residual (DmodX) plots.

In this example, the WT plants seemed to behave identically to the transgenic plants all the way to internode 5 (figure 12). After the fifth internode, the WT plants deviated from the transgenic growth direction. However, in the DmodX residual plot, the WT plants appeared to behave differently after the first internode.

This observed behaviour led to the generation of a BP model based on both types of plant in which, at the batch level, the score values calculated from the observation level were used as new descriptor variables, XB, for each of the six individual plants. Consequently, at the batch level each plant was represented by one observation.

.

Figure 12. Batch processing model of transgenic plants in a) The average trajectory PLS for the first score vector. Wild type plants predicted on the BP model at the observation level (broken lines). Deviating behaviour in WT plants at internode 6 and beyond. b) The DModX plot revealed deviating behaviour at all internodes except internode 1.

PCA at the batch level revealed a separation between the WT and transgenic lines, as illustrated in figure 13. In order to interpret this separation, the loadings from the batch level and the loadings from the observation level were used simultaneously. The finding from the BP model was that the WT and transgenic plants have a different correlation

Page 46: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

46

structure. These differences were at the lower internodes (4-8) primarily related to acetyl groups that were higher in abundance in the WT plant (methoxy groups were more abundant in the transgenic plant) as visualised in figures 13b and 14b. At the higher internodes (1-3) the separation was related to aromatic compounds that were more abundant in the transgenic plant. Some separating peaks could also be seen in the sugar region (figures 13b and 14c).

One possible assignment from these observations was that the dominating methoxy group found in the transgenic plant could be associated with lignin compounds, for example, syringyl complexes.

Figure 13. BP model on the batch level. a) PCA score plot indicating a class separation b) loading plot, p(1), presenting the cause for the class separation.

Page 47: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

47

Figure 14. Interpretation of loadings from the observation level. a) PLS loading vector w*(1). b) Separating peaks at the lower region of the plant can be seen in loading vector 2, p(2). c) Peaks causing separation in the upper region can be seen in loading vector p(4).

Page 48: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

48

99..22.. OOPPLLSS--DDAA MMooddeell VViissuuaalliizzaattiioonn ((PPaappeerr IIII))

HIS STUDY WAS EXPERIMENTALLY DESIGNED to evaluate metabolic alternations associated with a modification

within the PttPME1 gene, wherein the WT line was compared to transgenic line 2B (up-regulated) and line 5 (down regulated). For each line, internode 42 was collected from 7-10 biological replicates. The analysis of these samples was performed using GC/MS. In Paper II, the variables in the X data comprised resolved and integrated profiles using H-MCR for which most compounds were identified through a library search. Two OPLS-DA models were calculated, in each of which one transgenic line was compared to the WT line.

As discussed in section 8.3, the validation of an OPLS model must be performed with particular caution since the score vectors will change when more components are added. In order to avoid misleading interpretation from the modelled score plots, cross-validated score plots were used, as indicated in figure 15. The plot includes both the modelled score value and the CV score value based on full cross- validation. In combination with Q2(y) this made the validation more transparent since samples that deviate are clearly visualized. For the model, in which line 5 was compared to WT, the predictive ability, Q2(y), was 77%. For the second model, in which line 2B was compared to WT, Q2(y) was 64%.

Figure 15. Cross-validated score plots from two models. a) Line 5 vs WT and b) Line 2B vs WT.

T

Page 49: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

49

One other problem in metabolomics studies is the selection of putative biomarkers. It was suggested to use the S-plot in combination with the jack-knifed confidence interval to select putative biomarkers. The S-plot combines the modelled covariance, cov(t,X), with the modelled correlation, corr(t,X) and can both be useful for the predictive component as well as the orthogonal components (figures 16 and 17). This is the same as combining the contribution (cov(t,X)) which relates to peak intensity, with the reliability or effect (corr(t,X)). This combination is especially important when the variables in X do not have equal variance and UV-scaling is not appropriate for model quality.

An additional concern for the selection of biomarkers is how to decide the cut-off limits using the suggested tools. Cut-off limits cannot be set based on correlation or covariation. The simple reason for this is that the limit depends on the stability/variability of the data. If the number of measured samples is sufficiently large, the data and the model stabilise, resulting in that very small correlations can be proved to be statistically true.133-135

When selecting putative biomarkers, it is necessary to address both reliability/variability (statistical significance) and the size of the effect. One approach to do this is using the jack-knifed confidence interval in combination with the modelled correlation, figure 16. The jack-knifed confidence interval is calculated as

CIJK=SEcv*t(α,df),

where SEcv is the standard error for each loading variable based on CV, t(α,df) is the statistical t-value based on a probability limit, usually set to α=0.05, with df degrees of freedom based on the number of CV rounds. Whenever selecting biomarkers purely by a t-test, which is very common in metabolomics studies, one should recall that a statistical p-value only displays the probability that the groups have similar average and not the actual significance of the size or effect.

Page 50: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

50

Figure 16. Visualization of OPLS-DA model from WT vs line 5 a) S-plot from the predictive component, cov(tp,X) vs corr(tp,X) b) some of the loading pp values (cov(tp,X)) with the Jack-knifed confidence interval, CIJK. The grey coloured metabolites were not considered significant.

Whichever method is used for the selection, a question still remains. Are the selected biomarkers biochemically significant? Unfortunately correlation is not synonymous with causation, and this question cannot be answered by any type of statistics.

.

Figure 17. The orthogonal (uncorrelated) variation can be visualized by the use of an S-plot, cov(to,X) vs corr(to,X).

Page 51: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

51

Whenever the interest is to compare two or more class models, the shared and unique structure plot, SUS-plot, is applicable. In this example the corr(tp,X) vectors are plotted for both classes, figure 18. Only the metabolites that were considered significant according to the S-plot, in combination with the confidence interval in the respective class model, were included to clarify the interpretation.

Figure 18. The SUS-plot is a useful tool to compare the outcome from two or three OPLS-DA models.

In this study some metabolites were shared between the two classes. One possible explanation for this is that the metabolites provide information regarding an unexpected stress reaction. Some unique metabolites were also found. For example, in line 5, malic acid and glucaric acid were up-regulated, while in line 2B inositol were up-regulated and fructose-1-phosphate were down-regulated. These findings are still being biologically validated and cannot be further discussed at present.

Page 52: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

52

99..33.. PPeeccttiinn MMeeaassuurreemmeennttss UUssiinngg HHSSQQCC NNMMRR ((PPaappeerr IIIIII))

HE EXTRACTION OF PECTIN FROM POPLAR plants is regularly performed using cyclohexanediamine tetraacetic acid

(CDTA) in aqueous solution. CDTA is a chelating agent that is used to improve extraction of pectin by its complexation to Ca2+. The CDTA solution will also extract many other water-soluble compounds. Global metabolite analysis, providing information (inter alia) on the pectin polymer, would thus be possible in a single NMR experiment. However, CDTA resonances will obscure many signals of the 1H-13C HSQC spectra, including metabolites, acetyl groups and the methyl group of the pectin polymer. Therefore the experimental design in Paper III was performed to investigate the possibility of reducing the concentration of CDTA in pectin extractions. However, the results of this experiment provided information that was of greater scientific interest than the extraction parameters. Consequently the experimental focus was switched to investigate the differences in the degree of methylation, DM, between poplar samples by comparing the WT line to the up-regulated lines 2B and 7, and the down-regulated line 5. These plants are described in section 9.2.

T

Page 53: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

53

Figure 19. Experimental design performed to investigate the variation of DM in pectin. As a first step phloem samples scraped from four trees were pooled. Two pools were prepared for each line. Each pool was divided into three samples and extracted using varying concentrations of CDTA. Finally the extracted samples were analysed by 2D NMR spectroscopy.

The extraction protocol was applied to two sets of phloem samples obtained from four plants for each poplar line. The pooled samples were each divided into three aliquots and extracted using 50, 30 and 10 mM CDTA, as illustrated in figure 19.

Following the HSQC measurements, the DM was determined by separately integrating the spectral regions of the 5´ signal for the carboxylate unit and the 5 signal for the esterified peaks, as illustrated in figure 20. The reason for selecting only the H5 resonances for volume integration is that these pectin signals are resolved as opposed to the other pectin resonances. The assignments shown in figure 20 correspond to the resonances previously reported in pectin62, 63 and the resonances obtained by analyzing HSQC 1H-13C spectra of two reference samples: 95%

Page 54: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

54

esterified pectin from citrus (Sigma-Aldrich) and polygalacturonic acid from orange (Sigma-Aldrich). The DM was calculated as

'55

5

III

DM+

= ,

where I5 is the integral of the esterified carbon five and I5’ is the de-esterified integral of carbon five in the pectin sample.

Figure 20. Pectin assignments and general regions for other poplar plant compounds. a) Example of a phloem poplar sample (line 5) extracted by CDTA and analysed by 1H-13C HSQC NMR. The grey resonances are assigned to the pectin polymer. b) The pectin carbons in the esterified unit are referred to as 1-5 and carbons in the de-esterified unit as 1´-5´.

The data were evaluated using multiple linear regressions, MLR, where the variables (x) and the response (y) were:

x 1: CDTA (50, 30, 10) x 2: Lines (WT vs one transgenic, a binary factor) y: DM%

Three MLR models were calculated. In each model one transgenic

line was compared to the WT line. This evaluation allowed the terms CDTA, lines CDTA*lines and CDTA*CDTA to be investigated.

The results clearly indicated that an increase in the concentration of CDTA correlated with a decrease in DM for all plants, figure 21 a-c

Page 55: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

55

Figure 21. Results from the DM study of the up- and down-regulated PttPME1 poplar plants. The data were evaluated by MLR and the bars shown in the figures represent the effects of the evaluated model terms. The model statistics are displayed in each corner of the model. Q2 represents the modelled predictive ability and the p-value represents the ANOVA test in which the modelled variation was compared to the residuals. In a) the down-regulated line 5 is compared to WT in b) the up-regulated line 7 is compared to WT and in c) the up-regulated line 2B is compared to WT. T1=CDTA, T2=Transgenic vs WT and T3=CDTA*line.

The results indicate that each of the samples contained pectins of

varying DM. It was also shown that 50 mM CDTA solution extracted a greater total amount of pectin than the weaker solutions (see figure 23, section 9.4). It can be concluded that pectins with a higher DM are less complex bound to Ca2+, which is why these pectins are easier to extract using a lower concentration of CDTA. Line 5 was observed to have a higher DM than WT. The results for the up-regulated lines (2B and 7) showed they had slightly lower DM than WT. These results were also supported by other data presented in Paper III.

Unfortunately, there is a high risk of losing some pectin fractions if the concentration of CDTA is reduced. Therefore it would be unwise to decrease the concentration of CDTA in order to improve the NMR signals of other plant metabolites.

Page 56: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

56

99..44.. HHSSQQCC NNMMRR aanndd MMVVAA ((PPaappeerr IIVV)) 1H NMR SPECTRUM FROM BIOFLUIDS can contain thousands of resonances originating from hundreds of

compounds. Severe overlap occurs in congested regions and the presence of additives and solvents can sometimes obscure important spectral features. The superior advantage with 2D NMR compared to 1D NMR data is the reduced spectral overlap and the added information from another nuclei. For HSQC this added information can be 13C or 15N. These advantages has the potential to facilitate the assignments of biomarkers as well improve the modelled correlation structure which will improve interpretation on a molecular level. In this study it was demonstrated how to use Fourier transformed 2D NMR data in OPLS regression. It is based on the analysis of raw spectral data without any volume integration as presented in the former study, section 9.3. The strategy for evaluating a number, N, of 2D NMR spectra using MVA analysis is illustrated in figure 22.

A

Page 57: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

57

Figure 22. General illustration of MVA of 2D NMR data. a) Each 2D NMR spectrum is unfolded to row vectors. The resulting X(N,K) matrix is represented by N observations and K=L*M variables. b) Exclusion of noise and additives X(N,K’) followed by MVA. c) Model interpretation by examining scores and loadings. The loading vector is back-transformed into a matrix of the same size as the original spectrum.

The first step is to FT all spectra in both the F1 and F2 directions followed by manual phasing and baseline correction. Before further pre-treatments, each 2D spectrum is unfolded in the F2 direction (figure 22a), generating a long row vector of size K=L*M; these vectors create the X[N,K] matrix which is imported for Matlab136 applications. Since the 2D spectrum can contain a significant quantity of noise and artefacts, which could complicate modelling due to spurious correlations, a noise threshold is set that depends on the variance of the noise. All peaks below the noise threshold are then automatically excluded (figure 22b), resulting in a reduced matrix, X[N,K´]. This procedure does not exclude phasing errors and thus some noise remains (depending on the threshold). In the example, the threshold was set to 2*variance of the noise calculated from a noise region of all samples, as illustrated in figure 23. The reduced X matrix can be used in any type of modelling, including OPLS regression

Page 58: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

58

as demonstrated below. The OPLS Scores plot can be directly interpreted (figure 22c). However, as the loading plot is represented as a row vector it must be back-transformed into a matrix of the same size as the original 2D spectrum (figure 22d).

Figure 23. Prior to modelling, noise is excluded. This is done by defining a threshold according to the variance of the noise, calculated from all samples used in the model. a) The variance of all variables from all samples used in the time model. b) An expanded region from a) indicating different noise limits. Black=max variance, grey=mean variance*2, white=mean variance. In the example, the noise exclusion threshold was set to the grey limit.

When the loading vector is back transformed it can be visualized with the same axes as in the original 2D spectrum.

In Paper IV, two models were demonstrated using OPLS regression. The response vectors, y, were the CDTA direction and the time direction, which were modelled against the HSQC data, X. The CDTA vector was as described in section 9.3 and the time vector utilised two consecutive experiments for the same observations at 75oC. The time direction model was comprised of X(16,372922) before the exclusion of noise, CDTA and TRIS and after their exclusion, to give X(16,21861). For the CDTA model, a sub-region was selected which did not include any interfering CDTA resonances or artefacts. This was done in order to avoid influence of additives on the model diagnostics. After noise exclusion, this matrix was of size X(12,6167). For model validation, line 5 data were excluded from the time model and were used as an external

Page 59: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

59

test set. In the CDTA model, line 2B data were excluded and used as an external test set.

It was concluded that, for the time axis, the exclusion of noise, CDTA and artefacts from CDTA peaks is important to ensure model quality. This is because such peaks would generate a high number of spurious correlations and there were very few time-related signals, only 2% predictive variation (R2(X)p). The samples were observed to have a common time direction when measured at high temperature, as shown in figure 24a. In the loading plot this variation can be interpreted as an increase of pectin peaks, as shown in figure 24b-d. This observation can be explained as the loss of viscosity and a depolymerisation reaction within the pectin polymer caused by heat.

Since the data were centered prior to OPLS, the predictive loading vector was the modelled covariance, cov(tp,X), which especially highlight peaks with a high intensity, as in figure 24b. In order to prevent incorrect interpretations, the cov(tp,X) loading vector was combined with the corr(tp,X) vector (figure 24c-d). The combination and interpretation were performed in a similar way to those described for the S-plot (Paper II), although the present S-plot was a combination of two figures, i.e. figures 24b and c).

Page 60: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

60

Figure 24. OPLS time model. a) Score plot. Line 5 data were used as the external prediction set; black corresponds to the first time point and red to the second. b) Loading plot, modelled covariance. Some of the disturbances that can be seen in the plot are caused by t1-noise arising from the CDTA resonances. c) The modelled correlation, between -1 (blue) and +1 (red), where the red correlations indicate an increase in time caused by temperature and blue a decrease. d) A region from (c) expanded to clarify the detail of the plot.

Page 61: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

61

In the CDTA model, it was observed that all pectin peaks increased with increasing concentrations of CDTA (figure 25).

Figure 25. OPLS model of CDTA. a) Score plot. Line 2B data (red) were used as the external prediction set. b) Loading plot, modelled correlation. The modelled correlation spans between -1 (blue) and +1 (red), where the red correlations indicate an increase due to CDTA.

It is hypothesized that this was a result of the complexation between the de-esterified pectin and Ca2+ which makes the increased concentration of CDTA important for improved extraction of pectins.

These examples demonstrate the high potential of OPLS regression on informative rich 2D NMR data. This type of models based on 2D NMR data will be important in studies on complex biological samples where the key aspect is to identify detailed chemical changes on a molecular level

Page 62: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

62

99..55.. PPLLSS CCoommppoonneenntt SSeelleeccttiioonn ((PPaappeerr VV))

HE PURPOSE OF THE RANDOMISATION TEST for selecting PLS components was to develop a method that would

be simpler, objective and each added component could be statistically proved to be true. The method was evaluated using 10 different datasets. For each dataset, PLS regression was calculated using two different software packages: SIMCA and Unscrambler. The difference between the programs lies in the procedure used to calculate cross-validation. In SIMCA, partial CV is used for PLS but in Unscrambler full CV is used. The two programs also have slightly different strategies for component selection. For this reason it was interesting to compare the outcome of number of components between the two software programs. The tests included ‘leave one out CV’, LOO, and 7-fold CV, as implemented in both SIMCA and Unscrambler. SIMCA applies a number of rules to confirm or reject the component. These rules were used in the study and are described below. 1. A component is significant according to Rule 1 when the amount of

predicted (cross validation) variance Q2(Y)>limit. For a PLS model, Limit 0= for models with more than 100 objects and Limit 0.05= for models with 100 objects or less.

2. A component is significant according to Rule 2 when at least one Y-variable for a PLS model has QV

2(Y)>limit. QV2 is the fraction from

one Y variable, i.e. QV2(y)=1-PRESS/SS(y), where PRESS is the

predicted residual sum of squares and SS is sum of squares for one response vector y. The total from all responses is Q2(Y).

A PLS component is not significant if: 1. Rule 1 or 2 is not fulfilled. 2. The data have insufficient degrees of freedom after previous

components, i.e. if (N-A) or (K-A) = 0. 3. The explained variance for Y (PLS) is less than 1% and no single Y-

variable has more than 3% (PLS) explained variance.

Unscrambler applies leverage correction as a “crude” selection rule. All of the strategies detailed above were compared to the new randomisation test.

T

Page 63: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

63

The results, summarised in table 1, indicated that in some cases the selection of model rank differed between the two software programs. The number of components selected by the randomisation test seemed to agree with the number of components identified by the software programs. The only rule that deviated from all other selection rules was the leverage correction applied in Unscrambler. However, it should be noted that this strategy is only recommended as a “quick-and-dirty” alternative to the more time-consuming cross validation. Table 1. Selected number of PLS components using LOO cross validation (CV), CV with r=7 segments, size of the eigenvalue, rules of significance (Rules), leverage correction and the new randomization test (Rand-test). Data set Pre-

treatment (+centering)

SIMCA Unscrambler

CV (LOO)

CV (r=7)

Eigenvalue Rules CV (LOO)

CV (r=7)

Leverage correction

Rand-test

Carra – 6 6 2 6 6 6 6 6-9 SNV 5 8 4 5 5 5 5 8-10 Water – 8 8 3 9 6 8 11 9 ISO-brightness

– 6 3-6 1 6 7 5 18 3-6

MSC 5 5 2 5 5 5 18 3-5 Gas oil – 2 2 3 2 2 2 2 5 RGB – 3 3 8 5 5 6 10 4 Wavelet – 7-10 7-10 4 6 5 4 6 10 Internodes – 1 1 3 1 1 1 6 1

Poplar class – 4 4 3 5 4 4 7 4

Kelder – 2 2 2 2 3 5 3 2 Hexapep UVS 3 – 2 3 11 – 4 4

To evaluate how close these methods and the randomisation test

were to the “true” model rank, four data sets were divided into calibration and test sets. The test set was used to calculate the root mean square sum of error of prediction, RMSEP, for each of the four data sets. The estimated RMSEP was calculated as

RMSEP=√((∑y-ypred)2/N), where y is the measured response and ypred the prediction from the model.

Page 64: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Results and Discussion

64

This evaluation suggested that the “true” model rank might not be after A number of components but rather after varying numbers of components, as summarised in table 2. Table 2. Table of RMSEP values for independent validation set. Data set Number of PLS components 1 2 3 4 5 6 7 8 9 10 Carra 18.5 14.9 7.30 7.04 6.04 3.21 3.11 2.35 1.79 1.85 Carra (SNV)

14.2 7.29 6.75 5.86 3.22 2.39 1.68 1.52 1.16 1.36

Water 9.17 6.99 6.70 6.07 6.04 5.47 5.39 5.43 5.44 5.31 Gas oil 0.18 0.10 0.085 0.055 0.045 0.039 0.039 0.037 0.039 0.038

Wavelet 0.27 0.24 0.22 0.23 0.23 0.23 0.22 0.22 0.22 0.22

What was interesting though regarding the randomisation test,

was the outcome of the designed dataset. Since designed datasets usually contain few samples, the exclusion of samples during CV can compromise cross validation. However, because the randomization test does not exclude samples, it can provide better analysis of the designed data set. Moreover, the randomization test could be used to indicate when the data require pre-treatment prior to modelling. The randomization test could also be useful in combination with other selection rules in order to simplify the selection of components.

Page 65: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Conclusions and Future Perspectives

65

1100.. CCoonncclluussiioonnss aanndd FFuuttuurree PPeerrssppeeccttiivveess

O UNRAVEL THE FUNCTION OF GENES in the underlying wood forming process is challenging task. To meet all complex

questions regarding wood formation, powerful analytical platforms and strategies for data assessment is crucial. Multivariate data analysis in evaluation of complex data is a powerful tool in the study of biological problems. MVA methods provide good visualization, which is of particular importance as it makes data transparent and thus facilitates the interpretation. In this thesis some useful visualization tools were presented and discussed. In Paper I modelling of time dependencies in poplar plants using 1H HR/MAS NMR spectroscopy was described. This method could be further developed for application to smaller-sub sets in order to account for non-linearity. HR/MAS NMR has the advantage that analysis can be performed without any extraction thus remaining the sample intact. In Paper II the difficulties of putative biomarker identification were discussed. The use of a statistical S-plot and the jack-knifed confidence interval were proposed as tools for biomarker identification. In Paper III, a new method was applied to measure the DM in pectin using 1H-13C HSQC NMR data. This method could be optimised to decrease the depolymerisation reaction in pectin. This would facilitate pectin quantification. The effect of CDTA upon pectin extraction was also evaluated. In Paper IV, the application of MVA to 2D frequency domain NMR data was presented. 2D NMR spectroscopy is a very powerful tool thanks to the superior resolution in the HSQC spectra compared to 1H NMR spectra. It also includes more information than a 1D spectrum by the second 13C dimension. These enhancements facilitate interpretation regarding bio-molecules on a molecular level. The possibility to remove additives which could interfere with the multivariate diagnostics is also facilitated by the increased resolution. A study could be conducted to compare the outcome from a 1D model to a 2D model. The exclusion of noise could be evaluated by wavelet compression. In Paper V a randomisation test was applied to the selection of PLS components. This method was proposed as an objective assessment of model complexity. The randomisation test agreed well with other rule-based strategies and can be used in combination with other selection rules for component selection. Its performance should be further evaluated using designed datasets.

This thesis deals with two important fields of plant research which aims to understand the function of wood formation: metabolomics and

T

Page 66: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Conclusions and Future Perspectives

66

pectin. The methods presented could preferably be extended to the analysis of lignin, cellulose etc. Examples of this are shown in figure 5 illustrating a HSQC NMR spectrum from dissolved plant cell-walls.79, 81 This new approach of analysing the plant cell-wall greatly increased the speed of sample preparation, it improved the recovery of lignin from the plant samples (compared to other extraction methods) and it enabled the simultaneous analysis of other cell wall components such as cellulose and hemicellulose. Using 2D NMR spectroscopy this also resulted in a more complex spectrum, where small subtitle changes between WT and transgenic plants were difficult to access with traditionally used methods. Within this field 2D NMR and MVA have a promising future.

Understanding the function of genes in trees is an important step towards the improvement of forest trees. Improved forest trees will not only meet the demands for an increased use of biomass, but also for wood fibres with chemical and physical wood properties suitable for different industrial processes. This could in turn meet the demands for a sustainable development

This thesis discusses methods and strategies to measure and evaluate perturbations in modified plants. The author does not aim to find the “perfect” wood fibre for a given situation. However, it is hoped that application of some of the developed strategies discussed in this thesis, in conjunction with spectroscopic analyses, will be helpful for plant biologists investigating xylogenesis and related phenomena.

Page 67: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Acknowledgements

67

1111.. AAcckknnoowwlleeddggeemmeennttss

Många år har passerat på Umeå universitet, närmare bestämt 12 år varav 11 år vid kemiska institutionen. Två underbara ungar har det också blivit under den tiden. Under årens lopp har mycket hänt och det mesta har varit roligt. Förutom allt jag har fått lära mig så har jag fått möjligheten att träffa många fantastiska forskare och vänner. Dessa möten har berikat min tid till något utöver vad jag förväntade mig innan jag började doktorera. Det finns många som bidragit till denna resa och några vill jag ge ett extra varmt Tack! Framförallt till min handledare Ulf Edlund för att du gett mig denna möjlighet och för att du alltid har en lugn men bestämd attityd. Till min biträdande handledare Michael Sjöström för att du alltid (verkar) har tid för frågor och diskussion. Även ett stort tack för all god honung. Till min biträdande handledare Johan Trygg för alla givande diskussioner. Du har varit en riktig klippa och inspirationskälla under resans gång. Även ett stort tack för att du tyckte att jag skulle kaxa till mig lite ☺. Till min biträdande handledare Thomas Moritz för att du alltid har tid för frågor och för att du kan tillrättavisa mig då jag försöker vara biolog. Dan Johnels för att du är så allsidig, du har svar på de flesta frågor. Dessutom är du en riktigt underhållande typ, alltid med roliga anekdoter i bakfickan. Ingmar Sethsson för att du är ett så otroligt härligt sällskap. Att få jobba med dig vid NMR spektrometern är ett rent nöje som är få förunnat. Dina kloka ord har också alltid fått mig på gott humör då läget känts uselt. Johan Trygg, Thomas Moritz och Henrik Antti för att ni sprider kunskap och inspiration över metabolomics området. Erik Johansson för att alltid är rak och ärlig. Du är också en otroligt bra drivkraft. Svante Wold och Lennart Eriksson för inspiration och stabilitet i normalfördelade diskussioner. Mattias Hedenström för ett roligt samarbete, alla givande diskussioner och hjälp vid NMR spektrometern. Du är en klippa! Alla i Kemometrigruppen genom tiderna för givande diskussioner och skidturer. Ingen nämnd ingen glömd!

Page 68: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Acknowledgements

68

Metabolomicsgruppen även denna grupp har delgivit en massa intressanta diskussioner och presentationer. Max för alla givande diskussioner inom kemometri och för att du alltid har ett leende på läpparna. David N för trevliga kemometri diskussioner och för en stark insats i fårhagen i somras. Vi skall försöka bli bättre på att komma till Innansjön. Innebandygänget för hårt och otroligt roligt spel. Hans S för att det är alltid lika kul att titta in i ditt och Max rum och störa lite. Man lär sig alltid något nytt och särskilt givande är alla ”kloka” ord, de får mig att landa. För korrekturläsning inom väl utvalda delar Dan, Max , Mattias, Erik och Lennart. Carina Sandberg för att inget hade funkat utan dig. Du har inte bara koll på administration och ekonomi utan allt annat som ingen annan orkar ta tag i. Ann-Helen för att du alltid hjälper till med mina reseräkningar och fakturor. UPSC medarbetarna Björn Sundberg, Ewa Mellerowicz och Gunnar Wingsle ert engagemang och intresse för NMR har betytt otroligt mycket för min egen inspiration i växtforskningen. John and Sally Ralph thanks for the wonderful time and great experience you gave me at my visit in Madison, Wisconsin. Mina rumskamrater: Knatti du är en underbar ”roomy” som har en ny ”världens bästa låt” och ”världens bästa grupp” minst varje vecka☺, Malin W alias Big evil du är faktiskt en ängel. Anton L alltid lika snäll som gullig. Organisk kemi visst är vi en större familj idag med hela kemiska institutionen, men inget illa menat jag måste tillägna detta tack till organisk kemi för det är trots allt denna avdelning jag spenderat 99% av min tid tillsammans med. Tack för att ni är så trevliga, för allt gott fredagsfika, för all djävuls java (kanske inte så gott, men i alla fall funktionellt) och för alla trevliga julluncher och övriga fester. Det har också varit en upplevelse att få se alla intressanta mattrender. I början var det kulinarisk pasta (makaroner) med bacon och sås (ketchup) som ägde. Idag är det en något hälsosammare stil som gäller då blodpudding med broccoli och äpple som äger lunchtrenden. Till er som ställer upp inför min mottagning och fest Elin C, Elin T, Knatti, Sussi (vilket inte skall förknippas med mig själv, jag kallas Suss), Lotta och Malin. Mina matglada vänner Anna-Karin, Malin, Annika, Lissan (vilande medlem), Stina (vilande medlem), Kristina, Linus, och nykomlingen

Page 69: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

Acknowledgements

69

Sussi. Vi har så otroligt härliga middagar tillsammans, mycket mat, vin och skvaller. En perfekt kombination för en fulländad kväll! Ullis och Nicke världens bästa vänner i Staaan. Det är få saker som är så underbart som att få sitta och blicka ut över Skellefteälven tillsammans med er en varm sommarkväll. Ni är så goa! Mina barndomsvänner Ullis, Anneli, Linda vi har alltid så roligt ihop. Tillsammans med er blir jag förevigat ung. Härliga och goa Ume´ vänner Lotta och Jonas, Anna-Karin och Björn, Anna och Marcus jag har alltid trevligt ihop med er Umeås bästa grannar Siv och Sigge, John och Therese, Tza och Wenchao. Fredric du är bäst. JAG ÄLSKAR DIG! Värdens underbaraste barn Oliver och Nova ni är mina livlinor. Vad vore livet utan er? Kenneth för att du är min bästa bror. Kul att du flyttar hem igen! Och sist men inte minst Mamma och Pappa utan er hade jag aldrig fixat dessa år. Ni har alltid ställt upp när tidsplanen inte gått ihop. Nu åker vi till Thailand och njuter...

Page 70: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

References

70

1122.. RReeffeerreenncceess

1. http://www.upsc.se. 2007 2. Mellerowicz, E. J., Baucher, M., Sundberg, B. and Boerjan, W.

Unravelling cell wall formation in the woody dicot stem. Plant Molecular Biology. 2001;47:239-274.

3. Sterky, F., Bhalerao, R. R., Unneberg, P., Segerman, B., Nilsson, P., Brunner, A. M., et al. A Populus EST resource for plant functional genomics. Proceedings of the National Academy of Sciences of the United States of America. 2004;101:13951-13956.

4. Hahn-Hagerdal, B., Galbe, M., Gorwa-Grauslund, M. F., Liden, G. and Zacchi, G. Bio-ethanol - the fuel of tomorrow from the residues of today. Trends in Biotechnology. 2006;24:549-556.

5. Higuchi, T. Biochemistry and molecular biology of wood. New York: Springer; 1997.

6. Simson, B. W. and Timell, T. E. Polysaccharides in cambial tissues of polulus tremuloides and Tilia americana. I. Isolation, fractionation and chemical composition of the cambial tissues. Cellulose Chemistry and Technology. 1978;12:39-50.

7. Simson, B. W. and Timell, T. E. Polysaccharides in cambial tissues of polulus tremuloides and Tilia americana. II. Isolation and structure of xyloglucan. Cellulose Chemistry and Technology 1978;12:51-62.

8. Nicholson, J. K., Lindon, J. C. and Holmes, E. 'Metabonomics': understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data. Xenobiotica. 1999;29:1181-1189.

9. Holmes, E., Nicholls, A. W., Lindon, J. C., Ramos, S., Spraul, M., Neidig, P., et al. Development of a model for classification of toxin-induced lesions using H-1 NMR spectroscopy of urine combined with pattern recognition. NMR in Biomedicine. 1998;11:235-244.

10. Lindon, J. C., Holmes, E. and Nicholson, J. K. Pattern recognition methods and applications in biomedical magnetic resonance. Progress in Nuclear Magnetic Resonance Spectroscopy. 2001;39:1-40.

11. Lindon, J. C., Nicholson, J. K., Holmes, E., Keun, H. C., Craig, A., Pearce, J. T. M., et al. Summary recommendations for

Page 71: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

References

71

standardization and reporting of metabolic analyses. Nature Biotechnology. 2005;23:833-838.

12. Marchesi, J. R., Holmes, E., Khan, F., Kochhar, S., Scanlan, P., Shanahan, F., et al. Rapid and noninvasive metabonomic characterization of inflammatory bowel disease. Journal of Proteome Research. 2007;6:546-551.

13. Brescia, A. M., Di Martino, G., Fares, C., Di Fonzo, N., Gehelli, S. P., Reniero, F., et al. Characterization of Italian Durum Wheat Semolina by Means of Chemical Analytical and Spectroscopic Determinations. Cereal Chemistry. 2002;79:238-242.

14. Catchpole, G. S., Beckmann, M., Enot, D. P., Mondhe, M., Zywicki, B., Taylor, J., et al. Hierarchical metabolomics demonstrates substantial compositional similarity between genetically modified and conventional potato crops. Proceedings of the National Academy of Sciences of the United States of America. 2005;102:14458-14462.

15. Bundy, J. G., Osborn, D., Weeks, J. M., Lindon, J. C. and Nicholson, J. K. An NMR-based metabonomic approach to the investigation of coelomic fluid biochemistry in earthworms under toxic stress. Febs Letters. 2001;500:31-35.

16. Dunn, W. B. and Ellis, D. I. Metabolomics: Current analytical platforms and methodologies. Trac-Trends in Analytical Chemistry. 2005;24:285-294.

17. Goodacre, R., Vaidyanathan, S., Dunn, W. B., Harrigan, G. G. and Kell, D. B. Metabolomics by numbers: acquiring and understanding global metabolite data. Trends in Biotechnology. 2004;22:245-252.

18. Hollywood, K., Brison, D. R. and Goodacre, R. Metabolomics: Current technologies and future trends. Proteomics. 2006;6:4716-4723.

19. Kitano, H. Systems biology: A brief overview. Science. 2002;295:1662-1664.

20. Fiehn, O., Kopka, J., Dormann, P., Altmann, T., Trethewey, R. N. and Willmitzer, L. Metabolite profiling for plant functional genomics. Nature Biotechnology. 2000;18:1157-1161.

21. Tweeddale, H., Notley-McRobb, L. and Ferenci, T. Effect of slow growth on metabolism of Escherichia coli, as revealed by global metabolite pool ("Metabolome") analysis. Journal of Bacteriology. 1998;180:5109-5116.

22. Raamsdonk, L. M., Teusink, B., Broadhurst, D., Zhang, N. S., Hayes, A., Walsh, M. C., et al. A functional genomics strategy

Page 72: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

References

72

that uses metabolome data to reveal the phenotype of silent mutations. Nature Biotechnology. 2001;19:45-50.

23. Oliver, S. G., Winson, M. K., Kell, D. B. and Baganz, F. Systematic functional analysis of the yeast genome. Trends in Biotechnology. 1998;16:373-378.

24. Fiehn, O. Metabolomics - the link between genotypes and phenotypes. Plant Molecular Biology. 2002;48:155-171.

25. Fiehn, O., Kloska, S. and Altmann, T. Integrated studies on plant biology using multiparallel techniques. Current Opinion in Biotechnology. 2001;12:82-86.

26. Ratcliffe, R. G. and Shachar-Hill, Y. Probing plant metabolism with NMR. Annual Review of Plant Physiology and Plant Molecular Biology. 2001;52:499-526.

27. Bligny, R. and Douce, R. NMR and plant metabolism. Current Opinion in Plant Biology. 2001;4:191-196.

28. Smith, L. L. Key challenges for toxicologists in the 21st century. Trends in Pharmacological Sciences. 2001;22:281-285.

29. Robertson, D. G. Metabonomics in Toxicology: A Review. Toxicological Sciences. 2005;85:809-822.

30. Robertson, D. G., Reily, M. D., Sigler, R. E., Wells, D. F., Paterson, D. A. and Braden, T. K. Metabonomics: Evaluation of nuclear magnetic resonance (NMR) and pattern recognition technology for rapid in vivo screening of liver and kidney toxicants. Toxicological Sciences. 2000;57:326-337.

31. Beckwith-Hall, B. M., Nicholson, J. K., Nicholls, A. W., Foxall, P. J. D., Lindon, J. C., Connor, S. C., et al. Nuclear magnetic resonance spectroscopic and principal components analysis investigations into biochemical effects of three model hepatotoxins. Chemical Research in Toxicology. 1998;11:260-272.

32. Eide, I., Neverdal, G., Thorvaldsen, B., Arneberg, R., Grung, B. and Kvalheim, O. M. Toxicological evaluation of complex mixtures: fingerprinting and multivariate analysis. Environmental Toxicology and Pharmacology. 2004;18:127-133.

33. Holmes, E., Loo, R. L., Cloarec, O., Coen, M., Tang, H. R., Maibaum, E., et al. Detection of urinary drug metabolite (Xenometabolome) signatures in molecular epidemiology studies via statistical total correlation (NMR) spectroscopy. Analytical Chemistry. 2007;79:2629-2640.

34. Holmes, E., Nicholls, A. W., Lindon, J. C., Connor, S. C., Connelly, J. C., Haselden, J. N., et al. Chemometric models for

Page 73: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

References

73

toxicity classification based on NMR spectra of biofluids. Chemical Research in Toxicology. 2000;13:471-478.

35. Pohjanen, E., Thysell, E., Jonsson, P., Eklund, C., Silfver, A., Carlsson, I. B., et al. A multivariate screening strategy for investigating metabolic effects of strenuous physical exercise in human serum. Journal of Proteome Research. 2007;6:2113-2120.

36. Hollywood, K., Brison, D. R. and Goodacre, R. Metabolomics: current technologies and future trends. Proteomics 2006;6:4716-4723.

37. Azmi, J., Griffin, J. L., Antti, H., Shore, R. F., Johansson, E., Nicholson, J. K., et al. Metabolic trajectory characterisation of xenobiotic-induced hepatotoxic lesions using statistical batch processing of NMR data. Analyst. 2002;127:271-276.

38. Wiklund, S., Karlsson, M., Antti, H., Johnels, D., Sjöström, M., Wingsle, G., et al. A new metabonomic strategy for analysing the growth process of the poplar tree. Plant Biotechnology Journal. 2005;3:353-362.

39. Jonsson, P., Stenlund, H., Moritz, T., Trygg, J., Sjostrom, M., Verheij, E. R., et al. A strategy for modelling dynamic responses in metabolic samples characterized by GC/MS. Metabolomics. 2006;2:135-143.

40. Box, G. E. P., Hunter, W. G. and Hunter, J. S. Statistics for experimenters an introduction to design, data analysis and model building: John Wiley & Sons; 1978:653.

41. Lundstedt, T., Seifert, E., Abramo, L., Thelin, B., Nystrom, A., Pettersen, J., et al. Experimental design and optimization. Chemometrics and Intelligent Laboratory Systems. 1998;42:3-40.

42. Trygg, J. and Lundstedt, T. Chemometrics Techniques for Metabolomics. In: J. C. Lindon, J. K. Nicholson and E. Holmes, eds. The Handbook of Metabonomics and Metabolomics. Amsterdam: Elsevier; 2007.

43. Eriksson, L., Antti, H., Gottfries, J., Holmes, E., Johansson, E., Lindgren, F., et al. Using chemometrics for navigating in the large data sets of genomics, proteomics, and metabonomics (gpm). Analytical and Bioanalytical Chemistry. 2004;380:419-429.

44. Trygg, J., Holmes, E. and Lundstedt, T. Chemometrics in metabonomics. Journal of Proteome Research. 2007;6:469-479.

45. Jackson, J. E. A Users Guide to Principal Components. New York: John Wiley; 1991.

Page 74: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

References

74

46. Wold, S., Esbensen, K. and Geladi, P. Principal Component Analysis. Chemometrics and Intelligent Laboratory Systems. 1987;2:37-52.

47. Wold, S., Trygg, J., Berglund, A. and Antti, H. Some recent developments in PLS modeling. Chemometrics and Intelligent Laboratory Systems. 2001;58:131-150.

48. Trygg, J. and Wold, S. Orthogonal projections to latent structures (O-PLS). Journal of Chemometrics. 2002;16:119-128.

49. Wold, S., Johansson, E., Sjöström, M. and Cocchi, M. PLS, In: ESCOM Science, Ledien; 1993:523-550.

50. Holmes, E. and Antti, H. Chemometric contributions to the evolution of metabonomics: mathematical solutions to characterising and interpreting complex biological NMR spectra. Analyst. 2002;127:1549-1557.

51. Siew, C. K., Williams, P. A. and Young, N. W. G. New insights into the mechanism of gelation of alginate and pectin: Charge annihilation and reversal mechanism. Biomacromolecules. 2005;6:963-969.

52. Grant, G. T., Morris, E. R., Rees, D. A., Smith, P. J. C. and Thom, D. Biological interactions between polysaccharides and divalent cations the egg-box model. FEBS Letters. 1973;32:195-198.

53. Braccini, I. and Perez, S. Molecular basis of Ca2+ -induced gelation in alginates and pectins: The egg-box model revisited. Biomacromolecules 2001;2:1089-1096.

54. Immerzeel, P., Eppink, M. M., De Vries, S. C., Schols, H. A. and Voragen, A. G. J. Carrot arabinogalactan proteins are interlinked with pectins Physiologia Plantarum 2006;128 18-28

55. Ralet, M. C., Andre-Leroux, G., Quemener, B. and Thibault, J. F. Sugar beet (Beta vulgaris) pectins are covalently cross-linked through diferulic bridges in the cell wall Phytochemistry 2005; 66 2800-28.

56. Cumming, C. M., Rizkallah, H. D., McKendrick, K. A., Abdel-Massih, R. M., Baydoun, E. A. H. and Brett, C. T. Biosynthesis and cell-wall deposition of a pectin-xyloglucan complex in pea. Planta. 2005;222:546-555.

57. Zykwinska, A. W., Ralet, M. C. J., Garnier, C. D. and Thibault, J. F. J. Evidence for in vitro binding of pectin side chains to cellulose Plant Physiology 2005;139:397-407.

58. Micheli, F. Pectin methylesterases: cell wall enzymes with important roles in plant physiology. Trends in Plant Science 2001; 6: 414-419.

Page 75: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

References

75

59. Willats, W. G. T., Orfila, C., Limberg, G., Buchholt, H. C., van Alebeek, G., Voragen, A. G. J., et al. Modulation of the degree and pattern of methyl-esterification of pectic homogalacturonan in plant cell walls - Implications for pectin methyl esterase action, matrix properties, and cell adhesion Journal of Biological Chemistry. 2001;276 19404-19413

60. Gnanasambandam, R. and Proctor, A. Determination of pectin degree of esterification by diffuse reflectance Fourier transform infrared spectroscopy. Food Chemistry. 2000;68:327-332.

61. Maness, N. O., Ryan, J. D. and Mort, A. J. Determination of the degree of methyl esterification of pectins in small samples by selective reduction of esterified galacturonic acid to galactose. Analytical Biochemistry. 1990;185:346-352.

62. Grasdalen, H., Einar Bakoy, O. and Larsen, B. Determination of the degree of esterification and the distribution of methylated and free carboxyl groups in pectins by 1H-n.m.r. spectroscopy. Carbohydrate Research. 1988;184:183-191.

63. Rosenbohm, C., Lundt, I., Christensen, T. and Young, N. W. G. Chemically methylated and reduced pectins: preparation, characterisation by H-1 NMR spectroscopy, enzymatic degradation, and gelling properties. Carbohydrate Research. 2003;338:637-649.

64. Schultz, H. T. Determination of the degree of esterification of pectin. Methods in Carbohydrate Chemistry. 1962;5:189-194.

65. Levigne, S., Thomas, M., Ralet, M. C., Quemener, B. and Thibault, J. F. Determination of the degrees of methylation and acetylation of pectins using a C18 column and internal standards. Food Hydrocolloids. 2002;16:547-550.

66. Bloch, F., Hansen, W. and Packard, M. E. Nuclear induction. Physical Review. 1946;69.

67. Purcell, E. M., Torrey, H. C. and Pound, R. V. Resonance absorption by nuclear magnetic moments in a solid. Physical Review. 1946;69:37-38.

68. Ernst, R. R. Sensitivity enhancement in magnetic resonance. New York: Academic Press Inc.; 1966:1-135.

69. Ernst, R. R., Bodenhausen, G. and Wokaun, A. Principles of nuclear magnetic resonance in one and two dimensions. Oxford: Oxford science publications; 1987.

70. Ernst, R. R. Nobel Lecture. Nuclear magnetic resonance Fourier transform spectroscopy. Bioscience Report. 1992;12:143-87.

Page 76: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

References

76

71. King, R. W. and Williams, K. R. The fourier transform in chemistry. Part 1. Nuclear magnetice: Introduction. Journal of Chemical Education. 1989;66:A213-A219.

72. King, R. W. and Williams, K. R. The fourier transform in chemistry. Part 2. Nuclear magnetic resonance: the single pulse experiment. Journal of Chemical Education. 1989;66:A243-A248.

73. Keeler, J. Understanding NMR Spectroscopy. West Sussex: Wiley; 2005.

74. Brindle, J. T., Antti, H., Holmes, E., Tranter, G., Nicholson, J. K., Bethell, H. W. L., et al. Rapid and noninvasive diagnosis of the presence and severity of coronary heart disease using H-1-NMR-based metabonomics. Nature Medicine. 2002;8:1439-1444.

75. Choi, Y. H., Kim, H. K., Linthorst, H. J. M., Hollander, J. G., Lefeber, A. W. M., Erkelens, C., et al. NMR metabolomics to revisit the tobacco mosaic virus infection in Nicotiana tabacum leaves. Journal of Natural Products. 2006;69:742-748.

76. Wikberg, H. and Maunu, S. L. Characterisation of thermally modified hard- and softwoods by C-13 CPMAS NMR. Carbohydrate Polymers. 2004;58:461-466.

77. Elg-Christofferson, K., Hauksson, J., Edlund, U., Sjostrom, M. and Dolk, M. Characterisation of dissolving pulp using designed process variables, NIR and NMR spectroscopy, and multivariate data analysis. Cellulose. 1999;6:233-249.

78. Waters, N. J., Garrod, S., Farrant, R. D., Haselden, J. N., Connor, S. C., Connelly, J., et al. High-resolution magic angle spinning H-1 NMR spectroscopy of intact liver and kidney: Optimization of sample preparation procedures and biochemical stability of tissue during spectral acquisition. Analytical Biochemistry. 2000;282:16-23.

79. Lu, F. C. and Ralph, J. Non-degradative dissolution and acetylation of ball-milled plant cell walls: high-resolution solution-state NMR. Plant Journal. 2003;35:535-544.

80. Ralph, J., Lapierre, C., Marita, J. M., Kim, H., Lu, F., Hatfield, R. D., et al. Elucidation of new structures in lignins of CAD- and COMT-deficient plants by NMR. Phytochemistry. 2001;57:993-1003.

81. Ralph, A. S., Ralph, J. and Landucci, L. L. NMR Database of Lignin and Cell Wall Model Compounds. 2001:Printed version, contact US Forest Products Laboratory, One Gifford Pinchot Drive, Madison, WI 53705, Attn. Sally A. Ralph

Page 77: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

References

77

82. Kay, L. E., Keifer, P. and Saarinen, T. Pure absorption gradient enhanced hetero nuclear single quantum correlations spectroscopy with improved sensitivity. Journal of American Chemical Society 1992;114:10663-10665

83. Vandendool, H. and Kratz, P. D. A generalization of the retention index system including linear temperature programmed gas-liquid partition chromatography. Journal of Chromatography. 1963;11:463-471.

84. Herbert, C. G. and Johnstone, R. A. W. Mass spectrometry basics: CRC Press LLC; 2003.

85. Jonsson, P., Johansson, E. S., Wuolikainen, A., Lindberg, J., Schuppe-Koistinen, I., Kusano, M., et al. Predictive metabolite profiling applying hierarchical multivariate curve resolution to GC-MS datas - A potential tool for multi-parametric diagnosis. Journal of Proteome Research. 2006;5:1407-1414.

86. Workman, J. The state of multivariate thinking for scientists in industry: 1980-2000. Chemometrics and Intelligent Laboratory Systems. 2002;60:13-23.

87. Olsson, I. M., Gottfries, J. and Wold, S. Controlling coverage of D-optimal onion designs and selections. Journal of Chemometrics. 2004;18:548-557.

88. Olsson, I. M., Johansson, E., Berntsson, M., Eriksson, L., Gottfries, J. and Wold, S. Rational DOE protocols for 96-well plates. Chemometrics and Intelligent Laboratory Systems. 2006;83:66-74.

89. Höskuldsson, A. A Combined Theory for PCA and PLS. Journal of Chemometrics. 1995;9:91-123.

90. Wold, H. Estimation of principal components and related models by iterative least squares. In: P. R. Krishnaiah, ed. Multivariate Analysis. New York: Academic Press; 1966:391-420.

91. Jolliffe, I. T. New York: Springer; 1986. 92. Wold, S., Ruhe, A., Wold, H. and Dunn III, W. J. The Collinearity

problem in linear regression: The partial least squares approach to generalized inverses. SIAM Journal of Scientific Computing 1984;5:735-743.

93. Geladi, P. and Kowalski, B. R. Partial least-squares regression-a tutorial. Analytica Chimica Acta. 1986;185:1-17.

94. Wold, S. Cross-validatory estimation of the number of components in factor and principal components models. Technometrics. 1978;20:397-405.

Page 78: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

References

78

95. Xu, Q. S. and Liang, Y. Z. Monte Carlo cross validation. Chemometrics and Intelligent Laboratory Systems. 2001;56:1-11.

96. Xu, Q. S., Liang, Y. Z. and Du, Y. P. Monte Carlo cross-validation for selecting a model and estimating the prediction error in multivariate calibration. Journal of Chemometrics. 2004;18:112-120.

97. Kvalheim, O. M. and Karstang, T. V. Interpretation of latent-variable regression-models. Chemometrics and Intelligent Laboratory Systems 1989:37-52.

98. Cloarec, O., Dumas, M. E., Craig, A., Barton, R. H., Trygg, J., Hudson, J., et al. Statistical total correlation spectroscopy: An exploratory approach for latent biomarker identification from metabolic H-1 NMR data sets. Analytical Chemistry. 2005;77:1282-1289.

99. Nomikos, P. and MacGregor, J. F. Multi-way partial least squares in monitoring batch processes. Chemometrics and Intelligent Laboratory Systems. 1995;30:97-108.

100. Wold, S., Kettaneh, N., Friden, H. and Holmberg, A. Modelling and diagnostics of batch processes and analogous kinetic experiments. Chemometrics and Intelligent Laboratory Systems. 1998;44:331-340.

101. Nomikos, P. and Macgregor, J. F. Multivariate spc charts for monitoring batch processes. Technometrics. 1995;37:41-59.

102. Tucker, L. Some mathematical notes on the three-model factor analysis. Psychometrica. 1966;31:279-311.

103. Bro, R. Multiway calibration. Multilinear PLS. Journal of Chemometrics. 1996;10:47-61.

104. Bro, R. PARAFAC. Tutorial and applications. Chemometrics and Intelligent Laboratory Systems. 1997;38:149-171.

105. Bro, R., Workman, J. J., Mobley, P. R. and Kowalski, B. R. Review of chemometrics applied to spectroscopy: 1985-95 .3. Multi-way analysis. Applied Spectroscopy Reviews. 1997;32:237-261.

106. Sjöström, M. and Edlund, U. Analysis of 13C NMR data by means of pattern recognition methodology Journal of Magnetic Resonance. 1977;25:258-297.

107. Musumarra, G., Scarlata, G. and Wold, S. Studies of substituent effects by 13C NMR spectroscopy. IV Application of principal component analysis to 13C-NMR shifts of (Z)-[alpha-(p-substituted-phenyl)-beta-(5-substituted-2-thienyl) acrylonitriles]. Gazetta Chimica Italia. 1981;111:499-502.

Page 79: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

References

79

108. Musumarra, G., Wold, S. and Gronowitz, S. Application of principal component analysis to 13C NMR shifts of chalcones and their thiophene and furan analogues: a useful tool for the shift assignement and for the study of substituent effects. Organic Magnetic Resonance. 1981;17:118-123.

109. Eliasson, B., Johnels, D., Wold, S. and Edlund, U. On the correlation between solvent scales and solvent-induced 13C NMR chemical shifts of a planar lithium carbanion. A multivariate data analysis using principal component- multiple regression-like formalism. Acta Chemica Scandinavica. 1982;B36:155-164.

110. Johnels, D. C-13 NMR substituent induced chemical-shifts - a comparison of an even alternant and a non-alternant aromatic system, using a multivariate data analytic approach. Chimica Scripta 1984;24

111. Brekke, T., Kvalheim, O. M. and Sletten, E. Prediction of physical-properties of hydrocarbon mixtures by partial-least-squares calibration of C-13 nuclear magnetic-resonance data. Analytica Chimica Acta. 1989;223:123-134.

112. Brekke, T., Barth, T., Kvalheim, O. M. and Sletten, E. Multivariate-analysis of C-13 nuclear magnetic-resonance spectra - Identification and quantification of average structures in petroleum distillates. Analytical Chemistry. 1990;62:49-56.

113. Urdahl, O., Brekke, T. and Sjoblom, J. C-13 NMR and multivariate statistical-analysis of adsorbed surface-active crude-oil fractions and the corresponding crude oils. Fuel. 1992;71:739-746.

114. Grahn, H., Delaglio, F., Delsuc, M. A. and Levy, G. C. Multivariate data-analysis for pattern-recognition in two-dimensional NMR. Journal of Magnetic Resonance. 1988;77:294-307.

115. Grahn, H., Edlund, U., van den Hoogen, Y. T., Altona, C., Delaglio, F., Roggenbuck, M. W., et al. Toward a computer-assisted analysis of noesy spectra - a multivariate data-analysis of an RNA noesy spectrum. Journal of Biomolecular Structure & Dynamics. 1989;6:1135-1150.

116. Grahn, H., Szeverenyi, N. M., Roggenbuck, M. W., Delaglio, F. and Geladi, P. Data-analysis of multivariate magnetic-resonance images .1. a principal component analysis approach. Chemometrics and Intelligent Laboratory Systems. 1989;5:311-322.

Page 80: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

References

80

117. Berglund, A., Brorsson, A. C., Jonsson, B. H. and Sethson, I. The equilibrium unfolding of MerP characterized by multivariate analysis of 2D NMR data. Journal of Magnetic Resonance. 2005;172:24-30.

118. Lindon, J. C., Nicholson, J. K., Holmes, E. and Everett, J. R. Metabonomics: Metabolic processes studied by NMR spectroscopy of biofluids. Concepts in Magnetic Resonance. 2000;12:289-320.

119. Lindon, J. C., Holmes, E. and Nicholson, J. K. Pattern recognition methods and applications in biomedical magnetic resonance. Progress in Nuclear Magnetic Resonance Spectroscopy. 2001;39:1-40.

120. Dumas, M. E., Wilder, S. P., Bihoreau, M. T., Barton, R. H., Fearnside, J. F., Argoud, K., et al. Direct quantitative trait locus mapping of mammalian metabolic phenotypes in diabetic and normoglycemic rat models. Nature Genetics. 2007;39:666-672.

121. Fan, T. W.-M. Metabolite profiling by one- and two-dimensional NMR analysis of complex mixtures. Progress in Nuclear Magnetic Resonance Spectroscopy. 1996;28:161-219.

122. Nicholson, J. K., Buckingham, M. J. and Sadler, P. J. High-resolution H-1-NMR studies of vertebrate blood and plasma. Biochemical Journal. 1983;211.

123. Nicholson, J. K., Timbrell, J. A. and Sadler, P. J. Proton NMR-spectra of urine as indicators of renal damage - mercury-induced nephrotoxicity in rats. Molecular Pharmacology. 1985;27.

124. Bales, J. R., Higham, D. P., Howe, J. K., Nicholson, J. K. and Sadler, P. J. Use of high-resolution proton nuclear magnetic-resonance spectroscopy for rapid multi-component analysis of urine. Clinical Chemistry. 1984;30.

125. Forshed, J., Schuppe-Koistinen, I. and Jacobsson, S. P. Peak alignment of NMR signals by means of a genetic algorithm. Analytica Chimica Acta. 2003;487:189-199.

126. Witjes, H., Melssen, W. J., Zandt, H., van der Graaf, M., Heerschap, A. and Buydens, L. M. C. Automatic correction for phase shifts, frequency shifts, and lineshape distortions across a series of single resonance lines in large spectral data sets. Journal of Magnetic Resonance. 2000;144:35-44.

127. Ernst, R. R. Numerical hilbert transform and automatic phase correction in magnetic resonance spectroscopy. Journal of Magnetic Resonance. 1969;1:7-26.

Page 81: Spectroscopic Data and Multivariate Analysis - DiVA portal140881/FULLTEXT01.pdf · Spectroscopic Data and Multivariate Analysis ~Tools to Study Genetic Perturbations in Poplar Trees~

References

81

128. Sotak, C. H., Dumoulin, C. L. and Newsham, M.-D. Automatic phase correction of fourier-transform NMR-Spectra based on the dispersion versus absorption (dispa) lineshapes analysis. Journal of Magnetic Resonance. 1984;57:453-462.

129. Ober, R. J. and Ward, E. S. Correcting for phase-distortion of NMR-spectra analyzed using singular-value decomposition of Hankel-Matrices. Journal of Magnetic Resonance Series A. 1995;114:120-123.

130. Brown, T. R. and Stoyanova, R. NMR spectral quantitation by principal-component analysis .2. Determination of frequency and phase shifts. Journal of Magnetic Resonance Series B. 1996;112:32-43.

131. Pedersen, H. T., Bro, R. and Engelsen, S. B. Towards rapid and unique curve resolution of low-field NMR relaxation data: Trilinear SLICING versus two-dimensional curve fitting. Journal of Magnetic Resonance. 2002;157:141-155.

132. Cloarec, O., Dumas, M. E., Trygg, J., Craig, A., Barton, R. H., Lindon, J. C., et al. Evaluation of the orthogonal projection on latent structure model limitations caused by chemical shift variability and improved visualization of biomarker changes in H-1 NMR spectroscopic metabonomic studies. Analytical Chemistry. 2005;77:517-526.

133. Cohen, J. What I Have learned (So Far). American Phychologist. 1990;45:1304-1312.

134. Cohen, J. Statistical power analysis for the behavioral sciences 2ed: Lawrence Erlbaum Associates; 1988.

135. Cohen, J. The Earth Is Round (p<.05). American Psychologist. 1994;49:997-1003.

136. MATLAB: The Mathworks, I., Natick, MA, USA.