clustering nutrients and toxic cyanobacteria communities ... · clustering nutrients and toxic...

Clustering nutrients and toxic

cyanobacteria communities using a self-

organizing map (SOM)

www.lcbp.org

A bloom near Venise-en-Quebec in August, 2008.

Credit: Quebec Ministry of Sustainable Development, Environment and Parks.

Andrea Pearce1, Donna Rizzo1,

Lori Stevens2, Mary Watzin3

(1) School of Engineering, University of Vermont, Votey Hall, 33

Colchester Ave, Burlington, VT 05405

(2) Department of Biology, University of Vermont, Marsh Life Science,

109 Carrigain Dr, Burlington, VT 05450

(3) Rubenstein School of Environment and Natural Resources, University

of Vermont, 324 Aiken Center, Burlington, VT 05405

http://www.lcbp.org

Acknowledgements

•Vermont EPSCoR - Graduate Research Fellowship

•NSF - EAR Award #061154

•Casella Waste Management, EPA - Landfill Data

•Rubenstein Lab crew, VT ANR & volunteer

monitors - Lake Champlain Data

•CSYS Group

Background Photo Credit: Larry Dupont www.lcbp.org

Big Picture Goal

Develop a computational method to use

microbial diversity and other data to describe

spatial patterns in the environment

Lake Champlain Cyanobacteria Dataset

•Many correlated variables

•Data correlated in space and time

•Multiple indicators of water quality

•Observation of trends requires costly,long-term

monitoring

•Lack of appropriate computational tools for

•data mining multiple data types

•producing maps/estimations at multiple scales

Introduction: Computational Methodology

Self-Organizing Map (SOM) -

A clustering artificial neural network (ANN)

Why ANNs?

•Data-driven methods

• Exploit complex functional relationships in a

dataset without explicitly defining them

Self-Organizing Map (SOM) -

A clustering artificial neural network (ANN)

Why the SOM?

•Unsupervised ANN

•Reduces data dimensionality

•Does not require a priori knowledge of the

number of groupings or features of those groups

•Outperforms many clustering methods with

noisy datasets.


Kohonen 1990, Mangiameli et al. 1996

Method Demonstration: Kohonen Animal Example

small medium big 2 legs 4 legs hair hooves mane feathers hunt run fly swim

dove 1 0 0 1 0 0 0 0 1 0 0 1 0

hen 1 0 0 1 0 0 0 0 1 0 0 0 0

duck 1 0 0 1 0 0 0 0 1 0 0 1 1

goose 1 0 0 1 0 0 0 0 1 0 0 1 1

owl 1 0 0 1 0 0 0 0 1 1 0 1 0

hawk 1 0 0 1 0 0 0 0 1 1 0 1 0

eagle 0 1 0 1 0 0 0 0 1 1 0 1 0

fox 0 1 0 0 1 1 0 0 0 1 0 0 0

dog 0 1 0 0 1 1 0 0 0 0 1 0 0

wolf 0 1 0 0 1 1 0 0 0 1 1 0 0

cat 1 0 0 0 1 1 0 0 0 1 0 0 0

tiger 0 0 1 0 1 1 0 0 0 1 1 0 0

lion 0 0 1 0 1 1 0 1 0 1 1 0 0

horse 0 0 1 0 1 1 1 1 0 0 1 0 0

zebra 0 0 1 0 1 1 1 1 0 0 1 0 0

cow 0 0 1 0 1 1 1 0 0 0 0 0 0

SOM Network Architecture

Hair

2-D Output Map

…

2-Legs

Hooves

Swim

w(i,j)1

w(i,j)13

w(i,j)3

w(i,j)2

i

j

x1

x2

x3

x13



2-D Output Map

…

w(i,j)1

w(i,j)13

w(i,j)3

w(i,j)2

i

j

x1

x2

x3

x13

xk w(i, j)k 2

k1

K

1 2

Minimize Euclidean Distance


Hair

2-Legs

Hooves

Swim


2-D Output Map

…

w(i,j)1

w(i,j)13

w(i,j)3

w(i,j)2

i

j

x1

x2

x3

x13

w(i, j)knew w(i, j)k

old xk w(i, j)kold


Hair

2-Legs

Hooves

Swim

SOM Network Training

•Present input data vectors to the network 1 by 1

•Iterate through all a predetermined number of times

2-D Output Map

…

w(i,j)1

w(i,j)13

w(i,j)3

w(i,j)2

i

j

x1

x2

x3

x13

w(i, j)knew w(i, j)k

old xk w(i, j)kold

Du

ck

Fo

x

Ho

rse

…

Lio

nIntroduction: Computational Methodology

Method Demonstration Kohonen Animal Example

Before Training After Training

Method Demonstration - Animal Example U-Matrix

Method Demonstration - Component Planes

Method Demonstration

Field Site - Schuyler Falls Landfill, NY

Mouser 2006

Describe the Dataset

hydrochemistry

Sp. Cond.

Eh

Ph

Turbidity

BOD

COD

Ammonia

Nitrate

Cations

Anions

Heavy Metals

Acetone

Benzene

Carbon Tetrachloride

…

Xylene

Describe the Dataset

hydrochemistry

Sp. Cond.

Eh

Ph

Turbidity

BOD

COD

Ammonia

Nitrate

Cations

Anions

Heavy Metals

Acetone

Benzene

Carbon Tetrachloride

…

Xylene

http://rdp8.cme.msu.edu/html/t-rflp_jul02.html

Mouser et al. 2005

Roling et al. 2001

Microbial Data from the Landfill

Community profiles of Archaea, Bacteria and

Geobacteracea

Relative abundance of „Operational Taxonomic Units‟,

or OTU‟s for each target

Community data from 25 monitoring wells

Reduce OTU data dimensionality by computing

principal components 75% of variance captured by

2 PC’s of Archaea

3 PC’s of Bacteria

3 PC’s of Geobacteracea

Method Demonstration - Data Organization

C = Clean

F = Fringe

P = Polluted

= Unused Node

After Training

Non-Parametric MANOVA

F = Between Group Variability / Within Group Variability

How Many Clusters?

Anderson, M. J. (2001), McArdle, B.H. and M.J. Anderson (2001), Jones, D. (2003)

•Non-parametric

•Suitable for unbalanced designs

•Can be applied to any distance matrix

Non-Parametric MANOVA

F = Between Group Variability / Within Group Variability

How Many Clusters?

Anderson, M. J. (2001), McArdle, B.H. and M.J. Anderson (2001), Jones, D. (2003)

0

5

10

15

20

25

30

0 2 4 6 8 10 12

Number of Clusters

F-S

tati

sit

c

Preliminary Results

2-Clusters

3-Clusters

Preliminary Results

4-Clusters

Preliminary Results

4-Clusters

Bacteria &

Geobacter Only

4-Clusters

Archaea, Bacteria &

Geobacter

•The SOM is effective at distinguishing a gradient of

contamination at the landfill based on microbial

communities.

•The spatial pattern of the groups generated by the

algorithm agree with hydrochemical analysis.

•Microbe communities may be able to serve as an

advance indicator of migrating pollution.

•More knowledge of specific sub-groups of

organisms (and primers to amplify relevant DNA)

could improve the ability of the algorithm.

Landfill Application Preliminary Conclusions

Map courtesy of M. Watzin

Lake Champlain Dataset

ID

datetime

collected location rep

sample

type diatoms greens

chryso

phytes

crypt

ophyt

es

dinof

lagel

lates

euglen

ophyte

s

bacilla

riophy

ceae

indete

rmina

te

Potential

Toxin

Producers total

Net, Plankton

or Whole

Water

Toxin

Microcystin,

ug/L

Whole

Water Chl,

ug/L Total P Total N SRP

32 7/2/07 13:20Chapman Bay 1 net 8.3 15.7 4.4 0.0 570.2 598.6 ww <0.05 3.25 27.41 0.77 7.0

33 7/2/07 13:20Chapman Bay 1 net 8.3 15.7 4.4 0.0 570.2 598.6 wwp 0.006 3.25 27.41 0.77 7.0



36 7/2/07 14:25Highgate Cliffs 1 net 5.5 84.0 0.0 0.0 680.6 775.9 wwp 0.009 0.00 46.76 1.04 20.3

37 7/2/07 14:25Highgate Cliffs 1 net 5.5 84.0 0.0 0.0 680.6 775.9 ww <0.05 0.00 46.76 1.04 20.3

38 7/2/07 14:25Highgate Cliffs 2 net 2.6 44.3 0.0 0.0 929.6 976.4 ww 0.061 4.97 46.57 0.96 20.7


40 7/2/07 14:40Highgate Springs 1 net 0.7 73.0 0.0 0.1 317.0 390.8 ww <0.05 4.07 57.31 1.03 21.2

41 7/2/07 14:40Highgate Springs 2 net 2.8 49.5 0.0 0.0 198.2 250.5 ww 0.062 4.07 56.76 1.16 21.1

Sample

Collection

Details

Algae

Community

CompositionLake

ChemistryCyanotoxin

Concentration

ID

datetime

collected location rep

sample

type diatoms greens

chryso

phytes

crypt

ophyt

es

dinof

lagel

lates

euglen

ophyte

s

bacilla

riophy

ceae

indete

rmina

te

Potential

Toxin

Producers total

Net, Plankton

or Whole

Water

Toxin

Microcystin,

ug/L

Whole

Water Chl,

ug/L Total P Total N SRP






37 7/2/07 14:25Highgate Cliffs 1 net 5.5 84.0 0.0 0.0 680.6 775.9 ww <0.05 0.00 46.76 1.04 20.3

38 7/2/07 14:25Highgate Cliffs 2 net 2.6 44.3 0.0 0.0 929.6 976.4 ww 0.061 4.97 46.57 0.96 20.7


40 7/2/07 14:40Highgate Springs 1 net 0.7 73.0 0.0 0.1 317.0 390.8 ww <0.05 4.07 57.31 1.03 21.2

41 7/2/07 14:40Highgate Springs 2 net 2.8 49.5 0.0 0.0 198.2 250.5 ww 0.062 4.07 56.76 1.16 21.1

Lake Champlain Dataset: Preliminary Analysis

Sample

Collection

Details

Algae

Community

CompositionLake

ChemistryCyanotoxin

Concentration

Input Dataset:

All sampling locations in Missisquoi Bay („03-‟07)

All sampling dates (May - October)

356 samples total

Variables:

Total N (g/L)

Total P (g/L)

Chlorophyll (g/L)

Anabaena (cells/mL)

Aphanizomenon (cells/mL)

Microcystis (cells/mL)

Lake Champlain Dataset - BMU by Microcystin Conc.

< 10 g/L toxin (n = 339)

10 - 100 g/L toxin (n = 14)

>100 g/L toxin (n = 3)

Unused Node

Lake Champlain Dataset - Component Planes

Anabaena Aphanizomenon

Total NitrogenTotal PhosphorusChlorophyll

Microcystis

1) Create new input variables (Principal Components)

A. Lake Chemistry (N, P, Chl, H2O Temp)

B. Biology (Cyanobacteria Community)

C. Environmental Conditions (Air Temp, Cloud Cover)

2) Explore the spatial and temporal autocorrelation in the

dataset

Lake Champlain Dataset - Ongoing Work

Canfield, D.E., B. Thamdrup, and E. Kristensen, (2005). Aquatic Geomicrobiology, Elsevier, New York.

Anderson, M. J. (2001). A new method for non-parametric multivariate analysis of variance. Austral Ecology 26, 32-

46.

Jones, D. (2003). Fathom: a MATLAB toolbox for ecological and oceanographic data analysis. University of Miami

СSMAS, Department of Marine Biology and Fisheries. Available at: http://www.rsmas.miami.edu/personal/djones

Accessed: 3 December 2008

Roling, W.F.M., B.M. van Breukelen, M.Braster, B. Lin, H.W. van Verseveld, (2001) “Relationships between Microbial

Community Structure and Hydrochemistry in a Landfill Leachate-Polluted Aquifer”. Applied and Environmental

Microbiology 67(10):4619-4629.

Grant, L.M., L.M. Muckian, N.J.W. Clipson, and E.M. Doyle, (2006). “Microbial community changes during the

bioremediation of creosote-contaminated soil”. Letters in Applied Microbiology 44:293-300.

McArdle, B.H. and M.J. Anderson (2001). Fitting multivariate models to community data: A comment on distance-

based redundancy analysis. Ecology. 8(1), 290-297.

Mouser, P.J., D. M. Rizzo, W.F.M. Roling and B.M. van Breukelen, (2005). “A multivariate statistical approach to spatial

representation of groundwater contamination using hydrochemistry and microbial community profiles.” Environmental

Science and Technology 39:7551 - 7559.

Mouser (2006), Improving Detection and Long-Term Monitoring Strategies for Landfill Leachate Contaminated

Groundwater with Molecular-Based Microbiological Data Using Geostatistics and Artificial Neural Networks. Doctoral

Dissertation, University of Vermont.

Mangiameli, P., S.K. Chen, and D. West, (1996). “A comparison of SOM neural network and hierarchical clustering

methods”. European Journal of Operational Research 93:402-417.

Pace, N.R. (1997) “A Molecular View of Microbial Diversity and the Biosphere”. Science 276(5313):734-740.

References

http://www.rsmas.miami.edu/personal/djones

Preliminary Results

clustering nutrients and toxic cyanobacteria communities ... · clustering nutrients and toxic...

Documents