clustering nutrients and toxic cyanobacteria communities ... · clustering nutrients and toxic...
TRANSCRIPT
Clustering nutrients and toxic
cyanobacteria communities using a self-
organizing map (SOM)
www.lcbp.org
A bloom near Venise-en-Quebec in August, 2008.
Credit: Quebec Ministry of Sustainable Development, Environment and Parks.
Andrea Pearce1, Donna Rizzo1,
Lori Stevens2, Mary Watzin3
(1) School of Engineering, University of Vermont, Votey Hall, 33
Colchester Ave, Burlington, VT 05405
(2) Department of Biology, University of Vermont, Marsh Life Science,
109 Carrigain Dr, Burlington, VT 05450
(3) Rubenstein School of Environment and Natural Resources, University
of Vermont, 324 Aiken Center, Burlington, VT 05405
Acknowledgements
•Vermont EPSCoR - Graduate Research Fellowship
•NSF - EAR Award #061154
•Casella Waste Management, EPA - Landfill Data
•Rubenstein Lab crew, VT ANR & volunteer
monitors - Lake Champlain Data
•CSYS Group
Background Photo Credit: Larry Dupont www.lcbp.org
Big Picture Goal
Develop a computational method to use
microbial diversity and other data to describe
spatial patterns in the environment
Lake Champlain Cyanobacteria Dataset
•Many correlated variables
•Data correlated in space and time
•Multiple indicators of water quality
•Observation of trends requires costly,long-term
monitoring
•Lack of appropriate computational tools for
•data mining multiple data types
•producing maps/estimations at multiple scales
Introduction: Computational Methodology
Self-Organizing Map (SOM) -
A clustering artificial neural network (ANN)
Why ANNs?
•Data-driven methods
• Exploit complex functional relationships in a
dataset without explicitly defining them
Self-Organizing Map (SOM) -
A clustering artificial neural network (ANN)
Why the SOM?
•Unsupervised ANN
•Reduces data dimensionality
•Does not require a priori knowledge of the
number of groupings or features of those groups
•Outperforms many clustering methods with
noisy datasets.
Introduction: Computational Methodology
Kohonen 1990, Mangiameli et al. 1996
Method Demonstration: Kohonen Animal Example
small medium big 2 legs 4 legs hair hooves mane feathers hunt run fly swim
dove 1 0 0 1 0 0 0 0 1 0 0 1 0
hen 1 0 0 1 0 0 0 0 1 0 0 0 0
duck 1 0 0 1 0 0 0 0 1 0 0 1 1
goose 1 0 0 1 0 0 0 0 1 0 0 1 1
owl 1 0 0 1 0 0 0 0 1 1 0 1 0
hawk 1 0 0 1 0 0 0 0 1 1 0 1 0
eagle 0 1 0 1 0 0 0 0 1 1 0 1 0
fox 0 1 0 0 1 1 0 0 0 1 0 0 0
dog 0 1 0 0 1 1 0 0 0 0 1 0 0
wolf 0 1 0 0 1 1 0 0 0 1 1 0 0
cat 1 0 0 0 1 1 0 0 0 1 0 0 0
tiger 0 0 1 0 1 1 0 0 0 1 1 0 0
lion 0 0 1 0 1 1 0 1 0 1 1 0 0
horse 0 0 1 0 1 1 1 1 0 0 1 0 0
zebra 0 0 1 0 1 1 1 1 0 0 1 0 0
cow 0 0 1 0 1 1 1 0 0 0 0 0 0
SOM Network Architecture
Hair
2-D Output Map
…
2-Legs
Hooves
Swim
w(i,j)1
w(i,j)13
w(i,j)3
w(i,j)2
i
j
x1
x2
x3
x13
Introduction: Computational Methodology
SOM Network Architecture
2-D Output Map
…
w(i,j)1
w(i,j)13
w(i,j)3
w(i,j)2
i
j
x1
x2
x3
x13
xk w(i, j)k 2
k1
K
1 2
Minimize Euclidean Distance
Introduction: Computational Methodology
Hair
2-Legs
Hooves
Swim
SOM Network Architecture
2-D Output Map
…
w(i,j)1
w(i,j)13
w(i,j)3
w(i,j)2
i
j
x1
x2
x3
x13
w(i, j)knew w(i, j)k
old xk w(i, j)kold
Introduction: Computational Methodology
Hair
2-Legs
Hooves
Swim
SOM Network Training
•Present input data vectors to the network 1 by 1
•Iterate through all a predetermined number of times
2-D Output Map
…
w(i,j)1
w(i,j)13
w(i,j)3
w(i,j)2
i
j
x1
x2
x3
x13
w(i, j)knew w(i, j)k
old xk w(i, j)kold
Du
ck
Fo
x
Ho
rse
…
Lio
nIntroduction: Computational Methodology
Method Demonstration Kohonen Animal Example
Before Training After Training
Method Demonstration - Animal Example U-Matrix
Method Demonstration - Component Planes
Method Demonstration
Field Site - Schuyler Falls Landfill, NY
Mouser 2006
Method Demonstration
Field Site - Schuyler Falls Landfill, NY
Mouser 2006
Describe the Dataset
hydrochemistry
Sp. Cond.
Eh
Ph
Turbidity
BOD
COD
Ammonia
Nitrate
Cations
Anions
Heavy Metals
Acetone
Benzene
Carbon Tetrachloride
…
Xylene
Describe the Dataset
hydrochemistry
Sp. Cond.
Eh
Ph
Turbidity
BOD
COD
Ammonia
Nitrate
Cations
Anions
Heavy Metals
Acetone
Benzene
Carbon Tetrachloride
…
Xylene
http://rdp8.cme.msu.edu/html/t-rflp_jul02.html
Mouser et al. 2005
Roling et al. 2001
Microbial Data from the Landfill
Community profiles of Archaea, Bacteria and
Geobacteracea
Relative abundance of „Operational Taxonomic Units‟,
or OTU‟s for each target
Community data from 25 monitoring wells
Reduce OTU data dimensionality by computing
principal components 75% of variance captured by
2 PC’s of Archaea
3 PC’s of Bacteria
3 PC’s of Geobacteracea
Method Demonstration - Data Organization
C = Clean
F = Fringe
P = Polluted
= Unused Node
After Training
Non-Parametric MANOVA
F = Between Group Variability / Within Group Variability
How Many Clusters?
Anderson, M. J. (2001), McArdle, B.H. and M.J. Anderson (2001), Jones, D. (2003)
•Non-parametric
•Suitable for unbalanced designs
•Can be applied to any distance matrix
Non-Parametric MANOVA
F = Between Group Variability / Within Group Variability
How Many Clusters?
Anderson, M. J. (2001), McArdle, B.H. and M.J. Anderson (2001), Jones, D. (2003)
0
5
10
15
20
25
30
0 2 4 6 8 10 12
Number of Clusters
F-S
tati
sit
c
Preliminary Results
2-Clusters
3-Clusters
Preliminary Results
4-Clusters
Preliminary Results
4-Clusters
Bacteria &
Geobacter Only
4-Clusters
Archaea, Bacteria &
Geobacter
•The SOM is effective at distinguishing a gradient of
contamination at the landfill based on microbial
communities.
•The spatial pattern of the groups generated by the
algorithm agree with hydrochemical analysis.
•Microbe communities may be able to serve as an
advance indicator of migrating pollution.
•More knowledge of specific sub-groups of
organisms (and primers to amplify relevant DNA)
could improve the ability of the algorithm.
Landfill Application Preliminary Conclusions
Map courtesy of M. Watzin
Lake Champlain Dataset
ID
datetime
collected location rep
sample
type diatoms greens
chryso
phytes
crypt
ophyt
es
dinof
lagel
lates
euglen
ophyte
s
bacilla
riophy
ceae
indete
rmina
te
Potential
Toxin
Producers total
Net, Plankton
or Whole
Water
Toxin
Microcystin,
ug/L
Whole
Water Chl,
ug/L Total P Total N SRP
32 7/2/07 13:20Chapman Bay 1 net 8.3 15.7 4.4 0.0 570.2 598.6 ww <0.05 3.25 27.41 0.77 7.0
33 7/2/07 13:20Chapman Bay 1 net 8.3 15.7 4.4 0.0 570.2 598.6 wwp 0.006 3.25 27.41 0.77 7.0
34 7/2/07 13:20Chapman Bay 2 net 63.5 24.4 2.3 0.1 151.8 245.5 wwp 0.002 2.75 28.80 0.78 6.9
35 7/2/07 13:20Chapman Bay 2 net 63.5 24.4 2.3 0.1 151.8 245.5 ww <0.05 2.75 28.80 0.78 6.9
36 7/2/07 14:25Highgate Cliffs 1 net 5.5 84.0 0.0 0.0 680.6 775.9 wwp 0.009 0.00 46.76 1.04 20.3
37 7/2/07 14:25Highgate Cliffs 1 net 5.5 84.0 0.0 0.0 680.6 775.9 ww <0.05 0.00 46.76 1.04 20.3
38 7/2/07 14:25Highgate Cliffs 2 net 2.6 44.3 0.0 0.0 929.6 976.4 ww 0.061 4.97 46.57 0.96 20.7
39 7/2/07 14:25Highgate Cliffs 2 net 2.6 44.3 0.0 0.0 929.6 976.4 wwp 0.043 4.97 46.57 0.96 20.7
40 7/2/07 14:40Highgate Springs 1 net 0.7 73.0 0.0 0.1 317.0 390.8 ww <0.05 4.07 57.31 1.03 21.2
41 7/2/07 14:40Highgate Springs 2 net 2.8 49.5 0.0 0.0 198.2 250.5 ww 0.062 4.07 56.76 1.16 21.1
Sample
Collection
Details
Algae
Community
CompositionLake
ChemistryCyanotoxin
Concentration
ID
datetime
collected location rep
sample
type diatoms greens
chryso
phytes
crypt
ophyt
es
dinof
lagel
lates
euglen
ophyte
s
bacilla
riophy
ceae
indete
rmina
te
Potential
Toxin
Producers total
Net, Plankton
or Whole
Water
Toxin
Microcystin,
ug/L
Whole
Water Chl,
ug/L Total P Total N SRP
32 7/2/07 13:20Chapman Bay 1 net 8.3 15.7 4.4 0.0 570.2 598.6 ww <0.05 3.25 27.41 0.77 7.0
33 7/2/07 13:20Chapman Bay 1 net 8.3 15.7 4.4 0.0 570.2 598.6 wwp 0.006 3.25 27.41 0.77 7.0
34 7/2/07 13:20Chapman Bay 2 net 63.5 24.4 2.3 0.1 151.8 245.5 wwp 0.002 2.75 28.80 0.78 6.9
35 7/2/07 13:20Chapman Bay 2 net 63.5 24.4 2.3 0.1 151.8 245.5 ww <0.05 2.75 28.80 0.78 6.9
36 7/2/07 14:25Highgate Cliffs 1 net 5.5 84.0 0.0 0.0 680.6 775.9 wwp 0.009 0.00 46.76 1.04 20.3
37 7/2/07 14:25Highgate Cliffs 1 net 5.5 84.0 0.0 0.0 680.6 775.9 ww <0.05 0.00 46.76 1.04 20.3
38 7/2/07 14:25Highgate Cliffs 2 net 2.6 44.3 0.0 0.0 929.6 976.4 ww 0.061 4.97 46.57 0.96 20.7
39 7/2/07 14:25Highgate Cliffs 2 net 2.6 44.3 0.0 0.0 929.6 976.4 wwp 0.043 4.97 46.57 0.96 20.7
40 7/2/07 14:40Highgate Springs 1 net 0.7 73.0 0.0 0.1 317.0 390.8 ww <0.05 4.07 57.31 1.03 21.2
41 7/2/07 14:40Highgate Springs 2 net 2.8 49.5 0.0 0.0 198.2 250.5 ww 0.062 4.07 56.76 1.16 21.1
Lake Champlain Dataset: Preliminary Analysis
Sample
Collection
Details
Algae
Community
CompositionLake
ChemistryCyanotoxin
Concentration
Input Dataset:
All sampling locations in Missisquoi Bay („03-‟07)
All sampling dates (May - October)
356 samples total
Variables:
Total N (g/L)
Total P (g/L)
Chlorophyll (g/L)
Anabaena (cells/mL)
Aphanizomenon (cells/mL)
Microcystis (cells/mL)
Lake Champlain Dataset - BMU by Microcystin Conc.
< 10 g/L toxin (n = 339)
10 - 100 g/L toxin (n = 14)
>100 g/L toxin (n = 3)
Unused Node
Lake Champlain Dataset - Component Planes
Anabaena Aphanizomenon
Total NitrogenTotal PhosphorusChlorophyll
Microcystis
1) Create new input variables (Principal Components)
A. Lake Chemistry (N, P, Chl, H2O Temp)
B. Biology (Cyanobacteria Community)
C. Environmental Conditions (Air Temp, Cloud Cover)
2) Explore the spatial and temporal autocorrelation in the
dataset
Lake Champlain Dataset - Ongoing Work
Canfield, D.E., B. Thamdrup, and E. Kristensen, (2005). Aquatic Geomicrobiology, Elsevier, New York.
Anderson, M. J. (2001). A new method for non-parametric multivariate analysis of variance. Austral Ecology 26, 32-
46.
Jones, D. (2003). Fathom: a MATLAB toolbox for ecological and oceanographic data analysis. University of Miami
СSMAS, Department of Marine Biology and Fisheries. Available at: http://www.rsmas.miami.edu/personal/djones
Accessed: 3 December 2008
Roling, W.F.M., B.M. van Breukelen, M.Braster, B. Lin, H.W. van Verseveld, (2001) “Relationships between Microbial
Community Structure and Hydrochemistry in a Landfill Leachate-Polluted Aquifer”. Applied and Environmental
Microbiology 67(10):4619-4629.
Grant, L.M., L.M. Muckian, N.J.W. Clipson, and E.M. Doyle, (2006). “Microbial community changes during the
bioremediation of creosote-contaminated soil”. Letters in Applied Microbiology 44:293-300.
McArdle, B.H. and M.J. Anderson (2001). Fitting multivariate models to community data: A comment on distance-
based redundancy analysis. Ecology. 8(1), 290-297.
Mouser, P.J., D. M. Rizzo, W.F.M. Roling and B.M. van Breukelen, (2005). “A multivariate statistical approach to spatial
representation of groundwater contamination using hydrochemistry and microbial community profiles.” Environmental
Science and Technology 39:7551 - 7559.
Mouser (2006), Improving Detection and Long-Term Monitoring Strategies for Landfill Leachate Contaminated
Groundwater with Molecular-Based Microbiological Data Using Geostatistics and Artificial Neural Networks. Doctoral
Dissertation, University of Vermont.
Mangiameli, P., S.K. Chen, and D. West, (1996). “A comparison of SOM neural network and hierarchical clustering
methods”. European Journal of Operational Research 93:402-417.
Pace, N.R. (1997) “A Molecular View of Microbial Diversity and the Biosphere”. Science 276(5313):734-740.
References
Preliminary Results