Transcript
Page 1: Mapping Uncharted Waters: Exploratory Analysis ...kilian/papers/lewisj-OceanMapping.pdf · Mapping Uncharted Waters: Exploratory Analysis, Visualization, and Clustering of Oceanographic

Mapping Uncharted Waters: Exploratory Analysis,Visualization, and Clustering of Oceanographic Data

Joshua M. LewisDepartment of Cognitive Science

University of California, San DiegoSan Diego, CA 92093-0515

[email protected]

Pincelli M. HullScripps Institution of OceanographyUniversity of California San Diego

San Diego, CA [email protected]

Kilian Q. WeinbergerYahoo! Research

2821 Mission College BlvdSanta Clara, CA [email protected]

Lawrence K. SaulCSE Department

University of California, San DiegoSan Diego, CA 92093-0404

[email protected]

Abstract

In this paper we describe an interdisciplinary collabora-tion between researchers in machine learning and oceanog-raphy. The collaboration was formed to study the prob-lem of open ocean biome classification. Biomes are regionson Earth with similar climate (e.g., temperature and rain-fall) and vegetation structure (e.g., grasslands, coniferousforests, and deserts). To discover biomes in the open ocean,we apply leading methods in high dimensional data anal-ysis, clustering, and visualization to oceanographic mea-surements culled from multiple existing databases. We com-pare traditional approaches, such as k-means clusteringand principal component analysis, to newer approachessuch as Isomap and maximum variance unfolding. Ourwork provides the first quantitative classification of openocean biomes from an automated statistical analysis of mul-tivariate data. It also provides a valuable case study in theuse (and misuse) of recently developed algorithms for highdimensional data analysis.

1 Introduction

It is now widely recognized that advances in machinelearning are ushering in a new era of computational and ex-perimental science. This era will be characterized by in-creasingly powerful, automated, and large-scale methodsin data analysis and visualization (Mjolsness & DeCoste,2001). The full potential of machine learning will only be

1234567891011n/a

Figure 1. The global projection of oceanicbiomes produced by k-means clustering (k =11) after data analysis by maximum varianceunfolding; see section 4 for details. Thebiomes are colored according to cluster mem-bership and sorted by the first principal com-ponent of the cluster centroids.

realized, however, by making the field’s leading algorithmsavailable and accessible to researchers in other data-drivenand computation-intensive areas of science.

As a successful example of this practice, in this paperwe describe an interdisciplinary collaboration between re-searchers in machine learning and oceanography. The col-laboration was formed to study the problem of open oceanbiome classification. Biomes are regions on Earth with sim-

1

Page 2: Mapping Uncharted Waters: Exploratory Analysis ...kilian/papers/lewisj-OceanMapping.pdf · Mapping Uncharted Waters: Exploratory Analysis, Visualization, and Clustering of Oceanographic

ilar climate (e.g., temperature and rainfall) and vegetationstructure (e.g., grasslands, coniferous forests and deserts).Ecologists are interested in biomes as units of classifica-tion because they correspond broadly with the communitystructure and dynamics of the organisms that live there. Inessence, dividing the world into meaningful biomes pro-vides a rough estimate of the number, location, and sim-ilarity of the different ecosystems on Earth. Moreover,this problem is ripe for automated analysis because oceanicbiomes are large relative to the resolution of available data.In particular, the available data makes it especially feasibleto classify surface waters of the open ocean—namely, thatpart of the ocean that is offshore and restricted to the upper200 meters of water.

In this paper, we provide the first quantitative classifica-tion of open ocean biomes based on a fully automated, sta-tistical analysis of climatological and primary productionparameters. To discover these biomes, we applied leadingmethods in high dimensional data analysis, clustering, andvisualization to oceanographic measurements culled frommultiple databases and previous ecological studies. Fig. 1provides an overall visualization of our results.

While our results are interesting in their own right,they also showcase the real-world applicability of severalrecently developed algorithms for high dimensional dataanalysis. In addition to standard procedures such as k-means clustering and principal component analysis, we alsoexperimented with spectral clustering (Ng et al., 2002),Isomap (Tenenbaum et al., 2000), and maximum varianceunfolding (Weinberger et al., 2004; Sun et al., 2006). No-tably, our most interesting results were obtained not fromthe original implementations of these algorithms, but byadapting the algorithms in various ways and incorporatingadvances suggested by follow-up work (de Silva & Tenen-baum, 2003; Weinberger et al., 2007).

Our results also illuminate certain stark differences be-tween these algorithms. For the most part, recently devel-oped algorithms for “manifold learning” yield fairly similarresults on the carefully controlled data sets used to bench-mark algorithms in this area. The basic assumptions behindthese algorithms also appear to be satisfied by most real-world data sets on which they have been tested. The oceandata set in this paper, however, appears to violate the basicassumption of (at least) one popular algorithm for manifoldlearning, leading to fairly divergent results in a real-worldapplication of interest. We highlight the reasons for this di-vergence in the discussion of our methods and experimentalresults. Our work on this particular application thus serves abroader purpose than the application itself. More generally,it provides a valuable case study for researchers in the areaof manifold learning.

Our paper is organized as follows. In section 2, we de-scribe the problem of open ocean biome classification in

more detail, as well as the data set compiled for this task. Insection 3, we briefly survey our methods for data analysis.In section 4, we analyze the results from different methodsand compare the different aspects of biome structure thatthey reveal. Finally, in section 5, we conclude with generallessons from our work.

2 Open ocean biome classification

The endeavor to classify open ocean biomes is not a newone. Oceanic biomes have been identified for more than100 years based on species range distributions (Giesbrecht,1892; McGowan, 1971), with the explicit acknowledgmentthat some set of habitat characteristics were likely drivingthe patterns. However, it was only in the last decade thatoceanic biomes were first determined from physical, chemi-cal and biological habitat characteristics (Longhurst, 1998).This advance was made possible by the recent accumulationof multiple long-term global databases of shipboard mea-surements and satellite images. This manual classificationof oceanic biomes and provinces was based on a visual as-sessment of a decade of ocean color images corroboratedby numerous physical, chemical, and biotic factors. Whilethis biome classification has proven useful, the methodol-ogy behind it has limitations. It is difficult to determine therelative importance of biome separations or to assess thesimilarity of geographically separate regions. Furthermore,visual biome assessments are not readily adapted to oceano-graphic analyses on different temporal or spatial scales andthey offer no systematic method for evaluating the effect ofadditional variables on biome assignment. In spite of theselimitations, there has yet to be a quantitative, multivariateassessment of the identity, location, and similarity of oceanbiomes. Automated methods have the potential to (i) pro-vide further insight into the structure, distribution, and in-terrelationships of surface ocean ecosystems; (ii) enable theidentification of biomes on a range of temporal scales; and(iii) support quantitative intercomparisons of open oceanbiomes.

In this study we characterize open ocean biomes basedon both the long-term annual mean and the average an-nual range of the following seven global ocean charac-teristics: temperature (◦C), salinity, 10 to 200m den-sity difference, photosynthetically active radiation (PAR,Einsteins/m2/day), phosphate (µmol/L), dissolved oxygen(ml/L), and net primary productivity (NPP, mg C/m2/day).These measurements were considered at a 2◦ by 2◦ reso-lution (latitude by longitude), resulting in a d = 14 di-mensional data set of n = 9105 geographic locations.All measurements except PAR and NPP were obtained asobjectively analyzed annual and monthly climatologies ina 1◦ by 1◦ resolution from the World Ocean Atlas 2005(WOA05) (Locarnini et al., 2006; Antonov et al., 2006;

Page 3: Mapping Uncharted Waters: Exploratory Analysis ...kilian/papers/lewisj-OceanMapping.pdf · Mapping Uncharted Waters: Exploratory Analysis, Visualization, and Clustering of Oceanographic

Garcia et al., 2006a; Garcia et al., 2006b), with 10 to 200mdensity difference calculated from WOA05 temperature andsalinity climatologies. Monthly NPP and PAR from 1998-2005 at 1/6◦ by 1/6◦ resolution were obtained from theOcean Productivity website; NPP was estimated based onSeaWiFs ocean color images and the Vertically General-ized Production Model (VGPM) (Behrenfeld & Falkowski,1997). For all variables, annual range (as a measure of intra-annual variability) was calculated as the range in averagemonthly conditions for a given 1◦ by 1◦ location. In thefinal data preparation, we excluded all neritic (i.e. coastal)zones using a 200-meter depth mask from WOA05 and av-eraged all high resolution variables to a 2◦ resolution. Wenormalized all oceanographic measures to have zero meanand unit standard deviation. The normalization was appliedto attach equal importance to each measurement.

Ocean biomes were identified by searching for clustersin the normalized data described above. Note that by de-sign, latitude and longitude coordinates are not includedin this data. Ocean regions that are distant from one an-other can have very similar features, for instance both po-lar regions are characterized by low temperatures. Simi-larly, nearby areas in the ocean can have dramatically differ-ent ecological compositions. The geographical coherencein our results is a product of the oceanographic measuresalone.

We used two criteria to choose the appropriate numberof clusters: first, that there should be enough clusters tohighlight differences between different methodologies fordata analysis; second, that cluster boundaries should sug-gest visible changes in open ocean biome classification, asassessed by an expert oceanographer. The first criterion setsthe lower limit on the number of clusters. In this anal-ysis, positive and negative aspects of the various method-ologies were difficult to discern with fewer than 8 clusters.The second criterion acknowledges that for the purpose ofclassifying open ocean biomes, the data analysis should beperformed at a sufficiently coarse resolution to avoid toomuch subdivision along coastlines. Coastal ocean biomesare known to be geographically much smaller than the openocean biomes they abut. We began to observe a prolifera-tion of small coastal biomes in our results when the algo-rithms were asked to compute more than 10 to 12 clusters.This range in cluster numbers is approximately equal to thenumber of open ocean biomes within an ocean basin as de-termined by other methods (Longhurst, 1998), further sup-porting the use of 10 to 12 clusters. Based on these criteria,we used 11 clusters throughout our analyses as a generallyuseful choice for all algorithms.

As our paper came to press, another group published alargely objective study of surface ocean biomes (Oliver &Irwin, 2008). Our study and this recent study are comple-mentary in the aspects of surface ocean classification that

they explore. In the present study we include a wide arrayof variables and focus on the effect of dimensionality reduc-tion on the classification of biomes. In contrast, Oliver andIrwin focus on the problem of automatically identifying thenumber of surface ocean biomes. Furthermore, while thepresent study uses k-means clustering exclusively as a finalprocessing step, Oliver and Irwin combine k-means cluster-ing, Wards linkage agglomerative clustering, and a post-hocseparation of clusters based on geographic continuity. It isnotable that in spite of the large differences in variables in-cluded and study methodology, the major features of ourclassifications are similar.

Although oceanographers have identified ocean biomesin the past, there is no standard global set of ocean biomesagainst which to compare our results, and thus there is noexplicit ground truth for evaluating each of our methods.This is hardly a unique problem, however, as there are manyapplication domains lacking a well-defined ground truth.Nevertheless, one may still want to investigate those do-mains with modern machine learning techniques. In thecase of ocean biome classification, our evaluation criteriaare defined as (i) the correspondence of cluster boundarieswith known currents, faunal breaks, or other biogeochemi-cal boundaries, and (ii) the biogeographic and environmen-tal similarity of clustered regions. While these evaluationcriteria lack the succinct appeal of a black-and-white clas-sification set, they reflect the state of knowledge within thedomain. To apply these criteria to our results, we rely onthe expert analysis of an oceanographer well-versed in theocean biome literature.

3 Methods

Ocean biomes were discovered by analyzing the n =9105 samples of d = 14 normalized measurements de-scribed in the previous section. We used a number of dif-ferent methods for exploratory analysis, visualization, andclustering (Burges, 2005) of the data. These included tra-ditional methods, such as k-means clustering and princi-pal component analysis, as well as newer methods such asIsomap (Tenenbaum et al., 2000) and maximum varianceunfolding (Weinberger et al., 2004). The methods in man-ifold learning are strongly motivated by the three dimen-sional nature of the ocean itself (characterized by latitude,longitude, and depth). For this reason, we anticipated thatthe manifold learning algorithms would extract three di-mensional representations of the data (which would, in turn,be fed as input to k-means clustering). Manifold learningtechniques are natural in this application because they al-low us to achieve a manageable (and unbiased) representa-tion of the ocean without losing much information. We alsoexperimented with spectral clustering (Ng et al., 2002), butit performed poorly for reasons that we suggest in section 5.

Page 4: Mapping Uncharted Waters: Exploratory Analysis ...kilian/papers/lewisj-OceanMapping.pdf · Mapping Uncharted Waters: Exploratory Analysis, Visualization, and Clustering of Oceanographic

In the following sections, we provide a brief survey of thealgorithms used in our study.

3.1 K-means

K-means is one of the simplest and most popular algo-rithms for unsupervised clustering of multivariate data. Thealgorithm assigns each data point ~xi ∈ <d to one of k dis-joint clusters {Cα}kα=1, where

⋃α Cα = {~xi}ni=1. Each

cluster Cα has a corresponding centroid ~µα∈<d. The cen-troids and cluster assignments are chosen to minimize thevector quantization error:

Lk-means =k∑

α=1

∑i∈Cα

‖~xi − ~µα‖2. (1)

Eq. (1) is minimized by a two-step iterative process. First,each input ~xi is assigned to the cluster with the closest cen-troid. Second, each centroid ~µα is re-estimated as the meanof the inputs assigned to it. This process is repeated untilconvergence. Although this optimization is not convex, inpractice it often converges to good local minima with highreliability. We investigated k ranging from 2 to 16. Fork = 2, we initialized the cluster centroids at random; fork > 2, we initialized the centroids by inheriting those froma previous run of the algorithm with k− 1 clusters, thenchoosing a new centroid from among the data points in thecluster with the largest variance.

3.2 PCA and MDS

Principal component analysis (PCA) is a simple linearmethod for high dimensional data analysis and visualiza-tion. PCA computes a linear orthogonal projection P ∈<r×d that maps the original input space <d into a lowerdimensional subspace <r. For simplicity, assume the datapoints are centered. Then the projection P is computed byminimizing the reconstruction error:

LPCA =∑i

‖~xi −P>P ~xi‖2 (2)

subject to PP>= I, where I is the r×r identity matrix. Theconstraint ensures that the rows of P are orthonormal. Al-though the optimization in eq. (2) is not convex, its globalminimum can be computed by singular value decomposi-tion. In particular, the rows of P are given by the topr eigenvectors of the covariance matrix C =

∑i

1n~xi~x

>i .

Each eigenvalue of C reveals the variance of the coordinateobtained from the projection onto its eigenvector. Thesecoordinates represent the data’s so-called principal compo-nents.

A closely related method for linear dimensionality re-duction is metric multidimensional scaling (MDS). This

method produces outputs ~yi ∈ <r that best preserve theinner product structure of the original data. The outputs arecomputed by minimizing:

LMDS =∑ij

(~yi ·~yj − ~xi ·~xj)2 (3)

Despite its different motivation, it can be shown that MDSyields the same solution as PCA. Though the loss functionexplicitly penalizes differences in inner products, MDS isoften used to derive outputs that approximately preservepairwise Euclidean distances. In practice, starting from apairwise distance matrix, one infers the corresponding in-ner products which are then fed as input to MDS.

3.3 Isomap

The Isomap algorithm (Tenenbaum et al., 2000) providesa powerful nonlinear extension of multidimensional scaling.It was developed to analyze high dimensional data pointssampled from a low dimensional manifold. Whereas MDSfocuses on preserving Euclidean distances, Isomap focuseson preserving geodesic distances along the manifold.

The algorithm has three basic steps. The first step com-putes κ-nearest neighbors of each data point, then uses thisinformation to create an adjacency graph whose nodes rep-resent data points and whose (undirected) edges indicatenearest neighbor relations. The second step estimates thegeodesic pairwise distances δij along the manifold betweenpoints ~xi and ~j. This is done by computing shortest pathsthrough the adjacency graph, with edges weighted by near-est neighbor distances. Finally, the third step uses thesepairwise distances as input to MDS. From these distances,MDS outputs low dimensional outputs whose Euclideandistances ‖~yi−~yj‖ are approximately equal to the geodesicdistances δij found in step two.

To analyze our data set of size n = 9105 we used afast approximate implementation of the original Isomap al-gorithm, known as landmark Isomap (de Silva & Tenen-baum, 2003). Landmark Isomap scales better to large datasets because it only computes the shortest paths between theoriginal data points and some smaller subset of data pointsdesignated as landmarks. In our experiments with landmarkIsomap, we used 500 landmarks and κ = 15 nearest neigh-bors.

The performance guarantees for Isomap depend on anassumption that the data’s underlying manifold can be iso-metrically mapped to a convex subset of Euclidean space.When this assumption does not hold, the algorithm can re-turn spurious results. Later we will examine this assumptionin the context of open ocean biome classification.

Page 5: Mapping Uncharted Waters: Exploratory Analysis ...kilian/papers/lewisj-OceanMapping.pdf · Mapping Uncharted Waters: Exploratory Analysis, Visualization, and Clustering of Oceanographic

Figure 2. Normalized eigenvalue spectra from PCA, Isomap, and MVU. Each normalized eigenvaluereveals the relative amount of variance in the corresponding principal component.

3.4 Maximum variance unfolding

Maximum variance unfolding (MVU) (Weinberger et al.,2004; Sun et al., 2006) provides yet another nonlinear ex-tension of multidimensional scaling. Like Isomap, it wasdeveloped to analyze data sampled from a low dimensionalmanifold. While Isomap attempts to preserve geodesic dis-tances, though, MVU focuses on preserving local distances.Let us define the adjacency matrix ηij ∈ {0, 1}with ηij = 1if ~xi and ~xj are κ-nearest neighbors and ηij = 0 otherwise.MVU attempts to compute the maximum variance config-uration of low dimensional outputs ~yi ∈ <r that preservethe distances between κ-nearest neighbors. The outputs arecomputed by minimizing the loss function:

LMVU =∑ij

ηij(‖~yi−~yj‖2−‖~xi−~xj‖2)2−ν∑i

‖~yi‖2,

(4)where the constant ν balances the distance-preserving andvariance-maximizing goals of the optimization. The lossfunction in eq. (4) is not convex. However, its optimizationcan be reformulated as an instance of semidefinite program-ming by relaxing the dimensionality of the outputs ~yi. Toanalyze our data set of size n = 9105, we implemented afast approximation (Weinberger et al., 2007) to the origi-nal implementation of MVU. This approximation solves amuch smaller semidefinite program to find an approximateminimum of eq. (4). The loss function is then further mini-mized by conjugate gradient descent.

4 Results

We applied the methods described in the last section toderive low dimensional representations of the data. Fig. 2shows the normalized eigenvalue spectra from these meth-ods. Note that the top three principal components in PCAaccount for less than 80% of the variance in the originaldata. On the other extreme, MVU accounts for nearly all ofthe variance in three dimensions.

We used the k-means algorithm to derive clusters in thedata and to locate likely borders between ocean biomes. The

k-means algorithm was applied to the raw d = 14 dimen-sional data, as well as to the three dimensional representa-tions discovered by Isomap and MVU. Running k-means ontop of these representations is appropriate since we are look-ing for biomes, which conceptually align well with clustersin ocean feature space. To assess the quality of the cluster-ing, k-means cluster assignments were mapped onto globalocean projections and analyzed by an expert oceanographer.The quality of the clustering was measured using two crite-ria: (i) the correspondence of geographic cluster boundarieswith known currents, faunal breaks, or other biogeochemi-cal boundaries, and (ii) the biogeographic and environmen-tal similarity of regions clustered together.

In general, all the methods from section 3 producedreasonable results, exhibiting both geographical continu-ity and high covariance with major oceanographic and bio-geographic regions. However, a detailed review of the re-sults revealed significant differences between the variousmethods. In what follows, we use two regional compar-isons (North Atlantic and Antarctic) to highlight the par-ticular strengths and weaknesses that emerged from thisreview. For the sake of brevity, we discuss geographicpatterns in general terms and in reference to acceptedhydrographic (Tomczak & Godfrey, 2003) and biogeo-graphic (Longhurst, 1998) features, with limited examplesfrom species ranges. Also, as shorthand, in the followingsections we use Isomap and MVU to refer to the resultsobtained by k-means clustering of the low dimensional rep-resentations discovered by these algorithms.

4.1 North Atlantic comparison

Within the North Atlantic (top panel of Fig. 3), previ-ous expert assessments of biome classification appear tocoincide best with the automated results from MVU, fol-lowed next by those of raw k-means, then Isomap. Gener-ally speaking, MVU identifies biomes with boundaries andgeographic extents that are concordant with oceanographicand biogeographic features, with some evidence for latitu-dinal over-division. Only MVU has cluster boundaries thatlie along a line stretching from northern Canada to northern

Page 6: Mapping Uncharted Waters: Exploratory Analysis ...kilian/papers/lewisj-OceanMapping.pdf · Mapping Uncharted Waters: Exploratory Analysis, Visualization, and Clustering of Oceanographic

North Atlantic comparison

Antarcticcomparison

Raw Isomap MVU

Figure 3. Regional geographic projections of k-means clustering (with k=11) on raw data (d=14) andprocessed data from the North Atlantic (top) and Antarctic (bottom).

Norway, delineating the North Atlantic Current (Tomczak& Godfrey, 2003) and many species range limits. For exam-ple, copepods like Calanus hyperboreus and Paraeucheatanorvegica range to the north of the current, while cope-pods like Neocalanus gracilis and Metridia lucens range tothe south (Barnard et al, 2004). On the other hand, onlyraw k-means identifies a separate coastal biome along thenortheastern coast of Africa (in orange); this divides a pro-ductive coastal upwelling region from the less productivegyre biome (Longhurst, 1998). Isomap suboptimally clus-ters this coastal region with the Mediterranian and MVUfails to clearly identify a coastal biome at all. MVU andraw k-means have reasonable albeit different subdivisionsof the subtropics, a region characterized by low productiv-ity, a stable pycnocline, and high species richness. MVUsubdivides the subtropical gyre into a northern biome in-cluding the Gulf Stream Extension and Azores Current (yel-low), and a southern biome composed of the Sargasso Seaand a southeastern extension along the Subtropical Conver-gence (red orange) (Tomczak & Godfrey, 2003). Raw k-means clusters the subtropical gyre along a longitudinal di-vide with both biomes (in yellow and green) notably trun-cated in their southern extent relative to the SubtropicalConvergence (the conventional southern boundary for thesubtropical gyre). MVU’s latitudinal subdivision of this re-gion is understandable given the different conditions alongthe edges of the gyre (e.g., intensified currents and asso-ciated features) (Longhurst, 1998). However, MVU lati-tudinally subdivides the tropical Atlantic as well, suggest-ing a general tendency for latitudinal over-division. Rawk-means highlights a longitudinal subdivision in the sub-

tropical gyre with scant support from known biogeographicranges, although the biome division has previously beenproposed based on east-west differences in Sargassum kelpabundance, among other factors (Longhurst, 1998). Over-all in the North Atlantic, MVU best emphasizes the knownimportance of the North Atlantic Current as both a hydro-graphic feature and a biogeographic boundary.

4.2 Antarctic comparison

In the Southern Ocean (bottom panel of Fig. 3), thebest overall characterization of open ocean biomes is pro-vided by k-means clustering of the raw data. The southern-most biome (south of the Antarctic Divergence at 65◦S)appears in the raw k-means and MVU clusters only. Thisbiome classification is consistent with the occurrence ofice adapted fauna (e.g., Stephos longipes and Euphau-sia crystallorophias) in this seasonally ice-covered re-gion (Longhurst, 1998). In the waters stretching from theAntarctic Divergence to the Subtropical Front (approxi-mately 65◦- 45◦S), raw k-means and MVU identify oneand two biomes, respectively; both are supported by biogeo-graphic ranges. This broad region spans the marginal ice toice-free zones and crosses several fronts (Tomczak & God-frey, 2003); some abundant species span this entire region(e.g., Salpa thompsoni, Calanus propinquus, Calanoidesacutus) (Longhurst, 1998), while others characterize waterssouth or north of the Polar Front (e.g., Euphausia superbaand Euphausia frigida respectively) (Brinton, 1962). Thenext major hydrographic and biogeographic region is theSubtropical Convergence Zone (north of 65◦) (Longhurst,1998), which is characterized by transitional species like

Page 7: Mapping Uncharted Waters: Exploratory Analysis ...kilian/papers/lewisj-OceanMapping.pdf · Mapping Uncharted Waters: Exploratory Analysis, Visualization, and Clustering of Oceanographic

Figure 4. Color-coded geographic projectionsof the leading three coordinates from MVUand their combination. To improve the overallvisualization, each channel displays the samerange of color. The green and blue channelswould have much less range if weighted bytheir actual proportion of the variance.

the krill Thysanoessa gregaria and Nematoscelis mega-lops (Brinton, 1962). The Subtropical Convergence ap-pears in raw k-means and Isomap (green and orange respec-tively), but is over-split in MVU (green and yellow bands).Unlike the extra subdivision in the waters stretching fromthe Antarctic Divergence to the Subtropical Front, the addi-tional MVU division in the Subtropical Convergence is notsupported by species ranges or by hydrographic features.Therefore, given the concordance with known faunal andphysical features in the Southern Ocean, raw k-means ispreferred in this region over both MVU and Isomap.

5 Discussion

In this paper, we have applied leading methods in highdimensional data analysis to the visualization and cluster-ing of oceanographic data. We conclude by highlightingthe general lessons that emerged from this particular appli-cation.

Exploratory analysis and visualization are greatly facil-itated by the ability to discover faithful two or three di-mensional representations of multivariate data. As shownin Fig. 2, on the d = 14 dimensional data set in this pa-per, both PCA and Isomap fail to discover such represen-tations. PCA fails presumably because it focuses only onlinear structure, and our data does not lie principally ina two or three dimensional subspace. We speculate thatIsomap’s failure stems from its underlying assumption thatthe data can be isometrically mapped to a convex subsetof Euclidean space. Oceanic regions are bounded by cur-rents, across which sharp discontinuities in environmentalconditions occur. This characteristic of the ocean may leadto gaps in the continuity of the environmental data (i.e.,“holes”) and thus violate the convexity assumptions behindthe Isomap algorithm, leading to spurious results.

Where PCA and Isomap fail in our application, however,MVU essentially succeeds. Fig. 4 maps the three dimen-sional representation discovered by MVU onto the globe.The figure enables the three coordinates of MVU to be in-terpreted (by an expert) in terms of actual environmentalvariables. In particular, the first MVU coordinate has a clearlatitudinal gradient which is highly correlated to the meanannual temperature, PAR, and dissolved oxygen. Likewise,the second and third MVU coordinates each show a weakcorrelation to five other environmental variables (annualvariability in NPP, PAR, temperature, and water columndensity difference as well as mean annual NPP) that arepoorly captured by the first MVU coordinate. Finally, thecomposite image of these coordinates in the bottom panelof the figure provides a highly graphic and interpretable vi-sualization of the entire data set.

Our results in clustering also present an opportunity tocompare traditional versus more recently proposed meth-ods in high dimensional data analysis. For the purpose ofopen ocean biome classification, the k-means algorithm onthe raw data provides a surprisingly successful clustering.However, raw k-means appears to be somewhat insensitiveto fine gradients in conditions, as evidenced by the coarseclusters identified in the subpolar North Atlantic. In the sub-tropical to polar North Atlantic, the results from MVU pro-duce clusterings most consistent with known oceanographicbiomes. Unfortunately, this attention to detail by MVU alsoleads to an emphasis of latitudinal variations in the Antarc-tic, unlike the results from raw k-means. These distinctionsare certainly present in the data, but they might not be useful

Page 8: Mapping Uncharted Waters: Exploratory Analysis ...kilian/papers/lewisj-OceanMapping.pdf · Mapping Uncharted Waters: Exploratory Analysis, Visualization, and Clustering of Oceanographic

if they are too fine for organisms to exploit or if the ecolog-ical importance of an environmental change is not linearlyrelated to the magnitude of the change itself.

Based on the results from all of our methods, we rec-ommend MVU for oceanographic analyses within specificsubregions (e.g., North Atlantic, North Pacific) or for thoseseeking to measure fine spatial or temporal gradients. Foranalyses that are concerned more with general trends or takeplace on larger scales, however, simple k-means may be themost appropriate method. In addition to the above methods,we tried several other methods not reported here. Spectralclustering (Ng et al., 2002) produced results that lacked geo-graphical continuity, perhaps because, like kernel PCA witha Gaussian kernel, it projects distant points into orthogonalvectors, making it ill-equipped to discover low dimensionalmanifolds (Weinberger et al., 2004). We investigated sev-eral other techniques (e.g., locally linear embedding andadditional variants of MVU (Song et al., 2008)) but theywere too similar to our main methods to warrant separatediscussion.

In conclusion, perhaps the most important lesson of ourwork is that modern methods in machine learning providenew avenues for exploratory analysis of scientific data. Ingeneral, we have found that an interdisciplinary approachis needed to combine the statistical expertise of machinelearning researchers with the domain knowledge of naturalscientists. This paper provides one example of such a fruit-ful collaboration.

References

Antonov, J. I., Locarnini, R. A., Boyer, T. P., Mishonov, A. V.,& Garcia, H. E. (2006). World Ocean Atlas 2005. Volume 2:Salinity. In S. Levitus (Ed.), NOAA Atlas NESDIS, vol. 62, 1–182. Washington, D.C.: U.S. Government Printing Office.

Barnard et al, R. (2004). Continuous plankton records: Planktonatlas of the North Atlantic Ocean (1958-1999). ii. biogeograph-ical charts. Marine Ecology-Progress Series, Supplement, 11–75.

Behrenfeld, M. J., & Falkowski, P. G. (1997). Photosynthetic ratesderived from satellite-based chlorophyll concentration. Limnol-ogy and Oceanography, 42 (1), 1–20.

Brinton, E. (1962). The distribution of pacific euphausiids. Bul-letin of the Scripps Institution of Oceanography, 8 (2), 46–269.

Burges, C. J. C. (2005). Geometric methods for feature extrac-tion and dimensional reduction. In L. Rokach and O. Maimon(Eds.), Data mining and knowledge discovery handbook: Acomplete guide for practitioners and researchers. Kluwer Aca-demic Publishers.

de Silva, V., & Tenenbaum, J. B. (2003). Global versus local meth-ods in nonlinear dimensionality reduction. Advances in NeuralInformation Processing Systems 15 (pp. 721–728). Cambridge,MA: MIT Press.

Garcia, H. E., Locarnini, R. A., Boyer, T. P., & Antonov, J. I.(2006a). World Ocean Atlas 2005. Volume 3: Dissolved Oxy-gen, Apparent Oxygen Utilization, and Oxygen saturation. InS. Levitus (Ed.), NOAA Atlas NESDIS, vol. 63, 1–342. Wash-ington, D.C.: U.S. Government Printing Office.

Garcia, H. E., Locarnini, R. A., Boyer, T. P., & Antonov, J. I.(2006b). World Ocean Atlas 2005. Volume 4: Nutrients (phos-phate, nitrate, silicate). In S. Levitus (Ed.), NOAA Atlas NES-DIS, vol. 64, 1–396. Washington, D.C.: U.S. Government Print-ing Office.

Giesbrecht, W. (1892). Systematik und Faunistik der pelagis-chen Copepoden des Golfes von Neapel und der angrenzendenMeeres-Abschnitte. Berlin: R. Friedlaender.

Locarnini, R. A., Mishonov, A. V., Antonov, J. I., Boyer, T. P.,& Garcia, H. E. (2006). World Ocean Atlas 2005. Volume 1:Temperature. In S. Levitus (Ed.), NOAA Atlas NESDIS, vol. 61,1–182. Washington, D.C.: U.S. Government Printing Office.

Longhurst, A. (1998). Ecological Geography of the Sea. SanDiego: Academic Press.

McGowan, J. A. (1971). Oceanic biogeography of the pacific. InB. Funnell and W. R. Riedel (Eds.), The Micropalaeontology ofthe Oceans, 2–74. Cambridge: Cambridge University Press.

Mjolsness, E., & DeCoste, D. (2001). Machine Learning for Sci-ence: State of the Art and Future Prospects. Science, 293,2051–2055.

Ng, A. Y., Jordan, M., & Weiss, Y. (2002). On spectral cluster-ing: analysis and an algorithm. Advances in Neural Informa-tion Processing Systems 14 (pp. 849–856). Cambridge, MA:MIT Press.

Oliver, M. J., & Irwin, A. J. (2008). Objective global ocean bio-geographic provinces. Geophys. Res. Lett., 35, L15601.

Song, L., Smola, A., Borgwardt, K., & Gretton, A. (2008). Col-ored maximum variance unfolding. In J. Platt, D. Koller,Y. Singer and S. Roweis (Eds.), Advances in neural informa-tion processing systems 20. Cambridge, MA: MIT Press.

Sun, J., Boyd, S., Xiao, L., & Diaconis, P. (2006). The fastest mix-ing Markov process on a graph and a connection to a maximumvariance unfolding problem. SIAM Review, 48(4), 681–699.

Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A globalgeometric framework for nonlinear dimensionality reduction.Science, 290, 2319–2323.

Tomczak, M., & Godfrey, J. S. (2003). Regional Oceanography:An Introduction. Delhi: Daya Publishing House.

Weinberger, K., Sha, F., Zhu, Q., & Saul, L. K. (2007). GraphLaplacian regularization for large-scale semidefinite program-ming. In B. Scholkopf, J. Platt and T. Hofmann (Eds.), Ad-vances in neural information processing systems 19. Cam-bridge, MA: MIT Press.

Weinberger, K. Q., Sha, F., & Saul, L. K. (2004). Learning a kernelmatrix for nonlinear dimensionality reduction. Proceedings ofthe Twenty First International Conference on Machine Learn-ing (ICML-04) (pp. 839–846). Banff, Canada.


Top Related