an introduction to learning algorithms and potential applications … · 2020. 6. 8. · supervised...

Earth Surf. Dynam., 4, 445–460, 2016www.earth-surf-dynam.net/4/445/2016/doi:10.5194/esurf-4-445-2016© Author(s) 2016. CC Attribution 3.0 License.

An introduction to learning algorithms and potentialapplications in geomorphometry and

Earth surface dynamics

Andrew Valentine1 and Lara Kalnins2

1Department of Earth Sciences, Universiteit Utrecht, Postbus 80.021, 3508TA Utrecht, the Netherlands2Department of Earth Sciences, Science Labs, Durham University, Durham, DH1 3LE, UK

Correspondence to: Andrew Valentine ([email protected])

Received: 31 January 2016 – Published in Earth Surf. Dynam. Discuss.: 2 February 2016Revised: 28 April 2016 – Accepted: 19 May 2016 – Published: 30 May 2016

Abstract. “Learning algorithms” are a class of computational tool designed to infer information from a data set,and then apply that information predictively. They are particularly well suited to complex pattern recognition, orto situations where a mathematical relationship needs to be modelled but where the underlying processes are notwell understood, are too expensive to compute, or where signals are over-printed by other effects. If a representa-tive set of examples of the relationship can be constructed, a learning algorithm can assimilate its behaviour, andmay then serve as an efficient, approximate computational implementation thereof. A wide range of applicationsin geomorphometry and Earth surface dynamics may be envisaged, ranging from classification of landformsthrough to prediction of erosion characteristics given input forces. Here, we provide a practical overview of thevarious approaches that lie within this general framework, review existing uses in geomorphology and relatedapplications, and discuss some of the factors that determine whether a learning algorithm approach is suited toany given problem.

1 Introduction

The human brain has a remarkable capability for identify-ing patterns in complex, noisy data sets, and then applyingthis knowledge to solve problems or negotiate new situations.The research field of “learning algorithms” (or “machinelearning”) centres around attempts to replicate this ability viacomputational means, and is a cornerstone of efforts to cre-ate “artificial intelligence”. The fruits of this work may beseen in many different spheres – learning algorithms featurein everything from smartphones to financial trading. As weshall discuss in this paper, they can also prove useful in scien-tific research, providing a route to tackling problems that arenot readily solved by conventional approaches. We will focusparticularly on applications falling within geomorphometryand Earth surface dynamics, although the fundamental con-cepts are applicable throughout the geosciences, and beyond.

This paper does not attempt to be comprehensive. It is im-possible to list every problem that could potentially be tack-

led using a learning algorithm, or to describe every techniquethat might somehow be said to involve “learning”. Instead,we aim to provide a broad overview of the possibilities andlimitations associated with these approaches. We also hope tohighlight some of the issues that ought to be considered whendeciding whether to approach a particular research questionby exploring the use of learning algorithms.

The artificial intelligence literature is vast, and can be con-fusing. The field sits at the interface of computer science, en-gineering, and statistics: each brings its own perspective, andsometimes uses different language to describe essentially thesame concept. A good starting point is the book by Mackay(2003), although this assumes a certain level of mathemati-cal fluency; Bishop (2006) is drier and has a somewhat nar-rower focus, but is otherwise aimed at a similar readership.Unfortunately, the nature of the field is such that there arefew good-quality reviews targeted to the less mathematicallyinclined, although one example can be found in Olden et al.(2008). There is also a wealth of tutorials and other course

Published by Copernicus Publications on behalf of the European Geosciences Union.

446 A. P. Valentine and L. M. Kalnins: An introduction to learning algorithms

material available online, varying in scope and quality. Fi-nally, we draw readers’ attention to a recent review by Jor-dan and Mitchell (2015), offering a broad survey of machinelearning and its potential applications.

Learning algorithms are computational tools, and a num-ber of software libraries are available which provide userswith a relatively straightforward route to solving practi-cal problems. Notable examples include pybrain andscikit-learnfor Python (Schaul et al., 2010; Pedregosaet al., 2011), and the commercially available “Statistics andMachine Learning” and “Neural Network” toolboxes forMatlab. Most major techniques are also available as packageswithin the statistical programming language R. Nevertheless,we encourage readers with appropriate interest and skills tospend some time writing their own implementations for ba-sic algorithms: our experience suggests that this is a power-ful aid to understanding how methods behave, and providesan appreciation for potential obstacles or pitfalls. In princi-ple, this can be achieved using any mainstream programminglanguage, although it is likely to be easiest in a high-levellanguage with built-in linear algebra support (e.g. Matlab,Python with NumPy). Efficient “production” applications arelikely to benefit from use of the feature-intensive and highlyoptimised tools available within the specialist software li-braries mentioned above.

This paper begins with a brief overview of the generalframework within which learning algorithms operate. Wethen introduce three fundamental classes of problem that areoften encountered in geomorphological research, and whichseem particularly suited to machine learning solutions. Moti-vated by this, we survey some of the major techniques withinthe field, highlighting some existing applications in geomor-phology and related fields. Finally, some of the practical con-siderations that affect implementation of these techniques arediscussed, and we highlight some issues that should be notedwhen considering exploring learning algorithms further.

1.1 Learning algorithms: a general overview

Fundamentally, a learning algorithm is a set of rules that aredesigned to find and exploit patterns in a data set. This is afamiliar process when patterns are known (or assumed) totake a certain form – consider fitting a straight line to a setof data points, for example – but the power of most learn-ing algorithm approaches lies in their ability to handle com-plex, arbitrary structures in data. Traditionally, a distinctionis drawn between “supervised” learning algorithms – whichare aimed at training the system to recognise known patterns,features, or classes of object – and “unsupervised” learning,aimed finding patterns in the data that have not previouslybeen identified or that are not well defined. Supervised learn-ing typically involves optimising some pre-defined measureof the algorithm’s performance – perhaps minimising the dif-ference between observed values of a quantity and those pre-dicted by the algorithm – while in unsupervised learning, the

goal is usually for the system to reach a mathematically sta-ble state.

At a basic level, most learning algorithms can be regardedas “black boxes”: they take data in, and then output somequantity based upon that data. The detail of the relationshipbetween inputs and outputs is governed by a number of ad-justable parameters, and the “learning” process involves tun-ing these to yield the desired performance. Thus, a learn-ing algorithm typically operates in two modes: a learningor “training” phase, where internal parameters are iterativelyupdated based on some “training data”, and an “operational”mode in which these parameters are held constant, and thealgorithm outputs results based on whatever it has learned.Depending on the application, and the type of learning al-gorithm, training may operate as “batch learning” – wherethe entire data set is assimilated in a single operation – or as“online learning”, where the algorithm is shown individualdata examples sequentially and updates its model parameterseach time. This may be particularly useful in situations wheredata collection is ongoing, and it is therefore desirable to beable to refine the operation of the system based on this newinformation.

In the context of learning algorithms, a “data set” is gen-erally said to consist of numerous “data vectors”. For ourpurposes, each data vector within the data set will usuallycorrespond to the same set of physical observations madeat different places in space or time. Thus, a data set mightconsist of numerous stream profiles, or different regions ex-tracted from a lidar-derived digital elevation model (DEM).It is possible to combine multiple, diverse physical observa-tions into a single data vector: for example, it might be desir-able to measure both the cross section and variations in flowrate across streams, and regard both as part of the same datavector. It is important to ensure that all data vectors consti-tuting a given data set are obtained and processed in a similarmanner, so that any difference between examples can be at-tributed solely to physical factors. In practice, pre-processingand “standardising” data to enhance features that are likely toprove “useful” for the desired task can also significantly im-pact performance: we return to this in Sect. 4.1.

2 Some general classes of geomorphologicalproblem

Broadly speaking, we see three classes of problem wherelearning algorithms can be particularly useful in geomor-phology: classification and cataloguing; cluster analysis anddimension reduction; and regression and interpolation. Allrepresent tasks that can be difficult to implement effectivelyvia conventional means, and which are fundamentally data-driven. However, there can be considerable overlap betweenall three, and many applications will not fit neatly into onecategory.

Earth Surf. Dynam., 4, 445–460, 2016 www.earth-surf-dynam.net/4/445/2016/

A. P. Valentine and L. M. Kalnins: An introduction to learning algorithms 447

2.1 Classification and cataloguing

Classification problems are commonplace in observationalscience, and provide the canonical application for supervisedlearning algorithms. In the simplest case, we have a large col-lection of observations of the same type – perhaps cross sec-tions across valleys – and we wish to assign each to one of asmall number of categories (for example, as being of glacialor riverine form). In general, this kind of task is straightfor-ward to an experienced human eye. However, it may be dif-ficult to codify the precise factors that the human takes intoaccount, preventing their implementation as computer code:simple rules break down in the face of the complexities inher-ent to real data from the natural world. With a learning algo-rithm approach, the user typically classifies a representativeset of examples by hand, so that each data vector is associatedwith a “flag” denoting the desired classification. The learningalgorithm then assimilates information about the connectionbetween observations and classification, seeking to replicatethe user’s choices as closely as possible. Once this trainingprocedure has been completed, the system can be used oper-ationally to classify new examples, in principle without theneed for further human involvement.

Beyond an obvious role as a labour-saving device, auto-mated systems may enable users to explore how particularfactors affect classification. It is straightforward to alter as-pects of data processing, or the labelling of training exam-ples, and then re-run the classification across a large dataset. Another advantage lies in the repeatable nature of theclassification: observations for a new area, or obtained at alater date, can be processed in exactly the same manner asthe original data set, even if personnel differ. It is also possi-ble to use multiple data sets simultaneously when performingthe classification – for example, identification of certain to-pographic features may be aided by utilising high-resolutionlocal imagery, plus lower-resolution data showing the sur-rounding region, or land use classification may benefit fromusing topography together with satellite imagery.

It is sometimes claimed that it is possible to somehow “in-terrogate” the learning algorithm so as to discover its inter-nal state and understand which aspects of the data are used tomake a particular classification. This information could offernew insights into the physical processes underpinning a givenproblem. For the simplest classifiers, this may be possible,but in general we believe it ought to be approached with somescepticism. Classification systems are complex, and subtleinteractions between their constituent parts can prove impor-tant. Thus, simplistic analysis may prove misleading. A morerobust approach, where feasible, would involve classifyingsynthetic (artificial) data and exploring how parameters con-trolling the generation of this affect results (see also Hillieret al., 2015). It may also be instructive to explore how perfor-mance varies when different subsets of observables are used.

Conventionally, classification problems assume that all ex-amples presented to the system can be assigned to one cate-

gory or another. A closely related problem, which we chooseto call “cataloguing”, involves searching a large data setfor examples of a particular feature – for example, locatingmoraines or faults in regional-scale topographic data. Thisintroduces additional challenges: each occurrence of the fea-ture should be detected only once, and areas of the data setthat do not contain the desired feature may nevertheless varyconsiderably in their characteristics. As a result, catalogu-ing problems may require an approach that differs from otherclassification schemes.

Examples of classification problems in geomorphologywhere machine learning techniques have been applied in-clude classifying elements of urban environments (Miliaresisand Kokkas, 2007), river channel morphologies (Beechie andImaki, 2014), and landslide susceptibility levels (e.g. Bren-ning, 2005). An example of a cataloguing problem is givenin Valentine et al. (2013), aimed at identifying seamounts ina range of tectonic settings.

2.2 Cluster analysis and dimension reduction

Classification problems arise when the user has prior knowl-edge of the features they wish to identify within a given dataset. However, in many cases we may not fully understandhow a given process manifests itself in observable phenom-ena. Cluster analysis and dimension reduction techniquesprovide tools for “discovering” structure within data sets,by identifying features that frequently occur, and by findingways to partition a data set into two or more parts, each witha particular character. An accessible overview of the concep-tual basis for cluster analysis, as well as a survey of availableapproaches, can be found in Jain (2010).

In many applications, data vectors are overparameterised:the representations used for observable phenomena havemore degrees of freedom than the underlying physical sys-tem. For example, local topography might be represented asa grid of terrain heights. If samples are taken every 10 m, then11 samples span a distance of 100m, and a 100m × 100marea is represented by 121 distinct measurements. A single121-dimensional data vector is obtained by “unwrapping”the grid according to some well-defined scheme – perhapsby traversing lines of latitude. Nevertheless, the underlyingprocess of interest might be largely governed by a handfulof parameters. This implies that there is a high level of re-dundancy within the data vector, and it could therefore betransformed to a lower-dimensional state without significantloss of information. This may be beneficial, either in its ownright or as an adjunct to other operations: low-dimensionalsystems tend to be easier to handle computationally, and sim-ilarities or differences between examples may be clearer aftertransformation. Dimension reduction algorithms aim to findthe optimal low-dimension representation of a given data set.

One particularly important application of dimension re-duction lies in visualisation. Where individual data vectorsare high-dimensional, it may be difficult to devise effec-

www.earth-surf-dynam.net/4/445/2016/ Earth Surf. Dynam., 4, 445–460, 2016


tive means of plotting them in (typically) two dimensions.This makes it difficult to appreciate the structure of a dataset, and how different examples relate to one another. Inorder to tackle this problem, learning algorithms may beused to identify a two-dimensional representation of the dataset that somehow preserves the higher-dimensional relation-ships between examples. This process involves identifyingcorrelations and similarities between individual data vec-tors, and generally does not explicitly incorporate knowl-edge of the underlying physical processes. Thus, the coordi-nates of each example within the low-dimensional space maynot have any particular physical significance, but examplesthat share common characteristics will yield closely spacedpoints. Thus, it may be possible to identify visual patterns,and hence discover relationships within the high-dimensionaldata.

Geomorphological applications of cluster analysis includeidentifying flow directions from glacial landscapes (Smithet al., 2016), identifying different types of vegetation (Bel-luco et al., 2006), or extracting common structural orienta-tions from a laser scan of a landslide scarp (Dunning et al.,2009). An example of dimension reduction comes again fromlandslide susceptibility studies, this time aimed at identifyingthe most influential observables within a suite of possibilities(Baeza and Corominas, 2001).

2.3 Regression and interpolation

The third class of problem involves learning relationships be-tween physical parameters, in order to make predictions orto infer properties. Very often, it is known that one set of ob-servable phenomena are closely related to a different set – butthe details of that relationship may be unknown, or it may betoo complex to model directly. However, if it is possible toobtain sufficient observations where both sets of phenomenahave been measured, a learning algorithm can be used to rep-resent the link, and predict one given the other – for example,an algorithm might take measurements of soil properties andlocal topography and then output information about expectedsurface run-off rates. Alternatively, the same training datacould be used to construct a system that infers the soil pa-rameters given topography and run-off measurements. Thismay be useful when there are fewer measurements avail-able for one of the physical parameters, perhaps because it isharder or more expensive to measure: once trained on exam-ples where this parameter has been measured, the algorithmcan be used to estimate its value in other locations based onthe more widely available parameters.

Questions of this sort may be framed deterministically –so that the system provides a single prediction – or statisti-cally, where the solution is presented as a probability distri-bution describing the range of possible outcomes. The choiceof approach will depend upon the nature of the underlyingproblem, and upon the desired use of the results. In general,probabilistic approaches are desirable, since they provide a

more realistic characterisation of the system under consid-eration – deterministic approaches can be misleading whenmore than one solution is compatible with available data,or where uncertainties are large. However, in some cases itmay be difficult to interpret and use information presentedas a probability distribution. For completeness, we observethat most learning algorithms have their roots in statisticaltheory, and even when used “deterministically”, the result isformally defined within a statistical framework.

In geomorphology, a common application of machinelearning for regression and interpolation is to link widelyavailable remote sensing measurements with underlying pa-rameters of interest that cannot be measured directly: forexample, sediment and chlorophyll content of water fromcolour measurements (Krasnopolsky and Schiller, 2003) ormarine sediment properties from bathymetry and proximityto the coast (Li et al., 2011; Martin et al., 2015).

3 Some popular techniques

The aforementioned problems can be tackled in almost anynumber of ways: there is rarely a single “correct” approachto applying learning algorithms to any given question. Aswill become clear, once a general technique has been se-lected, there remains a considerable array of choices to bemade regarding its precise implementation. Usually, there isno clear reason to make one decision instead of another –often, the literature describes some “rule of thumb”, but itsunderlying rationale may not always be obvious. A certainamount of trial and error is generally required to obtain op-timal results with a learning algorithm. This should perhapsbe borne in mind when comparisons are drawn between dif-ferent approaches: although many studies can be found inthe literature that conclude that one method outperforms an-other for a given problem, it is unlikely that this has beendemonstrated to hold for all possible implementations of thetwo methods. It is also worth noting that the relationship be-tween performance and computational demands may differbetween algorithms: a method that gave inadequate perfor-mance on a desktop computer a decade ago may neverthe-less excel given the vastly increased resources of a modern,high-performance machine.

In what follows, we outline a selection of common meth-ods, with an emphasis on conveying general principles ratherthan providing precise formal definitions. There is no partic-ular rationale underpinning the methods we choose to includehere, beyond a desire to cover a spectrum of important ap-proaches. Other authors would undoubtedly make a differentselection (for example, see Wu et al., 2008, although this hasa narrower scope than the present work). In an effort to pro-mote readability, we order our discussion roughly accordingto complexity, although this is not an objective assessment.A brief summary of the methods discussed can be found inTable 1.



Table 1. Summary of methods discussed in this paper. For each of the “popular techniques” discussed in Sect. 3, we indicate the classes ofproblem for which they are generally used (as described in Sect. 2); whether the method generally operates as “supervised” learning (based onoptimising a pre-determined performance measure), or unsupervised (attempting to reach a stable state); and whether the method is typically“deterministic”, so that it is guaranteed to yield identical results each time it is applied to a given data set. Note that Bayesian inference isitself deterministic, but it is most often encountered in contexts where it is applied to randomly chosen observations. The indications givenhere are not intended to be exhaustive: it may be possible to adapt each technique to suit the full range of applications.

Generally used for: Learning modeDimension

Method Classification Clustering reduction Regression Supervised Unsupervised Deterministic

Decision trees X XK-means X X

PCA X X XNeural networks X X X X X

SVMs X XSOMs X X X

Bayesian inference X X X X X (X)

3.1 Decision trees and random forests

A decision tree is a system that takes a data vector, andprocesses it via a sequence of if-then-else constructs(“branch points”) until an output state can be determined (seeFig. 1). This is clearly well suited to addressing simple classi-fication problems, and to predicting the states of certain phys-ical systems: essentially, the system resembles a flowchart. Inthis context, “learning” involves choosing how the data setshould be partitioned at each branch.

Typically, each data vector contains a number of “ele-ments” – distinct observations, perhaps made at differentpoints in space or time, or of different quantities relevant tothe phenomenon of interest. Each vector is also associatedwith a particular “desired outcome” – the classification orstate that the tree should output when given that example. Ba-sic decision tree generation algorithms aim to identify a testthat can be applied to any one element, which separates de-sired outcomes as cleanly as possible (e.g. Fig. 1a, b). This istypically quantified using a measure such as “information en-tropy”, which assesses the degree to which a system behavespredictably. Once the training data have been partitioned intotwo sets, the algorithm can be applied recursively on each,until a complete tree has been constructed. Commonly en-countered tree generation schemes include ID3 (Quinlan,1986) and its successor C4.5 (Quinlan, 1993).

Tree generation assumes that the data are perfect, and willtherefore continue adding branch points until all training datacan be classified as desired. When real-world data sets –which invariably contain errors and “noise” – are used, thistends to result in overly complex trees, with many branches.This phenomenon is known as “overfitting”, and tends to re-sult in a tree with poor generalisation performance: whenused to process previously unseen examples, the system doesnot give the desired outcome as often as one might hope. It istherefore usual to adopt some sort of “pruning” strategy, by

which certain branches are merged or removed. Essentially,this entails prioritising simple trees over perfect performancefor training data; a variety of techniques exist, and the choicewill probably be application-specific.

Another approach to this issue, and to the fact that in manyproblems the number of data elements vastly exceeds thenumber of possible outcomes, lies in the use of “randomforests” (Breiman, 2001). By selecting data vectors from thetraining set at random (with replacement), and discardingsome randomly chosen elements from these, we can con-struct a number of new data sets. It is then straightforwardto build a decision tree for each of these. Typically, this re-sults in trees that perform well for some – but not all – exam-ples. However, if we use each tree to predict an outcome for agiven data vector, and then somehow compute the average ofthese predictions, performance is usually significantly betterthan can be achieved with any one tree. This strategy, wherea number of similar, randomised systems are constructed andthen used simultaneously, is sometimes referred to as an “en-semble” method. Again, when treated in more detail, a vari-ety of approaches are possible, and the intended applicationmay help dictate which should be used.

A recent example of an application of random forests toEarth surface data can be found in Martin et al. (2015).Here, the goal is to predict the porosity of sediments lyingon the ocean floor from a range of different data, includ-ing measures such as water depth or distance from sediment-producing features. Such observations are much easier to ob-tain than direct measurement of the seafloor, which obviouslyrequires ocean-bottom sampling. Using training data fromlocations where such samples have been collected, a randomforest is constructed that enables porosity predictions to bemade throughout the oceans. This concept could readily beadapted to a variety of other situations where local physi-cal properties must be estimated from remotely sensed data.Further discussion of the use of random forests for interpola-



(a)

Seasonality > 0.4

Yes No

C1 G1

C2

B1 G2

B2 G3

(b)

Seasonality > 0.5

Yes No

C1

C2

B1 G2

B2 G3G1

Sample Veg. Season. Rough.• cropland 1 0.5 0.9 0.1• grazing 1 0.6 0.5 0.2• built up 1 0.2 0.1 0.2• grazing 2 0.4 0.4 0.7• cropland 2 0.7 0.8 0.3• built up 2 0.3 0.3 0.4• grazing 3 0.5 0.3 0.6

(c)

Seasonality > 0.5

Yes No

C1

C2

Vegetation > 0.3

Yes No

G1 G2 G3 B1 B2

1

Figure 1. Evolution of a decision tree for land use data based on normalised parameters for vegetation index, seasonal colour variability,and topographic roughness. (a) The algorithm tests a variety of conditions for the first branch point, looking for a condition such as (b) thatcleanly separates different classes. (c) A second branch point is added to differentiate between built-up and grazing land.

tion between samples can be found in, for example, Li et al.(2011), who test a variety of techniques including randomforests and support vector machines in order to produce a re-gional map of mud content in sediments based on discretesampling. Another example comes from Bhattacharya et al.(2007), who construct models of sediment transport using avariant of decision trees where each “leaf” is a linear regres-sion function, rather than a single classification; the decisiontree is essentially used to choose which mathematical modelto apply in each particular combination of circumstances.

3.2 The k-means algorithm

By far the most well-known technique for cluster analysis,k-means is usually said to have its origins in work eventuallypublished as Lloyd (1982), but disseminated earlier (thus, forexample, Hartigan and Wong, 1979, set out a specific imple-mentation). The algorithm is designed to divide a set of N

data vectors, each consisting of M elements, into K clusters.These clusters are defined so that the distance of each pointin the cluster from its centre is as small as possible.

The algorithm is readily understood, and is illustrated inFig. 2. We begin by generating K points, which representour initial guesses for the location of the centre of each clus-ter – these may be chosen completely at random, or basedon various heuristics which attempt to identify a “sensible”initial configuration. We then assign each element of ourtraining data to the cluster with the nearest centre (Fig. 2b).Once this has been done, we recompute the position of thecentral point by averaging the locations of all points in thecluster (Fig. 2c). This process of assignment and averagingis repeated (Fig. 2d) until a stable configuration is obtained

(Fig. 2e). The resulting clusters may then be inspected to as-certain their similarities and differences, and new data can beclassified by computing its distance from each cluster centre.

In order to implement this, it is necessary to define whatthe word “distance” means in the context of comparing anytwo data vectors. There are a number of possible defini-tions, but it is most common to use the “Euclidean” dis-tance: the sum of the squared difference between each ele-ment of the two vectors. Thus, if x is a vector with M ele-ments (x1,x2, . . .,xM ) and y a second vector (y1,y2, . . .,yM ),then the Euclidean distance between them is defined

d =

M∑i=1

(xi − yi)2. (1)

This definition is a natural extension of our everyday under-standing of the concept of “distance”. However, where datavectors are comprised of more than one class of observation– perhaps combining topographic heights with soil properties– problems can arise if the measurements differ considerablyin typical scale. The Euclidean distance between two two-element data vectors (1,10−9) and (1,10−5) is very small,despite the second elements differing by 4 orders of magni-tude, because both are negligible in comparison to the firstelement. It may be necessary to rescale the various measure-ments that make up a data vector to ensure that they havea similar magnitude and dynamic range. Alternatively, andequivalently, the definition of “distance” can be adapted toassign different weights to various data types.

In its basic form, the k-means algorithm requires the userto specify the number of clusters to be sought as a priori in-formation. In many cases, this may not be known, and a rangeof different solutions have been proposed – see, for example,



(a) (b) (c) (d) (e)

Figure 2. The k-means algorithm. (a) A data set contains points clustered around three distinct locations (colour-coded for ease of reference).(b) We first guess locations for these centres at random (black squares) and assign each datum to the nearest cluster (shown divided by blacklines). (c) We then re-compute the location of each cluster centre by averaging all points within the cluster (old centres, grey squares; newcentres, black squares), and (d) update the cluster assignment of each point to reflect this. It may be necessary to repeat steps (c) and (d) forseveral iterations until a stable partitioning of the data set is found (e).

Jain (2010). Fundamentally, these entail balancing increasedcomplexity (i.e., an increased number of clusters) against anyresulting reduction in the average distance between samplesand the cluster centre. In general, this reduction becomes in-significant once we pass a certain number of clusters, and thisis taken to provide an appropriate description of the data.

Clustering algorithms such as k-means have seen use ingeomorphology as an aid to analysis or interpretation of avariety of data sets. It is common to first apply some trans-formation or pre-processing to raw data so as to improvesensitivity to particular classes of feature. For example, Mil-iaresis and Kokkas (2007) take lidar DEMs, apply variousfilters designed to enhance the visibility of the built environ-ment within the image, and then use k-means to distinguishdifferent areas of urban environment (such as differentiatingbetween vegetation, buildings, and roads/pavements). Simi-larly, Belluco et al. (2006) use the technique to assist in map-ping vegetation based on remotely sensed spectral imaging,although they find that which method performs best dependson the type of imaging and field data. On a much smallerscale, Dunning et al. (2009) use k-means and other clus-tering algorithms to extract discontinuity orientations fromlaser-derived observations of landslide scarps, which helpconstrain the mechanism behind the slope failure.

3.3 Principal component analysis

Often, the different observations comprising a given datavector are correlated, and thus not fully independent: forexample, topographic heights at adjacent sites are likely tobe reasonably similar to one another, and an imprint of to-pography will often be found in other data sets such as soilthickness or temperature. For analysis purposes, it is oftendesirable to identify the patterns common to multiple dataelements, and to transform observations into a form whereeach parameter, or component, is uncorrelated from the oth-ers. Principal component analysis (PCA) provides one of themost common techniques for doing so, and has its roots in thework of Pearson (1901). Essentially the same mathematicaloperation arises in a variety of other contexts, and has ac-quired a different name in each: for example, “singular value

−1

0

1

−1 0 1

Figure 3. Principal component analysis of a simple data set. Dom-inant principal component shown in red; secondary principal com-ponent shown in blue. Line lengths are proportional to the weightattached to each component. It is apparent that the principal com-ponents align with the directions in which the data set shows mostvariance.

decomposition”, “eigenvector analysis”, and the concept of“empirical orthogonal functions” are all closely related toPCA.

Numerical algorithms for performing PCA are complex,and there is usually little need for the end user to understandtheir intricacies. In general terms, PCA involves finding the“direction” in which the elements of a data set exhibit thegreatest variation, and then repeating this with the constraintthat each successive direction considered must be at right an-gles (orthogonal) to those already found. Although easiest tovisualise in two or three dimensions (see Fig. 3), the principleworks in the same way for data vectors with more elements:for a set of data vectors with M elements, it is possible toconstruct up to M perpendicular directions.

Thus, the outcome of PCA is an set of orthogonal direc-tions (referred to as principal components) ordered by theirimportance in explaining a given data set: in a certain sense,this can be regarded as a new set of co-ordinate axes againstwhich data examples may be measured. The principal com-



Original N=1 N=10 N=100 N=200 N=500

−6000

−5000

−4000

−3000

−2000

−1000

0

1000

2000

3000

Ele

vation (

m)

Figure 4. Image reconstruction using principal components. PCA has been performed on a data set containing 1000 square “patches” ofbathymetric data, each representing an area of dimension 150 km×150 km, centred upon a seamount (the “training set” used in Valentineet al., 2013). Each patch is comprised of 64× 64 samples – thus, each can be seen as a 4096-dimensional object. Three examples from thisset are shown here (one per row). In the left-most column we show the original bathymetry for each; then, we present reconstructions ofthis bathymetry using only the N most significant principal components, for N = 1, 10, 100, 200, and 500. It is apparent that the large-scalestructures within this data set can be represented using only 100–200 dimensions, while around 500 dimensions are required to allow someof the fine-scale structure to be represented, particularly in the third example. This still represents almost an order of magnitude reduction, incomparison to the original, 4096-dimensional, data.

ponents may be regarded as a set of weighted averages ofdifferent combinations of the original parameters, chosen tobest describe the data. Often, much of the structure of adata set can be expressed using only the first few principalcomponents, and PCA can therefore be used as a form ofdimensionality-reduction (see Fig. 4). In a similar vein, it canform a useful precursor to other forms of analysis, such asclustering, by virtue of its role in unpicking the various sig-nals contributing to a data set: often, each component will ex-hibit sensitivity to particular physical parameters. However,particularly where the original data are composed of a vari-ety of different physical data types, the results of PCA maynot be straightforward to interpret: each principal componentmay be influenced by a number of disparate measurements.

One example of this can be found in Cuadrado and Per-illo (1997), where PCA is performed on a data set consist-ing of bathymetric measurements for a given region repeatedover several months. The first principal component is thenfound to describe the mean bathymetry of the period, whilethe second provides information about the general trend ofchange in bathymetry over time, highlighting areas of de-position and erosion. Another typical application occurs inBaeza and Corominas (2001), where PCA is used to iden-tify the observable parameters that best serve as predictorsof landslide hazard. A more recent example is Tamene et al.(2006), who use PCA to identify which observable parame-ters best explain variability in sediment yield between differ-ent river catchments in Ethiopia. This example also shows

how a single principal component may be a combinationof observables: their first component, which explains about50 % of the variability, is dominated by a combination of to-pographic variables, such as height difference, elongation ra-tio, and catchment area. The second component, which ex-plains approximately 20 % of the variability, is dominated byvariables associated with lithology and land use/vegetationcover.

3.4 Neural networks

Perhaps the most varied and versatile class of algorithm dis-cussed in this paper is the neural network. As the name sug-gests, these were originally developed as a model for theworkings of the brain, and they can be applied to tacklinga wide range of problems. It has been shown (e.g. Hornik,1991) that neural networks can, in principle, be used to rep-resent arbitrarily complex mathematical functions, and theiruse is widespread in modern technology – with applica-tions including tasks such as voice recognition or automaticlanguage translation. They may be used to tackle problemsfalling in all three of the classes discussed in Sect. 2. A com-prehensive introduction to neural networks may be found inBishop (1995); the aforementioned book by Mackay (2003)also discusses them at some length. Some readers may alsobe interested in a review by Mas and Flores (2008), targetedat the remote sensing community.



x

xN

x4

x3

x2

x1

w(1)1

w(1)2

w(1)3

w(1)K

w(2)1

w(2)2

w(2)3

w(2)L

w(3)1

w(3)2

w(3)M

y1

y2

yM

y

Figure 5. Schematic of a simple neural network (a “multi-layer perceptron”). The network takes an N -dimensional vector, x, on the leftside and transforms it into an M-dimensional vector y on the right side. Each grey box represents a “neuron”, and is a simple mathematicaloperation which takes many inputs (lines coming in from the left) and returns a single output value which is sent to every neuron in the next“layer” (lines coming out from the right). The neuron’s behaviour is governed by a unique set of “weights” (w): one common mathematicalrelation has a neuron in a layer with K inputs return y = tanh(w0+

∑Ki=1wixi ), where the single output element y will become part of the

inputs for the next layer.

A neural network is constructed from a large number of in-terconnected “neurons”. Each neuron is a processing unit thattakes a number of numerical inputs, computes a weightedsum of these, and then uses this result to compute an out-put value. The behaviour of the neuron can therefore be con-trolled by altering the weights used in this computation. Byconnecting together many neurons, with the outputs fromsome being used as inputs to others, a complex system – ornetwork – can be created, as in Fig. 5. The weights associatedwith each neuron are unique, so that the behaviour of the net-work as a whole is controlled by a large number of adjustableparameters. Typically, these are initially randomised; then, a“training” procedure is used to iteratively update the weightsuntil the network exhibits the desired behaviour for a giventraining set.

Such a brief description glosses over the richness of ap-proaches within “neural networks”: choices must be maderegarding how individual neurons behave, how they are con-nected together (the “network architecture”), and how thetraining algorithm operates. These are generally somewhatapplication-dependent, and may dictate how effectively thenetwork performs. The simplest form of neural network,sometimes called a “multi-layer perceptron” (MLP), consistsof neurons arranged in two or three “layers”, with the out-puts from neurons in one layer being used as the inputs forthe next layer (see Fig. 5). Traditionally, these are trained by“back-propagation” (Rumelhart et al., 1986), which involvescalculating how network outputs are influenced by each in-dividual weight. The canonical use for such a network is asa classifier (e.g. Lippmann, 1989; Bischof et al., 1992), al-though they are exceptionally versatile: for example, Erminiet al. (2005) demonstrate that MLPs can be used for landslidesusceptibility prediction. They may also be used to model awide variety of physical relationships between observables,

for example, in interpolation problems such as estimatingdifficult-to-measure surface properties from satellite obser-vations (e.g. Krasnopolsky and Schiller, 2003), and to detectunusual or unexpected features within data sets (e.g. Markouand Singh, 2003).

In recent years, attention has increasingly focussed on“deep learning”, where many more layers of neurons areused. This has proven effective as a means to “discover”structure in large, complex data sets and represent these ina low-dimensional form (Hinton and Salakhutdinov, 2006).Typically, these require specialised training algorithms, sincethe number of free parameters in the network is exceptionallylarge. Such systems are particularly useful for applications incluster analysis and dimension reduction, and these proper-ties can be exploited to enable cataloguing of geomorpholog-ical features in large data sets, as in Valentine et al. (2013).

3.5 Support vector machines

The modern concept of the support vector machine (SVM)stems from the work of Cortes and Vapnik (1995), althoughthis builds on earlier ideas. The approach is targeted towardsclassification problems, framed in terms of finding, and thenutilising, “decision boundaries” that separate one class fromanother. In the simplest case, linear decision boundaries canbe found – that is, any two classes can always be separatedby a straight line when two-dimensional slices through thedata set are plotted on a graph. The SVM method providesan algorithm for constructing linear decision boundaries thatmaximise the “margin” between boundary and adjacent datapoints, as shown in Fig. 6.

However, in most realistic cases, the data set cannot becleanly categorised using linear boundaries: all possible lin-ear decision boundaries will misclassify some data points. To



Figure 6. Segmenting data sets with linear decision boundaries. Inthe original one dimension (top), it is not possible to separate redsquares from blue circles. However, by mapping this data set intoan artificial second dimension, it becomes possible to draw a linear“decision boundary” that distinguishes the two classes. The supportvector machine provides a technique for achieving this in such away that the “margin” between the boundary and the nearest datapoints is maximised (as shown by the dotted lines).

handle this scenario, the SVM approach uses a mathemat-ical trick to create nonlinear decision boundaries in a waythat is computationally tractable. Data are first mapped into ahigher-dimensional “feature space” (the opposite of dimen-sionality reduction); in this space, the data can then be sepa-rated using linear boundaries (see Fig. 6). This mapping, ortransformation, may be nonlinear, so that lines in the featurespace may correspond to curves in the original data space.Various extensions to this approach exist that allow for a less-than-perfect division of the data set, to reflect the presence ofclassification errors or observational noise. Once boundarieshave been determined, the SVM may then be used for clas-sification of new examples. SVMs have some similarities instructure to simple neural networks, although the “training”or optimisation procedure is quite distinct.

Again, landslide susceptibility assessment offers one as-pect of geomorphology where SVMs have found signifi-cant application (e.g. Brenning, 2005; Yao et al., 2008; Mar-janovic et al., 2011; Peng et al., 2014). They have also beenused to differentiate between fluvial and glacial river valleys(Matías et al., 2009). Another similar use can be found inBeechie and Imaki (2014): there, the authors use an SVM toclassify river channel morphologies based on geospatial dataincluding DEMs, precipitation, and geology. This can thenbe used in river conservation and restoration to infer the nat-ural patterns that existed in areas that have been subject toextensive human intervention.

3.6 Self-organising maps

The concept of the self-organising map (SOM) stems fromthe work of Kohonen (1990), and it can be viewed as a par-

ticular class of neural network. The SOM implements a formof dimensionality reduction, and is generally used to helpidentify clusters and categories within a data set. The ba-sic premise is to take a (usually) two-dimensional grid and,through training, create a mapping between that 2-D spaceand a higher-dimensional data set (see Fig. 7). The 2-D rep-resentation can then be used to help visualise the structure ofthe data. The SOM is also typically designed to be signifi-cantly smaller than the training set, spanning the same dataspace with fewer points, and is thus easier to analyse.

To create an SOM, we start with a map consisting of anumber of “nodes”, often arranged as a regular grid, so that itis possible to define a “distance” between any two nodes, andhence identify the set of nodes that lie within a certain radiusof a given node, known as its “neighbourhood”. Each node isassociated with a random “codebook vector” with the samedimensionality as the data (Fig. 7a). During training, we iter-atively select a data example at random from the training set,identify the node with the closest-matching codebook vec-tor, and then adjust this vector, and those of neighbouringnodes, to better match the training example (Fig. 7b). Givensufficient training, the codebook vectors come to mirror thedistribution of data in the training set (Fig. 7c). Typically,both the radius used to define the “neighbourhood” of a givennode and the extent to which codebook vectors are allowed tochange are reduced as training proceeds in order to promotefine tuning of performance.

Once the SOM is trained, various approaches exist to en-able visualisation of the codebook vectors, with the goal ofhighlighting any underlying structure within the data set. Onecommon approach is to try to identify clusters in the dataset by examining the distances between the codebook vec-tors for neighbouring nodes, often by plotting the SOM gridcoloured according to these distances (sometimes describedas depicting the “U-matrix”). Alternatively, a method can beused to distort the grid in such a way that when plotted in2-D, the distance between nodes is proportional to the dis-tance between their codebook vectors; one common tech-nique for this is “Sammon’s mapping” (Sammon, 1969). An-other visualisation approach, sometimes called “componentplane analysis”, looks for correlations between input param-eters, e.g. between rainfall and elevation or slope orientationand landslide risk. Here, the SOM grid is coloured accord-ing to the values of particular elements of the codebook vec-tors, with each element corresponding to an input parameter.Correlated parameters can then be identified by their similarcolour patterns.

Potential applications in geomorphology are numerous.Marsh and Brown (2009) use an SOM-based method toanalyse and classify bathymetry and backscatter data, al-lowing near-real-time identification of regions with partic-ular seafloor characteristics, e.g. for benthic habitat map-ping. Similarly, Ehsani and Quiel (2008) demonstrate thatSOMs can be used to classify topographic features, iden-tifying characteristic morphologies contained within DEMs



(a) (b) (c) (d) (e)

Figure 7. The self-organising map (SOM). A data set consists of numerous data points (blue), and is to be represented by a one-dimensionalSOM (red). The SOM contains of a number of nodes, with a well-defined spatial relationship (here, depicted by lines connecting points).Initially, all SOM nodes are associated with random locations in data space (a). During training (b–d) a datum is selected at random (high-lighted in green). The closest SOM node is identified, and this and its nearest neighbours are adjusted so as to be closer to the chosen point(arrows), with the scale of adjustment proportional to the separation. After many iterations of this procedure, the distribution of SOM nodesmirrors that of the data set (e). For ease of illustration, this figure depicts a two-dimensional data set and a one-dimensional SOM; in practice,the data set will usually have much higher dimensionality, and the SOM is usually organised as a two-dimensional grid.

such as channels, ridge crests, and valley floors. In a thirdexample, Friedel (2011) uses SOMs to assess post-fire haz-ards in recently burned landscapes. Both PCA and k-meansclustering are then used to divide the 540 areas studied into 8distinct groups, which the author suggests could be used forfocussing future field research and development of empiri-cal models. As illustrated by these examples, one of the keystrengths of the SOM method is that learning is unsupervised,and does not rely on the user having any prior knowledge ofthe “important” structures contained within the data set.

3.7 Bayesian inference

To conclude this section, we mention the concept of Bayesianinference. This is a much broader topic than any of the meth-ods discussed so far; indeed, in many cases these methods arethemselves formally derived from Bayesian concepts. Bayes’theorem (Bayes, 1763) underpins a significant fraction ofmodern statistical techniques, and explains how new obser-vations can be used to refine our knowledge of a system. Ittells us that the probability that the system is in a certain state,given that we have made a particular observation (“obs.”),P (state |obs.), can be expressed in the form

P (state |obs.)=P (obs. | state)P (state)

P (obs.). (2)

Here, P (state) represents our prior knowledge – that is, ourassessment of the probability that the system is in the givenstate without making any measurements – while P (obs.) rep-resents the probability with which we expect to obtain theexact measurements we did, in the absence of any knowl-edge about the state of the system. Finally, P (obs. | state)expresses the probability that we would get those measure-ments for a system known to be in the relevant state.

In many cases, it is possible to estimate or compute thevarious probabilities required to implement Bayes’ theorem,and thus it is possible to make probabilistic assessments. Thisis often useful: for example, hazard assessment is generally

better framed in terms of the chance, or risk, of an event,rather than attempting to provide deterministic predictions.An extensive discussion of Bayesian analysis can be found inMackay (2003). Sivia (1996) is another useful reference, anda paper by Griffiths (1982) provides some geomorphologi-cal context. A wide range of computational techniques havebeen developed to enable Bayesian calculations in varioussettings, for which a review by Sambridge and Mosegaard(2002) may be a good starting point.

As a simple example, suppose we are interested in classi-fying land use from satellite imagery. Grassland will appearas a green pixel 80 % of the time, although it may also bebrown: thus, P (green|grass)= 0.8. On the other hand, desertenvironments appear as brown in 99 % of cases. A partic-ular region is known to be mainly desert, with only 10 %grassland – so an image of the area will consist of 8.9 %green pixels (= ((0.01× 0.9)+ (0.8× 0.1))). If we look atone specific pixel, and observe that it is green, Bayes’ theo-rem tells us there is a 90 % chance that the location is grass-land: P (grass|green)= P (green|grass)P (grass)/P (green)=0.8×0.1/0.089≈ 0.899. On the other hand, if a pixel is seento be brown, we can classify it as desert with 98 % certainty.This illustrates a property that emerges from the Bayesianformalism: unusual observations convey more informationthan routine ones. Before we obtained any satellite imagery,we could have guessed with 90 % accuracy that a particularlocation in the region was desert; observing a brown pixelleads to only a relatively modest increase in the certainty ofthis determination. However, if we were to observe a greenpixel, the chance that the location is a desert drops markedly,to only 10 %.

Most examples in geomorphology again come from land-slide susceptibility: Lee et al. (2002) and Das et al. (2012) useBayesian techniques to assess landslide susceptibility in Ko-rea and the Himalayas, respectively. In a similar application,Mondini et al. (2013) map and classify landslides in southernTaiwan using a Bayesian framework. Examples from otherareas of geomorphology include Gutierrez et al. (2011), who



use a Bayesian network to predict shoreline evolution in re-sponse to sea-level change, and Schmelter et al. (2011), whouse Bayesian techniques for sediment transport modelling.

4 Practical considerations

Each of the techniques discussed in the previous section –and the wide variety of alternatives not mentioned here – hasits strengths and weaknesses, and a number of practical is-sues may need to be considered when implementing a solu-tion to a particular problem. Here, we discuss some topicsthat may be relevant across a range of different approaches,and which may affect the viability of a learning algorithmsolution in any case.

4.1 Constructing a training set

Unsurprisingly, the training data used when implementing alearning algorithm can have a major impact upon results: ev-erything the system “knows” is derived entirely from theseexamples. Any biases or deficiencies within the training datawill therefore manifest themselves in the performance of thetrained system. This is not, in itself, necessarily problem-atic – indeed, the landform cataloguing system introducedby Valentine et al. (2013) relies upon this property – but itmust be borne in mind when tools are used and results inter-preted. If training data are low-quality or contain artefacts,glitches and other observational “noise”, results will suffer.In general, time invested in properly selecting and process-ing data sets will be well spent. However, it is also importantthat training data remain representative of the data that willbe used during operation.

One particular issue that can arise stems from the fact thatthe learning algorithm lacks the trained researcher’s senseof context: it has no preconception that certain structures inthe data are more or less significant than others. For exam-ple, suppose a system is developed to classify valley profilesas being formed by either a river or a glacier. The trainingdata for this system would consist of a sequence of hand-classified valley profiles, each expressed as a vector of topo-graphic measurements. If, for the sake of example, all glacialexamples happen to be drawn from low-elevation regions,and all riverine examples from high-elevation regions, it islikely that the system would learn to treat elevation as an im-portant factor during classification.

A second potential pitfall arises from the statistical basisunderpinning most learning algorithms. Typically, the extentto which a particular feature or facet of the data set will be“learnt” depends on its prevalence within the training exam-ples as a whole. This can make it difficult for a system torecognise and use information that occurs infrequently in thetraining data, since it gets “drowned out” by more commonfeatures. Again, considering the problem of valley classifica-tion, if 99 % of training examples are glacial, the system islikely to learn to disregard its inputs and simply classify ev-

erything as glacial, since this results in a very low error rate;for best results, both types should occur in roughly equal pro-portions within the training set. As before, this should be re-garded as a natural property of the learning process, ratherthan as being inherently problematic; indeed, it can be ex-ploited as a tool for “novelty detection”, allowing unusualfeatures within a data set to be identified (for a review, seee.g. Marsland, 2002). Nevertheless, it is a factor that ought tobe borne in mind when a system is designed and used.

To avoid problems, it is important to choose training datawith care, and to develop strategies for evaluating and moni-toring the performance of the trained system. It is often ben-eficial to invest time in finding the best way to represent agiven data type, so as to accentuate the features of interest,and remove irrelevant variables. This process is sometimesreferred to as “feature selection” (e.g. Guyon and Elisseeff,2003). Thus, in the situation mentioned above, the valley pro-files might be more usefully represented as variations rela-tive to their highest (or lowest) points, rather than as a se-quence of absolute elevations. In line with our comments inSect. 3.2, it is often helpful to “de-mean” and normalise thedata: using the training set, it is straightforward to computethe mean input vector, as well as the standard deviation as-sociated with each component of this. Then, all examples –during training, and during operation – can be standardisedby subtraction of this mean and rescaling to give unit stan-dard deviation. Clearly, the inverse of these transformationsmay need to be applied before results are interpreted. Formore complex applications, specialist representations suchas the “geomorphons” developed by Jasiewicz and Stepin-ski (2013) may be beneficial, and provide a targeted route toencoding geomorphological information. Training examplesshould be chosen to cover a spread of relevant cases, perhapsincluding different regions or measurements made at differ-ent times of year, to avoid introducing unintended biases intoresults.

4.2 Assessing performance

For “supervised” learning algorithms, where a pre-definedrelationship or structure is to be learnt, it is possible to as-sess performance using a second set of user-selected data –often referred to as a “test” or “monitoring” data set. Thesetest data intended to provide an independent set of examplesof the phenomena of interest, allowing quantitative measuresof effectiveness to be evaluated; this strategy is sometimesreferred to as “cross-validation”. It is important to do this us-ing examples separate from the training set in order to ensurethat the system’s “generalisation performance” is measured:we want to be sure that the algorithm has learned proper-ties that can be applied to new cases, rather than learningfeatures specific to the particular examples used during train-ing. As an analogy: a dog may learn to respond to particularcommands, but only when issued in a very specific manner;



it cannot then be said to properly understand a spoken lan-guage.

The metric by which performance is assessed is likelyto be situation-dependent. In general, a supervised learningalgorithm will be designed to optimise certain “error mea-sures”, and these are likely to provide a good starting point.Nevertheless, other statistics may also prove useful. For clas-sification systems, analysis of “receiver operating character-istics” (ROCs) such as hit and false positive rates may beinstructive (e.g. Fawcett, 2006). More difficult to quantify,but still potentially valuable, is the experienced researcher’ssense of the plausibility of a system’s predictions: do resultsexhibit geomorphologically reasonable patterns? For exam-ple, does a prediction of landslide risk seem plausible in itsrelationship with the topography and underlying bedrock?

Assessing performance in unsupervised learning is morechallenging, as we may not have prior expectations againstwhich to measure results: fundamentally, it may be challeng-ing to define what “good performance” should signify. Inmany cases, application-specific statistics may be helpful –for example, in cluster analysis it is possible to calculate thestandard deviation of each cluster, quantifying how tightlyeach is defined – and again, the researcher’s sense of plausi-bility may provide some insight. It may also prove instruc-tive to repeat training using a different subset of examples,to assess how stable results are with respect to variations inthe training data: a structure or grouping that appears consis-tently is more likely to be real and significant than one thatis very dependent on the precise examples included in thetraining set.

4.3 Overtraining and random noise

The phenomenon of “overtraining” or “overfitting”, whichtypically arises in the context of supervised learning algo-rithms, has already been alluded to. It occurs when an it-erative training procedure is carried out for too many itera-tions: at some point, the system usually begins to learn infor-mation that is specific to the training examples, rather thanbeing general. This tends to reduce the performance of thesystem when subsequently applied to unseen data. It can of-ten be detected by monitoring the algorithm’s generalisationperformance using an independent set of examples as train-ing proceeds: this enables the training procedure to be ter-minated once generalisation performance begins to decrease.In certain cases, post-training strategies can be used to re-duce the degree of over-fitting: the example of “pruning”decision trees has already been mentioned. It has also beenshown that “ensemble methods” may be useful (indeed, “ran-dom forests” provide one example of an ensemble method),whereby multiple instances of the same learning algorithmare (over-)trained, from different randomised starting points,and their outputs are then somehow averaged (e.g. Dietterich,2000). Because each instance becomes sensitive to distinctaspects of the training data, due to their different initialisation

conditions, each performs well on a subset of examples, andpoorly on the remainder. However, the overall performanceof the ensemble (or “committee”) as a whole is typically bet-ter than that of any individual member.

Another strategy that is adopted is to add random noise tothe training data. The rationale here is that training examplestypically exhibit an underlying pattern or signal of interest,overprinted by other processes and observational errors. Inorder to learn the structure of interest, we wish to desensi-tise our training procedure to the effects of this overprinting.If we can define a “noise model” that approximates the sta-tistical features of the unwanted signal, adding random real-isations of this to each training example allows us to limitthe extent to which the algorithm learns to rely on such fea-tures: by making their appearance random, we make themless “useful” to the algorithm. Returning again to the exam-ple of valley classification, local variations in erosion andhuman intervention might be modelled as correlated Gaus-sian noise on each topographic measurement. During train-ing, each example is used multiple times, with different noiseon each occasion; in theory, this results in only the gross fea-tures of the valley profile being taken into account for clas-sification purposes. However, it may be challenging to iden-tify and construct appropriate noise models in many realisticcases.

It is worth noting here that a similar strategy may proveuseful in other cases where it is desirable to desensitise sys-tems to particular aspects of the data. For example, spatialobservations are typically reported on a grid, aligned withgeographic coordinates. However, natural phenomena typi-cally do not display any such alignment, and orientation in-formation may be irrelevant in many cases. If 2-D spatial in-formation is used in a particular case, it may be desirable tomake use of multiple, randomly rotated copies of each train-ing set example. This allows an effectively larger training setto be created, and reduces the chance that features are treateddifferently depending on their alignment.

4.4 Operational considerations

As with any analysis technique, results obtained using learn-ing algorithms can be misleading if not treated carefully. Thisis especially true where a technique invites treatment as a“black box”, and where the mechanism by which it operatesis not always easily understood. The great strength of arti-ficial intelligence is that it enables a computer to mimic theexperienced researcher – but this is also a potential draw-back, tending to distance the researcher from their data. Insome sense, this is an inevitable consequence of the ever-increasing quantity of data available to researchers – butthere is a risk that it leads to subtleties being missed, or re-sults interpreted wrongly due to misapprehensions surround-ing computational processing. To minimise the risk of theseissues arising, it is important that researchers develop heuris-tics and procedures that enable “intelligent” systems to be



monitored. For example, users of automated data classifica-tion systems should monitor the statistical distributions ofclassification outputs, and investigate any deviations from thenorm; it is also desirable to spot-check classifications. Thisis particularly true in settings where there is a risk that newexamples lie outside the scope of the training set – perhapswhere data are drawn from new geographic regions, or werecollected at a different time of year.

Learning algorithms have immense potential in enablingexploration of large, complex data sets and in automatingcomplicated tasks that would otherwise have to be donemanually. However, developing a learning algorithm system– especially those targeted at classification, regression, orinterpolation – can also be time-consuming and resource-intensive. Computational demands may also be significant:although many applications require no more than a standardlaptop or desktop computer, some may only be viable withaccess to large-scale parallel computing resources. By wayof an illustration for the more computationally intensive endof the spectrum: training the learning algorithm used to cata-logue seamounts in Valentine et al. (2013) currently requiresa few hundred CPU hours on a 1.9 GHz machine, in addi-tion to considerable computational demands for processingthe raw bathymetric data. Thus, learning algorithms do notcurrently present an economic solution to all problems – al-though this balance will undoubtedly change as technologyand algorithms continue to evolve.

5 Outlook

In this paper, we have attempted to provide a general surveyof the field of learning algorithms, and how they might beuseful to researchers working in the fields of geomorphom-etry and Earth surface dynamics. These fields benefit fromextensive, large-scale, feature-rich data sets, bringing oppor-tunities and challenges in equal measure. Although currentlydominated by a few specific topics in the field, such as land-slide hazard assessment, the use of artificial intelligence tohelp explore, process, and interpret geomorphological datais almost certain to be an increasingly significant aspect ofresearch in coming years.

An increased use of learning algorithms in geomorpholog-ical communities is likely to require developments in com-putational infrastructure. There are obvious requirements foraccess to appropriate hardware and software resources, aswell as skill development. In particular, larger problems ormore complex algorithms may make use of parallel com-puting and other high-performance computing techniques es-sential. In addition, in light of the potentially substantial de-velopment cost of implementing some of the more complexlearning algorithms, it is worth trying to plan for a flexibleimplementation. Most of these approaches can, in principle,be implemented in a fairly general framework, allowing thesame underlying algorithm to be applied to many problems.

The computational implications of both parallelisation andgeneralisation are beyond the scope of this review, but onearea of particular relevance to the geomorphology commu-nity concerns data input and output, and hence data format:the ability to reuse an algorithm across multiple data types isa key element of flexibility in a field with such a diversity ofmeasurements. The ability to handle large file sizes and sup-port efficient data access, potentially in parallel, is also animportant consideration. As these techniques develop, thisplaces increasing importance on the development and useof robust community standards for data formats. File frame-works, which allow the development of multiple specialistfile formats all adhering to a common set of rules, may beparticularly valuable in combining consistency from an algo-rithmic point of view with flexibility to accommodate varieddata.

However, learning algorithms are not a panacea. “Tradi-tional” approaches to data analysis will remain important forthe foreseeable future, and are well suited to tackling manyproblems; we do not advocate their wholesale replacement.In addition, it is important that the use of more advancedcomputational techniques is not allowed to become a bar-rier between researchers and data; it is almost certain thatthe nature and practice of “research” will need to evolve toaccommodate increasing use of these technologies. Some in-teresting perspectives on these issues may be found in – forexample – an issue of Nature dealing with “2020 comput-ing” (e.g. Muggleton, 2006; Szalay and Gray, 2006), andin work aimed at developing a “robot scientist” (e.g. Kinget al., 2009). Nevertheless, artificial intelligence opens upmany new possibilities within the fields of geomorphologyand Earth surface processes – particularly given the commonscenarios of large data sets; complex, interacting processes;and/or natural variability that make many geomorphologicalsituations difficult to classify or model using conventionaltechniques. Learning algorithms such as those discussed herethus represent a powerful set of tools with the potential tosignificantly expand our capabilities across many branchesof the field.

Acknowledgements. We are grateful to the associate editor,John Hillier, and to Niels Anders, J. J. Becker, Ian Evans, andEvan Goldstein for reviews and comments on the initial draft ofthis manuscript. We also thank Jeannot Trampert for numeroususeful discussions. A. P. Valentine is supported by the EuropeanResearch Council under the European Union’s Seventh FrameworkProgramme (FP/2007-2013)/ERC grant agreement no. 320639.

Edited by: J. K. Hillier

References

Baeza, C. and Corominas, J.: Assessment of shallow landslide sus-ceptibility by means of multivariate statistical techniques, EarthSurf. Proc. Land., 26, 1251–1263, 2001.



Bayes, T.: An essay towards solving a problem in the doctrine ofchances, Philos. T. R. Soc. A, 53, 370–418, 1763.

Beechie, T. and Imaki, H.: Predicting natural channel patterns basedon landscape and geomorphic controls in the Columbia Riverbasin, USA, Water Resour. Res., 50, 39–57, 2014.

Belluco, E., Camuffo, M., Ferrari, S., Modenese, L., Silvestri, S.,Marani, A., and Marani, M.: Mapping salt-marsh vegetation bymultispectral and hyperspectral remote sensing, Remote Sens.Environ., 105, 54–67, 2006.

Bhattacharya, B., Price, R., and Solomatine, D.: Machine learn-ing approach to modeling sediment transport, J.Hydraul. Eng.-ASCE, 133, 440–450, 2007.

Bischof, H., Schneider, W., and Pinz, A.: Multispectral classifica-tion of Landsat-images using neural networks, IEEE T. Geosci.Remote S., 30, 482–490, 1992.

Bishop, C.: Neural Networks for Pattern Recognition, Oxford Uni-versity Press, Oxford, UK, 1995.

Bishop, C.: Pattern Recognition and Machine Learning, Springer,New York, USA, 2006.

Breiman, L.: Random Forests, Mach. Learn., 45, 5–32, 2001.Brenning, A.: Spatial prediction models for landslide hazards: re-

view, comparison and evaluation, Nat. Hazards Earth Syst. Sci.,5, 853–862, doi:10.5194/nhess-5-853-2005, 2005.

Cortes, C. and Vapnik, V.: Support-Vector Networks, Mach. Learn.,20, 273–297, 1995.

Cuadrado, D. and Perillo, G.: Principal component analysis appliedto geomorpholigic evolution, Estuar. Coast. Shelf S., 44, 411–419, 1997.

Das, I., Stein, A., Kerle, N., and Dadhwal, V. K.: Landslide sus-ceptibility mapping along road corridors in the Indian Himalayasusing Bayesian logistic regression models, Geomorphology, 179,116–125, 2012.

Dietterich, T.: Ensemble methods in machine learning, in: Multipleclassifier systems, in: Lecture Notes in Computer Science, editedby: Kittler, J. and Roli, F., 1857, 1–15, Springer-Verlag, Berlin,Germany, 2000.

Dunning, S., Massey, C., and Rosser, N.: Structural and geomor-phological features of landslides in the Bhutan Himalaya de-rived from terrestrial laser scanning, Geomorphology, 103, 17–29, 2009.

Ehsani, A. and Quiel, F.: Geomorphometric feature analysis usingmorphometric parameterization and artificial neural networks,Geomorphology, 99, 1–12, 2008.

Ermini, L., Catani, F., and Casagli, N.: Artificial neural networksapplied to landslide susceptibility assessment, Geomorphology,66, 327–343, 2005.

Fawcett, T.: An introduction to ROC analysis, Pattern Recogn. Lett.,27, 861–874, 2006.

Friedel, M. J.: Modeling hydrologic and geomorphic hazards acrosspost-fire landscapes using a self-organizing map approach, Env-iron. Modell. Soft., 26, 1660–1674, 2011.

Griffiths, G.: Stochastic Prediction in Geomorphology UsingBayesian Inference Models, Math. Geol., 14, 65–75, 1982.

Gutierrez, B. T., Plant, N. G., and Thieler, E. R.: A Bayesian net-work to predict coastal vulnerability to sea level rise, J. Geophys.Res.-Earth, 116, F02009, doi:10.1029/2010JF001891, 2011.

Guyon, I. and Elisseeff, A.: An introduction to variable and featureselection, J. Mach. Learn. Res., 3, 1157–1182, 2003.

Hartigan, J. and Wong, M.: A K-Means Clustering Algorithm, J.Roy. Stat. Soc. C-App., 28, 100–108, 1979.

Hillier, J., Conway, S., and Sofia, G.: Perspective – SyntheticDEMs: A vital underpinning for the quantitative future of land-form analysis?, Earth Surface Dynamics, 3, 587–598, 2015.

Hinton, G. and Salakhutdinov, R.: Reducing the Dimensionality ofData with Neural Networks, Science, 313, 504–507, 2006.

Hornik, K.: Approximation capabilities of multilayer feedforwardnetworks, Neural Networks, 4, 251–257, 1991.

Jain, A.: Data clustering: 50 years beyond K-means, PatternRecogn. Lett., 31, 651–666, 2010.

Jasiewicz, J. and Stepinski, T.: Geomorphons – a pattern recogni-tion approach to classification and mapping of landforms, Geo-morphology, 182, 147–156, 2013.

Jordan, M. and Mitchell, T.: Machine learning: Trends, perspec-tives, and prospects, Science, 349, 255–260, 2015.

King, R., Rowland, J., Aubrey, W., Liakata, M., Markham, M.,Soldatova, L., Whelan, K., Clare, A., Young, M., Sparkes, A.,Oliver, S., and Pir, P.: The robot scientist Adam, Computer, 42,46–54, 2009.

Kohonen, T.: The Self-Organizing Map, Proceedings of the IEEE,78, 1464–1480, 1990.

Krasnopolsky, V. and Schiller, H.: Some neural network applica-tions in environmental sciences. Part I: forward and inverse prob-lems in geophysical remote measurements, Neural Networks, 16,321–334, 2003.

Lee, S., Choi, J., and Min, K.: Landslide susceptibility analysisand verification using the Bayesian probability model, Environ.Geol., 43, 120–131, 2002.

Li, J., Heap, A., Potter, A., and Daniell, J.: Application of machinelearning methods to spatial interpolation of environmental vari-ables, Environ. Modell. Soft., 26, 1647–1659, 2011.

Lippmann, R.: Pattern classification using neural networks, IEEECommun. Mag., 27, 47–50, 1989.

Lloyd, S.: Least squares quantization in PCM, IEEE T. Inform. The-ory, 28, 129–137, 1982.

Mackay, D.: Information Theory, Inference and Learning Algo-rithms, Cambridge University Press, Cambridge, UK, 2003.

Marjanovic, M., Kovacevic, M., Bajat, B., and Voženílek, V.: Land-slide susceptibility assessment using SVM machine learning al-gorithm, Eng. Geol., 123, 225–234, 2011.

Markou, M. and Singh, S.: Novelty detection: a review – part 2: neu-ral network based approaches, Signal Process., 83, 2499–2521,2003.

Marsh, I. and Brown, C.: Neural network classification of multi-beam backscatter and bathymetry data from Stanton Bank (AreaIV), Appl. Acoust., 70, 1269–1276, 2009.

Marsland, S.: Novelty detection in learning systems, Neural Com-puting Surveys, 3, 1–39, 2002.

Martin, K., Wood, W., and Becker, J.: A global prediction ofseafloor sediment porosity using machine learning, Geophys.Res. Lett., 42, 10640–10646, 2015.

Mas, J. and Flores, J.: The application of artificial neural networksto the analysis of remotely sensed data, Int. J. of Sens., 29, 617–664, 2008.

Matías, K., Ordóñez, C., Taboada, J., and Rivas, T.: Functional sup-port vector machines and generalized linear models for glaciergeomorphology analysis, Int. J. Comput. Math., 86, 275–285,2009.


http://dx.doi.org/10.5194/nhess-5-853-2005

http://dx.doi.org/10.1029/2010JF001891


Miliaresis, G. and Kokkas, N.: Segmentation and object-based clas-sification for the extraction of the building class from LIDARDEMs, Comput. Geosci., 33, 1076–1087, 2007.

Mondini, A. C., Marchesini, I., Rossi, M., Chang, K.-T., Pasquar-iello, G., and Guzzetti, F.: Bayesian framework for mapping andclassifying shallow landslides exploiting remote sensing and to-pographic data, Geomorphology, 201, 135–147, 2013.

Muggleton, S.: Exceeding human limits, Nature, 440, 409–410,2006.

Olden, J., Lawler, J., and Poff, N.: Machine Learning MethodsWithout Tears: A Primer for Ecologists, Q. Rev. Biol., 83, 171–193, 2008.

Pearson, K.: On Lines and Planes of Closest Fit to Systems of Pointsin Space, Philos. Mag., 2, 559–572, 1901.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion,B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg,V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Per-rot, M., and Duchesnay, E.: Scikit-learn: Machine Learning inPython, J. Mach. Learn. Res., 12, 2825–2830, 2011.

Peng, L., Niu, R., Huang, B., Wu, X., Zhao, Y., and Ye, R.: Land-slide susceptibility mapping based on rough set theory and sup-port vector machines: A case of the Three Gorges area, China,Geomorphology, 204, 287–301, 2014.

Quinlan, J.: Induction of Decision Trees, Mach. Learn., 1, 81–106,1986.

Quinlan, J.: C4.5: Programs for Machine Learning, Morgan Kauf-mann, San Francisco, CA, USA, 1993.

Rumelhart, D., Hinton, G., and Williams, R.: Learning representa-tions by back-propagating errors, Nature, 323, 533–536, 1986.

Sambridge, M. and Mosegaard, K.: Monte Carlo methods in geo-physical inverse problems, Rev. Geophys., 40, 1–29, 2002.

Sammon, J.: A Nonlinear Mapping for Data Structure Analysis,IEEE T. Comput., C-18, 401–409, 1969.

Schaul, T., Bayer, J., Wierstra, D., Sun, Y., Felder, M., Sehnke, F.,Rückstieß, T., and Schmidhuber, J.: PyBrain, J. Mach. Learn.Res., 11, 743–746, 2010.

Schmelter, M., Hooten, M., and Stevens, D.: Bayesian sedimenttransport model for unisize bed load, Water Resour. Res., 47,W11514, doi:10.1029/2011WR010754, 2011.

Sivia, D.: Data analysis: A Bayesian tutorial, Oxford UniversityPress, Oxford, UK, 1996.

Smith, M., Anders, N., and Keesstra, S.: CLustre: semi-automatedlineament clustering for paleo-glacial reconstruction, Earth Surf.Proc. Land., 41, 364–377, 2016.

Szalay, A. and Gray, J.: Science in an exponential world, Nature,440, 413–414, 2006.

Tamene, L., Park, S., Dikau, R., and Vlek, P.: Analysis of factors de-termining sediment yield variability in the highlands of northernEthiopia, Geomorphology, 76, 76–91, 2006.

Valentine, A., Kalnins, L., and Trampert, J.: Discovery and analysisof topographic features using learning algorithms: A seamountcase-study, Geophys. Res. Lett., 40, 3048–3054, 2013.

Wu, X., Kumar, V., Quinlan, J., Ghosh, J., Yang, Q., Motoda, H.,McLachlan, G., Ng, A., Liu, B., Yu, P., Zhou, Z.-H., Steinbach,M., Hand, D., and Steinberg, D.: Top 10 algorithms in data min-ing, Knowl. Inf. Syst., 14, 1–37, 2008.

Yao, X., Tham, L., and Dai, F.: Landslide susceptibility mappingbased on Support Vector Machine: A case study on natural slopesof Hong Kong, China, Geomorphology, 101, 572–582, 2008.


http://dx.doi.org/10.1029/2011WR010754

an introduction to learning algorithms and potential applications … · 2020. 6. 8. · supervised...

Documents