machine learning for scientific applications

Machine Learning for Scientific Applications

http://davidlary.infoDavid Lary

Need: Accounting for complex multi-variate contextwhich is often not fully described by theory

Monday, August 11, 14

http://davidlary.info

http://davidlary.info

Long Term Data Sets: Uncertainty, Cross-Calibration,

Data Fusion & Machine Learning

Motivated by Data Assimilation

With examples from Land, Atmosphere & OceanMonday, August 11, 14

Bias Detection“Who may discern his errors, ....” Psalm 19:12

7


Why is it an issue?

• With fusion of multiple datasets bias is often an issue (very relevant for climate variables).

• Data assimilation is a least squares or a Best Linear Unbiased Estimator (BLUE)

8


.... runs deeper still• Instrument teams have a keen sense of faithfully reporting the

data, as it is, warts and all. They are naturally loath to empirically correct biases; they would like to theoretically understand the cause of the bias and data issues from first principles.

The Earth System is so complex, with many interacting processes, and often the instruments are also complex, this is not always possible.

Residual data issues can, and usually do, remain.

• Modelers know that data bias exist, but are very reticent to make changes to data products.

.... we therefore have a problem of closure.

9


The problem!

• Biases are ubiquitous, not all of them can be explained theoretically. Yet, we typically need to fuse multiple datasets to construct long-term time series and/or improve global coverage.

• If the biases are not corrected before data fusion we introduce further problems, such as ...

• spurious trends, leading to the possibility of unsuitable policy decisions.

• when assimilation is involved, the suboptimal use of observations, non-physical structures in the analysis, biases in the assimilated fields, and extrapolation of biases due to multivariate background constraints.

10


A Further Problem

The instruments whose data we would like to fuse are often not making coincident measurements in time or space.

Imperative to inter-compare observations in their appropriate context.

11


Integrate multiple satellite datasets for applications

The comparison above shows the total ozone column observed by EP TOMS and Aura OMI. The high resolution coverage that Aura OMI provides is clearly seen. In the particular event shown there is a tropopause fold event over Texas.

12


An Example

13 representativenessMonday, August 11, 14

14


0.5 1 1.5 2 2.5 3 3.5 4

x 10ï6

0

0.02

0.04

0.06

0.08

0.1

0.12

O3 v.m.r.

Rel

ativ

e Fr

eque

ncy

All years 01 (1900 K<e< 2300 K, ï90o<qel< ï79o)

Aura MLS O3 (23)

CLAES v9 O3 (207)

ISAMS v10 O3 (19)

UARS MLS v5 183 GHz O3 (379)

UARS MLS v5 205 GHz O3 (490)

SAGE 2 v6.2 O3 (21)

SBUV v8 O3 (33)

15


Geophysical Insights

(a) (b)

(c) (d)

Figure 2: N2O Equivalent PV latitude - potential temperature cross sectionsof (a) representativeness uncertainty (v.m.r.), (b) observational uncertainty(v.m.r.), (c) obvservation (v.m.r.), and (d) analyses uncertainty (v.m.r.). Thedata used is from the Upper Atmosphere Research Satellite (UARS) CryogenicLimb Array Etalon Spectrometer (CLAES) version 9 for January 1992.

3

16


Bias is Spatially Dependent

−75 −60 −45 −30 −15 0 15 30 45 60 75

250

300

350

400

500

600

700

1000

1200

1500

2000

2500

Equivalent PV Latitude

Pote

ntia

l Tem

pera

ture

(K)

% Bias (UARS MLS v5 183 GHz O3 − HALOE v19 O3) for January of all years

−30

−20

−10

0

10

20

30

−75 −60 −45 −30 −15 0 15 30 45 60 75

250

300

350

400

500

600

700

1000

1200

1500

2000

2500

Equivalent PV Latitude

Pote

ntia

l Tem

pera

ture

(K)

% Bias (UARS MLS v5 183 GHz O3 − HALOE v19 O3) for January of all years

−30

−20

−10

0

10

20

30

17


So what can we do about this?

.... we do not have a theoretical explanation

18


Machine Learningfor when our understanding is incomplete

19

... and that is quite often!


What is Machine Learning?

• Machine learning is a sub-field of artificial intelligence that is concerned with the design and development of algorithms that allow computers to learn the behavior of data sets empirically.

• A major focus of machine-learning research is to produce (induce) empirical models from data automatically.

• This approach is usually used because of the absence of adequate and complete theoretical models that are more desirable conceptually.

20


What is Machine Learning?

The use of machine learning can actually help us to construct a more complete theoretical model, as it allows us to determine which factors are statistically capable of providing the data mappings we seek— e.g. the multi-variate, non-linear, non-parametric mapping between satellite radiances and a suite of ocean products.

21


Machine Learning

Is for:

Regression

➡ Multivariate, non-linear, non-parametric

Classification

➡ Supervised and unsupervised

22


Machine Learning

Comes in Several Flavors, for example:

• Neural Networks

• Support Vector Machines

• Gaussian Process Models

• Decision Trees

• Random Forests

23


Machine Learning Regression

x1 x2 x3 x4 x5 � xn y

Inpu

ts

Out

put(

s)

Inpu

ts

Inpu

ts

Inpu

ts

Inpu

ts

Inpu

ts

Inpu

ts

y = f (x1, x2, x3, x4 , x5,…, xn )

Multivariate, non-linear, non-parametricn can be very large

Training Data


Machine Learning Supervised Classification

x1 x2 x3 x4 x5 � xn y

Inpu

ts

Out

put(

s)

Inpu

ts

Inpu

ts

Inpu

ts

Inpu

ts

Inpu

ts

Inpu

ts


Training Data


Machine Learning Unsupervised Classification


x1 x2 x3 x4 x5 � xn

Inpu

ts

Inpu

ts

Inpu

ts

Inpu

ts

Inpu

ts

Inpu

ts

Inpu

ts

Training Data


Neural Networks

In a neural network model simple nodes (neurons), are connected together to form a network of nodes. Its practical use comes with algorithms designed to alter the strength (weights) of the connections in the network to produce a desired signal flow.

27


Support Vector Machines

Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression.

Intuitively, an SVM model is a representation of the training examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible.

Vladimir Vapnik

28


Gaussian Process ModelsGaussian processes (GPs) (Rasmussen and Williams 2006) fit a multivariate Gaussian probability distribution to any set of regressors, allowing for analytic inference. As a principled Bayesian technique, GPs go beyond SVMs by allowing us to supply a full posterior distribution for our regressors, giving us both mean estimates as well as an indication of the uncertainty in them.

29


Random ForestRandom forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees, hence a forest. The approach was developed by Leo Breiman and Adele Cutler.


http://en.wikipedia.org/wiki/Ensemble_learning

http://en.wikipedia.org/wiki/Ensemble_learning

http://en.wikipedia.org/wiki/Statistical_classification

http://en.wikipedia.org/wiki/Statistical_classification

http://en.wikipedia.org/wiki/Regression

http://en.wikipedia.org/wiki/Regression

http://en.wikipedia.org/wiki/Decision_tree_learning

http://en.wikipedia.org/wiki/Decision_tree_learning

http://en.wikipedia.org/wiki/Leo_Breiman

http://en.wikipedia.org/wiki/Leo_Breiman

A key issue is training dataset size, the bigger

the better!

..... until we run out of memory

31


Variations in Stratospheric Cly Between 1991 and the present

David Lary, Anne Douglass, Darryn Waugh, Richard Stolarski, Paul Newman, Hamse Mussa

• Data can be biased, maybe as a function of many parameters.

• May be observing a proxy for what we really want to know.32


ozone reductions there (SOCOL and E39C), and the modelwith the largest cold bias in the Antarctic lower strato-sphere in spring (LMDZrepro) simulates very low ozone.

CCMs show a large range of ozone trends over thepast 25 years (see left panels in Figure 3-26 of Chapter 3)and large differences from observations. Some of thesedifferences may in part be related to differences in the sim-ulated Cly, e.g., E39C and SOCOL show a trend smallerthan observed, whereas AMTRAC and UMETRAC showa trend larger than observed in extrapolar area weightedmean column ozone. However, other factors also con-tribute, e.g., biases in tropospheric ozone (Austin andWilson, 2006).

The CCM evaluation discussed above and in Eyringet al. (2006) has guided the level of confidence we placeon each model simulation. The CCMs vary in their skillin representing different processes and characteristics ofthe atmosphere. Because the focus here is on ozone

recovery due to declining ODSs, we place importance onthe models’ ability to correctly simulate stratospheric Clyas well as the representation of transport characteristicsand polar temperatures. Therefore, more credence is givento those models that realistically simulate these processes.Figure 6-7 shows a subset of the diagnostics used to eval-uate these processes and CCMs shown with solid curvesin Figures 6-7, 6-8, 6-10 and 6-12 to 6-14 are those thatare in good agreement with the observations in Figure6-7. However, these line styles should not be over-interpreted as both the ability of the CCMs to representthese processes as well as the relative importance of Cly,temperature, and transport vary between different regionsand altitudes. Also, analyses of model dynamics in theArctic, and differences in the chlorine budget/partitioningin these models, when available, might change this evalu-ation for some regions and altitudes.

21st CENTURY OZONE LAYER

6.26

Figure 6-8. October zonal mean values of total inorganic chlorine (Cly in ppb) at 50 hPa and 80°S from CCMs.Panel (a) shows Cly and panel (b) difference in Cly from that in 1980. The symbols in (a) show estimates of Clyin the Antarctic lower stratosphere in spring from measurements from the UARS satellite in 1992 and the Aurasatellite in 2005, yielding values around 3 ppb (Douglass et al., 1995; Santee et al., 1996) and around 3.3 ppb(see Figure 4-8), respectively.

50 hPa 80°S October 50 hPa 80°S October

Cl y–

Cl y

(198

0) (

ppbv

)

Cl y

(ppb

v)

Year Year

33







6.26



Cl y–

Cl y

(198

0) (

ppbv

)

Cl y

(ppb

v)

Year Year

A large range of Cly in the model simulations

Constrained by a limited number of Cly observations

33


• We need to know the distribution of inorganic chlorine (Cly) in the stratosphere to:

• Attribute changes in stratospheric ozone to changes in halogens.

• Assess the realism of chemistry-climate models.

34


Cly=HCl+ClONO2+ClO+HOCl+2Cl2O2+2Cl2

Long time-series

SporadicLong time-series

Since 2004

Estimating Cly is hampered by lack of observations

Estimating Cly is hampered by inter-instrument biases35


Using PDFs for Bias Detection

0.8 1 1.2 1.4 1.6 1.8 2x 10ï9

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

HCl v.m.r.

Rel

ativ

e Fr

eque

ncy

2005/01 (460 K<e< 590 K, 49o<qel< 61o)

ACE v2.2 HCl (75)Aura MLS HCl (1544)HALOE v19 HCl (101)

http://www.pdfcentral.info/

HALOE -Aura

HCl

If we now repeat this globally for all periods of overlap

36


http://www.ndvicentral.info

http://www.ndvicentral.info

0 1 2 3 40

1

2

3

4

HALOE HCl (ppbv)

ATM

OS

HCl (

ppbv

)

Slope = 1.05Intercept = 0.23 ppbv

Data1:1Weighted Fit

HCl Inter-comparisons

37


0 1 2 3 40

1

2

3

4

HALOE HCl (ppbv)

ACE

HCl v

2.2

(ppb

v)

Slope = 1.18Intercept = −0.050 ppbv

Data1:1Weighted Fit


37


0 1 2 30

1

2

3

HALOE HCl (ppbv)

MLS

HCl

(ppb

v)


Data1:1Weighted FitFit


37


0 1 2 30

1

2

3

HALOE HCl (ppbv)

MLS

HCl

(ppb

v)


Data1:1Weighted FitFit

0 1 2 30

1

2

3

HALOE HCl (ppbv) NN adjustedM

LS H

Cl (p

pbv)

Slope = 0.995Iintercept = 0.0093 ppbv

Data1:1Weighted Fit


37


Neurological algorithmsInputsOutputs

Process

38


An example neural network

Inputs

Outputs

Process

39



Inputs

Outputs

Process

39

Objective design of neural networksusing genetic algorithms



40


Re-calibration using a Neural Network

0.5 1 1.5 2 2.5 3 3.5

x 10ï9

0.5

1

1.5

2

2.5

3

3.5

x 10ï9

Targets T

Out

puts

A,

Line

ar F

it: A

=(0.

97)T

+(5eï1

1)

HCl Training Outputs vs. Targets, R=0.98739

Training Data PointsBest Linear FitA = T

0.5 1 1.5 2 2.5 3 3.5x 10ï9

0.5

1

1.5

2

2.5

3

3.5

x 10ï9

Targets T

Out

puts

A,

Line

ar F

it: A

=(0.

98)T

+(2.

9eï1

1)

HCl Validation Outputs vs. Targets, R=0.99232

Validation Data PointsBest Linear FitA = T

41


Re-calibration using a Neural Network

0.5 1 1.5 2 2.5 3 3.5

x 10ï9

0.5

1

1.5

2

2.5

3

3.5

x 10ï9

Targets T

Out

puts

A,

Line

ar F

it: A

=(0.

97)T

+(5eï1

1)

HCl Training Outputs vs. Targets, R=0.98739

Training Data PointsBest Linear FitA = T

0.5 1 1.5 2 2.5 3 3.5x 10ï9

0.5

1

1.5

2

2.5

3

3.5

x 10ï9

Targets T

Out

puts

A,

Line

ar F

it: A

=(0.

98)T

+(2.

9eï1

1)

HCl Validation Outputs vs. Targets, R=0.99232

Validation Data PointsBest Linear FitA = T

Totally independentvalidation

41


Long-term continuity42


Long-term continuity

Applied Neural NetworkRe-calibration to HALOE

42


1995 2000 20050

0.5

1

1.5

2

2.5

3

3.5

4 x 10ï9

Year

Cl y

Monthly average 2o

800 K525 K6 Year Age5 Year Age4 Year Age3 Year Age2 Year Age

1995 2000 20050

0.5

1

1.5

2

2.5

3

3.5

4 x 10ï9

Year

Cl y

Monthly average 61o


October

Use neural networks to infer Cly from HCl, CH4, ϕpv, and θ.

Long-term continuity for Cly43


1995 2000 20050

0.5

1

1.5

2

2.5

3

3.5

4 x 10ï9

Year

Cl y

Monthly average 2o


1995 2000 20050

0.5

1

1.5

2

2.5

3

3.5

4 x 10ï9

Year

Cl y

Monthly average 61o


October


Long-term continuity for Clyozone reductions there (SOCOL and E39C), and the modelwith the largest cold bias in the Antarctic lower strato-sphere in spring (LMDZrepro) simulates very low ozone.





6.26



Cl y–

Cl y

(198

0) (

ppbv

)

Cl y

(ppb

v)

Year Year

43


1995 2000 20050

0.5

1

1.5

2

2.5

3

3.5

4 x 10ï9

Year

Cl y

Monthly average 2o


October


Long-term continuity for Cly






6.26



Cl y–

Cl y

(198

0) (

ppbv

)

Cl y

(ppb

v)

Year Year

43


44


45


Other uses of machine learning

• Cross calibration of vegetation indices from AVHRR, MODIS, SPOT and SeaWIFS

• Inferring CO2 fluxes from vegetation indices and surface temperature

• Inferring ocean pigment concentrations and other parameters

• Inferring drought stress and endophyte infection in cacao (coffee)

• Learning the chaotically tumbling orbit of the Hubble space telescope

• Detecting online ebay fraud

• Acceleration of expensive code elements

46


Another applicationdissolved organic carbon

47


48


Method used to estimate DOC R

SeaWiFS bands GP NL 0.99977

MODIS bands GP NL 0.9997

All bands GP NL 0.99901

UV & SeaWiFS bands GP NL 0.99899

All bands NN 0.95859

UV & SeaWiFS bands NN 0.94609

MODIS bands NN 0.92585

SeaWiFS bands NN 0.91653

49


5

10

15

0

5

0

10

0

15

1

0.99

0.95

0.9

0.8

0.7

0.6

0.50.4

0.30.20.10

Stan

dard

dev

iatio

n

Co r r e l a t i o n Co e f f i ci e n

tR

MS

D

A

B

C

D

E

F

G

HGaussian Process Models50


Relative Importance of the Inputs

Wavelength Relative Importance

Rrs490 0.00087123

Rrs555 0.011976

Rrs670 1.5876

Rrs510 9.8423

Rrs443 13.0898

Rrs412 20.2553

The GPM hyper-parameters give an indication of the relative importance of the inputs. For the DOC SeaWiFS bands the best inputs

are those with the smallest values, here they are sorted in order of importance

Mos

t Im

port

ant

Leas

t Im

port

ant

51


−0.5 0 0.5 1 1.5 20

5

10

15

20

25

30

35

40

a412−a443

Salin

itySalinity

DataPolynomial (r2=0.928)NN (r2=0.933)SVM (r2=0.933)

52


Visibility

Variable R

Td

q

T

U

RH

SLP

-0.29

-0.26

-0.19

-0.18

0.13

0.05

53


High Resolution Identification of Dust Sources Using Machine

Learning and Remote Sensing DataAnnette Walker and David J. Lary

A42A-08


NRL High-resolution Dust Source Database

20030820 NRL DEP20030820 NRL DEP

Iran

Pakistan

Iran

Pakistan

• 10 years of DEP (2 yr MSG/RGB) imagery• COAMPS 10 m wind overlays • Surface weather plots • ENVI (Gis-like software)• NGDC topographical 10ºX10º tiles• Overlay 0.25º grid or use Google Earth (GE)

• Dust source area entered into database (cursor location tool = 1km precision)• Cross-correlate land and water features using maps, atlases, Landsat images (detailed topographic, geographic, and geomorphic information, GE) • Technical and governmental reports

Approach and Methodology

20110630 NRL MSG/RGB

Saudi Arabia

20030820 MODIS True Color



20030820 NRL DEP20030820 NRL DEP

Iran

Pakistan

Iran

Pakistan

• 10 years of DEP (2 yr MSG/RGB) imagery• COAMPS 10 m wind overlays • Surface weather plots • ENVI (Gis-like software)• NGDC topographical 10ºX10º tiles• Overlay 0.25º grid or use Google Earth (GE)

• Dust source area entered into database (cursor location tool = 1km precision)• Cross-correlate land and water features using maps, atlases, Landsat images (detailed topographic, geographic, and geomorphic information, GE) • Technical and governmental reports

Approach and Methodology

20110630 NRL MSG/RGB

Saudi Arabia

20030820 MODIS True Color20030820 NRL DEP

Iran

Pakistan



Solid red and purple shapes identify dust source areas located using DEP and MSG.

SW Asia DSD East Asia DSD

Mongolia

Saudi Arabia


Self-Organizing MapSelf-organizing maps (SOMs) are a data visualization and unsupervised classification technique invented by Professor Teuvo Kohonen (Kohonen 1982; 1990) that reduce the dimensions of data through the use of self-organizing neural networks.

They help us address the issue that humans simply cannot visualize high dimensional data.


Self-Organizing MapSOMs reduce dimensionality by producing a map that objectively plots the similarities of the data by grouping similar data items together.

SOMs learn to classify input vectors according to how they are grouped in the input space.

SOMs learn both the distribution and topology of the input vectors they are trained on. This approach allows SOMs to accomplish two things, reduce dimensions and display similarities.


Detecting Dust Sources


Self Organizing Map Classification

7 BandsMODIS MCD43C3

bihemispherical reflectance


All 1000-Classes mapped for North Africa


Libyan Dust Event: May 9, 2010 (8Z – 12Z)Jabal al Akhdar (الجبل األخضر ‎ Al Ǧabal al 'Aḫḍar, English: Green Mountains)A coastal mountain range with height 1.0-1.5 km.


Plumes originate on leeward side ofAl Jabal al Akhdar where drainage occurs along slopes.

Corresponding SOM-Classes: 49, 93, 94

Libyan Dust Event: May 9, 2010 (6Z – 8Z)Jabal al Akhdar (الجبل األخضر ‎ Al Ǧabal al 'Aḫḍar, English: Green Mountains)A coastal mountain range with height 1.0-1.5 km.


Chad: Bodélé Depression Dust Event: March 16, 2010 (7Z -12Z)

Located at the southern edge of the Sahara Desert in north central Africa, is the lowest point in Chad. Dust storms from the Bodélé Depression occur on average about 100 days per year. The Bodélé depression is a single spot in the Sahara that provides most of the mineral dust to the Amazon forest.


Selected SOM Classes

Chad: Bodélé Depression

NRL MSG-RGB 20110109

Source area is not designated in first pass of MODIS reflectance and land surface classification.

1000 SOM Classes


Selected Classes with Class 137


NRL MSG-RGB 20110109

Class 137 maps diatom sediment in depression.

1000 SOM Classes


Selected Classes Without Class 137


NRL MSG-RGB 201101091000 SOM Classes

Class 137 maps diatom sediment in depression.


Solid black circles/ovals show plume source

Corresponding SOM Classes within open circles/ovals

Northern Sahara: 36, 40, 63, 100 Sahel: 147, 229, 230, 405

West Africa: Feb 2, 2011 13Z


Selected Classes for North Africa (This involves 40 distinct classes)


Jan 1, 2006 True Color

Jan 1, 2006 NRL DEP

Sources along New Mexico/Texas border

The North American sources have a different spectral signature than those we saw in SW Asia

Agricultural on high planesBlue dessert areas


Sources in Arizona and Colorado

Apr 17, 2006 NRL DEP

Apr 17, 2006 True color


Selected Classes for North America (n=64)


All 1000-Classes mapped for South America


All 1000-Classes mapped for South America

Blue colored SOM-Classes are concentrated in Atacama and Salar de Uyuni deserts

White areas are salt flats


South America: Bolivia and Chile

July 18, 2010 MODIS Terra True Color


South America: Bolivia and Chile

July 18, 2010 MODIS Terra True Color Selected SOM-Classes in 200s, 300s, and 400s


• SOMs provide an effective mechanism for automating the identification of dust sources.

• Using the SOMs let us globally map dust sources at high resolution 1-10 km.

• Saved time in finding dust sources while comparing to satellite imagery.

• This can be done in real time to have dynamically changing dust sources.


Model&

Exis+ng&New&

Model&

Exis+ng&New&

78


Model&

Exis+ng&New&

Model&

Exis+ng&New&

• Personalized Health Care

• Proactive Health Care System

• Business Analytics

• Smart Logistics

• Disaster Response

• Fraud Detection

http://holistics3.comMonday, August 11, 14

Visualiza1on(

Decision(Support(

Machine(Learning(

Insight(&(Discovery(

Exis%ng(

• Social(Media(

• Socioeconomic,(Census(

• News(feeds(• Environmental(

• Weather(

• Satellite(• Sensors(• Health(• Economic(

New(

• Business(Analy%cs(2.0(• UAVs(• HyperHspectral(Imaging(

• Smart(Dust(

• Wearable(Sensors(

• Autonomous(Cars(

Simula%on(

• Global(Weather(Models(

• Economic(Models(

• Earthquake(Models(

GigaPop(Pipe(

TACC


machine learning for scientific applications

Technology

data bias

data fusion machine

data assimilation

data products

use of machine learning

data mappings

learning research

behavior of data sets