discovering communicable scientific knowledge from spatio-temporal data

Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Mark Schwabacher

NASA Ames Research Center

Computational Sciences Division

[email protected]

http://ic-www.arc.nasa.gov/people/schwabacher/

Joint work with Pat Langley and Jeff Shrager (ISLE) and Chris Potter, Steve Klooster, Lisy Torregrosa,

and Vanessa Brooks (NASA Earth Science)

Outline

• Description of Earth science problem

• Choice of representation and algorithm

• Results

• Visualizations

• Discovery of an error in the data

• Future Work

Earth Science Problem

• The Normalized Difference Vegetation Index (NDVI) is a measure of vegetation across the globe derived from satellite data

• NDVI is used in various Earth-science models

• Unfortunately, NDVI is only available for the years since 1983, when a satellite with these sensors was launched

• We would like to predict NDVI at a point on the globe from ground-based climate variables representing temperature, precipitation, and moisture

Choice of Representation

For scientific applications, the learned models should be

• Understandable

• Communicable

Representation used by scientists

Our Earth Science collaborators had built the following model with an “if” statement to select between two linear models, one for warmer locations and one for cooler locations:

if GDD<3000 then

ln(NDVI) = 0.715 ln(GDD) + 0.377 ln(PPT) – 0.448

if GDD>= 3000 then

NDVI = 189.89 AMI + 44.02 ln(PPT) + 227.99

Choice of Algorithm

• We selected regression rules as a generalization of the Earth scientists’ representation

• We selected Cubist to learn themhttp://www.rulequest.com

First Results

r # rules

Cubist 0.91 41

Linear regression with expert-selected cut

0.86 2

Cubist produced better accuracy, but model was hard to understand.

Varying the Cubist minimumrule cover parameter

2-rule Cubist model

if PPT <= 25.457 then

NDVI = -3.225 + 7.07 PPT + 0.0521 CDD - 84 AMI+ 0.4 ln(PPT) + 0.0001 GDD

if PPT > 25.457 then

NDVI = 386.3 + 316 AMI + 0.0294 GDD - 0.99 PPT + 0.2 ln(PPT)

Visualization #1:Cubist model in one variable

Visualization #2: Activity of Cubist Rules

Visualization #2:Error of Cubist model

Testing the model across years

• We trained Cubist using one year’s data

• We tested the resulting model on other years’ data

• If it transfers, it’s useful for Earth scientists

• If it sometimes doesn’t transfer, that could point to a scientific discovery

Discovery of an error in the data

Cross-validate 1985 Train 1984, test 1985

Related Work

• Regression trees: Breiman et al’s CART (1984)

• Classification applied to Earth science: Brodley & Friedl (1999); Ester, Kriegel, & Xu (1996)

• Visualizing classes on map: Brodley & Friedl (1999); Smyth, Ghil, & Ide (1999)

• Detecting and correcting faulty class labels in data: John (1995); Brodley and Friedl (1999)

• Detecting and correcting calibration problems in remote-sensing systems using predefined model: Chen (1997)

Future Work

• Cubist/NDVI work

– Incorporate time explicitly

– Include other variables (e.g. elevation)

– Test understandability

• Other work

– Improve CASA model (next talk)

– Implement an interactive system that lets scientists direct high-level search for improved ecosystem models

Lessons Learned

We’ve identified three problems that arise in scientific applications of ML, and proposed initial solutions:

Communicability: Use the same representation as the scientists.

Understandability: When using spatial data, spatially visualize the model’s errors and the activity of its components.

Quantitative errors: When using time-series data, quantitative errors can be identified by testing a model trained on one time period against data from other time periods.

discovering communicable scientific knowledge from spatio-temporal data

Documents

earth scientistsif

predefined model

resulting model

following model

error of cubist modeltesting

brodley friedl

yearswe trained cubist

linear models