discovering communicable scientific knowledge from spatio-temporal data

17
Discovering Communicable Scientific Knowledge from Spatio-Temporal Data Mark Schwabacher NASA Ames Research Center Computational Sciences Division [email protected] http://ic-www.arc.nasa.gov/people/ schwabacher/ Joint work with Pat Langley and Jeff Shrager (ISLE) and Chris Potter, Steve Klooster, Lisy

Upload: bena

Post on 15-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Discovering Communicable Scientific Knowledge from Spatio-Temporal Data. Mark Schwabacher NASA Ames Research Center Computational Sciences Division [email protected] http://ic-www.arc.nasa.gov/people/schwabacher/ - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Mark Schwabacher

NASA Ames Research Center

Computational Sciences Division

[email protected]

http://ic-www.arc.nasa.gov/people/schwabacher/

Joint work with Pat Langley and Jeff Shrager (ISLE) and Chris Potter, Steve Klooster, Lisy Torregrosa,

and Vanessa Brooks (NASA Earth Science)

Page 2: Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Outline

• Description of Earth science problem

• Choice of representation and algorithm

• Results

• Visualizations

• Discovery of an error in the data

• Future Work

Page 3: Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Earth Science Problem

• The Normalized Difference Vegetation Index (NDVI) is a measure of vegetation across the globe derived from satellite data

• NDVI is used in various Earth-science models

• Unfortunately, NDVI is only available for the years since 1983, when a satellite with these sensors was launched

• We would like to predict NDVI at a point on the globe from ground-based climate variables representing temperature, precipitation, and moisture

Page 4: Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Choice of Representation

For scientific applications, the learned models should be

• Understandable

• Communicable

Page 5: Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Representation used by scientists

Our Earth Science collaborators had built the following model with an “if” statement to select between two linear models, one for warmer locations and one for cooler locations:

if GDD<3000 then

ln(NDVI) = 0.715 ln(GDD) + 0.377 ln(PPT) – 0.448

if GDD>= 3000 then

NDVI = 189.89 AMI + 44.02 ln(PPT) + 227.99

Page 6: Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Choice of Algorithm

• We selected regression rules as a generalization of the Earth scientists’ representation

• We selected Cubist to learn themhttp://www.rulequest.com

Page 7: Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

First Results

r # rules

Cubist 0.91 41

Linear regression with expert-selected cut

0.86 2

Cubist produced better accuracy, but model was hard to understand.

Page 8: Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Varying the Cubist minimumrule cover parameter

Page 9: Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

2-rule Cubist model

if PPT <= 25.457 then

NDVI = -3.225 + 7.07 PPT + 0.0521 CDD - 84 AMI+ 0.4 ln(PPT) + 0.0001 GDD

if PPT > 25.457 then

NDVI = 386.3 + 316 AMI + 0.0294 GDD - 0.99 PPT + 0.2 ln(PPT)

Page 10: Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Visualization #1:Cubist model in one variable

Page 11: Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Visualization #2: Activity of Cubist Rules

Page 12: Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Visualization #2:Error of Cubist model

Page 13: Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Testing the model across years

• We trained Cubist using one year’s data

• We tested the resulting model on other years’ data

• If it transfers, it’s useful for Earth scientists

• If it sometimes doesn’t transfer, that could point to a scientific discovery

Page 14: Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Discovery of an error in the data

Cross-validate 1985 Train 1984, test 1985

Page 15: Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Related Work

• Regression trees: Breiman et al’s CART (1984)

• Classification applied to Earth science: Brodley & Friedl (1999); Ester, Kriegel, & Xu (1996)

• Visualizing classes on map: Brodley & Friedl (1999); Smyth, Ghil, & Ide (1999)

• Detecting and correcting faulty class labels in data: John (1995); Brodley and Friedl (1999)

• Detecting and correcting calibration problems in remote-sensing systems using predefined model: Chen (1997)

Page 16: Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Future Work

• Cubist/NDVI work

– Incorporate time explicitly

– Include other variables (e.g. elevation)

– Test understandability

• Other work

– Improve CASA model (next talk)

– Implement an interactive system that lets scientists direct high-level search for improved ecosystem models

Page 17: Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Lessons Learned

We’ve identified three problems that arise in scientific applications of ML, and proposed initial solutions:

Communicability: Use the same representation as the scientists.

Understandability: When using spatial data, spatially visualize the model’s errors and the activity of its components.

Quantitative errors: When using time-series data, quantitative errors can be identified by testing a model trained on one time period against data from other time periods.