interactive discovery in large data sets kiri l. wagstaff jet propulsion laboratory...

32
Interactive Discovery in Large Data Sets Kiri L. Wagstaff Jet Propulsion Laboratory [email protected] October 11, 2012 Joint work with David R. Thompson (JPL), Nina Lanza (Los Alamos National Lab), Thomas G. Dietterich (Oregon State University), and Diana Blaney (JPL) Guest lecture for CS 886, Applied Machine Lear This work was carried out in part at the Jet Propulsion Laboratory, California Institute of Technology, © 2012. Government sponsorship acknowledged. It was also supported by the Defense Advanced Research Projects Agency (DARPA) under Contract W911NF-11-C-0088. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author’s and do not necessarily reflect

Upload: simon-george

Post on 27-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Interactive Discovery in Large Data SetsKiri L. WagstaffJet Propulsion [email protected]

October 11, 2012

Joint work with David R. Thompson (JPL), Nina Lanza (Los Alamos National Lab),Thomas G. Dietterich (Oregon State University), and Diana Blaney (JPL)

Guest lecture for CS 886, Applied Machine Learning

This work was carried out in part at the Jet Propulsion Laboratory, California Institute of Technology, © 2012. Government sponsorship acknowledged. It was also supported by the Defense Advanced

Research Projects Agency (DARPA) under Contract W911NF-11-C-0088. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author’s and do not

necessarily reflect the views of the DARPA, the Army Research Office, or the US government.

Interactive Discovery in Large Data Sets - CS 886

2

Interactive Discovery in Large Data Sets

Discovery◦What is interesting? Novel?◦Big NASA data sets

LSST: 28 TB/day SKA: 86 TB/day

Explanations◦AI: actions + reasons for them

Why “interactive”?◦No general definition of “interesting”

Large Synoptic Survey Telescope

(LSST)

Square Kilometre Array (SKA)

Interactive Discovery in Large Data Sets - CS 886

3

Spirit’s McMurdo Panorama, 1000 sols, October 2006 (NASA/JPL/Cornell)

22,348 x 5771 pixels = 386 MBWhat’s most interesting here?

Example: Mars Rover Panorama

Interactive Discovery in Large Data Sets - CS 886

4

Zooming in

Portion of Spirit’s McMurdo Panorama, 1000 sols, October 2006 (NASA/JPL/Cornell)

Interactive Discovery in Large Data Sets - CS 886

Zooming in

Portion of Spirit’s McMurdo Panorama, 1000 sols, October 2006 (NASA/JPL/Cornell)

Possible meteorit

e

Impact ejecta

Sand ripples

Exposed layers

Bright rocks

Dark rocks

5

Interactive Discovery in Large Data Sets - CS 886

6

Discovery

Exploration of large data sets

Desiderata◦Diverse sampling of data set

System

learns model

System

chooses

item

Unsupervised learning

What to select?Items that differ from those

previously seenPrincipal Components Model

◦Approximate model of data set variation

◦Keep only the top K vectors from U

Known items

[M. Scholz, 2006] Interactive Discovery in Large Data Sets - CS 886 7

What to select?Items that differ from those

previously seenPrincipal Components Model

◦Approximate model of data set variation

◦Keep only the top K vectors from U◦Select items in D that are

difficult to represent with model UReconstruction error

Reconstruction of x

Mean of X

For x in D Interactive Discovery in Large Data Sets - CS 886 8

Known items

Updating model U with new xRedo PCA from scratch:

expensiveIncrementally update U: fast!

◦U depends only on previous U and new x

◦[Ross et al., 2008]

Iterations

Data

Principal Components

X1 X2 X3 X4 X5

U1 U2 U3 U4 U5

Interactive Discovery in Large Data Sets - CS 886 9

DEMUD: Discovery through Eigenbasis Modeling of Uninteresting Data

Select most interesting x

in D

Update model U to include x

Compute score for all x in D using

U

Initial ranking of D by PCA-1

reconstruction error

Interactive Discovery in Large Data Sets - CS 886 10

Treat x as “no longer

interesting”

Interactive Discovery in Large Data Sets - CS 886

11

McMurdo selections1200 features: 100x100 RGB,

downsamp. 5xK=20

Selection #1

Interactive Discovery in Large Data Sets - CS 886

12

McMurdo selections1200 features: 100x100 RGB,

downsamp. 5xK=20

Selection #2

Interactive Discovery in Large Data Sets - CS 886

13

McMurdo selections1200 features: 100x100 RGB,

downsamp. 5xK=20

Selection #10

Interactive Discovery in Large Data Sets - CS 886

14

ExplanationsReconstruction residuals

Original Features Residuals

1

2

3

Dark: lower intensity than expectedBright: higher intensity than expected

Dark area at bottom has small residual learning happened!

Interactive Discovery in Large Data Sets - CS 886

15

Interactive Discovery

Guided exploration of large data sets

Desiderata◦Quickly find items of interest, even if

rare◦Don’t miss anything!

User reviews item

System

chooses

item

Semi-supervised learning

Interactive DEMUD

Select most interesting x in

D

Query user on x• Interesting or

uninteresting?

If uninteresting, update model

U

Compute score for all x in D using

U

Initial ranking of D by PCA-1

reconstruction error

Interactive Discovery in Large Data Sets - CS 886 16

Alternatives: 1SVM-intTrain a one-class SVM to model

the interesting classSelect most interesting item

Smooth

Varied

Interactive Discovery in Large Data Sets - CS 886 17Dark

Light

Alternatives: 1SVM-unintTrain a one-class SVM to model

the uninteresting classSelect least uninteresting item

Smooth

Varied

Interactive Discovery in Large Data Sets - CS 886 18Dark

Light

Alternatives: 2SVM-bothTrain a two-class SVM to model

both classesSelect most interesting item

Smooth

Varied

Interactive Discovery in Large Data Sets - CS 886 19Dark

Light

Alternatives: 2SVM-activeTrain a two-class SVM to model

both classesSelect most ambiguous item

Smooth

Varied

Interactive Discovery in Large Data Sets - CS 886 20Dark

Light

Premature Specialization

Interactive Discovery in Large Data Sets - CS 886 21

(A possible danger of training on positives in a

discovery setting)

Alternatives: Static BaselineSelect by PCA-K ordering

◦Same initial model as DEMUDNo feedback

Interactive Discovery in Large Data Sets - CS 886 22

Faces Data Set40 people, 10 poses eachHigh dimensionality: 10,304Goal: Discover 3 women

◦Data set is mostly men◦Challenge: disjunction

Data from AT&T Laboratories CambridgeInteractive Discovery in Large Data Sets -

CS 886 23

Faces: Three Women Discovery

Premature specialization

Delayed start

Perfect

Random

Interactive Discovery in Large Data Sets - CS 886 24

Exploration Rate

Algorithm Queries

DEMUD 43

1SVM-unint 50

1SVM-int 164

2SVM-both 124

2SVM-active 124

Static baseline 223

Number of queries to find one image of each woman

Interactive Discovery in Large Data Sets - CS 886 25

Rare Class Discovery

6 classes, d=7, n=336, K=4 6 classes, d=9, n=214, K=4

DEMUDDEMUD

Interactive Discovery in Large Data Sets - CS 886 26

Figures from [He and Carbonell, 09]

Random subset of CRISM data

CRISM: Magnesite Discovery Magnesite (MgCO3): possible groundwater deposit CRISM data: 0.364 to 3.92 μm, 197 bands Only 17 of 15,400 items match

Data from Mars Reconnaissance Orbiter/CRISM

MgCO3

Interactive Discovery in Large Data Sets - CS 886 27

CRISM: Magnesite Discovery

Delayed start

Perfect

Random

Ran out of time (3 days!)~2 mins per query

Interactive Discovery in Large Data Sets - CS 886 28

ChemCam: CarbonatesChemCam: LIBS instrument on

MSLData set: 60 lab samples

+ 40 carbonates◦6143 features (bands)◦K (8) chosen to

capture 90% variance

Interactive Discovery in Large Data Sets - CS 886 29

Regular PCA

DEMUD

Colored items are carbonates; white are non-carbonates

1SVM-int

DEMUD: ExplanationsTop 10 items chosen by DEMUD

Ranked by “interestingness” Explanations (residuals)

Interactive Discovery in Large Data Sets - CS 886 30

Interactive Discovery in Large Data Sets - CS 886

31

Other ApplicationsText

◦Long Wavelength Array system log files

◦Detect anomalous system behavior

Onboard prioritization◦Imaging spectrometers

Hyperion on EO-1: 256x6000x242

◦Assign priorities for input to onboard compression: ROI-ICER

Interactive Discovery in Large Data Sets - CS 886

32

SummaryDiscovery

◦PCA-based model + reconstruction error

Explanations◦Why was it chosen?

Interactive discovery◦Model the uninteresting to avoid it

Next challenge: evolving class of interest Thank you!

Contact: [email protected]