surprise detection in multivariate astronomical...

53
Surprise Detection in Multivariate Astronomical Data Kirk Borne George Mason University [email protected] , http://classweb.gmu.edu/kborne/

Upload: lamdung

Post on 06-Sep-2018

240 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Surprise Detection

in Multivariate

Astronomical Data

Kirk Borne

George Mason University

[email protected] , http://classweb.gmu.edu/kborne/

Page 2: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Outline

• What is Surprise Detection?

• Example Application: The LSST Project

• New Algorithm for Surprise Detection: KNN-DD

Page 3: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Outline

• What is Surprise Detection?

• Example Application: The LSST Project

• New Algorithm for Surprise Detection: KNN-DD

Page 4: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Outlier Detection has many names

• Semi-supervised Learning

• Outlier Detection

• Novelty Detection

• Anomaly Detection

• Deviation Detection

• Surprise Detection

Page 5: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Outlier Detection has many names

• Semi-supervised Learning

• Outlier Detection

• Novelty Detection

• Anomaly Detection

• Deviation Detection

• Surprise Detection

Page 6: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Outlier Detection as Surprise Detection

Graphic from S. G. Djorgovski

• Benefits of very large

datasets:

• best statistical analysis

of “typical” events

• automated search for

“rare” events

… Surprise !

Page 7: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

• Outlier detection: (unknown unknowns)– Finding the objects and events that are outside the

bounds of our expectations (outside known clusters)

– These may be real scientific discoveries or garbage

– Outlier detection is therefore useful for:

• Novelty Discovery – is my Nobel prize waiting?

• Anomaly Detection – is the detector system working?

• Science Data Quality Assurance – is the data pipeline working?

– “One person’s garbage is another person’s treasure.”

– “One scientist’s noise is another scientist’s signal.”

Basic Knowledge Discovery Problem

Page 8: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Outlier detection: (unknown unknowns)– Simple techniques exist for uni- and multi-variate data:

– Outlyingness: O(x) = | x – μ(Xn) | / σ(Xn) – Mahalanobis distance: – Normalized

Euclidean distance:

– Numerous (countless?) outlier detection algorithms have

been developed. For example, see these reviews:• “Novelty Detection: A Review – Part 1: Statistical Approaches,” by Markos & Singh, Signal

Processing, 83, 2481-2497 (2003).

• “A Survey of Outlier Detection Methodologies,” by Hodge & Austin, Artificial Intelligence

Review, 22, 85-126 (2004 ).

• “Capabilities of Outlier Detection Schemes in Large Datasets,” by Tang, Chen, Fu, &

Cheung, Knowledge and Information Systems, 11 (1), 45-84 (2006).

• “Outlier Detection, A Survey,” by Chandola, Banerjee, & Kumar, Technical Report (2007).

– How does one optimally find outliers in 103-D parameter

space? or in interesting subspaces (lower dimensions)?

– How do we measure their “interestingness”?

Page 9: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Outline

• What is Surprise Detection?

• Example Application: The LSST Project

• New Algorithm for Surprise Detection: KNN-DD

Page 10: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

LSST =

Large

Synoptic

Survey

Telescopehttp://www.lsst.org/

8.4-meter diameter

primary mirror =

10 square degrees!

Hello !

Page 11: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Observing Strategy: One pair of images every 40 seconds for each spot on the sky,

then continue across the sky continuously every night for 10 years (~2019-2029), with

time domain sampling in log(time) intervals (to capture dynamic range of transients).

• LSST (Large Synoptic Survey Telescope):

– Ten-year time series imaging of the night sky – mapping the Universe !

– ~1,000,000 events each night – anything that goes bump in the night !

– Cosmic Cinematography! The New Sky! @ http://www.lsst.org/

Education and Public Outreach

have been an integral and key

feature of the project since the

beginning – the EPO program

includes formal Ed, informal Ed,

Citizen Science projects, and

Science Centers / Planetaria.

Page 12: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

LSST in time and space:– When? ~2019-2029– Where? Cerro Pachon, Chile

LSST Key Science Drivers: Mapping the Dynamic Universe– Solar System Inventory (moving objects, NEOs, asteroids: census & tracking)– Nature of Dark Energy (distant supernovae, weak lensing, cosmology)– Optical transients (of all kinds, with alert notifications within 60 seconds)– Digital Milky Way (proper motions, parallaxes, star streams, dark matter)

Architect’s design

of LSST Observatory

Page 13: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

LSST Summaryhttp://www.lsst.org/

• 3-Gigapixel camera

• One 6-Gigabyte image every 20 seconds

• 30 Terabytes every night for 10 years

• 100-Petabyte final image data archive anticipated –

all data are public!!!

• 20-Petabyte final database catalog anticipated

• Real-Time Event Mining: 1-10 million events per

night, every night, for 10 yrs

– Follow-up observations required to classify these

• Repeat images of the entire night sky every 3 nights:

Celestial Cinematography

Page 14: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

The LSST will represent a 10K-100K times

increase in the VOEvent network traffic.

This poses significant real-time classification

demands on the event stream:

from data to knowledge!

from sensors to sense!

Page 15: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

The LSST Data Mining Raison d’etre

• More data is not just more data … more is different!

• Discover the unknown unknowns.

• Massive Data-to-Knowledge challenge.

Page 16: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

The LSST Data Mining Challenges1. Massive data stream: ~2

Terabytes of image data per hour that must be mined in real time (for 10 years).

2. Massive 20-Petabyte database: more than 50 billion objects need to be classified, and most will be monitored for important variations in real time.

3. Massive event stream: knowledge extraction in real time for 1,000,000 events each night.

• Challenge #1 includes both the static data mining aspects of #2 and the dynamic data mining aspects of #3.

• Look at these in more detail ...

Page 17: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

LSST challenges # 1, 2• Each night for 10 years LSST will obtain the equivalent

amount of data that was obtained by the entire Sloan

Digital Sky Survey

• My grad students will be asked to mine these data (~30 TB

each night ≈ 60,000 CDs filled with data):

Page 18: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

LSST challenges # 1, 2• Each night for 10 years LSST will obtain the equivalent

amount of data that was obtained by the entire Sloan

Digital Sky Survey

• My grad students will be asked to mine these data (~30 TB

each night ≈ 60,000 CDs filled with data): a sea of CDs

Page 19: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

LSST challenges # 1, 2• Each night for 10 years LSST will obtain the equivalent

amount of data that was obtained by the entire Sloan

Digital Sky Survey

• My grad students will be asked to mine these data (~30 TB

each night ≈ 60,000 CDs filled with data): a sea of CDs

Image: The CD Sea

in Kilmington, England

(600,000 CDs)

Page 20: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

LSST challenges # 1, 2• Each night for 10 years LSST will obtain the equivalent

amount of data that was obtained by the entire Sloan

Digital Sky Survey

• My grad students will be asked to mine these data (~30 TB

each night ≈ 60,000 CDs filled with data):

– A sea of CDs each and every day for 10 yrs

– Cumulatively, a football stadium full of 200 million CDs

after 10 yrs

• The challenge is to find the new, the novel,

the interesting, and the surprises (the

unknown unknowns) within all of these data.

• Yes, more is most definitely different !

Page 21: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

LSST data mining challenge # 3

• Approximately 1,000,000 times each night for 10

years LSST will obtain the following data on a

new sky event, and we will be challenged with

classifying these data:

Page 22: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

LSST data mining challenge # 3

• Approximately 1,000,000 times each night for 10

years LSST will obtain the following data on a

new sky event, and we will be challenged with

classifying these data:

time

flux

Page 23: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

LSST data mining challenge # 3

• Approximately 1,000,000 times each night for 10

years LSST will obtain the following data on a

new sky event, and we will be challenged with

classifying these data: more data points help !

time

flux

Page 24: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

LSST data mining challenge # 3

• Approximately 1,000,000 times each night for 10

years LSST will obtain the following data on a

new sky event, and we will be challenged with

classifying these data: more data points help !

time

flux

Characterize first !

Classify later.

Page 25: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Characterization includes …

• Feature detection and extraction:• Identify and describe features in the data

• Extract feature descriptors from the data

• Curate these features for scientific search & re-use

• Find other parameters and features from other archives,

other databases, other sky surveys – and use those to

help characterize (ultimately classify) each new event.

• … hence, cope with a highly multivariate parameter space

• Outlier / Anomaly / Novelty / Surprise detection

• Clustering:• unsupervised learning ; class discovery

• Correlation discovery

Page 26: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Outline

• What is Surprise Detection?

• Example Application: The LSST Project

• New Algorithm for Surprise Detection: KNN-DD

(work done in collaboration Arun Vedachalam)

Page 27: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Challenge: which data points

are the outliers ?

Page 28: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Inlier or Outlier?

Is it in the eye of the beholder?

Page 29: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

3 Experiments

Page 30: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Experiment #1-A (L-TN)• Simple linear data stream – Test A

• Is the red point an inlier or and outlier?

Page 31: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Experiment #1-B (L-SO)• Simple linear data stream – Test B

• Is the red point an inlier or and outlier?

Page 32: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Experiment #1-C (L-HO)• Simple linear data stream – Test C

• Is the red point an inlier or and outlier?

Page 33: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Experiment #2-A (V-TN)• Inverted V-shaped data stream – Test A

• Is the red point an inlier or and outlier?

Page 34: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Experiment #2-B (V-SO)• Inverted V-shaped data stream – Test B

• Is the red point an inlier or and outlier?

Page 35: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Experiment #2-C (V-HO)• Inverted V-shaped data stream – Test C

• Is the red point an inlier or and outlier?

Page 36: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Experiment #3-A (C-TN)• Circular data topology – Test A

• Is the red point an inlier or and outlier?

Page 37: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Experiment #3-B (C-SO)• Circular data topology – Test B

• Is the red point an inlier or and outlier?

Page 38: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Experiment #3-C (C-HO)• Circular data topology – Test C

• Is the red point an inlier or and outlier?

Page 39: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

KNN-DD = K-Nearest Neighbors

Data Distributions

fK(d[xi,xj])

Page 40: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

KNN-DD = K-Nearest Neighbors

Data Distributions

fO(d[xi,O])

Page 41: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

KNN-DD = K-Nearest Neighbors

Data Distributions

fO(d[xi,O])

fK(d[xi,xj])

Page 42: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

The Test: K-S test• Tests the Null Hypothesis: the two data

distributions are drawn from the same

parent population.

• If the Null Hypothesis is rejected, then it is

probable that the two data distributions are

different.

• This is our definition of an outlier:

– The Null Hypothesis is rejected. Therefore…

– the data point’s location in parameter space

deviates in an improbable way from the rest

of the data distribution.

Page 43: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Advantages and Benefits of KNN-DD

• Based on the non-parametric K-S test

– It makes no assumption about the shape of the data

distribution or about “normal” behavior

– It compares the cumulative distributions of the data

values (i.e., the sets of inter-point distances) without

regard to the nature of those distributions

… (to be continued) …

Page 44: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Cumulative Data Distribution (K-S test)

for Experiment 1A (L-TN)

Page 45: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Cumulative Data Distribution (K-S test)

for Experiment 2B (V-SO)

Page 46: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Cumulative Data Distribution (K-S test)

for Experiment 3C (C-HO)

Page 47: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Results of KNN-DD experimentsExperiment ID Short Description

of Experiment

KS Test p-value Outlier Index = 1-p =

Outlyingness Likelihood

Outlier Flag

(p<0.05?)

L-TN Linear data stream,

True Normal test

0.590 41.0% False

L-SO Linear data stream,

Soft Outlier test

0.096 90.4% Potential

Outlier

L-HO Linear data stream,

Hard Outlier test

0.025 97.5% TRUE

V-TN V-shaped stream,

True Normal test

0.366 63.4% False

V-SO V-shaped stream,

Soft Outlier test

0.063 93.7% Potential

Outlier

V-HO V-shaped stream,

Hard Outlier test

0.041 95.9% TRUE

C-TN Circular stream,

True Normal test

0.728 27.2% False

C-SO Circular stream,

Soft Outlier test

0.009 99.1% TRUE

C-HO Circular stream,

Hard Outlier test

0.005 99.5% TRUE

The K-S test p value is essentially the likelihood of the Null Hypothesis.

Page 48: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Results of KNN-DD experimentsExperiment ID Short Description

of Experiment

KS Test p-value Outlier Index = 1-p =

Outlyingness Likelihood

Outlier Flag

(p<0.05?)

L-TN Linear data stream,

True Normal test

0.590 41.0% False

L-SO Linear data stream,

Soft Outlier test

0.096 90.4% Potential

Outlier

L-HO Linear data stream,

Hard Outlier test

0.025 97.5% TRUE

V-TN V-shaped stream,

True Normal test

0.366 63.4% False

V-SO V-shaped stream,

Soft Outlier test

0.063 93.7% Potential

Outlier

V-HO V-shaped stream,

Hard Outlier test

0.041 95.9% TRUE

C-TN Circular stream,

True Normal test

0.728 27.2% False

C-SO Circular stream,

Soft Outlier test

0.009 99.1% TRUE

C-HO Circular stream,

Hard Outlier test

0.005 99.5% TRUE

The K-S test p value is essentially the likelihood of the Null Hypothesis.

Page 49: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Astronomy data experiment #1:Star-Galaxy separation

• Approximately 100 stars and 100 galaxies were selected from the SDSS & 2MASS catalogs

• 8 parameters (ugriz and JHK magnitudes) were extracted

• 7 colors were computed: u-g, g-r, r-i, i-z, z-J, J-H, H-K

– these are used to locate each object in feature space

– hence, we have a 7-dimensional parameter feature space

• The galaxies are treated as the “outliers” relative to the stars – i.e., can KNN-DD separate them from the stars?

• Results: (for p=0.05 and K=20)

– 78% of the galaxies were correctly classified as “outliers” (TP)

– 1% of the stars were incorrectly classified as “outliers” (FP)

Page 50: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Astronomy data experiment #2:Star-Quasar separation

• 1000 stars selected from the SDSS & 2MASS catalogs

• 100 quasars selected from Penn St. astrostats website

• 8 parameters (ugriz and JHK magnitudes) were extracted

• 7 colors were computed: u-g, g-r, r-i, i-z, z-J, J-H, H-K

– these are used to locate each object in feature space

– hence, we have a 7-dimensional parameter feature space

• The quasars are treated as the “outliers” relative to the stars – i.e., can KNN-DD separate them from the stars?

• Results: (for p=0.05 and K=20)

– 100% of the quasars were correctly classified as “outliers” (TP)

– 29% of the stars were incorrectly classified as “outliers” (FP)

Page 51: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Advantages and Benefits of KNN-DD

… continued …

• KNN-DD:

– operates on multivariate data (thus solving the curse of

dimensionality)

– is algorithmically univariate (by estimating a function

that is based only on the distance between data points,

which themselves occupy high-dimensional parameter

space)

– is simply extensible to higher dimensions

– is computed only on small-K local subsamples of the

full dataset of N data points (K << N)

– is easily (embarrassingly) parallelized when testing

multiple data points for outlyingness

Page 52: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Future Work for KNN-DD(i.e., deficiencies that need attention)

• Validate our choices of p and K, which are not well

determined or justified

• Measure the KNN-DD algorithm’s learning times

• Determine the algorithm’s complexity

• Compare the algorithm against several other outlier

detection algorithms

– We started doing that, but there are a very large number

• Evaluate the KNN-DD algorithm’s effectiveness on

much larger datasets (e.g., 77,000 SDSS quasars)

– We started doing that, but with mixed results, which we are

still analyzing

• Test its usability on event streams (streaming data)

Page 53: Surprise Detection in Multivariate Astronomical Dataastrostatistics.psu.edu/su11scma5/lectures/kborne-SCMAV.pdf · Surprise Detection in Multivariate Astronomical Data Kirk Borne

Future Work in Surprise Detection• Test and validate many more of the existing outlier

detection algorithms on astronomical data:

– score them according to their effectiveness and efficiency

• Derive an “interestingness” index, which will probably be

based upon a mixture of outlyingness metrics:

– test and validate this on single data points, data sequences

(e.g., trending data), and data series (e.g., time series)

• Apply resulting algorithms and indices on very large

datasets (e.g., SDSS+2MASS+GALEX+WISE catalogs,

Kepler time series, etc.)

– test on LSST simulated data and catalogs, in preparation for the

real thing at the end of the decade

• Investigate applicability of algorithms to SDQA (Science

Data Quality Assessment) tasks for large sky surveys