computational astrostatistics bob nichol (carnegie mellon) motivation & goals multi-resolutional...

38
Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture models (applications) Bayes network anomaly detection (application) Very high dimensional data NVO Problems

Upload: alice-atkinson

Post on 17-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Computational AstroStatisticsBob Nichol (Carnegie Mellon)

Motivation & Goals

Multi-Resolutional KD-trees (examples)

Npt functions (application)

Mixture models (applications)

Bayes network anomaly detection (application)

Very high dimensional data

NVO Problems

Page 2: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Collaborators

Chris Miller, Percy Gomez, Kathy Romer, Andy Connolly, Andrew Hopkins, Mariangela Bernardi,

Tomo Goto (Astro)Larry Wasserman, Chris Genovese, Wong Jang,

Pierpaolo Brutti (Statistics)Andrew Moore, Jeff Schneider, Brigham Anderson, Alex Gray, Dan Pelleg (CS)

Alex Szalay, Gordon Richards, Istvan Szapudi & others (SDSS)

Pittsburgh Computational AstroStatistics (PiCA) Group

(See http://www.picagroup.org)

Page 3: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

First MotivationCosmology is moving from a “discovery”

science into a “statistical” scienceDrive for ``high precision’’ measurements:

Cosmological parameters to a few percent;Accurate description of the complex structure in

the universe;Control of observational and sampling biases

New statistical tools – e.g. non-parametric analyses – are often computationally intensive.

Also, often want to re-sample or Monte Carlo data.

Page 4: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Second MotivationLast decade was dedicated to building more

telescopes and instruments; more coming this decade as well (SDSS, Planck, LSST, 2MASS,

DPOSS, MAP). Also, larger simulations.

We have a “Data Flood”; SDSS is terabytes of data a night, while LSST is an SDSS every 5

nights! Petabytes by end of 00’s

Highly correlated datasets and high dimensionality

Existing statistics and algorithms do not scale into these regimes

New Paradigm where we must build new tools before we can analyze &

visualize data

Page 5: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

SDSSSDSS

Page 6: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

SDSSSDSS

Page 7: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

SDSS Data

FACTOR OF 12,000,000

Area 10000 sq deg 3

Objects 2.5 billion 200

Spectra 1.5 million 200

Depth R=23 10

Attributes 144 presently 10

SDSS Science Most Distant Object! 100,000 spectra!

Page 8: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Start with tree data structures: Multi-resolutional kd-trees

Scale to n-dimensions (although for very high dimensions use new tree structures)

Use Cached Representation (store at each node summary sufficient statistics). Compute counts

from these statisticsPrune the tree which is stored in memory!See Moore et al. 2001 (astro-ph/0012333)

Many applications; suite of algorithms!

Goal to build new, fast & efficient statistical algorithms

Page 9: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture
Page 10: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Range SearchesFast range searches and catalog matching

Prune cells outside range

Also Prune cells inside!Greater saving in time

Page 11: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

N-point correlation functions

The 2-point function has a long history in cosmology (Peebles 1980). It is the excess joint probability of a pair

of points over that expected from a poisson process. Also long history (as point processes) in Statistics:

Similarly, the three-point is defined as (so on!)

Page 12: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Same 2pt, very different 3ptNaively, this is an n^N process, but all it is, is a

set of range searches.

Page 13: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Dual Tree Approach

Usually binned into annuli rmin< r < rmax . Thus, for each r transverse both trees and prune pairs of nodes with either dmin < rmin ; dmax > rmax.

Also, if dmin > rmin & dmax<rmax all pairs in these nodes are within annuli. Therefore, only need to calculate pairs cutting the boundaries.

Extra speed-ups are possible doing multiple r’s together and controlled approximations

Page 14: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Time depends on density of points

and binsize & scale

N*N

NlogNN*N*N

Page 15: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Fast Mixture ModelsDescribe the data in N-dimensions as a mixture of, say, Gaussians (kernel shape less important than

bandwidth!)

The parameters of the model are then N gaussians each with a mean and covariance

Iterate, testing using BIC and AIC at each iteration. Fast because of kdtrees (20 mins for

100,000 points on a PC!)

Employ heuristic splitting algorithm as well

Details in Connolly et al. 2000 (astro-ph/0008187)

Page 16: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

EM-Based Gaussian Mixture Clustering: 1

Page 17: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

EM-Based Gaussian Mixture Clustering: 2

Page 18: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

EM-Based Gaussian Mixture Clustering: 4

Page 19: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

EM-Based Gaussian Mixture Clustering: 20

Page 20: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture
Page 21: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Applications

Used in SDSS quasar selection (used to map the multi-color stellar locus)

Gordon Richards @ PSU

Anomaly detector (look for low probability points in N-dimensions)

Optimal smoothing of large-scale structure

Page 22: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

SDSS QSO target selection in 4D color-space

Cluster 9999 spectroscopically confirmed stars

Cluster 8833 spectroscopically

confirmed QSOs (33 gaussians)

99% for stars, 96% for QSOs

Page 23: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture
Page 24: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Bayes Net Anomaly Detector

Instead of using a single joint probability function (fitted to data) factorize into a smaller

set of conditional probabilities Directional and acyclical

If we know graph and conditional probabilities, we have valid probability function

to whole model

Page 25: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Use 1.5 million SDSS sources to learn model (25 variables each)

Then evaluate the likelihood of each data being drawn from the model

Lowest 1000 are anomalous; look at ‘em and follow `em up at Keck

Page 26: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Unfortunately, a lot of error Advantage of Bayes Net is that to tells you why it was anomalous; the most unusual conditional probabilitiesTherefore, iterate loop and get scientist to highlight obvious errors; then suppress those errors so they do not return againIssue of productivity!

Page 27: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Will Only Get Worse

LSST will do an SDSS every 5 nights looking for transient objects producing petabytes of data (2007)

VISTA will collect 300 Terabytes of data (2005)

Archival Science is upon us! HST database has 20GBytes per day

downloaded (10 times more than goes in!)

Page 28: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Will Only Get Worse II

Surveys spanning electromagnetic spectrumCombining these surveys is hard: different sensitivities, resolutions and physicsMixture of imaging, catalogs and spectraDifference between continuum and point processesThousands of attributes per source

Page 29: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

What is VO?

The “Virtual Observatory” must: Federate multi-wavelength data sources

(interoperability)Must empower everyone (democratise)

Be fast, distributed and easyAllow input and output

Page 30: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Computer Science + Statistics!Scientists will need help through autonomous

scientific discovery of large, multi-dimensional, correlated datasets

Scientists will need fast databases Scientists will need distributed computing and fast

networks Scientists will need new visualization tools

CS and Statistics looking for new challenges: Also no data-rights & privacy issues

New breed of students needed with IT skills

Symbiotic Relationship Symbiotic Relationship

Page 31: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

VO PrototypeIdeally we would like all parts of the VO to be web-servises

DB C# dym

EMdymhttp

.NEThttp

Page 32: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Lessons We Learnt

Tough to marry research c code developed under linux to MS (pointers to memory)

.NET has “unsafe” memory

.NET server is hard to set up!

Migrate to using VOTables to perform all I/O.Have server running at CMU so we can control code

Page 33: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Very High Dimensions

Using LLE and Isomap; looking for lower

dimensional manifolds in higher dimensional spaces

500x2000 space from

SDSS spectra

Page 34: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

SummaryEra of New Cosmology: Massive data sources and

search for subtle features & high precision measurements

Need new methods that scale into these new regimes; ``a virtual universe’’ (students will need different

skills). Perfect synergy with Stats, CS, PhysicsGood algorithms are as good as faster and more

computers!The “glue” to make a “virtual observatory” is hard

and complex. Don’t under-estimate the job

Page 35: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Are the Features Real? (FDR)!Are the Features Real? (FDR)!

This is an example of multiplehypothesis testing e.g. is every point

consistent with a smooth p(k)?

Page 36: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

Let us first look at a simulated example: consider a 1000x1000 image

with 40000 sources.

FDR 30389 1505 9611 958495

2sigma 31497 22728 8503 937272

Bonferroni 27137 0 12863 960000

FDR makes 15 times few mistakes for the same power as traditional 2-sigma

Why? Controls a scientifically meaningful quantity: FDR = No. of false discoveries/Total no. of discoveries

Page 37: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

And it is adaptive to the size of the dataset

Page 38: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture

We used a FDR of 0.25i.e. 25% of circled Points are in error

Therefore, we can say with statistical rigor that most of these points a rejected and are thus

``features’’

No single point is a 3sigma deviation

New statistics has enabled an astronomical discovery