misha pesenson, isaac pesenson*, bruce mccollum california institute of technology, *temple...

28
Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple California Institute of Technology, *Temple University University 1/7/2010 215th AAS Meeting, Washington DC Information Visualization, Nonlinear Dimensionality Reduction and Sampling for Large and Complex Data Sets

Upload: shanna-thompson

Post on 03-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Misha Pesenson, Isaac Pesenson*, Bruce McCollumCalifornia Institute of Technology, *Temple UniversityCalifornia Institute of Technology, *Temple University

1/7/2010 215th AAS Meeting, Washington DC

Information Visualization, Nonlinear Dimensionality Reduction and Sampling

for Large and Complex Data Sets

Page 2: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Acknowledgment

We would like to thank Dr. Mike Egan for his support

This work was carried out at the SSC, Caltech and supported by

The National Geospatial-Intelligence Agency,

Grant # HM1582-08-1-0019

1/7/2010 215th AAS Meeting, Washington DC

Page 3: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

MotivationMotivation The Data Big Bang

The Expanding Digital Universe

Inflationary Epoch

1/7/2010 215th AAS Meeting, Washington DC

Page 4: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

MotivationMotivation (cont.)

Data is now produced faster than it can be meaningfully analyzed

Modern data are complex - dozens or hundreds of useful parameters associated with each astronomical object

• LSST: The ten-year survey will result in tens of petabytes of image and catalog data and will require ~250 TFlops of processing to reduce.

• A discussion related to LSST can be found in: The Spectrum of LSST Data Analysis Challenges: Kiloscale to Petascale, 2010, by T. Loredo, G. Babu, K. Borne, E. Feigelson, A. Gray, 215th AAS

1/7/2010 215th AAS Meeting, Washington DC

Page 5: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

MotivationMotivation (cont.)

To capitalize on the opportunities provided by these data sets one needs to be able to organize, analyze and visualize them

Traditional methods are often inadequate not merely because of the size in bytes of the data sets, but also because of the complexity of modern data sets

To be successful, these approaches must extend beyond traditional scientific analysis and information visualization

1/7/2010 215th AAS Meeting, Washington DC

Page 6: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

MotivationMotivation (cont.)Moreover, to detect the expected and discover the unexpected in massive data sets requires a synergistic approach that utilizes recent advances in:

Statistics Applied mathematics Computer science Artificial intelligence Machine learning Knowledge representation Cognitive and perceptual sciences Decision sciences, and more

1/7/2010 215th AAS Meeting, Washington DC

Page 7: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

MotivationMotivation (cont.)

Valuable results pertaining to these problems are mostly to be found only in the publications outside of astronomy

There is a big gap between applied mathematics, artificial intelligence and computer science on the one side and astronomy on the other

1/7/2010 215th AAS Meeting, Washington DC

Page 8: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Goals of This Presentation

To attract attention of the astronomical community to the aforementioned gap

To help bridge this gap by briefly reviewing the some of the advanced methods

“To increase the general awareness and avoidance of unprincipled data analysis methods” (Xiao Li Meng, 2009, Desired and Feared—What Do We Do Now and Over the Next 50 Years?, American Statistician, v. 63, 3, 202-210).

1/7/2010 215th AAS Meeting, Washington DC

Page 9: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Complex Data: Spectral Imaging

1/7/2010 215th AAS Meeting, Washington DC

224 spectral channels

Page 10: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Data Types Some Astronomical Applications

Traditional Approaches to Data

Advanced Approaches to Data

Vector Data 1.Multiwavelength observations.2. Multitemporal observations.3. VO4. Spectra.

1. Linear dimension reduction: PCA and its modifications.

1. Spectral methods, eigenmaps, diffusion maps, LLE, ISOMAP.2. Sampling on graphs.3. Methods based on nonlinear dynamics.4. Neural networks.5. Genetic algorithms.6. Scientific visualization.7. Compressed sensing.

Manifold –Valued and/orManifold -Defined

1. Polarization measurements (CMB).2. Gravitational lensing.3. Solar astrophysics.

1. Various sampling distributions on a sphere.

1. Healpix (2D sphere).2. Needlets.3. Sampling on manifolds4. Scientific vizualization.

1/7/2010 215th AAS Meeting, Washington DC

Astronomical Data Types and Approaches to their Representation and Processing

Page 11: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Scientific Visualization vs. Illustrative Visualization

Scientific Visualization (SV) does not simply reproduce visible things, but makes the things visible

SV enables extraction of meaningful patterns from multiparametric data sets

1/7/2010 215th AAS Meeting, Washington DC

Page 12: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

The Curse of Dimensionality and Dimension Reduction (DR)

Extraction and Visualization of meaningful structures from multiparametric, high-dimensional data sets require an accurate low-dimensional representation of data

DR is motivated by the fact that the more we are able to reduce the dimensionality of a data set, the more regularities (correlations) we have found in it and therefore, the more we have learned from the data

• Pesenson M., Pesenson I., McCollum B., 2010, “The Data Big Bang and the Expanding Digital Universe: High-Dimensional, Complex and Massive Data Sets in an Inflationary Epoch”, Advances in Astronomy, special issue on Robotic Astronomy (accepted)

1/7/2010 215th AAS Meeting, Washington DC

Page 13: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Dimension Reduction (cont.)

Greatly increases computational efficiency of machine

learning algorithms

Improves statistical inference

Enables effective scientific visualization and

classification

1/7/2010 215th AAS Meeting, Washington DC

Page 14: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Dimension Reduction: “Linear” Data, PCA

1/7/2010 215th AAS Meeting, Washington DC

If the data are mainly confined to an almost linear low-dimensional subspace, then simple linear methods such as principal component analysis (PCA) can be used to discover the subspace and estimate its dimensionality

Page 15: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Limitations of Linear Methods

Linear methods such as PCA have a serious drawback in that they do not explicitly consider the structure of the manifold on which the data may possibly reside

PCA is intrinsically linear, so if data points form a nonlinear manifold, then obviously, there is no rotation & shift of the axis (this is what a linear transform like PCA provides) that can “unfold” such a manifold as the one on the next slide:

1/7/2010 215th AAS Meeting, Washington DC

Page 16: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Data Laying on Manifolds

1/7/2010 215th AAS Meeting, Washington DC

Formally applying geometrically linear methods would produce a complete misrepresentation of the data

Page 17: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Data Laying on Manifolds + Noise(Balasubramanian, Schwartz 2002 )

1/7/2010 215th AAS Meeting, Washington DC

The practical usage of dimension reduction demands: Representation of measurement errors in high-dimensional instrument calibration

• Connors A., van Dyk D., Freeman P., Kashyap V., Siemiginowska A., et al. 2008

Careful improvement of signal-to-noise ratio without smearing essential features• Pesenson M., Roby W., McCollum, 2008

Page 18: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Handling Geometrically Nonlinear Data

The modern approach to multidimensional images or data sets is to approximate them by graphs or Riemannian manifolds

Next, after constructing a weighted graph, one can introduce the corresponding combinatorial Laplace operator

• Belkin M., Niyogi P., 2005; Coifman R., Lafon S., 2006 • Application to astronomy: Richards J., Freeman P., Lee A., & Schafer C., 2009

1/7/2010 215th AAS Meeting, Washington DC

Page 19: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Nonlinear Dimension Reduction as an Approach to Nonlinear Data

The eigenfunctions and eigenvalues of the Laplacian form a basis, thus allowing one to develop a harmonic or Fourier analysis on graphs

This set of basis functions captures patterns intrinsic to a particular state space

Finds a lower-dimensional representation of high-dimensional data without losing a significant amount of information

1/7/2010 215th AAS Meeting, Washington DC

Page 20: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Nonlinear Dimension Reduction and Harmonic Analysis on Manifolds and Graphs

We have devised innovative algorithms for nonlinear data dimension reduction and data compression:

enable one to overcome PCA’s limitations for handling nonlinear data manifolds

allow one to deal effectively with: 1) missing observations 2) partial sky coverage 3) non-regular sampling

For details: • Pesenson I., 2009, J. of Geometric Analysis, 19 (2), 390; • Pesenson I., Pesenson M., 2010, J. of Math. Analysis and

Applications, accepted; • Pesenson I., Pesenson M., 2010, J. of Fourier Analysis and

Applications, accepted• Pesenson M., Pesenson I., McCollum B., 2010, Advances in Astronomy,

accepted1/7/2010 215th AAS Meeting, Washington DC

Page 21: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Visualization - Multispectral

1/7/2010 215th AAS Meeting, Washington DC

From a set of images obtained at multiple wavebands, effective dimension reduction provides a comprehensible, information-rich single image with minimal information loss and statistical details, unlike a simple coadding with arbitrary, empirical weights

Page 22: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Manifold-Valued Data and Data Laying on Manifolds

Application:Cosmic Microwave Background (CMB) • Gorski K., et al. 2005

Solar Astrophysics

A powerful approach to the problem is based on Needlets - second generation spherical wavelets• Geller D., & Marinucci D., 2008

1/7/2010 215th AAS Meeting, Washington DC

Page 23: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Manifold-Valued Data and Data Laying on Manifolds (cont.)

Important properties of needlets that are not shared by other spherical wavelet constructions:

do not rely on any kind of tangent plane approximation; have good localization properties in both pixel and harmonic space;

Needlet coefficients are asymptotically uncorrelated at any fixed angular distance (which makes their use in statistical procedures very promising)

• Pesenson, I., 2006, Integral Geometry and Tomography, Contemporary Mathematics, 405, 135-148, American Mathematical Society;

• Geller D., Pesenson I., 2010, Tight Frames and Besov Spaces on

Compact Homogeneous Manifolds, J. of Geometric Analysis (accepted)

1/7/2010 215th AAS Meeting, Washington DC

Page 24: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Unsupervised Manifold Learning and Information Visualization

Manifold Learning and Visualization based on Nonlinear Dynamics

One needs to distinguish between geometrically nonlinear data and nonlinear methods of analysis

1/7/2010 215th AAS Meeting, Washington DC

Page 25: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Unsupervised Manifold Learning – A Nonlinear Approach

Approximating a multidimensional image or a data set by a graph and associating a nonlinear dynamical system with each node enables us to unify the three seemingly unrelated tasks:

image segmentation, unsupervised learning data visualization

1/7/2010 215th AAS Meeting, Washington DC

Page 26: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Testing the Algorithm: a Simulated 3D set of a 103 uniformly distributed random points with a double-diamond pattern

1/7/2010 215th AAS Meeting, Washington DC

Left and middle: two screen shots from a running animation – each point in the set oscillates (in this case in 3 dimensions) with its own, random frequency

Right: synchronization made the points that are connected with high-weight edges oscillate in-phase thus allowing to reveal the pattern visually or by automatically selecting in-phase oscillating points and highlighting the pattern in red• Pesenson M., Pesenson I., McCollum B., 2010, Advances in Astronomy, (accepted). • Pesenson M., Pesenson I. 2010, Image Segmentation, Unsupervised Manifold Learning and Information Visualization: A Unified Approach Based on Nonlinear Dynamics (submitted).

Page 27: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Conclusions Many important challenges have been identified by various authors and presentations

Different groups have already been working on some of them the problems: The Center for Astrostatistics at PSU (E. Feigelson, G. Babu)

BIPS at Cornell (T. Loredo) InCA at CMU (C. Schafer et al.) SAMSI-SaFeDe Collaboration (V. Kashyap et al.) Caltech (M. Pesenson et al.) Caltech (G. Djorgovski et al.) AstroNeural collaboration (G. Longo et al.) Georgia Tech (A. Gray et al.) GMU (K. Borne et al.) IIC at Harvard (A. Goodman et al.)

1/7/2010 215th AAS Meeting, Washington DC

Page 28: Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 1/7/2010 215th AAS Meeting, Washington DC Information

Conclusions (cont.)

The concepts and approaches described in this presentation also contribute to the actual steps in creating needed novel approaches and algorithms

All the described efforts when combined together will enable effective automated analysis and processing of giant, complex data sets such as LSST

1/7/2010 215th AAS Meeting, Washington DC