diffusion geometries, diffusion wavelets and harmonic ... · diffusion wavelets rr coifman & mm...

23
Diffusion Geometries, Diffusion Wavelets and Harmonic Analysis of large data sets. R .R. Coifman, S. Lafon, MM Mathematics Department Program of Applied Mathematics. Yale University

Upload: others

Post on 01-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Diffusion Geometries, Diffusion Wavelets and Harmonic Analysis of large data sets.

    R .R. Coifman, S. Lafon, MM

    Mathematics Department Program of Applied Mathematics. Yale University

  • MotivationsThe main problem is to analyse lots of data in high dimensions.Paradigm: we have a large number of documents (e.g.: web pages, gene array data, (hyper)spectral data, molecular dynamics data etc...) and a way of measuring similarity between pairs. Model: a graph

    (G,E,W)In some cases: vertices are points in high-dimensional Euclidean space, weights are a function of Euclidean distance.

    ProblemsUnderstand data sets in high-dimensions, and classes of functions on

    themApproximation and “learning” of such functions

    Parametrize low dimensional data sets embedded in high-dimensionFast algorithms

  • • Biotech data (Gene arrays, proteomic data)

    • Customer databases: companies collect and process information on (potential) customers

    • Financial data

    • Web searching

    • Satellite imagery

    High dimensional data: examples

    however...

    • In many situations constraints force the data to lie on sets which a very small intrinsic dimensionality compared to that of the ambient space.

    • In the case of graphs, or arbitrary metric spaces, there are notions of intrinsic complexity, or of embeddability in low dimensional Hilbert spaces.

  • The high dimension is an obstacle to the processing of the data:

    • Approximation of functions: to represent C1 functions on a grid with accuracy , one needs -n grid points

    • Density estimation difficult: one needs a lot of data points, otherwise most bins are empty

    • Computational cost of many algorithms grows exponentially with the dimension (e.g. Nearest neighbor search, Fast Multipole Method)

    Curse of dimensionality

  • Diffusion GeometriesRR Coifman & S. Lafon

    Geodesic distance ---> Diffusion distanceDiffusion distance is more stable, uses a “preponderance of evidence”

  • On the graph of “documents” with similarities there is a natural random walk: we get a Markov chain represented by a matrix P(x,y). If P is symmetric and positive semidefinite, we can define the diffusion distance by

    2||,.)(,.)(|| ypxpmm

    2

    0))()(( yx jj

    jj

    m

    ),(2),(),(),(2 yxpyypxxpyxD mmmm

    . l )}({(x)X x mapDiffusion Geometric 2 xii

    Embeds the graph in Euclidean space, up to precision, via the eigenfunctions, mapping diffusion distance into Euclidean distance. For a set of points in Euclidean space, sampled from a Riemannian manifold, one can build a discretized Laplace-Beltrami operator (associated to the canonical Brownian motion constrained on the manifold) and map the manifold with diffusion distance isometrically in Euclidean space.

  • Original points Embeddings

  • Phi1 Phi2 Phi3

  • Diffusion WaveletsRR Coifman & MM

    Eigenfunctions are like global Fourier Analysis on the data set, they live in different “frequency bands” but are not localized. We would like to have elements localized both in frequency and space (compatibly with Heisenberg principles), and critically sampled at the “rate” corresponding to the frequency band.

    Where are the “frequencies”?

    0 5 10 15 20 25 30

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    (T2)

    V0 V

    1 V2 V3 ...

    (T4)

    (T8)

    (T16)

  • Multiresolution diffusion wavelet construction of orthonormal diffusion scaling functions.

  • All this can be done in n log(n), n cardinality of the space!

  • Fast multipole method for generalized potentials

  • 0 50 100 150 200 250 300-0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.28

    0 50 100 150 200 250 300-0.4

    -0.3

    -0.2

    -0.1

    0

    0.1

    0.2

    0.3

    0.412

    0 50 100 150 200 250 300-0.3

    -0.2

    -0.1

    0

    0.1

    0.2

    0.3

    0.4

    0.512

    0 50 100 150 200 250 300-0.2

    -0.15

    -0.1

    -0.05

    0

    0.05

    0.1

    0.15

    0.215

  • 0 50 100 150 200 250 300-0.2

    -0.15

    -0.1

    -0.05

    0

    0.05

    0.1

    0.15

    0.2

    0.2516

    -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3-0.2

    -0.15

    -0.1

    -0.05

    0

    0.05

    0.1

    0.15

    (16,2(x),16,3(x))

  • Comments, Applications, etc...● This is wavelet analysis on manifolds (and more, e.g. fractals), graphs, markov chains, while Laplacian eigenfunctions do Fourier Analysis on manifolds (and fractals, etc...).● We are “compressing” powers of the operator, functions of the operators, subspaces of the function subspaces on which its powers act (Heisenberg principle...), and the space itself (sampling theorems, quadrature formulas...)● We are constructing a biorthogonal version of the transform (better adapted to studying Markov chains) and wavelet packets: this will allow efficient denoising, compression, discrimination on all the spaces mentioned above.● The multiscale spaces are a natural scale of complexity spaces for learning empirical functions on the data set.● Diffusion wavelets extend outside the set, in a natural multiscale fashion.● To be tied with measure-geometric considerations used to embed metric spaces in Euclidean spaces with small distortion.● Study and compression of dynamical systems.