so much data
Post on 30-Dec-2015
20 Views
Preview:
DESCRIPTION
TRANSCRIPT
So Much DataSo Much DataSo Much DataSo Much Data
Bernard ChazelleBernard Chazelle
Princeton UniversityPrinceton University
Bernard ChazelleBernard Chazelle
Princeton UniversityPrinceton University
So Little TimeSo Little TimeSo Little TimeSo Little Time
So Many SlidesSo Many SlidesSo Many SlidesSo Many Slides
Bernard ChazelleBernard Chazelle
Princeton UniversityPrinceton University
Bernard ChazelleBernard Chazelle
Princeton UniversityPrinceton University
So Little Time So Little Time
So Little Time So Little Time
(before lunch)(before lunch) (before lunch)(before lunch)
computation
math experimentation
algorithms
Computers have two Computers have two problemsproblems
Computers have two Computers have two problemsproblems
1. They don’t have steering 1. They don’t have steering wheelswheels
1. They don’t have steering 1. They don’t have steering wheelswheels
2. End of Moore’s Law
party’s over !
party’s over !
computation
algorithms experimentation
32x 17
22432
= 544
This is not me
FFT
RSA
noisy
low entropy
uncertain
unevenly priced
big
noisy
low entropy
uncertain
unevenly priced
big
Biomedical imaging
Sloan Digital Sky
Survey4 petabytes4 petabytes
(~1MG)(~1MG)4 petabytes4 petabytes
(~1MG)(~1MG)
10 10 petabytes/yrpetabytes/yr
10 10 petabytes/yrpetabytes/yr
150 petabytes/yr150 petabytes/yr150 petabytes/yr150 petabytes/yr
Collected works of Micha Sharir
My A(9,9)-th paper
massive input
massive input outputoutput
Sublinear Sublinear AlgorithmsAlgorithmsSublinear Sublinear
AlgorithmsAlgorithms
Sample tiny fractionSample tiny fractionSample tiny fractionSample tiny fraction
Shortest PathsShortest PathsShortest PathsShortest Paths [C-Liu-Magen ’03]
New New YorkYork
New New YorkYork
DelphiDelphiDelphiDelphi
Ray ShootingRay ShootingRay ShootingRay Shooting
Volume Intersection Point location
Approximate MSTApproximate MSTApproximate MSTApproximate MST [C-Rubinfeld-Trevisan ’01]
Reduces to counting connected componentsReduces to counting connected componentsReduces to counting connected componentsReduces to counting connected components
EEEE = no. connected components= no. connected components= no. connected components= no. connected components
varvarvarvar << (no. connected components)<< (no. connected components)<< (no. connected components)<< (no. connected components)2222
whp, is a good estimator
of # connected components
worst case worst case worst case worst case
input spaceinput spaceinput spaceinput space
average case average case (uniform)(uniform)average case average case (uniform)(uniform)
worst case worst case worst case worst case
average case = actuarial view average case = actuarial view average case = actuarial view average case = actuarial view
“ OK, if you elect NOT to have the surgery, the insurance company offers 6 days and 7 nights in Barbados. “
arbitrary, unknown random sourcearbitrary, unknown random sourcearbitrary, unknown random sourcearbitrary, unknown random source
Self-Improving Self-Improving AlgorithmsAlgorithms
Self-Improving Self-Improving AlgorithmsAlgorithms
Yes ! This could be YOU, too !
E Tk Optimal expected time for random source
time T1
time T2
time T3
time T4
Clustering Clustering [ Ailon-C-Liu-Comandur [ Ailon-C-Liu-Comandur ’05 ]’05 ]Clustering Clustering [ Ailon-C-Liu-Comandur [ Ailon-C-Liu-Comandur ’05 ]’05 ]
K-median over Hamming K-median over Hamming cubecubeK-median over Hamming K-median over Hamming cubecube
minimize sum of distancesminimize sum of distancesminimize sum of distancesminimize sum of distances
minimize sum of distancesminimize sum of distancesminimize sum of distancesminimize sum of distances
[ Kumar-Sabharwal-Sen ’04 ][ Kumar-Sabharwal-Sen ’04 ][ Kumar-Sabharwal-Sen ’04 ][ Kumar-Sabharwal-Sen ’04 ]
COST OPT( 1 + )
How to achieve linear limiting How to achieve linear limiting time?time?How to achieve linear limiting How to achieve linear limiting time?time?
Input space {0,1}Input space {0,1}Input space {0,1}Input space {0,1}dndndndn
prob < O(dn)/KSSprob < O(dn)/KSSprob < O(dn)/KSSprob < O(dn)/KSS
Identify coreIdentify coreIdentify coreIdentify core
TailTail::TailTail::
Use KSS Use KSS Use KSS Use KSS
Store sample of Store sample of precomputed KSSprecomputed KSSStore sample of Store sample of precomputed KSSprecomputed KSS
Nearest neighborNearest neighborNearest neighborNearest neighborIncremental algorithmIncremental algorithmIncremental algorithmIncremental algorithm
Main difficulty: How to spot the tail?Main difficulty: How to spot the tail?Main difficulty: How to spot the tail?Main difficulty: How to spot the tail?
encode
decode
Data inaccessible before noise
What makes you What makes you think it’s wrong?think it’s wrong?
Data inaccessible before noise
must satisfy some propertymust satisfy some property
(eg, convex, bipartite)(eg, convex, bipartite)
but does not quitebut does not quite
f(x) = ?f(x) = ?
x
f(x)
data
f = access function
f(x) = ?f(x) = ?
x
f(x)
f = access function
f(x) = ?f(x) = ?
x
f(x)
But life being what it is…
f(x) = ?f(x) = ?
x
f(x)
)(O
Humans
Define distance from any object to data class
f(x) = ?f(x) = ?
x
g(x)
x1, x2,…
f(x1), f(x2),…
filter
g is access function for:
Online DataOnline DataReconstructiReconstructi
onon
Online DataOnline DataReconstructiReconstructi
onon
Monotone function: [n] Rd
Filter requires polylog (n) lookups
[ Ailon-C-Liu-Comandur ’04 ][ Ailon-C-Liu-Comandur ’04 ] [ Ailon-C-Liu-Comandur ’04 ][ Ailon-C-Liu-Comandur ’04 ]
Convex Convex polygonpolygon
Filter requires : lookups
[C-Comandur ’06 ]
Convex Convex terrainterrain
lookups
Filter requires :
Iterated planar separator Iterated planar separator theoremtheorem
Iterated planar separator Iterated planar separator theoremtheorem
Iterated Iterated (weak)(weak) planar separator theorem planar separator theorem
in sublinear time!in sublinear time!in sublinear time!in sublinear time!
Using epsilon-nets in spaces of unbounded VC Using epsilon-nets in spaces of unbounded VC dimensiondimension
reconstruct
bipartite graph
k-connectivity
expander
denoising low-dim attractor sets
Priced Priced
computation & computation & accuracyaccuracy
Priced Priced
computation & computation & accuracyaccuracy
spectrometry/cloning/gene chipspectrometry/cloning/gene chip PCR/hybridization/chromatographyPCR/hybridization/chromatography gel electrophoresis/blottinggel electrophoresis/blotting
spectrometry/cloning/gene chipspectrometry/cloning/gene chip PCR/hybridization/chromatographyPCR/hybridization/chromatography gel electrophoresis/blottinggel electrophoresis/blotting
001100001010001111110011001101011100001100000101111o1o1100001100
001100001010001111110011001101011100001100000101111o1o1100001100
Linear programmingLinear programming Linear programmingLinear programming
Pricing dataPricing data
Pricing dataPricing data
Factoring is easy. Here’s why…Factoring is easy. Here’s why…Factoring is easy. Here’s why…Factoring is easy. Here’s why…Gaussian mixture sample: Gaussian mixture sample: 0010010100100110101010100100101001001101010101….….
Collaborators:Collaborators: Nir Ailon, Seshadri Comandur, Ding Liu
Avner Magen, Ronitt Rubinfeld, Luca Trevisan
Collaborators:Collaborators: Nir Ailon, Seshadri Comandur, Ding Liu
Avner Magen, Ronitt Rubinfeld, Luca Trevisan
top related