exploringbigbraindata - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf ·...
TRANSCRIPT
Exploring Big Brain Data
Anastasia Ailamaki
with Farhan Tauheed, Thomas Heinis, and many others
Data-‐Intensive Applica9ons and Systems (DIAS) Laboratory School of Computer and Communica9on Sciences
THE SEQUENCE EXPLOSION
the human genome at ten (nature, 4/2010)
benefits hindered by complexity
scien4fic ques4ons and answers
Past three paradigms: • theory • simula4on • experimenta4on fourth paradigm:
data-‐driven science wasted data worst ROI ever!
ques4ons to answer today
Domain ScienIst: • how to find interes4ng data? • how to move quickly through data? • how to enable scien4fic discovery?
Computer ScienIst: • What’s in it for me?
4
SPATIAL QUERIES ON UNSTRUCTURED MESHES
5
How to find interes4ng data:
8
Point Query Q
o Queries n Point n Range n Feature
Element
Node
Range Query R
o Use in simulation n Show ground velocity at Q n Draw the temperature of R
Courtesy Cornell Fracture Group
must scale with model complexity
spa4al range queries
9
A
B
C
D
H G
EFQ
K
A
C
E
B
D F
G
H
Q
B A C D E F G H
K M N
X Y
Disk page B A F E C D G H
X Y
K M L N
R-Tree Index
L N K
M Minimum Bounding Rectangle
(MBR)
R-‐Tree indexing
Ight connecIvity hurts performance
10
directed local search
A
B
C
D
E
F
G H
Q
insight: trade accuracy of target with efficiency
Q
J. Kozloski et al., Iden9fying, tabula9ng, and analyzing contacts between branched neuron morphologies, IBM Journal for Research and Development, Issue 52, Number 1/2, 2008
retrieval of a spaIal range
11 querying increasingly denser data
SimulaIons: 10K neurons
human: 86,000,000,000 neurons
brain diseases in Europe: €800B
12 ?
dye loading & raster scanning
3D reconstrucIon
images courtesy of the Blue Brain Project
the Blue Brain Project
13
neo cortex
a neuron
brain simula4on
14
single neuron, modeled with 3D cylinders
idea: index the brain
c
brain indexing with R-‐trees
15
Bulk Loaded R-‐Trees: Hilbert packed R-‐Tree, STR R-‐Tree PR-‐Tree
0
10
20
30
40
50
0 0.5 1 1.5 2 2.5 3
ExecuI
on Tim
e [sec]
per 1
0K re
sults
Segments in Million
R-‐Tree
Without Overlap
need to scale with density
sparse dense
FLAT! idea: seed-‐then-‐crawl
2) CRAWLING: traverse and retrieve remaining objects
16
c
1) SEEDING: find any object in region
Reachability?
tesselaIon + linking enable crawling
0
50
100
150
200
250
300
50 100 150 200 250 300 350 400 450
Total Page Re
ads [Th
ousand
s]
Density [millions of elements / volume unit]
Hilbert R-‐Tree
STR R-‐Tree
PR-‐Tree
FLAT
17
known techniques
2012: 1 million neurons
FLAT scales with data density
FAST DATA EXPLORATION THROUGH INTERACTIVE QUERIES
18
How to move quickly through data:
structural analysis: naviga4on
19
ad-‐hoc analysis visualizaIon model
refinement use cases:
sequences follow a latent guiding structure
sequence of queries issued interacIvely
idea: prefetch next query
20
1st query DISK CPU
query execuIon (I/O) analyzing results
Ime
2nd query 3rd query
known techniques do not scale
prefetching opportunity
• model structures in graph, then traverse • iden4fy guiding structure: itera4ve pruning • max accuracy: incremental prefetching
SCOUT: content-‐aware prefetching
21
3rd query 1st query DISK CPU
2nd query
prefetching for future query residual I/O analyzing results
predicIon computaIon
Ime
70-‐98% hit rate, 4x-‐15x speedup
QUERY PROCESSING ALGORITHMS
22
How to enable scien4fic discovery:
touch detec4on Model Synapses electrical connec4ons betw. axons and dendrites
Data Challenge 100K neurons => 5B synapses 30GB addl space to store synapse data Human Brain => 100B neurons ~PBs
• efficient spa4al proximity queries • precise distance calcula4on
a major bopleneck in brain simulaIon
disk-‐based spa4al join
24
a3
b2
b1
a4
a1
a2
Mul4ple Assignment : PBSM
Dataset A Dataset B
Mul4ple Matching : S3
b1 a1
a1 a3 b3
b2 a2
a3
b4
a4
a1 1 1 1 1
1
2 2
2 2 2 3
3
4 4
4 4
data par44oning space par44oning
Algorithm Object ReplicaIon
SensiIve to DistribuIon
Duplicate Results
Filtering
PBSM S3
TOUCH: in-‐memory join
25
Phases: 1) Building (A) 2) Assignment (B) 3) Join (Plane Sweep)
Building using Dataset A
Root
Leaf
Level 1
Assignment using Dataset B
Dataset A Dataset B
A-‐data, B-‐space
TOUCH is a distribu4on-‐aware join
26
Algorithm Object ReplicaIon
SensiIve to DistribuIon
Duplicate Results
Filtering
PBSM S3 TOUCH
FILTERED: During Assignment Phase Discarded Without Joining
simula4on trace analysis Need accurate and fast queries to • Discover and explore neuro-‐circuit behavior • Compare to behavior of biological 4ssue • Understand plas4city
Typical trace file ~0.8TB for 100K neurons for only 1 second of simula4on
In-‐memory efficient access method limit use of
complex query analysis Storage capacity limits longer simula4on 4me
on-‐demand tracking moving data
4me
NEXT-‐GENERATION QUERY ENGINES
28
What’s in it for computer science:
neurons synapses
circuits
whole brain
cogniIon
molecules
data
unifying models
panerns
rules
knowledge
the human brain project
images from the Blue Brain Project, EPFL
integrate clinical and simulaIon data
access to private hospital data
30 no move, no copy
Source: “An Overview of Business Intelligence Technology”. S. Chaudhuri, U. Dayal, V. Narasayya. CACM August 2011
BI: create database to run queries
RAW DATA FILES
LOAD INTO DB
QUERY REPORT RESULTS
APPLICATIONS
data “locked” in DB for performance
NoDB: In-‐situ queries over never-‐before-‐seen data
Year,Make,Model,Description,Price 1997,Ford,E350,"ac, abs, moon",3000.00 1999,Chevy,"Venture ""Extended Edition""","",4900.00 1999,Chevy,"Venture ""Extended Edition, Very Large""","",5000.00 1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00
Posi4onal Maps
also: indexing, file system integra4on != Classical Data Loading
User does not need to control when, what, how or where data is cached Caching
data-‐to-‐query Ime = 0
0
500
1000
1500
2000
2500
3000
MySQL PostgreSQL DBMS X PostgresRAW
ExecuI
on Tim
e (s)
Query Sequence
Load
PostgreSQL + NoDB = PostgresRaw
40 analyIcs queries on CSV data
comparable/compeIIve performance
Avg runIme of last 5 queries
DBMS X 6.56 s RAW 2.97 s
0
500
1000
1500
2000
2500
3000
MySQL PostgreSQL DBMS X PostgresRAW
ExecuI
on Tim
e (s)
PostgreSQL + NoDB = PostgresRaw
40 analyIcs queries on CSV data
NoDB gets progressively faster
now: run queries to create database MapReduce
Engine Rela4onal DBMS
Repor4ng Server Spreadsheet
Enterprise Search Engine
…
…
… External Data
Sources
Opera4onal Databases
ViDa: in-situ query engine
making a difference
• Solve the domain scienIst’s problems – not ours
• One* soluIon does not fit all – *query language, data model, data type, index
• Go back to the lab and apply findings – then write computer science papers
• Build mulFple bridges to sciences
36