exploringbigbraindata - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf ·...

Exploring Big Brain Data

Anastasia Ailamaki

with Farhan Tauheed, Thomas Heinis, and many others

Data-‐Intensive Applica9ons and Systems (DIAS) Laboratory School of Computer and Communica9on Sciences

THE SEQUENCE EXPLOSION

the human genome at ten (nature, 4/2010)

benefits hindered by complexity

scien4fic ques4ons and answers

Past three paradigms: •  theory •  simula4on •  experimenta4on fourth paradigm:

data-‐driven science wasted data worst ROI ever!

ques4ons to answer today

Domain ScienIst: •  how to find interes4ng data? •  how to move quickly through data? •  how to enable scien4fic discovery?

Computer ScienIst: • What’s in it for me?

4

SPATIAL QUERIES ON UNSTRUCTURED MESHES

5

How to find interes4ng data:

8

Point Query Q

o  Queries n  Point n  Range n  Feature

Element

Node

Range Query R

o  Use in simulation n  Show ground velocity at Q n  Draw the temperature of R

Courtesy Cornell Fracture Group

must scale with model complexity

spa4al range queries

9

A

B

C

D

H G

EFQ

K

A

C

E

B

D F

G

H

Q

B A C D E F G H

K M N

X Y

Disk page B A F E C D G H

X Y

K M L N

R-Tree Index

L N K

M Minimum Bounding Rectangle

(MBR)

R-‐Tree indexing

Ight connecIvity hurts performance

10

directed local search

A

B

C

D

E

F

G H

Q

insight: trade accuracy of target with efficiency

Q

J. Kozloski et al., Iden9fying, tabula9ng, and analyzing contacts between branched neuron morphologies, IBM Journal for Research and Development, Issue 52, Number 1/2, 2008

retrieval of a spaIal range

11 querying increasingly denser data

SimulaIons: 10K neurons

human: 86,000,000,000 neurons

brain diseases in Europe: €800B

12 ?

dye loading & raster scanning

3D reconstrucIon

images courtesy of the Blue Brain Project

the Blue Brain Project

13

neo cortex

a neuron

brain simula4on

14

single neuron, modeled with 3D cylinders

idea: index the brain

c

brain indexing with R-‐trees

15

Bulk Loaded R-‐Trees: Hilbert packed R-‐Tree, STR R-‐Tree PR-‐Tree

0

10

20

30

40

50

0 0.5 1 1.5 2 2.5 3

ExecuI

on Tim

e [sec]

per 1

0K re

sults

Segments in Million

R-‐Tree

Without Overlap

need to scale with density

sparse dense

FLAT! idea: seed-‐then-‐crawl

2) CRAWLING: traverse and retrieve remaining objects

16

c

1) SEEDING: find any object in region

Reachability?

tesselaIon + linking enable crawling

0

50

100

150

200

250

300

50 100 150 200 250 300 350 400 450

Total Page Re

ads [Th

ousand

s]

Density [millions of elements / volume unit]

Hilbert R-‐Tree

STR R-‐Tree

PR-‐Tree

FLAT

17

known techniques

2012: 1 million neurons

FLAT scales with data density

FAST DATA EXPLORATION THROUGH INTERACTIVE QUERIES

18

How to move quickly through data:

structural analysis: naviga4on

19

ad-‐hoc analysis visualizaIon model

refinement use cases:

sequences follow a latent guiding structure

sequence of queries issued interacIvely

idea: prefetch next query

20

1st query DISK CPU

query execuIon (I/O) analyzing results

Ime

2nd query 3rd query

known techniques do not scale

prefetching opportunity

• model structures in graph, then traverse •  iden4fy guiding structure: itera4ve pruning • max accuracy: incremental prefetching

SCOUT: content-‐aware prefetching

21

3rd query 1st query DISK CPU

2nd query

prefetching for future query residual I/O analyzing results

predicIon computaIon

Ime

70-‐98% hit rate, 4x-‐15x speedup

QUERY PROCESSING ALGORITHMS

22

How to enable scien4fic discovery:

touch detec4on Model Synapses electrical connec4ons betw. axons and dendrites

Data Challenge 100K neurons => 5B synapses 30GB addl space to store synapse data Human Brain => 100B neurons ~PBs

•  efficient spa4al proximity queries •  precise distance calcula4on

a major bopleneck in brain simulaIon

disk-‐based spa4al join

24

a3

b2

b1

a4

a1

a2

Mul4ple Assignment : PBSM

Dataset A Dataset B

Mul4ple Matching : S3

b1 a1

a1 a3 b3

b2 a2

a3

b4

a4

a1 1 1 1 1

1

2 2

2 2 2 3

3

4 4

4 4

data par44oning space par44oning

Algorithm Object ReplicaIon

SensiIve to DistribuIon

Duplicate Results

Filtering

PBSM S3

TOUCH: in-‐memory join

25

Phases: 1) Building (A) 2) Assignment (B) 3) Join (Plane Sweep)

Building using Dataset A

Root

Leaf

Level 1

Assignment using Dataset B

Dataset A Dataset B

A-‐data, B-‐space

TOUCH is a distribu4on-‐aware join

26

Algorithm Object ReplicaIon

SensiIve to DistribuIon

Duplicate Results

Filtering

PBSM S3 TOUCH

FILTERED: During Assignment Phase Discarded Without Joining

simula4on trace analysis Need accurate and fast queries to •  Discover and explore neuro-‐circuit behavior •  Compare to behavior of biological 4ssue •  Understand plas4city

Typical trace file ~0.8TB for 100K neurons for only 1 second of simula4on

In-‐memory efficient access method limit use of

complex query analysis Storage capacity limits longer simula4on 4me

on-‐demand tracking moving data

4me

NEXT-‐GENERATION QUERY ENGINES

28

What’s in it for computer science:

neurons synapses

circuits

whole brain

cogniIon

molecules

data

unifying models

panerns

rules

knowledge

the human brain project

images from the Blue Brain Project, EPFL

integrate clinical and simulaIon data

access to private hospital data

30 no move, no copy

Source: “An Overview of Business Intelligence Technology”. S. Chaudhuri, U. Dayal, V. Narasayya. CACM August 2011

BI: create database to run queries

RAW DATA FILES

LOAD INTO DB

QUERY REPORT RESULTS

APPLICATIONS

data “locked” in DB for performance

NoDB: In-‐situ queries over never-‐before-‐seen data

Year,Make,Model,Description,Price 1997,Ford,E350,"ac, abs, moon",3000.00 1999,Chevy,"Venture ""Extended Edition""","",4900.00 1999,Chevy,"Venture ""Extended Edition, Very Large""","",5000.00 1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00

Posi4onal Maps

also: indexing, file system integra4on != Classical Data Loading

User does not need to control when, what, how or where data is cached Caching

data-‐to-‐query Ime = 0

0

500

1000

1500

2000

2500

3000

MySQL PostgreSQL DBMS X PostgresRAW

ExecuI

on Tim

e (s)

Query Sequence

Load

PostgreSQL + NoDB = PostgresRaw

40 analyIcs queries on CSV data

comparable/compeIIve performance

Avg runIme of last 5 queries

DBMS X 6.56 s RAW 2.97 s

0

500

1000

1500

2000

2500

3000

MySQL PostgreSQL DBMS X PostgresRAW

ExecuI

on Tim

e (s)

PostgreSQL + NoDB = PostgresRaw

40 analyIcs queries on CSV data

NoDB gets progressively faster

now: run queries to create database MapReduce

Engine Rela4onal DBMS

Repor4ng Server Spreadsheet

Enterprise Search Engine

…

…

… External Data

Sources

Opera4onal Databases

ViDa: in-situ query engine

making a difference

• Solve the domain scienIst’s problems –  not ours

• One* soluIon does not fit all –  *query language, data model, data type, index

• Go back to the lab and apply findings –  then write computer science papers

• Build mulFple bridges to sciences

36

exploringbigbraindata - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf ·...

Documents