exploringbigbraindata - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf ·...

36
Exploring Big Brain Data Anastasia Ailamaki with Farhan Tauheed, Thomas Heinis, and many others DataIntensive Applica9ons and Systems (DIAS) Laboratory School of Computer and Communica9on Sciences

Upload: others

Post on 18-Oct-2019

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

Exploring  Big  Brain  Data  

 Anastasia  Ailamaki    

 with  Farhan  Tauheed,  Thomas  Heinis,  and  many  others  

 

Data-­‐Intensive  Applica9ons  and  Systems  (DIAS)  Laboratory  School  of  Computer  and  Communica9on  Sciences  

Page 2: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

THE    SEQUENCE    EXPLOSION  

the  human  genome  at  ten  (nature,  4/2010)  

benefits  hindered  by  complexity  

Page 3: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

scien4fic  ques4ons  and  answers  

Past  three  paradigms:  •  theory  •  simula4on  •  experimenta4on    fourth  paradigm:    

data-­‐driven  science      wasted  data    worst  ROI  ever!  

Page 4: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

ques4ons  to  answer  today  

Domain  ScienIst:  •  how  to  find  interes4ng  data?  •  how  to  move  quickly  through  data?  •  how  to  enable  scien4fic  discovery?    

Computer  ScienIst:  • What’s  in  it  for  me?  

4  

Page 5: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

SPATIAL  QUERIES  ON  UNSTRUCTURED  MESHES  

5  

How  to  find  interes4ng  data:  

Page 6: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $
Page 7: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $
Page 8: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

8  

Point Query Q

o  Queries n  Point n  Range n  Feature

Element

Node

Range Query R

o  Use in simulation n  Show ground velocity at Q n  Draw the temperature of R

Courtesy Cornell Fracture Group

must scale with model complexity

spa4al  range  queries  

Page 9: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

9  

A

B

C

D

H G

EFQ

K

A

C

E

B

D F

G

H

Q

B A C D E F G H

K M N

X Y

Disk page B A F E C D G H

X Y

K M L N

R-Tree Index

L N K

M Minimum Bounding Rectangle

(MBR)

R-­‐Tree  indexing  

Ight  connecIvity  hurts  performance  

Page 10: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

10  

directed  local  search  

A

B

C

D

E

F

G H

Q

insight:    trade  accuracy  of  target  with  efficiency  

Q

Page 11: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

J.  Kozloski  et  al.,  Iden9fying,  tabula9ng,  and  analyzing  contacts  between  branched  neuron  morphologies,    IBM  Journal  for  Research  and  Development,  Issue  52,  Number  1/2,  2008  

retrieval  of  a  spaIal  range  

11  querying  increasingly  denser  data  

SimulaIons:  10K  neurons  

human:  86,000,000,000  neurons  

Page 12: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

brain  diseases  in  Europe:  €800B  

12  ?

Page 13: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

dye  loading  &  raster  scanning  

3D  reconstrucIon  

images  courtesy  of  the  Blue  Brain  Project  

the  Blue  Brain  Project  

13  

neo  cortex  

a  neuron  

Page 14: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

brain  simula4on  

14  

single  neuron,    modeled  with  3D  cylinders    

idea: index the brain

Page 15: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

c  

brain  indexing  with  R-­‐trees  

15  

Bulk  Loaded  R-­‐Trees:  Hilbert  packed  R-­‐Tree,    STR  R-­‐Tree    PR-­‐Tree  

0  

10  

20  

30  

40  

50  

0   0.5   1   1.5   2   2.5   3  

ExecuI

on  Tim

e  [sec]    

per  1

0K  re

sults  

Segments  in  Million  

R-­‐Tree  

Without  Overlap  

need to scale with density

sparse   dense  

Page 16: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

FLAT!  idea:  seed-­‐then-­‐crawl  

2)  CRAWLING:  traverse  and  retrieve  remaining  objects  

 16  

c  

1)  SEEDING:  find  any  object  in  region  

Reachability?  

tesselaIon  +  linking  enable  crawling  

Page 17: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

0  

50  

100  

150  

200  

250  

300  

50   100   150   200   250   300   350   400   450  

Total  Page  Re

ads  [Th

ousand

s]  

Density  [millions  of  elements  /  volume  unit]  

Hilbert  R-­‐Tree  

STR  R-­‐Tree  

PR-­‐Tree  

FLAT  

17  

known  techniques  

2012:  1  million  neurons  

FLAT  scales  with  data  density  

Page 18: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

FAST  DATA  EXPLORATION  THROUGH  INTERACTIVE  QUERIES  

18  

How  to  move  quickly  through  data:  

Page 19: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

structural  analysis:  naviga4on  

19  

ad-­‐hoc  analysis   visualizaIon   model  

refinement  use  cases:  

sequences  follow  a  latent  guiding  structure  

Page 20: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

sequence  of  queries  issued  interacIvely    

idea:  prefetch  next  query  

20  

1st  query DISK CPU  

query  execuIon  (I/O)   analyzing  results

Ime

2nd  query 3rd  query

known  techniques  do  not  scale  

prefetching  opportunity  

Page 21: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

• model  structures  in  graph,  then  traverse  •  iden4fy  guiding  structure:  itera4ve  pruning  • max  accuracy:  incremental  prefetching  

SCOUT:  content-­‐aware  prefetching  

21  

3rd  query 1st  query DISK CPU  

2nd  query

prefetching  for  future  query residual  I/O   analyzing  results

predicIon  computaIon

Ime

70-­‐98%  hit  rate,  4x-­‐15x  speedup    

Page 22: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

QUERY  PROCESSING  ALGORITHMS  

22  

How  to  enable  scien4fic  discovery:  

Page 23: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

touch  detec4on  Model  Synapses    electrical  connec4ons  betw.  axons  and  dendrites  

 

Data  Challenge    100K  neurons  =>  5B  synapses    30GB  addl  space  to  store  synapse  data    Human  Brain  =>  100B  neurons  ~PBs  

•  efficient  spa4al  proximity  queries  •  precise  distance  calcula4on            

a  major  bopleneck  in  brain  simulaIon  

Page 24: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

disk-­‐based  spa4al  join  

24  

a3  

b2  

b1  

a4  

     a1  

a2  

Mul4ple  Assignment  :  PBSM  

Dataset  A   Dataset  B  

Mul4ple  Matching  :  S3  

b1        a1  

a1   a3  b3  

b2        a2  

a3  

b4  

a4  

a1  1   1  1  1  

1  

2   2  

2  2   2   3  

3  

4   4  

4  4  

data  par44oning   space  par44oning  

Algorithm   Object    ReplicaIon  

SensiIve  to  DistribuIon  

Duplicate      Results  

Filtering  

PBSM          S3          

Page 25: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

TOUCH:  in-­‐memory  join  

25  

Phases:      1)  Building  (A)      2)  Assignment  (B)      3)  Join  (Plane  Sweep)  

Building  using  Dataset  A  

Root  

Leaf  

Level  1  

Assignment  using  Dataset  B  

Dataset  A   Dataset  B  

A-­‐data,  B-­‐space  

Page 26: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

TOUCH  is  a  distribu4on-­‐aware  join  

26  

Algorithm   Object    ReplicaIon  

SensiIve  to  DistribuIon  

Duplicate      Results  

Filtering  

PBSM          S3          TOUCH          

FILTERED:  During    Assignment  Phase    Discarded    Without  Joining  

Page 27: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

simula4on  trace  analysis  Need  accurate  and  fast  queries  to  •  Discover  and  explore  neuro-­‐circuit  behavior  •  Compare  to  behavior  of  biological  4ssue  •  Understand  plas4city  

Typical  trace  file  ~0.8TB    for  100K  neurons    for  only  1  second  of  simula4on  

 In-­‐memory  efficient  access  method  limit  use  of  

complex  query  analysis  Storage  capacity  limits  longer  simula4on  4me  

on-­‐demand  tracking  moving  data  

4me  

Page 28: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

NEXT-­‐GENERATION  QUERY  ENGINES  

28  

What’s  in  it  for  computer  science:  

Page 29: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

neurons  synapses  

circuits  

whole  brain  

cogniIon  

molecules  

data  

unifying  models  

panerns  

rules  

knowledge  

the  human  brain  project  

images  from  the  Blue  Brain  Project,  EPFL  

integrate  clinical    and  simulaIon  data  

Page 30: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

access  to  private  hospital  data  

30  no  move,  no  copy  

Page 31: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

Source:  “An  Overview  of  Business  Intelligence  Technology”.    S.  Chaudhuri,  U.  Dayal,  V.  Narasayya.  CACM  August  2011  

BI:  create  database  to  run  queries  

RAW  DATA  FILES  

LOAD  INTO  DB  

QUERY  REPORT  RESULTS  

APPLICATIONS  

data  “locked”  in  DB  for  performance  

Page 32: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

NoDB:  In-­‐situ  queries  over  never-­‐before-­‐seen  data  

Year,Make,Model,Description,Price 1997,Ford,E350,"ac, abs, moon",3000.00 1999,Chevy,"Venture ""Extended Edition""","",4900.00 1999,Chevy,"Venture ""Extended Edition, Very Large""","",5000.00 1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00

Posi4onal  Maps  

also:  indexing,  file  system  integra4on  !=  Classical  Data  Loading  

User  does  not  need  to  control  when,  what,  how  or  where  data  is  cached  Caching  

data-­‐to-­‐query  Ime  =  0  

Page 33: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

0  

500  

1000  

1500  

2000  

2500  

3000  

MySQL   PostgreSQL   DBMS  X   PostgresRAW  

ExecuI

on  Tim

e  (s)  

Query  Sequence  

Load  

PostgreSQL  +  NoDB  =  PostgresRaw  

40  analyIcs  queries  on  CSV  data  

comparable/compeIIve  performance  

Page 34: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

Avg  runIme  of  last  5  queries    

DBMS  X  6.56  s  RAW    2.97  s  

0  

500  

1000  

1500  

2000  

2500  

3000  

MySQL   PostgreSQL   DBMS  X   PostgresRAW  

ExecuI

on  Tim

e  (s)  

PostgreSQL  +  NoDB  =  PostgresRaw  

40  analyIcs  queries  on  CSV  data  

NoDB  gets  progressively  faster  

Page 35: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

now:  run  queries  to  create  database  MapReduce  

Engine  Rela4onal  DBMS  

Repor4ng  Server   Spreadsheet  

Enterprise  Search  Engine  

…  

…  

…  External  Data  

Sources    

Opera4onal  Databases  

ViDa: in-situ query engine

Page 36: ExploringBigBrainData - db.in.tum.dedb.in.tum.de/hosted/bigdata/presentations/ailamaki.pdf · touch$detec4on$ Model*Synapses* *electrical$connec4ons$betw.$axons$ and$dendrites$ $

making  a  difference  

• Solve  the  domain  scienIst’s  problems  –  not  ours  

• One*  soluIon  does  not  fit  all  –  *query  language,  data  model,  data  type,  index  

• Go  back  to  the  lab  and  apply  findings  –  then  write  computer  science  papers  

 • Build  mulFple  bridges  to  sciences  

36