designing for self-serve science

Post on 26-Jun-2015

69 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

I gave this talk at the UW Systems, Architecture, & Networking (SANE) retreat in May 2014. I argued that as a community, big data system-builders may be great at building fast systems.. but that these systems DO NOT serve the scientists we work with at the UW eScience Institute. I then provide a few ideas going forward for how to build services for scientists that will enable them to do their own work, thus "serving themselves".

TRANSCRIPT

Designing for self-serve science

Daniel Halperin

How much time “handling data” vs “doing science”?

How much time “handling data” vs “doing science”?

90%

“I sort both my spreadsheets on Gene ID, then I copy matches into a new one”

We are the problem

0

30

60

90

120

Benchmark 1 Benchmark 2

Old system Your system Our system

0

2500

5000

7500

10000

Benchmark 1 Benchmark 2

Old system Your systemOur system What people use

Perfo

rman

ce

Complexity

Perfo

rman

ce

Complexity

Perfo

rman

ce

Complexity

Perfo

rman

ce

Complexity

Design for here

What we build What they need

Steve Jurvetson https://www.flickr.com/photos/jurvetson/7408464122

sutton-images.com http://biser3a.com/formula-1/f1-airboxes-all-you-need-to-know/

terms: http://sutton-images.com/terms.asp

Lowering barrier to entry

Developing a new language

• SQL: 3 great features for science • THE language of data

management!• We know how to

scale it • Scientists can learn it

• MyriaL is better • Imperative &

declarative:easy to write

• Iteration & recursion!• Lots of practical

extensions

Giving users insight

Diagnosing problems����������������

�� ��������

� � � � � � � � � ��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

���������

������������������������������������������������������������������������������������������������������������������������������

Sour

ce n

ode

Destination node

Automating the ‘CS parts’• Do work on the user’s behalf:

(Ratul Mahajan’s Buffet Principle)

• Infer indexes and constraints!

• Aggressively reuse computation

• Speculatively apply queries to data

• Key enabler: science data is (mostly) read-only

Enable authoring & sharing

• “Autocomplete for science” - predict query snippets as users work. (Nodira Khoussainova)

• Natural language interface: queries → English questions → queries “Compute the fraction of CGs that are methylated in the oyster genome.”

Improve their state of the art

• “You just did in 1 minute what took me a week”

• “Replaced 100 lines of Python with 1 line of SQL”

• “That 5-line MyriaL program was 100x faster than my R cluster, and much simpler”

Trust, but Verify (& Support)

Trust, but Verify (& Support)

top related