uc#berkeley# - university of california, los...

23
Making Sense at Scale with Algorithms, Machines & People PI: Michael Franklin University of California, Berkeley Expeditions in Computing PI Meeting May 15, 2013 UC BERKELEY

Upload: dangque

Post on 17-Feb-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

Making Sense at Scale with Algorithms, Machines & People!

PI: Michael Franklin!University of California, Berkeley!

!Expeditions in Computing PI Meeting!

May 15, 2013!

UC  BERKELEY  

Page 2: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

The Berkeley AMPLab!

2

Page 3: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

It’s  All  Happening  On-­‐line  Every: Click Ad impression Billing event Fast Forward, pause,… Friend Request Transaction Network message Fault …

User  Generated  (Web  &  Mobile)  

…..

Internet  of  Things  /  M2M   ScienCfic  CompuCng  

Sources Driving Big Data!

Page 4: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

Challenge 1: Data is Big!Projected  Growth  

Increase  over  2

010  

0  

10  

20  

30  

40  

50  

60  

2010   2011   2012   2013   2014   2015  

Moore's  Law  Overall  Data  Par8cle  Accel.  DNA  Sequencers  

Data  Grows  faster  than  Moore’s  Law  [IDC  report,  Kathy  Yelick,  LBNL]  

Page 5: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

Challenge 2: Data is Dirty!

•  Variety of diverse sources!•  Uncurated!•  No schema !•  Inconsistent syntax and semantics!

Dirty  Data  worse  than  Big  Data    

Page 6: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

Challenge 3: Complex Questions!

•  Hard questions!– What is the impact on traffic and

home prices of building a new on-ramp?!

•  Detect real-time events!–  Is there a cyber attack going on?!

•  Open-ended questions !– How many supernovae happened

last year?!

Page 7: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

Our Vision: A Necessary Synergy!lgorithms     achines     eople    

Challenge  1:  Data  is  Big   ✔   ✔  

Challenge  3:  Ques8ons    are  complex  

✔   ✔  ✔  

Challenge  2:  Data  is  Dirty   ✔   ✔  ✔  

Page 8: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

The AMPLab Big Bets!•  Traditional intellectual borders hinder “Big Data” stacks!

–  Need Machine Learning/Systems/Database Co-Design!–  Requires Cohabitation and Real Collaboration!

•  Now is a unique opportunity to rethink fundamental design points:!–  Changing Latency Demands!–  Changing Consistency Requirements!–  Cloud-based Elastic Resources!–  Huge Desire for New Solutions in the Marketplace!–  Open Source is the key to Tech Transfer in Big Data!

•  Need to consider role of people throughout the entire analytics lifecycle!8

Page 9: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

AMPLab: Collaborative Research!An integration of Faculty Interests (*Directors):!!!!!!

9

Alex  Bayen  (Mobile  Sensing)   Anthony  Joseph  (Sec./  Privacy)  Ken  Goldberg  (Crowdsourcing)   Randy  Katz  (Systems)  *Michael  Franklin  (Databases)   Dave  Pa`erson  (Systems)  Armando  Fox  (Systems)   *Ion  Stoica  (Systems)  *Mike  Jordan  (Machine  Learning)   Sco`  Shenker  (Networking)  

Twice-Yearly Research Retreats (industry & sponsors):!

50+ amazing grad students, post-docs, undergrads, developers, staff & visitors!

Page 10: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

Co-Located for Collaboration!

10

Page 11: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

Collaboration: Industry + Government!!AMPLab Launched January 2011 (5 yr plan)!Founding Sponsors:!!Sponsors and Affiliates:!!!!Federal Grants and Contracts:!

!11

Expeditions in Computing

XData Program

Page 12: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

Collaboration: Applications!!

Participatory Sensing! Mobile Millenium - Traffic!Collective Discovery !!Opinion Space - Opinions!!Carat – Smartphone energy!

Urban Planning and Simulation!! UrbanSim – data integration!

Cancer Genomics/Personalized Medicine (w/ UCSF and UCSC) !!!SNAP: Fast Sequence Alignment!!Genome Data Warehouse!12

Page 13: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

Shared Deliverable:Berkeley Data Analytics Stack (BDAS)!

13

Page 14: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

BDAS: Current Snapshot!

Mesos  

MPI  

Resource  Mgmt.  

Data    Processing  Storm  

Spark

Spark Streaming Shark

BlinkDB

                                                                                                                                                                                       HDFS  Data  Mgmt.  

Tachyon

Hadoop  

HIVE  Pig  Spark

Graph ML base

Released  (BDAS)   In  development  (BDAS)   Exis8ng  open  source  stack                                                                                                                                                                                            

BDAS  Components  being  released  under  BSD  or  Apache  Open  Source  License  

Page 15: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

Big Data Landscape – Our Corner!

15

Page 16: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

Impact (so far)!•  Open Source Release of BDAS components:!

•  Mesos: Cluster Virtualization !•  Business critical services on 6000+ servers at Twitter!•  see “How Twitter Rebuilt Google’s Secret Weapon” Wired 3/13!

•  Spark: In-memory Computation Framework &! Shark: Hive-Compatible SQL Query Engine on Spark!

•  in use at large companies, start ups, and govt. agencies !•  100x Performance Improvement over Hadoop/Apache Hive!•  available on Amazon Elastic Map Reduce!•  700+ member Meetup group!

•  Best Paper Awards: Eurosys 13, ICDE 13, NSDI 12, SIGCOMM 12 and Best Demo Award: SIGMOD 12!

•  Students in high-demand in academia and industry!16

Page 17: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

Spark: Sys/ML Collaboration at Work!

iter.  1   iter.  2   .    .    .  

Logistic Regression Performance

29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)

Research Challenge Addressed: How to design a distributed memory abstraction that is both fault-tolerant and efficient?

Technical Challenge: disk-oriented Hadoop Map Reduce inefficient for iterative Machine Learning

Solution: Resilient Distributed Datasets (RDDs)

Page 18: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

Impact: Carat Smartphone App!

18 Over 500,000

downloads

Page 19: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

MLBase – Declarative ML!

19

Vision: Make Machine Learning usable by “mere mortals” Allow high-level (declarative) specification of ML tasks Use Database-style “query optimization to generate efficient execution strategy

Page 20: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

Hybrid Human/Machine Systems!Use machines for bulk data processing!Leverage human activity for data collection and event detection!Leverage human knowledge, reasoning and perception for:!

• subjective entity comparisons!• complex predicates !• finding missing data!• disambiguating questions!

!!

20

Disk 2

Disk 1

Parser

Optimizer

Stat

istic

s

CrowdSQL Results

Executor

Files Access Methods

UI Template Manager

Form Editor

UI Creation

HIT Manager

Met

aDat

a

Turker Relationship Manager

e.g., CrowdDB Architecture

Page 21: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

Outreach!

21

AMPCamp I @ Berkeley, August 2012 AMPCamp II @ Strata Conf., Feb 2013 AMPCamp III @ Berkeley, August 2013 AMPCamp Online:

ampcamp.berkeley.edu

Page 22: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

What do we get from Expeditions?!

Simply put – the ability to!! ! ! ! ! “swing for the fences”!

22

Page 23: UC#BERKELEY# - University of California, Los Angelescadlab.cs.ucla.edu/expeditions_pi_meeting/slides/MichaelFranklin... · • 700+ member Meetup group! • Best Paper Awards:

For More Information!amplab.cs.berkeley.edu!

•  Papers and Project Pages!

•  News updates and Blogs!

Twitter: @amplab!Github and Apache!http://[email protected]!!!!

23