big data curricula at the uw escience institute, jsm 2013

30
Bill Howe, PhD Director of Research, Scalable Data Analytics University of Washington eScience Institute Big Data Curricula at the University of Washington eScience Institute 06/22/2022 Bill Howe, UW 1

Upload: bill-howe

Post on 27-Jan-2015

106 views

Category:

Education


0 download

DESCRIPTION

A 25 minute talk from a panel on big data curricula at JSM 2013 http://www.amstat.org/meetings/jsm/2013/onlineprogram/ActivityDetails.cfm?SessionID=208664

TRANSCRIPT

Page 1: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 1

Bill Howe, PhDDirector of Research,

Scalable Data AnalyticsUniversity of Washington

eScience Institute

Big Data Curricula at the University of Washington

eScience Institute

Bill Howe, UW

Page 2: Big Data Curricula at the UW eScience Institute, JSM 2013

2

“It’s a great time to be a data geek.”-- Roger Barga, Microsoft Research

“The greatest minds of my generation are trying to figure out how to make people click on ads”

-- Jeff Hammerbacher, co-founder, Cloudera

Page 3: Big Data Curricula at the UW eScience Institute, JSM 2013

1. Theory (last 2000 yrs)2. Experiment (last 200

yrs)3. Simulation (last 50 yrs)4. Data-Driven Discovery

(last 5 yrs)

Page 4: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 4

The University of Washington eScience Institute

• Rationale– The exponential increase in sensors is transitioning all fields of

science and engineering from data-poor to data-rich– As a result, the techniques and technologies of data science

must be widely practiced and widely adopted

• Mission– Advance the forefront of research both in modern data science

techniques and technologies, and in the fields that depend upon them

• Strategy– Provide an umbrella organization for Big Data activities at UW

and beyond (new curricula, collaborations, funding sources, hiring practices)

– Bootstrap a national network of partners and peer institutes– Attract, develop, and retain “Pi-shaped people”Bill Howe, UW

Page 5: Big Data Curricula at the UW eScience Institute, JSM 2013

π -shaped researchers

Broad in many areas; deep in at least two

Page 6: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 6

UW Data Science Education Efforts

Bill Howe, UW

Students Non-StudentsCS/Informatics Non-Major professionals researchersundergrads grads undergrads grads

UWEO Data Science Certificate Graduate Certificate in Big Data CS Data Management Courses eScience workshops Intro to data programming eScience Masters (planned) MOOC: Intro to Data Science Incubator: On-the-job-training

Previous courses:Scientific Data Management, Graduate CS, Summer 2006, Portland State UniversityScientific Data Management, Graduate CS, Spring 2010, University of Washington

Page 7: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 7

Three Activities

• Massively Open Online Course• New Phd Tracks in Big Data• An Incubator for Data Science Projects

• Other actitivites I won’t discuss– Undergraduate “Data Wizardry” Courses– 2-day Bootcamps in Python, SQL, GitHub, …– Certificate Programs in Data Science– Hackathons

Bill Howe, UW

Page 8: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 8

Three Activities

• Massively Open Online Course• New Phd Tracks in Big Data• An Incubator for Data Science Projects

• Other actitivites I won’t discuss– Undergraduate “Data Wizardry” Courses– 2-day Bootcamps in Python, SQL, GitHub, …– Certificate Programs in Data Science– Hackathons

Bill Howe, UW

Page 9: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 9Bill Howe, UW

Page 10: Big Data Curricula at the UW eScience Institute, JSM 2013

• 8600 completed all programming assignments• 7000 earned a certificate

Page 11: Big Data Curricula at the UW eScience Institute, JSM 2013
Page 12: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 12

Syllabus

• Data Science Landscape (~1 week)

• Data Manipulation at Scale– Relational Databases (~1 week)– MapReduce (~1 week)– NoSQL (~1 week)

• Analytics– Statistics Pearls (~1 week)– Machine Learning Pearls (~1 week)

• Visualization (~1 week)

Bill Howe, UW

Page 13: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 13Bill Howe, UW

tools abstr.

desk cloud

structs stats

hackers analysts

This Course

Page 14: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 14Bill Howe, UW

What are the abstractions of data science?

tools abstr.

“Data Jujitsu”“Data Wrangling”“Data Munging”

Translation: “We have no idea what this is all about”

Page 15: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 15Bill Howe, UW

matrices and linear algebra? relations and relational algebra?objects and methods?files and scripts?data frames and functions?

What are the abstractions of data science?

tools abstr.

Page 16: Big Data Curricula at the UW eScience Institute, JSM 2013

16

Data Access Hitting a Wall

Current practice based on data download (FTP/GREP)Will not scale to the datasets of tomorrow

• You can GREP 1 MB in a second• You can GREP 1 GB in a minute • You can GREP 1 TB in 2 days• You can GREP 1 PB in 3 years.

• Oh!, and 1PB ~5,000 disks

• At some point you need indices to limit searchparallel data search and analysis

• This is where databases can help

• You can FTP 1 MB in 1 sec• You can FTP 1 GB / min (~1$)• … 2 days and 1K$• … 3 years and 1M$

desk cloud

[slide src: Jim Gray]

Page 17: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 17

US faces shortage of 140,000 to 190,000 people “with deep analytical skills, as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”

Bill Howe, UW

--Mckinsey Global Institute

hackers analysts

Page 18: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 18

Three types of tasks:

Bill Howe, UW

1) Preparing to run a model

2) Running the model

3) Interpreting the results

Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging

“80% of the work”

-- Aaron Kimball

“The other 80% of the work”-- Aaron Kimball

structs stats

Page 19: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 19

Three Activities

• Massively Open Online Course• New Phd Tracks in Big Data• An Incubator for Data Science Projects

• Other actitivites I won’t discuss– Undergraduate “Data Wizardry” Courses– 2-day Bootcamps in Python, SQL, GitHub, …– Certificate Programs in Data Science– Hackathons

Bill Howe, UW

Page 20: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 20

New Phd Track: “Big Data U”

• Open to all departments• New courses to “level the playing field”

– “Molecular Biology for Computer Scientists” offered this Fall

• Dual advising in two disciplines• Joint projects leading to multiple theses

– Each methods thesis will include domain impact component

– Each domain thesis will include methods impact component

• Contribution to a shared cyberinfrastructure– Software engineering experience as a side effect

• “Application Assistantships”– Like RAs and TAs; focused on solving a concrete

problem

Bill Howe, UW

Magda Balazinska

Carlos Guestrin

Page 21: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 21

Three Activities

• Massively Open Online Course• New Phd Tracks in Big Data• An Incubator for Data Science

• Other actitivites I won’t discuss– Undergraduate “Data Wizardry” Courses– 2-day Bootcamps in Python, SQL, GitHub, …– Certificate Programs in Data Science– Hackathons

Bill Howe, UW

Page 22: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 22

Data Science Incubator: Motivation

• We need the right people– We produce “builders,” but 99% of them go to

industry to “make people click on ads”– They aren’t motivated by writing papers– No viable career path in the academy

• We need the right processes– Hands-on, extended, intensive experience is required to

produce π-shaped people – Data-driven discovery requires intensive collaboration

Bill Howe, UW

Page 23: Big Data Curricula at the UW eScience Institute, JSM 2013

Science DomainsStats, Computer Science, Applied Math

• “Where’s the funding?”• “How does this help me write a paper in my field”?• Thin collaborations; nobody to work on the short-

term, high-risk, high-impact “triage” projects• “Does method X work on dataset Y?”

Page 24: Big Data Curricula at the UW eScience Institute, JSM 2013

Domain Labs

Research Programmers

• Expensive; doesn’t scale• “Code Monkey” – No viable career path• Can’t attract top people• No sharing, no community, no cross-pollination

Page 25: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 25

Data Science Incubator: Structure

• Recruit top-flight data science talent• Give them autonomy to select collaborations and

projects• Promote them according to “altmetrics” and project

impact– “Data Scientist” “Senior Data Scientist” “Technical

Fellow”– “Data Science Fellows”

• Perhaps non-tenure, but 3-5 year commitments• Funded with contributions from Academic units, IT,

Libraries, and soft money

Bill Howe, UW

Page 26: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 26

Data Science Incubator: Seed Grants

• Domain researchers submit Seed Grant applications for short, intensive 1-6 month projects– Reviewed by the Data Scientists themselves

• Awardees send 1+ students, postdocs, staff, or faculty to come and physically sit in the incubator space X days per week for the project duration– Application may or may not include funding for the

student

Bill Howe, UW

Page 27: Big Data Curricula at the UW eScience Institute, JSM 2013

Domain Labs

Incubator

• Data Scientists have their own identity and prestige• Cross-pollination between disciplines• Awardees leave with skills and knowledge; become “disciples”

Page 28: Big Data Curricula at the UW eScience Institute, JSM 2013

Domain Labs

Incubator

• Data Scientists have their own identity and prestige• Cross-pollination between disciplines• Awardees leave with skills and knowledge; become “disciples”

Page 29: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 29

Three Activities

• Massively Open Online Course• New Phd Tracks in Big Data• An Incubator for Data Science

• Other actitivites I won’t discuss– Undergraduate “Data Wizardry” Courses– 2-day Bootcamps in Python, SQL, GitHub, …– Certificate Programs in Data Science– Hackathons

Bill Howe, UW

Page 30: Big Data Curricula at the UW eScience Institute, JSM 2013

04/10/2023 30

MOOC “Introduction to Data Science:”https://www.coursera.org/course/datasci

Certificate program:http://www.pce.uw.edu/courses/data-science-intro

Bill Howe, UW

http://escience.washington.edu

[email protected]