the rise of data science in the age of big data analytics: why data distillation and machine...

38
Revolution Confidential The R ise of Data S cience in the age of Big Data Analytics Why Data Distillation and Machine Learning Aren’t E nough David M Smith VP Marketing and Community Revolution Analytics

Upload: revolution-analytics

Post on 10-May-2015

2.275 views

Category:

Documents


2 download

DESCRIPTION

The reason why Big Data is important is because we want to use it to make sense of our world. It’s tempting to think there’s some “magic bullet” for analyzing big data, but simple “data distillation” often isn’t enough, and unsupervised machine-learning systems can be dangerous. (Like, bringing-down-the-entire-financial-system dangerous.) Data Science is the key to unlocking insight from Big Data: by combining computer science skills with statistical analysis and a deep understanding of the data and problem we can not only make better predictions, but also fill in gaps in our knowledge, and even find answers to questions we hadn’t even thought of yet.

TRANSCRIPT

Page 1: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

T he R is e of Data S c ienc e in the age of B ig Data A nalytic sWhy Data Dis tillation and Mac hine L earning A ren’t E nough

David M S mithV P Marketing and C ommunityR evolution Analytic s

Page 2: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialToday, we’ll dis c us s :

What is Data Science? Why machine learning isn’t enough Why Data Science works The Data Scientists Toolkit The Future of Big Data Analytics Closing thoughts and resources

2

Page 3: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

3© Dov Harrington, CC By-2.0http://www.flickr.com/photos/idovermani/4110546683/

Page 4: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialWhere is it s afe to fis h near S an F ranc is co?

4San Francisco Estuary Institutehttp://www.sfei.org/tools/wqt

Page 5: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialHurric ane S andy

Bob Rudishttp://rud.is/b/2012/10/28/watch-sandy-in-r-including-forecast-cone/

5

Page 6: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialHurric ane S andy

Ed Chenhttp://blog.echen.me/hurricane-sandy-outages/

6

Page 7: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

When did Michael J acks on have his bigges t hits ?

New York Times, June 25 2009 (3 hours after Michael Jackson’s death)http://www.nytimes.com/interactive/2009/06/25/arts/0625-jackson-graphic.html 7

Page 8: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialT hree E s s ential S kills of Data S c ientis ts

8Drew Conwayhttp://www.dataists.com/2010/09/the-data-science-venn-diagram/

Data IntegrationMashups

Applications

ModelsVisualizationPredictionsUncertainty

ProblemsData Sources

Credibility

EffectiveData

Applications

Page 9: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

9Image © Abode of Chaos, CC BY 2.0http://www.flickr.com/photos/home_of_chaos/6418989233/

Page 10: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialMac hine learning (ML ) for predic tions

10

Res

pons

e

Feat

ures

Res

pons

es

MLscoring rules

Building the Model

Valid

atio

n se

t

Pre

dict

ions

scoring rules

Validating the Model

New

Dat

a

Pre

dict

ions

(sco

res)

scoring rules

Scoring new data

“Accuracy”

Page 11: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialP roblem: A lac k of pers pec tive

11Image © 2010 David M Smith. Some rights reserved CC BY-2.0

Page 12: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialP roblem: L ac k of c redibility

12

Page 13: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialP roblem: C omplexity

13

Page 14: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialData Science to the Rescue!

14

Page 15: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialA ns wer Unas ked Ques tions

15Revolutions blog: “The Uncanny Valley of Big Data”http://blog.revolutionanalytics.com/2012/02/the-uncanny-valley-of-big-data.html

Page 16: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

16

“More data beats better algorithms, every time” – Google

“Companies that have massive amounts of data without massive amounts

of clue are going to be displaced by startups that have less data but more

clue.” -- Tim O’Reilly

Google Research, “The Unreasonable Effectiveness of Data”: http://googleresearch.blogspot.com/2009/03/unreasonable-effectiveness-of-data.html

Tim O’Reilly on Google+: https://plus.google.com/107033731246200681024/posts/4Xa76AtxYwdTechnoCalifornia: http://technocalifornia.blogspot.com/2012/07/more-data-or-better-models.html

F ill in knowledge gaps

Page 17: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialAvoid ineffec tive reac tions

17Stupid Data Miner Trickshttp://nerdsonwallstreet.typepad.com/my_weblog/files/dataminejune_2000.pdf

S&P

500

Page 18: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

18© Henricks Photos CC-BY-ND 2.0http://www.flickr.com/photos/hendricksphotos/3240667626/

Page 19: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential0. Data (B ig & Mes s y)

19

Page 20: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential1. A language for programming with data

20

Download the White Paper

R is Hotbit.ly/r-is-hot

Page 21: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

21

Grant awards to homeless veterans FY09Data: Data.govAnalysis: Drew Conway

User-defined functions

Internet API interfaceXML parsing

Custom graphics

Data import and pre-processing

Iterative data processing

Page 22: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential2. S peed. L ots and lots of s peed.

22

Variable Transformation

Model Estimation

Model Refinement

Model Comparison / Benkmarking

Feature SelectionSampling

AggregationData Predictions

Page 23: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

Core 0(Thread 0)

Core n(Thread n)

Core 2(Thread 2)

Core 1(Thread 1)

Multicore Processor (4, 8, 16+ cores)

DataData Data

Disk

Shared Memory

Us e all available c omputing c yc les

23

Page 24: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

Compute Node

Compute Node

Master Node

DataPartition

DataPartition

Compute Node

Compute Node

DataPartition

DataPartition

3. A lgorithms that don’t choke on B ig Data

PEMAs: Parallel External-Memory Algorithms24

BIGDATA

Page 25: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialDrink les s c offee!

25

Single ThreadedNon-optimized

algorithms

OptimizedParallelizedAlgorithms

Page 26: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential4. Move c ode to data (not vic e vers a)

26

Map-Reduce

RHadoop: http://bit.ly/RHadoop

Page 27: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialB ig Data A pplianc es

27

More info: http://bit.ly/R-Netezza

Page 28: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialP lay Nic e with Others

• Business Intelligence Tools• Web-based data apps• Reporting / Spreadsheets

Presentation Layer

• R

Analytics Layer

• Relational datastores• Unstructured datastores

Data Layer

28

Page 29: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialWhat every data s c ientis t needs

Open-Source RRevolution R

EnterpriseInterface with multiple data sources

Exploratory data analysis

Wide range of statistical methods

High-speed computation

Big Data support

Data/code locality (Hadoop, etc.)

Print-quality data visualization

Scheduled batch production

Works in a multi-tool ecosystem

Integration into Data Apps

29

Page 30: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialR evolution R E nterpris e: B ig-Data R

Open-Source RRevolution R

EnterpriseInterface with multiple data sources

Exploratory data analysis

Wide range of statistical methods

High-speed computation

Big Data support

Data/code locality (Hadoop, etc.)

Print-quality data visualization

Scheduled batch production

Works in a multi-tool ecosystem

Integration into Data Apps

30www.revolutionanalytics.com/products

Page 31: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

31Image © www.tinyplanetphotography.com

Page 32: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialA nd … the future?

Even more data

Cloud computing

Demand for Data Scientists

Diverging paradigms for data analytics

32http://www.indeed.com/jobtrends

Page 33: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialDiverging data paradigms

33

HadoopNoSQL

FilesClusters

Data Appliances

More data, better fault tolerance

Easier programming, better performanceExplorationModeling

StoragePreprocessing

Production

Page 34: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialData S c ienc e in P roduc tion

Real-time Big Data Analytics: From Deployment to Production

Thursday, November 29, 201210:00AM - 11:00AM Pacific Time

www.revolutionanalytics.com/news-events/free-webinars/

34

Page 35: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialB uilding Data S c ienc e Teams

DJ Patil in O’Reilly Radar: http://oreil.ly/I3H5fI

Statistics and Data Science graduates

Kaggle and Chorus

Revolution Analytics R Training: http://www.revolutionanalytics.com/services/training/

35

Page 36: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialC los ing T houghts

Data Science process leads to more powerful, and more useful models

Data Scientists need a technology platform to think about, explore, and model data

Revolution R Enterprise is R for Big Data

36

Page 37: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialR es ourc es

Revolution R Enterprise : R for Big Data www.revolutionanalytics.com/products

Rhadoop : Connecting R and Hadoop bit.ly/r-hadoop

Contact David Smith [email protected] @revodavid blog.revolutionanalytics.com

37

Page 38: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialT hank you.

38

www.revolutionanalytics.com 650.646.9545 Twitter: @RevolutionR

The leading commercial provider of software and support for the popular open source R statistics language.