the rise of data science in the age of big data analytics: why data distillation and machine...

Post on 10-May-2015

2.275 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

The reason why Big Data is important is because we want to use it to make sense of our world. It’s tempting to think there’s some “magic bullet” for analyzing big data, but simple “data distillation” often isn’t enough, and unsupervised machine-learning systems can be dangerous. (Like, bringing-down-the-entire-financial-system dangerous.) Data Science is the key to unlocking insight from Big Data: by combining computer science skills with statistical analysis and a deep understanding of the data and problem we can not only make better predictions, but also fill in gaps in our knowledge, and even find answers to questions we hadn’t even thought of yet.

TRANSCRIPT

Revolution Confidential

T he R is e of Data S c ienc e in the age of B ig Data A nalytic sWhy Data Dis tillation and Mac hine L earning A ren’t E nough

David M S mithV P Marketing and C ommunityR evolution Analytic s

Revolution ConfidentialToday, we’ll dis c us s :

What is Data Science? Why machine learning isn’t enough Why Data Science works The Data Scientists Toolkit The Future of Big Data Analytics Closing thoughts and resources

2

Revolution Confidential

3© Dov Harrington, CC By-2.0http://www.flickr.com/photos/idovermani/4110546683/

Revolution ConfidentialWhere is it s afe to fis h near S an F ranc is co?

4San Francisco Estuary Institutehttp://www.sfei.org/tools/wqt

Revolution ConfidentialHurric ane S andy

Bob Rudishttp://rud.is/b/2012/10/28/watch-sandy-in-r-including-forecast-cone/

5

Revolution ConfidentialHurric ane S andy

Ed Chenhttp://blog.echen.me/hurricane-sandy-outages/

6

Revolution Confidential

When did Michael J acks on have his bigges t hits ?

New York Times, June 25 2009 (3 hours after Michael Jackson’s death)http://www.nytimes.com/interactive/2009/06/25/arts/0625-jackson-graphic.html 7

Revolution ConfidentialT hree E s s ential S kills of Data S c ientis ts

8Drew Conwayhttp://www.dataists.com/2010/09/the-data-science-venn-diagram/

Data IntegrationMashups

Applications

ModelsVisualizationPredictionsUncertainty

ProblemsData Sources

Credibility

EffectiveData

Applications

Revolution Confidential

9Image © Abode of Chaos, CC BY 2.0http://www.flickr.com/photos/home_of_chaos/6418989233/

Revolution ConfidentialMac hine learning (ML ) for predic tions

10

Res

pons

e

Feat

ures

Res

pons

es

MLscoring rules

Building the Model

Valid

atio

n se

t

Pre

dict

ions

scoring rules

Validating the Model

New

Dat

a

Pre

dict

ions

(sco

res)

scoring rules

Scoring new data

“Accuracy”

Revolution ConfidentialP roblem: A lac k of pers pec tive

11Image © 2010 David M Smith. Some rights reserved CC BY-2.0

Revolution ConfidentialP roblem: L ac k of c redibility

12

Revolution ConfidentialP roblem: C omplexity

13

Revolution ConfidentialData Science to the Rescue!

14

Revolution ConfidentialA ns wer Unas ked Ques tions

15Revolutions blog: “The Uncanny Valley of Big Data”http://blog.revolutionanalytics.com/2012/02/the-uncanny-valley-of-big-data.html

Revolution Confidential

16

“More data beats better algorithms, every time” – Google

“Companies that have massive amounts of data without massive amounts

of clue are going to be displaced by startups that have less data but more

clue.” -- Tim O’Reilly

Google Research, “The Unreasonable Effectiveness of Data”: http://googleresearch.blogspot.com/2009/03/unreasonable-effectiveness-of-data.html

Tim O’Reilly on Google+: https://plus.google.com/107033731246200681024/posts/4Xa76AtxYwdTechnoCalifornia: http://technocalifornia.blogspot.com/2012/07/more-data-or-better-models.html

F ill in knowledge gaps

Revolution ConfidentialAvoid ineffec tive reac tions

17Stupid Data Miner Trickshttp://nerdsonwallstreet.typepad.com/my_weblog/files/dataminejune_2000.pdf

S&P

500

Revolution Confidential

18© Henricks Photos CC-BY-ND 2.0http://www.flickr.com/photos/hendricksphotos/3240667626/

Revolution Confidential0. Data (B ig & Mes s y)

19

Revolution Confidential1. A language for programming with data

20

Download the White Paper

R is Hotbit.ly/r-is-hot

Revolution Confidential

21

Grant awards to homeless veterans FY09Data: Data.govAnalysis: Drew Conway

User-defined functions

Internet API interfaceXML parsing

Custom graphics

Data import and pre-processing

Iterative data processing

Revolution Confidential2. S peed. L ots and lots of s peed.

22

Variable Transformation

Model Estimation

Model Refinement

Model Comparison / Benkmarking

Feature SelectionSampling

AggregationData Predictions

Revolution Confidential

Core 0(Thread 0)

Core n(Thread n)

Core 2(Thread 2)

Core 1(Thread 1)

Multicore Processor (4, 8, 16+ cores)

DataData Data

Disk

Shared Memory

Us e all available c omputing c yc les

23

Revolution Confidential

Compute Node

Compute Node

Master Node

DataPartition

DataPartition

Compute Node

Compute Node

DataPartition

DataPartition

3. A lgorithms that don’t choke on B ig Data

PEMAs: Parallel External-Memory Algorithms24

BIGDATA

Revolution ConfidentialDrink les s c offee!

25

Single ThreadedNon-optimized

algorithms

OptimizedParallelizedAlgorithms

Revolution Confidential4. Move c ode to data (not vic e vers a)

26

Map-Reduce

RHadoop: http://bit.ly/RHadoop

Revolution ConfidentialB ig Data A pplianc es

27

More info: http://bit.ly/R-Netezza

Revolution ConfidentialP lay Nic e with Others

• Business Intelligence Tools• Web-based data apps• Reporting / Spreadsheets

Presentation Layer

• R

Analytics Layer

• Relational datastores• Unstructured datastores

Data Layer

28

Revolution ConfidentialWhat every data s c ientis t needs

Open-Source RRevolution R

EnterpriseInterface with multiple data sources

Exploratory data analysis

Wide range of statistical methods

High-speed computation

Big Data support

Data/code locality (Hadoop, etc.)

Print-quality data visualization

Scheduled batch production

Works in a multi-tool ecosystem

Integration into Data Apps

29

Revolution ConfidentialR evolution R E nterpris e: B ig-Data R

Open-Source RRevolution R

EnterpriseInterface with multiple data sources

Exploratory data analysis

Wide range of statistical methods

High-speed computation

Big Data support

Data/code locality (Hadoop, etc.)

Print-quality data visualization

Scheduled batch production

Works in a multi-tool ecosystem

Integration into Data Apps

30www.revolutionanalytics.com/products

Revolution Confidential

31Image © www.tinyplanetphotography.com

Revolution ConfidentialA nd … the future?

Even more data

Cloud computing

Demand for Data Scientists

Diverging paradigms for data analytics

32http://www.indeed.com/jobtrends

Revolution ConfidentialDiverging data paradigms

33

HadoopNoSQL

FilesClusters

Data Appliances

More data, better fault tolerance

Easier programming, better performanceExplorationModeling

StoragePreprocessing

Production

Revolution ConfidentialData S c ienc e in P roduc tion

Real-time Big Data Analytics: From Deployment to Production

Thursday, November 29, 201210:00AM - 11:00AM Pacific Time

www.revolutionanalytics.com/news-events/free-webinars/

34

Revolution ConfidentialB uilding Data S c ienc e Teams

DJ Patil in O’Reilly Radar: http://oreil.ly/I3H5fI

Statistics and Data Science graduates

Kaggle and Chorus

Revolution Analytics R Training: http://www.revolutionanalytics.com/services/training/

35

Revolution ConfidentialC los ing T houghts

Data Science process leads to more powerful, and more useful models

Data Scientists need a technology platform to think about, explore, and model data

Revolution R Enterprise is R for Big Data

36

Revolution ConfidentialR es ourc es

Revolution R Enterprise : R for Big Data www.revolutionanalytics.com/products

Rhadoop : Connecting R and Hadoop bit.ly/r-hadoop

Contact David Smith david@revolutionanalytics.com @revodavid blog.revolutionanalytics.com

37

Revolution ConfidentialT hank you.

38

www.revolutionanalytics.com 650.646.9545 Twitter: @RevolutionR

The leading commercial provider of software and support for the popular open source R statistics language.

top related