winning with big data: secrets of the successful data scientist
DESCRIPTION
The world is experiencing an Industrial Revolution of Data. In any given minute the machines around us are tracking billions of mouse clicks, credit card swipes, and GPS coordinates. And increasingly this data is being saved, aggregated, and analyzed. These massive data flows present big challenges to firms, but also new opportunities for deriving insights. Presented at the June 2010 gathering of the Bay Area's Business Intelligence Special Interest Group.TRANSCRIPT
![Page 1: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/1.jpg)
WINNINGWITH
BIG DATA
Michael Driscoll@dataspora
SDForum BI SIGJune 15, 2010
Secrets of the Successful
Data Scientist
![Page 2: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/2.jpg)
WHY DATAMATTERSNOW
![Page 3: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/3.jpg)
THE INDUSTRIALAGE OF DATA
![Page 4: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/4.jpg)
WHAT IS BIG DATA?
Data that is distributed.
class size manage with how it fits examples
small < 10 GB Excel, Rfits in one machine’s memory
thousands of sales figures
medium 10GB-1TB indexed files, monolothic DB
fits on one machine’s disk millions of web pages
Big > 1TBHadoop,
distributed DBs
stored across many
machinesbillions of web clicks
![Page 5: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/5.jpg)
WHAT ISDATA SCIENCE?
![Page 6: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/6.jpg)
WHY DATA SCIENCEIS SEXY
![Page 7: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/7.jpg)
+ =
“The sexy job in the next ten years will be statisticians…”- Hal Varian
![Page 8: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/8.jpg)
![Page 9: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/9.jpg)
data model
1000 bytes 2 bytes
![Page 10: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/10.jpg)
9 WAYS TO WINWITH DATA
![Page 11: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/11.jpg)
1. CHOOSE THERIGHT TOOL
You don’t need a chainsaw to cut butter.
![Page 12: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/12.jpg)
2. COMPRESS EVERYTHING
The world is IO-bound.
mysqldump -u myuser -p mypass sourceDB | \ gzip | ssh [email protected] "cat - | \ gunzip | mysql -u myuser -p mypass targetDB"
![Page 13: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/13.jpg)
3. SPLIT UPYOUR DATA
Split, apply, combine.
![Page 14: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/14.jpg)
4. WORK WITH SAMPLES
Big Data is heavy, samples are light.
perl -ne "print if (rand() < 0.01)" \ data.csv > sample.csv
![Page 15: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/15.jpg)
5. USESTATISTICS
![Page 16: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/16.jpg)
6. COPYFROM OTHERS
Use open source.
git clone git://github.com/kevinweil/hadoop-lzo
![Page 17: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/17.jpg)
Charts are compositions,not containers.
7. ESCHEW CHART TYPOLOGIES
![Page 18: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/18.jpg)
8. COLOR WITH CARE
Color can enhance or insult.
![Page 19: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/19.jpg)
9. TELL A STORY
People are listening.
![Page 20: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/20.jpg)
ONE SUCCESSSTORY
![Page 21: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/21.jpg)
WHY DO TELCO CUSTOMERS LEAVE?
Sign up Leave
Goal: “less churn.”
![Page 22: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/22.jpg)
DATA:BILLIONSOF CALLS
… and millions of callers.
![Page 23: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/23.jpg)
… a difference,but not significant.
DOES CALL QUALITYMATTER?
![Page 24: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/24.jpg)
Hmmm...
WHAT ABOUTSOCIALNETWORKS?
![Page 25: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/25.jpg)
… but is it predictive?
BUILD THE CALL GRAPH
![Page 26: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/26.jpg)
April
EVOLUTION OF A CALL GRAPH
![Page 27: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/27.jpg)
May
EVOLUTION OF A CALL GRAPH
![Page 28: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/28.jpg)
June
EVOLUTION OF A CALL GRAPH
![Page 29: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/29.jpg)
July
EVOLUTION OF A CALL GRAPH
![Page 30: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/30.jpg)
when a cancellationoccurs in a call network.
700% INCREASEIN CHURN
![Page 31: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/31.jpg)
FINAL THOUGHTS
![Page 32: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/32.jpg)
Big Data Dedicated RDBMS
Analytics(R, SPSS, SAS, SAP)
Data Products (Content Filters, Rec Engines)
Data
Actions
Insights
THE BIG DATA STACK
![Page 33: Winning With Big Data: Secrets of the Successful Data Scientist](https://reader033.vdocuments.site/reader033/viewer/2022061206/5482bfaab4af9f513f8b4840/html5/thumbnails/33.jpg)
THANKS!QUESTIONS?
Michael [email protected]
@dataspora on Twitterhttp://www.dataspora.com/blog
SDForum BI SIGJune 15, 2010