a tour of the data science process, a case study using movie industry data

48
A tour of the (a? many?) “data science process(es)” Including a short case study using movie industry data. Eduardo Ariño de la Rubia Chief Data Scientist, Domino Data Lab [email protected] @earino

Upload: domino-data-lab

Post on 05-Jan-2017

206 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

A tour of the (a? many?) “data science process(es)”

Including a short case study using movie industry data.

Eduardo Ariño de la RubiaChief Data Scientist, Domino Data Lab

[email protected]@earino

Page 2: A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Page 3: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

Rough Timeline of “Data Science” in My Life

1. 1996 - First account on a supercomputer (MasPar MP-1)

2. 1997 - Fell in love with Genetic Algorithms for job shop scheduling (PVM/MPI)

3. 1999 - Hired my first ML engineer (“I think aNNs may be useful for predicting users buying patterns.”)

4. 2003 - Expert / Fuzzy systems for accounting continuing education compliance

5. 2005 - ML (mostly aNNs) and Six Sigma statistical approaches for manufacturing

6. 2007 - Computer vision approaches for pre-press support (first “big” data, PB)

7. 2010 - ML for manufacturing automation (vision/job shop)

8. 2012 - A/B testing for effectiveness of new designs

9. 2013 - NLP for “jurisdictionally aware” obscenity detection

10. 2014 - Classifiers for “at risk” students intervention in Ed. Tech.

11. 2015 - Vendor (!)

Page 4: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

What Kind of Data Scientist Are You?

Page 5: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

Me.Probably.

@drewconway

Page 6: A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Page 7: A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Page 8: A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Page 9: A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Page 10: A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Page 11: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

@willynguen

Page 12: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

Me.Definitely.

Page 13: A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Page 14: A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Page 15: A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Page 16: A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Page 17: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

Thanks to:

@BecomingDataSci@StephdeSilva@josecamoessilva

Page 18: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

How Are You Doing Data Science?

Page 19: A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Page 20: A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Page 21: A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Page 22: A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Page 23: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

CRISP-DM“de facto standard for developing data mining and knowledge discovery projects”

Between 2006 and 2008 a CRISP-DM 2.0 SIG was formed and there were discussions about updating the CRISP-DM process model. The current status of these efforts is not known. However, the original crisp-dm.org website cited in the reviews,and the CRISP-DM 2.0 SIG website are both no longer active.

Page 24: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

© Szilard Pafka

Page 25: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

© UC Berkeley, Understanding Science

Page 26: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

© UC Berkeley, Understanding Science

Page 27: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

© UC Berkeley, Understanding Science

Page 28: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

© UC Berkeley, Understanding Science

Page 29: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

© UC Berkeley, Understanding Science

Page 30: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

© UC Berkeley, Understanding Science

Page 31: A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Page 32: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

A Case Study:So How Much Would That Movie Bank?

Page 33: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

A Brief Disclaimer…I don’t, like… watch a lot movies

Page 34: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

I have paid actual real money to see these movies in theaters. I should probably not be trusted.

Page 35: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

The IMDB Movie Dataset is pretty massive...

Page 36: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

And not exactly friendly...

Page 37: A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Page 38: A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Page 39: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

Signs You’re Doing Data Science

1.The problem is poorly specified

2.The data is messy and unstructured

3.Unique entities aren’t

Page 40: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

Make Your Data Tidy

1. Each variable is a column2. Each observation is a row3. Each type of observational

unit is a table.

Tidy Manifesto:

1. Share data structures2. Compose simple pieces3. Embrace functional

programming4. Write for humans

Page 41: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

Signs You’re Doing Data Science

1. You want to make your data “tidy”

2. Your data surprises you with its wrongness…

Can Anyone Guess What is Wrong???

Wait… what?

Page 42: A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Page 43: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

Movies Seem to be making more and more of their money in worldwide receipts… Seems like the Domestic Gross is just about covering your costs...

Page 44: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

Some Movies Make Amazing Multiples!

Page 45: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

Interesting Factoids - Ratios1. The movie with the highest “ratio” that

won an Academy Award: Rocky

2. Braveheart was the movie in the 1990s to have the lowest “ratio” (made 2.9 of it’s budget back) to win an Academy Award for Best Picture

3. Sports Movies and Adult Movies make back the best ratio. Crime and Western the worst.

4. Movies that won an academy award on average make 132% more return for invested dollar at the box office than movies that don’t.

Page 46: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

Prepare to be Disappointed...

> cor(movie_and_genre$`Production Budget`, movie_and_genre$`Worldwide Gross`)[1] 0.7359111

Page 47: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

Analysis on the AnalysisTotal Hours Spent - 4

Total Number of Commands - 497

Total Number of Packages Used - 12

RCurl

lubridate

ggplot2

h2o

readr

dplyr

rvest

tidyr

caret

Shiny

stringr

purrr

Total Number of Plots Generated - 25

Total Number of MB Downloaded - 3.4G

Total Number of Models Trained - 8

REMEMBER THIS GUY?!

Page 48: A Tour of the Data Science Process, a Case Study Using Movie Industry Data

Thanks for Your Time!