the wild west of data wrangling
TRANSCRIPT
![Page 1: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/1.jpg)
The Wild West of Data Wrangling
Sarah Guido PyCon 2017
@sarah_guido
![Page 2: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/2.jpg)
This talk:
• A day in the life
• Three examples of dealing with uncooperative data
• Not ground truth!
![Page 3: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/3.jpg)
Who am I?
• Senior data scientist at Mashable
• Mashable == internet culture media!
• Data sciencing in Python
• Twitter: @sarah_guido
![Page 4: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/4.jpg)
Iris Dataset
![Page 5: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/5.jpg)
Iris Dataset
![Page 6: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/6.jpg)
![Page 7: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/7.jpg)
![Page 8: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/8.jpg)
Example 1: Predicting building sales
• The problem: can we predict if a building will sell the following year?
• The data: floors, location, square footage, price per sqft, etc
• The goal: provide valuable insight to platform users
![Page 9: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/9.jpg)
Example 1: Predicting building sales
• First thought: logistic regression using scikit-learn
• Binary classification: sale/no sale
![Page 10: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/10.jpg)
Problem…
Data: 95% no sale, 5% sale
Logistic regression: 95% accurate
DONE!
![Page 11: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/11.jpg)
![Page 12: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/12.jpg)
Problem: Class imbalance
Class imbalance
When the values you are trying to predict are not equal, this can create bias in classification models.
![Page 13: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/13.jpg)
Solution: Gradient boosting
Gradient boosting
Produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
![Page 14: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/14.jpg)
Example 2: Clustering user interactions
The problem: how can we identify similar patterns based on click data?
The data: time, geolocation, cookie, browser useragent string, referrer
The goal: understand how people interact with content over time
![Page 15: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/15.jpg)
Why Scala?
![Page 16: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/16.jpg)
Problem: Clustering user interactions
K-means clustering
An unsupervised learning method of grouping data together based on a distance metric.
![Page 17: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/17.jpg)
Problem: Clustering the data
• Only look at users with 5 or more interactions
• Each user has a different number of interactions
• Each data point ends up in a different cluster
![Page 18: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/18.jpg)
![Page 19: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/19.jpg)
![Page 20: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/20.jpg)
![Page 21: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/21.jpg)
![Page 22: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/22.jpg)
Solution: Transform the data
![Page 23: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/23.jpg)
Solution: Transform the data
date: 2017-04-09, 2017-04-13, 2017-04-30, 2017-05-01, 2017-05-12
Length of interactions: 5
Average time between interactions: ~8 days
![Page 24: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/24.jpg)
Solution: Transform the data
referrer: facebook, twitter
One-hot encode and transform to matrix
• Facebook: [1, 0]
• Twitter: [0, 1]
![Page 25: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/25.jpg)
Solution: Transform the data
![Page 26: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/26.jpg)
Example 3: Understand audience composition
The problem: how can we effectively describe our audience?
The data: anonymized demographic and psychographic data
The goal: audience segmentation and channel analysis
![Page 27: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/27.jpg)
Problem: insufficient data
• Google Analytics data – 1/3 of urls
• Finicky API
• Semi-useless psychographic data
![Page 28: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/28.jpg)
Solution: accept defeat
![Page 29: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/29.jpg)
Solution: accept defeat make it work!
![Page 30: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/30.jpg)
Solution: make it work!
• Theory of highly-performant links
• Segmentation through archetypal analysis
• Go get more data!
![Page 31: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/31.jpg)
General strategy
• What problem are you trying to solve?
• What’s wrong with your data?
• What do you need that you don’t have?
![Page 32: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/32.jpg)
Keep in mind…
• Data your company collects is complicated
• What you do to your data will affect the model
• Creativity is your friend
• Lots of ways to solve the problem
![Page 33: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/33.jpg)
![Page 34: The Wild West of Data Wrangling](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a656de77f8b9a06748b48f5/html5/thumbnails/34.jpg)
Thank you!
@sarah_guido