the road to data science - joel grus, june 2015
TRANSCRIPT
![Page 1: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/1.jpg)
Joel GrusSeattle DAML Meetup
June 23, 2015
Data Science from Scratch
![Page 2: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/2.jpg)
About meOld-school DAML-erWrote a book ---------->SWE at GoogleFormerly data science at VoloMetrix, Decide, Farecast
![Page 3: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/3.jpg)
The Road to Data Science
![Page 4: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/4.jpg)
The Road to Data ScienceMy
![Page 5: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/5.jpg)
![Page 6: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/6.jpg)
Grad School
![Page 7: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/7.jpg)
![Page 8: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/8.jpg)
![Page 9: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/9.jpg)
Fareology
![Page 10: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/10.jpg)
Data Science Is A Broad Field
Some Stuff
MoreStuff
EvenMoreStuff
DataScience
People who think they're data scientists, but they're not really data scientists
People who are a danger to everyone around them
People who say "machine learnings"
![Page 11: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/11.jpg)
![Page 12: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/12.jpg)
a data scientist should be able to
JOEL GRUS
![Page 13: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/13.jpg)
a data scientist should be able torun a regression,
JOEL GRUS
![Page 14: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/14.jpg)
a data scientist should be able torun a regression, write a sql query,
JOEL GRUS
![Page 15: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/15.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site,
JOEL GRUS
![Page 16: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/16.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment,
JOEL GRUS
![Page 17: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/17.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices,
JOEL GRUS
![Page 18: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/18.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame,
JOEL GRUS
![Page 19: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/19.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning,
JOEL GRUS
![Page 20: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/20.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery,
JOEL GRUS
![Page 21: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/21.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python,
JOEL GRUS
![Page 22: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/22.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce,
JOEL GRUS
![Page 23: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/23.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior,
JOEL GRUS
![Page 24: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/24.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard,
JOEL GRUS
![Page 25: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/25.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data,
JOEL GRUS
![Page 26: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/26.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis,
JOEL GRUS
![Page 27: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/27.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson,
JOEL GRUS
![Page 28: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/28.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, JOEL GRUS
![Page 29: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/29.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, JOEL GRUS
![Page 30: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/30.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, JOEL GRUS
![Page 31: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/31.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. JOEL GRUS
![Page 32: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/32.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. specialization is for engineers.JOEL GRUS
![Page 33: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/33.jpg)
A lot of stuff!
![Page 34: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/34.jpg)
What Are Hiring Managers Looking For?
![Page 35: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/35.jpg)
What Are Hiring Managers Looking For?
Let's check LinkedIn
![Page 36: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/36.jpg)
![Page 37: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/37.jpg)
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. specialization is for engineers.JOEL GRUS
grad students!
![Page 38: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/38.jpg)
Learning Data Science
![Page 39: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/39.jpg)
I want to be a data
scientist.Great!
![Page 40: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/40.jpg)
The Math WayI like to start with matrix
decompositions. How's your
measure theory?
![Page 41: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/41.jpg)
The Math WayThe Good:Solid foundationMath is the noblest known pursuit
![Page 42: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/42.jpg)
The Math WayThe Good:Solid foundationMath is the noblest known pursuit
The Bad:Some weirdos don't think math is fun
Can be pretty forbidding
Can miss practical skills
![Page 43: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/43.jpg)
So, did you count the words in
that document?
No, but I have an elegant
proof that the number of
words is finite!
![Page 44: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/44.jpg)
OK, Let's Try Again
![Page 45: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/45.jpg)
I want to be a data
scientist.Great!
![Page 46: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/46.jpg)
The Tools WayHere's a list of
the 25 libraries you
really ought to know. How's
your R programming?
![Page 47: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/47.jpg)
The Tools WayThe Good:Don't have to understand the math
PracticalCan get started doing fun stuff right away
![Page 48: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/48.jpg)
The Tools WayThe Good:Don't have to understand the math
PracticalCan get started doing fun stuff right away
The Bad:Don't have to understand the math
Can get started doing bad science right away
![Page 49: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/49.jpg)
So, did you build that model?
Yes, and it fits the training data almost perfectly!
![Page 50: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/50.jpg)
OK, Maybe Not That Either
![Page 51: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/51.jpg)
So Then What?
![Page 52: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/52.jpg)
Example: k-means clusteringUnsupervised machine learning technique
Given a set of points, group them into k clusters in a way that minimizes the within-cluster sum-of-squares
i.e. in a way such that the clusters are as "small" as possible (for a particular conception of "small")
![Page 53: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/53.jpg)
![Page 54: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/54.jpg)
The Math Way
![Page 55: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/55.jpg)
The Math Way
![Page 56: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/56.jpg)
The Tools Way# a 2-dimensional examplex <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))colnames(x) <- c("x", "y")(cl <- kmeans(x, 2))plot(x, col = cl$cluster)points(cl$centers, col = 1:2, pch = 8, cex = 2)
![Page 57: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/57.jpg)
The Tools Way>>> from sklearn import cluster, datasets>>> iris = datasets.load_iris()>>> X_iris = iris.data>>> y_iris = iris.target
>>> k_means = cluster.KMeans(n_clusters=3)>>> k_means.fit(X_iris) KMeans(copy_x=True, init='k-means++', ...>>> print(k_means.labels_[::10])[1 1 1 1 1 0 0 0 0 0 2 2 2 2 2]>>> print(y_iris[::10])[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]
![Page 58: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/58.jpg)
So What To Do?
![Page 59: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/59.jpg)
Bootcamps?
![Page 60: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/60.jpg)
Data Science from ScratchThis is to certify thatJoel Grus
has honorably completed the course of study outlined in the book Data Science from Scratch: First Principles with Python, and is entitled to all the Rights, Privileges, and Honors thereunto appertaining. Joel
GrusJune 23, 2015
Certificate Programs?
![Page 61: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/61.jpg)
Hey! Data scientists!
![Page 62: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/62.jpg)
Learning By BuildingYou don't really understand something until you build it
For example, I understand garbage disposals much better now that I had to replace one that was leaking water all over my kitchen
More relevantly, I thought I understood hypothesis testing, until I tried to write a book chapter + code about it.
![Page 63: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/63.jpg)
Learning By BuildingFunctional Programming
![Page 64: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/64.jpg)
Break Things Down Into Small Functions
![Page 65: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/65.jpg)
So you don't end up with
something like this
![Page 66: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/66.jpg)
Don't Mutate
![Page 67: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/67.jpg)
Example: k-means clusteringGiven a set of points, group them into k clusters in a way that minimizes the within-cluster sum-of-squares
Global optimization is hard, so use a greedy iterative approach
![Page 68: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/68.jpg)
Fun Motivation: Image Posterization
Image consists of pixelsEach pixel is a triplet (R,G,B)Imagine pixels as points in spaceFind k clusters of pixelsRecolor each pixel to its cluster
meanI think it's fun, anyway
8 colors
![Page 69: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/69.jpg)
Example: k-means clusteringgiven some points, find k clusters by
choose k "means"repeat:
assign each point to cluster of closest "mean"recompute mean of each cluster
sounds simple! let's code!
![Page 70: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/70.jpg)
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
![Page 71: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/71.jpg)
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
start with k randomly chosen points
![Page 72: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/72.jpg)
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
![Page 73: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/73.jpg)
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
![Page 74: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/74.jpg)
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
![Page 75: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/75.jpg)
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each mean
![Page 76: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/76.jpg)
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each meancompute the distance
![Page 77: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/77.jpg)
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each meancompute the distance
assign the point to the cluster of the mean with the smallest distance
![Page 78: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/78.jpg)
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each meancompute the distance
assign the point to the cluster of the mean with the smallest distance
find the points in each cluster
![Page 79: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/79.jpg)
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each meancompute the distance
assign the point to the cluster of the mean with the smallest distance
find the points in each cluster
and compute the new means
![Page 80: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/80.jpg)
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
Not impenetrable, but a lot less helpful than it
could be
![Page 81: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/81.jpg)
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
Not impenetrable, but a lot less helpful than it
could be
Can we make it simpler?
![Page 82: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/82.jpg)
Break Things Down Into Small Functions
![Page 83: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/83.jpg)
def k_means(points, k, num_iters=10): # start with k of the points as "means" means = random.sample(points, k)
# and iterate finding new means for _ in range(num_iters): means = new_means(points, means)
return means
![Page 84: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/84.jpg)
def new_means(points, means): # assign points to clusters # each cluster is just a list of points clusters = assign_clusters(points, means)
# return the cluster means return [mean(cluster) for cluster in clusters]
![Page 85: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/85.jpg)
def assign_clusters(points, means): # one cluster for each mean # each cluster starts empty clusters = [[] for _ in means] # assign each point to cluster # corresponding to closest mean for p in points: index = closest_index(point, means) clusters[index].append(point) return clusters
![Page 86: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/86.jpg)
def closest_index(point, means): # return index of closest mean return argmin(distance(point, mean) for mean in means)
def argmin(xs): # return index of smallest element return min(enumerate(xs), key=lambda pair: pair[1])[0]
![Page 87: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/87.jpg)
To Recapk_means(points, k, num_iters=10)
mean(points)
k_means(points, k, num_iters=10)new_means(points, means)assign_clusters(points, means)closest_index(point, means)argmin(xs)
distance(point1, point2)mean(points)
add(point1, point2)scalar_multiply(c, point)
![Page 88: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/88.jpg)
As a Pedagogical ToolCan be used "top down" (as we did here)
Implement high-level logicThen implement the detailsNice for exposition
Can also be used "bottom up"Implement small piecesBuild up to high-level logicGood for workshops
![Page 89: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/89.jpg)
Example: Decision TreesWant to predict whether a given Meetup is worth attending (True) or not (False)
Inputs are dictionaries describing each Meetup
{ "group" : "DAML", "date" : "2015-06-23", "beer" : "free", "food" : "dim sum", "speaker" : "@joelgrus", "location" : "Google", "topic" : "shameless self-promotion" }
{ "group" : "Seattle Atheists", "date" : "2015-06-23", "location" : "Round the Table", "beer" : "none", "food" : "none", "topic" : "Godless Game Night" }
![Page 90: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/90.jpg)
Example: Decision Trees{ "group" : "DAML", "date" : "2015-06-23", "beer" : "free", "food" : "dim sum", "speaker" : "@joelgrus", "location" : "Google", "topic" : "shameless self-promotion" }
{ "group" : "Seattle Atheists", "date" : "2015-06-23", "location" : "Round the Table", "beer" : "none", "food" : "none", "topic" : "Godless Game Night" }
beer?
True Falsespeaker?
True False
free none
paid
@jakevdp
@joelgrus
![Page 91: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/91.jpg)
Example: Decision Treesclass LeafNode: def __init__(self, prediction): self.prediction = prediction
def predict(self, input_dict): return self.prediction
class DecisionNode: def __init__(self, attribute, subtree_dict): self.attribute = attribute self.subtree_dict = subtree_dict
def predict(self, input_dict): value = input_dict.get(self.attribute) subtree = self.subtree_dict[value] return subtree.predict(input)
![Page 92: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/92.jpg)
Example: Decision TreesAgain inspiration from functional programming:type Input = Map.Map String String
data Tree = Predict Bool | Subtrees String (Map.Map String Tree)
look at the "beer" entry a map from each possible "beer" value to a subtree
always predict a specific value
![Page 93: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/93.jpg)
Example: Decision Treestype Input = Map.Map String String
data Tree = Predict Bool | Subtrees String (Map.Map String Tree)
predict :: Tree -> Input -> Boolpredict (Predict b) _ = bpredict (Subtrees a subtrees) input = predict subtree input where subtree = subtrees Map.! (input Map.! a)
![Page 94: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/94.jpg)
Example: Decision Treestype Input = Map.Map String String
data Tree = Predict Bool | Subtrees String (Map.Map String Tree)
We can do the same, we'll say a decision tree is eitherTrueFalse(attribute, subtree_dict)
("beer", { "free" : True, "none" : False, "paid" : ("speaker", {...})})
![Page 95: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/95.jpg)
predict :: Tree -> Input -> Bool
predict (Predict b) _ = b
predict (Subtrees a subtrees) input = predict subtree input where subtree = subtrees Map.! (input Map.! a)
Example: Decision Treesdef predict(tree, input_dict): # leaf node predicts itself if tree in (True, False): return tree else: # destructure tree attribute, subtree_dict = tree # find appropriate subtree value = input_dict[attribute] subtree = subtree_dict[value] # classify using subtree return predict(subtree, input_dict)
![Page 96: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/96.jpg)
Not Just For Data Science
![Page 97: The Road to Data Science - Joel Grus, June 2015](https://reader036.vdocuments.site/reader036/viewer/2022062523/58eff1d11a28abd2578b45ad/html5/thumbnails/97.jpg)
In ConclusionTeaching data science is fun, if you're smart about it
Learning data science is fun, if you're smart about it
Writing a book is not that much funHaving written a book is pretty funMaking slides is actually kind of funFunctional programming is a lot of fun