final project! (& wind sprints) - brown...
TRANSCRIPT
Final Project! (& wind sprints)
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 1
Final Project
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 2
Define Problem Find Data
Write a set of instructions
Solution
Final Project
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 3
Define Problem Find Data
Write a set of instructions
Solution
ANYTHING Numerical Data
Textual Data Twitter Data
Geographical Data ...
Final Project
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 4
Define Problem Find Data
Write a set of instructions
Solution
ANYTHING Numerical Data
Textual Data Twitter Data
Geographical Data ...
ANYTHING COMPUTATIONAL
Find Trends/Patterns Predict Outliers/Groups
Describe Anomalies Assess Someone’s Claims
...
Final Project
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 5
Define Problem Find Data
Write a set of instructions
Solution
ANYTHING Numerical Data
Textual Data Twitter Data
Geographical Data ...
ANYTHING COMPUTATIONAL
Find Trends/Patterns Predict Outliers/Groups
Describe Anomalies Assess Someone’s Claims
...
ANY TOOL Excel or Python
Probably...
Final Project
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 6
Define Problem Find Data
Write a set of instructions
Solution
ANYTHING Numerical Data
Textual Data Twitter Data
Geographical Data ...
ANYTHING COMPUTATIONAL
Find Trends/Patterns Predict Outliers/Groups
Describe Anomalies Assess Someone’s Claims
...
ANY TOOL Excel or Python
Probably...
ANY ANALYSIS Plots, Tables, KML Files, ...
Final Project
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 7
Define Problem Find Data
Write a set of instructions
Solution
ANYTHING Numerical Data
Textual Data Twitter Data
Geographical Data ...
ANYTHING COMPUTATIONAL
Find Trends/Patterns Predict Outliers/Groups
Describe Anomalies Assess Someone’s Claims
...
ANY TOOL Excel or Python
Probably...
ANY ANALYSIS Plots, Tables, KML Files, ...
Only Constant: you will use computers to process data automatically
Today
• What do we wish we had known before?
• Example Final Projects
• General Project Description – Scope
– Timeline
• K-means in Python
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 8
What do we wish we had known before?
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 9
Today
• What do we wish we had known before?
• Example Final Projects
• General Project Description – Scope
– Timeline
• Project 2 Workshop
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 10
Example 1: Where should I Live?
I want to buy a house somewhere near my hometown. Where should I buy?
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 11
Kasey Hass’s Project https://sites.google.com/a/brown.edu/house-hunting-mecklenburg-county/home
In Each Zip Code: Rent price: Red dollar sign Vacancies: For sale sign Income: Green dollar sign Marriage rate: Wedding rings
Example 1: Where should I Live?
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 12
Kasey Hass’s Project https://sites.google.com/a/brown.edu/house-hunting-mecklenburg-county/home
In Each Zip Code: Rent price: Red dollar sign Vacancies: For sale sign Income: Green dollar sign Marriage rate: Wedding rings
Example 1: Where should I Live?
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 13
Kasey Hass’s Project https://sites.google.com/a/brown.edu/house-hunting-mecklenburg-county/home
In Each Zip Code: Rent price: Red dollar sign Vacancies: For sale sign Income: Green dollar sign Marriage rate: Wedding rings
Example 2: Athletes & Geography
Malcolm Gladwell’s Outliers • Professional Canadian hockey players
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 14
Example 2: Athletes & Geography
Malcolm Gladwell’s Outliers • Professional Canadian hockey players
– Mostly born in January. – Few are born in December. – Junior league’s cutoff is Dec. 31st
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 15
Example 2: Athletes & Geography
Malcolm Gladwell’s Outliers • Professional Canadian hockey players
– Mostly born in January. – Few are born in December. – Junior league’s cutoff is Dec. 31st
• Any association between professional players and where they were born?
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 16
Example 2: Athletes & Geography
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 17
Daniel Newmark’s Project https://sites.google.com/a/brown.edu/daniel-s-data/project-4--distribution-of-athletes
Football (NFL) Hockey (NHL)
Example 3: Growth of Top Digital Media Companies
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 18
How does the Growth Rate Compare in the Past 10 Years?
Example 3: Growth of Top Digital Media Companies
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 19
Nicholas Talbott’s Project https://sites.google.com/site/growthofdigitalmediacompanies/home
YouTube Video: http://www.youtube.com/watch?v=QDtDF6_mrkc
One KML File for Each Year Logo Size shows Growth
Today
• What do we wish we had known before?
• Example Final Projects
• General Project Description – Scope
– Timeline
• K-means
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 20
Scope – Think Big! (& Have Backup Plans)
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 21
Define Problem Find Data
Write a set of instructions
Solution
Scope – Think Big! (& Have Backup Plans)
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 22
Define Problem Find Data
Write a set of instructions
Solution
Utilize MANY data sources to test your claim.
Scope – Think Big! (& Have Backup Plans)
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 23
Define Problem Find Data
Write a set of instructions
Solution
Analyze the same dataset in DIFFERENT ways:
Make a number of related claims/hypotheses
Utilize MANY data sources to test your claim.
Scope – Think Big! (& Have Backup Plans)
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 24
Define Problem Find Data
Write a set of instructions
Solution
Analyze the same dataset in DIFFERENT ways:
Make a number of related claims/hypotheses
Utilize MANY data sources to test your claim.
GENERALIZE the Program Try to Implement something we
haven’t discussed
Scope – Think Big! (& Have Backup Plans)
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 25
Define Problem Find Data
Write a set of instructions
Solution
Analyze the same dataset in DIFFERENT ways:
Make a number of related claims/hypotheses
Utilize MANY data sources to test your claim.
Display the Results in Different Ways
GENERALIZE the Program Try to Implement something we
haven’t discussed
Today
• What do we wish we had known before?
• Example Final Projects
• General Project Description – Scope
– Timeline
• Project 2 Workshop
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 26
Timeline
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 27
Sun Mon Tues Wed Thurs Fri Sat
4/14 4/15 4/16 4/17 4/18
Final Project Out Project 2 Due
Guest and In-Class Work Time
4/21 4/22 4/23 4/24 4/25
Guest and In-Class Work Time
Proposal Due
Guest and In-Class Work Time
4/28 4/29 4/30 5/1 5/2
In-Class Work Time (Reading Period)
In-Class Work Time (Reading Period)
5/5 5/6 5/7 5/8 5/9
Final Project Due and Presentations
Meet With Staff At Least Once
Meet With Staff At Least Once
Before You Begin...
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 28
Define Problem Find Data
Write a set of instructions
Solution
Say you have a vague idea of what you want to do....
Before You Begin...
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 29
Define Problem Find Data
Write a set of instructions
Solution
Where is the data? What does it look like?
What format will be easiest for you to analyze?
Say you have a vague idea of what you want to do....
Before You Begin...
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 30
Define Problem Find Data
Write a set of instructions
Solution
Where is the data? What does it look like?
What format will be easiest for you to analyze?
What is your claim? Are there any sub-claims?
What are your assumptions?
Say you have a vague idea of what you want to do....
Before You Begin...
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 31
Define Problem Find Data
Write a set of instructions
Solution
Where is the data? What does it look like?
What format will be easiest for you to analyze?
What is your claim? Are there any sub-claims?
What are your assumptions?
What do you want the output to be?
What analysis will best test your claim?
Say you have a vague idea of what you want to do....
Before You Begin...
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 32
Define Problem Find Data
Write a set of instructions
Solution
Where is the data? What does it look like?
What format will be easiest for you to analyze?
What is your claim? Are there any sub-claims?
What are your assumptions?
What do you want the output to be?
What analysis will best test your claim?
What are the steps? What tools are best for each step?
How can you GENERALIZE for a broader analysis?
Say you have a vague idea of what you want to do....
Today
• What do we wish we had known before?
• Example Final Projects
• General Project Description – Scope
– Timeline
• K-means
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 33
K-means
• Recall goal: given data that might be clustered...find the clusters
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 34
K-means
• Recall goal: given data that might be clustered...find the clusters
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 35
Idea
• Guess cluster-centers at random • Assign points to nearest “center” to build
clusters • Average points in cluster to update “center”
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 36
Idea
• Guess cluster-centers at random
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 37
Idea
• Assign points to nearest “center” to build clusters
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 38
Idea
• Average points in cluster to update “center”
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 39
Idea
• Repeat!
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 40
Idea
• Repeat!
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 41
Code • A “point” is a (float, float) tuple. def kmeans(points, k): ''' point list * int -> [point list] take a list of n points in the plane, and a number k > 0 of groups to put them in. Return a list of points containing the k "center” points (the "means" of the clusters) '''
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 42
def kmeans(points, k): distant = findDistantPoint(points) means = initialMeans(points, k) clusters = findClusters(points, means) oldClusters = [] while (oldClusters != clusters) : oldClusters = clusters means = findMeans(clusters, distant) clusters = findClusters(points, means) return means
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 43
def findDistantPoint(points): ''' point list -> point compute a point very far from all the selected points, to use as a dummy "cluster center" when some cluster "dies". '''
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 44
def findDistantPoint(points): x0 = 0 y0 = 0 for p in points : x0 = max(x0, abs(p[0])) y0 = max(y0, abs(p[1])) return (3 * x0, 3 * y0) CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 45
def initialMeans(points, k) : ''' point list * int -> point list Select k points at random from the input point list, and return a list containing those k selected points.'''
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 46
def findClusters(points, means) : ''' point list -> point list list Given n points and k "mean" points, form k lists; the ith list consists of all input points that are closer to the ith mean than to any other. Return a list of these k lists of "closest points".'''
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 47
def findClusters(points, means) : selectionList = [indexOfClosest(point, means) for point in points] k = len(means) return [selectList(points, selectionList, i) for i in range(0,k)]
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 48
def indexOfClosest(point, meanPoints) : ''' point * point list -> int Take a point P and a list of points Q0, Q2, ..., Qk-1 (the "meanPoint" list) and compute the distance to each of the Q points. If the distance to Qi is the smallest of these, return i. Thus if the distances are [2, 2, 1, 6, 3], we'd return 2, because the "1" is at position 2 (counting from zero!]''' CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 49
def distance(p, q) : ''' point * point -> float return the distance between two points in the plane. ''' xdiff = p[0] - q[0] ydiff = p[1] - q[1] return sqrt( xdiff*xdiff + ydiff*ydiff)
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 50
def selectList(points, selectionList, index) : ''' (float, float) list * int list * int -> (float, float) list Given a list of points, and a selection list (a list of indices from 0 to k-1, for some k, such as (for k = 2) [0 1 1 0 0], and an index that's between 0 and k-1 (inclusive), return the points corresponding to the selected index. For instance, if the selection list is [0 1 1 0 0] and index = 0, you'd return a list of three points: the ones at index 0, 3, and 4 in the "points" list''' CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 51
def findMeans(clusters, distant) : ''' (point list) list * point -> (point list) Given k clusters, each represented as a point list, find the mean (average) point for each cluster; if the cluster is empty, use "distant" as its mean''' return [findMean(cluster, distant) for cluster in clusters]
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 52
def findMean(cluster, distant): ''' point list * point-> point Given a nonempty list of n points, return the average of the points gotten by summing up the x-coordinates of all the points and dividing by n, to get a number x0, and doing the same with the y-coordinates to get y0, and returning the point (x0, y0). For an empty list of points, return "distant" instead.'''
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 53
Shared Skeleton Code
• https://drive.google.com/a/brown.edu/file/d/0B4mKyAS1wRwhSnZvcTdFMnlHQXM/view?usp=sharing
• Available from course page, too
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 54
To do
• Person 1: create at least two tests def TestfindMeans(): clusters1 = [(1,2), (1, 4)] cluster2 = [(4, 5)] clusters = [cluster1, cluster2] u =findMeans(clusters, (10,10)) if (u != [(1,3), (4, 5)]) : print “error”
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 55
To do
• Person 2: coder/typist • Person 3: editor
CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 56