learn python through public data hacking requirements
TRANSCRIPT
![Page 1: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/1.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Learn Python Through Public Data Hacking
1
David Beazley @dabeaz
http://www.dabeaz.com
Presented at PyCon'2013, Santa Clara, CAMarch 13, 2013
Copyright (C) 2013, http://www.dabeaz.com
Requirements
2
• Python 2.7 or 3.3
• Support files:
http://www.dabeaz.com/pydata
• Also, datasets passed around on USB-key
![Page 2: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/2.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Welcome!
• And now for something completely different
• This tutorial merges two topics
• Learning Python
• Public data sets
• I hope you find it to be fun
3
Copyright (C) 2013, http://www.dabeaz.com
Primary Focus
• Learn Python through practical examples
• Learn by doing!
• Provide a few fun programming challenges
4
![Page 3: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/3.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Not a Focus
• Data science
• Statistics
• GIS
• Advanced Math
• "Big Data"
• We are learning Python
5
Copyright (C) 2013, http://www.dabeaz.com
Approach
• Coding! Coding! Coding! Coding!
• Introduce yourself to your neighbors
• You're going to work together
• A bit like a hackathon
6
![Page 4: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/4.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Your Responsibilities
• Ask questions!
• Don't be afraid to try things
• Read the documentation!
• Ask for help if stuck
7
Copyright (C) 2013, http://www.dabeaz.com
Ready, Set, Go...
8
![Page 5: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/5.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Running Python• Run it from a terminal
bash % pythonPython 2.7.3 (default, Jun 13 2012, 15:29:09) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwinType "help", "copyright", "credits" or "license" >>> print 'Hello World'Hello World>>> 3 + 47>>>
9
• Start typing commands
Copyright (C) 2013, http://www.dabeaz.com
IDLE• Look for it in the "Start" menu
10
![Page 6: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/6.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Interactive Mode• The interpreter runs a "read-eval" loop
>>> print "hello world"hello world>>> 37*421554>>> for i in range(5):... print i...01234>>>
• It runs what you type11
Copyright (C) 2013, http://www.dabeaz.com
Interactive Mode• Some notes on using the interactive shell
>>> print "hello world"hello world>>> 37*421554>>> for i in range(5):... print i...01234>>>
12
>>> is the interpreter prompt for starting a
new statement
... is the interpreter prompt for continuing a statement (it may be blank in some tools) Enter a blank line to
finish typing and to run
![Page 7: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/7.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Creating Programs
• Programs are put in .py files
# helloworld.pyprint "hello world"
• Create with your favorite editor (e.g., emacs)
• Can also edit programs with IDLE or other Python IDE (too many to list)
13
Copyright (C) 2013, http://www.dabeaz.com
Running Programs• Running from the terminal
• Command line (Unix)
bash % python helloworld.pyhello worldbash %
• Command shell (Windows)C:\SomeFolder>helloworld.pyhello world
C:\SomeFolder>c:\python27\python helloworld.pyhello world
14
![Page 8: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/8.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Pro-Tip
• Use python -i
bash % python -i helloworld.pyhello world>>>
• It runs your program and then enters the interactive shell
• Great for debugging, exploration, etc.
15
Copyright (C) 2013, http://www.dabeaz.com
Running Programs (IDLE)• Select "Run Module" from editor
• Will see output in IDLE shell window
16
![Page 9: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/9.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Python 101 : Statements
• A Python program is a sequence of statements
• Each statement is terminated by a newline
• Statements are executed one after the other until you reach the end of the file.
17
Copyright (C) 2013, http://www.dabeaz.com
Python 101 : Comments
• Comments are denoted by #
# This is a commentheight = 442 # Meters
18
• Extend to the end of the line
![Page 10: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/10.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Python 101: Variables
• A variable is just a name for some value
• Name consists of letters, digits, and _.
• Must start with a letter or _
height = 442 user_name = "Dave"filename1 = 'Data/data.csv'
19
Copyright (C) 2013, http://www.dabeaz.com
Python 101 : Basic Types• Numbers
a = 12345 # Integerb = 123.45 # Floating point
• Text Strings
name = 'Dave'filename = "Data/stocks.dat"
20
• Nothing (a placeholder)
f = None
![Page 11: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/11.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Python 101 : Math• Math operations behave normally
y = 2 * x**2 - 3 * x + 10z = (x + y) / 2.0
• Potential Gotcha: Integer Division in Python 2>>> 7/41>>> 2/30
21
• Use decimals if it matters>>> 7.0/41.75
Copyright (C) 2013, http://www.dabeaz.com
Python 101 : Text Strings
• A few common operations
a = 'Hello'b = 'World'
>>> len(a) # Length5>>> a + b # Concatenation'HelloWorld'>>> a.upper() # Case convert'HELLO'>>> a.startswith('Hell') # Prefix TestTrue>>> a.replace('H', 'M') # Replacement'Mello>>>
22
![Page 12: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/12.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Python 101: Conversions• To convert values
a = int(x) # Convert x to integerb = float(x) # Convert x to floatc = str(x) # Convert x to string
• Example:>>> xs = '123'>>> xs + 10Traceback (most recent call last): File "<stdin>", line 1, in <module>TypeError: cannot concatenate 'str' and 'int' objects>>> int(xs) + 10133>>>
23
Copyright (C) 2013, http://www.dabeaz.com
Python 101 : Conditionals• If-else
if a < b: print "Computer says no"else: print "Computer says yes"
• If-elif-else
if a < b: print "Computer says not enough"elif a > b: print "Computer says too much"else: print "Computer says just right"
24
![Page 13: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/13.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Python 101 : Relations
25
• Relational operators< > <= >= == !=
• Boolean expressions (and, or, not)
if b >= a and b <= c: print "b is between a and c"
if not (b < a or b > c): print "b is still between a and c"
Copyright (C) 2013, http://www.dabeaz.com
Python 101: Looping
• while executes a loop
• Executes the indented statements underneath while the condition is true
26
n = 10while n > 10: print 'T-minus', n n = n - 1print 'Blastoff!'
![Page 14: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/14.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Python 101: Iteration
• for iterates over a sequence of data
• Processes the items one at a time
• Note: variable name doesn't matter
27
names = ['Dave', 'Paula', 'Thomas', 'Lewis']for name in names: print name
for n in names: print n
Copyright (C) 2013, http://www.dabeaz.com
Python 101 : Indentation
• There is a preferred indentation style
• Always use spaces
• Use 4 spaces per level
• Avoid tabs
• Always use a Python-aware editor
28
![Page 15: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/15.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Python 101 : Printing• The print statement (Python 2)
print x print x, y, zprint "Your name is", nameprint x, # Omits newline
• The print function (Python 3)
29
print(x)print(x, y, z)print("Your name is", name)print(x, end=' ') # Omits newline
Copyright (C) 2013, http://www.dabeaz.com
Python 101: Files• Opening a file
f = open("foo.txt","r") # Open for readingf = open("bar.txt","w") # Open for writing
• To read data
data = f.read() # Read all data
• To write text to a file
g.write("some text\n")
30
![Page 16: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/16.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Python 101: File Iteration
• Reading a file one line at a time
f = open("foo.txt","r") for line in f: # Process the line ...f.close()
31
• Extremely common with data processing
Copyright (C) 2013, http://www.dabeaz.com
Python 101: Functions• Defining a new function
def hello(name): print('Hello %s!' % name)
def distance(lat1, lat2): 'Return approx miles between lat1 and lat2' return 69 * abs(lat1 - lat2)
32
• Example:>>> hello('Guido')Hello Guido!>>> distance(41.980262, 42.031662)3.5465999999995788>>>
![Page 17: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/17.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Python 101: Imports• There is a huge library of functions
• Example: math functionsimport math
x = math.sin(2)y = math.cos(2)
33
• Reading from the web
import urllib # urllib.request on Py3
u = urllib.urlopen('http://www.python.org)data = u.read()
Copyright (C) 2013, http://www.dabeaz.com
Coding Challenge
34
"The Traveling Suitcase"
![Page 18: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/18.jpg)
Copyright (C) 2013, http://www.dabeaz.com
The Traveling Suitcase
35
Travis traveled to Chicago and took the Clark Street #22 bus up to Dave's office.
Problem: He just left his suitcase on the bus!
Your task: Get it back!
Copyright (C) 2013, http://www.dabeaz.com
Panic!
36
>>> import urllib>>> u = urllib.urlopen('http://ctabustracker.com/bustime/map/getBusesForRoute.jsp?route=22')>>> data = u.read()>>> f = open('rt22.xml', 'wb')>>> f.write(data)>>> f.close()>>>
• Start the Python interpreter and type this
• Don't ask questions: you have 5 minutes...
![Page 19: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/19.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Hacking Transit Data
37
• Many major cities provide a transit API
• Example: Chicago Transit Authority (CTA)
http://www.transitchicago.com/developers/
• Available data:
• Real-time GPS tracking
• Stop predictions
• Alerts
Copyright (C) 2013, http://www.dabeaz.com 38
![Page 20: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/20.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Here's the Data
39
<?xml version="1.0"?> <buses rt="22"> <time>1:14 PM</time> <bus> !! <id>6801</id> !! <rt>22</rt> !! <d>North Bound</d> !! <dn>N</dn> !! <lat>41.875033214174465</lat> !<lon>-87.62907409667969</lon> !<pid>3932</pid> !<pd>North Bound</pd> !<run>P209</run> !<fs>Howard</fs> !<op>34058</op> ... </bus> ...
Copyright (C) 2013, http://www.dabeaz.com
Here's the Data
40
<?xml version="1.0"?> <buses rt="22"> <time>1:14 PM</time> <bus> !! <id>6801</id> !! <rt>22</rt> !! <d>North Bound</d> !! <dn>N</dn> !! <lat>41.875033214174465</lat> !<lon>-87.62907409667969</lon> !<pid>3932</pid> !<pd>North Bound</pd> !<run>P209</run> !<fs>Howard</fs> !<op>34058</op> ... </bus> ...
![Page 21: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/21.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Your Challenge
41
• Task 1:
latitude 41.980262 longitude -87.668452
Travis doesn't know the number of the bus he was riding. Find likely candidates by parsing the data just downloaded and identifying vehicles traveling northbound of Dave's office.
Dave's office is located at:
Copyright (C) 2013, http://www.dabeaz.com
Your Challenge
42
• Task 2:
Write a program that periodically monitors the identified buses and reports their current distance from Dave's office.
When the bus gets closer than 0.5 miles, have the program issue an alert by popping up a web-page showing the bus location on a map.
Travis will meet the bus and get his suitcase.
![Page 22: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/22.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Parsing XML
43
from xml.etree.ElementTree import parsedoc = parse('rt22.xml')
• Parsing a document into a tree
<?xml version="1.0"?> <buses rt="22"> <time>1:14 PM</time> <bus> !! <id>6801</id> !! <rt>22</rt> !! <d>North Bound</d> !! <dn>N</dn> !! <lat>41.875033214174465</lat> !<lon>-87.62907409667969</lon> !<pid>3932</pid> !<pd>North Bound</pd> !<run>P209</run> !<fs>Howard</fs> !<op>34058</op> ... </bus> ...
roottime
bus
bus
bus
bus
id
rt
d
dn
lat
lon
doc
Copyright (C) 2013, http://www.dabeaz.com
Parsing XML
44
for bus in doc.findall('bus'): ...
• Iterating over specific element type
roottime
bus
bus
bus
bus
id
rt
d
dn
lat
lon
doc
![Page 23: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/23.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Parsing XML
45
for bus in doc.findall('bus'): ...
• Iterating over specific element type
roottime
bus
bus
bus
bus
id
rt
d
dn
lat
lon
doc
bus
Produces a sequence of
matching elements
Copyright (C) 2013, http://www.dabeaz.com
Parsing XML
46
for bus in doc.findall('bus'): ...
• Iterating over specific element type
roottime
bus
bus
bus
bus
id
rt
d
dn
lat
lon
doc
busProduces a sequence of
matching elements
![Page 24: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/24.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Parsing XML
47
for bus in doc.findall('bus'): ...
• Iterating over specific element type
roottime
bus
bus
bus
bus
id
rt
d
dn
lat
lon
doc
bus
Produces a sequence of
matching elements
Copyright (C) 2013, http://www.dabeaz.com
Parsing XML
48
for bus in doc.findall('bus'): ...
• Iterating over specific element type
roottime
bus
bus
bus
bus
id
rt
d
dn
lat
lon
doc
bus
Produces a sequence of
matching elements
![Page 25: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/25.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Parsing XML
49
for bus in doc.findall('bus'): d = bus.findtext('d') lat = float(bus.findtext('lat'))
• Extracting data : elem.findtext()
roottime
bus
bus
bus
bus
id
rt
d
dn
lat
lon
doc
bus "North Bound"
"41.9979871114"
Copyright (C) 2013, http://www.dabeaz.com
Mapping
50
• To display a map : Maybe Google Static Maps
https://developers.google.com/maps/documentation/staticmaps/
• To show a page in a browser
import webbrowserwebbrowser.open('http://...')
![Page 26: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/26.jpg)
Copyright (C) 2013, http://www.dabeaz.com 51
Copyright (C) 2013, http://www.dabeaz.com
Go Code...
52
30 Minutes• Talk to your neighbors
• Consult handy cheat-sheet
• http://www.dabeaz.com/pydata
![Page 27: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/27.jpg)
Copyright (C) 2013, http://www.dabeaz.com
New Concepts
53
Copyright (C) 2013, http://www.dabeaz.com
Data Structures• Real programs have more complex data
• Example: A place marker
Bus 6541 at 41.980262, -87.668452
• An "object" with three parts
• Label ("6541")
• Latitude (41.980262)
• Longitude (-87.668452)
54
![Page 28: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/28.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Tuples
• A collection of related values grouped together
• Example:
55
bus = ('6541', 41.980262, -87.668452)
• Analogy: A row in a database table
• A single object with multiple parts
Copyright (C) 2013, http://www.dabeaz.com
Tuples (cont)• Tuple contents are ordered (like an array)
bus = ('6541', 41.980262, -87.668452) id = bus[0] # '6541'lat = bus[1] # 41.980262lon = bus[2] # -87.668452
• However, the contents can't be modified
>>> bus[0] = '1234'TypeError: object does not support item assignment
56
![Page 29: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/29.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Tuple Unpacking• Unpacking values from a tuple
bus = ('6541', 41.980262, -87.668452)
id, lat, lon = bus# id = '6541'# lat = 41.980262# lon = -87.668452
• This is extremely common
• Example: Unpacking database row into vars
57
Copyright (C) 2013, http://www.dabeaz.com
Dictionaries• A collection of values indexed by "keys"
• Example:bus = { 'id' : '6541', 'lat' : 41.980262, 'lon' : -87.668452}
58
• Use:
>>> bus['id']'6541'>>> bus['lat'] = 42.003172>>>
![Page 30: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/30.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Lists• An ordered sequence of items
names = ['Dave', 'Paula', 'Thomas']
59
• A few operations
>>> len(names)3>>> names.append('Lewis')>>> names['Dave', 'Paula', 'Thomas', 'Lewis']>>> names[0]'Dave'>>>
Copyright (C) 2013, http://www.dabeaz.com
List Usage• Typically hold items of the same type
nums = [10, 20, 30]
buses = [ ('1412', 41.8750332142, -87.6290740967), ('1406', 42.0126361553, -87.6747320322), ('1307', 41.8886332973, -87.6295552408), ('1875', 41.9996211482, -87.6711741429), ('1780', 41.9097633362, -87.6315689087),]
60
![Page 31: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/31.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Dicts as Lookup Tables• Use a dict for fast, random lookups
• Example: Bus locations
61
bus_locs = { '1412': (41.8750332142, -87.6290740967), '1406': (42.0126361553, -87.6747320322), '1307': (41.8886332973, -87.6295552408), '1875': (41.9996211482, -87.6711741429), '1780': (41.9097633362, -87.6315689087),}
>>> bus_locs['1307'](41.8886332973, -87.6295552408)>>>
Copyright (C) 2013, http://www.dabeaz.com
Sets• An ordered collections of unique items
62
ids = set(['1412', '1406', '1307', '1875'])
• Common operations>>> ids.add('1642')>>> ids.remove('1406')>>> '1307' in idsTrue>>> '1871' in idsFalse>>>
• Useful for detecting duplicates, related tasks
![Page 32: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/32.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Coding Challenge
63
"Diabolical Road Biking"
Copyright (C) 2013, http://www.dabeaz.com
Problem
64
Not content to ride your bike on the lakefront path, you seek a new road biking challenge involving large potholes and heavy traffic.
Your Task: Find the five most post-apocalyptic pothole-filled 10-block sections of road in Chicago.
Bonus: Identify the worst road based on historical data involving actual number of patched potholes.
![Page 33: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/33.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Data Portals
65
• Many cities are publishing datasets online
• http://data.cityofchicago.org
• https://data.sfgov.org/
• https://explore.data.gov/
• You can download and play with data
Copyright (C) 2013, http://www.dabeaz.com 66
![Page 34: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/34.jpg)
Copyright (C) 2013, http://www.dabeaz.com 67
Pothole Datahttps://data.cityofchicago.org/Service-Requests/311-Service-
Requests-Pot-Holes-Reported/7as2-ds3y
Copyright (C) 2013, http://www.dabeaz.com
Getting the Data
• You can download from the website
• I have provided a copy on USB-key
68
Data/potholes.csv
• Approx: 31 MB, 137000 lines
![Page 35: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/35.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Parsing CSV Data
• You will need to parse CSV data
import csv
f = open('potholes.csv')for row in csv.DictReader(f): addr = row['STREET ADDRESS'] num = row['NUMBER OF POTHOLES FILLED ON BLOCK']
69
• Use the CSV module
Copyright (C) 2013, http://www.dabeaz.com
Tabulating Data
• You'll probably need to make lookup tablespotholes_by_block = {}
f = open('potholes.csv')for row in csv.DictReader(f): ... potholes_by_block[block] += num_potholes ...
70
• Use a dict. Map keys to counts.
![Page 36: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/36.jpg)
Copyright (C) 2013, http://www.dabeaz.com
String Splitting• You might need to manipulate strings
>>> addr = '350 N STATE ST'>>> parts = addr.split()>>> parts['350', 'N', 'STATE', 'ST']>>> num = parts[0]>>> parts[0] = num[:-2] + 'XX'>>> parts['3XX', 'N', 'STATE', 'ST']>>> ' '.join(parts)'3XX N STATE ST'>>>
71
• For example, to rewrite addresses
Copyright (C) 2013, http://www.dabeaz.com
Data Reduction/Sorting• Some useful data manipulation functions
>>> nums = [50, 10, 5, 7, -2, 8]>>> min(nums)-2>>> max(nums)50>>> sorted(nums)[-2, 5, 7, 8, 10, 50]>>> sorted(nums, reverse=True)[50, 10, 8, 7, 5, -2]>>>
72
![Page 37: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/37.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Exception Handling
• You might need to account for bad data
for row in csv.DictReader(f): try: n = int(row['NUMBER OF POTHOLES FILLED']) except ValueError: n = 0 ...
73
• Use try-except to catch exceptions (if needed)
Copyright (C) 2013, http://www.dabeaz.com
Code...
74
40 MinutesHint: This problem requires more thought
than actual coding
(The solution is small)
![Page 38: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/38.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Power Tools
(Python powered)
75
Copyright (C) 2013, http://www.dabeaz.com
List Comprehensions• Creates a new list by applying an operation
to each element of a sequence.>>> a = [1,2,3,4,5]>>> b = [2*x for x in a]>>> b[2, 4, 6, 8, 10]>>>
76
• Shorthand for this:>>> b = []>>> for x in a:... b.append(2*x)...>>>
![Page 39: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/39.jpg)
Copyright (C) 2013, http://www.dabeaz.com
List Comprehensions
• A list comprehension can also filter
>>> a = [1, -5, 4, 2, -2, 10]>>> b = [2*x for x in a if x > 0]>>> b[2, 8, 4, 20]>>>
77
Copyright (C) 2013, http://www.dabeaz.com
List Comp: Examples
• Collecting the values of a specific field
addrs = [r['STREET ADDRESS'] for r in records]
• Performing database-like queries
filled = [r for r in records if r['STATUS'] == 'Completed']
78
• Building new data structures
locs = [ (r['LATITUDE'], r['LONGITUDE']) for r in records ]
![Page 40: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/40.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Simplified Tabulation• Counter objects
79
from collections import Counter
words = ['yes','but','no','but','yes']wordcounts = Counter(words)
>>> wordcounts['yes']2>>> wordcounts.most_common()[('yes', 2), ('but', 2), ('no', 1)]>>>
Copyright (C) 2013, http://www.dabeaz.com
Advanced Sorting
• Use of a key-function
80
records.sort(key=lambda p: p['COMPLETION DATE'])records.sort(key=lambda p: p['ZIP'])
• lambda: creates a tiny in-line functionf = lambda p: p['COMPLETION DATE']
# Same asdef f(p): return p['COMPLETION DATE']
• Result of key func determines sort order
![Page 41: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/41.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Grouping of Data
• Iterating over groups of sorted data
81
from itertools import groupbygroups = groupby(records, key=lambda r: r['ZIP'])for zipcode, group in groups: for r in group: # All records with same zip-code ...
• Note: data must already be sorted by field
records.sort(key=lambda r: r['ZIP'])
Copyright (C) 2013, http://www.dabeaz.com
Index Building• Building indices to data
82
from collections import defaultdict
zip_index = defaultdict(list)for r in records: zip_index[r['ZIP']].append(r)
• Builds a dictionaryzip_index = { '60640' : [ rec, rec, ... ], '60637' : [ rec, rec, rec, ... ], ...}
![Page 42: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/42.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Third Party Libraries• Many useful packages
• numpy/scipy (array processing)
• matplotlib (plotting)
• pandas (statistics, data analysis)
• requests (interacting with APIs)
• ipython (better interactive shell)
• Too many others to list
83
Copyright (C) 2013, http://www.dabeaz.com
Coding Challenge
84
"Hmmmm.... Pies"
![Page 43: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/43.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Problem
85
You're ravenously hungry after all of that biking, but you can never be too careful.
Copyright (C) 2013, http://www.dabeaz.com
Problem
86
You're ravenously hungry after all of that biking, but you can never be too careful.
Your Task: Analyze Chicago's food inspection data and make a series of tasty pie charts and tables
![Page 44: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/44.jpg)
Copyright (C) 2013, http://www.dabeaz.com 87
The Datahttps://data.cityofchicago.org/Health-Human-
Services/Food-Inspections/4ijn-s7e5
• It's a 77MB CSV file. Don't download
• Available on USB key (passed around)
• New challenges abound!
Copyright (C) 2013, http://www.dabeaz.com 88
Problems of Interest
• Outcomes of a health-inspection (pass, fail)
• Risk levels
• Breakdown of establishment types
• Most common code violations
• Use your imagination...
![Page 45: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/45.jpg)
Copyright (C) 2013, http://www.dabeaz.com 89
To Make Charts...
You're going to have to install some packages...
Copyright (C) 2013, http://www.dabeaz.com 90
Bleeding Edge
![Page 46: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/46.jpg)
Copyright (C) 2013, http://www.dabeaz.com
Code
91
45 Minutes• Code should not be long
• For plotting/ipython consider EPD-Free, Anaconda CE, or other distribution
• See samples at http://www.dabeaz.com/pydata
Copyright (C) 2013, http://www.dabeaz.com 92
Where To Go From Here?
• Python coding
• Functions, modules, classes, objects
• Data analysis
• Numpy/Scipy, pandas, matplotlib
• Data sources
• Open government, data portals, etc.
![Page 47: Learn Python Through Public Data Hacking Requirements](https://reader033.vdocuments.site/reader033/viewer/2022052707/5881d3541a28abb9198b4f86/html5/thumbnails/47.jpg)
Copyright (C) 2013, http://www.dabeaz.com 93
Final Comments
• Thanks!
• Hope you had some fun!
• Learned at least a few new things
• Follow me on Twitter: @dabeaz