learning python from data
DESCRIPTION
It is the slides for COSCUP[1] 2013 Hands-on[2], "Learning Python from Data". It aims for using examples to show the world of Python. Hope it will help you with learning Python. [1] COSCUP: http://coscup.org/ [2] COSCUP Hands-on: http://registrano.com/events/coscup-2013-hands-on-moskyTRANSCRIPT
LEARNING PYTHON FROM DATA
Mosky
1
THIS SLIDE
• The online version is at https://speakerdeck.com/mosky/learning-python-from-data.
• The examples are at https://github.com/moskytw/learning-python-from-data-examples.
2
MOSKY
3
MOSKY
• I am working at Pinkoi.
• I've taught Python for 100+ hours.
3
MOSKY
• I am working at Pinkoi.
• I've taught Python for 100+ hours.
• A speaker atCOSCUP 2014, PyCon SG 2014, PyCon APAC 014, OSDC 2014, PyCon APAC 2013, COSCUP 2014, ...
3
MOSKY
• I am working at Pinkoi.
• I've taught Python for 100+ hours.
• A speaker atCOSCUP 2014, PyCon SG 2014, PyCon APAC 014, OSDC 2014, PyCon APAC 2013, COSCUP 2014, ...
• The author of the Python packages: MoSQL, Clime, ZIPCodeTW, ...
3
MOSKY
• I am working at Pinkoi.
• I've taught Python for 100+ hours.
• A speaker atCOSCUP 2014, PyCon SG 2014, PyCon APAC 014, OSDC 2014, PyCon APAC 2013, COSCUP 2014, ...
• The author of the Python packages: MoSQL, Clime, ZIPCodeTW, ...
• http://mosky.tw/3
SCHEDULE
4
SCHEDULE
• Warm-up
4
SCHEDULE
• Warm-up
• Packages - Install the packages we need.
4
SCHEDULE
• Warm-up
• Packages - Install the packages we need.
• CSV - Download a CSV from the Internet and handle it.
4
SCHEDULE
• Warm-up
• Packages - Install the packages we need.
• CSV - Download a CSV from the Internet and handle it.
• HTML - Parse a HTML source code and write a Web crawler.
4
SCHEDULE
• Warm-up
• Packages - Install the packages we need.
• CSV - Download a CSV from the Internet and handle it.
• HTML - Parse a HTML source code and write a Web crawler.
• SQL - Save data into a SQLite database.
4
SCHEDULE
• Warm-up
• Packages - Install the packages we need.
• CSV - Download a CSV from the Internet and handle it.
• HTML - Parse a HTML source code and write a Web crawler.
• SQL - Save data into a SQLite database.
• The End4
FIRST OF ALL,
5
6
PYTHON IS AWESOME!
6
2 OR 3?
7
2 OR 3?
• Use Python 3!
• But it actually depends on the libs you need.
7
2 OR 3?
• Use Python 3!
• But it actually depends on the libs you need.
• https://python3wos.appspot.com/
7
2 OR 3?
• Use Python 3!
• But it actually depends on the libs you need.
• https://python3wos.appspot.com/
• We will go ahead with Python 2.7,but I will also introduce the changes in Python 3.
7
THE ONLINE RESOURCES
8
THE ONLINE RESOURCES
• The Python Official Doc
• http://docs.python.org
• The Python Tutorial
• The Python Standard Library
8
THE ONLINE RESOURCES
• The Python Official Doc
• http://docs.python.org
• The Python Tutorial
• The Python Standard Library
• My Past Slides
• Programming with Python - Basic
• Programming with Python - Adv.
8
THE BOOKS
9
THE BOOKS
• Learning Python by Mark Lutz
9
THE BOOKS
• Learning Python by Mark Lutz
• Programming in Python 3 by Mark Summerfield
9
THE BOOKS
• Learning Python by Mark Lutz
• Programming in Python 3 by Mark Summerfield
• Python Essential Reference by David Beazley
9
PREPARATION
10
PREPARATION
• Did you say "hello" to Python?
10
PREPARATION
• Did you say "hello" to Python?
• If no, visit
• http://www.slideshare.net/moskytw/programming-with-python-basic.
10
PREPARATION
• Did you say "hello" to Python?
• If no, visit
• http://www.slideshare.net/moskytw/programming-with-python-basic.
• If yes, open your Python shell.
10
WARM-UPThe things you must know.
11
MATH & VARS
2 + 32 - 32 * 32 / 3, -2 / 3!(1+10)*10 / 2!2.0 / 3!2 % 3!2 ** 3
x = 2!y = 3!z = x + y!print z!'#' * 10
12
FOR
for i in [0, 1, 2, 3, 4]: print i!items = [0, 1, 2, 3, 4] for i in items: print i!for i in range(5): print i!!!
chars = 'SAHFI' for i, c in enumerate(chars): print i, c!!words = ('Samsung', 'Apple', 'HP', 'Foxconn', 'IBM') for c, w in zip(chars, words): print c, w
13
IF
for i in range(1, 10): if i % 2 == 0: print '{} is divisible by 2'.format(i) elif i % 3 == 0: print '{} is divisible by 3'.format(i) else: print '{} is not divisible by 2 nor 3'.format(i)
14
WHILE
while 1: n = int(raw_input('How big pyramid do you want? ')) if n <= 0: print 'It must greater than 0: {}'.format(n) continue break
15
TRY
while 1:! try: n = int(raw_input('How big pyramid do you want? ')) except ValueError as e: print 'It must be a number: {}'.format(e) continue! if n <= 0: print 'It must greater than 0: {}'.format(n) continue! break
16
LOOP ... ELSE
for n in range(2, 100): for i in range(2, n): if n % i == 0: break else: print '{} is a prime!'.format(n)
17
A PYRAMID
****
************
********************
****************************
************************************
18
A FATER PYRAMID
******
**********************
*******************
19
YOUR TURN!
20
LIST COMPREHENSION
[ n for n in range(2, 100) if not any(n % i == 0 for i in range(2, n))]
21
PACKAGESimport is important.
22
23
GET PIP - UN*X
24
GET PIP - UN*X
• Debian family
• # apt-get install python-pip
24
GET PIP - UN*X
• Debian family
• # apt-get install python-pip
• Rehat family
• # yum install python-pip
24
GET PIP - UN*X
• Debian family
• # apt-get install python-pip
• Rehat family
• # yum install python-pip
• Mac OS X
• # easy_install pip24
GET PIP - WIN *
25
GET PIP - WIN *
• Follow the steps in http://stackoverflow.com/questions/4750806/how-to-install-pip-on-windows.
25
GET PIP - WIN *
• Follow the steps in http://stackoverflow.com/questions/4750806/how-to-install-pip-on-windows.
• Or just use easy_install to install. The easy_install should be found at C:\Python27\Scripts\.
25
GET PIP - WIN *
• Follow the steps in http://stackoverflow.com/questions/4750806/how-to-install-pip-on-windows.
• Or just use easy_install to install. The easy_install should be found at C:\Python27\Scripts\.
• Or find the Windows installer on Python Package Index.
25
3-RD PARTY PACKAGES
26
3-RD PARTY PACKAGES
• requests - Python HTTP for Humans
26
3-RD PARTY PACKAGES
• requests - Python HTTP for Humans
• lxml - Pythonic XML processing library
26
3-RD PARTY PACKAGES
• requests - Python HTTP for Humans
• lxml - Pythonic XML processing library
• uniout - Print the object representation in readable chars.
26
3-RD PARTY PACKAGES
• requests - Python HTTP for Humans
• lxml - Pythonic XML processing library
• uniout - Print the object representation in readable chars.
• clime - Convert module into a CLI program w/o any config.
26
YOUR TURN!
27
CSVLet's start from making a HTTP request!
28
HTTP GET
import requests!#url = 'http://stats.moe.gov.tw/files/school/101/u1_new.csv'url = 'https://raw.github.com/moskytw/learning-python-from-data-examples/master/sql/schools.csv'!print requests.get(url).content!#print requests.get(url).text
29
FILE
save_path = 'school_list.csv'!with open(save_path, 'w') as f: f.write(requests.get(url).content)!with open(save_path) as f: print f.read()!with open(save_path) as f: for line in f: print line,
30
DEF
from os.path import basename!def save(url, path=None):! if not path: path = basename(url)! with open(path, 'w') as f: f.write(requests.get(url).content)
31
CSV
import csvfrom os.path import exists!if not exists(save_path): save(url, save_path)!with open(save_path) as f: for row in csv.reader(f): print row
32
+ UNIOUT
import csvfrom os.path import existsimport uniout # You want this!!if not exists(save_path): save(url, save_path)!with open(save_path) as f: for row in csv.reader(f): print row
33
NEXT
with open(save_path) as f: next(f) # skip the unwanted lines next(f) for row in csv.reader(f): print row
34
DICT READER
with open(save_path) as f: next(f) next(f) for row in csv.DictReader(f): print row!# We now have a great output. :)
35
DEF AGAIN
def parse_to_school_list(path): school_list = [] with open(path) as f: next(f) next(f) for school in csv.DictReader(f): school_list.append(school)! return school_list[:-2]
36
+ COMPREHENSION
def parse_to_school_list(path='schools.csv'): with open(path) as f: next(f) next(f) school_list = [school for school in csv.DictReader(f)][:-2]! return school_list
37
+ PRETTY PRINT
from pprint import pprint!pprint(parse_to_school_list(save_path))!# AWESOME!
38
PYTHONIC
school_list = parse_to_school_list(save_path)!# hmmm ...!for school in shcool_list: print shcool['School Name']!# It is more Pythonic! :)!print [school['School Name'] for school in school_list]
39
GROUP BY
from itertools import groupby!# You MUST sort it.keyfunc = lambda school: school['County']school_list.sort(key=keyfunc)!for county, schools in groupby(school_list, keyfunc): for school in schools: print '%s %r' % (county, school) print '---'
40
DOCSTRING
'''It contains some useful function for paring data from government.'''!def save(url, path=None): '''It saves data from `url` to `path`.''' ...!--- Shell ---!$ pydoc csv_docstring
41
CLIME
if __name__ == '__main__': import clime.now!--- shell ---!$ python csv_clime.pyusage: basename <p> or: parse-to-school-list <path> or: save [--path] <url>!It contains some userful function for parsing data from government.
42
DOC TIPS
help(requests)!print dir(requests)!print '\n'.join(dir(requests))
43
YOUR TURN!
44
HTMLHave fun with the final crawler. ;)
45
LXML
import requestsfrom lxml import etree!content = requests.get('http://clbc.tw').contentroot = etree.HTML(content)!print root
46
CACHE
from os.path import exists!cache_path = 'cache.html'!if exists(cache_path): with open(cache_path) as f: content = f.read()else: content = requests.get('http://clbc.tw').content with open(cache_path, 'w') as f: f.write(content)
47
SEARCHING
head = root.find('head')print head!head_children = head.getchildren()print head_children!metas = head.findall('meta')print metas!title_text = head.findtext('title')print title_text
48
XPATH
titles = root.xpath('/html/head/title')print titles[0].text!title_texts = root.xpath('/html/head/title/text()')print title_texts[0]!as_ = root.xpath('//a')print as_print [a.get('href') for a in as_]
49
MD5
from hashlib import md5!message = 'There should be one-- and preferably only one --obvious way to do it.'!print md5(message).hexdigest()!# Actually, it is noting about HTML.
50
DEF GET
from os import makedirsfrom os.path import exists, join!def get(url, cache_dir_path='cache/'):! if not exists(cache_dir_path): makedirs(cache_dir)! cache_path = join(cache_dir_path, md5(url).hexdigest())! ...
51
DEF FIND_URLS
def find_urls(content): root = etree.HTML(content) return [ a.attrib['href'] for a in root.xpath('//a') if 'href' in a.attrib ]
52
BFS 1/2
NEW = 0QUEUED = 1VISITED = 2!def search_urls(url):! url_queue = [url] url_state_map = {url: QUEUED}! while url_queue:! url = url_queue.pop(0) print url
53
BFS 2/2
# continue the previous page try: found_urls = find_urls(get(url)) except Exception, e: url_state_map[url] = e print 'Exception: %s' % e except KeyboardInterrupt, e: return url_state_map else: for found_url in found_urls: if not url_state_map.get(found_url, NEW): url_queue.append(found_url) url_state_map[found_url] = QUEUED url_state_map[url] = VISITED
54
DEQUE
from collections import deque...!def search_urls(url): url_queue = deque([url])... while url_queue:! url = url_queue.popleft() print url...
55
YIELD
...!def search_urls(url):... while url_queue:! url = url_queue.pop(0) yield url... except KeyboardInterrupt, e: print url_state_map return...
56
YOUR TURN!
57
SQLHow about saving the CSV file into a db?
58
TABLE
CREATE TABLE schools ( id TEXT PRIMARY KEY, name TEXT, county TEXT, address TEXT, phone TEXT, url TEXT, type TEXT);!DROP TABLE schools;
59
CRUD
INSERT INTO schools (id, name) VALUES ('1', 'The First');INSERT INTO schools VALUES (...);!SELECT * FROM schools WHERE id='1';SELECT name FROM schools WHERE id='1';!UPDATE schools SET id='10' WHERE id='1';!DELETE FROM schools WHERE id='10';
60
COMMON PATTERN
import sqlite3!db_path = 'schools.db'conn = sqlite3.connect(db_path)cur = conn.cursor()!cur.execute('''CREATE TABLE schools ( ...)''')conn.commit()!cur.close()conn.close()
61
ROLLBACK
...!try: cur.execute('...')except: conn.rollback() raiseelse: conn.commit()!...
62
PARAMETERIZE QUERY
...!rows = ...!for row in rows: cur.execute('INSERT INTO schools VALUES (?, ?, ?, ?, ?, ?, ?)', row)!conn.commit()!...
63
EXECUTEMANY
...!rows = ...!cur.executemany('INSERT INTO schools VALUES (?, ?, ?, ?, ?, ?, ?)', rows)!conn.commit()!...
64
FETCH
...cur.execute('select * from schools')!print cur.fetchone()!# orprint cur.fetchall()!# orfor row in cur: print row...
65
TEXT FACTORY
# SQLite only: Let you pass the 8-bit string as parameter.!...!conn = sqlite3.connect(db_path)conn.text_factory = str!...
66
ROW FACTORY
# SQLite only: Let you convert tuple into dict. It is `DictCursor` in some other connectors.!def dict_factory(cursor, row): d = {} for idx, col in enumerate(cursor.description): d[col[0]] = row[idx] return d!...con.row_factory = dict_factory...
67
MORE
68
MORE
• Python DB API 2.0
68
MORE
• Python DB API 2.0
• MySQLdb - MySQL connector for Python
68
MORE
• Python DB API 2.0
• MySQLdb - MySQL connector for Python
• Psycopg2 - PostgreSQL adapter for Python
68
MORE
• Python DB API 2.0
• MySQLdb - MySQL connector for Python
• Psycopg2 - PostgreSQL adapter for Python
• SQLAlchemy - the Python SQL toolkit and ORM
68
MORE
• Python DB API 2.0
• MySQLdb - MySQL connector for Python
• Psycopg2 - PostgreSQL adapter for Python
• SQLAlchemy - the Python SQL toolkit and ORM
• MoSQL - Build SQL from common Python data structure.
68
THE END
69
THE END
• You learned how to ...
69
THE END
• You learned how to ...• make a HTTP request
69
THE END
• You learned how to ...• make a HTTP request• load a CSV file
69
THE END
• You learned how to ...• make a HTTP request• load a CSV file• parse a HTML file
69
THE END
• You learned how to ...• make a HTTP request• load a CSV file• parse a HTML file• write a Web crawler
69
THE END
• You learned how to ...• make a HTTP request• load a CSV file• parse a HTML file• write a Web crawler• use SQL with SQLite
69
THE END
• You learned how to ...• make a HTTP request• load a CSV file• parse a HTML file• write a Web crawler• use SQL with SQLite• and lot of techniques today. ;)
69