learning python from data

LEARNING PYTHON FROM DATA

Mosky

1

THIS SLIDE

• The online version is at https://speakerdeck.com/mosky/learning-python-from-data.

• The examples are at https://github.com/moskytw/learning-python-from-data-examples.

2

https://speakerdeck.com/mosky/learning-python-from-data

https://github.com/moskytw/learning-python-from-data-examples

MOSKY

3

MOSKY

• I am working at Pinkoi.

3

http://www.pinkoi.com/

http://mosky.tw

MOSKY


• I've taught Python for 100+ hours.

3


http://mosky.tw

MOSKY



• A speaker atCOSCUP 2014, PyCon SG 2014, PyCon APAC 014, OSDC 2014, PyCon APAC 2013, COSCUP 2014, ...

3


http://mosky.tw

MOSKY




• The author of the Python packages: MoSQL, Clime, ZIPCodeTW, ...

3


http://mosky.tw

MOSKY




• The author of the Python packages: MoSQL, Clime, ZIPCodeTW, ...

• http://mosky.tw/3


http://mosky.tw

SCHEDULE

4

SCHEDULE

• Warm-up

4

SCHEDULE

• Warm-up

• Packages - Install the packages we need.

4

SCHEDULE

• Warm-up


• CSV - Download a CSV from the Internet and handle it.

4

SCHEDULE

• Warm-up



• HTML - Parse a HTML source code and write a Web crawler.

4

SCHEDULE

• Warm-up




• SQL - Save data into a SQLite database.

4

SCHEDULE

• Warm-up




• SQL - Save data into a SQLite database.

• The End4

FIRST OF ALL,

5

PYTHON IS AWESOME!

6

2 OR 3?

7

2 OR 3?

• Use Python 3!

7

https://python3wos.appspot.com/

2 OR 3?

• Use Python 3!

• But it actually depends on the libs you need.

7


2 OR 3?

• Use Python 3!


• https://python3wos.appspot.com/

7


2 OR 3?

• Use Python 3!


• https://python3wos.appspot.com/

• We will go ahead with Python 2.7,but I will also introduce the changes in Python 3.

7


THE ONLINE RESOURCES

8


• The Python Official Doc

• http://docs.python.org

• The Python Tutorial

• The Python Standard Library

8

http://docs.python.org

http://docs.python.org/2/tutorial/index.html

http://docs.python.org/2/library/index.html

http://www.slideshare.net/moskytw/programming-with-python-basic

http://www.slideshare.net/moskytw/programming-with-python-adv


• The Python Official Doc

• http://docs.python.org

• The Python Tutorial

• The Python Standard Library

• My Past Slides

• Programming with Python - Basic

• Programming with Python - Adv.

8

http://docs.python.org

http://docs.python.org/2/tutorial/index.html

http://docs.python.org/2/library/index.html


http://www.slideshare.net/moskytw/programming-with-python-adv

THE BOOKS

9

THE BOOKS

• Learning Python by Mark Lutz

9

http://www.amazon.com/Learning-Python-Mark-Lutz/dp/1449355730

http://www.amazon.com/Programming-Python-Complete-Introduction-Language/dp/0321680561

http://www.amazon.com/Python-Essential-Reference-Developers-Library/dp/0672329786

THE BOOKS


• Programming in Python 3 by Mark Summerfield

9




THE BOOKS


• Programming in Python 3 by Mark Summerfield

• Python Essential Reference by David Beazley

9




PREPARATION

10

PREPARATION

• Did you say "hello" to Python?

10


PREPARATION


• If no, visit

• http://www.slideshare.net/moskytw/programming-with-python-basic.

10


PREPARATION


• If no, visit

• http://www.slideshare.net/moskytw/programming-with-python-basic.

• If yes, open your Python shell.

10


WARM-UPThe things you must know.

11

MATH & VARS

2 + 32 - 32 * 32 / 3, -2 / 3!(1+10)*10 / 2!2.0 / 3!2 % 3!2 ** 3

x = 2!y = 3!z = x + y!print z!'#' * 10

12

FOR

for i in [0, 1, 2, 3, 4]: print i!items = [0, 1, 2, 3, 4] for i in items: print i!for i in range(5): print i!!!

chars = 'SAHFI' for i, c in enumerate(chars): print i, c!!words = ('Samsung', 'Apple', 'HP', 'Foxconn', 'IBM') for c, w in zip(chars, words): print c, w

13

IF

for i in range(1, 10): if i % 2 == 0: print '{} is divisible by 2'.format(i) elif i % 3 == 0: print '{} is divisible by 3'.format(i) else: print '{} is not divisible by 2 nor 3'.format(i)

14

WHILE

while 1: n = int(raw_input('How big pyramid do you want? ')) if n <= 0: print 'It must greater than 0: {}'.format(n) continue break

15

TRY

while 1:! try: n = int(raw_input('How big pyramid do you want? ')) except ValueError as e: print 'It must be a number: {}'.format(e) continue! if n <= 0: print 'It must greater than 0: {}'.format(n) continue! break

16

LOOP ... ELSE

for n in range(2, 100): for i in range(2, n): if n % i == 0: break else: print '{} is a prime!'.format(n)

17

A PYRAMID

****

************

********************

****************************

************************************

18

A FATER PYRAMID

******

**********************

*******************

19

YOUR TURN!

20

LIST COMPREHENSION

[ n for n in range(2, 100) if not any(n % i == 0 for i in range(2, n))]

21

PACKAGESimport is important.

22

GET PIP - UN*X

24

GET PIP - UN*X

• Debian family

• # apt-get install python-pip

24

GET PIP - UN*X

• Debian family


• Rehat family

• # yum install python-pip

24

GET PIP - UN*X

• Debian family


• Rehat family

• # yum install python-pip

• Mac OS X

• # easy_install pip24

GET PIP - WIN *

25

GET PIP - WIN *

• Follow the steps in http://stackoverflow.com/questions/4750806/how-to-install-pip-on-windows.

25

http://stackoverflow.com/questions/4750806/how-to-install-pip-on-windows

https://pypi.python.org/pypi

GET PIP - WIN *


• Or just use easy_install to install. The easy_install should be found at C:\Python27\Scripts\.

25



GET PIP - WIN *


• Or just use easy_install to install. The easy_install should be found at C:\Python27\Scripts\.

• Or find the Windows installer on Python Package Index.

25



3-RD PARTY PACKAGES

26

3-RD PARTY PACKAGES

• requests - Python HTTP for Humans

26

https://pypi.python.org/pypi/requests

https://pypi.python.org/pypi/lxml

https://pypi.python.org/pypi/uniout

https://pypi.python.org/pypi/clime

3-RD PARTY PACKAGES


• lxml - Pythonic XML processing library

26





3-RD PARTY PACKAGES



• uniout - Print the object representation in readable chars.

26





3-RD PARTY PACKAGES



• uniout - Print the object representation in readable chars.

• clime - Convert module into a CLI program w/o any config.

26





YOUR TURN!

27

CSVLet's start from making a HTTP request!

28

HTTP GET

import requests!#url = 'http://stats.moe.gov.tw/files/school/101/u1_new.csv'url = 'https://raw.github.com/moskytw/learning-python-from-data-examples/master/sql/schools.csv'!print requests.get(url).content!#print requests.get(url).text

29

http://stats.moe.gov.tw/files/school/101/u1_new.csv'

FILE

save_path = 'school_list.csv'!with open(save_path, 'w') as f: f.write(requests.get(url).content)!with open(save_path) as f: print f.read()!with open(save_path) as f: for line in f: print line,

30

DEF

from os.path import basename!def save(url, path=None):! if not path: path = basename(url)! with open(path, 'w') as f: f.write(requests.get(url).content)

31

CSV

import csvfrom os.path import exists!if not exists(save_path): save(url, save_path)!with open(save_path) as f: for row in csv.reader(f): print row

32

+ UNIOUT

import csvfrom os.path import existsimport uniout # You want this!!if not exists(save_path): save(url, save_path)!with open(save_path) as f: for row in csv.reader(f): print row

33

NEXT

with open(save_path) as f: next(f) # skip the unwanted lines next(f) for row in csv.reader(f): print row

34

DICT READER

with open(save_path) as f: next(f) next(f) for row in csv.DictReader(f): print row!# We now have a great output. :)

35

DEF AGAIN

def parse_to_school_list(path): school_list = [] with open(path) as f: next(f) next(f) for school in csv.DictReader(f): school_list.append(school)! return school_list[:-2]

36

+ COMPREHENSION

def parse_to_school_list(path='schools.csv'): with open(path) as f: next(f) next(f) school_list = [school for school in csv.DictReader(f)][:-2]! return school_list

37

+ PRETTY PRINT

from pprint import pprint!pprint(parse_to_school_list(save_path))!# AWESOME!

38

PYTHONIC

school_list = parse_to_school_list(save_path)!# hmmm ...!for school in shcool_list: print shcool['School Name']!# It is more Pythonic! :)!print [school['School Name'] for school in school_list]

39

GROUP BY

from itertools import groupby!# You MUST sort it.keyfunc = lambda school: school['County']school_list.sort(key=keyfunc)!for county, schools in groupby(school_list, keyfunc): for school in schools: print '%s %r' % (county, school) print '---'

40

DOCSTRING

'''It contains some useful function for paring data from government.'''!def save(url, path=None): '''It saves data from `url` to `path`.''' ...!--- Shell ---!$ pydoc csv_docstring

41

CLIME

if __name__ == '__main__': import clime.now!--- shell ---!$ python csv_clime.pyusage: basename <p> or: parse-to-school-list <path> or: save [--path] <url>!It contains some userful function for parsing data from government.

42

DOC TIPS

help(requests)!print dir(requests)!print '\n'.join(dir(requests))

43

YOUR TURN!

44

HTMLHave fun with the final crawler. ;)

45

LXML

import requestsfrom lxml import etree!content = requests.get('http://clbc.tw').contentroot = etree.HTML(content)!print root

46

CACHE

from os.path import exists!cache_path = 'cache.html'!if exists(cache_path): with open(cache_path) as f: content = f.read()else: content = requests.get('http://clbc.tw').content with open(cache_path, 'w') as f: f.write(content)

47

http://clbc.tw'

SEARCHING

head = root.find('head')print head!head_children = head.getchildren()print head_children!metas = head.findall('meta')print metas!title_text = head.findtext('title')print title_text

48

XPATH

titles = root.xpath('/html/head/title')print titles[0].text!title_texts = root.xpath('/html/head/title/text()')print title_texts[0]!as_ = root.xpath('//a')print as_print [a.get('href') for a in as_]

49

MD5

from hashlib import md5!message = 'There should be one-- and preferably only one --obvious way to do it.'!print md5(message).hexdigest()!# Actually, it is noting about HTML.

50

DEF GET

from os import makedirsfrom os.path import exists, join!def get(url, cache_dir_path='cache/'):! if not exists(cache_dir_path): makedirs(cache_dir)! cache_path = join(cache_dir_path, md5(url).hexdigest())! ...

51

DEF FIND_URLS

def find_urls(content): root = etree.HTML(content) return [ a.attrib['href'] for a in root.xpath('//a') if 'href' in a.attrib ]

52

BFS 1/2

NEW = 0QUEUED = 1VISITED = 2!def search_urls(url):! url_queue = [url] url_state_map = {url: QUEUED}! while url_queue:! url = url_queue.pop(0) print url

53

BFS 2/2

# continue the previous page try: found_urls = find_urls(get(url)) except Exception, e: url_state_map[url] = e print 'Exception: %s' % e except KeyboardInterrupt, e: return url_state_map else: for found_url in found_urls: if not url_state_map.get(found_url, NEW): url_queue.append(found_url) url_state_map[found_url] = QUEUED url_state_map[url] = VISITED

54

DEQUE

from collections import deque...!def search_urls(url): url_queue = deque([url])... while url_queue:! url = url_queue.popleft() print url...

55

YIELD

...!def search_urls(url):... while url_queue:! url = url_queue.pop(0) yield url... except KeyboardInterrupt, e: print url_state_map return...

56

YOUR TURN!

57

SQLHow about saving the CSV file into a db?

58

TABLE

CREATE TABLE schools ( id TEXT PRIMARY KEY, name TEXT, county TEXT, address TEXT, phone TEXT, url TEXT, type TEXT);!DROP TABLE schools;

59

CRUD

INSERT INTO schools (id, name) VALUES ('1', 'The First');INSERT INTO schools VALUES (...);!SELECT * FROM schools WHERE id='1';SELECT name FROM schools WHERE id='1';!UPDATE schools SET id='10' WHERE id='1';!DELETE FROM schools WHERE id='10';

60

COMMON PATTERN

import sqlite3!db_path = 'schools.db'conn = sqlite3.connect(db_path)cur = conn.cursor()!cur.execute('''CREATE TABLE schools ( ...)''')conn.commit()!cur.close()conn.close()

61

ROLLBACK

...!try: cur.execute('...')except: conn.rollback() raiseelse: conn.commit()!...

62

PARAMETERIZE QUERY

...!rows = ...!for row in rows: cur.execute('INSERT INTO schools VALUES (?, ?, ?, ?, ?, ?, ?)', row)!conn.commit()!...

63

EXECUTEMANY

...!rows = ...!cur.executemany('INSERT INTO schools VALUES (?, ?, ?, ?, ?, ?, ?)', rows)!conn.commit()!...

64

FETCH

...cur.execute('select * from schools')!print cur.fetchone()!# orprint cur.fetchall()!# orfor row in cur: print row...

65

TEXT FACTORY

# SQLite only: Let you pass the 8-bit string as parameter.!...!conn = sqlite3.connect(db_path)conn.text_factory = str!...

66

ROW FACTORY

# SQLite only: Let you convert tuple into dict. It is `DictCursor` in some other connectors.!def dict_factory(cursor, row): d = {} for idx, col in enumerate(cursor.description): d[col[0]] = row[idx] return d!...con.row_factory = dict_factory...

67

MORE

68

MORE

• Python DB API 2.0

68

http://www.python.org/dev/peps/pep-0249/

http://sourceforge.net/projects/mysql-python/

http://initd.org/psycopg/

http://www.sqlalchemy.org/

http://mosql.mosky.tw/

MORE


• MySQLdb - MySQL connector for Python

68






MORE



• Psycopg2 - PostgreSQL adapter for Python

68






MORE




• SQLAlchemy - the Python SQL toolkit and ORM

68






MORE




• SQLAlchemy - the Python SQL toolkit and ORM

• MoSQL - Build SQL from common Python data structure.

68






THE END

69

THE END

• You learned how to ...

69

THE END

• You learned how to ...• make a HTTP request

69

THE END

• You learned how to ...• make a HTTP request• load a CSV file

69

THE END

• You learned how to ...• make a HTTP request• load a CSV file• parse a HTML file

69

THE END

• You learned how to ...• make a HTTP request• load a CSV file• parse a HTML file• write a Web crawler

69

THE END

• You learned how to ...• make a HTTP request• load a CSV file• parse a HTML file• write a Web crawler• use SQL with SQLite

69

THE END

• You learned how to ...• make a HTTP request• load a CSV file• parse a HTML file• write a Web crawler• use SQL with SQLite• and lot of techniques today. ;)

69

learning python from data

Technology

python packages

python tutorial

python standardlibrary8

python shell

ive taught python

books learning python

python basic programming

python official doc