scrape the web: strategies for programming …scrape the web: strategies for programming websites...

Scrape the Web: Strategies for programmingwebsites that don’t expect it

Presenter: Asheesh Laroia, @asheeshlaroia(scrape-pycon@asheesh.org, +1-585-506-8865)

February 18, 2010

Outline

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

I You will learn neat tricks

I DO NOT BECOME AN EVIL COMMENT SPAMMER

I Theory, practice, and iterative development

I Brittle? Sometimes.

I The comics aren’t mine; ask me for references.

Format introduction

I I’ll stand up here and talk about things.

I You’ll ask me questions.

Format introduction

You know what sucks?

I It sucks when everyone’s thinking something and nobody’ssaying it.

I If I am incoherent, stop me.

“Only” three hours

I Slow me down,

I or speed me up.

I Do this with your voice or by raising your hand.

I Don’t try to do it via Twitter.

I Slow me down,

I or speed me up.

I Slow me down,

I or speed me up.

I Slow me down,

I or speed me up.

I Slow me down,

I or speed me up.

What is screen scraping?

Brittle?

Remote procedure call

I Every time you press a key, you cause the remote computer toexecute code.

I Every keypress causes a remote procedure call.

I If you understand this, you can document it as an API.

I We get to interact with the raw data.

I We could write our own interface.

I We get to programmatically interact with a system that onlyexpect humans at the door.

Independence

I Design choices and restrictions fall away.

Independence

I Design choices and restrictions fall away.

Power, too much

I WE CAN SEND SPAM!

I Don’t do that.

Power, too much

I WE CAN SEND SPAM!

I Don’t do that.

Power, too much

I WE CAN SEND SPAM!

I Don’t do that.

Outline

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

Programming the web

The Web

I It’s the twenty-first century.

I The Web is a massive, mostly-unrestricted remote procedurecall system.

The Web

Mac OS “say”

I I’m not hip enough to have “say”

I but I do have the Web

Mac OS “say”

Cepstral demo

Delicious

Curry on the web

http://mehfilindian.com/LunchMenuTakeOut.htm

Beneath the covers...

I FrontPage 6.0 is from 2003

I Some really ugly HTML...

I I like to call this 1998-style HTML

The easy way

examples/curry/trivial.py

I urllib2.urlopen() gives you a file descriptor

I Now you can read() it... (and you get a big ol’ byte string)

I Test its contents for squash, and you’re done.

The easy way

The Web and standards

I We don’t have to resort to visual screen scraping.

I The web has a standard data format for marking up pagecontent.

I What is it called?

XHTML and HTML

I It’s 2010.

I Surely XHTML has won by now.

XHTML and HTML

I It’s 2010.

XHTML and HTML

I It’s 2010.

“Extract some information”

I HTML

I vs. XHTML (2000)

I Both are trees of tags; both can be visualized in FireBug.

I ...did XHTML win?

I HTML

I vs. XHTML (2000)

I ...did XHTML win?

I HTML

I vs. XHTML (2000)

I ...did XHTML win?

I HTML

I vs. XHTML (2000)

I ...did XHTML win?

I HTML

I vs. XHTML (2000)

I ...did XHTML win?

Outline

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

I Average page size?

I 16.5K

I ca. 12

I HTML to XHTML ratio?

I ca. 12

I Transitional vs. Strict/Frameset:

I 10:1

I ca. 12

I How many in ”Quirks” mode?

I ca. 12

I What’s more popular? TITLE or BODY?

I TITLE

I ca. 12

I What percent validate in general?

I ca. 4.13%

I ca. 12

Outline

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

The web: Round one

Parsing considerations

A showcase of some of your options

I An example of valid HTML (written by hand)(examples/parsing/)

I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)

(examples/parsing/)I Parsed with HTMLParser

I Parsed with xml.dom.minidom

I in Firefox

I In xml.dom.minidomI in HTMLParser

I in FirefoxI In xml.dom.minidom

I in HTMLParser

Other ways to get information out of web pages?

I “squash” in page contents.lower()

I re.search(“squash”, page contents, re.IGNORECASE)

Inspirational quote: JWZ

Some people, when confronted with a problem, think“Iknow, I’ll use regular expressions.” Now they have twoproblems.– Jamie Zawinski

What’s wrong with regular expressions for scraping

I <a href=”/whatever/”>

I <a href=’whatever’>

I <a href=‘whatever”>

I Okay for “Reviews 1-10 of 430”

I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)

Inspirational quote: Jon Postel

Robustness principle: “Be conservative in what you do, be liberal inwhat you accept from others.”– Jon Postel, Transmission Control Protocol, RFC 793

Inspirational quote: Leonard Richardson

“You didn’t write that awful page. You’re just trying to get somedata out of it. Right now, you don’t really care what HTML issupposed to look like.“– Leonard Richardson, author of BeautifulSoup

Back to curry

New goal for curry: Objectify

Map the menu to Python objects

I play with the source in BeautifulSoup

I ...this is a text processing problem, not tag processing.

Model the data

examples/curry/menu.pyclass Entree:

I index

I name

I description

I long winded description

I price

Model the data

I index

I name

I description

I price

Model the data

I index

I name

I description

I price

Model the data

I index

I name

I description

I price

Model the data

I index

I name

I description

I price

Model the data

I index

I name

I description

I price

Mini-lesson

I hand-written pages vs.

I machine-written pages

Mini-lesson

New goal: Scrape Yahoo! finance

I examples/tree-builders/beautifulsoup yfinance.py

New goal: Scrape Yahoo! finance

I examples/tree-builders/beautifulsoup yfinance.py

We’re done!

Right?

Trees of tags

What defines how HTML gets parsed?

Web browsers

Surfing tag trees in FireBug

I Or Opera Dragonfly

I Or Chrome’s Inspector

Parsing trees and finding elements

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Early history

I soup(“title”)

Early history

I soup(“title”)

Early history

I soup(“title”)

Early history

I soup(“title”)

Early history

I soup(“title”)

Early history

I soup(“title”)

Early history

I soup(“title”)

Early history

I soup(“title”)

I CSS Selectors...

I titleI span.title

Early history

I soup(“title”)

I CSS Selectors...I title

I span.title

Early history

I soup(“title”)

Recent history

I 2007: lxml.html improved, publicized by Ian Bicking

I CSS selectors for Pythonistas

I 2007: html5lib: Parse web pages like a browser

I 2008: BeautifulSoup 3.1.0, the end of an era

I 2010: html5lib deprecates BeautifulSoup

I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”

Recent history

Searching tag trees

I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)

I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)

I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)

I “minimal stable XPath”

I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)

Searching tag trees

Interacting with the web

Basic Yahoo! search (hard-coded)

examples/search/yahoo.py

Basic Google! search (hard-coded)

examples/search/google.py

I Great code, but broken due to ?

Basic Google! search (hard-coded)

examples/search/google.py

I Great code, but broken due to ?

Something’s wrong...

Outline

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

A network trace of an HTTP conversation

User-Agent, and other headers the client sends

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 402: Payment Required

I 404 Not Found

I 410 Gone

I 418 I’m a teapot

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 404 Not Found

I 410 Gone

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 404 Not Found

I 410 Gone

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 404 Not Found

I 410 Gone

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 404 Not Found

I 410 Gone

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 404 Not Found

I 410 Gone

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 404 Not Found

I 410 Gone

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 404 Not Found

I 410 Gone

HTTP methods

I POST

I BREW

HTTP methods

I POST

I BREW

HTTP methods

I POST

I BREW

HTTP methods

I POST

I BREW

HTTP methods

I POST

I BREW

Once we set User-Agent, are we just like Firefox?

I JavaScript behavior

I Image download behavior

I Cookie behavior

I Invalid HTML handling behavior (?)

I Accept: headers

I Cookie behavior

I Accept: headers

I Cookie behavior

I Accept: headers

I Cookie behavior

I Accept: headers

I Cookie behavior

I Accept: headers

I Cookie behavior

I Accept: headers

What if we settle for approximate emulation?

Re-do of Google search with a cooked user-agent

examples/search/urllib2-user-agent/google as ie.py

Favorite User-Agent headers

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;(compatible;))

I I can’t believe it’s not Googlebot/2.1

HTTP: State via cookies

I HTTP implements state on top of TCP

HTTP: State via cookies

I HTTP implements state on top of TCP

robots.txt

I User-agent: *

I Disallow: /

I Allow: /crawlme.html

I http://www.robotstxt.org/

robots.txt

I User-agent: *

I Disallow: /

robots.txt

I User-agent: *

I Disallow: /

robots.txt

I User-agent: *

I Disallow: /

robots.txt

I User-agent: *

I Disallow: /

robots.txt and detectability

I “How does the server know you’re a robot?”

I Well, if you GET /robots.txt...

Filling out more forms: POST and GET

(Be sure to pay attention to the clock; minute 90 is when snackbreak starts.)

POST: Cepstral Weather demo (by hand)

http://cepstral.com/cgi-bin/demos/weather

Note the URL we POST to

I from FireBug

Note the URL we POST to

I from FireBug

Note the data we POST

I from FireBug

Note the data we POST

I from FireBug

Write simple Python that also POSTs

examples/cepstral/just post.py

Pull out the .wav file and play it with mplayer

examples/cepstral/play wav.py

POST: Cepstral weather demo (via mechanize)

examples/cepstral/just post via mechanize.py

Basic Yahoo! search (via mechanize)

examples/search/yahoo mechanize.py

I Great code, but broken due to robots.txt

Basic Yahoo! search (via mechanize)

examples/search/yahoo mechanize.py

I Great code, but broken due to robots.txt

Basic Yahoo! search (via mechanize, handle robots=False)

examples/search/yahoo mechanize norobots.py

Basic Google! search (via mechanize,handle robots=False, changeuser-agent)

examples/search/google mechanize.py

Cookies

emusic: Log in and verify that we logged in successfully(with cookielib)(optional)

examples/cookies/emusic login byhand.py

emusic: Log in and verify that we logged in successfully(with mechanize)

examples/cookies/emusic login mechanize.py

emusic: Check how many downloads we have left (withmechanize)

examples/cookies/emusic check downloads.py

Now we’re done, right?

Outline

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

We’ve seen:

I Loading web pages from the network with urllib2

I Parsing web pages (even broken ones)

I Scraping that page into a set of structured Python objects

I HTTP status codes

I Faking the user agent header

I Submitting forms

I Keeping a session with cookies

We’ve seen:

I HTTP status codes

I Submitting forms

We’ve seen:

I HTTP status codes

I Submitting forms

We’ve seen:

I HTTP status codes

I Submitting forms

We’ve seen:

I HTTP status codes

I Submitting forms

We’ve seen:

I HTTP status codes

I Submitting forms

We’ve seen:

I HTTP status codes

I Submitting forms

We’ve seen:

I HTTP status codes

I Submitting forms

“Play nice” on the web

I Ignore Terms of Service at your own peril

I robots.txt

Why scrape the web?

I Anger

I Interoperation with unmaintained systems

I “Rogue interoperability”

Why scrape the web?

I Anger

Why scrape the web?

I Anger

Why scrape the web?

I Anger

Web APIs

Facebook uses standards!

I XMPP chat doesn’t support:

I support grouping contactsI status messagesI large profile imagesI notifications

I What’s the point?

I support grouping contacts

I status messagesI large profile imagesI notifications

I support grouping contactsI status messages

I large profile imagesI notifications

I support grouping contactsI status messagesI large profile images

I notifications

“Sorry”

I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”

I Flickr: No way to get a user avatar via the API.

I API keys are evidence of submission.

I Where is the love?

I Why even play this game?

“Sorry”

Outline

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

Parser redux

Choosing a parser

I Performance

I Ease-of-use

I Quality

I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?

Choosing a parser

I Performance

I Ease-of-use

I Quality

Choosing a parser

I Performance

I Ease-of-use

I Quality

Choosing a parser

I Performance

I Ease-of-use

I Quality

Choosing a parser

I Performance

I Ease-of-use

I Quality

I Especially as relates to cleaning broken HTML

I HTML: 1998-style, or 2003-style?

Choosing a parser

I Performance

I Ease-of-use

I Quality

Benchmarks by Ian Bicking

I Benchmarks run by me this morning

I same results as Ian

Benchmarks by Ian BickingI Benchmarks run by me this morning

Ease of use

Tree fixups

I lxml ≈ BeautifulSoup

I lxml ≈ html5lib

I BeautifulSoup 3.0.7 > BeautifulSoup 3.1.0

Tree fixups

I lxml ≈ html5lib

Tree fixups

I lxml ≈ html5lib

Tree fixups

I lxml ≈ html5lib

A winner

I lxml!

I ...?

A winner

I lxml!

I ...?

A winner

I lxml!

I ...?

More about CSS selectors

I FireQuark

I http://www.imdb.com/title/tt0111161/

I h5:contains(“Release”)

I CSS...

I FireQuark

I CSS...

I FireQuark

I CSS...

I FireQuark

I CSS...

I FireQuark

I CSS...

Outline

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

Countermeasures

Imagine a really stupid bot

Check Referer header

I mechanize solves this

Check Referer header

Extra hidden form fields

Requiring cookies

Countermeasures: hard

Per-IP address query limits

Example: Yahoo web search API

I Use more IPs

I Tor, orI your own machines

I Use SOCKS (plus SSH) to make this easy

I Use more IPs

I Tor, or

I your own machines

I Use more IPs

CAPTCHAs

Example: Google web search (when you exceed undeclared querylimits).

I uh-oh

CAPTCHAs

Example: Google web search (when you exceed undeclared querylimits).

I uh-oh

JavaScript

Example: “Hash cash” system for avoiding comment spam.

I uh-oh

JavaScript

Example: “Hash cash” system for avoiding comment spam.

I uh-oh

Invisible countermeasures

Behavior profiling

I Time-based?

Behavior profiling

I Time-based?

Inserting false link visible only to bots

I “Tarpits”

Inserting false link visible only to bots

I “Tarpits”

robots.txt access

I As soon as you access it, you lose.

robots.txt access

I As soon as you access it, you lose.

Getting around IP address limits

Understand

I We still have to stay within the limits. We can just takeadvantage of IPs we do have.

Understand

I We still have to stay within the limits. We can just takeadvantage of IPs we do have.

ssh -D

I Borrow the IP of any machine you can log in to

I ssh -D 1080 asheesh.org

ssh -D

socks monkey

I SOCKSify Python from within Python

I examples/ip-limits/socks monkey.py

socks monkey

tsocks

I SOCKSify Python via LD PRELOAD

I examples/ip-limits/tsocks/

tsocks

“The onion router”

I SOCKSify but borrow someone else’s IP

I (play nice...)

Cycling strategies

I Drain it dry

I easy to implement first

I Round-robin

I generally preferable

Cycling strategies

I Drain it dry

I Round-robin

Cycling strategies

I Drain it dry

I Round-robin

Cycling strategies

I Drain it dry

I Round-robin

Cycling strategies

I Drain it dry

I Round-robin

Return to JavaScript: breaking Hash Cash

Detecting its presence

I Attempt to submit a comment with JS disabled

I Attempt to submit a comment with JS enabled

I Trace the second in FireBug

Rewriting the JavaScript as Python

I You may think I’m joking, but this is a common strategy.

Rewriting the JavaScript as Python

I You may think I’m joking, but this is a common strategy.

DOMForm

I Good news

“DOMForm is a Python module for web scraping and web testing.It knows how to evaluate embedded JavaScript code in response toappropriate events.”– John J. Lee of mechanize

I Bad news

“This module is unmaintained. Maybe someday...”Also, it does not execute page-global JavaScript, which is whereHashCash is implemented.

DOMForm

I Good news

I Bad news

DOMForm

I Good news

I Bad news

python-spidermonkey

I Good news

I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”

I Bad news

I ...do you really want to parse the web page for JavaScript andexecute it?

I examples/javascript/hashcash.py

python-spidermonkey

I Good news

I Bad news

python-spidermonkey

I Good news

I Bad news

python-spidermonkey

I Good news

I Bad news

python-spidermonkey

I Good news

I Bad news

python-spidermonkey

I Good news

I Bad news

I None of this is as clean and automated as mechanize.

“Breaking” CAPTCHAs

Fallback: yourself

I Can always just prompt the operator to figure it out and enterit

Fallback: yourself

I Can always just prompt the operator to figure it out and enterit

Mailinator: “Enter these words to delete the email”

I Only so many different images

I So build a look-up table

I ...indexed by URL?

I ...indexed by image contents?

I ...indexed by fuzzy image contents?

(I don’t have a good tool for the last one.)

Audio captchas: “Simple” signal analysis

I Should be doable in pylab/matplotlib with fast Fouriertransforms

Audio captchas: “Simple” signal analysis

I Should be doable in pylab/matplotlib with fast Fouriertransforms

JavaScript CAPTCHAs (like reCAPTCHA)

I re-implement CAPTCHA-downloading logic in Python

I ...or execute the JavaScript with spidermonkey

...JDownloader

I “Again, our captcha team did a great job and implementedmany new captcha methods.”

...JDownloader

I “Again, our captcha team did a great job and implementedmany new captcha methods.”

The website from Hell: US PTO Public PAIR

http://portal.uspto.gov/external/portal/pair

Start with a CAPTCHA

Solve it and move on to...

I document.write()

Solve it and move on to...

I document.write()

The page is invisible.

Outline

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

Selenium Remote Control

examples/seleniumrc/start.py

Selenium IDE

I Our friend, XPath

I FireBug

Selenium IDE

I Our friend, XPath

I FireBug

Selenium IDE

I Our friend, XPath

I FireBug

Why don’t we just do this all the time?

I Firefox memory footprint

I Flexibility

Outline

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

Other tricks

Your parser may fail

Text encoding

I Look in the HTTP header!

I Try UTF-8!

I ...chardet, if you must

Text encoding

I Try UTF-8!

Text encoding

I Try UTF-8!

Text encoding

I Try UTF-8!

Automatically reverse-engineer templates

I templatemaker by Adrian Holovaty

I everyblock templatemaker

table2dict

I Python bug tracker

table2dict

I Python bug tracker

Outline

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

Scaling and stability

I Choosing reliable queries from web pages

I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)

I Tor (and other proxy considerations)

I registrar.py: was seven years stable...

Summary

I If it’s on a web page, you can scrape it out.

I “Now you have an API for everything.”

Summary

Future directions

I More automation

I Using cssselect everywhere, geez it’s cool

Future directions

I More automation

Future directions

I More automation

Bonus time

If we have time:

I Greasemonkey demo: scraping in the browser

I Audience-suggested scraping lab

I Workshopping on queries or regular expressions

Bonus time

If we have time:

Bonus time

If we have time:

Bonus time

If we have time:

scrape the web: strategies for programming …scrape the web: strategies for programming websites...

Documents

tough scrape - mattek.com.au · phone 02 4720 4000 fax 02...

girdler stainless steel surface scrape votator

rajiv laroia founder and cto flarion technologies

art of the scrape!!!! show the internet who’s boss. scrape...

real bad scrape

runaway scrape

don't scrape, glean!

open source community growth as a user experience...

asheesh agarwal, innovation norway new delhi 26 ... -...

scrape feeder

pinar avci, md nih public access 1,2 asheesh gupta, phd1,2...

goliad & the runaway scrape goliad & the runaway scrape

to scrape or not to scrape? plaster, stucco and victorian...

broads authority planning committee application …...•...

goliad & the runaway scrape

asheesh complete project

principles and application of chromatography by asheesh...

crawl & scrape data for classifieds sites

asheesh goel a samad pardesi charles zagnoli the next...

asheesh mehra, infosys & mark leigh, hudson - journey to bpo