scrape the web: strategies for programming …scrape the web: strategies for programming websites...

Post on 28-Jun-2020

23 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Scrape the Web: Strategies for programmingwebsites that don’t expect it

Presenter: Asheesh Laroia, @asheeshlaroia(scrape-pycon@asheesh.org, +1-585-506-8865)

February 18, 2010

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Intro

Meta

Hello

I You will learn neat tricks

I DO NOT BECOME AN EVIL COMMENT SPAMMER

I Theory, practice, and iterative development

I Brittle? Sometimes.

I The comics aren’t mine; ask me for references.

Hello

I You will learn neat tricks

I DO NOT BECOME AN EVIL COMMENT SPAMMER

I Theory, practice, and iterative development

I Brittle? Sometimes.

I The comics aren’t mine; ask me for references.

Hello

I You will learn neat tricks

I DO NOT BECOME AN EVIL COMMENT SPAMMER

I Theory, practice, and iterative development

I Brittle? Sometimes.

I The comics aren’t mine; ask me for references.

Hello

I You will learn neat tricks

I DO NOT BECOME AN EVIL COMMENT SPAMMER

I Theory, practice, and iterative development

I Brittle? Sometimes.

I The comics aren’t mine; ask me for references.

Hello

I You will learn neat tricks

I DO NOT BECOME AN EVIL COMMENT SPAMMER

I Theory, practice, and iterative development

I Brittle? Sometimes.

I The comics aren’t mine; ask me for references.

Hello

I You will learn neat tricks

I DO NOT BECOME AN EVIL COMMENT SPAMMER

I Theory, practice, and iterative development

I Brittle? Sometimes.

I The comics aren’t mine; ask me for references.

Format introduction

I I’ll stand up here and talk about things.

I You’ll ask me questions.

Format introduction

I I’ll stand up here and talk about things.

I You’ll ask me questions.

Format introduction

I I’ll stand up here and talk about things.

I You’ll ask me questions.

You know what sucks?

I It sucks when everyone’s thinking something and nobody’ssaying it.

I If I am incoherent, stop me.

You know what sucks?

I It sucks when everyone’s thinking something and nobody’ssaying it.

I If I am incoherent, stop me.

You know what sucks?

I It sucks when everyone’s thinking something and nobody’ssaying it.

I If I am incoherent, stop me.

“Only” three hours

I Slow me down,

I or speed me up.

I Do this with your voice or by raising your hand.

I Don’t try to do it via Twitter.

“Only” three hours

I Slow me down,

I or speed me up.

I Do this with your voice or by raising your hand.

I Don’t try to do it via Twitter.

“Only” three hours

I Slow me down,

I or speed me up.

I Do this with your voice or by raising your hand.

I Don’t try to do it via Twitter.

“Only” three hours

I Slow me down,

I or speed me up.

I Do this with your voice or by raising your hand.

I Don’t try to do it via Twitter.

“Only” three hours

I Slow me down,

I or speed me up.

I Do this with your voice or by raising your hand.

I Don’t try to do it via Twitter.

What is screen scraping?

Photo

Photo

Brittle?

Remote procedure call

I Every time you press a key, you cause the remote computer toexecute code.

I Every keypress causes a remote procedure call.

I If you understand this, you can document it as an API.

Remote procedure call

I Every time you press a key, you cause the remote computer toexecute code.

I Every keypress causes a remote procedure call.

I If you understand this, you can document it as an API.

Remote procedure call

I Every time you press a key, you cause the remote computer toexecute code.

I Every keypress causes a remote procedure call.

I If you understand this, you can document it as an API.

Remote procedure call

I Every time you press a key, you cause the remote computer toexecute code.

I Every keypress causes a remote procedure call.

I If you understand this, you can document it as an API.

Power

I We get to interact with the raw data.

I We could write our own interface.

I We get to programmatically interact with a system that onlyexpect humans at the door.

Power

I We get to interact with the raw data.

I We could write our own interface.

I We get to programmatically interact with a system that onlyexpect humans at the door.

Power

I We get to interact with the raw data.

I We could write our own interface.

I We get to programmatically interact with a system that onlyexpect humans at the door.

Power

I We get to interact with the raw data.

I We could write our own interface.

I We get to programmatically interact with a system that onlyexpect humans at the door.

Independence

I Design choices and restrictions fall away.

Independence

I Design choices and restrictions fall away.

Power, too much

I WE CAN SEND SPAM!

I Don’t do that.

Power, too much

I WE CAN SEND SPAM!

I Don’t do that.

Power, too much

I WE CAN SEND SPAM!

I Don’t do that.

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Programming the web

Say

The Web

I It’s the twenty-first century.

I The Web is a massive, mostly-unrestricted remote procedurecall system.

The Web

I It’s the twenty-first century.

I The Web is a massive, mostly-unrestricted remote procedurecall system.

The Web

I It’s the twenty-first century.

I The Web is a massive, mostly-unrestricted remote procedurecall system.

Mac OS “say”

I I’m not hip enough to have “say”

I but I do have the Web

Mac OS “say”

I I’m not hip enough to have “say”

I but I do have the Web

Mac OS “say”

I I’m not hip enough to have “say”

I but I do have the Web

Cepstral demo

Curry

Delicious

Curry on the web

http://mehfilindian.com/LunchMenuTakeOut.htm

Beneath the covers...

I FrontPage 6.0 is from 2003

I Some really ugly HTML...

I I like to call this 1998-style HTML

Beneath the covers...

I FrontPage 6.0 is from 2003

I Some really ugly HTML...

I I like to call this 1998-style HTML

Beneath the covers...

I FrontPage 6.0 is from 2003

I Some really ugly HTML...

I I like to call this 1998-style HTML

Beneath the covers...

I FrontPage 6.0 is from 2003

I Some really ugly HTML...

I I like to call this 1998-style HTML

The easy way

examples/curry/trivial.py

I urllib2.urlopen() gives you a file descriptor

I Now you can read() it... (and you get a big ol’ byte string)

I Test its contents for squash, and you’re done.

The easy way

examples/curry/trivial.py

I urllib2.urlopen() gives you a file descriptor

I Now you can read() it... (and you get a big ol’ byte string)

I Test its contents for squash, and you’re done.

The easy way

examples/curry/trivial.py

I urllib2.urlopen() gives you a file descriptor

I Now you can read() it... (and you get a big ol’ byte string)

I Test its contents for squash, and you’re done.

The easy way

examples/curry/trivial.py

I urllib2.urlopen() gives you a file descriptor

I Now you can read() it... (and you get a big ol’ byte string)

I Test its contents for squash, and you’re done.

The Web and standards

I We don’t have to resort to visual screen scraping.

I The web has a standard data format for marking up pagecontent.

I What is it called?

The Web and standards

I We don’t have to resort to visual screen scraping.

I The web has a standard data format for marking up pagecontent.

I What is it called?

The Web and standards

I We don’t have to resort to visual screen scraping.

I The web has a standard data format for marking up pagecontent.

I What is it called?

The Web and standards

I We don’t have to resort to visual screen scraping.

I The web has a standard data format for marking up pagecontent.

I What is it called?

XHTML and HTML

I It’s 2010.

I Surely XHTML has won by now.

XHTML and HTML

I It’s 2010.

I Surely XHTML has won by now.

XHTML and HTML

I It’s 2010.

I Surely XHTML has won by now.

“Extract some information”

I HTML

I vs. XHTML (2000)

I Both are trees of tags; both can be visualized in FireBug.

I ...did XHTML win?

“Extract some information”

I HTML

I vs. XHTML (2000)

I Both are trees of tags; both can be visualized in FireBug.

I ...did XHTML win?

“Extract some information”

I HTML

I vs. XHTML (2000)

I Both are trees of tags; both can be visualized in FireBug.

I ...did XHTML win?

“Extract some information”

I HTML

I vs. XHTML (2000)

I Both are trees of tags; both can be visualized in FireBug.

I ...did XHTML win?

“Extract some information”

I HTML

I vs. XHTML (2000)

I Both are trees of tags; both can be visualized in FireBug.

I ...did XHTML win?

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?

I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?

I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:

I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?

I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?

I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?

I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

The web: Round one

Parsing considerations

A showcase of some of your options

I An example of valid HTML (written by hand)(examples/parsing/)

I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)

I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidom

I Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in Firefox

I In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidom

I in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

Other ways to get information out of web pages?

I “squash” in page contents.lower()

I re.search(“squash”, page contents, re.IGNORECASE)

Other ways to get information out of web pages?

I “squash” in page contents.lower()

I re.search(“squash”, page contents, re.IGNORECASE)

Other ways to get information out of web pages?

I “squash” in page contents.lower()

I re.search(“squash”, page contents, re.IGNORECASE)

Inspirational quote: JWZ

Some people, when confronted with a problem, think“Iknow, I’ll use regular expressions.” Now they have twoproblems.– Jamie Zawinski

What’s wrong with regular expressions for scraping

I <a href=”/whatever/”>

I <a href=’whatever’>

I <a href=‘whatever”>

I Okay for “Reviews 1-10 of 430”

I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)

What’s wrong with regular expressions for scraping

I <a href=”/whatever/”>

I <a href=’whatever’>

I <a href=‘whatever”>

I Okay for “Reviews 1-10 of 430”

I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)

What’s wrong with regular expressions for scraping

I <a href=”/whatever/”>

I <a href=’whatever’>

I <a href=‘whatever”>

I Okay for “Reviews 1-10 of 430”

I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)

What’s wrong with regular expressions for scraping

I <a href=”/whatever/”>

I <a href=’whatever’>

I <a href=‘whatever”>

I Okay for “Reviews 1-10 of 430”

I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)

What’s wrong with regular expressions for scraping

I <a href=”/whatever/”>

I <a href=’whatever’>

I <a href=‘whatever”>

I Okay for “Reviews 1-10 of 430”

I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)

What’s wrong with regular expressions for scraping

I <a href=”/whatever/”>

I <a href=’whatever’>

I <a href=‘whatever”>

I Okay for “Reviews 1-10 of 430”

I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)

Inspirational quote: Jon Postel

Robustness principle: “Be conservative in what you do, be liberal inwhat you accept from others.”– Jon Postel, Transmission Control Protocol, RFC 793

Inspirational quote: Leonard Richardson

“You didn’t write that awful page. You’re just trying to get somedata out of it. Right now, you don’t really care what HTML issupposed to look like.“– Leonard Richardson, author of BeautifulSoup

Back to curry

New goal for curry: Objectify

Map the menu to Python objects

I play with the source in BeautifulSoup

I ...this is a text processing problem, not tag processing.

New goal for curry: Objectify

Map the menu to Python objects

I play with the source in BeautifulSoup

I ...this is a text processing problem, not tag processing.

New goal for curry: Objectify

Map the menu to Python objects

I play with the source in BeautifulSoup

I ...this is a text processing problem, not tag processing.

Model the data

examples/curry/menu.pyclass Entree:

I index

I name

I description

I long winded description

I price

Model the data

examples/curry/menu.pyclass Entree:

I index

I name

I description

I long winded description

I price

Model the data

examples/curry/menu.pyclass Entree:

I index

I name

I description

I long winded description

I price

Model the data

examples/curry/menu.pyclass Entree:

I index

I name

I description

I long winded description

I price

Model the data

examples/curry/menu.pyclass Entree:

I index

I name

I description

I long winded description

I price

Model the data

examples/curry/menu.pyclass Entree:

I index

I name

I description

I long winded description

I price

Mini-lesson

I hand-written pages vs.

I machine-written pages

Mini-lesson

I hand-written pages vs.

I machine-written pages

Mini-lesson

I hand-written pages vs.

I machine-written pages

New goal: Scrape Yahoo! finance

I examples/tree-builders/beautifulsoup yfinance.py

New goal: Scrape Yahoo! finance

I examples/tree-builders/beautifulsoup yfinance.py

We’re done!

Right?

Trees of tags

What defines how HTML gets parsed?

Web browsers

Surfing tag trees in FireBug

I Or Opera Dragonfly

I Or Chrome’s Inspector

Surfing tag trees in FireBug

I Or Opera Dragonfly

I Or Chrome’s Inspector

Surfing tag trees in FireBug

I Or Opera Dragonfly

I Or Chrome’s Inspector

Parsing trees and finding elements

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...

I titleI span.title

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I title

I span.title

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Recent history

I 2007: lxml.html improved, publicized by Ian Bicking

I CSS selectors for Pythonistas

I 2007: html5lib: Parse web pages like a browser

I 2008: BeautifulSoup 3.1.0, the end of an era

I 2010: html5lib deprecates BeautifulSoup

I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”

Recent history

I 2007: lxml.html improved, publicized by Ian Bicking

I CSS selectors for Pythonistas

I 2007: html5lib: Parse web pages like a browser

I 2008: BeautifulSoup 3.1.0, the end of an era

I 2010: html5lib deprecates BeautifulSoup

I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”

Recent history

I 2007: lxml.html improved, publicized by Ian Bicking

I CSS selectors for Pythonistas

I 2007: html5lib: Parse web pages like a browser

I 2008: BeautifulSoup 3.1.0, the end of an era

I 2010: html5lib deprecates BeautifulSoup

I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”

Recent history

I 2007: lxml.html improved, publicized by Ian Bicking

I CSS selectors for Pythonistas

I 2007: html5lib: Parse web pages like a browser

I 2008: BeautifulSoup 3.1.0, the end of an era

I 2010: html5lib deprecates BeautifulSoup

I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”

Recent history

I 2007: lxml.html improved, publicized by Ian Bicking

I CSS selectors for Pythonistas

I 2007: html5lib: Parse web pages like a browser

I 2008: BeautifulSoup 3.1.0, the end of an era

I 2010: html5lib deprecates BeautifulSoup

I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”

Recent history

I 2007: lxml.html improved, publicized by Ian Bicking

I CSS selectors for Pythonistas

I 2007: html5lib: Parse web pages like a browser

I 2008: BeautifulSoup 3.1.0, the end of an era

I 2010: html5lib deprecates BeautifulSoup

I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”

Recent history

I 2007: lxml.html improved, publicized by Ian Bicking

I CSS selectors for Pythonistas

I 2007: html5lib: Parse web pages like a browser

I 2008: BeautifulSoup 3.1.0, the end of an era

I 2010: html5lib deprecates BeautifulSoup

I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”

Searching tag trees

I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)

I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)

I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)

I “minimal stable XPath”

I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)

Searching tag trees

I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)

I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)

I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)

I “minimal stable XPath”

I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)

Searching tag trees

I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)

I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)

I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)

I “minimal stable XPath”

I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)

Searching tag trees

I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)

I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)

I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)

I “minimal stable XPath”

I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)

Searching tag trees

I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)

I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)

I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)

I “minimal stable XPath”

I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)

Searching tag trees

I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)

I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)

I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)

I “minimal stable XPath”

I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)

Interacting with the web

Basic Yahoo! search (hard-coded)

examples/search/yahoo.py

Basic Google! search (hard-coded)

examples/search/google.py

I Great code, but broken due to ?

Basic Google! search (hard-coded)

examples/search/google.py

I Great code, but broken due to ?

Something’s wrong...

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

The web: HTTP and you

A network trace of an HTTP conversation

User-Agent, and other headers the client sends

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 402: Payment Required

I 404 Not Found

I 410 Gone

I 418 I’m a teapot

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 402: Payment Required

I 404 Not Found

I 410 Gone

I 418 I’m a teapot

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 402: Payment Required

I 404 Not Found

I 410 Gone

I 418 I’m a teapot

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 402: Payment Required

I 404 Not Found

I 410 Gone

I 418 I’m a teapot

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 402: Payment Required

I 404 Not Found

I 410 Gone

I 418 I’m a teapot

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 402: Payment Required

I 404 Not Found

I 410 Gone

I 418 I’m a teapot

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 402: Payment Required

I 404 Not Found

I 410 Gone

I 418 I’m a teapot

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 402: Payment Required

I 404 Not Found

I 410 Gone

I 418 I’m a teapot

HTTP methods

I GET

I POST

I PUT

I BREW

HTTP methods

I GET

I POST

I PUT

I BREW

HTTP methods

I GET

I POST

I PUT

I BREW

HTTP methods

I GET

I POST

I PUT

I BREW

HTTP methods

I GET

I POST

I PUT

I BREW

Once we set User-Agent, are we just like Firefox?

I JavaScript behavior

I Image download behavior

I Cookie behavior

I Invalid HTML handling behavior (?)

I Accept: headers

Once we set User-Agent, are we just like Firefox?

I JavaScript behavior

I Image download behavior

I Cookie behavior

I Invalid HTML handling behavior (?)

I Accept: headers

Once we set User-Agent, are we just like Firefox?

I JavaScript behavior

I Image download behavior

I Cookie behavior

I Invalid HTML handling behavior (?)

I Accept: headers

Once we set User-Agent, are we just like Firefox?

I JavaScript behavior

I Image download behavior

I Cookie behavior

I Invalid HTML handling behavior (?)

I Accept: headers

Once we set User-Agent, are we just like Firefox?

I JavaScript behavior

I Image download behavior

I Cookie behavior

I Invalid HTML handling behavior (?)

I Accept: headers

Once we set User-Agent, are we just like Firefox?

I JavaScript behavior

I Image download behavior

I Cookie behavior

I Invalid HTML handling behavior (?)

I Accept: headers

What if we settle for approximate emulation?

Re-do of Google search with a cooked user-agent

examples/search/urllib2-user-agent/google as ie.py

Favorite User-Agent headers

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;(compatible;))

I I can’t believe it’s not Googlebot/2.1

Favorite User-Agent headers

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;(compatible;))

I I can’t believe it’s not Googlebot/2.1

Favorite User-Agent headers

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;(compatible;))

I I can’t believe it’s not Googlebot/2.1

Favorite User-Agent headers

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;(compatible;))

I I can’t believe it’s not Googlebot/2.1

HTTP: State via cookies

I HTTP implements state on top of TCP

HTTP: State via cookies

I HTTP implements state on top of TCP

robots.txt

I User-agent: *

I Disallow: /

I Allow: /crawlme.html

I http://www.robotstxt.org/

robots.txt

I User-agent: *

I Disallow: /

I Allow: /crawlme.html

I http://www.robotstxt.org/

robots.txt

I User-agent: *

I Disallow: /

I Allow: /crawlme.html

I http://www.robotstxt.org/

robots.txt

I User-agent: *

I Disallow: /

I Allow: /crawlme.html

I http://www.robotstxt.org/

robots.txt

I User-agent: *

I Disallow: /

I Allow: /crawlme.html

I http://www.robotstxt.org/

robots.txt and detectability

I “How does the server know you’re a robot?”

I Well, if you GET /robots.txt...

robots.txt and detectability

I “How does the server know you’re a robot?”

I Well, if you GET /robots.txt...

robots.txt and detectability

I “How does the server know you’re a robot?”

I Well, if you GET /robots.txt...

Filling out more forms: POST and GET

(Be sure to pay attention to the clock; minute 90 is when snackbreak starts.)

POST: Cepstral Weather demo (by hand)

http://cepstral.com/cgi-bin/demos/weather

Note the URL we POST to

I from FireBug

Note the URL we POST to

I from FireBug

Note the data we POST

I from FireBug

Note the data we POST

I from FireBug

Write simple Python that also POSTs

examples/cepstral/just post.py

Pull out the .wav file and play it with mplayer

examples/cepstral/play wav.py

POST: Cepstral weather demo (via mechanize)

examples/cepstral/just post via mechanize.py

Basic Yahoo! search (via mechanize)

examples/search/yahoo mechanize.py

I Great code, but broken due to robots.txt

Basic Yahoo! search (via mechanize)

examples/search/yahoo mechanize.py

I Great code, but broken due to robots.txt

Basic Yahoo! search (via mechanize, handle robots=False)

examples/search/yahoo mechanize norobots.py

Basic Google! search (via mechanize,handle robots=False, changeuser-agent)

examples/search/google mechanize.py

Cookies

emusic: Log in and verify that we logged in successfully(with cookielib)(optional)

examples/cookies/emusic login byhand.py

emusic: Log in and verify that we logged in successfully(with mechanize)

examples/cookies/emusic login mechanize.py

emusic: Check how many downloads we have left (withmechanize)

examples/cookies/emusic check downloads.py

Now we’re done, right?

Whew.

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Recap and philosophy

Recap

We’ve seen:

I Loading web pages from the network with urllib2

I Parsing web pages (even broken ones)

I Scraping that page into a set of structured Python objects

I HTTP status codes

I Faking the user agent header

I Submitting forms

I Keeping a session with cookies

Recap

We’ve seen:

I Loading web pages from the network with urllib2

I Parsing web pages (even broken ones)

I Scraping that page into a set of structured Python objects

I HTTP status codes

I Faking the user agent header

I Submitting forms

I Keeping a session with cookies

Recap

We’ve seen:

I Loading web pages from the network with urllib2

I Parsing web pages (even broken ones)

I Scraping that page into a set of structured Python objects

I HTTP status codes

I Faking the user agent header

I Submitting forms

I Keeping a session with cookies

Recap

We’ve seen:

I Loading web pages from the network with urllib2

I Parsing web pages (even broken ones)

I Scraping that page into a set of structured Python objects

I HTTP status codes

I Faking the user agent header

I Submitting forms

I Keeping a session with cookies

Recap

We’ve seen:

I Loading web pages from the network with urllib2

I Parsing web pages (even broken ones)

I Scraping that page into a set of structured Python objects

I HTTP status codes

I Faking the user agent header

I Submitting forms

I Keeping a session with cookies

Recap

We’ve seen:

I Loading web pages from the network with urllib2

I Parsing web pages (even broken ones)

I Scraping that page into a set of structured Python objects

I HTTP status codes

I Faking the user agent header

I Submitting forms

I Keeping a session with cookies

Recap

We’ve seen:

I Loading web pages from the network with urllib2

I Parsing web pages (even broken ones)

I Scraping that page into a set of structured Python objects

I HTTP status codes

I Faking the user agent header

I Submitting forms

I Keeping a session with cookies

Recap

We’ve seen:

I Loading web pages from the network with urllib2

I Parsing web pages (even broken ones)

I Scraping that page into a set of structured Python objects

I HTTP status codes

I Faking the user agent header

I Submitting forms

I Keeping a session with cookies

“Play nice” on the web

I Ignore Terms of Service at your own peril

I robots.txt

I DO NOT BECOME AN EVIL COMMENT SPAMMER

“Play nice” on the web

I Ignore Terms of Service at your own peril

I robots.txt

I DO NOT BECOME AN EVIL COMMENT SPAMMER

“Play nice” on the web

I Ignore Terms of Service at your own peril

I robots.txt

I DO NOT BECOME AN EVIL COMMENT SPAMMER

“Play nice” on the web

I Ignore Terms of Service at your own peril

I robots.txt

I DO NOT BECOME AN EVIL COMMENT SPAMMER

Why scrape the web?

I Anger

I Interoperation with unmaintained systems

I “Rogue interoperability”

Why scrape the web?

I Anger

I Interoperation with unmaintained systems

I “Rogue interoperability”

Why scrape the web?

I Anger

I Interoperation with unmaintained systems

I “Rogue interoperability”

Why scrape the web?

I Anger

I Interoperation with unmaintained systems

I “Rogue interoperability”

Web APIs

Facebook uses standards!

I XMPP chat doesn’t support:

I support grouping contactsI status messagesI large profile imagesI notifications

I What’s the point?

Facebook uses standards!

I XMPP chat doesn’t support:

I support grouping contactsI status messagesI large profile imagesI notifications

I What’s the point?

Facebook uses standards!

I XMPP chat doesn’t support:

I support grouping contacts

I status messagesI large profile imagesI notifications

I What’s the point?

Facebook uses standards!

I XMPP chat doesn’t support:

I support grouping contactsI status messages

I large profile imagesI notifications

I What’s the point?

Facebook uses standards!

I XMPP chat doesn’t support:

I support grouping contactsI status messagesI large profile images

I notifications

I What’s the point?

Facebook uses standards!

I XMPP chat doesn’t support:

I support grouping contactsI status messagesI large profile imagesI notifications

I What’s the point?

Facebook uses standards!

I XMPP chat doesn’t support:

I support grouping contactsI status messagesI large profile imagesI notifications

I What’s the point?

“Sorry”

I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”

I Flickr: No way to get a user avatar via the API.

I API keys are evidence of submission.

I Where is the love?

I Why even play this game?

“Sorry”

I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”

I Flickr: No way to get a user avatar via the API.

I API keys are evidence of submission.

I Where is the love?

I Why even play this game?

“Sorry”

I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”

I Flickr: No way to get a user avatar via the API.

I API keys are evidence of submission.

I Where is the love?

I Why even play this game?

“Sorry”

I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”

I Flickr: No way to get a user avatar via the API.

I API keys are evidence of submission.

I Where is the love?

I Why even play this game?

“Sorry”

I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”

I Flickr: No way to get a user avatar via the API.

I API keys are evidence of submission.

I Where is the love?

I Why even play this game?

“Sorry”

I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”

I Flickr: No way to get a user avatar via the API.

I API keys are evidence of submission.

I Where is the love?

I Why even play this game?

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Parser redux

Choosing a parser

I Performance

I Ease-of-use

I Quality

I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?

Choosing a parser

I Performance

I Ease-of-use

I Quality

I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?

Choosing a parser

I Performance

I Ease-of-use

I Quality

I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?

Choosing a parser

I Performance

I Ease-of-use

I Quality

I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?

Choosing a parser

I Performance

I Ease-of-use

I Quality

I Especially as relates to cleaning broken HTML

I HTML: 1998-style, or 2003-style?

Choosing a parser

I Performance

I Ease-of-use

I Quality

I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?

Benchmarks by Ian Bicking

I Benchmarks run by me this morning

I same results as Ian

Benchmarks by Ian BickingI Benchmarks run by me this morning

I same results as Ian

Benchmarks by Ian BickingI Benchmarks run by me this morning

I same results as Ian

Ease of use

Tree fixups

I lxml ≈ BeautifulSoup

I lxml ≈ html5lib

I BeautifulSoup 3.0.7 > BeautifulSoup 3.1.0

Tree fixups

I lxml ≈ BeautifulSoup

I lxml ≈ html5lib

I BeautifulSoup 3.0.7 > BeautifulSoup 3.1.0

Tree fixups

I lxml ≈ BeautifulSoup

I lxml ≈ html5lib

I BeautifulSoup 3.0.7 > BeautifulSoup 3.1.0

Tree fixups

I lxml ≈ BeautifulSoup

I lxml ≈ html5lib

I BeautifulSoup 3.0.7 > BeautifulSoup 3.1.0

A winner

I lxml!

I ...?

A winner

I lxml!

I ...?

A winner

I lxml!

I ...?

More about CSS selectors

I FireQuark

I http://www.imdb.com/title/tt0111161/

I h5:contains(“Release”)

I CSS...

More about CSS selectors

I FireQuark

I http://www.imdb.com/title/tt0111161/

I h5:contains(“Release”)

I CSS...

More about CSS selectors

I FireQuark

I http://www.imdb.com/title/tt0111161/

I h5:contains(“Release”)

I CSS...

More about CSS selectors

I FireQuark

I http://www.imdb.com/title/tt0111161/

I h5:contains(“Release”)

I CSS...

More about CSS selectors

I FireQuark

I http://www.imdb.com/title/tt0111161/

I h5:contains(“Release”)

I CSS...

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Countermeasures

Easy

Imagine a really stupid bot

Check Referer header

I mechanize solves this

Check Referer header

I mechanize solves this

Extra hidden form fields

I mechanize solves this

Extra hidden form fields

I mechanize solves this

Requiring cookies

I mechanize solves this

Requiring cookies

I mechanize solves this

Countermeasures: hard

Per-IP address query limits

Example: Yahoo web search API

I Use more IPs

I Tor, orI your own machines

I Use SOCKS (plus SSH) to make this easy

Per-IP address query limits

Example: Yahoo web search API

I Use more IPs

I Tor, orI your own machines

I Use SOCKS (plus SSH) to make this easy

Per-IP address query limits

Example: Yahoo web search API

I Use more IPs

I Tor, or

I your own machines

I Use SOCKS (plus SSH) to make this easy

Per-IP address query limits

Example: Yahoo web search API

I Use more IPs

I Tor, orI your own machines

I Use SOCKS (plus SSH) to make this easy

Per-IP address query limits

Example: Yahoo web search API

I Use more IPs

I Tor, orI your own machines

I Use SOCKS (plus SSH) to make this easy

CAPTCHAs

Example: Google web search (when you exceed undeclared querylimits).

I uh-oh

CAPTCHAs

Example: Google web search (when you exceed undeclared querylimits).

I uh-oh

JavaScript

Example: “Hash cash” system for avoiding comment spam.

I uh-oh

JavaScript

Example: “Hash cash” system for avoiding comment spam.

I uh-oh

Invisible countermeasures

Behavior profiling

I Time-based?

Behavior profiling

I Time-based?

Inserting false link visible only to bots

I “Tarpits”

Inserting false link visible only to bots

I “Tarpits”

robots.txt access

I As soon as you access it, you lose.

robots.txt access

I As soon as you access it, you lose.

Getting around IP address limits

Understand

I We still have to stay within the limits. We can just takeadvantage of IPs we do have.

Understand

I We still have to stay within the limits. We can just takeadvantage of IPs we do have.

ssh -D

I Borrow the IP of any machine you can log in to

I ssh -D 1080 asheesh.org

ssh -D

I Borrow the IP of any machine you can log in to

I ssh -D 1080 asheesh.org

ssh -D

I Borrow the IP of any machine you can log in to

I ssh -D 1080 asheesh.org

socks monkey

I SOCKSify Python from within Python

I examples/ip-limits/socks monkey.py

socks monkey

I SOCKSify Python from within Python

I examples/ip-limits/socks monkey.py

socks monkey

I SOCKSify Python from within Python

I examples/ip-limits/socks monkey.py

tsocks

I SOCKSify Python via LD PRELOAD

I examples/ip-limits/tsocks/

tsocks

I SOCKSify Python via LD PRELOAD

I examples/ip-limits/tsocks/

tsocks

I SOCKSify Python via LD PRELOAD

I examples/ip-limits/tsocks/

tor

“The onion router”

I SOCKSify but borrow someone else’s IP

I (play nice...)

tor

“The onion router”

I SOCKSify but borrow someone else’s IP

I (play nice...)

tor

“The onion router”

I SOCKSify but borrow someone else’s IP

I (play nice...)

Cycling strategies

I Drain it dry

I easy to implement first

I Round-robin

I generally preferable

Cycling strategies

I Drain it dry

I easy to implement first

I Round-robin

I generally preferable

Cycling strategies

I Drain it dry

I easy to implement first

I Round-robin

I generally preferable

Cycling strategies

I Drain it dry

I easy to implement first

I Round-robin

I generally preferable

Cycling strategies

I Drain it dry

I easy to implement first

I Round-robin

I generally preferable

Return to JavaScript: breaking Hash Cash

Detecting its presence

I Attempt to submit a comment with JS disabled

I Attempt to submit a comment with JS enabled

I Trace the second in FireBug

Detecting its presence

I Attempt to submit a comment with JS disabled

I Attempt to submit a comment with JS enabled

I Trace the second in FireBug

Detecting its presence

I Attempt to submit a comment with JS disabled

I Attempt to submit a comment with JS enabled

I Trace the second in FireBug

Detecting its presence

I Attempt to submit a comment with JS disabled

I Attempt to submit a comment with JS enabled

I Trace the second in FireBug

Rewriting the JavaScript as Python

I You may think I’m joking, but this is a common strategy.

Rewriting the JavaScript as Python

I You may think I’m joking, but this is a common strategy.

DOMForm

I Good news

“DOMForm is a Python module for web scraping and web testing.It knows how to evaluate embedded JavaScript code in response toappropriate events.”– John J. Lee of mechanize

I Bad news

“This module is unmaintained. Maybe someday...”Also, it does not execute page-global JavaScript, which is whereHashCash is implemented.

DOMForm

I Good news

“DOMForm is a Python module for web scraping and web testing.It knows how to evaluate embedded JavaScript code in response toappropriate events.”– John J. Lee of mechanize

I Bad news

“This module is unmaintained. Maybe someday...”Also, it does not execute page-global JavaScript, which is whereHashCash is implemented.

DOMForm

I Good news

“DOMForm is a Python module for web scraping and web testing.It knows how to evaluate embedded JavaScript code in response toappropriate events.”– John J. Lee of mechanize

I Bad news

“This module is unmaintained. Maybe someday...”Also, it does not execute page-global JavaScript, which is whereHashCash is implemented.

python-spidermonkey

I Good news

I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”

I Bad news

I ...do you really want to parse the web page for JavaScript andexecute it?

I examples/javascript/hashcash.py

python-spidermonkey

I Good news

I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”

I Bad news

I ...do you really want to parse the web page for JavaScript andexecute it?

I examples/javascript/hashcash.py

python-spidermonkey

I Good news

I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”

I Bad news

I ...do you really want to parse the web page for JavaScript andexecute it?

I examples/javascript/hashcash.py

python-spidermonkey

I Good news

I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”

I Bad news

I ...do you really want to parse the web page for JavaScript andexecute it?

I examples/javascript/hashcash.py

python-spidermonkey

I Good news

I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”

I Bad news

I ...do you really want to parse the web page for JavaScript andexecute it?

I examples/javascript/hashcash.py

python-spidermonkey

I Good news

I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”

I Bad news

I ...do you really want to parse the web page for JavaScript andexecute it?

I examples/javascript/hashcash.py

Ick

I None of this is as clean and automated as mechanize.

Ick

I None of this is as clean and automated as mechanize.

“Breaking” CAPTCHAs

Fallback: yourself

I Can always just prompt the operator to figure it out and enterit

Fallback: yourself

I Can always just prompt the operator to figure it out and enterit

Mailinator: “Enter these words to delete the email”

I Only so many different images

I So build a look-up table

I ...indexed by URL?

I ...indexed by image contents?

I ...indexed by fuzzy image contents?

(I don’t have a good tool for the last one.)

Mailinator: “Enter these words to delete the email”

I Only so many different images

I So build a look-up table

I ...indexed by URL?

I ...indexed by image contents?

I ...indexed by fuzzy image contents?

(I don’t have a good tool for the last one.)

Mailinator: “Enter these words to delete the email”

I Only so many different images

I So build a look-up table

I ...indexed by URL?

I ...indexed by image contents?

I ...indexed by fuzzy image contents?

(I don’t have a good tool for the last one.)

Mailinator: “Enter these words to delete the email”

I Only so many different images

I So build a look-up table

I ...indexed by URL?

I ...indexed by image contents?

I ...indexed by fuzzy image contents?

(I don’t have a good tool for the last one.)

Mailinator: “Enter these words to delete the email”

I Only so many different images

I So build a look-up table

I ...indexed by URL?

I ...indexed by image contents?

I ...indexed by fuzzy image contents?

(I don’t have a good tool for the last one.)

Mailinator: “Enter these words to delete the email”

I Only so many different images

I So build a look-up table

I ...indexed by URL?

I ...indexed by image contents?

I ...indexed by fuzzy image contents?

(I don’t have a good tool for the last one.)

Audio captchas: “Simple” signal analysis

I Should be doable in pylab/matplotlib with fast Fouriertransforms

Audio captchas: “Simple” signal analysis

I Should be doable in pylab/matplotlib with fast Fouriertransforms

JavaScript CAPTCHAs (like reCAPTCHA)

I re-implement CAPTCHA-downloading logic in Python

I ...or execute the JavaScript with spidermonkey

JavaScript CAPTCHAs (like reCAPTCHA)

I re-implement CAPTCHA-downloading logic in Python

I ...or execute the JavaScript with spidermonkey

JavaScript CAPTCHAs (like reCAPTCHA)

I re-implement CAPTCHA-downloading logic in Python

I ...or execute the JavaScript with spidermonkey

...JDownloader

I “Again, our captcha team did a great job and implementedmany new captcha methods.”

...JDownloader

I “Again, our captcha team did a great job and implementedmany new captcha methods.”

The website from Hell: US PTO Public PAIR

http://portal.uspto.gov/external/portal/pair

Start with a CAPTCHA

Solve it and move on to...

I document.write()

Solve it and move on to...

I document.write()

The page is invisible.

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Automating the web browser

Selenium Remote Control

examples/seleniumrc/start.py

Selenium IDE

I Our friend, XPath

I FireBug

Selenium IDE

I Our friend, XPath

I FireBug

Selenium IDE

I Our friend, XPath

I FireBug

Why don’t we just do this all the time?

I Firefox memory footprint

I Flexibility

Why don’t we just do this all the time?

I Firefox memory footprint

I Flexibility

Why don’t we just do this all the time?

I Firefox memory footprint

I Flexibility

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Other tricks

Your parser may fail

Text encoding

I Look in the HTTP header!

I Try UTF-8!

I ...chardet, if you must

Text encoding

I Look in the HTTP header!

I Try UTF-8!

I ...chardet, if you must

Text encoding

I Look in the HTTP header!

I Try UTF-8!

I ...chardet, if you must

Text encoding

I Look in the HTTP header!

I Try UTF-8!

I ...chardet, if you must

Automatically reverse-engineer templates

I templatemaker by Adrian Holovaty

I everyblock templatemaker

Automatically reverse-engineer templates

I templatemaker by Adrian Holovaty

I everyblock templatemaker

Automatically reverse-engineer templates

I templatemaker by Adrian Holovaty

I everyblock templatemaker

table2dict

I Python bug tracker

table2dict

I Python bug tracker

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Conclusions

Scaling and stability

I Choosing reliable queries from web pages

I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)

I Tor (and other proxy considerations)

I registrar.py: was seven years stable...

Scaling and stability

I Choosing reliable queries from web pages

I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)

I Tor (and other proxy considerations)

I registrar.py: was seven years stable...

Scaling and stability

I Choosing reliable queries from web pages

I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)

I Tor (and other proxy considerations)

I registrar.py: was seven years stable...

Scaling and stability

I Choosing reliable queries from web pages

I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)

I Tor (and other proxy considerations)

I registrar.py: was seven years stable...

Scaling and stability

I Choosing reliable queries from web pages

I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)

I Tor (and other proxy considerations)

I registrar.py: was seven years stable...

Summary

I If it’s on a web page, you can scrape it out.

I “Now you have an API for everything.”

Summary

I If it’s on a web page, you can scrape it out.

I “Now you have an API for everything.”

Summary

I If it’s on a web page, you can scrape it out.

I “Now you have an API for everything.”

Future directions

I More automation

I Using cssselect everywhere, geez it’s cool

Future directions

I More automation

I Using cssselect everywhere, geez it’s cool

Future directions

I More automation

I Using cssselect everywhere, geez it’s cool

Bonus time

If we have time:

I Greasemonkey demo: scraping in the browser

I Audience-suggested scraping lab

I Workshopping on queries or regular expressions

Bonus time

If we have time:

I Greasemonkey demo: scraping in the browser

I Audience-suggested scraping lab

I Workshopping on queries or regular expressions

Bonus time

If we have time:

I Greasemonkey demo: scraping in the browser

I Audience-suggested scraping lab

I Workshopping on queries or regular expressions

Bonus time

If we have time:

I Greasemonkey demo: scraping in the browser

I Audience-suggested scraping lab

I Workshopping on queries or regular expressions

top related