web scrapping with python

Web Scrapping with Python

Miguel Miranda de Mattos:@mmmattos - mmmattos.net

Porto Alegre, Brazil.

2012

http://twitter.com/mmmattos

http://mmmattos.net

Web Scrapping with Python

● Tools:

○ BeautifulSoup

○ Mechanize

BeautifulSoup

An HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

● In Summary:

○ Navigate the "soup" of HTML/XML tags, programatically

○ Access tag´s properties and values

○ Search for tags and their attributes.

BeautifulSoup○ Example:

from BeautifulSoup import BeautifulSoupdoc = "<html><h1>Heading</h1><p>Text"soup = BeautifulSoup(doc)print soup.prettify()

# <html># <h1># Heading# </h1># <p># Text# </p># </html>

○

BeautifulSoup

○ Searching / Looking for things■ 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild',

'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings', 'findParent', 'findParents', 'findPrevious', 'findPreviousSibling', 'findPreviousSiblings'

■ findAll● findAll(self, name=None, attrs={}, recursive=True,

text=None, limit=None, **kwargs)

● Extracts a list of Tag objects that match the given● criteria. You can specify the name of the Tag and any● attributes you want the Tag to have.

○

BeautifulSoup

● Example:

>>> from BeautifulSoup import BeautifulSoup>>> doc = "<table><tr><td>one</td><td>two</td></tr></table>">>> docSoup = BeautifulSoup(doc) >>> print docSoup.findAll('tr')[<tr><td>one</td><td>two</td></tr>]

>>> print docSoup.findAll('td')[<td>one</td>, <td>two</td>]

BeautifulSoup

● findAll (cont´d.):

>>> for t in docSoup.findAll('td'):>>> print t

<td>one</td><td>two</td>

>>> for t in docSoup.findAll('td'):>>> print t.getText()

onetwo

BeautifulSoup● findAll using attributes to qualify:

>>> soup.findAll('div',attrs = {'class': 'Menus'})[<div>musicMenu</div>,<div>videoMenu</div>]

● For more options:

○ dir (BeautifulSoup)○ help (yourSoup.<command>)

● Use BeautifulSoup rather than regexp patterns:patFinderTitle = re.compile(r'<a[^>]*\stitle="(.*?)"')re.findAll(patFinderTitle, html)

○ bysoup = BeautifulSoup(html)for tag in brand_row_soup.findAll('a'):print tag['title']

Mechanize

● Stateful programmatic web browsing in Python, after Andy Lester’s Perl module.

● mechanize.Browser and mechanize.UserAgentBase, so:○ any URL can be opened, not just http:○ mechanize.UserAgentBase offers easy dynamic configuration of

user-agent features like protocol, cookie, redirection and robots.txt handling, without having to make a new OpenerDirector each time, e.g. by callingbuild_opener().

● Easy HTML form filling.● Convenient link parsing and following.● Browser history (.back() and .reload() methods).● The Referer HTTP header is added properly (optional).● Automatic observance of robots.txt.● Automatic handling of HTTP-Equiv and Refresh.

http://www.robotstxt.org/wc/norobots.html

Mechanize

● Navigation commands:○ open(url)

○ follow_link(link)

○ back()

○ submit()

○ reload()

● Examples

br = mechanize.Browser()br.open("python.org")gothtml = br.response().read()for link in br.links(url_regex="python.org"): print link br.follow_link(link) # takes EITHER Link instance OR keyword args br.back()

Mechanize

● Example:

import reimport mechanize

br = mechanize.Browser()br.open("http://www.example.com/")

# follow second link with element text matching # regular expressionresponse1 = br.follow_link(text_regex=r"cheese\s*shop")

assert br.viewing_html()print br.title()print response1.geturl()print response1.info() # headersprint response1.read() # body

Mechanize

● Example: Combining Mechanize and BeautifulSoup

import reimport mechanizefrom BeautifulSoup import BeutifulSoup

url = "http://www.hp.com"br = mechanize.Browser()

br..open(url) assert br.viewing_html() html = br.response().read() result_soup = BeautifulSoup(html)

found_divs = soup.findAll('div')print "Found " + str(len(found_divs))for d in found_divs:

print d

Mechanize

● Example: Combining Mechanize and BeautifulSoup

import reimport mechanize

url = "http://www.hp.com"br = mechanize.Browser()

br..open(url) assert br.viewing_html() html = br.response().read() result_soup = BeautifulSoup(html)

found_divs = soup.findAll('div')print "Found " + str(len(found_divs))for d in found_divs:

if d.has_key('class'):print d['class']

web scrapping with python

Technology

assert br

parse tree

str

br mechanize

result soup beautifulsoup

combining mechanize

print response1

beautifulsoup import