web scrapping with python

Web Scrapping with Python

Miguel Miranda de Mattos:@mmmattos - mmmattos.net

Porto Alegre, Brazil.

Web Scrapping with Python

● Tools:

○ BeautifulSoup

○ Mechanize

BeautifulSoup

An HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

● In Summary:

○ Navigate the "soup" of HTML/XML tags, programatically

○ Access tag´s properties and values

○ Search for tags and their attributes.

BeautifulSoup○ Example:

from BeautifulSoup import BeautifulSoupdoc = "<html><h1>Heading</h1><p>Text"soup = BeautifulSoup(doc)print soup.prettify()

# <html># <h1># Heading# </h1># <p># Text# </p># </html>

BeautifulSoup

○ Searching / Looking for things■ 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild',

'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings', 'findParent', 'findParents', 'findPrevious', 'findPreviousSibling', 'findPreviousSiblings'

■ findAll● findAll(self, name=None, attrs={}, recursive=True,

text=None, limit=None, **kwargs)

● Extracts a list of Tag objects that match the given● criteria. You can specify the name of the Tag and any● attributes you want the Tag to have.

BeautifulSoup

● Example:

>>> from BeautifulSoup import BeautifulSoup>>> doc = "<table><tr><td>one</td><td>two</td></tr></table>">>> docSoup = BeautifulSoup(doc) >>> print docSoup.findAll('tr')[<tr><td>one</td><td>two</td></tr>]

>>> print docSoup.findAll('td')[<td>one</td>, <td>two</td>]

BeautifulSoup

● findAll (cont´d.):

>>> for t in docSoup.findAll('td'):>>> print t

>>> for t in docSoup.findAll('td'):>>> print t.getText()

onetwo

BeautifulSoup● findAll using attributes to qualify:

>>> soup.findAll('div',attrs = {'class': 'Menus'})[<div>musicMenu</div>,<div>videoMenu</div>]

● For more options:

○ dir (BeautifulSoup)○ help (yourSoup.<command>)

● Use BeautifulSoup rather than regexp patterns:patFinderTitle = re.compile(r'<a[^>]*\stitle="(.*?)"')re.findAll(patFinderTitle, html)

○ bysoup = BeautifulSoup(html)for tag in brand_row_soup.findAll('a'):print tag['title']

Mechanize

● Stateful programmatic web browsing in Python, after Andy Lester’s Perl module.

● mechanize.Browser and mechanize.UserAgentBase, so:○ any URL can be opened, not just http:○ mechanize.UserAgentBase offers easy dynamic configuration of

user-agent features like protocol, cookie, redirection and robots.txt handling, without having to make a new OpenerDirector each time, e.g. by callingbuild_opener().

● Easy HTML form filling.● Convenient link parsing and following.● Browser history (.back() and .reload() methods).● The Referer HTTP header is added properly (optional).● Automatic observance of robots.txt.● Automatic handling of HTTP-Equiv and Refresh.

Mechanize

● Navigation commands:○ open(url)

○ follow_link(link)

○ back()

○ submit()

○ reload()

● Examples

br = mechanize.Browser()br.open("python.org")gothtml = br.response().read()for link in br.links(url_regex="python.org"): print link br.follow_link(link) # takes EITHER Link instance OR keyword args br.back()

Mechanize

● Example:

import reimport mechanize

br = mechanize.Browser()br.open("http://www.example.com/")

# follow second link with element text matching # regular expressionresponse1 = br.follow_link(text_regex=r"cheese\s*shop")

assert br.viewing_html()print br.title()print response1.geturl()print response1.info() # headersprint response1.read() # body

Mechanize

● Example: Combining Mechanize and BeautifulSoup

import reimport mechanizefrom BeautifulSoup import BeutifulSoup

url = "http://www.hp.com"br = mechanize.Browser()

br..open(url) assert br.viewing_html() html = br.response().read() result_soup = BeautifulSoup(html)

found_divs = soup.findAll('div')print "Found " + str(len(found_divs))for d in found_divs:

print d

Mechanize

● Example: Combining Mechanize and BeautifulSoup

import reimport mechanize

url = "http://www.hp.com"br = mechanize.Browser()

br..open(url) assert br.viewing_html() html = br.response().read() result_soup = BeautifulSoup(html)

found_divs = soup.findAll('div')print "Found " + str(len(found_divs))for d in found_divs:

if d.has_key('class'):print d['class']

web scrapping with python

assert br

parse tree

str

br mechanize

result soup beautifulsoup

combining mechanize

print response1

beautifulsoup import

Technology

perancangan galangan scrapping ramah lingkungan

python and the web

what members really feel about your credit card · what...

imran yousuf - university of california, berkeley · imran...

python web interaction

services that advance your business · • an extensible...

an introduction to python and python web...

premier pas de web scrapping avec r

web-scrapping, social media, and big data in politics ·...

python na web

scrapping keynote - playing with papers

a python web service

un global trusted platform data, applications and...

wm scrapping

bottle - python web microframework

web scrapping & wordpress

feasibility study ship scrapping final report...

programando para web com python - introdução a python

web scrapping con python y selenium · web scrapping con...

returning and scrapping assemblies_spd