don't scrape, glean!
DESCRIPTION
Lacks the demo part, alas, but it's the slides I usedTRANSCRIPT
![Page 1: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/1.jpg)
Don’t Scrape,Glean.
Tom Morris
![Page 2: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/2.jpg)
Scraping sucks.
![Page 3: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/3.jpg)
def lastlogin (@hmodel/"//td[@class='text'][@width='193']").first.innerHTML.split("<br />"[9].strip[-10..-1] return date[-4..-1] + "-" + date[-7..-6] + "-" + date[-10..-9]end
![Page 4: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/4.jpg)
Hpricot for ‘Last login’ date on
MySpace.
![Page 5: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/5.jpg)
try: lastlogin = self.soup.findAll(True, {"width": "193"})[0].br.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.string loginregex = re.compile( r"[0-9]/[0-9]+/[0-9]*") loginregex_inst = loginregex.search(lastlogin) if loginregex_inst is not None: self.lastlogin = loginregex_inst.group() except: pass
![Page 6: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/6.jpg)
Taken from a Python/BeautifulSo
up library.
![Page 7: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/7.jpg)
(The Ruby is prettier, but who’s
counting?)
![Page 8: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/8.jpg)
getElementsByClassName(“foo”)[0].children
![Page 9: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/9.jpg)
It’s an edge case. MySpace’s HTML is
worse than average.
![Page 10: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/10.jpg)
But it is an ugly recipe for mental
turmoil.
![Page 11: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/11.jpg)
The alternative?
![Page 12: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/12.jpg)
flickr.getPhotos()
![Page 13: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/13.jpg)
And you get back nice XML or JSON(or even SOAP!)
![Page 14: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/14.jpg)
But ‘D.R.Y.’!APIs break that
principle.
![Page 15: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/15.jpg)
This is the data equivalent of the
‘accessible version’.
![Page 16: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/16.jpg)
Enter GRDDL.
![Page 17: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/17.jpg)
GRDDL defines a transformation
process for XHTML » RDF.
![Page 18: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/18.jpg)
XHTML?That’s what the
spec says.
![Page 19: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/19.jpg)
HTML 4 works too.Tidy!
![Page 20: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/20.jpg)
RDF?Yes. Trust me.It’s not evil.
![Page 21: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/21.jpg)
GRDDL can worklike a data stylesheet
on top of your HTML.
![Page 22: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/22.jpg)
You simply use HTML (or XML) in the normal way...
![Page 23: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/23.jpg)
...and define how the data
transformation.
![Page 24: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/24.jpg)
You can even use it as a bridge for
exisiting APIs and services.
![Page 25: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/25.jpg)
Could even be used
for other formatsthan RDF. Atom?
![Page 26: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/26.jpg)
Simple example:‘Not Safe For Work’
![Page 28: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/28.jpg)
I can write that.I can’t write xFolk
by hand.
![Page 29: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/29.jpg)
Is ‘nsfw’ a good class name? No.
![Page 30: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/30.jpg)
Do I care? No.
![Page 31: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/31.jpg)
The data layer becomes
separated like CSS is from HTML.
![Page 32: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/32.jpg)
That’s the theory.Now for the demo.
![Page 33: Don't scrape, Glean!](https://reader036.vdocuments.site/reader036/viewer/2022081412/5400dca28d7f72c1628b4591/html5/thumbnails/33.jpg)
irc.freenode.net#swig
#swhack