Download - The Bixo Web Mining Toolkit
Bixo - Web Mining Toolkit 23 Sep 2009
1
Web Mining Toolkit
Ken Krugler
TransPac Software, Inc.
My background - did a startup called Krugle from 2005 - 2008
Used Nutch to do a vertical crawl of the web, looking for technical software
pages.
Mined pages for references to open source projects.
Used experience to create Bixo, an open source web mining toolkit
Built on top of Hadoop, Cascading, Tika.
Bixo - Web Mining Toolkit 23 Sep 2009
2
Web Mining 101
Extracting & Processing Web Data
More Than Just Search
Business intelligence, competitive intelligence,
events, people, companies, popularity, pricing,
social graphs, Twitter feeds, Facebook friends,
support forums, shopping carts…
Quick intro to web mining, so we’re on the same page
Most people think about the big search companies when they think about web
mining.
Search is clearly the biggest web mining category, and generates the most
revenue.
But other types of web mining have value that is high and growing.
This is what Bixo focuses on.
Bixo - Web Mining Toolkit 23 Sep 2009
3
4 Steps in Mining
Collect - fetch content from web
Parse - extract data from formats
Analyze - tokenize, rate, classify, cluster
Produce - an index, a report
Search
Note - does not include serving up the search results
Why do I bring this up? To help clarify why web mining is not the same as
vertical search (next slide)
Bixo - Web Mining Toolkit 23 Sep 2009
4
Vertical Search
Vertical crawl to get specific content
Common use case for Nutch, Heritrix
But web mining often has different outcome
And specialized processing of data
Most people think of vertical search when they think of specialized web
mining.
Lots of people have been doing this, using OSS like Nutch & Heritrix.
End result is typically a Lucene index, plus the content, inverted links, etc.
Typical web mining is not the same as vertical search.
Often uses a white list, versus crawling to discover links.
More specialized processing of the data.
And these differences help answer the question of (next slide)…
Bixo - Web Mining Toolkit 23 Sep 2009
5
Why Bixo?
Response to needs of commercial projects
– Plug into Cascading-based workflow
– Low IT time/skill requirements
– Run well in AWS EC2 environment
– Flexible I/O support for AWS - S3, HBase
– Toolkit for building custom solutions
• Fetch white list (parse/index, data mine)
• Scrape white list (social popularity)
Does the world really need yet another web crawler?
No, but it does need a web mining toolkit
Two companies agreed to sponsor work on Bixo as an open source project.
On the point of running well in an EC2 environment…
Even though there are many web mining tasks that can be handled on a single
computer,
You very quickly run into issues of scale if you can’t handle upwards of
100M+ pages.
Bixo - Web Mining Toolkit 23 Sep 2009
6
Bixo Overview
MIT license open source project
In use by three companies
“Pipe” model for building workflows
Runs on top of Hadoop/Cascading
Full disclosure - Bixo makes heavy use of Cascading, which is under GPL.
So if you want to sell a product based on Bixo, you need to talk to Chris
Wensel.
The pipe model comes from our use of Cascading to define the workflows.
Bixo - Web Mining Toolkit 23 Sep 2009
7
What is Cascading
API for Hadoop data processing workflows
Operations on tuples with named fields
Workflows created from pipes
Reduces painful low-level MR details
Key for complex/reliable workflows
I know Chris Wensel has previously talked about Cascading here, but just to
make sure we’re all on the same page…
“tuple” is like a row in a database. Named fields with values.
Example of tuple - result of fetching a page, has URL, time of fetch, content,
headers, response rate, etc.
Because you can build workflows out of a mix of pre-defined & custom pipes,
it’s a real toolkit.
Chris explains it as MR is assembly, and Cascading is C. Sometimes it feels
more like C++ :)
Key aspect of reliable workflows is Cascading’s ability to check your
workflow (the DAG it builds)
Finds cases where fields aren’t available for operations.
Solves a key problem we ran into when customizing Nutch at Krugle
Bixo - Web Mining Toolkit 23 Sep 2009
8
Architecture
This architecture looks nice and squeaky clean - and in general it is.
One issue is with the fetch phase of bixo not fitting well into the MR model.
External resource constraints mean you can’t treat it like a regular job.
So lots of threads in a special reduce phase, with corresponding issues
-Stack size
-Error handling
Bixo - Web Mining Toolkit 23 Sep 2009
9
HUGMEE
Hadoop
Users who
Generate the
Most
Effective
Emails
Let’s use a real example now of using Bixo to do web mining.
Imagine that the Apache Foundation decided to honor people who make
significant contributions to the Hadoop community.
In a typical company, determining the winner would depend on political
maneuvering, bribes,and sucking up.
But the Apache Foundation could decides to go for a quantitative approach for
the HUGMEE award.
Bixo - Web Mining Toolkit 23 Sep 2009
10
Helpful Hadoopers
Use mailing list archives for data (collect)
Parse mbox files and emails (parse)
Score based on key phrases (analyze)
End result is score/name pair (produce)
How do you figure out the most helpful Hadoopers?
As we discussed previously, it’s a classic web mining problem
Luckily the Hadoop mailing lists are all nicely archived as monthly mbox files.
How do we score based on key phrases (next slide)?
Bixo - Web Mining Toolkit 23 Sep 2009
11
Scoring Algorithm
Very sophisticated point system
“thanks” == 5
“owe you a beer” == 50
“worship the ground you walk on” == 100
Bixo - Web Mining Toolkit 23 Sep 2009
12
High Level Steps
Collect emails
– Fetch mod_mbox generated page
– Parse it to extract links to mbox files
– Fetch mbox files
– Split into separate emails
Parse emails
– Extract key headers (messageId, email, etc)
– Parse body to identify quoted text
Parsing the mod_mbox page is simple with Tika’s HtmlParser
Cheated a bit when parsing emails - some users like Owen have many aliases
So hand-generated alias resolution table.
Bixo - Web Mining Toolkit 23 Sep 2009
13
High Level Steps
Analyze emails
– Find key phrases in replies (ignore signoff)
– Score emails by phrases
– Group & sum by message ID
– Group & sum by email address
Produce ranked list
– Toss email addresses with no love
– Sort by summed score
Need to ignore “thanks” in “thanks in advance for doing my job for me”
signoff.
Generate two tuples for each email:
-one with messageId/name/address
-One with reply-to messageId/score
Group/sum aspect is classic reduce operation.
Bixo - Web Mining Toolkit 23 Sep 2009
14
Workflow
I think this slide is pretty self-explanatory - two Bixo fetch cycles, 6 custom
Cascading operations, 6 MR jobs.
OK, actually not so clear, but…
Key point is that only purple is stuff that I had to actually create
Some lines are purple as well, since that workflow (DAG) is also something I
defined - see next page.
But only two custom operations actually needed - parsing mbox_page and
calculating score
Running took about 30 minutes - mostly politely waiting until it was Ok to
politely do another fetch.
Downloaded 150MB of mbox files
409 unique email addresses with at least one positive reply.
Bixo - Web Mining Toolkit 23 Sep 2009
15
Building the Flow
Most of the code needed to create the workflow for this data mining app.
Lots of oatmeal code - which is good. Don’t want to be writing tricky code
here.
Could optimize, but that would be a mistake…most web mining is
programmer-constrained.
So just use more servers in EC2 - cheaper & faster.
Bixo - Web Mining Toolkit 23 Sep 2009
16
mod_mbox Page
Example of the top-level pages that were fetched in first phase.
Then needed to be parsed to extract links to mbox files.
Bixo - Web Mining Toolkit 23 Sep 2009
17
Custom Operation
Example of one of two custom operation
Parsing mod_mbox page
Uses Tika to extract Ids
Emits tuple with URL for each mbox ID
Bixo - Web Mining Toolkit 23 Sep 2009
18
Validate
Curve looks right - exponential decay.
409 unique email addresses that got some love from somebody.
Bixo - Web Mining Toolkit 23 Sep 2009
19
This Hug’s for Ted!
And the winner is…Ted Dunning
I know - I should have colored the elephant yellow.
Bixo - Web Mining Toolkit 23 Sep 2009
20
Produce
A list of the usual suspects
Coincidentally, Ted helped me derive the scoring algorithm I used…hmm.
Bixo - Web Mining Toolkit 23 Sep 2009
21
Use Bixo to…
Find +/- product comments on forums
Compare web site quality
Track social network popularity
Derive optimized SEO terms
Scape and analyze pricing data
Previous example could be easily changed to “find opinion makers on forums”
Many other use cases
All involve web mining workflow - fetch, parse, analyze, produce
Bixo - Web Mining Toolkit 23 Sep 2009
22
Summary
Bixo is a web mining toolkit
Built on Hadoop, Cascading, Tika
Young project but used commercially
Future - Mahout, monitoring, HBase, URL
DB, cleanup, bug fixes, rinse, repeat
Lots to be done, of course, but moving fast
Bixo - Web Mining Toolkit 23 Sep 2009
23
Resources
Web: http://bixo.101tec.com
List: http://tech.groups.yahoo.com/group/bixo-dev/
Source: http://github.com/emi/bixo/tree
Bugs: http://oss.101tec.com/jira/browse/bixo
URLs to find out more about the Bixo project.
Stefan Groschupf from 101tec helped with initial Bixo coding.
His company provides infrastructure for project, thus 101tec.com in URLs
above
Bixo - Web Mining Toolkit 23 Sep 2009
24
Any Questions?