the bixo web mining toolkit

Post on 11-Nov-2014






Click to see full reader


Ken Krugler's talk at the Hadoop User Group (HUG) on the Bixo Web Mining Toolkit


Bixo - Web Mining Toolkit 23 Sep 2009


Web Mining Toolkit

Ken Krugler

TransPac Software, Inc.

My background - did a startup called Krugle from 2005 - 2008

Used Nutch to do a vertical crawl of the web, looking for technical software


Mined pages for references to open source projects.

Used experience to create Bixo, an open source web mining toolkit

Built on top of Hadoop, Cascading, Tika.

Bixo - Web Mining Toolkit 23 Sep 2009


Web Mining 101

Extracting & Processing Web Data

More Than Just Search

Business intelligence, competitive intelligence,

events, people, companies, popularity, pricing,

social graphs, Twitter feeds, Facebook friends,

support forums, shopping carts…

Quick intro to web mining, so we’re on the same page

Most people think about the big search companies when they think about web


Search is clearly the biggest web mining category, and generates the most


But other types of web mining have value that is high and growing.

This is what Bixo focuses on.

Bixo - Web Mining Toolkit 23 Sep 2009


4 Steps in Mining

Collect - fetch content from web

Parse - extract data from formats

Analyze - tokenize, rate, classify, cluster

Produce - an index, a report


Note - does not include serving up the search results

Why do I bring this up? To help clarify why web mining is not the same as

vertical search (next slide)

Bixo - Web Mining Toolkit 23 Sep 2009


Vertical Search

Vertical crawl to get specific content

Common use case for Nutch, Heritrix

But web mining often has different outcome

And specialized processing of data

Most people think of vertical search when they think of specialized web


Lots of people have been doing this, using OSS like Nutch & Heritrix.

End result is typically a Lucene index, plus the content, inverted links, etc.

Typical web mining is not the same as vertical search.

Often uses a white list, versus crawling to discover links.

More specialized processing of the data.

And these differences help answer the question of (next slide)…

Bixo - Web Mining Toolkit 23 Sep 2009


Why Bixo?

Response to needs of commercial projects

– Plug into Cascading-based workflow

– Low IT time/skill requirements

– Run well in AWS EC2 environment

– Flexible I/O support for AWS - S3, HBase

– Toolkit for building custom solutions

• Fetch white list (parse/index, data mine)

• Scrape white list (social popularity)

Does the world really need yet another web crawler?

No, but it does need a web mining toolkit

Two companies agreed to sponsor work on Bixo as an open source project.

On the point of running well in an EC2 environment…

Even though there are many web mining tasks that can be handled on a single


You very quickly run into issues of scale if you can’t handle upwards of

100M+ pages.

Bixo - Web Mining Toolkit 23 Sep 2009


Bixo Overview

MIT license open source project

In use by three companies

“Pipe” model for building workflows

Runs on top of Hadoop/Cascading

Full disclosure - Bixo makes heavy use of Cascading, which is under GPL.

So if you want to sell a product based on Bixo, you need to talk to Chris


The pipe model comes from our use of Cascading to define the workflows.

Bixo - Web Mining Toolkit 23 Sep 2009


What is Cascading

API for Hadoop data processing workflows

Operations on tuples with named fields

Workflows created from pipes

Reduces painful low-level MR details

Key for complex/reliable workflows

I know Chris Wensel has previously talked about Cascading here, but just to

make sure we’re all on the same page…

“tuple” is like a row in a database. Named fields with values.

Example of tuple - result of fetching a page, has URL, time of fetch, content,

headers, response rate, etc.

Because you can build workflows out of a mix of pre-defined & custom pipes,

it’s a real toolkit.

Chris explains it as MR is assembly, and Cascading is C. Sometimes it feels

more like C++ :)

Key aspect of reliable workflows is Cascading’s ability to check your

workflow (the DAG it builds)

Finds cases where fields aren’t available for operations.

Solves a key problem we ran into when customizing Nutch at Krugle

Bixo - Web Mining Toolkit 23 Sep 2009



This architecture looks nice and squeaky clean - and in general it is.

One issue is with the fetch phase of bixo not fitting well into the MR model.

External resource constraints mean you can’t treat it like a regular job.

So lots of threads in a special reduce phase, with corresponding issues

-Stack size

-Error handling

Bixo - Web Mining Toolkit 23 Sep 2009




Users who

Generate the




Let’s use a real example now of using Bixo to do web mining.

Imagine that the Apache Foundation decided to honor people who make

significant contributions to the Hadoop community.

In a typical company, determining the winner would depend on political

maneuvering, bribes,and sucking up.

But the Apache Foundation could decides to go for a quantitative approach for

the HUGMEE award.

Bixo - Web Mining Toolkit 23 Sep 2009


Helpful Hadoopers

Use mailing list archives for data (collect)

Parse mbox files and emails (parse)

Score based on key phrases (analyze)

End result is score/name pair (produce)

How do you figure out the most helpful Hadoopers?

As we discussed previously, it’s a classic web mining problem

Luckily the Hadoop mailing lists are all nicely archived as monthly mbox files.

How do we score based on key phrases (next slide)?

Bixo - Web Mining Toolkit 23 Sep 2009


Scoring Algorithm

Very sophisticated point system

“thanks” == 5

“owe you a beer” == 50

“worship the ground you walk on” == 100

Bixo - Web Mining Toolkit 23 Sep 2009


High Level Steps

Collect emails

– Fetch mod_mbox generated page

– Parse it to extract links to mbox files

– Fetch mbox files

– Split into separate emails

Parse emails

– Extract key headers (messageId, email, etc)

– Parse body to identify quoted text

Parsing the mod_mbox page is simple with Tika’s HtmlParser

Cheated a bit when parsing emails - some users like Owen have many aliases

So hand-generated alias resolution table.

Bixo - Web Mining Toolkit 23 Sep 2009


High Level Steps

Analyze emails

– Find key phrases in replies (ignore signoff)

– Score emails by phrases

– Group & sum by message ID

– Group & sum by email address

Produce ranked list

– Toss email addresses with no love

– Sort by summed score

Need to ignore “thanks” in “thanks in advance for doing my job for me”


Generate two tuples for each email:

-one with messageId/name/address

-One with reply-to messageId/score

Group/sum aspect is classic reduce operation.

Bixo - Web Mining Toolkit 23 Sep 2009



I think this slide is pretty self-explanatory - two Bixo fetch cycles, 6 custom

Cascading operations, 6 MR jobs.

OK, actually not so clear, but…

Key point is that only purple is stuff that I had to actually create

Some lines are purple as well, since that workflow (DAG) is also something I

defined - see next page.

But only two custom operations actually needed - parsing mbox_page and

calculating score

Running took about 30 minutes - mostly politely waiting until it was Ok to

politely do another fetch.

Downloaded 150MB of mbox files

409 unique email addresses with at least one positive reply.

Bixo - Web Mining Toolkit 23 Sep 2009


Building the Flow

Most of the code needed to create the workflow for this data mining app.

Lots of oatmeal code - which is good. Don’t want to be writing tricky code


Could optimize, but that would be a mistake…most web mining is


So just use more servers in EC2 - cheaper & faster.

Bixo - Web Mining Toolkit 23 Sep 2009


mod_mbox Page

Example of the top-level pages that were fetched in first phase.

Then needed to be parsed to extract links to mbox files.

Bixo - Web Mining Toolkit 23 Sep 2009


Custom Operation

Example of one of two custom operation

Parsing mod_mbox page

Uses Tika to extract Ids

Emits tuple with URL for each mbox ID

Bixo - Web Mining Toolkit 23 Sep 2009



Curve looks right - exponential decay.

409 unique email addresses that got some love from somebody.

Bixo - Web Mining Toolkit 23 Sep 2009


This Hug’s for Ted!

And the winner is…Ted Dunning

I know - I should have colored the elephant yellow.

Bixo - Web Mining Toolkit 23 Sep 2009



A list of the usual suspects

Coincidentally, Ted helped me derive the scoring algorithm I used…hmm.

Bixo - Web Mining Toolkit 23 Sep 2009


Use Bixo to…

Find +/- product comments on forums

Compare web site quality

Track social network popularity

Derive optimized SEO terms

Scape and analyze pricing data

Previous example could be easily changed to “find opinion makers on forums”

Many other use cases

All involve web mining workflow - fetch, parse, analyze, produce

Bixo - Web Mining Toolkit 23 Sep 2009



Bixo is a web mining toolkit

Built on Hadoop, Cascading, Tika

Young project but used commercially

Future - Mahout, monitoring, HBase, URL

DB, cleanup, bug fixes, rinse, repeat

Lots to be done, of course, but moving fast

Bixo - Web Mining Toolkit 23 Sep 2009







URLs to find out more about the Bixo project.

Stefan Groschupf from 101tec helped with initial Bixo coding.

His company provides infrastructure for project, thus in URLs


Bixo - Web Mining Toolkit 23 Sep 2009


Any Questions?

top related