scaling the knowledge base for the never-ending language learner (nell): a step toward large-scale...

SCALING THE KNOWLEDGE BASE FOR THE NEVER-ENDING LANGUAGE LEARNER (NELL): A STEP TOWARD LARGE-SCALE COMPUTING

FOR AUTOMATED LEARNING

Joel Welling PSC

4/10/2012

Automated Learning from Text is not a supercomputing problem

But it really should be. Simultaneously infer patters and identify data

matching those patterns using large collections of text

Corpus size is supercomputing-scale (~7TB compressed for the ClueWeb corpus)

Learning algorithms (like clustering and SVD) are appropriate to supercomputers.

Largest scale projects typically use Hadoop

NELL: The Never-Ending Language Learner

NELL is a project of the ReadTheWeb group at CMU: Tom Mitchell and William Cohen

Based on 500M web pages (ClueWeb09) and access to the rest of the web via search engine APIs

You can follow NELL on twitter as “cmunell”

NELL's tasks Extract new instances of categories and relations.

In other words, find noun phrases that represent new examples of the input categories (e.g., "Barack Obama" is a person and politician), and find pairs of noun phrases that correspond to instances of the input relations (e.g., the pair "Jason Giambi" and "Yankees" is an instance of the playsOnTeam relation).

These new instances are added to the growing knowledge base of structured beliefs.

Learn to read better than yesterday. NELL uses a variety of methods to extract beliefs from the web. These are retrained, using the growing knowledge base as a self-supervised collection of training examples. The result is a semi-supervised learning method that couples the training of hundreds of different extraction methods for a wide range of categories and relations. Much of NELL’s current success is due to its algorithm for coupling the simultaneous training of many extraction methods.

NELL Learns In Cycles

Each requires pouring the corpus disk-to-disk, applying templates and checking against the currently known KB. Very slow.

We'd like to hold everything in memory, and learn templates and facts simultaneously. Speed things up by a factor of 100 or more.

Typical Scaling Work-around

Currently the text corpus gets streamed disk-to-disk and a co-occurence matrix is built.

Stats are taken from this, but it can't distinguish between “Bill ate the cow” and “The cow ate Bill”.

This is a “bag of words” model. We want to do “deep reading” instead; actually parsing and understanding individual sentences.

Barriers To Scaling NELL

Java Use of web services Database implementation of the NELL

Knowledge Base (KB) Hadoop-based mindset, if they think of scaling

at all Single threads, with many JVMs working in parallel

Steps to solution

Learn to run Java stably: there’s a trick Move to scalable DB: the current project Minimize real-time web service interaction More parallelism. This is most easily achieved

with many threads, so Blacklight is good.

Java GC Threads

On Blacklight, it tries to spawn one garbage collection thread per core. Over the whole machine.

To control this:

-XX:ParallelGCThreads=16

(for IBM, -Xgcthreads16) Beware of lucene

A common DB server written in Java; you must arrange to start it up with the same flag.

Knowledge Base Representations

The KB knows the ontology: what is what. It is referenced continuously as the program

runs, maybe across multiple threads NELL’s KB is described in 3 ways:

Theo: syntactic, and evolving TokyoCabinet representation Graph Databases and Neo4J

Why change? Scaling, explicit transactions, natural structure

Theo

Theo is the abstract syntax of the knowledge base. It's still evolving.

Value : { string | number | list | pointer to entity }

Entity : { PrimitiveEntity | query | belief }

PrimitiveEntity : string

Slot : one of a subset of the set of PrimitiveEntities

Query : ( entity, slot )

Belief : query = value

Theo Examples

(Tom wife) = Sally

(Tom wife =Sally haircolor) = blonde

(Tom wife haircolor) = blonde

Some beliefs are currently valid but being dropped:

Tom = blonde (primitive entity treated like a slot)

Note the difference!

TokyoCabinet

TokyoCabinet is a fast, light key-value pair database. It was used for the first KB representation because it was handy.

Theo examples become DB key-value pairs:

Key Value

Tom wife Sally

Tom wife =Sally haircolor blonde

Tom wife haircolor blonde

Graph Databases

A number exist: Neo4J was chosen for:

Compact Flexible Not specialized for web apps Matches common structure, like Tinkerpop Blueprints Popular

There are other options- substitution would be easy

Neo4J Elements

Node

Node

•Nodes and relationships can be inserted into indices

•Given a node, get its relationships and properties by name.

•Given a relationship, get either node.

•Nodes and relationships can be inserted into indices

•Properties are just scalars or arrays, accessed by name

Properties

Properties

Relationship

Problems

Ambiguity: Earlier version doesn't make a clear distinction between strings and entities.

Ambiguity: Should a value consisting of a list in TokyoCabinet get mapped to one graph element or many?

Meanwhile: The ReadTheWeb group is adding 'contexts'- the time intervals in which things happen. The Neo4J representation will need to be updated to support this.

Neo4J Reference node

‘Entity’ nodeWith value “hillary”

‘Slot’ node

RelationshipsIdentified by name

“billclinton dog =roverhaircolor =blonde according to=tom”

“billclinton dog haircolor =gray according to =sam”

This arc can only exist because of the ambiguity between entities and slots

The syntax needs to indicate when nodes (like these two) need to merge

Status

The graph database project is feeding back into the rest of the software base, helping to inform new design changes.

But we're about a month behind: The target is changing Effect of ambiguities was not anticipated

An implementation exists. The next step is timing tests.

Questions?

scaling the knowledge base for the never-ending language learner (nell): a step toward large-scale...

Documents