scaling the knowledge base for the never-ending language learner (nell): a step toward large-scale...
TRANSCRIPT
SCALING THE KNOWLEDGE BASE FOR THE NEVER-ENDING LANGUAGE LEARNER (NELL): A STEP TOWARD LARGE-SCALE COMPUTING
FOR AUTOMATED LEARNING
Joel Welling PSC
4/10/2012
Automated Learning from Text is not a supercomputing problem
But it really should be. Simultaneously infer patters and identify data
matching those patterns using large collections of text
Corpus size is supercomputing-scale (~7TB compressed for the ClueWeb corpus)
Learning algorithms (like clustering and SVD) are appropriate to supercomputers.
Largest scale projects typically use Hadoop
NELL: The Never-Ending Language Learner
NELL is a project of the ReadTheWeb group at CMU: Tom Mitchell and William Cohen
Based on 500M web pages (ClueWeb09) and access to the rest of the web via search engine APIs
You can follow NELL on twitter as “cmunell”
NELL's tasks Extract new instances of categories and relations.
In other words, find noun phrases that represent new examples of the input categories (e.g., "Barack Obama" is a person and politician), and find pairs of noun phrases that correspond to instances of the input relations (e.g., the pair "Jason Giambi" and "Yankees" is an instance of the playsOnTeam relation).
These new instances are added to the growing knowledge base of structured beliefs.
Learn to read better than yesterday. NELL uses a variety of methods to extract beliefs from the web. These are retrained, using the growing knowledge base as a self-supervised collection of training examples. The result is a semi-supervised learning method that couples the training of hundreds of different extraction methods for a wide range of categories and relations. Much of NELL’s current success is due to its algorithm for coupling the simultaneous training of many extraction methods.
NELL Learns In Cycles
Each requires pouring the corpus disk-to-disk, applying templates and checking against the currently known KB. Very slow.
We'd like to hold everything in memory, and learn templates and facts simultaneously. Speed things up by a factor of 100 or more.
Typical Scaling Work-around
Currently the text corpus gets streamed disk-to-disk and a co-occurence matrix is built.
Stats are taken from this, but it can't distinguish between “Bill ate the cow” and “The cow ate Bill”.
This is a “bag of words” model. We want to do “deep reading” instead; actually parsing and understanding individual sentences.
Barriers To Scaling NELL
Java Use of web services Database implementation of the NELL
Knowledge Base (KB) Hadoop-based mindset, if they think of scaling
at all Single threads, with many JVMs working in parallel
Steps to solution
Learn to run Java stably: there’s a trick Move to scalable DB: the current project Minimize real-time web service interaction More parallelism. This is most easily achieved
with many threads, so Blacklight is good.
Java GC Threads
On Blacklight, it tries to spawn one garbage collection thread per core. Over the whole machine.
To control this:
-XX:ParallelGCThreads=16
(for IBM, -Xgcthreads16) Beware of lucene
A common DB server written in Java; you must arrange to start it up with the same flag.
Knowledge Base Representations
The KB knows the ontology: what is what. It is referenced continuously as the program
runs, maybe across multiple threads NELL’s KB is described in 3 ways:
Theo: syntactic, and evolving TokyoCabinet representation Graph Databases and Neo4J
Why change? Scaling, explicit transactions, natural structure
Theo
Theo is the abstract syntax of the knowledge base. It's still evolving.
Value : { string | number | list | pointer to entity }
Entity : { PrimitiveEntity | query | belief }
PrimitiveEntity : string
Slot : one of a subset of the set of PrimitiveEntities
Query : ( entity, slot )
Belief : query = value
Theo Examples
(Tom wife) = Sally
(Tom wife =Sally haircolor) = blonde
(Tom wife haircolor) = blonde
Some beliefs are currently valid but being dropped:
Tom = blonde (primitive entity treated like a slot)
Note the difference!
TokyoCabinet
TokyoCabinet is a fast, light key-value pair database. It was used for the first KB representation because it was handy.
Theo examples become DB key-value pairs:
Key Value
Tom wife Sally
Tom wife =Sally haircolor blonde
Tom wife haircolor blonde
Graph Databases
A number exist: Neo4J was chosen for:
Compact Flexible Not specialized for web apps Matches common structure, like Tinkerpop Blueprints Popular
There are other options- substitution would be easy
Neo4J Elements
Node
Node
•Nodes and relationships can be inserted into indices
•Given a node, get its relationships and properties by name.
•Given a relationship, get either node.
•Nodes and relationships can be inserted into indices
•Properties are just scalars or arrays, accessed by name
Properties
Properties
Relationship
Problems
Ambiguity: Earlier version doesn't make a clear distinction between strings and entities.
Ambiguity: Should a value consisting of a list in TokyoCabinet get mapped to one graph element or many?
Meanwhile: The ReadTheWeb group is adding 'contexts'- the time intervals in which things happen. The Neo4J representation will need to be updated to support this.
Neo4J Reference node
‘Entity’ nodeWith value “hillary”
‘Slot’ node
RelationshipsIdentified by name
“billclinton dog =roverhaircolor =blonde according to=tom”
“billclinton dog haircolor =gray according to =sam”
This arc can only exist because of the ambiguity between entities and slots
The syntax needs to indicate when nodes (like these two) need to merge
Status
The graph database project is feeding back into the rest of the software base, helping to inform new design changes.
But we're about a month behind: The target is changing Effect of ambiguities was not anticipated
An implementation exists. The next step is timing tests.
Questions?