using graph theory to understand intent & concepts - neo4j user group (january 2013)
Upload: tumra-big-data-science-gain-a-competitive-advantage-through-big-data-data-science
Post on 20-Aug-2015
3.495 views
TRANSCRIPT
UNDERSTANDING INTENT & CONCEPTS
tumra.com
• Use case: - Enhancing Social TV user experience - Matching users to content that interests them
• Topics we’ll cover: - Natural Language Processing - Graph Theory - Machine Learning
USE CASE ENHANCED SOCIAL TV
tumra.com
• Objectives: - Increase engagement with content - Enhance multi-channel user experience
• We built a prototype solution: - Mines unstructured data in real-time - Understands:
- What interests individual users - Entities & Concepts (People, Places, Events)
tumra.com
THANKS FOR LISTENING Help users to “follow the story” regardless of the news outlet, integrated to web / second-screen
THE CHALLENGE
Photo Credit: byrion on Flickr (cc)
THE PROBLEM
tumra.com
• Little useful data to work with… - Streams of continuous live TV - Have to create metadata
• Where did we start? - Ingest several live news channels - Extract whatever data was available:
- In-video text using OCR - Subtitles / Closed Captions
We used a simple N-Gram model for exact matches; then Apache Lucene for everything else…
STEP 1 NAMED ENTITY RECOGNITION
tumra.com
EXAMPLE N.E.R.
tumra.com
“David Cameron and the German Chancellor Angela Merkel meets to discuss the debt crisis and signal
their approval for greater eurozone integration.”
EXAMPLE N.E.R.
tumra.com
“David Cameron and the German Chancellor Angela Merkel meets to discuss the debt crisis and signal
their approval for greater eurozone integration.”
DISAMBIGUATION
tumra.com
• Which “David Cameron”? - We have many in our Knowledgebase - Sportsmen, actors, painters & characters…
• Our initial simplistic approach was naïve - Works great with unambiguous matches - Best-case returns top-scoring entity
• We needed a smarter approach
RECAP
tumra.com
• We have an effectively ‘flat’ KB of Entities - “David Cameron” -> Politician (Person) - “Angela Merkel” -> Politician (Person) - “German Chancellor” -> Political office (Concept) - “Debt” -> Economic concept (Concept) - “Eurozone” -> Economic area (Place)
• We needed a way to find relationships
between Entities
THE BIG IDEA
Graphs allow us to store relationships between entities, and graph algorithms allow us to interrogate those connections…
GRAPH DATABASES
tumra.com
Apache Giraph
Neo4J Graph Lab
Golden Orb
… of course there are many more open-source & proprietary ones
We had 250 million Nodes, and 4 billion Edges… great initial results but horrendously inefficient!
Example: “David Cameron” & “Angela Merkel”
STEP 2 BUILDING RELATIONSHIPS
tumra.com
INITIAL IMPROVEMENTS
tumra.com
• We didn’t need everything… just: - People: “David Cameron”, “Angela Merkel” - Places: “London”, “Downing Street”, “Eurozone” - Concepts: “Debt”, “President”, “Eurozone” - Things: Companies, Products etc.
• Pruned the graph using Map/Reduce
• This reduced the number of Entities… - … but we still had billions of connections
EXAMPLE PEOPLE, PLACES, CONCEPTS
tumra.com
“David Cameron and the German Chancellor Angela Merkel meets to discuss the debt crisis and signal
their approval for greater eurozone integration.”
EXAMPLE PEOPLE, PLACES, CONCEPTS
tumra.com
“David Cameron and the German Chancellor Angela Merkel meets to discuss the debt crisis and signal
their approval for greater eurozone integration.”
People Concepts Places
DISAMBIGUATION
David Cameron
(footballer) David Cameron (actor)
David Cameron
(politician)
David Cameron (painter)
Angela Merkel
Politician Head of
State
Living Person
Possibilities: shortest path, number of common connections etc.
Sure all that extra metadata was tasty but we didn’t need it all to solve the use-case…
So we used Map/Reduce to count the common
connections
STEP 3 SIMPLIFYING THE GRAPH
tumra.com
SIMPLIFIED
David Cameron
(footballer) David Cameron (actor)
David Cameron
(politician)
David Cameron (painter)
Angela Merkel
Woah … that looks a lot like Least Cost Routing problem
3 1 1
LEAST COST PATH
David Cameron
(footballer) David Cameron (actor)
David Cameron
(politician)
David Cameron (painter)
Angela Merkel
1 / number of common connections = cost
1/3 1/1 1/1
RECAP
tumra.com
• Graphs allow us to interrogate relationships - Disambiguate when faced with multiple possibilities - Infer more about the context of what’s happening
• Went through iterations of improvements
- Kept our Entity data in NoSQL = TB’s
- Used the Graph as an index of sorts = GB’s • Neo4j was a great fit for our needs
Some queries were taking ‘seconds’ and we needed to go a lot faster because TV wont wait for us …
Do we really need to check the Graph everytime?
STEP 4 MAKING IT WORK REAL-TIME
tumra.com
ENTER MACHINE LEARNING
tumra.com
• We can use simple predictors to estimate the likelihood of Entities occurring
- i.e. every time we’ve looked for “David Cameron” in the past the best match was the Politician
• Keeping a ‘probabilistic context’ of recent
Entities allows us to detect shifts in topics - Works especially well on News channels
- Reduces the demand on Graph lookups
Looks complicated, but its basically just counting & division
BAYES THEOREM
Photo Credit: mattbuck007 on Flickr (cc)
We solved the problem for English, but what about other languages?
STEP 5 MAKING IT WORK WORLDWIDE
tumra.com
LANGUAGE
tumra.com
• Our core Entities of ‘People’, ‘Places’, & ‘Concepts’ are language agnostic…
• We needed a way to ditch ‘language’ and
jump straight to entities… - The colour ‘Red’ means the same thing regardless of
you calling it ‘Rot’, ‘Rouge’ or ‘赤’
• Again, Graphs could solve the problem
Typical response time ~30ms … relevancy improves over time and learns new entities ‘online’
PROBLEM SOLVED
tumra.com
FINAL SOLUTION
tumra.com
Unstructured Data Awesomeness!
Neo4J
NER
NoSQL
Disambiguation Language Model
Machine Learning
“TUMRA” is a transliteration of the Sanskrit word for “BIG”; we thought it’s a great name … ( and the .COM was available )
ABOUT US
tumra.com
• We’ve built a product… - Our ‘Digital Marketing Optimization’ platform
improves conversion rates & customer satisfaction for eCommerce & Marketing campaigns
- Launches Q1 2013
• What else do we do? - ‘Big Data’ & ‘Data Science’ professional services - Bespoke prototype & solution development
tumra.com
THANKS FOR LISTENING
TUMRA You?
We’re hiring! Data Scientists & Developers