anhai doan university of wisconsin @walmartlabs social media, data integration, and human...
TRANSCRIPT
AnHai DoanUniversity of Wisconsin@WalmartLabs
Social Media, Data Integration, and Human Computation
@WalmartLabs
2
Background Professor at University of Wisconsin-Madison In 2010 took unpaid leave and joined Kosmix
– Bay-area startup, did semantic analysis of social media Acquired by Walmart in 2011, became WalmartLabs
– Based in San Bruno, local office in India, hundreds of people
Why did Walmart buy a social-media startup? – Wanted to catch up with Amazon (<10B online vs. >35B of Amazon)– Major problems if don’t get close in 10 years (see Borders)– Kosmix/WalmartLabs helps in many ways
– Provides a core of technical people, help attract more– Improves traditional e-commerce– Builds the e-commerce of the future : Social + Local + Mobile
3
Major R&D Groups at WalmartLabs
Search and Products
Polaris
Giant product catalog
Product intelligence
Demand Generation
SEO, SEM
Customer targeting and personalization
Social, Mobile, and Local E-Commerce
Mining social data
Stores + Mobile
Build social/mobile apps (get on the self,
gift recommendation, etc.)
Special Initiatives
Big Fast Data
Large-scale Machine Learning
Data Extraction & Integration
Crowdsourcing
Social Genome
4
Social Genome Mine everything we can out of social data
– From tweets, FB feeds, Foursquare, blogs, etc.– Mine users, organizations, products, sentiments, events, etc.
Connect them to those in the traditional Web world Put them into a giant knowledge base
– Big, evolve rapidly over time– Call this “social genome”
Use social genome to power multiple e-commerce applications– Search– Product intelligence– Gift recommendation– Personalized “Groupon”– Etc.
Social Genomeall
people
actors
Angelia Jolie Mel Gibson
places Twitter users
@melgibson @dsmith …
FB users
mel-gibson davesmith …
events
celebritiessports politics …
Gibson car crash Egyptian uprising
the-same-astweet-about
@dsmith: Mel crashed. Maserati is gone.
@far213: Tahrir is packed!Tahrir
CairoEgypt
related-to
located-in
capital-of
Building Social Genome: Three Sample Challengesall
people
actors
Angelia Jolie Mel Gibson
places Twitter users
@melgibson @dsmith …
FB users
mel-gibson davesmith …
events
celebritiessports politics …
Gibson car crash Egyptian uprising
the-same-as tweet-about
@dsmith: Mel crashed. Maserati is gone.
@far213: Tahrir is packed!Tahrir
CairoEgypt
related-to
located-in
capital-of
2
3
1
Extraction and Disambiguation:Traditional Methods Ill Suited for Social Media
all
people
actors professors
Angelia Jolie Mel Gibson
places
Long-term, Web context: actor, movie, Oscar, Hollywood
Short-term, social context: crash, car, Maserati
@dsmith: mel crashed. maserati is gone.
Mel was arrested again. What a dramatic fall sincehis Oscar-winning day.
Mel Brocks
events
celebritiessports politics …
Gibson car crash Egyptian uprising
Extraction
use rule-based / NLP / machine learning techniques
Extraction
use dictionaries
Disambiguation
Disambiguation
Must Maintain a Highly Dynamic Social Genome
9
all
people
actors professors
Angelia Jolie Mel Gibson
places
Long-term, Web context: actor, movie, Oscar, Hollywood
Short-term, social context: crash, car, Maserati
Mel Brocks
events
celebritiessports politics …
Gibson car crash Egyptian uprising
Latency less than 2 seconds,
Maintained using a fast-data processing system
The Giant Traditional Taxonomy is the Secret Weapon
Without it, dictionary-based extraction is not possible Provide a framework to
– “understand” social media, find related concepts, “hang” social contexts Very hard to develop, takes years
– Integrate data from multiple sources, like learning a foreign language Partly explains why it was hard for others to catch up
To integrate social media, must integrate traditional data well, then bootstrap
all
people
actors
Angelia Jolie Mel Gibson
places
Tahrir
CairoEgypt
located-in
capital-of
11
Context is also Absolutely Critical
– Social @Walmart Labs
Alice tweets Go Giants!
?SF Giants
NY Giants
Context/Disambiguation
Alice lives in NYC
NY Giants
Bob tweetsGo Giants!
?SF Giants
NY Giants
Context/Disambiguation
Bob likes Buster Posey (SF Giants player)
SF Giants
?SF Giants
NY Giants
Context/Disambiguation
Charlie tweeted on Feb 4th
(day before the Super Bowl (event) – theWeb is talking about the NY Giants)
NY Giants
Charlie tweets
Go Giants!
Entity Extraction
Entity Extraction
Entity Extraction
Building Social Genome: Three Sample Challengesall
people
actors
Angelia Jolie Mel Gibson
places Twitter users
@melgibson @dsmith …
FB users
mel-gibson davesmith …
events
celebritiessports politics …
Gibson car crash Egyptian uprising
the-same-as tweet-about
@dsmith: Mel crashed. Maserati is gone.
@far213: Tahrir is packed!Tahrir
CairoEgypt
related-to
located-in
capital-of
2
3
1
Event Detection: Current Solutions
• Lot of current work in academia / industry• Limitations of most of the current solutions
– exploit just one kind of heuristics • e.g., find hot, trending, popular words (Egypt, revolt)
– does not exploit crowdsourcing– does not scale
events
celebritiessports politics …
Gibson car crash Egyptian uprising
Twitter4squareFacebookMyspaceFlickr…
Event detection
Event Dection: Our Solution
TwitterFoursquare
Detector 2
Detector n
Detector 1
…
Candidate events
Candidate events
Candidate events
Eventevaluatorand ranker
Rankedevents
CrowdsourcingPopulation 2
CrowdsourcingPopulation 3
CrowdsourcingPopulation 1
...
Muppet, a platform to process fast dataover multiple machines
Processing Fast Data
Big data management is well known by now– use MapReduce implementations– simple programming model, widespread adoption
But a lot of fast data is also emerging – 150 M tweets / day, 1 billion FB shares / day,
3 M Foursquare checkins / day– come into the system as very fast streams
Numerous applications over these streams Need to process in real time
– to answer “what is happening now?”
Processing Fast Data
What we want: a platform that – delivers real-time processing (over multiple machines)– is highly scalable (as the data gets faster and faster)– has simple programming model
– so developers can quickly write hundreds of apps– ideally like map-reduce, which developers already know
– has real-time query and storage capability– apps can query content in real-time– distributed across multiple machines
Answer: Muppet, like Map-Reduce, but for fast data– see “MapReduce-Style Processing of Fast Data”, VLDB-12
Using the Social Genome
Gift recommendation: – “I love salt!”
– “Your friend has just tweeted about the movie SALT. Would you like to buy something related for her birthday?”
17
Using the Social Genome
Search query expansion– “Advil” “advil headache cramp”
Personalized “Groupon” with vendors– “You seem to be interested in gourmet coffee.
If 50 persons sign up to buy the new DeLonghi coffee maker, you can get that for a 50% discount.”
Stocking a local store– Lot of people in Mountain View are interested in outdoor sport– Stock up local Walmart store with related products
A Siri-like shopping assistant
18
Wrapping Up The future of e-commerce: social, mobile, and local Retailers must increasingly be data / Web players
Social media is important for e-commerce Integrating social data is fundamentally much harder
than integrating “traditional” data– lack of context– dynamic environment, new concepts appear quickly– quality issues, lots of spam– fast data
Must integrate “traditional” data well, then bootstrap– giant taxonomy critical
Crowdsourcing becomes indispensible– but raises interesting challenges