anhai doan university of wisconsin @walmartlabs social media, data integration, and human...

19
AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

Upload: donna-cox

Post on 16-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

AnHai DoanUniversity of Wisconsin@WalmartLabs

Social Media, Data Integration, and Human Computation

@WalmartLabs

Page 2: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

2

Background Professor at University of Wisconsin-Madison In 2010 took unpaid leave and joined Kosmix

– Bay-area startup, did semantic analysis of social media Acquired by Walmart in 2011, became WalmartLabs

– Based in San Bruno, local office in India, hundreds of people

Why did Walmart buy a social-media startup? – Wanted to catch up with Amazon (<10B online vs. >35B of Amazon)– Major problems if don’t get close in 10 years (see Borders)– Kosmix/WalmartLabs helps in many ways

– Provides a core of technical people, help attract more– Improves traditional e-commerce– Builds the e-commerce of the future : Social + Local + Mobile

Page 3: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

3

Major R&D Groups at WalmartLabs

Search and Products

Polaris

Giant product catalog

Product intelligence

Demand Generation

SEO, SEM

Customer targeting and personalization

Social, Mobile, and Local E-Commerce

Mining social data

Stores + Mobile

Build social/mobile apps (get on the self,

gift recommendation, etc.)

Special Initiatives

Big Fast Data

Large-scale Machine Learning

Data Extraction & Integration

Crowdsourcing

Social Genome

Page 4: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

4

Social Genome Mine everything we can out of social data

– From tweets, FB feeds, Foursquare, blogs, etc.– Mine users, organizations, products, sentiments, events, etc.

Connect them to those in the traditional Web world Put them into a giant knowledge base

– Big, evolve rapidly over time– Call this “social genome”

Use social genome to power multiple e-commerce applications– Search– Product intelligence– Gift recommendation– Personalized “Groupon”– Etc.

Page 5: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

Social Genomeall

people

actors

Angelia Jolie Mel Gibson

places Twitter users

@melgibson @dsmith …

FB users

mel-gibson davesmith …

events

celebritiessports politics …

Gibson car crash Egyptian uprising

the-same-astweet-about

@dsmith: Mel crashed. Maserati is gone.

@far213: Tahrir is packed!Tahrir

CairoEgypt

related-to

located-in

capital-of

Page 6: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs
Page 7: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

Building Social Genome: Three Sample Challengesall

people

actors

Angelia Jolie Mel Gibson

places Twitter users

@melgibson @dsmith …

FB users

mel-gibson davesmith …

events

celebritiessports politics …

Gibson car crash Egyptian uprising

the-same-as tweet-about

@dsmith: Mel crashed. Maserati is gone.

@far213: Tahrir is packed!Tahrir

CairoEgypt

related-to

located-in

capital-of

2

3

1

Page 8: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

Extraction and Disambiguation:Traditional Methods Ill Suited for Social Media

all

people

actors professors

Angelia Jolie Mel Gibson

places

Long-term, Web context: actor, movie, Oscar, Hollywood

Short-term, social context: crash, car, Maserati

@dsmith: mel crashed. maserati is gone.

Mel was arrested again. What a dramatic fall sincehis Oscar-winning day.

Mel Brocks

events

celebritiessports politics …

Gibson car crash Egyptian uprising

Extraction

use rule-based / NLP / machine learning techniques

Extraction

use dictionaries

Disambiguation

Disambiguation

Page 9: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

Must Maintain a Highly Dynamic Social Genome

9

all

people

actors professors

Angelia Jolie Mel Gibson

places

Long-term, Web context: actor, movie, Oscar, Hollywood

Short-term, social context: crash, car, Maserati

Mel Brocks

events

celebritiessports politics …

Gibson car crash Egyptian uprising

Latency less than 2 seconds,

Maintained using a fast-data processing system

Page 10: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

The Giant Traditional Taxonomy is the Secret Weapon

Without it, dictionary-based extraction is not possible Provide a framework to

– “understand” social media, find related concepts, “hang” social contexts Very hard to develop, takes years

– Integrate data from multiple sources, like learning a foreign language Partly explains why it was hard for others to catch up

To integrate social media, must integrate traditional data well, then bootstrap

all

people

actors

Angelia Jolie Mel Gibson

places

Tahrir

CairoEgypt

located-in

capital-of

Page 11: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

11

Context is also Absolutely Critical

– Social @Walmart Labs

Alice tweets Go Giants!

?SF Giants

NY Giants

Context/Disambiguation

Alice lives in NYC

NY Giants

Bob tweetsGo Giants!

?SF Giants

NY Giants

Context/Disambiguation

Bob likes Buster Posey (SF Giants player)

SF Giants

?SF Giants

NY Giants

Context/Disambiguation

Charlie tweeted on Feb 4th

(day before the Super Bowl (event) – theWeb is talking about the NY Giants)

NY Giants

Charlie tweets

Go Giants!

Entity Extraction

Entity Extraction

Entity Extraction

Page 12: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

Building Social Genome: Three Sample Challengesall

people

actors

Angelia Jolie Mel Gibson

places Twitter users

@melgibson @dsmith …

FB users

mel-gibson davesmith …

events

celebritiessports politics …

Gibson car crash Egyptian uprising

the-same-as tweet-about

@dsmith: Mel crashed. Maserati is gone.

@far213: Tahrir is packed!Tahrir

CairoEgypt

related-to

located-in

capital-of

2

3

1

Page 13: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

Event Detection: Current Solutions

• Lot of current work in academia / industry• Limitations of most of the current solutions

– exploit just one kind of heuristics • e.g., find hot, trending, popular words (Egypt, revolt)

– does not exploit crowdsourcing– does not scale

events

celebritiessports politics …

Gibson car crash Egyptian uprising

Twitter4squareFacebookMyspaceFlickr…

Event detection

Page 14: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

Event Dection: Our Solution

TwitterFoursquare

Detector 2

Detector n

Detector 1

Candidate events

Candidate events

Candidate events

Eventevaluatorand ranker

Rankedevents

CrowdsourcingPopulation 2

CrowdsourcingPopulation 3

CrowdsourcingPopulation 1

...

Muppet, a platform to process fast dataover multiple machines

Page 15: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

Processing Fast Data

Big data management is well known by now– use MapReduce implementations– simple programming model, widespread adoption

But a lot of fast data is also emerging – 150 M tweets / day, 1 billion FB shares / day,

3 M Foursquare checkins / day– come into the system as very fast streams

Numerous applications over these streams Need to process in real time

– to answer “what is happening now?”

Page 16: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

Processing Fast Data

What we want: a platform that – delivers real-time processing (over multiple machines)– is highly scalable (as the data gets faster and faster)– has simple programming model

– so developers can quickly write hundreds of apps– ideally like map-reduce, which developers already know

– has real-time query and storage capability– apps can query content in real-time– distributed across multiple machines

Answer: Muppet, like Map-Reduce, but for fast data– see “MapReduce-Style Processing of Fast Data”, VLDB-12

Page 17: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

Using the Social Genome

Gift recommendation: – “I love salt!”

– “Your friend has just tweeted about the movie SALT. Would you like to buy something related for her birthday?”

17

Page 18: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

Using the Social Genome

Search query expansion– “Advil” “advil headache cramp”

Personalized “Groupon” with vendors– “You seem to be interested in gourmet coffee.

If 50 persons sign up to buy the new DeLonghi coffee maker, you can get that for a 50% discount.”

Stocking a local store– Lot of people in Mountain View are interested in outdoor sport– Stock up local Walmart store with related products

A Siri-like shopping assistant

18

Page 19: AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs

Wrapping Up The future of e-commerce: social, mobile, and local Retailers must increasingly be data / Web players

Social media is important for e-commerce Integrating social data is fundamentally much harder

than integrating “traditional” data– lack of context– dynamic environment, new concepts appear quickly– quality issues, lots of spam– fast data

Must integrate “traditional” data well, then bootstrap– giant taxonomy critical

Crowdsourcing becomes indispensible– but raises interesting challenges