dynamic topic modeling for monitoring market competition...

1
Collecting Data Dynamic Topic Modeling for Monitoring Market Competition from Online Text and Image Data Hao Zhang 1 , Gunhee Kim 2 , Eric P. Xing 1 1 : Carnegie Mellon University, 2 : Seoul National University Competitive Dynamic Multi-view STC (cdSTC) Evaluation: Topic Quality Evaluation: Prediction Crawling raw tweets and associated Images using the REST API 6.6M tweets and 7.5M images from and external links Time range: 10/20/2014 to 02/01/2015 Brand Competition Monitoring 2 groups of brands: Luxury (13 brands) Beer (12 brands) Tweets w/ external links Tweets w/o external links Tweets directly w/ images Tweets directly w/o images 72% 70% 30% 28% Come down to SVT and enjoy our Super Bowl Sunday special. Heineken 5/2000, Corona 5/3000,Guiness and Mackeson http://fb.me/2WdRGBkem Tweet (Beer) Processing Associations = (Heineken Corona Guiness) +1 +1 +1 = 1: = 1: = 1: 1. Multi-view 2. Competition 3. Dynamic The model aims to address 3 major challenges Modeling of multi-view representations of text and images Modeling of latent topics that are competitively shared by multiple brands Tracking temporal evolution of the topics and competitions = 1: Brand-topic occupation matrix at time (∈ × ) / Topic distributions over text/visual words at time (∈ × / × ) Document code of document (∈ ) / Word code of text/visual word / (∈ ) / Occurrences of text/visual word / in document Brand code of brand in document (∈ ) Indicator for each brand label for document , , , , , , , , = ( | , ) ( | , ) ( | , ) Joint Probability m Θ , , , =1 =1 =1 || || 1 + =1 ( 1 || −1 || 2 2 + 2 || −1 || 2 2 + 3 || −1 || 2 2 ) + =1 =1 ( 1 || || 2 2 + 1 || || 1 + ( , )) + =1 =1 ( 2 || || 2 2 + 2 || || 1 + ( , )) + =1 =1 ( 3 || || 2 2 + 3 || || 1 + ( , )) . . > 0, ∀, . , , > 0, ∀, , , , , , , ∀, MAP sparse term for document code evolving chain text image brand simplex = # # Argument 1: Lower perplexity higher quality [J. Chang 2009] Argument 2: Perplexity is not a fair metric for models with different distributions Define the Coherence Measure (CM) and the Validity Measure (VM): V = # # VM (Beer / Luxury) CM (Beer / Luxury) dLDA 0.53 / 0.68 0.55 / 0.52 STC + dyn 0.44 / 0.66 0.57 / 0.57 cdSTC + multi 0.51 / 0.70 0.63 / 0.59 cdSTC + text 0.605 / 0.71 0.61 / 0.59 Average VM/CM on text topics VM (Beer / Luxury) CM (Beer / Luxury) Kmeans 0.39 / 0.56 0.59 / 0.64 LDA + multi 0.57 / 0.63 0.51 / 0.69 cdSTC + multi 0.57 / 0.65 0.66 / 0.71 Average VM/CM on visual topic Task I: Given a novel tweet, can we predict its most associated brand? What is the most beautifully-designed perfume bottle? Tell us on the blog here: http://smarturl.it/ie2fka and win Gucci Gucci Model infer novel tweets max Θ , , =1 =1 Θ , , + Θ , + 1 2 2 2 . . > 0, ∀, . , > 0, ∀, , , , , ∀, Task I-I: Randomly split data in every time slice into 90% for training and 10% for testing (a) Beer (b) Luxury Task I-II: Use the data in [1, − 1] for training, [ − 1, ] for testing (a) Beer (b) Luxury Task II: Given an unseen past document, can we predict its timestamp? locate t Sent at this time point time What is the most beautifully-dsigned perfume bottle? Tell us on the blog here: http://smarturl.it/ie2fka and win Gucci max (| ) , ℎ (| ) = ( | ) ( | ) ( | ) (a) Beer (b) Luxury past tweets Task III: Can we predict future competition trends using past data? [1, t-1] 1 0 0 0 1 0 0 0 1 time 1 0 0 0 1 0 0 0 1 ··· evolve +1 learn t 1 0 0 0 1 0 0 0 1 counting data gtGroundtruth Prediction Bags Perfume Watch 0.4019 0.2615 0.0739 Evolve the competition matrix Construct the “groundtruthdata How brands occupy the market in every time slice? How each textual/visual topic evolves over time? How each brand’s occupation changes over time? How’s the competition trends between multi-brands like over time? Objective easy difficult #Style #Prada Black Leather & Nylon Tessuto Saffiano Shoulder #Bag http://dlvr.it/8WZKM2 #Forsale #Auction Coat from @ASOS , top from @FreePeople, jeans from Rag & Bone, boots from #ChristianLouboutin & bag from @Prada . What is the most beautifully-designed perfume bottle? Tell us on the blog here: http://smarturl.it/ie2fka and win Gucci The latest crop of #Chanel Pre-Spring bags have arrived! See the full collection now: http://bit.ly/1z3PnKG Pretty In Pink: From @Chanel to @nailsinc, the best petal-hued make-up launches this spring http://vogue.uk/8p6UOi Designer Kate Spade, Invicta, Gucci & More Watches from $22 & Extra 20% Off http://www.dealsplus.com/t/1zr85Y watch+diamond rolex, watch, gold, dial, mens, datejust, ladies, steel, diamond, oyster, stainless,18k glasses chanel, giorgio, sunglasses, classic, glasses, reading, women's, #burberrygifts bags bag, leather, gucci, handbag, tote, clothing, shoulder, canvas, reading, women's, watch+diamond watch, gold, white date, ladies, dial gift, rolex #deals_us, blue, vintage, bracelet, omega, glasses chanel, sunglasses, listen, green, funny, dark, xmas, womens, Armani, excellent, Havana. lacoste bags authentic, leather, bag, shoes, gucci, handbag, prada, tote, deals, brown, wallet t t + 1 Timeline Chanel Gucci Prada (a) Input: Tweets and associated images of competing brands (b) Output: Temporal evolution of topics and brands’ proportion over the topics Topics (text / visual words) Brands over topics The increasing pervasiveness of Internet has lead to a wealth of consumer- created data over a multitude of online platforms What can we learn? Problem Statement General public’s opinion towards different companies’ products and service Performance evaluations in different market conditions (time, location etc.) What does marketers want to see? Detection: Listen in consumers’ opinions towards their products and their competitors Summarization: Summarize/visualize how a shared market is occupied by different brands Dynamics: Monitoring the changes of market competition over time SuperBowl + beer Watch + luxury corona budlight guiness rolex omega burberry compete compete Brand Competitions Our Approach: Joint Analysis of Text and Images Take advantage of the pervasiveness of images on the social media A large portion of tweets simply show images&links without any meaningful text in them. Images play an important role for representing topics in this type of documents Many users prefer to use images to deliver their idea more clearly and broadly, The joint use of images with text also helps marketers interpret the discovered topics Images may be essential for users to make conversation about customers’ descriptions, experiences, and opinions toward the brands.

Upload: others

Post on 21-Mar-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dynamic Topic Modeling for Monitoring Market Competition …hzhang2/projects/BrandCompetition/... · 2019-09-01 · Collecting Data Dynamic Topic Modeling for Monitoring Market Competition

Collecting Data

Dynamic Topic Modeling for Monitoring Market Competition from Online Text and Image Data

Hao Zhang1, Gunhee Kim2, Eric P. Xing1

1: Carnegie Mellon University, 2: Seoul National University

Competitive Dynamic Multi-view STC (cdSTC)

Evaluation: Topic Quality

Evaluation: Prediction

Crawling raw tweets and associated Images using the REST API

• 6.6M tweets and 7.5M images from and external links

• Time range: 10/20/2014 to 02/01/2015

Brand Competition Monitoring

• 2 groups of brands: Luxury (13 brands) Beer (12 brands)

Tweets w/ external links

Tweets w/o external links

Tweets directly w/ images

Tweets directly w/o images

72%70%

30% 28%

Come down to SVT and enjoy our Super Bowl Sundayspecial. Heineken 5/2000, Corona 5/3000,Guinessand Mackeson http://fb.me/2WdRGBkem

Tweet (Beer)

Processing

Associations = (Heineken Corona Guiness)

𝒖𝑑𝒗𝑑𝒈𝑑

𝒅

𝜃𝑑

𝜑𝑘𝑡+1𝜑𝑘

𝑡

𝛽𝑘𝑡+1𝛽𝑘

𝑡

𝛾𝑘𝑡+1𝛾𝑘

𝑡

𝑡 = 1: 𝑇

𝑘 = 1: 𝐾

𝑟𝑑𝑏

𝑔𝑑𝑏

𝑧𝑑𝑛

𝑢𝑑𝑛

𝑦𝑑𝑚

𝑣𝑑𝑚

𝑑 = 1: 𝐷

1. Multi-view

2. Competition

3. Dynamic

The model aims to address 3 major challenges

• Modeling of multi-view representations of text and images

• Modeling of latent topics that are competitively shared by

multiple brands

• Tracking temporal evolution of the topics and competitions

𝜃𝑑

𝑟𝑑𝑏

𝑔𝑑𝑏

𝑧𝑑𝑛

𝑢𝑑𝑛

𝑦𝑑𝑚

𝑣𝑑𝑚

𝑑 = 1: 𝐷

𝜙𝑡 Brand-topic occupation matrix at time 𝑡 (∈ 𝑅𝐾×𝐿)

𝛽𝑡/𝛾𝑡 Topic distributions over text/visual words at

time 𝑡 (∈ 𝑅𝐾×𝐺/𝑅𝐾×𝐻)

𝜃𝑑 Document code of document 𝑑 (∈ 𝑅𝐾)

𝑧𝑑𝑛/𝑦𝑑𝑚 Word code of text/visual word 𝑛/𝑚 (∈ 𝑅𝐾)

𝑢𝑑𝑛/𝑣𝑑𝑚 Occurrences of text/visual word 𝑛/𝑚 in document 𝑑

𝑟𝑑𝑏 Brand code of brand 𝑏 in document 𝑑 (∈ 𝑅𝐾)

𝑔𝑑𝑏 Indicator for each brand label 𝑏 for document 𝑑

𝒑 𝜽, 𝒛, 𝒖, 𝒚, 𝒗, 𝒓, 𝒈 𝜷, 𝜸,𝝓

= 𝒑 𝜽

𝒏∈𝑵

𝒑 𝒛𝒏 𝜽 𝒑(𝒖𝒏|𝒛𝒏, 𝜷)

𝒎∈𝑴

𝒑 𝒚𝒎 𝜸 𝒑(𝒗𝒎|𝒚𝒎, 𝜸)

𝒃∈𝑩

𝒑 𝒓𝒃 𝝓 𝒑(𝒈𝒃|𝒓𝒃, 𝝓)

• Joint Probability

m𝑖𝑛Θ𝑡,𝜷𝑡,𝜸𝑡,𝜙𝑡 𝑡=1

𝑇

𝑡=1

𝑇

𝑑=1

𝐷

𝜆||𝜽𝑑𝑡 ||1

+

𝑡=1

𝑇

(𝜋1||𝜷𝑡 − 𝜷𝑡−1||2

2 + 𝜋2||𝜸𝑡 − 𝜸𝑡−1||2

2 + 𝜋3||𝝓𝑡 −𝝓𝑡−1||2

2)

+

𝑡=1

𝑇

𝑑=1

𝐷𝑡

𝑛∈𝑁𝑑𝑡

(𝜈1||𝒛𝑑𝑛𝑡 − 𝜽𝑑

𝑡 ||22 + 𝜌1||𝒛𝑑𝑛

𝑡 ||1 + 𝐿(𝒛𝑑𝑛𝑡 , 𝜷𝑡))

+

𝑡=1

𝑇

𝑑=1

𝐷𝑡

𝑚∈𝑁𝑑𝑡

(𝜈2||𝒚𝑑𝑚𝑡 − 𝜽𝑑

𝑡 ||22 + 𝜌2||𝒚𝑑𝑚

𝑡 ||1 + 𝐿(𝒚𝑑𝑚𝑡 , 𝜸𝑡))

+

𝑡=1

𝑇

𝑑=1

𝐷𝑡

𝑏∈𝐵𝑑𝑡

(𝜈3||𝒓𝑑𝑏𝑡 − 𝜽𝑑

𝑡 ||22 + 𝜌3||𝒓𝑑𝑏

𝑡 ||1 + 𝐿(𝒓𝑑𝑏𝑡 , 𝝓𝑡))

𝑠. 𝑡. 𝜽𝑑𝑡 > 0, ∀𝑑, 𝑡. 𝒛𝑑𝑛

𝑡 , 𝒚𝑑𝑚𝑡 , 𝒓𝑑𝑏𝑡 > 0, ∀𝑑, 𝑛,𝑚, 𝑏, 𝑡

𝛽𝑘𝑡 ∈ 𝑃𝑈 , 𝛾𝑘

𝑡 ∈ 𝑃𝑉 , 𝜙𝑘𝑡 ∈ 𝑃𝐵 , ∀𝑘, 𝑡

⇒• MAP

sparse term for

document code

evolving chain

text

image

brand

simplex

𝐶𝑀 =# 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑤𝑜𝑟𝑑𝑠

# 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑣𝑎𝑙𝑖𝑑 𝑡𝑜𝑝𝑖𝑐𝑠

Argument 1: Lower perplexity ≠ higher quality [J. Chang 2009]

Argument 2: Perplexity is not a fair metric for models with different distributions

–Define the Coherence Measure (CM) and the Validity Measure (VM):

V𝑀 =# 𝑜𝑓 𝑣𝑎𝑙𝑖𝑑 𝑡𝑜𝑝𝑖𝑐𝑠

# 𝑜𝑓 𝑡𝑜𝑝𝑖𝑐𝑠

VM (Beer / Luxury) CM (Beer / Luxury)

dLDA 0.53 / 0.68 0.55 / 0.52

STC + dyn 0.44 / 0.66 0.57 / 0.57

cdSTC + multi 0.51 / 0.70 0.63 / 0.59

cdSTC + text 0.605 / 0.71 0.61 / 0.59

•Average VM/CM on text topics

VM (Beer / Luxury) CM (Beer / Luxury)

Kmeans 0.39 / 0.56 0.59 / 0.64

LDA + multi 0.57 / 0.63 0.51 / 0.69

cdSTC + multi 0.57 / 0.65 0.66 / 0.71

•Average VM/CM on visual topic

• Task I: Given a novel tweet, can we predict its most associated brand?What is the most beautifully-designed

perfume bottle? Tell us on the blog here:

http://smarturl.it/ie2fka and win Gucci

GucciModelinfer

novel tweets

maxΘ𝑡,𝓜𝑡,𝞰𝑡 𝑡=1

𝑇

𝑡=1

𝑇

𝑓 Θ𝑡 ,𝓜𝑡 , 𝐷𝑡 + 𝐶𝑅 Θ𝑡 , 𝞰𝑡 +1

2𝞰𝑡 22

𝑠. 𝑡. 𝜽𝑑𝑡 > 0, ∀𝑑, 𝑡. 𝒛𝑑𝑛

𝑡 , 𝒚𝑑𝑚𝑡 > 0, ∀𝑑, 𝑛,𝑚, 𝑡

𝛽𝑘𝑡 ∈ 𝑃𝑈 , 𝛾𝑘

𝑡 ∈ 𝑃𝑉 , ∀𝑘, 𝑡

Task I-I: Randomly split data in every time slice

into 90% for training and 10% for testing

(a) Beer (b) Luxury

Task I-II: Use the data in [1, 𝑡 − 1] for training,

[𝑡 − 1, 𝑡] for testing

(a) Beer (b) Luxury

• Task II: Given an unseen past document, can we predict its timestamp?

locate

t

Sent at this

time point

time

What is the most beautifully-dsigned

perfume bottle? Tell us on the blog here:

http://smarturl.it/ie2fka and win Gucci

max𝑡𝑝(𝑑|𝓜𝑡) , 𝑤ℎ𝑒𝑟𝑒

𝑝(𝑑|𝓜𝑡) = 𝑛∈𝑁𝑑 𝑝(𝑢𝑛|𝜷𝑡) 𝑚∈𝑀𝑑 𝑝(𝑣𝑚|𝜸

𝑡) 𝑏∈𝐵𝑑 𝑝(𝑔𝑏|𝝓𝑡)

(a) Beer (b) Luxury

past tweets

• Task III: Can we predict future competition trends using past data?

[1, t-1]

1 0 00 1 00 0 1

time𝜙𝑡

1 0 00 1 00 0 1···

evolve

𝜙𝑡+1

learn

t1 0 00 1 00 0 1

counting

data “gt”

Groundtruth

Prediction

Bags PerfumeWatch

0.4019 0.2615 0.0739

Evolve the competition matrix

Construct the “groundtruth”

data

• How brands occupy the market in every time slice?

• How each textual/visual topic evolves over time?

• How each brand’s occupation changes over time?

• How’s the competition trends between multi-brands like over time?

Objective easy

difficult

#Style #Prada Black Leather & Nylon Tessuto

Saffiano Shoulder #Bag

http://dlvr.it/8WZKM2 #Forsale #Auction

Coat from @ASOS , top from @FreePeople,

jeans from Rag & Bone, boots from

#ChristianLouboutin & bag from @Prada .

What is the most beautifully-designed

perfume bottle? Tell us on the blog here:

http://smarturl.it/ie2fka and win Gucci

The latest crop of #Chanel Pre-Spring bags

have arrived! See the full collection now:

http://bit.ly/1z3PnKG

Pretty In Pink: From @Chanel to @nailsinc, the

best petal-hued make-up launches this spring

http://vogue.uk/8p6UOi

Designer Kate Spade, Invicta, Gucci & More

Watches from $22 & Extra 20% Off

http://www.dealsplus.com/t/1zr85Y

watch+diamond

rolex, watch, gold, dial,

mens, datejust, ladies,

steel, diamond, oyster,

stainless,18k

glasses

chanel, giorgio,

sunglasses, classic,

glasses, reading, women's,

#burberrygifts

bags

bag, leather, gucci,

handbag, tote, clothing,

shoulder, canvas, reading,

women's,

watch+diamond

watch, gold, white date,

ladies, dial gift, rolex

#deals_us, blue, vintage,

bracelet, omega,

glasses

chanel, sunglasses, listen,

green, funny, dark, xmas,

womens, Armani,

excellent, Havana. lacoste

bags

authentic, leather, bag,

shoes, gucci, handbag,

prada, tote, deals, brown,

wallet

t t + 1 Timeline

Chanel

Gucci

Prada

(a) Input: Tweets and associated images of competing brands (b) Output: Temporal evolution of topics and brands’ proportion over the topics

Topics (text / visual words) Brands over topics

The increasing pervasiveness of Internet has lead to a wealth of consumer-

created data over a multitude of online platforms

What can we learn?

Problem Statement

General public’s opinion towards different

companies’ products and service

Performance evaluations in different market

conditions (time, location etc.)

What does marketers want to see?• Detection: Listen in consumers’ opinions towards their

products and their competitors

• Summarization: Summarize/visualize how a shared market is

occupied by different brands

• Dynamics: Monitoring the changes of market competition

over time

SuperBowl + beer

Watch + luxury

corona

budlight

guiness

rolex

omega

burberry

compete

compete

Brand CompetitionsOur Approach: Joint Analysis of Text

and ImagesTake advantage of the pervasiveness of images on the social media

• A large portion of tweets simply show images&links without any

meaningful text in them. Images play an important role for

representing topics in this type of documents

• Many users prefer to use images to deliver their idea more clearly

and broadly,

• The joint use of images with text also helps marketers interpret the

discovered topics

• Images may be essential for users to make conversation about

customers’ descriptions, experiences, and opinions toward the

brands.