text mining of beauty blogs: о чем говорят женщины? (Артем...

34
Text mining of Beauty Blogs: Text mining of Beauty Blogs: О чем говорят женщины? Артем Просветов Data Scientist, CleverDATA

Upload: cleverdata

Post on 14-Apr-2017

97 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

Text mining of Beauty Blogs:

Text mining of Beauty Blogs:О чем говорят женщины?

Артем ПросветовData Scientist, CleverDATA

Page 2: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

emptynot Englishtechcrunch.comphoto/video pagescorrect English page

cleverdata.ru | [email protected]

Raw blog dataRaw data: 98,496 pages in format of ~ 1,000,000 files.Ready for analysis: 58,719 English pages (59.6%)

40.4% data: empty pages and pages with errors, not English pages (23,461), photo/video pages without text (2,315), articles from techcrunch.com (3,402)

Page 3: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

From 60k of pages → ~2000 authors.

Pages → Authors

Page 4: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

Mean blog post size (in words)

One can distinguish 2 populations of bloggers:

•twitter style' authors with short posts (~20%)

•full-length bloggers with 200-500 mean words per post (~80%)

Page 5: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

Used APIs and services:

- Sentity (https://sentity.io/)

- Twinword (https://www.twinword.com/)

- Textualinsights (http://www.textualinsights.com/)

- VivekN (https://github.com/vivekn/sentiment-web)

Sentiment analysis

Page 6: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

Sentiment analysis

• - the resulting sentiment rate is based on 4 independent rate systems.

• - the majority of the blogs have positive emotion rate.

• - the mean sentiment rate is «positive warm» 0.72.

• - all this results are intuitively consistent and are in a good agreement with manual tests

Page 7: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

We used a few traffic rank systems:

Estimation of blog efficiency

• Alexa Rank, that basically audits and makes public the frequency of visits on various Web sites.

• Yandex Thematic Citation Index (TIC), that determines the “credibility” of Internet resources based on a qualitative assessment of links to other sites.

• Google Page Rank, that works by counting the number and quality of links to blog to determine a rough estimate of how important the website is.

Page 8: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

Content relevance rate is based on fuzzy string matching:

- Every company product name was string matched with all amount of blogs. - String matching is based on Levinstein's metric.- Pages with 90% matching rate were marked up.- Tests with direct brand name matching showed that we get about 90-100% accuracy on each product name deppends on words in title. - The result relevance rate for each author is summed from all marks of his/hers pages.

Relevance Rate

Page 9: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

Levenshtein distance is a string metric for measuring the difference between two sequences.

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

Levinshtein distance between 'beer' and 'bread' is 44/100

Levenshtein distance

Page 10: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

The most active authors

write with sentiment

rate in short range:

0.74 +/- 0.03

Sentiment rate

Blo

g si

ze (p

ages

)

Sentiments vs Blog size

Page 11: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

The most discussed

blogs have middle-

size authors.

Log(Blog size)

Mea

n di

scus

sion

Discussion vs Blog size

Page 12: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

Again, 2 kinds of bloggers:

- 'twitter style' authors with short posts

- full-length bloggers

Log(mean words per page)

Log(

Blo

g si

ze)

Words vs Pages

Page 13: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

f you want to make a big discussion, you should praise something.

All highly discussed authors are sentiment positive (>=0.4)

Sentiment rate

Mea

n di

scus

sion

Discussion vs Sentiments

Page 14: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

We use Klout service to rank authors according to online social influence. Klout measures the size of a user's social media network and correlates the content created to measure how other users interact with that content.

- the median Klout score is 40.1

Using of Klout score for bloggers

Page 15: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

One can distinguish a population of beginner bloggers with low Klout score, that have tendency to amplification of sentiments.

Sentiment rate

Klo

ut s

core

Sentiments vs Klout score

Page 16: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

• Amount of blog pages

• Mean discussion size

• AlexaRank + YandexTIC + Google PageRank

• Relevance rate

• Sentiment rate

• Klout score

Final Author Rating is based on

Page 17: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

4 independent sentimentrating systems are combined

Alexa Rank

Yandex Thematic Citation Index

Google PageRank

list of most PR effective authors

Pragmatic statistical information

key recommendations for blogger

resulting sentiment rate isfully consistent with tests

Blog efficien

cy rating

Blogrelevance

rating

Sentiment analysis

Make your data clever

Based on fuzzy string matching

Blog rating in accordance to mentions of company products in text

Page 18: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

Name Url Sentiment Pages Mean Comments

Hayley Carr http://www.londonbeautyqueen.com 0.71 229 10.9

Luzanne http://pinkpeonies.co.za 0.77 66 68.3

Allison http://www.neversaydiebeauty.com 0.70 182 42.9

Mica Kelly, Beth, Jessica Diner http://blog.birchbox.co.uk 0.74 196 0.26

Poonam http://beautyandmakeupmatters.com 0.78 142 4.3

Silvie http://mysillylittlegang.com 0.74 571 0.64

TOP Rated Authors

Page 19: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

Testing the result

Hayley Carr (Top Rated Author): “BlaBlaBla is definitely a brand to be reckoned with... All of the BlaBlaBla products have multiple purposes, as well as smelling and feeling fabulous; the packaging is clean and fresh whilst still looking great in your bathroom, as well as having unique application methods that only aid the product performance... It's definitely worth checking out this growing brand, before it starts taking over the world. “

Page 20: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

Authors ←→ Products

Page 21: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

In order to associate a blogger with a product we must:

• Find products for promotion

• Find main topics of each blogger

• Match topics of each blogger with product names

• Find best combinations of blogger and product

Page 22: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

Finding the most perspective for promotion products

Page 23: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

In order to associate a blogger with a product we must:

• Find products for promotion

• Find main topics of each blogger

• Match topics of each blogger with product names

• Find best combinations of blogger and product

Page 24: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

Let's build document-term matrix, where each row is a document, each term is a column and a color intensity indicates that a term appears in a document at least once.

We can use TF-IDF method to get document-term matrix.

Finding topics: the document-term matrix

Page 25: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

Finding topics: TF - IDF

• Term frequency TF(t,d) is the number of times that term t occurs in document d.

• The inverse document frequency (IDF) is a measure of how much information the word provides, that is, whether the term is common or rare across all documents.

• Term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

Page 26: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

• NMF is a variant of Matrix Factorization where we start with a matrix D with document-term matrix, and constrain the elements of W and T to be non-negative.

• Lets us interpret each row of the T matrix as a topic.

Topic extraction: NMF

Page 27: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

In order to associate a blogger with a product we must:

• Find products for promotion

• Find main topics of each blogger

• Match topics of each blogger with product names

• Find best combinations of blogger and product

Page 28: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

• For each author we build document-term matrix.

• For each document-term matrix we perform matrix factorization and find main topics

• For each product we match product name with main topics of author and find the rate of intensity.

• If author have exact product name in one of his/hers titles, we set the rate of intensity to 0 (the author has already made review of the the product).

Topic extraction

Page 29: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

Thus for each pair of author-product we find rate of intensity and we can visualize it in form of heatmap where products are sorted by mean rate of

intensity and authors are sorted by author rating:

Note: the most rated authors are highly intensive on matrix

The intensity matrix

Page 30: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

In order to associate a blogger with a product we must:

• Find products for promotion

• Find main topics of each blogger

• Match topics of each blogger with product names

• Find best combinations of blogger and product

Page 31: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

Next we extract the most resonance peaks from product-author matrix of intensity. After each peak extraction the column with a peak is dropped, so for each author we get only one product.

We need to build recommendations only for 4 products and we can select 40 best rated authors for this task.

The intensity matrix

Page 32: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

In order to associate a blogger with a product we must:

• Find products for promotion

• Find main topics of each blogger

• Match topics of each blogger with product names

• Find best combinations of blogger and product

• Profit!

Page 33: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

cleverdata.ru | [email protected]

BlaBlaBla Body Oil Allison http://www.neversaydiebeauty.com

BlaBlaBla Wrinkle Repair Cindy Batchelor http://mystylespot.net

BlaBlaBla Face Serum Marie Papachatzis http://iamthemakeupjunkie.blogspot.ru

BlaBlaBla Face Oil Emily - Style Lobster http://stylelobster.com

The resulting associations

Page 34: Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)