automatic hierarchy discovery and opinion mining of political blogs amit goyal kristi mcburnie...
TRANSCRIPT
Automatic Hierarchy Discovery and Opinion Mining of Political Blogs
Amit Goyal
Kristi McBurnieNovember 28, 2007
Outline
Introduction Previous Work Our Approach Example Challenges and Future Work Milestones Conclusion
Introduction
The Web contains a wealth of opinions about products, politics, newsgroup posts, review sites, and elsewhere
Our interest: to mine opinions expressed in user generated content
Applications Businesses and Organizations
Market Intelligence: A huge amount of money is spent to find consumer sentiments and opinions
Opinion Polls, surveys Individuals interested in other opinions when
Purchasing a product Finding opinion on political topics Using a service etc.
Smart Ads Place an ad when one praises a product Place an ad from a competitor if one criticizes a product
Opinion Search Provide search for opinions Give me opinions on “gmail” Give me comparisons between “gmail vs yahoomail”
Types of opinions Direct Opinions: sentiment expressions on objects. E.g.
policies, politicians, movies, products E.g. “I find myself in support of the Senate Judiciary Committee,
which approved legislation that clears the way for millions of undocumented workers to continue working in America and seek citizenship.”
Comparisons: relations expressing similarities or differences of more than one object. E.g. “I think Bush will beat Kerry in the presidential elections” or
“The lens quality of Camera A is better than Camera B”
Problem Statement
Given a object and a collection of reviews on it, the task is Identification of featuresMaking hierarchy of featuresSentiment Analysis: Determining the
orientation and strengthProvide a visualization (summary)
Previous Work Mainly focused on product and movie reviews Feature Extraction
Opinion Observer (Hu and Liu, 2004) Opine (Popescu and Etzioni, 2005) Red Opal (Scaffidi, 2007)
Hierarchical Discovery To be filled by kristi
Previous Work
Opinion Observer By Bing Liu and Minqing Hu Feature Extraction
Identify Nouns using POS tagging Identify Noun phrases by Association Rule Mining Compactness pruning, redundancy pruning Opinion word extraction Infrequent feature identification
72% precision and 80% recall
Previous Work
OPINE Feature Extraction
First, extract nouns and noun phrases, retains those with frequency greater than some threshold
Evaluates each noun phrase by computing the PMI (point-wise mutual information) scores between the phrase and meronymy discriminators associated with the product class
E.g. “of scanner”, “scanner has”, “scanner come with” etc. for the Scanner class
PMI(f,d) = Hits(d+f) / {Hits(d) * Hits(f)} Then, PMI score are converted to binary features for a Naïve
Bayes Classifier, which outputs a probability associated with each fact
Compared to Hu and Liu work, 22% better precision and 3% lower recall
Previous Work
Red Opal 3 components:
Feature Extractor Product Scorer User Interface
Performs better than Opinion Observer
Previous Work
Red Opal Feature Extraction
POS tagging, takes noun and noun phrases as potential features
Use lemma frequency to rank the features Product Scoring: Score of feature f of product p
o(r,f) is the number of occurrences of feature f in review r
w(r,f) is the weight of feature f in review r
Previous Work
Clustering
Conceptual clustering CLUSTER/2
Places object descriptions and attributes together to obtain domain-dependent goals
COBWEB Favours classes that maximize the information that can be predicted
from knowledge of class membership
Hierarchical clustering BIRCH
Hierarchically cluster elements in a dataset Level of clustering quality = level in the hierarchy
Previous Work
Hierarchy Discovery
Han and Fu define formally as “A sequence of mapping from a set of lower-level concepts to their higher-level correspondences” DBLearn automatically discovered a hierarchy of concepts for
the purpose of data mining Ie: birthplace may have the following hierarchy: city, province,
country
Foreman et al. Trains categorizers and automatically constructs hierarchy of
categories using human trainers Good GUI Difficult for novice users and hard to optimize
Previous Work
Previous Work
Hierarchy Discovery
Sanderson and Croft Automatically develop hierarchy in web documents Organize extracted words/phrases using subsumption No clustering or training techniques
Yang and Lee Hierarchies of web directories Text mining to discover relationships between documents and
between words Cluster them into document and word maps
Previous Work
Sentiment Analysis
Esuli and Sebastiani 3 stages:
Determine subjective/objective polarity Determine positive/negative polarity Determine strength of the positive/negative polarity
Uses SentiWordNet to assign 3 scores to each word (objectivity, positivity, negativity)
Previous Work
Sentiment Analysis Pang and Lee
Only subjective sections of the movie review Machine learning techniques
Pair-wise relations between extracts to build an undirected graph Minimum cut
Efficient and results in higher accuracy rates
Agarwal and Bhattacharyya: SVM classifier Determine strength of polarity of subjective adjectives in good vs
bad classification based on WordNet’s synonymy graph Applied cut-based graph similar to Pang et al Reached accuracies of 84%-95.6%
Our proposal Apply feature extraction and opinion mining in political
domain Applications in political domain:
Automatic opinion polls Identification of local/global issues in elections Target campaigning in elections Impact of speech
Output: <politician, topic, opinion, polarity> Objects are politicians Categories are political organizations Topic may be policies, issues etc In this project, we focus mainly on feature extraction and
their hierarchy discovery
Our Approach
Observations
Two kinds of opinions: Direct – talks about single object Comparison – talks about multiple objects
Two kinds of information Facts (objective) Opinions (subjective)
Sentiment Analysis can be done only on subjective information
Although, features occur both categories, subjective sentences are noisy
Comparison to product domainProduct Domain Political Domain
Category Product Category (e.g. Camera)
Political Organizations (e.g. Democrats)
Object Product (e.g. Camera A)
Leaders (e.g. Bush)
Features/Topics Properties (e.g. lens)
Policies (e.g. Immigration)
Our Approach
Our Approach
Perform feature extraction Split into objective and subjective phrases Hierarchy discovery on features from
objective sentences Sentiment analysis on features from
subjective sentences
Our Approach
Feature Extraction Extract the features
Extract nouns from POS tagging Extract noun phrases from Association Rule Mining Pruning Rank the features based on lemma frequency
Identify the subjectivity of all sentences Mine the opinion words (adjectives) Use key phrases dictionary (e.g. “can you believe”, “I think”, “I
recommend” etc) Visual differences – factual data is often represented in quotes
Our Approach
Hierarchy Discovery
3 approaches: Subsumption
Sanderson and Croft Look at every pair of terms and apply subsumption X subsumes Y if the documents in which Y occurs are a subset of
the documents in which X occurs P(X|Y) = 1 and P(Y|X) < 1
Clustering Use DBpedia and/or YAGO
X
Y
Our Approach
Hierarchy Discovery
3 approaches: Subsumption Clustering
Yang and Lee Cluster phrases by co-occurrance Using unsiupervised learning algorithm SOM networks
Organizes phrases into a 2D map of neurons According to similarity of vectors
3 Steps: Training process Assigning phrases to a neuron Labelling process
Use DBpedia and/or YAGO
Our Approach
Hierarchy Discovery
3 approaches: Subsumption Clustering
Find a group of dominating clusters (neurons) Make these as superclusters and put neighbours one level down Repeat for lower level of hierarchy under each subcluster
Use DBpedia and/or YAGO
Our Approach
Hierarchy Discovery
3 approaches: Subsumption Clustering Use DBpedia and/or YAGO
DBpedia provides 3 classification schemes: Wikipedia categories YAGO classification Word Net Sysnet Links
Our Approach
Hierarchy Discovery
3 approaches: Subsumption Clustering Use DBpedia and/or YAGO
Our Approach
Hierarchy Discovery
Our Approach
Sentiment Analysis
2 ways to approach this: Subjective phrases
What does the public think about each policy Objective phrases
What is the policy Rank parties from each policy on a scale from right-wing to left-wing
Our Approach
Sentiment Analysis
Subjective phrases What does the public think the policy Pang and Lee
Cut-based classification (Pang and Lee) Individual scores Association scores Partition Cost
A cut (S,T) of G is a partition of its nodes into sets S = {s} U S’ and T = {t} U T’, where s not contained in S’ and t is not contained in T’. Its cost cost(S,T) is the sum of the weights of all edges crossing from S to T
A minimum cut of G is one of minimum cost.
Our Approach
Sentiment Analysis
Subjective phrases What does the public think about each policy Agarwal and Bhattacharyya
Determine adjective strength
Cut-based classification between
sentences (Pang and Lee) Cut-based classification between
documents Improved accuracy
Our Approach
Sentiment Analysis
Objective phrases What is the policy Rank parties from each policy on a scale from right-wing to left-
wing
Definition of polarity would be left/right using a comparison of left-wing and right-wing policies/ideals
Instead of traditional positive/negative using the ideal words ‘poor’ and ‘excellent’
Left-wing(Liberal)
Right-wing(Conservative)
Example
The economic cost of the war in Iraq is estimated to total $1.3 trillion – roughly double the amount the White House has requested thus far, according to a new report by Democrats on Congress’ Joint Economic Committee. I think this is an absurd amount of money to be spending on killing people and freeing oil fields.
Political Organization = Republicans
Politician = George Bush
Topic = War in Iraq Sub-topic = cost Opinion words =
absurd, killing, freeing Polarity = negative
Ideal case:
Example
The economic cost of the war in Iraq is estimated to total $1.3 trillion – roughly double the amount the White House has requested thus far, according to a new report by Democrats on Congress’ Joint Economic Committee. I think this is an absurd amount of money to be spending on killing people and freeing oil fields.
Noun phrases: economic cost, war in Iraq, amount, report, amount, money, people, oil fields
Proper nouns: White House, Democrats on Congress Joint Economic Committee
Frequent features: economic cost, war in Iraq, money, oil fields, White House
Feature Extraction:
Example
The economic cost of the war in Iraq is estimated to total $1.3 trillion – roughly double the amount the White House has requested thus far, according to a new report by Democrats on Congress’ Joint Economic Committee. I think this is an absurd amount of money to be spending on killing people and freeing oil fields.
Opinion words: think, absurd
1st sentence is objective, and 2nd is subjective
Interesting features: economic cost, war in Iraq
Identification of Subjectivity
Example
The economic cost of the war in Iraq is estimated to total $1.3 trillion – roughly double the amount the White House has requested thus far, according to a new report by Democrats on Congress’ Joint Economic Committee. I think this is an absurd amount of money to be spending on killing people and freeing oil fields.
Identification of category/object for proper nouns using DBpedia
Category = Republicans
Object = George Bush
Hierarchy Discovery – step 1
Example
Example
The economic cost of the war in Iraq is estimated to total $1.3 trillion – roughly double the amount the White House has requested thus far, according to a new report by Democrats on Congress’ Joint Economic Committee. I think this is an absurd amount of money to be spending on killing people and freeing oil fields.
Identification of policy hierarchy using subsumption and clustering
Policies are derived from interesting features economic cost, war in
Iraq
Hierarchy Discovery – step 2
Example
Example
The economic cost of the war in Iraq is estimated to total $1.3 trillion – roughly double the amount the White House has requested thus far, according to a new report by Democrats on Congress’ Joint Economic Committee. I think this is an absurd amount of money to be spending on killing people and freeing oil fields.
Opinion is the subjective sentence
Polar words: absurd, spending, killing, freeing
Polarity: Negative
Sentiment Analysis
Challenges Difficult to distinguish between objective and
subjective information Opinion words also occur in objective sentences Identification of spam blogs Identification of implicit features Mapping politician to the policy in comparison
blogs Deciding on a distance measurement for
clustering
Future Work
Implementation of algorithms Summarization of opinions
Visualization Refinements
Milestones
Decide on domain Read previous works Decide on an approach that is best for the
domain Write up an example to illustrate it Challenges and future work Presentation Write the paper
Questions?
Previous Work
OPINE (Backup Slide)
Overall Process
Previous Work
Opinion Observer (Backup Slide)
By Bing Liu and
Minqing Hu
Types of opinions Direct Opinions: sentiment expressions on objects. E.g.
policies, politicians, movies, products E.g. “I find myself in support of the Senate Judiciary Committee,
which approved legislation that clears the way for millions of undocumented workers to continue working in America and seek citizenship.”
Comparisons: relations expressing similarities or differences of more than one object. E.g. “I think Bush will beat Kerry in the presidential elections” or
“The lens quality of Camera A is better than Camera B”