1 statement of the research topic web view · 2012-11-15it may not be appropriate for...

28
Deriving Topics and Opinions from Microblog Research Proposal Feng Jiang fenjy009 110032208 10/06/2012 Supervisor: Jixue Liu Associate Supervisor: Jiuyong Li i

Upload: dinhnhu

Post on 30-Jan-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

Deriving Topics and Opinions from Microblog

Research Proposal

Feng Jiang

fenjy009

110032208

10/06/2012

Supervisor: Jixue Liu

Associate Supervisor: Jiuyong Li

i

Page 2: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

Contents1 Statement of the Research Topic........................................................................................................1

1.1 Field of thesis...............................................................................................................................1

1.2 Context of research.....................................................................................................................1

1.3 Significance of research...............................................................................................................1

1.4 Research questions......................................................................................................................1

1.4.1 Topic extraction....................................................................................................................2

1.4.2 Opinion extraction................................................................................................................2

1.5 Research sub questions...............................................................................................................2

1.5.1 Updating problem.................................................................................................................2

2 Review of literature............................................................................................................................2

2.1 Keyword Extraction......................................................................................................................2

2.2 Topic Extraction...........................................................................................................................6

2.3 Sentiment analysis.......................................................................................................................7

3 Significance and contributions............................................................................................................8

4 Methodologies....................................................................................................................................9

4.1 Text pre-processing.....................................................................................................................9

4.1.1 Word stemming....................................................................................................................9

4.1.2 Feature filtering....................................................................................................................9

4.2 Text Representation..................................................................................................................10

4.2.1 Vector Space Model (VSM).................................................................................................10

4.2.2 Feature item weighting.......................................................................................................10

4.2.3 Text similarity.....................................................................................................................11

4.3 Dimensionality reduction...........................................................................................................12

4.3.1 Feature selection................................................................................................................12

4.3.2 Feature Extraction..............................................................................................................12

4.4 Classification..............................................................................................................................13

4.5. Clustering..................................................................................................................................14

5 Scope and limits................................................................................................................................14

6 Project Plan.......................................................................................................................................15

7 References........................................................................................................................................16

8 Trial Table of Contents of thesis.......................................................................................................18

ii

Page 3: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

iii

Page 4: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

1 Statement of the Research Topic

1.1 Field of thesisText Ming

1.2 Context of researchIn this day and age, microblogs play a pivotal role in people’s daily life. One famous microblog site is Twitter. Approximately 20 million visitors use it every day [1]. Unlike other traditional media such as TV, newspaper and magazine, it can allow the public to freely express their own thoughts and idea, discuss the important events, share useful information and knowledge, and facilitate the dissemination of news. Although it contains plenty of useful information, it is very hard for individuals to manually seek and track the important events, identify fashion trends and find popular products due to numerous posts. It is very easy and convenient for bloggers to write the microblogs and publish them, so they often publish microblogs which are useless. Consider Twitter for example, it allows users to send 140 words microposts and about 40 million microposts per day. Therefore, Twitter now contains numerous posts and the number is increasing every day [3].

Obviously, we could draw on the existing web and text mining methods to analyse and mine the data of microblogs. However, we are confronted with many problems and challenges due to the microblogs’ own characteristics, which we could not directly utilise these techniques to extract hot topics and opinions from them. Firstly, the data on microblogs is semi-structured and unstructured. The writing of microblogs does not have a single format, so bloggers often write articles by free style, which even contains some grammatical and spelling error. Bloggers also tend to be welling to use the new words, grammar and abbreviation to express their opinions and emotions. Secondly, bloggers update their microblogs frequently. Now, Bloggers can use smartphones to publish posts at everywhere and every time, so the update speed is increasingly fast. Thirdly, a post may have several comments. It is difficult to judge the opinion of each comment [1,2,3]. For instance, some comments may do not refer to the topic, they may relate to other events. Therefore, it is hard to filter them.

1.3 Significance of researchIt is important to explore good methods to mine the microblogs because there is plenty of spam information in the microblogs. Take Twitter for example, major of microposts are pointless, only approximately 3.6% of microposts are main stream topic. Therefore, if we can find good approach to detect the topic and analyse the sentiment, people can save plenty of time and energy. They do not have to read the similar microposts and comments. Instead, they can quickly know the popular things. On top of that, readers can seek and track the important events, identify fashion trends and find popular products, government can gather residents’ opinions to improve the public service; companies can find the customers’ real needs to produce the better products.

1.4 Research questionsThis minor thesis will propose a method to categorise the microposts, detect the topics from the same categories, and analyse the sentiment of microposts to identity personal tendency.

1

Page 5: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

1.4.1 Topic extraction Although the number of posts in blogs is very large, we can classify them into different categories and find method to summary blogs belongs to the same categories using a brief sentence. Right now, there are many keywords detection method, lack the sentence detection approach. So the problem is how to find the useful keywords to reduce dimensionality and create the topic sentence

1.4.2 Opinion extractionAfter generating the topic sentence, how to analyse the topic sentences, then conduct the sentiment analysis to find who support this topic and who oppose it.

1.5 Research sub questions

1.5.1 Updating problemThere are mass data on the microblogs. Recently, the information grows explosively; information is frequently and increasingly published and updated on them, which generate large quantities of data. So the problem is how to generate the stable cluster and deal with the latest posts after clustering.

2 Review of literature

2.1 Keyword ExtractionKeyword extraction is one the most essential elements in the project because topic detection and sentiment analysis will use the keywords to run the algorithm. In this part, I will review some classical method by using specific examples [4].

Position Weight (PW) It considers the position of words in the article. It laid too much emphasis on the position feature; therefore it suits the structural document which has clear title, abstract, subtitle, and conclusion and so on. It may not be appropriate for unstructured document like blog and micro-blog [5,6].

Word FrequencyWord Frequency means how many times a word appears in a document. If the selected frequency of word is less than the threshold, it can be ignored [7].

Disadvantage: There seems to be plenty of information retrieval research finding to confirm that sometimes the words which have small frequency may contain more information. Therefore, in the process of feature selection, they should not be simply deleted according to the word frequency [7,8].

Document frequency thresholding (DF)It is one of most simple feature selection algorithm. It means how many documents contain the word in the whole collection or corpus. The advantage of DF is that it has very few calculations and high operation speed, which can be applied in the large-scale classification task. In spite of this, it should be pointed out that DF approach is not without its own downside. Rare words may be more in a kind of document, and may also include the important judge information. If we simply give up these words, the accuracy of classifier may be affected [6].

For example:

2

Page 6: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

Suppose we have 10,000 documents, so the document set is 10,000. If 1,000 documents include t1, 9,000 documents include t2, 5,000 documents include t3, 10 documents include t4, 6,000 documents include t5, and 200 documents include t5, then the document frequency is calculated asDF (t1) = 1,000 / 10,000 = 0.1DF (t2) = 9,000 / 10,000 = 0.9DF (t3) = 4,000 / 10,000 = 0.4DF (t4) = 10 / 10,000 = 0.001DF (t5) = 5,500 / 10,000 = 0.55DF (t6) = 200 / 10,000 = 0.02Then we can set the threshold is 0.3 < threshold < 0.6So the keywords t3 and t5 is selected.

Information gain (IG)It is widely used in the machine learning. It refers to frequency of occurrence and the number of times when did not occur. However, there is large scale computation in information gain, if the number of microposts is too big [9].Formula:

For example:Suppose the document set is 10,000 and there are two classes named c1 and c2, c1 has 5,000 documents and c2 has 5,000 documents. keyword/class c1 c2 totalt1 5,000 2,000 7,000t2 10 5,000 5,010t3 5,000 5,000 10,000This table means 7,000 documents contains keyword t1. In it, 5,000 documents belong to class c1, and 2,000 belong to class c2.

IG (t1) =-(5,000 / 10,000 * log5,000 / 10,000 + 5,000 / 10,000*log5,000 / 10,000)+ 7,000 / 10,000 * ( 5,000 / 7,000*log5,000 / 7,000 + 2,000 / 7,000 * log2,000 / 7,000) + 3,000 / 10,000* (0/3,000*log0 /3,000+3,000 / 3,000 * log3,000 / 3,000) = 0.693- 0.7 * (0.241+ 0.358) = 0.693 - 0.419 = 0.274

IG (t2) = =-(5,000 /10,000*log5,000 /10,000 + 5,000 / 10,000 * log5,000 /10,000) + 5,010 / 10,000 * (10/5,010 * log10 / 5,010 + 5,000 / 5,010 * log5,000 / 5,010) + 4,990 / 10,000 * (4,990/4,990 *log4,990 / 4990 + 0 / 4,990 * log0 / 4,990) = 0.693 - 0.006 = 0.687

IG (t3) = =- (5,000/10,000*log5,000/10,000+5,000/10,000*log5,000/10,000)+ 10,000 /10,000 * (5,000/10,000 *log5,000/10,000 +5,000/10,000 *log5,000/10,000) = 0

The result is that:IG (t1) = 0.274IG (t2) = 0.687IG (t3) = 0So the keywords t2 is the best one to use into classification.

Mutual information (MI)It can evaluate the relationship of statistical independence between the word and the category. The calculation of time complexity is similar to the information gain, the average of mutual information is the information gain. However, it has some disadvantage. The scoring is easily influenced by the marginal probability [10].

3

Page 7: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

Formula:

For example:Suppose the document set is 10,000 and there are two classes named c1 and c2, c1 has 5,000 documents and c2 has 5,000 documents. This table means 5,000 documents contains keyword t1. In it, 4,000 documents belong to class c1, and 1,000 belong to class c2.

keyword/class c1 c2 totalt1 4,000 1,000 5,000t2 1 5,000 5,001t3 5,000 5,000 10,000t4 10 1 11

MI (t1, c1) = log(4,000 / 5,000) – log( 5,000 / 10,000 ) = -0.223 + 0.693 = 0.470MI (t1, c2) = log(1,000 / 5,000) – log( 5,000 / 10,000 ) = -1.609 + 0.693 = -0.916MI (t2, c1) = log(1 / 5,000) – log(5,001 / 10,000) = -7.824MI (t2, c2) = log(5,000 / 5,000) – log(5,001 / 10,000) = 0.693MI (t3, c1) = log(5000/5,000) – log(10,000/10,000) = 0MI (t3, c2) = log(5,000/5,000) – log(10,000/10,000) = 0MI (t4, c1) = log(10/5,000) – log(11/10,000) = -6.215+6.812=0.597MI (t4, c2) = log(1/5,000) – log(11/10,000) = -8.517+6.812=-1.705

MImax(t1) = 0.470, MImax(t2) = 0.693MImax(t3) = 0MImax(t4) = 0.597

The result means that MImax(t2) > MImax(t4) > MImax(t1) > MImax(t3)

So the keyword t4 and t2 is more important than others, and should be selected.However, we can see the drawbacks, t4 appears only 11 times in the document set, but its MImax is higher than t1 which totally appears 5,000. This method thinks words that have the low document frequency are more important than the words have high document frequency.

χ2 Statistic (CHI)It suppose that there is a χ2 freedom distribution between the words and document category, then measure the related degree between words and document categories [5]. Formula:

For example:Suppose the document set is 10,000 and there are two classes named c1 and c2, c1 has 5,000 documents and c2 has 5,000 documents. keyword/class c1 c2 totalt1 4,000 1,000 5,000t2 1 5,000 5,001t3 5,000 5,000 10,000

4

Page 8: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

t4 10 1 11Chi (t1,c1) = 10,000 * [(4,000 / 10,000) * (4,000 / 10,000)- (1,000 / 10,000) * (1,000/10,000)] *[(4,000/10,000) * (4,000 / 10,000)- (1,000 / 10,000) * (1,000 / 10,000)]/{[(4,000 / 10,000) +(1,000/10,000)] * [(4,000 / 10,000) +(1,000 / 10,000)] * [(4,000 / 10,000) + (1,000/10,000)]* [(4,000/10,000) + (1,000 / 10,000)]} =3,600Chi (t1,c2) =10,000 * [(1,000 / 10,000) * (1,000 / 10,000) - (4,000 / 10,000) * (4,000 / 10,000)] **[(1,000 / 10,000) * (1,000/10,000) - (4,000 / 10,000) * (4,000 / 10,000)] / {[(4,000 / 10,000) +(1,000/10,000)]* [(4,000 / 10,000) + (1,000 / 10,000)]* [(4,000 / 10,000) + (1,000 /10,000)] * [(4,000 /10,000) + (1,000 / 10,000)]} = 10,000 * (-0.15) * ( -0.15 ) - 0.5*0.5*0.5*0.5 = 3,600Chi (t2, c1) = 10,000 * 0.25 * 0.25 / 0.5 * 0.5 * 0.5 * 0.5 = 10,000Chi (t2, c2) = 10,000Chi (t3, c1) = 0Chi (t3, c2) = 0Chi (t4, c1) = 0.0009 * 0.5 * 0.0009 * 0.5 * 10,000 / 0.0011 * 0.5 * 0.5 * 1 = 7.364Chi (t4, c2) = 7.364

(t1) = 3,600, (t2) = 10,000, (t3) = 0, (t4) = 7.364(t2) > (t1) > (t4) > (t3)

So the keyword t2 is the best one.

Expected cross entropy, ECEIt is also an import method in machine learning. It consider the cross entropy between the words and document categories [10]. Formula:

For example:Suppose the document set is 10,000 and there are two classes named c1 and c2, c1 has 5,000 documents and c2 has 5,000 documents. keyword/class c1 c2 totalt1 4,000 1,000 5,000t2 1 5,000 5,001t3 5,000 5,000 10,000t4 10 1 11ECE (t1) = 5000 /10,000 * [(4000 /5000* log ((4000 / 5000) / 0.5)) + (1000 / 5000*log((1000 / 5000) /0.5))] = 0.5*(0.8 * log1.6 + 0.2 * log0.4) =0.5 * (0.376 - 0.183 ) = 0.0965ECE (t2) = 0.5 * [(0.0002 * log0.0004) + log2] = 0.3455ECE (t3) = 0.5 * [(0.5 * log1) + 0.5 * log1] = 0ECE (t4) = 0.00043

ECE(t2) > ECE(t1) > ECE(t4) > ECE(t3)So the keyword t2 is the best one.

ComparisonIG and χ2 Statistic (CHI) are the best effective in term of term removal. Document frequency thresholding (DF) has the similar accuracy with them. Comparing to other approaches, Mutual information (MI) is the worst method, because of preferring to rare terms and easily being affected by possibility estimation errors [6].

5

Page 9: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

2.2 Topic ExtractionChung-Hong et al. [11] propose a novel method to not only detect the hot topics, but also rank them in near real-time from microblogs. They first filter the microblogging messages which include the non-ASCII characters. Next, in the topic creation module, it contains major three steps. They are dynamic term weighting, neighbourhood generation, and text clustering. In the dynamic term weighting, they consider the updating problem, so when they assign the weight to the keywords, they give the Burst Score plus term frequency. In neighbourhood generation, they propose a new method to analyse the neighbourhood of keywords, and find some sentence which contains keywords. Then they calculate the similarity between the texts. In the text clustering, they apply a density-based clustering method to filter the spam rather than spam classifier that sometimes only subjectively filters the keywords.

Sharifi et al. [12] propose two primary methods to summary microblog, which are all based on the extractive approach rather than abstractive method as the major feature of microblogs is too short. One method is a graph-based algorithm. It finds the most common words, and then chooses the microposts contains the keywords, and calculate the weight of each sentence. Select the sentence which has the highest score as the topic sentence. The other one is the Hybrid TF-IDF algorithm, which is based on the theory of Term frequency inverse document frequency to find keywords which have this feature: the frequency of words in a document is high, but in corpus, the frequency of keywords is low. Finally they perform the experiments and find that in the mass data, Hybrid TF-IDF algorithm perform better than graph-based algorithm. In my view, I think the major problem is redundancy. Many microposts are quite similar, they many have the same similar score or weight, it is hard to select the best one.

Bossard et al [13] develop an algorithm to first build the similarities between all sentences. Then use the clustering method – fast global k-means to cluster the similarity matrix and generate the clusters in which sentences contain the similar topic. They use “Jaccard” measure to calculate the similarity. If two words are not the same, they use the Wordnet to find the synonymy and hyperonymy of these words. After clustering, they choose one sentence from each cluster as the summary that covers the relevant information. Simply, they can select the central sentence per cluster, which can optimize the similarities.

Lee [14] present a density-based online clustering algorithm to detect the emerging events that has the temporal and geospatial characteristics. They point out that microposts lack the semantic integrity, so it is hard to generate a reliable weighting and clustering method. They resolve such problem; propose a dynamic weighting algorithm in the real-time environment. Firstly, they apply the incremental DBSCAN clustering method to group the microposts due to the incremental update of data. Next when they calculate the weight, they assign the score for different group like uninformative word, topic word and common word. In my opinion, I think there are some limitations as they only concentrate on the temporal analysis and spatial analysis and have fewer details about the density-based clustering algorithm.

Feifei et al. [15] present a new algorithm called FNODT, which is based on extraction method to detect the hot topics. It should input a lot of parameters such as the number of users, the messages number, word frequency, event’s occurrence and distribution of time. They sort the results and offer the top 50 as the hot topics. In this process, they use fuzzy clustering algorithm to cluster the

6

Page 10: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

microposts and use the classifier to classify the words set and generate the topic set. Besides, they use the topic set to re-retrieve the microposts, compare them by using above parameters, and the rank them, finally output the hot topic set.

Zitao et al. [16] propose a new feature selection method to detect topic, which base on part-of-speech and HowNet. They select the keywords which have the different part-of-speech but contain a considerable amount of information. In other words, it can filter the candidate characters and remove the spam information in the microposts. Then use the HowNet that is a knowledge base to describe the semantic features and split the concrete concepts. They discuss how to set the thresholds in the algorithm to boost the classification quality. This method can also be used in many web application.

He et al. [17] analyse the characteristics of microblogs, and propose a new and light algorithm which called LN algorithm to extract the latest hot topics. The people or places are the input of this algorithm and are regarded as the subjects, and find the predicates and objects to describe the subjects and generate the hot topic sentence. Finally, they use the generation model tree to display the results. Specifically, they collect the rubbish information set, and then use it to train the Naive Bayes Classifier. Next, they conduct the pre-process of microblog corpus to remove the hyperlinks and non-English words. Moreover, they use the trained Naive Bayes Classifier to remove the junk information and get the denoising microblog corpus and use it as the next steps input. They use the method based on named entity recognition and classification to find proper noun such as person's names or the name of a place. In it, they apply the Google Geocoder API to seek topic centre words. Finally, they create the generation tree based on the word frequency to show the hot topic. In my view, I think it is very hard to find the correct key words by using entity detection methods, because the number of microposts is too large. If the number is small, it may be good approach.

Hutton et al. [18] present a novel summarizing microblogs method which can summarize microposts automatically by adopting the Phrase Reinforcement Algorithm. It can detect the most common phrase which contains the topic phrase. Then it is regarded as a summary. In this process, they apply the filtering method - Naïve Bayes classifier which they have trained by the data from Twitter, to derive the most relevant content. After they get the relevant sentences, they built a graph to display the common sequences of words that appear both before and after the topic words, and then give each note a weight to keep longer parses from controlling output. Finally, they select the most heavily weighted path as the topic sentence. This method has some limitations. Firstly, it cannot find the keywords automatically and cannot categorize the microposts. Instead, it adopts the twitter HTTP-based API. Users should input a keyword then get the cluster from twitter. Secondly, the topic sentence may be too short due to the number of note is too large, and many frequency of note is too small.

2.3 Sentiment analysisJoshi and Belsare [19] propose a method to analyse the sentiment of microposts. They think the adjective words play a pivotal role in the sentiment analysis. So firstly, they use the suffix to tag the adjective words in the microposts. For example, the adjective word – exquisite can be marked with _JJ to become exquisite_JJ, and then used in the next steps. They introduce the QTag which the Part-of-speech tagger of English used in marking the words. They provide a method which uses a seed list of adjectives. WordNet is an English lexical database, can offer the list. They also describe another

7

Page 11: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

approach. It can use the film review from IMDB to train the classifier. For example, we can use the low rating reviews to improve the negative classifier. In contrast, we can use the high rating reviews to enhance the affirmative classifier. Finally we can use the negative and positive classifiers to judge the sentiments of microblogs.

Zhang et al. [20] use the SVM (Support vector machine) to classify the blog opinion. They input sentence one by one and each time can only deal with one sentence. Then it output a subjective tag and a mark for each sentence. The marks stand for the confidence of classifier to decide the tag. When the sentence tends to be subjective, the classifier will give it positive marks while the sentence tends to be negative, the classifier will give it a negative marks. A micropost can be marked as opinionative, when it includes at least one subjective sentence.

Guangxia et al. [21] states the features of emotion in microblogs. First, users use different words, phrase and patterns to show their own emotion. Second, the training date for individuals is often limited. Therefore, they propose a global model to refine personal models by using a collaborative online learning algorithm. More precisely, the algorithm gathers the latest global set of data, and uses them to update the global classification model. Meanwhile, every user should maintain the collaborative model which should be updated through the microposts and global model parameters. Finally, they can use global common knowledge to carry out the sentiment analysis.

Turney [22] propose an algorithm which belongs to unsupervised learning method. It can separate the reviews into two groups: the recommended group and not recommended group. They suppose that all the reviews contain adjectives or adverbs which can be used in semantic orientation analyse. They define that when a sentence has good associations, it has a positive semantic orientation. By contrast, when a sentence has bad associations, it has a negative semantic orientation. They use the mutual information to judge the semantic orientation. If the review has the positive semantic orientation in average, it should be recommended.

Pang et al. [23] classifying documents by using the total sentiment. For example to judge the review support the article or not. They use the film review to train the classifier. Then they use three famous classification methods to classify the documents by using the users’ opinions. They find a useful result: SVM performs best in the experiment, while Naive Bayes performs worst in the experiment. Overall, these classification methods can perform better on the topic-based categorization than on sentiment classification.

Qin et al. [24] introduce a judgment model based on microblog platform. The input is the vectors which come from the microposts include plenty of messages. The output is the judgment tendency. This model contains two major steps which named TDM (Topic Detection Matcher) and TJC (Tendency Judgment Classifier) respectively. The TJC is based on rule tree theory and vector space model, which can offer the opinion detection service.

3 Significance and contributions

I will propose a novel method to categorise the microposts, detect the topics from the same categories, and analyse the sentiment of microposts to identity personal tendency. The input is the raw data from twitter, the output are the ranking topic sentence and positive opinions and negative

8

Page 12: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

opinions. Most important, I will integrate the classification algorithm and clustering algorithm to categorise the microposts and I will propose a method to assign the weight to the sentence automatically rather than manually.

4 Methodologies

Firstly, I will pre-process the raw data from microblogs, and then use the VSM (vector space model) to present the text data; next I will extract the keywords and reduce the dimensionality. Moreover, I will use the classification method - SVM (support victor machine) to find several main categories like education, sports, entertainment, politics and use clustering method – ROCK to improve the results of classification, find the specific hot topic. In addition, I also use classification method to extract opinion like affirmative opinions and negative opinions.

4.1 Text pre-processing

4.1.1 Word stemming We can use the word stemming method to find the root of a word.

4.1.2 Feature filteringIn the context, use stop list to filter the low frequency words such as, articles, prepositions, conjunctions, adjective, adverb and so on, which might have no role and don not contain the classification information in them. For example, a, the, and, to, for, et al, should be removed.

4.2 Text Representation

4.2.1 Vector Space Model (VSM)It is also called term vector model, which is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers [25].

9

Page 13: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

4.2.2 Feature item weighting TF*IDFtf-idf (term frequency – inverse document frequency) is a famous approach to assess how important is the terms in a document. It is a useful way to change the text into the vector such as VSM (Vector Space Model). It is to find keywords which have this feature: the frequency of words in a document is high, but in corpus, the frequency of keywords is low. However, the structure of TFIDF is too simple to reflect the importance and distribution of keywords. In addition to this, the TFIDF does not consider the position weight. The words which are at different position should have different weight.For example:Suppose the documet1 contains 1000 words, so the sum of term is 1,000. We first select three keywords (named k1, k2, k3) in documet1, and they respectively appears 100 times, 200 times and 50 times. The term frequency (TF) for k1, k2, k3TF1 = 100/1000 = 0.1TF2 = 200/1000 = 0.2TF3 = 50/1000 = 0.05Next, assume we have 10,000 documents, so the document set is 10,000. If 1,000 documents include k1, 10,000 documents include k2, 5,000 documents include k3, then the inverse document frequency is calculated asIDF1 = log(10000/1000) = log(10) = 2.3IDF2 = log(10000/100000) = log(1) = 0IDF3 = log(10000/5000) = log(2) = 0.69The tf*idf score is the product of these quantities respectively.tf*idf (k1) = 0.1*2.3 = 0.23tf*idf (k2) = 0.2*0 = 0tf*idf (k3) = 0.05*0.69 = 0.0345

10

Page 14: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

4.2.3 Text similarityThere are some methods like inner product, cosine, jaccard distance to calculate the text similarity. Cosine method is regarded as the best one [26].

Inner Product

Cosine

11

Page 15: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

4.3 Dimensionality reduction

4.3.1 Feature selectionI will use the Information gain (IG) to select the feature.

4.3.2 Feature ExtractionLatent Semantic Analysis (LSA) is a dimensionality reduction technique; it can resolve the problems like synonymy and polysemy. In this process, it adopts the singular value decomposition (SVD) to reduce the dimensional subspace, calcite the semantic similarity and use the main associative pattern, do not refer to the less useful information [27].

The Input: term-by-document matrixThe output: U: concept-by-term matrixV: concept-by-document matrixS: elements assign weights to concepts

For example:We can use the word-document matrix to present the text.There are nine documents from T1 to T9.

12

Page 16: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

Then, we can use the histogram to show the singular values and select the approximate number of dimensions.

Next, we can use the first three dimensions to approximate the whole matrix.

4.4 ClassificationI will use the SVM (support victor machine) to classify the data. It is a supervised learning method in machine learning. It seeks global hyperplane to separate classes by using the training set, and generate the model to classify the new data. It has been proved an effective method, which can be used in high dimensionality space [28].

13

Page 17: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

4.5. Clustering Clustering is the effective technique to group and categorize document, based on the similarity among these documents. Documents belong to one cluster have the high similarity with each another, but low similarity with documents in other clusters. It includes various algorithm, such as K-MEAN, CURE, BIRCH, ROCK, and many others [29,30].

Traditional clustering algorithms use distance measure to cluster, cannot lead to high-quality clusters when clustering categorical data. ROCK (Robust Clustering Algorithm for Categorical Attributes) is an agglomerative hierarchical clustering algorithm, focus on providing better quality clusters relating categorical data. Initially, it used the Jaccard coefficient for similarity measure, later; it considered the neighborhoods of individual pairs of points. If two similar points also have similar neighborhoods, then the two points likely belong to the same cluster and so can be merged [29,30].

For example:

Suppose there are many keywords: k1, k2, k3, k4, k5, k6, k7, and there are many articles like {k1, k2, k3}, {k1, k2, k4} , {k1, k2, k5} , {k1, k3, k4}, {k1, k3,k5} . If we plan to cluster them into two groups, the result as follow.

Cluster one Cluster two<k1, k2, k3, k4, k5> <k1, k2, k6, k7>{k1, k2, k3} {k1, k2, k4} {k1, k2, k5} {k1, k3, k4} {k1, k3,k5}{k1, k4, k5} {k2, k3, k4} {k2, k3, k5} {k2, k4, k5} {k3, k4,k5}

{k1, k2, k6} {k1, k2, k7} {k1,k6, k7} {k2, k6, k7}

Using Jaccard coefficient, Sim (T1, T2) = | T1∩T2 |/ | T1 T∪ 2 |, {k1, k2, k3} and {k1, k2, k6}, {k1, k2, k3} and {k1, k2, k4} has the same Jaccard coefficient, it is 0.5. But, if let is 0.5, link of {k1, k2, k3} and {k1, k2, k4} is 5 (due to common neighbours {k1, k2, k5} {k1, k3, k4} {k2, k3, k4}{k1, k2, k6}{k1, k2, k7}), while link of {k1, k2, k3} and{k1, k2, k6} is 3 (due to common neighbours {k1, k2, k5} {k1, k2, k4} {k1, k2, k7}). Therefore, {k1, k2, k3} and {k1, k2, k4} likely belong to the same cluster, and they can be merged in Cluster one.

5 Scope and limits

This research only concentrates on the semi-structured and unstructured posts that are published in the microblog, which differs from well-structured databases and Web documents due to their own features. It does not refer to the information stem from other resources. Apart from that, the novel

14

Page 18: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

algorithms and model merely analyse the text-based posts, which cannot be used on the multi-media like photos, videos. I will consider more social media such as Facebook, MySpace, to broaden the spectrum of application of this mining method in the further research.

Technologically, I propose new methods like the keywords and sentence extraction and topic generation. However, I only use the mature classification and clustering approach in this study, cannot present more competitive algorithms or improve the existing algorithms. In order to enhance the accuracy and efficiency of blog mining model, I will dedicate more energy to the algorithms.

6 Project Plan

Date Task27 February 2012 – 04 March 2012 Choose and contact with Supervisor

05 March 2012 – 11 March 2012 Assign the thesis topic and discuss the topic in details.12 March 2012 – 18 March 2012 Study the background of blog mining19 March 2012 – 25 March 2012 Read the recommended papers26 March 2012 – 01 April 2012 Write topic generation summary02 March 2012 – 15 April 2012 Find and read relevant papers from the database16 April 2012 – 22 April 2012 Finish the annotated bibliography23 April 2012 – 29 April 2012 Review the Keyword Extraction methods30 April 2012 – 06 May 2012 Review the classification and clustering methods07 May 2012 – 13 May 2012 Review the Topic Extraction approaches14 May 2012 – 20 May 2012 Review the Sentiment analysis ways21 May 2012 – 27 May 2012 Compare and analyse the blog and micro-blog28 May 2012 – 03 June 2012 Prepare the Presentation04 June 2012 – 10 June 2012 Finish the Research Proposal11 June 2012 – 17 June 2012 Design and improve the structure of project18 June 2012 – 22 July 2012 Write programme to implement the functions23 June 2012 – 22 July 2012 Adjust and improve the functions23 July 2012 – 29 July 2012 Carry out the experiments

30 July 2012 – 05 August 2012 Evaluate the method and compare with others06 August 2012 – 12 August 2012 Improve my method13 August 2012 – 19 August 2012 Test the method and evaluate again20 August 2012 – 26 August 2012 Write the results and discussion

27 August– 02 October 2012 Finish the first draft of minor thesis22 October– 04 November 2012 Modify the thesis follow the feedback of supervisors

05 November – 18 November 2012 Finish the second draft and improve it again19 November– 25November 2012 Submit the final Minor Thesis

15

Page 19: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

7 References

[1] W. Wenhao and W. Bin, "Comparing Twitter and Chinese native microblog," in Cybersecurity Summit (WCS), 2011 Second Worldwide, 2011, pp. 1-4.

[2] G. T. Lakshmanan and M. A. Oberhofer, "Knowledge Discovery in the Blogosphere: Approaches and Challenges," IEEE Internet Computing, vol. 14, pp. 24-32, 2010.

[3] A. Java, X. Song, T. Finin, and B. L. Tseng. Why we twitter: An analysis of a microblogging community. In WebKDD/SNA-KDD, pages 118–138, 2007.

[4] H. Xinghua and W. Bin, "Automatic Keyword Extraction Using Linguistic Features," in Data Mining Workshops, 2006. ICDM Workshops 2006. Sixth IEEE International Conference on, 2006, pp. 19-23.

[5] M. Oka, et al., "Extracting topics from weblogs through frequency segments," in WWW2006 - 3rd Workshop on Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2006.

[6] Yiming Yang and Jan O. Pedersen. A Comparative Study on Feature Selection in Text Categorization. In ICML ’97: Proceedings of the Fourteenth International Conference on Machine Learning, pages 412–420. Morgan Kaufmann Publishers Inc., 1997.

[7] Z. Zhilong, et al., "Categorical Document Frequency Based Feature Selection for Text Categorization," in Information Technology, Computer Engineering and Management Sciences (ICM), 2011 International Conference on, 2011, pp. 65-68.

[8] L. Sungjick and K. Han-Joon, "News Keyword Extraction for Topic Tracking," in Networked Computing and Advanced Information Management, 2008. NCM '08. Fourth International Conference on, 2008, pp. 554-559.

[9] M. Hu and B. Liu, "Mining and summarizing customer reviews," presented at the Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, Seattle, WA, USA, 2004.

[10] H. Hui, et al., "Short Text Feature Extraction and Clustering for Web Topic Mining," in Semantics, Knowledge and Grid, Third International Conference on, 2007, pp. 382-385.

[11] L. Chung-Hong, et al., "An automatic topic ranking approach for event detection on microblogging messages," in Systems, Man, and Cybernetics (SMC), 2011 IEEE International Conference on, 2011, pp. 1358-1363.

[12] B. Sharifi, et al., "Experiments in Microblog Summarization," in Social Computing (SocialCom), 2010 IEEE Second International Conference on, 2010, pp. 49-56.

16

Page 20: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

[13] A. Bossard, et al., "CBSEAS, a summarization system integration of opinion mining techniques to summarize blogs," presented at the Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session, Athens, Greece, 2009.

[14] C.-H. Lee, "Mining spatio-temporal information on microblogging streams using a density-based online clustering method," Expert Systems with Applications, vol. 39, pp. 9623-9641, 2012.

[15] P. Feifei, et al., "Research on algorithm of extracting micro-blog's hot topics," in Electronics, Communications and Control (ICECC), 2011 International Conference on, 2011, pp. 986-989.

[16] L. Zitao, et al., "Short Text Feature Selection for Micro-Blog Mining," in Computational Intelligence and Software Engineering (CiSE), 2010 International Conference on, 2010, pp. 1-4.

[17] Y. He, et al., "Summarizing Microblogs on Network Hot Topics," in Internet Technology and Applications (iTAP), 2011 International Conference on, 2011, pp. 1-4.

[18] M. Hutton, et al., "Summarizing microblogs automatically," presented at the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, California, 2010.

[19] M. Joshi and N. Belsare, "BlogHarvest: Blog mining and search framework," in International Conference on Management of Data COMAD, Delhi, India, 2006, pp. 226-230.

[20] W. Zhang, et al., "Opinion retrieval from blogs," presented at the Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, Lisbon, Portugal, 2007.

[21] L. Guangxia, et al., "Micro-blogging Sentiment Detection by Collaborative Online Learning," in Data Mining (ICDM), 2010 IEEE 10th International Conference on, 2010, pp. 893-898.

[22] P. D. Turney, "Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews," 2002, pp. 417-424.

[23] B. Pang, et al., "Thumbs up? Sentiment Classification using Machine Learning Techniques," CoRR, cs.CL/0205070, 2002.

[24] Z. Qin, et al., "A content tendency judgment algorithm for micro-blog platform," in Intelligent Computing and Intelligent Systems (ICIS), 2010 IEEE International Conference on, 2010, pp. 168-172.

[25] N. Agarwal, et al., "WisColl: Collective wisdom based blog clustering," Information Sciences, vol. 180, pp. 39-61, 2010.

17

Page 21: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

[26] Y. Chen, F.S. Tsai, and K.L. Chan, “Blog Search and Mining in the Business Domain,” Proc. Workshop on Domain Driven Data Mining in Conjunction with Knowledge Discovery and Data Mining, ACM Press, 2007, pp. 55–60.

[27] Y. Chen, et al., "Machine learning techniques for business blog search and mining," Expert Systems with Applications, vol. 35, pp. 581-590, 2008.

[28] S. Tong and D. Koller, "Support vector machine active learning with applications to text classification," The Journal of Machine Learning Research, vol. 2, pp. 45-66, 2002.

[29] D. A. K. Rizwan Ahmad, "Document Topic Generation in Text Mining by Using Cluster Analysis with EROCK," International Journal of Computer Science and Security, vol. 4, pp. 176-182, 2010.

[30] S. Guha, et al., "ROCK: a robust clustering algorithm for categorical attributes," in Data Engineering, 1999. Proceedings., 15th International Conference on, 1999, pp. 512-521.

8 Trial Table of Contents of thesisAbstract1 Introduction1.1 Motivation1.2 Context of the research1.3 Significance of the research1.4 Research questions1.5 Research sub questions2 Related works2.1 Keyword Extraction2.2 Topic Extraction2.3 Sentiment analysis2.4 Updating problem3 Significance and contributions4 Methodologies4.1 Text pre-processing4.1.1 Stemming 4.1.2 Feature filtering4.2 Text Representation4.2.1 Vector Space Model (VSM)4.2.2 Feature item weighting 4.2.3 Text similarity4.3 Dimensionality reduction 4.3.1 Feature selection4.3.2 Feature Extraction4.4 Classification4.5. Clustering 4.6 Topic extraction (Summarization or sentence extraction)4.7 Sentiment analysis

18

Page 22: 1 Statement of the Research Topic Web view · 2012-11-15It may not be appropriate for unstructured document like blog and micro-blog [5,6]. Word Frequency. Word Frequency means how

5 Experiemts6 Results7 Discussion8 Project Plan9 ReferencesAppendix - programme and examples

19