paper-1 machine learning algorithms and their significance in sentiment analysis for context based...

8/10/2019 Paper-1 Machine Learning Algorithms and Their Significance in Sentiment Analysis for Context Based Mining

http://slidepdf.com/reader/full/paper-1-machine-learning-algorithms-and-their-significance-in-sentiment-analysis 1/9

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No. 7ISSN: 1837-7823

Machine Learning Algorithms and their Significance in Sentiment Analysisfor Context Based Mining

N. KARTHIKEYAN 1 and R.DHANAPAL 2

1

*Head (B.C.A Dept),Department of Computer Applications,

Srimad Andavan Arts&Science College,Tiruchitrapalli, Tamil Nadu, India

E-mail: [email protected]

2*PrincipalK.C.S. Kasi Nadar College of Arts & Science,

R.K.NagarChennai – 600 021

URL: www.kcskasinadarcollege.inE-mail: [email protected]

AbstractThe process of sentiment analysis is a typical area which requires analysis of various parts of the text to provide

the appropriate results. Since text in general are unstructured, it becomes more difficult for the algorithm todetermine the result. This paper uses machine learning algorithms (Neural Networks and SVM) and J48Classification algorithm to determine the best approach for determining the polarity of a document for sentimentanalysis. The results infer that SVM performs better than the other techniques in determining the document polarity. Keywords: Context based mining, Sentiment analysis, SVM, ANN, J48

1. IntroductionIn Content Based Image Retrieval (CBIR), we are concentrating on the aspect of retrieving images

corresponding to a query image. In usual text based image search, users will be providing some keywords based on

which images are retrieved. In case of text based search the ability of the user to provide an exact query is limited byseveral factors like, colour, texture and such intricate details could not be represented in textual form in a consistentmanner. So the inability to provide proper input will automatically introduce bias or error in the output. So currentgeneration image search is based on images as input so that the match could be much better than providing text asinput.

The drawback of the current approach is that we are not searching the images in a single well defined context.The image could be anything and should be matched with all other images in the repository before providing theoutput. Image based search and matching has been successful in many domains that are context specific. Say IrisScan images when compared to a database containing only Iris images was very successful and similarly, facialrecognition, fingerprint readers etc. are all very reliable because of the fact that the images are all from a single welldefined context.

When it comes to a broad category of images then the drawback of providing an image as input and searchingfor similar elements from a repository is that, the user is now handicapped because the context of the search is

missing. Say for example if the user is providing the image of a dog and searching through the repository, then thecontext could be any of the following like pet, breed based search, police/sniffer dogs, trained dogs, helper dogs,diseases suffered by dogs, food for dogs etc. So here by providing an image as input the user is unable to specify thecontext that he/she is looking for in the image result.

Human way of looking at an image must be studied from a psychological point of view rather than consideringit as just reading all the pixels and trying to make sense out of it. Human vision or the perception of human vision to

be precise is based on the overall broad context and once we obtain the context then we ignore the local details. Thisis completely different from a computerized program. Here semantics and context sensitiveness plays an important

4

mailto:[email protected]

mailto:[email protected]




role. This brings out the need for filling the semantic gap in content based image retrieval. Concentrating on the lowlevel features alone makes the search results biased and error prone. Also the changes in the luminance or texture orcolour does not change the context of an image and we are looking for the context here.

The core concept in retrieving content from an image is currently based on pixel by pixel analysis of the image.But human vision doesn’t provide the same importance to all the pixels as a computer does. So in order to emulatehuman vision through computers, the key is semantics. To provide such a semantic based image retrieval system the

repository as well as the query image must be accompanied by some metadata. Metadata here provides the context.It could be keywords, descriptions and tags. Even sentiment polarity could be included to make the search muchmore effective and context sensitive. Here in this paper we try to bridge the semantic gap by including the sentiment

polarity of the images in CBIR.The remainder of this paper is structured as follows; section II provides

2. Related worksA lot of research has gone into content based image retrieval. ThijsWesterveld in [1] used Latent Semantic

Indexing to uncover hidden semantics. That work concentrates on including co-occurrence statistics to uncover thehidden semantic information. The work tries to bring the best of both worlds, image feature (content) and words(context) into one semantic space. Though the work showed better performance in terms of mono and multilingualtext retrieval, its application to multi-modal and cross modal image retrieval involves a lot of computationalcomplexity and also its subjectivity complicates the process further.

In [2] David et al proposed several views regarding the importance of context sensitiveness in image retrieval.They have even quoted examples from newspapers that provides text as well as images in a biased manner favouringa particular political or religious faction. They have introduced a new platform and a diversity engine architecturefor image retrieval based on opinion analysis, text analysis and content based information retrieval. Though theyhave stressed the importance of semantics and context sensitiveness in image retrieval, they have only provided anoverview and have summarized the existing text, image and other multimedia based retrieval systems.

In [3] Liyan et al presented an approach that utilizes context information to learn adaptive rules for automaticand human in the loop clustering. The work is a bit more context aware as it considers a particular domain of facetagging and detection. The repository under consideration in their work consists only of human facial images andhence the context sensitiveness to a broader class is found missing. Large scale context based retrieval of imagesrequires analysis of millions or even billions of images and hence computationally complex.

In [4] Thanh-Nghi Doan et al have proposed a parallel incremental methodology for power mean SVM basedclassification of large scale image datasets and it is proved to handle 1000’s of visual classes effectively. Such a

parallel approach towards context sensitive image retrieval could improve the performance and accuracy as well. Italso considers dealing with imbalanced data. In [5] David Ahlstrom et al have shown the effectiveness of simple andsophisticated tools for video exploration. It provides insights from a real time video search competition for videoexploration.

The next step in web search is based on including users’ sentiment/opinion effectively and hence providingcontext sensitive results. As suggested in [2], the importance of such sentiment analysis is on the rise as the textmining systems are now being integrated along with multimedia based information retrieval systems. So it is nomore just text or image based search, instead a combination of them all resulting in better results that are reliable ina wide variety of domains.

Several machine learning based methods are proposed for lexical analysis of text corpus and to infer sentiment polarity from them. In [6] Blinov et al have proposed a machine learning approach based on Support VectorMachines (SVM) and maximum entropy method. Their approach has included information about the proportion of

positive and negative words, their colocations, emoticons as such to better identify the context. But their approach is based on manual formation of emotional dictionaries specifically made for each domain. Since such context basedemotional dictionaries are not so very widely available for all domains, it could not be a scalable solution for generalweb based image retrieval systems.

Automated Text Classification is done based on machine learning approaches for a long time now. In [7]Ikonomakis et al have provided a detailed study of the state of the art in automated text classification using machinelearning approaches. In [8] Stefano et al presented SentiWordNet 3.0 which is the latest edition of lexical resourcespecifically designed for opinion mining and sentiment classification applications. The difference between the

5




various versions of SentiWordNet and its features are also clearly explained along with the research applications ofsuch a lexical resource in various automated text classification and sentiment polarity analysis. They have alsomentioned the algorithm for automatic WordNet annotations and how it effectively classifies text into positive,negative and neutral elements.

Rudy et al in [9] proposed a hybrid approach for sentiment analysis based on rule based classification,supervised learning and machine learning. They have applied that to movie reviews and product reviews and

reported effective classification of sentiment polarity. Though the results are comparatively good the hybridizationincreases the computational complexity of the approach to a greater extent.Bo Pang et al in [10] have consideredsentiment analysis based on positive and negative polarity alone and independent of topic. Naive Bayes, maximumentropy classification, and support vector machines have been used for sentiment analysis by them and they havealso reported that machine learning approaches are better than human baseline when it comes to sentiment polarity.

3. System architecture

Te x t

C o n te n t A n a lys is

a n d F ea tur e

Vec to r C r ea t io n

S top wo r d

Eli m in a t io n

F ea tur e Ma tr ix

C r ea t io n

1 0

1

.................0

0 1

1

.................1. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

1 1

1

..................1

C o n te x t B ase d

S e n t im e n t

A n a lys is

u s in g Mac hi n e

Lea rn ing

Figure 1: System Architecture

6




The process of context based image retrieval uses the base information available in the images to retrieve thecontext in which they are being used. The context based image retrieval system functions in four phases. The initial

phase deals with analyzing the available data and creating a feature vector. These feature vectors are the informationthat is a broken down form of the available data. In order to remove the unnecessary words and to shortlist themandatory words needed for the future process, the second phase is performed. This phase removes the stop wordsand symbols from the feature vectors to make them more refined. After the process of refinement, the feature matrix

is created by using the reviews and feature vectors. This data serves as the base for performing the context basedsentiment analysis. Machine learning is used for performing this analysis and finding the classification. Figure 1shows an overall system architecture of the sentiment analysis methodology.

4. Context Based Image Retrieval Using Machine Learning ApproachesThe term context refers to perspective or situation. Content retrieval using context as the key has its own

complexities. The first and the foremost being sentiment retrieval from the data. In general, context directly refers tothe sentiments with which a certain text has been rendered. Emotion analysis is the next level of sentiment analysis.While sentiment analysis refers to finding the polarity of the document (positive, negative or neutral), emotionanalysis takes a deeper plunge and refers to the level of emotions. Our methodology here classifies the images basedon the polarity of the text, using which the context can be retrieved. The following four phases describe the workingmethodology of our system.

4.1. Content analysis and Feature Vector CreationContent of an image can be directly derived using the structural elements of the image. But deriving the context

from an image is complex and is mostly inaccurate. Hence it is necessary to search for other means of data thatdepict the context. This information is mostly found in the metadata and some part of the content that are at close

proximity to the image. Metadata here refers to tags, description or keywords corresponding to the image.Hence the initial process in sentiment mining is the content analysis and feature vector creation. The content

present in the available information are analyzed and are tokenized and the word vector is created. Here, the wordvector is referred to as the feature vector. This vector contains information about the word and its frequency ofoccurrence. After the completion of this phase, all the data corresponding to the text that is to be analyzed will belisted.

4.2. Stop word eliminationStop words refer to words that do not contribute to the meaning of a sentence. In short, these are connectors,

articles or pronouns. The major contributors in the process of sentiment mining would be the nouns, verbs, adverbsor adjectives that directly talk about the activity taking place or determining the subject. All other words are mostlyuseless, in other words, they tend to consume memory and reduces the processing speeds. Other types of stop wordsinclude punctuations such as comma, full stop, colon, semicolon, question and exclamation.

The text that is considered for mining includes user provided unstructured data, which means, the data does nothave a proper format like a data from the database. Further, these data might not even be a proper English sentence.There are very high possibilities of this text containing colloquial form of a language and it might even be multilingual. Even though our current methodology does not deal with multi lingual data, it could be performed in future.

The process of stop word elimination uses the stop word collection of the storm project [12,13,14]. The featurevectors that were initially formed are filtered and the stop words occurring in them are eliminated. This removes aconsiderable amount of data from the main feature vector set, hence enabling faster computation.

4.3. Feature matrix creationThe next phase is the creation of the feature matrix. This method maps the content with the already defined

feature vectors and creates a feature matrix. This phase creates an n×m matrix, where n refers to the number of textsconsidered for evaluation, and m refers to the number of items in the feature vector.

7




Figure 3: ROC for J48 (Positive Sentiment)

Figure 3shows the ROC plot for the positive sentiment. From the curve, it can be observed that the accuracy isapproximately 50%. J48 being a primitive classifier, it can be observed that the result obtained is average; hence wecan conclude that a machine learning approach would be a better option.

Figure 4: Result of ANN

Figure 4 shows the working of the neural network model. Due to the continually training approach and the verylarge data size, the training time of the neural networks seems to be very high. And further, the error rate also seemsto be high. It can be observed from Figure 3 that the error rate is 2.133 and is error reduction rate is also found to bevery low. Hence the option of considering neural networks is eliminated. ENCOG framework is used for

FPR

TPR

9




implementing the neural network model. The neural networks was constructed with three layers. The input andoutput layers with no biased neurons, the processing layer with two biased neurons. The input layer was constructedaccording to the number of words obtained after pre-processing. In our case it is 3190. Activation Linear andActivation TanH functions were used in the input and, processing and output layers respectively. Resilient

propagation function was used to train the network. The network design is as follows (Table 1):

Table 1: Neural Network Setup

No Of Layers 3

No Of Neurons In Input Layer 3190

No Of Biased Neurons In The Input Layer 0

No Of Neurons In Processing Layer 3192

No Of Biased Neurons In The ProcessingLayer

2

No Of Neurons In Output Layer 1

No Of Biased Neurons In The Output Layer 0

Activation Function Used In Input Layer ActivationLinear

Activation Function Used In ProcessingLayer

ActivationTanH

Activation Function Used In Output Layer ActivationTanH

Neural Network Training Function Resilient Propagation

The same data set is considered and analysis is performed using SVM. It uses the RBF kernel function is usedfor classification.

2( , ) exp( || || ), 0i j i jK x x x x r γ γ = − − + > (3)

The SVM requires a special format for reading the data. The expected format of input for an SVM is

[label] [index 1]:[value 1] [index 2]:[value 2] ... (4)

The values ( value 1 , value 2 ,…value n) in the given format are normalized within the range -1 to 1. In order toconvert the data into the required format, Max-Min Normalization is used, which is of the form,

(5)

A sample input data for SVM is of the form shown in figure 5.

10




Figure 5: Sample input data for SVM

Figure 6: ROC for SVM

Figure 6 shows the ROC plot, which provides a promising accuracy. Hence after analysis of the results, SVM isfound to work efficiently for the process of context mining. Figure 7 shows the result obtained from SVM Classifier.

Figure 7: Result of SVM

6. ConclusionThis paper is an initial implementation for analysis of the available data with the classification algorithms and

to select the appropriate technique for the next level of analysis. Implementation is carried out using data obtainedfrom the IMDb dataset, and from the results it is clear that SVM works best on the area of context mining. This

process can be further improvised by using one class classification techniques rather than multi-class classification.Further, our next research proposal will take forward this research into mining levels of polarities rather than

TPR

FPR

11




providing a single polarity base. Level of polarity can be analyzed and can be used for performing emotion analysis,which is a deeper form of sentiment analysis.

7. References [1] ThijsWesterveld, (2000), “Image Retrieval: Content versus Context”, University of Twente, Department of

Computer Science , Parlevink Group,PO Box 217, 7500 AE Enschede, The Netherlands.

[2] David Paul Dupplaw· Michael Matthews · Richard Johansson · Giulia Boato· Andrea Costanzo · MarcoFontani· Enrico Minack· Elena Demidova· Roi Blanco · Thomas Griffiths · Paul Lewis · Jonathon Hare ·Alessandro Moschitti, (2014), “Information extraction from multimedia web documents:an open-source

platform and testbed ”, Int J Multimed Info Retr 3:97–111.

[3] Liyan Zhang, Dmitri V. Kalashnikov, SharadMehrotra, (2014), “Context Assisted Face ClusteringFrameworkwith Human-in-the-Loop”, International Journal of Multimedia Information

Retrieval, Volume 3, Issue 2, pp 69-88.

[4] Thanh-Nghi Doan,Thanh-Nghi Do, Francois Poulet, (2014), “Parallel Incremental Power Mean SVM forthe Classificationof Large Scale Image Datasets”, International Journal of Multimedia Information


[5] Klaus Schoeffmann,David Ahlstrom, Werner Bailer,

ClaudiuCobarzan,FrankHopfgartner,KevinMcGuinness, CathalGurrin, ChristianFrisson, Duy-Dinh Le,Manfred Del Fabro, HongliangBai, Wolfgang Weiss, (2014), “The Video Browser Showdown: A LiveEvaluationof Interactive Video Search Tools”, International Journal of Multimedia Information


[6] Blinov P. D., Klekovkina M. V., Kotelnikov E. V., Pestov O. A. (2013), “Research of lexical approach andmachine learning methods for sentiment analysis”.

[7] M. Ikonomakis, S. Kotsiantis, V. Tampakas, (2005), “Text Classification Using Machine LearningTechniques”, Wseas Transactions On Computers , Issue 8, Volume 4, pp. 966-974.

[8] Stefano Baccianella, Andrea Esuli, FabrizioSebastiani, (2010), “SENTIWORDNET 3.0: An EnhancedLexical Resourcefor Sentiment Analysis and Opinion Mining”, LREC. Vol. 10.

[9] Rudy Prabowo, Mike Thelwall , (2009), “Sentiment Analysis: A Combined Approach”, Journal of

Informetrics 3.2 : 143-157.[10] Bo Pang,Lillian Lee, ShivakumarVaithyanathan,(2002), “Thumbs up? Sentiment Classification using

Machine LearningTechniques”, Proceedings of the ACL-02 conference on Empirical methods in naturallanguage processing -Volume 10.

[11] Rajaraman, A.; Ullman, J. D. (2011). "Data Mining". Mining of Massive Datasets. pp. 1– 17. doi :10.1017/CBO9781139058452.002. ISBN 9781139058452.

[12] http://storm-project.net, Referred on: 3 Oct 2014.

[13] https://github.com/nathanmarz/storm, Referred on: 3 Oct 2014.

[14] https://github.com/nathanmarz/storm/wiki , Referred on: 3 Oct 2014.

[15] http://www.cs.cornell.edu/People/pabo/movie-review-data, Referred on: 3 Oct 2014.

[16] Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. (2002), "Thumbs up? Sentiment classification usingmachine learning techniques." Proceedings of the ACL-02 conference on Empirical methods in naturallanguage processing -Volume 10.

[17] http://reviews.imdb.com/Reviews, Referred on: 3 Oct 2014.

12

http://i.stanford.edu/~ullman/mmds/ch1.pdf

http://en.wikipedia.org/wiki/Digital_object_identifier

http://dx.doi.org/10.1017%2FCBO9781139058452.002

http://en.wikipedia.org/wiki/International_Standard_Book_Number

http://en.wikipedia.org/wiki/Special:BookSources/9781139058452

http://storm-project.net/

https://github.com/nathanmarz/storm

https://github.com/nathanmarz/storm/wiki


http://www.cs.cornell.edu/People/pabo/movie-review-data

http://reviews.imdb.com/Reviews

http://reviews.imdb.com/Reviews

http://www.cs.cornell.edu/People/pabo/movie-review-data


https://github.com/nathanmarz/storm

http://storm-project.net/

http://en.wikipedia.org/wiki/Special:BookSources/9781139058452

http://en.wikipedia.org/wiki/International_Standard_Book_Number

http://dx.doi.org/10.1017%2FCBO9781139058452.002

http://en.wikipedia.org/wiki/Digital_object_identifier

http://i.stanford.edu/~ullman/mmds/ch1.pdf

paper-1 machine learning algorithms and their significance in sentiment analysis for context based...

Documents