sentiment analysis on bangla and romanized bangla text ... · sentiment analysis on bangla and...

42
Sentiment Analysis on Bangla and Romanized Bangla Text (BRBT) using Deep Recurrent models Asif Hassan a , Dr. Nabeel Mohammed a,b , Dr. Abul Kalam al Azad a a Department of Computer Science and Engineering, University of Liberal Arts Bangladesh, Dhaka, Bangladesh b Faculty of Information Technology, Monash University, Clayton Campus, Melbourne, Australia Abstract Sentiment Analysis (SA) is an action research area in the digital age. With rapid and constant growth of online social media sites and services, and the increasing amount of available textual data such as - statuses, comments, reviews, blogs etc. in them, application of automatic SA is also on the rise. However, most of the research works on SA in natural language processing (NLP) are based on English language. Despite being the sixth most widely spoken language in the entire world, Bangla still does not have a proper dataset that is both large and standard. As a result, recent research works in Bangla have failed to produce results that can be both comparable to works done by others and reusable as stepping stones for future researchers to progress in this field. Therefore, in our work we first tried to provide a textual dataset - that includes not just Bangla, but Romanized Bangla texts as well, which is substantial, post- processed and multiple validated, for using it in SA experiments. We tested this dataset in Deep Recurrent model, specifically, Long Short Term Memory (LSTM), using two types of loss functions binary crossentropy and categorical crossentropy, and also did some experimental pre-training by using data from one validation to pre-train the other and vice versa. Lastly, we documented the results along with some analysis on them, which was promising.

Upload: others

Post on 27-Apr-2020

84 views

Category:

Documents


2 download

TRANSCRIPT

  • Sentiment Analysis on Bangla and Romanized

    Bangla Text (BRBT) using Deep Recurrent

    models

    Asif Hassana, Dr. Nabeel Mohammed

    a,b, Dr. Abul Kalam al Azad

    a

    aDepartment of Computer Science and Engineering, University of Liberal Arts Bangladesh, Dhaka, Bangladesh

    bFaculty of Information Technology, Monash University, Clayton Campus, Melbourne, Australia

    Abstract

    Sentiment Analysis (SA) is an action research area in the digital age. With rapid and constant

    growth of online social media sites and services, and the increasing amount of available textual

    data such as - statuses, comments, reviews, blogs etc. in them, application of automatic SA is

    also on the rise. However, most of the research works on SA in natural language processing

    (NLP) are based on English language. Despite being the sixth most widely spoken language in

    the entire world, Bangla still does not have a proper dataset that is both large and standard. As a

    result, recent research works in Bangla have failed to produce results that can be both

    comparable to works done by others and reusable as stepping stones for future researchers to

    progress in this field. Therefore, in our work we first tried to provide a textual dataset - that

    includes not just Bangla, but Romanized Bangla texts as well, which is substantial, post-

    processed and multiple validated, for using it in SA experiments. We tested this dataset in Deep

    Recurrent model, specifically, Long Short Term Memory (LSTM), using two types of loss

    functions – binary crossentropy and categorical crossentropy, and also did some experimental

    pre-training by using data from one validation to pre-train the other and vice versa. Lastly, we

    documented the results along with some analysis on them, which was promising.

  • 2

    1.0: Introduction

    The purpose of this thesis is to discuss our work on Sentiment Analysis (SA) on Bangla

    (Bengali) and Romanized Bangla texts, using deep recurrent models. Bangla is one of the top 10

    most widely spoken languages in the world, with almost 200 million speakers worldwide, 160

    million of whom are Bangladeshi [2]. With a growing economy, declining price of technology

    and Government incentives, the traditional businesses that adopted IT and the IT sector as a

    whole in Bangladesh have enjoyed considerable and rapid growth; which in turn has widened the

    scope for more Bangladeshi people to get involved in online activities such as - getting

    connected to friends and families through social media, expressing their opinions and thoughts

    on popular micro-blogging and social networking sites, sharing opinions and thoughts by means

    of comments on online news portals, doing online shopping through online marketplaces and

    other such applications. While there are many advantages for online-based businesses, there are

    disadvantages too. It becomes increasingly harder for such businesses to monitor and analyze

    market trend, especially when it is done by analyzing the reaction of the customers on their

    products or services, due to less or no human-to-human interaction in such businesses. Moreover,

    the task of going through comments and reviews from each individual customers and figuring

    out the sentiments within is tedious and in some cases simply intractable, especially considering

    that usually very high volume of data is generated very quickly in this day and age of digital

    connectivity. Therefore, application of automatic SA can play a vital role here for increasing

    efficiency and productivity.

    Sentiment Analysis is itself a very important area of research, as vast number of studies has been

    done over past few years. SA has been defined as:

    "Sentiment analysis, also called opinion mining, is the field of study that analyzes

    people‟s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards

    entities such as products, services, organizations, individuals, issues, events, topics, and

    their attributes." [3].

    Since it has a large area of application, it goes with many other terms e.g. opinion extraction,

    sentiment mining, opinion mining, subjectivity analysis, emotion analysis, review mining etc.

    depending on the different area it is applied to. Thus, opinion mining and SA actually point to

  • 3

    the same field of study. Most of the research works we find on SA are based on the English

    language, and not as many on Bangla. This interesting work by Das and Bandyopadhyay [4] on

    subjectivity detection included Bangla but it is not self-sufficient, as English is also needed. We

    have discussed more about other studies on Bangla in chapter 2 (Literature review). However,

    none of the works truly considered Bangladesh's perspective. We need to consider not just

    standardized Bangla, but Banglish (Bangla words mixed with English words) and Romanized

    Bangla. These three major types can again be loosely categorized in - good, standard, bad,

    wrong, totally wrong, particular to specific location (almost arcane) etc. depending on the level

    of clarity, grammatical correctness, meaningfulness, personal idiosyncrasies, impact of

    localization etc. Moreover, for the Romanized Bangla the added complexity is due to the

    variation in transliteration between people who know English well and those who don't [5]. The

    reason, that no clear standard is followed when 160 million Bangladeshi people write in any of

    the mentioned types, makes it all the more complicated and challenging to work with. The

    following table has some examples of the texts in all three major types –

    Samples Type Comments

    অত্যন্ত সময় োপয়যোগী এবং উপকোরী

    পদয়েপ।

    Translation: a very timely and helpful

    step

    Bangla Standard,

    Meaningful, Good

    etc.

    পপথীপবয়ত্ মোনূল মোয়েই ভূ কয়

    Translation: To err is human

    Bangla Bad, Wrong

    etc.

    এইডো পক কইপ রর মোমো!!

    Translation: What are you saying dude!!

    Bangla Non standard,

    Meaningful, bad etc.

    গম আয়সো পন বো?

    Translation: How are you?/Are you

    doing well?)

    Bangla Highly localized,

    Non standard etc.

  • 4

    I hate you! আর রকোনপদন রত্োমোয়ক love

    রকোরয়বো নো। never ever!!

    Translation: I hate you! I won't love you

    anymore! never ever!!)

    Banglish Okay, Meaningful

    etc.

    Ottonto shomoyopojogi ebong upokari

    podokhkhep.

    Translation: a very timely and helpful step

    Romanized

    Bangla

    Standard,

    Meaningful, Good

    etc.

    amar bareta akan teke car mael dora

    Translation: my house is four miles from

    here

    Romanized

    Bangla

    Non standard,

    Transliteration error,

    Wrong etc.

    Table 1: Examples of Bangla text variants

    In the recent past, Deep Learning methods, specifically recurrent model-based deep learning

    models have enjoyed a lot of success in NLP(Natural Language Processing) than conventional

    machine learning methods [6]. While there are other approaches to SA, in this thesis we will

    concentrate exclusively on such deep techniques. Our key contributions cover -

    Pre-processing the data in a way so that it is readily usable by researchers.

    Application of deep recurrent models on a Bangla and Romanized Bangla text corpus.

    Pre-train dataset of one label for another (vice versa) to prove its usefulness.

    The paper is organized as follows. In chapter 2 we discussed the background of our work and the

    works of others in the same field that inspired and helped us in a way. In chapter 3 we discussed

    in details about the dataset that we used for our experiments. Chapter 4 discusses the

    methodology and also includes the experimental setup for the deep recurrent models, as

    elaborately as possible. Chapter 5 (results and discussion) has all the discussion about various

    results found from our experimentation, and lastly chapter 6 has conclusion.

  • 5

    2.0: Background

    Let us now look into the works of others to describe, summarize, evaluate and clarify our work

    on Sentiment Analysis using deep recurrent models for Bangla and Romanized Bangla texts.

    2.1: Sentiment Analysis

    Although the term "Sentiment Analysis" may have appeared for the first time in Nasukawa and

    Yi [7] , research works on sentiment appeared as early as in 2000 [8-10]. With advent of social

    media on internet e.g. Facebook, Twitter, forum discussions, reviews, and its rapid growth, we

    were introduced to humongous amount of digital data (mostly opinionated texts e.g. statuses,

    comments, arguments etc.) like never before, and to deal with this huge data SA field enjoyed a

    similar growth. Since early 2000, sentiment analysis has become one of the most active research

    areas in NLP (Natural Language Processing) [3].

    However, most of the works are highly concentrated on English language, with just a few

    research papers for Bangla. English SA research has enjoyed great progress, favored by the

    presence of standard data sets. Standard datasets allow researchers to do their own experiments

    and compare their contributions with those of others. For English language, an example of such a

    standard SA dataset is the IMDB Movie Review Data set, which contains 50,000 annotated

    (positive or negative movie review) movie reviews made by the viewers. This dataset was

    originally created by Maas, Daly [11] and since then has been used by a multitude of different

    studies.

    A detailed survey paper [12] presented an overview on the recent updates in SA algorithms and

    applications, categorizing and summarizing total 54 articles that had been published till 2014.

    The following figures (1 & 2) were taken from their paper -

  • 6

    Figure 1: Sentiment analysis on product review

    Figure 2: Sentiment classification techniques

  • 7

    Godbole, Srinivasaiah [13] collected opinions from newspaper and blogs, and assigned scores

    indicating positive or negative opinion to each distinct entity in the text corpus to do SA.

    In [14], they proposed and investigated a paradigm to mine the sentiment from a popular real-

    time micro-blogging service like Twitter, and they fashioned a hybrid approach of using both

    corpus-based and dictionary-based methods in determining the semantic orientation of the

    tweets.

    2.2: Sentiment Analysis for Bangla

    It is quite unfortunate that there is no standard collection of data, such as - the IMDB dataset,

    Twitter corpus etc. for Bangla texts. One effort for standardization came from an automatic

    translation of positive and negative words of SentiWordNet [15]. However, no corpus was

    Figure 3: System architecture

  • 8

    created from this work, thereby limiting its usage to word level determination of sentiment,

    rather than the more complex natural language processing methods. Additionally, such

    simplified techniques do not consider the variety of ways in which people usually write, e.g.

    spelling mistakes, using colloquial terms etc.

    A small dataset of Bangla Tweets were collected along with Hindi and Tamil by Patra, Das [16],

    where the authors reported on the outcome of a shared Sentiment Analysis task of Indian

    languages. They used 999 Bangla tweets for training and 499 for testing. They did some post

    processing such as pruning of emoticons from the tweets and removal of duplicated posts. This

    data was annotated manually by native speakers. However, in terms of accuracy the dataset's

    insignificant size may have been the only setback they had.

    Another similar collection was done in this paper [17], where they collected 1400 Bangla

    Tweets. However, their dataset is not publicly available, and the size of the dataset is rather small

    when it comes to the question of usability for some recent deep learning-based NLP techniques,

    as over training of data as small as this, is highly likely for such deep models.

    A slightly larger corpus was collected, automatically annotated and manually verified byDas and

    Bandyopadhyay [4], as their collection was almost 2500 Bangla text samples from news items

    and blog posts. The uniqueness of their collection over the ones collected by others [16, 17] was

    the average size of 288 words of their samples, which is quite a bit larger than the 144 character

    Tweet limit.

    With most of the other works proceeded in the similar way, the two biggest issues with the

    current state of affairs in Bangla SA research are - first and foremost, the absence of a standard

    and big enough dataset to compare against, which makes comparison of research work extremely

    difficult, and secondly, none of the Bangla SA research takes into account the very prominent

    practical aspect of the use of Romanized Bangla [5].

  • 9

    2.3: Deep recurrent models

    The models we used to run our experiments for our work are deep recurrent models. The

    following sections would give us some insight on the background of deep recurrent models and

    the algorithm they apply.

    2.3.1: Deep learning

    AI (Artificial Intelligence) has been traditionally done in two ways – i) Knowledge based, and

    ii) Representation learning based. Knowledge base approach to AI uses logical inference rules

    to reason about statements input by users. Cyc was one of the most famous of such projects [18].

    However, these projects didn‟t see much success. The failure of knowledge based approach was

    the driving force into finding a way to give AI the ability to gather its own knowledge by

    extracting patterns or learning from the data – popularly known as Machine Learning. This

    new algorithm was based on representation of data or feature. That is, the system is given a

    number of features about the task in hand on which it will give a decision. Clearly if any of the

    features were wrong it would mean wrong representation of the data and the system would not

    perform well. To rectify this situation representation learning based [19] algorithm was used.

    This algorithm gave better results than the manually tailored representation of data, and allowed

    systems to adapt to new tasks with ease. However, using this algorithm it was required that high

    level abstract features from the raw data were extracted without any error caused by

    misinterpretation due to the factors of variation, as there can be such factors (e.g. an accent in

    speakers speech) which would cause false representation in absence of highly sophisticated

    (human like) understanding. However, deep learning performed better with this issue, as it

    provides with complex representations expressed in terms of a number of other simpler

    representations. It may appear that Deep Learning came fairly recently, but in reality it existed

    under different names since as early as 1940s. However, deep learning didn‟t get much of

    importance until recently. And with this newfound importance the term “deep learning” is

    becoming popular.

  • 10

    The following Venn diagram figure shows the internal relationships between deep learning,

    machine learning and AI and their corresponding AI technology. [20] –

    Figure 4: Venn diagram to show relationships among deep learning, machine

    learning and other AI technologies

  • 11

    2.3.2: Artificial Neural Network

    Artificial neural network or ANN for short, is a computational model which is inspired by

    biological neural networks of natural neurons [21]. A neuron (also known as nerve cell) is a

    special biological information processing cell composed of a cell body (or soma), and two types

    of outward tree-like branches – axons and dendrites, and at the terminals of these branches –

    synapses. Signals from other neurons are received through dendrites and signal generated in the

    cell body after processing is transmitted through axons. Neurons are connected to each other

    through synapses where axon of one neuron is connected to dendrite of another neuron [22].

    The very first conceptual model of artificial neurons was proposed by Warren S. McCulloch,

    who was a neuroscientist, and Walter Pitts [23]. In their paper they described mathematical

    aspects of artificial neuron as it computes a weighted sum of n number of input signals that

    outputs 1 if the sum is greater than a certain threshold, and outputs 0 otherwise. This could be the

    very first conceptualization of Perceptrons. Following is a graphical representation of a

    perceptron.

    w1

    w2

    input w3 output

    wn Weighted Sum Activation function

    Figure A: graphical representation of perceptron

    X1

    X2

    X3

    Xn

  • 12

    Figure 5: A rolled RNN

    diagram

    2.3.3: Recurrent Neural Network

    Recurrent Neural Network or RNN in short, is highly used in speech recognition, handwriting

    recognition, natural language processing and others. Moreover, RNN is the precursor to LSTM,

    thereby making it important to discuss and understand RNNs before we can get into LSTMs.

    While traditional neural networks failed to create a persistent model that would somewhat mimic

    the way our memory cells work for learning and remembering information, RNN – a class of

    ANN, has an interesting model design with a loop which makes the information persistent. As

    described by Bullinaria [24] – “The fundamental feature of a Recurrent Neural Network (RNN)

    is that the network contains at least one feed-back connection, so the activations can flow round

    in a loop. That enables the networks to do temporal processing and learn sequences.”

    Figure 5 below shows a simple RNN diagram with feed-back connection [1] Here A takes input

    xt and outputs ht. The loop enables the flow of information from one step to the next.

    To better understand this looping mechanism in RNNs, we should consider the next figure where

    an unrolled RNN diagram is shown. (Figure 6)

  • 13

    In the diagram we see how input vector x0 generates output vector h0 and sends the information

    to next step, where input vector x1 generates h1, and sends the information again to the next step.

    So it is like there are multiple copies of same network, where a successor gets information from

    all the predecessors, connected in architecture that excels at processing sequential data. Simple

    RNNs uses the following formula to calculate the hidden vector, apart from input and output

    vectors [25]–

    ℎ𝑡 = tan 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡

    2.3.4: LSTM

    While RNN‟s success was critical in speech and pattern recognition due to its ability to

    memorize long-term dependencies, it was not without problems. RNNs were able to connect

    previous information to current task, only when the gap between the information was small. As

    the gap widened, RNNs started to perform poorly. It is highly typical for all traditional RNNs to

    have this vanishing gradient problem as the depth and complexity of layers are increased, unlike

    Long Short Term Memory neural networks – LSTM in short. LSTM neural network is like an

    extension of simple RNN [26]. In 1997 Hochreiter and Schmidhuber introduced LSTM, where a

    memory cell had linear dependence of its present activity and its past activity. Input and output

    Figure 6: Unrolled simple RNN Diagram [1]

  • 14

    gates were introduced to efficiently modulate input and output. However, the introduction of

    forget gates were crucial to effective modulation of the information flow between present and

    past activities. [27, 28]

    In [25], the authors presented an extension of the LSTM using a gate function called depth gate,

    and provided the explanation of the equations of LSTM (from 1.1 to 1.5) -

    𝑖𝑡 = 𝜎 𝑊𝑥𝑖𝑥𝑡 +𝑊ℎ𝑖ℎ𝑡−1 +𝑊𝑐𝑖𝑐𝑡−1 (1.1)

    𝑓𝑡 = 𝜎 𝑊𝑥𝑓𝑥𝑡 +𝑊ℎ𝑓ℎ𝑡−1 +𝑊𝑐𝑓𝑐𝑡−1 (1.2)

    𝑐𝑡 = 𝑓𝑡⨀𝑐𝑡−1 + 𝑖𝑡⨀𝑡𝑎𝑛ℎ 𝑊𝑥𝑐𝑥𝑡 +𝑊ℎ𝑐ℎ𝑡−1 (1.3)

    𝑜𝑡 = 𝜎 𝑊𝑥𝑜𝑥𝑡 +𝑊ℎ𝑜ℎ𝑡−1 +𝑊𝑐𝑜𝑐𝑡−1 (1.4)

    ℎ𝑡 = 𝑜𝑡⨀𝑡𝑎𝑛ℎ 𝑐𝑡 (1.5)

    2.4: Software tools used

    We used library and tools provided by Keras to design our models. Keras is a compact and

    highly modular neural networks library. It is written in python and compatible with Python

    2.7~3.5. Keras runs upon a back-end. Theano and TensorFlow are both compatible back-ends

    for Keras. However, Keras may use just one as its back-end – either Theano or Tensorflow.

    Models are the core data structure of Keras. Model is a way to organize the layers. There are two

    types of models in Keras –

    the Sequential model, and

    the Model class used with functional API (Keras functional API)

    All of our experiments ran in a Sequential model.

  • 15

    2.4.1: Sequential model characteristics

    The Keras Sequential model is a linear stack of layers which can be created either by passing a

    list of layer instances directly to the constructor, or using .add() method. It is necessary to

    specify the input shape for the model and only the first layer in a Sequential model needs this

    information, as the following layers can do automatic shape inference. This can be easily done

    by passing an input_shape argument to the very first layer, describing a tuple of integers or None

    entries. The latter case tells the model that any positive integer can be expected. One can pass a

    batch_input_shape argument instead, where batch dimension is included. In case of 2D layers

    such as Dense input shape can be specified via input_dim argument, whereas, 3D temporal layers

    supports two arguments – input_dim and input_length.

    Before the Sequential model can be trained, it is necessary to configure the learning process, and

    it is done by .compile() method. The following three arguments are quite important for the

    model –

    1. Optimizer: Usually a string identifier of an existing optimizer e.g. rmsprop

    2. Loss function: The objective function that the model tries to minimize. Again it is

    usually a string identifier of an existing loss function e.g. binary_crossentropy ,and

    3. Metrics: For now only accuracy metric is supported at this point.

    Once the previous steps are taken care of we come to the part of training the model. Numpy

    ndarrays of input data and labels are used for Keras model training. The method used for training

    is .fit(). Some of the important arguments for training function are as follow –

    1. X - Input data, as a Numpy array, or a list of Numpy arrays in case there are multiple

    inputs.

    2. y - Labels, also as a Numpy array.

    3. batch_size – Number of samples per gradient update

    4. nb_epoch – The number of epochs for training the model

    5. validation_data – Tuple of (X, y) to be used for validation data. Etc.

    This function returns a history object that holds a record of training loss values and metrics

    values for each epochs, as well as validation loss values and validation metrics values. By using

  • 16

    functions from modules such as h5py it is possible to save the weights from this history object. It

    is possible to save the entire model as well using applicable modules.

    .evaluate() method takes input data and label (x, and y) as arguments and returns the scalar test

    loss or list of scalars on the input data, batch by batch (configured by batch_size argument) All

    the information in this section were based on Keras documentation, which is very well

    documented and helpful. [29]

    2.4.2: Embedding Layers

    The whole idea of Embedding layer or word embedding is an aftermath of recent innovation

    called word2vec [30]. In a simple way, word2vec is a technique that converts words into unique

    discrete values and then maps each word in a continuous vector space. Likewise, Keras‟

    Embedding layer takes positive integers as indexes and turns them into dense vector of fixed

    size. To use embedding one must use it as the first layer in a model. Embedding layer takes

    input_dim as its first argument which is actually the size of the vocabulary, or in another way the

    number of unique words, such that the largest integer (i.e. word index) in the input should be no

    larger than input_dim-1 (vocabulary size). If it does then there will be errors during model run.

  • 17

    Figure 7: Bangla and Romanized Bangla data ratio

    3.0: Dataset details

    The dataset we used is primarily a BRBT (Bangla and Romanized Bangla) dataset, based on

    work of M. R. Amin, [31] and the version used in our work varies mostly in terms of modified

    number of posts, and a bit more polishing. Currently the Bangla Sentiment Analysis (SA) dataset

    consists of total 9337 post samples. The dataset is unique because not only this is really big but it

    also encompasses the till-now-ignored Romanized Bangla. Romanized Bangla is the Bangla

    written in English alphabets. Inclusion of Romanized Bangla is paramount, because the ease of

    writing Bangla using any standard QWERTY keyboard (without a Bangla keyboard e.g. Bijoy

    keyboard) and the simplicity of using English as base language for the posts, have lifted

    popularity of Romanized Bangla not just in personal messages and micro-blogs but also in Govt.

    sanctioned mass messages/announcements. The dataset is currently kept private for safe keeping

    and further improvement. However, it may be made available by personally contacting the

    owner/authors.

    3.1: Data Statistic

    Total number of entries: 9337 (no of rows in sheet 9338)

    Bangla entries: 6698 (no of rows in sheet 6699)

    Romanized Bangla entries: 2639 (no of rows in sheet 2640)

  • 18

    3.1.1: Data Sources

    Data were collected from various micro-blog sites, such as, Facebook, Twitter, YouTube etc, and

    some online news portal, product review panels etc. Following is the statistic of data sources -

    From Facebook: 4621

    From Twitter: 2610

    From YouTube: 801

    From online news portals: 1255

    From product review pages: 50

    3.1.2: Post collection data processing

    Removal of emoticons:- emoticon, hash-tags were removed to give annotators an

    unbiased-text-only content to make a decision based on three criteria - positive, negative

    and ambiguous.

    Removal of proper nouns:- Proper nouns were replaced with tags to provide ambiguity.

    All text samples were collected from publicly available sources and did not reflect the

    opinion of the authors. (The original text samples have been preserved but are not

    Figure 8: Data comparison by data source

  • 19

    publicly available. These can be obtained by emailing the authors directly and signing the

    required consent form.)

    Manual validation (by native speakers):- Collected data samples are manually

    annotated into one of three categories: positive (1), negative (0) and ambiguous (A). Each

    text sample was independently manually annotated by two different native Bangla

    speaking individuals for total two validations. Each annotator validated the data without

    knowing decisions made by other. This ensures that the validations are unbiased and

    personal.

    Text Sample 1st Annotator 2nd Annotator

    অয়নক ভোয়ো হয় য়ে গোন!

    Translation: very nice song!

    Positive Positive

    মম মোপন্তক সক দুঘ মটনো ৩ জন

    পনহত্।

    Translation: 3 dead in a tragic road

    accident.

    Negative Negative

    Chotobelar modhur din gulo khub miss kori

    Translation: really miss the

    sweet childhood days

    Positive Negative

    Sympony er set gula kemon?

    Translation: How are Symphony mobile sets?

    Positive Ambiguous

  • 20

    আয়ো আয়ো তু্পম কখয়নো

    আমোর হয়বনো

    Translation : Light, light, you'll

    never be mine

    Ambiguous Negative

    Table 2: Dataset validation samples

    3.2: Double validation analysis

    Figure 3 gives us a better perspective of the agreement-disagreement between first validation and

    second validation. Rows give first validation's agreement-disagreement for all three annotation

    types (first row positive, then next row negative and then last row ambiguous) with second

    validation's positive, negative and ambiguous (sequentially), and columns do same thing, only it

    is second validations agreement-disagreements related to first validation. For example, first row

    tells us for all positive annotation of first validation, second validation agreed with 2817

    positive, and disagreed with 538 negative and 392 ambiguous. We can also find out that, there

    were total 2817+538+392 = 3747 positive annotations from first validation, among which second

    F

    irst

    Vali

    dati

    on

    Second Validation

    Positive Negative Ambiguous

    Positive 2817 538 392

    Negative 178 3864 404

    Ambiguous 27 95 1022

    Table 3: Confusion Matrix table

  • 21

    validation agreed with 2817 and disagreed with 3747 - 2817 = 930. Again we can say second

    annotator agreed 75% of the times with the first annotator for positive annotations, etc. That is

    why this confusion matrix is of great importance, to do all sorts of analysis on both validations.

    3.3: Dataset preparation

    We prepared the data from dataset for our convenience to easily access any specific type of data,

    e.g. Bangla posts only or Romanized Bangla only etc. , store and distribute, so without accessing

    the actual dataset (xlsx file) one can reuse parts (or whole) of the dataset for his/her experiments

    with the models. We used Pythons pickling technology to make pickle files for serializing data

    from the datasheets. We have explained in details how we prepared the dataset in chapter 4 -

    Methodology. We have uploaded all the .pkl files (along with python codes) on GitHub under

    public access [32]

    3.3.1: Pickle file details

    Although the readme files attached to the GitHub repository have all the information needed for

    potential experimenter, we are going to give some details of what the repository holds and which

    does what and explain the methodology in chapter 4. There are two folders -

    1. pickled-sheets, and

    2. pickled-sheet-split.

    Pickled-sheets (folder):

    This folder holds a single .pkl.gz file (total three) for each individual sheet in BRBT dataset.

    Each .pkl file consists of a shuffled Numpy array of [[data], [label1],[label2]] where data means

    tokens from tokenized strings, and label1 and label2 are first validation and second validation

    respectively.

    Sentiment_Analysis.pkl.gz for the sheet which has both Bangla and Romanized Bangla

    posts.

  • 22

    Bangla_Sentiment_Analysis.pkl.gz for the sheet having only the Bangla posts

    Romanized_Bangla_Sentiment_Analysis.pkl.gz for Romanized Bangla sheets only.

    Pickled-sheet-split (folder):

    This folder holds pkl.gz files for each .pkl file from "pickled-sheets" folder, split into three sets

    - training, testing and validation. So for each "pickled-sheets" file there are three files; total 3x3

    = 9 files. Each split set is a Numpy array of [[data],[label1],[label2]] where data means the

    tokens, and label1 and label2 corresponds to first validation and second validation for the data.

    Length for the split was tried to be kept 80% of the total data as training set, 15% for testing and

    5% for validation. However, the ratio couldn't be maintained in most cases. Following are the

    exact lengths taken for each datasheets.

    Sheet1 (Sentiment_Analysis.pkl.gz) :- total length 9337

    1. Training set length: 7500 (brbt_split_train.pkl.gz)

    2. Validation set length: 500 (brbt_split_validate.pkl.gz)

    3. Test set length: 1337 (brbt_split_test.pkl.gz)

    Bangla (Bangla_Sentiment_Analysis.pkl.gz) :- total length 6698

    1. Training set length: 5400 (bangla_split_train.pkl.gz)

    2. Validation set length: 400 (bangla_split_validation.pkl.gz)

    3. Test set length: 898 (bangla_split_test.pkl.gz)

    Romanized Bangla (Romanized_Bangla_Sentiment_Analysis.pkl.gz):- total length 2639

    1. Training set length: 2200 (rb_split_train.pkl.gz)

    2. Validation set length: 150 (rb_split_validation.pkl.gz)

    3. Test set length: 289 (rb_split_test.pkl.gz)

    This folder also contains a simple python code split_three_ways.py to read .pkl.gz files found

    in pickled-dataset folder and split them into abovementioned three sets based on the length

    defined in the source code. It's quite basic in terms of coding and its methods self-explanatory for

    users who would want to try out different length for their sets. The code expects pickled files to

    be in the same directory as the code file is in.

  • 23

    4.0: Dataset setup

    In this chapter, we shall be discussing the methods used for collection and preparation of the

    dataset, setting up models for experiments, labeling each experiments and explaining our models

    and experiments etc.

    We have already discussed the statistical details of our dataset in previous chapter (chapter 3). In

    this section we are going to briefly discuss about the methods used for data collection and setting

    up the dataset for making it research-ready, not just for ourselves, but for other interested

    researchers as well.

    The data was manually picked from various online micro-blog sites, product review panels, news

    portals etc. For tweets „bn‟ parameters were used in the search option to access Bangla tweets

    only. There are over 10000 total Bangla and Romanized Bangla posts in the dataset [31].

    We checked for empty rows or columns, missing annotation, proper tagging (for dataset

    with proper nouns replaced), proper categorization etc. The resultant dataset is now both unique

    and error-free in terms of the abovementioned flaws.

    Two additional sheets were added – one for Bangla texts only and the other for Romanized

    Bangla posts only. Codes were done in Python and modules such as openpyxl, cPickle, were

    used in making scripts to automate tasks such as –

    Reading data from “xlsx” files

    Converting textual data into tokens

    Saving the data as tuple ([data], [label1], [label2])

    Applying random shuffle on Numpy array converted from simple tuple

    Serializing each datasheets and splitting three sets from each and making them available

    for public to download and un-pickle to use them in their models.

    For our experiments we applied the tokenizing, splitting, serializing scripts on the “full-text” (or

    unmodified texts column of the dataset with all the proper nouns, emoticons etc intact) also,

    hence creating additional sets of pickle files. But we didn‟t make these publicly available, as they

    were only produced for experimental purpose.

  • 24

    80

    to

    ken

    s

    Emb

    edd

    ing

    laye

    r Long

    Short

    Term

    Memory

    Layer

    (128)

    5.0: Model Implementation

    Our dataset consists of three categories –

    Positive,

    Negative, and

    Ambiguous.

    Depending on the dataset used and number of categories classified, we used three types of fully

    connected neural networks layer – known as Dense layer in Keras. Those are – Dense(1),

    Dense(2) and Dense(3). [Figure 9, 10, 11 respectively]

    Figure 9: Dense(1) model

    ANN

  • 25

    80

    to

    ken

    s

    Emb

    edd

    ing

    laye

    r Long

    Short

    Term

    Memory

    Layer

    (128)

    80

    to

    ken

    s

    Emb

    edd

    ing

    laye

    r Long

    Short

    Term

    Memory

    Layer

    (128)

    Figure 10: Dense(2) Model

    Figure 11: Dense(3) Model

    While Dense(1) is sufficient to output 1 and 0 values (1 for positive and 0 for negative), when

    using Categorical crossentropy as loss, and Ambiguous category is taken into consideration („A‟

    changed to integer value of 2), we used Dense(3). However, Dense(2) was used for “Ambiguous

    removed” experiment sets where we omitted data entries with „A‟ validation (either by 1st or 2

    nd

    validation) and only positives or negatives were counted. In this case, 0 and 1 goes to two

    different neurons instead of one. We used data for one validation set as pre-training for another

    validation set. More specifically, first we fit data from 1st validation in the model to pre-train for

    2nd

    validation data – which is fit in the same model afterwards. Likewise, we fit data from 2nd

    validation to pre-train for 1st validation data. This sort of pre-training was to check whether it

    ANN

    ANN

    ANN

    ANN

    ANN

  • 26

    can be useful to pre-train on an independently sentiment analysis data even if the labels did not

    match.

    6.0: Experiments

    In this segment we are going to discuss all about the experiments – setup, model labels and tags

    used to distinguish different experiments, tables showing all the experiments by their labels,

    number of epochs, dense layer number, type of max_features etc.

    6.1: Experimental setup

    Our model is based on Recurrent Neural Networks (RNN) – more specifically we used LSTM

    neural network. We used Keras‟ model-level library since it has all the required features to help

    us develop our deep learning model. We used Theano as the back-end for Keras. All our models

    are Keras Sequential models. First layer of the Sequential model is the Embedding layer. We

    used Embedding layer to implement the word to vector representation for the words in our

    dataset. We used a variable named max_features as the input dimension argument for

    embedding layer. It means the highest token value returned by the tokenizer during tokenization

    of our words, which in turn means that max_features is also the vocabulary size (input_dim). The

    value of max_features must be equal to or higher than the vocabulary size to make the model run

    without any runtime error. The second layer is Long Short Term Memory (LSTM) with an

    internal state of 128 dimensions. The third is a fully connected NN layer which in Keras

    terminology known as a Dense layer. Usually, a one dimension dense layer This actually outputs

    to a single neuron of dimension 1. For our model implementation we need to work with 2 types

    of values – positive and negative, represented by 1 and 0 respectively. And usually a one

    dimension neuron holds values of 0 and 1. However, for experimentation we will need to use

    more than one dimension dense layers. We used 2 dimension dense layer to include another

  • 27

    80

    to

    ken

    s

    Emb

    edd

    ing

    laye

    r Long

    Short

    Term

    Memory

    Layer

    (128)

    Dense layer

    (1/2/3)

    Figure 12: Model schematic

    value – Ambiguous or neutral, and 3 dimensions dense layer with categorical loss function. The

    input for our sequential model will be series of tokens. This is the reason we tokenized the words

    from our dataset first during data preparation. For the input we took maximum of 80 tokens at a

    time. The consequence for this would be that our proposed model would not be able to process

    more than 80 words at a time. However, that may not be much of a limitation since 80 words at a

    time is still large enough. We applied ‘sigmoid’ activation function to the output. Depending on

    our data and the labels, we used both ‘binary-crossentropy’ and ‘categorical-crossentropy’ as

    loss functions Dropouts of 0.2 were used both in Embedding layer and LSTM layer which help

    reduce overfitting by randomly setting a fraction of input units to 0 at each update during training

    time [33].

    6.1: Experiment model label Tags

    There are actually 36 unique experiments using the same LSTM model, depending on the dataset

    used, processing of texts, loss function used, processing of labels (annotations on data), and

    input_dim value for Embedding layer. However, it turns into a total of 72 experiments – one half

    of experiments where label 1 (1st validation) is used for pre-training, and the other half where

    label 2 (2nd

    validation) is used for pre-training. Following are the tags used in experiments and

    what they actually mean.

  • 28

    Tags used for different types of dataset –

    Dataset Type Tag used in experimental labels

    Bangla and Romanized Bangla (total) brbt

    Bangla (only) bangla

    Romanized Bangla (only) rb

    Tags used depending on processing of texts/posts –

    Processing of texts Tag used in experimental labels

    removed and other modifications PN

    Full texts (no modification) FT

    Tags used based on loss function –

    Loss function used Tag used in experimental labels

    Binary_crossentropy bin

    Categorical_crossentropy cat

    Tags used based on Annotation data modification-

    Annotation data modification Tag used in experimental labels

    Annotation value of „A‟ removed (label along

    with data removed) Ra

    Annotations value of „A‟ converted to 2 ato2

    Tags used based on different type of max_features applied -

  • 29

    Max_features type Tag used in experimental labels

    Non-fixed, ranging from 20,000 ~ 40,000

    depending on the dataset type and size 1

    Value fixed at 500 2

    6.2: Experiment model table

    The following table has all 36 experimental labels and their other significant specifications

    which were unique for both sets of pre-training. These labels also denote to experiment sets

    where label 1 is used for pre-training; and for the alternate experiments where label 2 is used for

    pre-training, only change is in the experiment label with a prefix of „ALT‟ –

    Experiment label Dense layer

    dimension

    Max_features

    value

    Number of

    Epoch

    brbt_bin_PN_ra_1 1 35000 50

    brbt_bin_PN_ra_2 1 500 50

    brbt_bin_FT_ra_1 1 40000 50

    brbt_bin_FT_ra_2 1 500 50

    brbt_cat_PN_ra_1 2 35000 25

    brbt_cat_PN_ra_2 2 500 25

    brbt_cat_FT_ra_1 2 40000 25

    brbt_cat_FT_ra_2 2 500 25

    brbt_cat_PN_ato2_1 3 35000 50

    brbt_cat_PN_ato2_2 3 500 50

  • 30

    brbt_cat_FT_ato2_1 3 35000 50

    brbt_cat_FT_ato2_2 3 500 50

    bangla_bin_PN_ra_1 1 35000 25

    bangla_bin_PN_ra_2 1 500 25

    bangla_bin_FT_ra_1 1 28000 25

    bangla_bin_FT_ra_2 1 500 25

    bangla_cat_PN_ra_1 2 35000 25

    bangla_cat_PN_ra_2 2 500 25

    bangla_cat_FT_ra_1 2 35000 25

    bangla_cat_FT_ra_2 2 500 25

    bangla_cat_PN_ato2_1 3 40000 25

    bangla_cat_PN_ato2_2 3 500 25

    bangla_cat_FT_ato2_1 3 40000 25

    bangla_cat_FT_ato2_2 3 500 25

    rb_bin_PN_ra_1 1 20000 25

    rb_bin_PN_ra_2 1 500 25

    rb_bin_FT_ra_1 1 25000 25

    rb_bin_FT_ra_2 1 500 25

    rb_cat_PN_ra_1 2 20000 25

    rb_cat_PN_ra_2 2 500 25

    rb_cat_FT_ra_1 2 20000 25

    rb_cat_FT_ra_2 2 500 25

  • 31

    rb_cat_PN_ato2_1 3 20000 25

    rb_cat_PN_ato2_2 3 500 25

    rb_cat_FT_ato2_1 3 35000 25

    rb_cat_FT_ato2_2 3 500 25

    Table 4: Experiment labels table

  • 32

    7.0: Results and discussion

    7.1: Result table

    The following table holds results from the experiments where 2nd

    validation is pre-trained with

    1st validation dataset –

    Experiment labels Validation

    used

    Test score

    (loss)

    Test accuracy

    brbt_bin_PN_ra_1 1st validation 1.88182373031 0.623299319728

    2nd

    validation 1.65080066764 0.679593720705

    brbt_bin_PN_ra_2 1st validation 0.95320324427 0.593537415169

    2nd

    validation 1.10312330084 0.632502309063

    brbt_bin_FT_ra_1 1st validation 1.84389834437 0.627986347919

    2nd

    validation 1.4472491316 0.691244240345

    brbt_bin_FT_ra_2 1st validation 0.913010644424 0.622866894401

    2nd

    validation 1.02413836789 0.639631336625

    brbt_cat_PN_ra_1 1st validation 1.49438894849 0.636904761905

    2nd

    validation 1.58974196519 0.660203139923

    brbt_cat_PN_ra_2 1st validation 1.00040410733 0.577380952786

    2nd

    validation 1.19003349273 0.62973222486

    brbt_cat_FT_ra_1 1st validation 1.46728742489 0.654436859661

    2nd

    validation 1.43848987329 0.688479261904

    brbt_cat_FT_ra_2 1st validation 0.703368600115 0.640784983139

    2nd

    validation 0.774347296124 0.658064515854

  • 33

    brbt_cat_PN_ato2_1 1st validation 2.37038821664 0.529543754318

    2nd

    validation 2.8942532849 0.519820494088

    brbt_cat_PN_ato2_2 1st validation 1.40178696362 0.507105460253

    2nd

    validation 1.69921113683 0.471204188281

    brbt_cat_FT_ato2_1 1st validation 2.43535874321 0.546746447716

    2nd

    validation 2.65894489638 0.519820493286

    brbt_cat_FT_ato2_2 1st validation 1.24154179383 0.519072550219

    2nd

    validation 1.56229666569 0.501121914378

    bangla_bin_PN_ra_1 1st validation 1.4950274012 0.625790138914

    2nd

    validation 1.41194984732 0.6910344828

    bangla_bin_PN_ra_2 1st validation 0.84057880322 0.608091023568

    2nd

    validation 0.902013720808 0.649655173154

    bangla_bin_FT_ra_1 1st validation 1.59519414057 0.61772151944

    2nd

    validation 1.57762829749 0.675900277091

    bangla_bin_FT_ra_2 1st validation 0.771766250948 0.593670885321

    2nd

    validation 0.874607833799 0.639889197171

    bangla_cat_PN_ra_1 1st validation 1.53535538588 0.633375473933

    2nd

    validation 1.3721139773 0.707586207554

    bangla_cat_PN_ra_2 1st validation 0.818257555196 0.60809102387

    2nd

    validation 0.898441794659 0.663448275903

    bangla_cat_FT_ra_1 1st validation 1.54158453458 0.635443038126

    2nd

    validation 1.50628914397 0.667590027783

  • 34

    bangla_cat_FT_ra_2 1st validation 0.783895753758 0.61772151944

    2nd

    validation 0.902959698125 0.649584488195

    bangla_cat_PN_ato2_1 1st validation 2.21905702797 0.525027808676

    2nd

    validation 2.35434000338 0.533926585161

    bangla_cat_PN_ato2_2 1st validation 1.11962867512 0.516129032258

    2nd

    validation 1.28997873955 0.526140155762

    bangla_cat_FT_ato2_1 1st validation 2.15015999162 0.530589544004

    2nd

    validation 2.39394572473 0.529477196885

    bangla_cat_FT_ato2_2 1st validation 1.06338288429 0.539488320389

    2nd

    validation 1.32858297875 0.521690767519

    rb_bin_PN_ra_1 1st validation 1.52705907256 0.608695648876

    2nd

    validation 1.67791411082 0.6375

    rb_bin_PN_ra_2 1st validation 0.954761880424 0.612648221108

    2nd

    validation 1.15460999012 0.65

    rb_bin_FT_ra_1 1st validation 1.22663058467 0.682203389831

    2nd

    validation 1.32435384307 0.638766519824

    rb_bin_FT_ra_2 1st validation 0.859760753179 0.648305085756

    2nd

    validation 1.23520370973 0.621145374449

    rb_cat_PN_ra_1 1st validation 1.45886840415 0.62450592814

    2nd

    validation 1.81545053323 0.616666666667

    rb_cat_PN_ra_2 1st validation 1.04351792529 0.59288537478

    2nd

    validation 1.09681313038 0.6375

  • 35

    rb_cat_FT_ra_1 1st validation 1.11829374705 0.648305083736

    2nd

    validation 1.29434119126 0.665198237885

    rb_cat_FT_ra_2 1st validation 0.933010570074 0.610169490515

    2nd

    validation 1.13431040043 0.656387665461

    rb_cat_PN_ato2_1 1st validation 1.85035417814 0.477508650519

    2nd

    validation 2.3243691055 0.456747404844

    rb_cat_PN_ato2_2 1st validation 1.31008294462 0.508650519031

    2nd

    validation 1.47633354969 0.463667820069

    rb_cat_FT_ato2_1 1st validation 2.00189424318 0.525951557093

    2nd

    validation 2.16044338199 0.505190311419

    rb_cat_FT_ato2_2 1st validation 1.56623152981 0.536332179931

    2nd

    validation 1.91411103592 0.512110726644

    Table 5: Experiment results for pre-training label 2

    Experiment labels Validation

    used

    Test score

    (loss)

    Accuracy

    ALTbangla_bin_FT_ra_1 1st validation 1.67468570637 0.6602

    2nd

    validation 1.38265603005 0.7825

    ALTbangla_bin_FT_ra_2 1st validation 0.963279091859 0.6741

    2nd

    validation 0.800574915561 0.7704

    ALTbangla_cat_FT_ra_1 1st validation 1.5892310106 0.6407

  • 36

    2nd

    validation 1.41864707728 0.7523

    ALTbangla_cat_FT_ra_2 1st validation 0.903364171258 0.6713

    2nd

    validation 0.769961072963 0.7855

    Table 6: Experiment results for pre-training label 1

    From Table 5 and 6, we see the result of one half of the experiments where 1st validation was

    used for pre-training. Highest accuracy was attained by Bangla dataset with categorical

    crossentropy loss, modified text, Ambiguous removed and non-fixed max_features, with 70% of

    accuracy – which is 20% more than chance for two category dataset. However, this experiment

    on BRBT dataset with categorical loss, modified text, ambiguous converted to 2, has a low

    accuracy score of 55% but for a three category it scores 22% more than chance (33%).

    Therefore, it is clear that most of experiment sets (dataset-wise, or PN-FT tag-wise, or loss

    function-wise, and label category-wise) scored above chance. However, none of the experiments

    with fixed max_features (vocabulary size for Embedding layer) scored well compared to the

    non-fixed variants.

  • 37

    Following are the graphs for some of the experiments with high accuracy scores -

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    1 2 3 4 5 6 7 5 6 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

    loss

    Epochs

    bangla_cat_PN_ra_1 (2nd Val)

    val_loss

    loss

    Figure 13: loss-val_loss graph for bangla_cat_PN_ra_1 (2nd validation)

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1 2 3 4 5 6 7 5 6 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

    accu

    racy

    Epoch

    bangla_cat_PN_ra_1 (2nd val)

    acc

    val_acc

    Figure 14: acc-val_acc graph for bangla_cat_PN_ra_1 (2nd validation

  • 38

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    1 2 3 4 5 6 7 5 6 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

    loss

    Epochs

    bangla_bin_PN_ra_1(2nd val)

    loss

    val_loss

    Figure 15: loss-val_loss graph of bangla_bin_PN_ra_1 (2nd validation)

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1 2 3 4 5 6 7 5 6 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

    accu

    racy

    Epochs

    bangla_bin_PN_ra_1(2nd val)

    acc

    val_acc

    Figure 16: acc-val_acc graph of bangla_bin_PN_ra_1 (2nd validation)

  • 39

    8.0: Conclusion

    Our goals in this project were -

    1. Pre-processing the data in a way so that it is readily usable by researchers.

    2. Application of deep recurrent models on a Bangla and Romanized Bangla text corpus.

    3. Pre-train dataset of one label for another (vice versa) to prove its usefulness.

    In meeting our goals, we pre-processed a BRBT (Bangla and Romanized Bangla Text) dataset of

    total 9337 entries with 6698 entries for Bangla and 2639 for Romanized Bangla texts. Then

    dataset was split and serialized into training set, testing set and validation set of lengths defined

    in section 3.3.1 and made them available for public so it can be usable by researchers.

    For our experiments, we applied LSTM which is a deep recurrent model. There are total 32

    different experiments based on the same model with only differences in dataset used, loss

    function applied, modification done (or not) on data (proper noun replaced with tags,

    duplication removal etc.) etc (This has been discussed in detail in section 4.3.1). While most of

    the experiments scored accuracy higher than chance in percentage, Bangla dataset with

    categorical crossentropy as loss function and non-fixed max_features for the embedding layer

    with “Ambiguous removed” scored highest with 78% in accuracy for 2 category (results

    compared from both pre-training set of experiments), and Bangla and Romanized Bangla dataset

    (modified text set) with categorical crossentropy loss, non-fixed max_features, and “Ambiguous

    converted to 2” scored highest with 55% in accuracy for 3 category.

    Our implementation of pre-training dataset of one label for another has showed that, even if the

    labels do not match it is useful to pre-train on an independently annotated SA data. For time

    constraints we could not finish experiments with all 36 alternate experiments using label 2 for

    pre-training for label 1 data, which we intend to complete before do the paper for this research.

    However, from four experiments done from the alternate experiment set we have seen consistent

    result from 2nd

    validation data (label 2).

  • 40

    References:

    1. Olah, C., Understanding LSTM Networks. 2016.

    2. Banglapedia. Bangla Language. Available from:

    http://en.banglapedia.org/index.php?title=Bangla_Language.

    3. Liu, B., Sentiment analysis and opinion mining. Synthesis lectures on human language

    technologies, 2012. 5(1): p. 1-167.

    4. Das, A. and S. Bandyopadhyay, Subjectivity detection in english and bengali: A crf-based

    approach. Proceeding of ICON, 2009.

    5. Khan, S. Convergence in spelling, and spell-checker for Romanized Bangla in computers and

    mobile phones. in Informatics, Electronics & Vision (ICIEV), 2014 International Conference on.

    2014. IEEE.

    6. LeCun, Y., Y. Bengio, and G. Hinton, Deep learning. Nature, 2015. 521(7553): p. 436-444.

    7. Nasukawa, T. and J. Yi. Sentiment analysis: Capturing favorability using natural language

    processing. in Proceedings of the 2nd international conference on Knowledge capture. 2003.

    ACM.

    8. Pang, B., L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classification using machine

    learning techniques. in Proceedings of the ACL-02 conference on Empirical methods in natural

    language processing-Volume 10. 2002. Association for Computational Linguistics.

    9. Das, S. and M. Chen. Yahoo! for Amazon: Extracting market sentiment from stock message

    boards. in Proceedings of the Asia Pacific finance association annual conference (APFA). 2001.

    Bangkok, Thailand.

    10. Wiebe, J. Learning subjective adjectives from corpora. in AAAI/IAAI. 2000.

    11. Maas, A.L., et al. Learning word vectors for sentiment analysis. in Proceedings of the 49th Annual

    Meeting of the Association for Computational Linguistics: Human Language Technologies-

    Volume 1. 2011. Association for Computational Linguistics.

    12. Medhat, W., A. Hassan, and H. Korashy, Sentiment analysis algorithms and applications: A

    survey. Ain Shams Engineering Journal, 2014. 5(4): p. 1093-1113.

    13. Godbole, N., M. Srinivasaiah, and S. Skiena, Large-Scale Sentiment Analysis for News and Blogs.

    ICWSM, 2007. 7(21): p. 219-222.

    http://en.banglapedia.org/index.php?title=Bangla_Language

  • 41

    14. Kumar, A. and T.M. Sebastian, Sentiment analysis on twitter. IJCSI International Journal of

    Computer Science Issues, 2012. 9(4): p. 372-373.

    15. Das, D. and S. Bandyopadhyay. Developing Bengali WordNet Affect for Analyzing Emotion. in

    International Conference on the Computer Processing of Oriental Languages. 2010.

    16. Patra, B.G., et al. Shared task on sentiment analysis in indian languages (sail) tweets-an

    overview. in International Conference on Mining Intelligence and Knowledge Exploration. 2015.

    Springer.

    17. Chowdhury, S. and W. Chowdhury. Performing sentiment analysis in Bangla microblog posts. in

    Informatics, Electronics & Vision (ICIEV), 2014 International Conference on. 2014. IEEE.

    18. Lenat, D.B. and R.V. Guha, Building large knowledge-based systems; representation and

    inference in the Cyc project. 1989: Addison-Wesley Longman Publishing Co., Inc.

    19. Bengio, Y., A. Courville, and P. Vincent, Representation learning: A review and new perspectives.

    IEEE transactions on pattern analysis and machine intelligence, 2013. 35(8): p. 1798-1828.

    20. Ian Goodfellow, Y.B., Aaron Courville, Deep Learning. 2016.

    21. Gershenson, C., Artificial neural networks for beginners. arXiv preprint cs/0308031, 2003.

    22. Jain, A.K., J. Mao, and K.M. Mohiuddin, Artificial neural networks: A tutorial. IEEE computer,

    1996. 29(3): p. 31-44.

    23. McCulloch, W.S. and W. Pitts, A logical calculus of the ideas immanent in nervous activity. The

    bulletin of mathematical biophysics, 1943. 5(4): p. 115-133.

    24. Bullinaria, J.A., Recurrent neural networks. Neural Computation: Lecture, 2013. 12.

    25. Yao, K., et al., Depth-Gated Recurrent Neural Networks. arXiv preprint arXiv:1508.03790, 2015.

    26. Elman, J.L., Finding structure in time. Cognitive science, 1990. 14(2): p. 179-211.

    27. Hochreiter, S. and J. Schmidhuber, Long short-term memory. Neural computation, 1997. 9(8): p.

    1735-1780.

    28. Gers, F.A., J. Schmidhuber, and F. Cummins, Learning to forget: Continual prediction with LSTM.

    Neural computation, 2000. 12(10): p. 2451-2471.

    29. Chollet, F. Keras. 2015; Available from: https://github.com/fchollet/kera.

    30. Mikolov, T. and J. Dean, Distributed representations of words and phrases and their

    compositionality. Advances in neural information processing systems, 2013.

    https://github.com/fchollet/kera

  • 42

    31. Amin, M.R., BRBT: A dataset of Bangla and Romanized Bangla Texts for Sentiment Analysis.

    2016, University of Liberal Arts Bangladesh.

    32. Hassan, A. Repository for BRBT pickle files. 2016; Available from: https://github.com/Asif-

    Hassan/BRBT-dataset-pickles.

    33. Srivastava, N., et al., Dropout: a simple way to prevent neural networks from overfitting. Journal

    of Machine Learning Research, 2014. 15(1): p. 1929-1958.

    https://github.com/Asif-Hassan/BRBT-dataset-pickleshttps://github.com/Asif-Hassan/BRBT-dataset-pickles