research article improving feature representation based on a...

Research ArticleImproving Feature Representation Based on a NeuralNetwork for Author Profiling in Social Media Texts

Helena Goacutemez-Adorno1 Ilia Markov1 Grigori Sidorov1 Juan-Pablo Posadas-Duraacuten2

Miguel A Sanchez-Perez1 and Liliana Chanona-Hernandez2

1 Instituto Politecnico Nacional (IPN) Centro de Invetigacion en Computacion (CIC) Mexico City Mexico2Instituto Politecnico Nacional (IPN) Escuela Superior de Ingenierıa Mecanica y Electrica Unidad Zacatenco (ESIME-Zacatenco)Mexico City Mexico

Correspondence should be addressed to Helena Gomez-Adorno helenaadornogmailcom

Received 30 January 2016 Revised 19 July 2016 Accepted 14 August 2016

Academic Editor Francesco Camastra

Copyright copy 2016 Helena Gomez-Adorno et alThis is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

We introduce a lexical resource for preprocessing social media data We show that a neural network-based feature representation isenhanced by using this resourceWe conducted experiments on the PAN 2015 and PAN 2016 author profiling corpora and obtainedbetter results when performing the data preprocessing using the developed lexical resource The resource includes dictionaries ofslang words contractions abbreviations and emoticons commonly used in social media Each of the dictionaries was built for theEnglish Spanish Dutch and Italian languages The resource is freely available

1 Introduction

Nowadays the messages extracted from social media arebeing used for various purposes in the research area of naturallanguage processing (NLP) and its practical applicationsTasks such as sentiment analysis author profiling authoridentification opinion mining plagiarism detection andtasks related to computing text similarity among others relyon social media content in order to develop robust systemsthat help the decision making process in marketing politicseducation forensics and so forth

Approaches based on neural networks (word embed-dings) for unsupervised feature representation often do notperform data cleaning [1 2] considering that the networkitself would solve the related problems These approachestreat special characters such as ldquordquo ldquordquo ldquordquo ldquordquo ldquordquo and user-name mentions (ldquordquo) as a regular word [1 3] Still in someworks where word embeddings are used the use of basic datacleaning (stop words removal URL filtering removal of rareterms etc) significantly improves the feature representationand consequently the results of the classification task [4ndash6]

One of the problems with the content of social mediamessages is that usually they contain a large number andvariety of nonstandard language expressions [7 8] The mainproblem is the nonstandardized writing style and irregularityof the language features used in this type of platforms due tothe short nature of the messages most of the authors use alarge vocabulary of slang words abbreviations and emoti-cons [9] Slang words are not considered as a part of thestandard vocabulary of a language and they are mostly usedin informal messages Abbreviations are shortened forms of aword or name that are used in order to replace the full formsEmoticons usually convey the current feeling of the messagewriter The sets of slang words and abbreviations are specificfor each language and therefore the systems that performprocesses over social media messages require specific dic-tionaries for each language In general emoticons are moreuniversal

The main goal of this work consists in development andevaluation of usefulness of a lexical resource which containsdictionaries of abbreviations contractions slang words andemoticons This resource allows improvement of the feature

Hindawi Publishing CorporationComputational Intelligence and NeuroscienceVolume 2016 Article ID 1638936 13 pageshttpdxdoiorg10115520161638936

2 Computational Intelligence and Neuroscience

representation obtained by a well-known neural networkmethod Doc2vec [1] We consider that our dictionaries canhelp in achieving better results in tasks based on social mediadata since they allow standardizing nonstandard languageexpressions that are used in a different way by differentauthors We evaluate our hypothesis on the author profiling(AP) task which aims at predicting the age and gender of anauthor of a given text [10] In our research we used Twittermessages However given that the writing style in social net-works like Twitter Facebook and Instagram among othersis similar these dictionaries are useful for preprocessing andcleaning messages obtained from all these social networks

The rest of the paper is structured as follows Section 2describes related work Section 3 explains the methodologyused for preparation of each dictionary and presents thestructure of the dictionaries Section 4 describes a neuralnetwork-based feature representation Section 5 presents thecase study for the author profiling task Finally Section 6draws the conclusions from this work and points to thepossible directions of future work

2 Related Work

With the unprecedentedly growing amount of social mediadata available on the Internet and the rapid expansion in user-generated content text preprocessing using correspondinglexical resources is becoming more and more crucial for sub-sequent accurate text analysis In this section we present sev-eral works that demonstrate the importance of the text pre-processing step and its usefulness in achieving better resultsfor different NLP tasks particularly for the tasks that useneural network-based feature representation

The challenges that social media offers to NLP werediscussed in detail in [11] Further we focus on the works(usually not related with word embeddings) that considerdifferent preprocessing approaches in order to solve thesechallenges

Clark and Araki [12] discuss the major problems relatedto processing social media messages written in EnglishThe authors improved the performance of open sourcespellcheckers on Twitter data by developing a preprocessingsystem for automatically normalizing casual social mediaEnglishThe authors report that when using their preprocess-ing system average errors per sentence decreased from 15to less than 5

The role of preprocessing has gained much importanceespecially when dealing with social media data It has beenproven that an appropriate text preprocessing for the taskof sentiment analysis (on social media data) significantlyimproves the results for this task [13ndash16] Some of the mostcommonly used preprocessing steps include removing URLsspecial characters and repeated letters from a word expand-ing abbreviations removing stop words handling nega-tions and performing stemming Besides preprocessing hasproved to be useful even when dealing with other types oftextual data for example source codes [17]

According to the overview papers of the author profiling(AP) task at PAN 2013 [18] PAN 2014 [19] and PAN 2015 [10]many participants carried out some kind of preprocessing

mainly for removing HTML tags from the tweets and sec-ondly for handling hashtags URLs and username mentionsin different ways

In PAN 2015 there was only one team [20] that removedall character sequences representing emojis in the originaltweets Unlike emoticons emojis are typically represented byUnicode characters so in this work their use was replacedby unknown character markers In the same work theauthors used a multi-input dependency parser [21] capable ofidentifying positive and negative emoticons for their furtheruse as features in the classification process

Several research teams that participated in the AP taskat PAN (editions 2013 2014 and 2015) exploited the use ofemoticons and slang words as stylistic and content featuresrespectively Some of these teams extracted emoticons usingregular expressions [22 23]Other research teams built lexicalresources for normalizing emoticons and Out Of Vocabulary(OOV) words with their corresponding normalized terms[24] however the authors did not publish these resourcesmaking it difficult for others to reproduce their results Therest of the teams that also used emoticons in their works didnot mention any information concerning how the emoticonswere identified in the corpus nor the source where they wereobtained from (in case they used a dictionary) [25ndash29]

With respect to the use of slang words in the AP taskGoswami et al [30] added slang words and the average lengthof sentences to the feature set improving accuracy to 803in age group identification and to 892 in gender detectionFarias et al [28] and Diaz and Hidalgo [31] used dictionariesextracted from the web (httpwwwchatslangcomtermscommon [last access 01072016] all other URLs in this docu-ment were also verified on this date) However these lists arevery short in comparisonwith our emoticons and slangwordsdictionaries which were collectedmanually from several websources

Regarding neural networks several approaches havebeen proposed for vector-space distributed representationsof words and phrases These models are used mainly forpredicting a word given a surrounding context Howevermost of the authors indicate that distributed representationsof words and phrases can also capture syntactic and semanticsimilarity or relatedness [1 32 33] This particular behaviourmakes these methods attractive to solve several NLP tasksnevertheless at the same time it rises new issues that isdealing with unnormalized texts which are typically presentin social media forums such as Twitter Facebook and Insta-gram Researchers have proposed several preprocessing stepsin order to overcome this issue which led to an overall perfor-mance increase Yan et al [4] enhanced system performanceby approximately 2 using standard NLP preprocessingwhich consists in tokenization lowercasing and removingstop words and rare terms Rangarajan Sridhar [5] focused onthe spelling issues in social media messages which includesrepeated letters omitted vowels use of phonetic spellingssubstitution of letters with numbers (typically syllables) anduse of shorthands and user created abbreviations for phrasesIn a data-driven approach Brigadir et al [3] apply URL filter-ing combined with standard NLP preprocessing techniques

Computational Intelligence and Neuroscience 3

As it can be seen there are many works that tackle theproblem of social media texts preprocessing however to thebest of our knowledge only few works based on a neuralnetwork for feature representation bothered to take intoconsideration the effect that data cleaning has on the qualityof the representation (specially on social media data) In thiswork we present our social media lexicon and demonstrateits usefulness for the author profiling task This task aims atidentifying the profile of the authors that is the age and gen-der of social media messages For our experiments we usedtwo corpora composed of Twittermessages obtained from thePAN competition in 2015 [10] and 2016 [34]

3 Creation of the Social Media Lexicon

We decided to develop the dictionaries for the four followinglanguages English Spanish Dutch and Italian because weneeded to preprocess tweets written in these languages forthe author profiling task at PAN 2015 [10] The appropriatepreprocessing of tweets was crucial for our system [35] per-formance since our approach relied on correct extraction ofsyntactic 119899-grams [36]We reviewed the tweets present in thePAN corpus and found an excessive use of shortened vocab-ulary which can be divided into three main categories slangwords abbreviations and contractions

The need to shorten the words emerged in order to savetime andmessage length and it became an essential part of aninformal language For example in Twitter messages one canfind shortened expressions such as ldquoxoxordquo (kisses and hugs)ldquoBFFrdquo (best friend for ever) ldquoLOLrdquo (laughing out loud)ldquoArghrdquo (exclamation of disappointment) and ldquo4everrdquo (for-ever)Moreover we noticed that the same slangwords (aswellas other types of shortened expressions) are used differentlyby various authors In order to standardize these shortenedexpressions we grouped them in our dictionary of slangwords The dictionary of abbreviations is conformed by theset of shortened forms of words or phrases that can be used inboth formal and informal languages Our dictionary of abbre-viations also includes acronyms which are abbreviations thatconsist of the initial letters or parts of words or phrases Inorder to give a few examples of abbreviations and acronymsincluded in our dictionaries we can cite ldquoStrdquo (street) ldquoAverdquo(avenue) ldquoMrrdquo (mister) ldquoNYrdquo (NewYork) and ldquolbrdquo (pound)among many others The third category of the shortenedvocabulary includes the contractions A contraction is alsoa shortened version of a word syllable or phrase created byomitting internal letters Different rules for different lan-guages are established in order to create a contraction Someexamples of constructions for the English language are ldquoletrsquosrdquo(let us) ldquoarenrsquotrdquo (are not) ldquocanrsquotrdquo (cannot) ldquowhorsquosrdquo (who is)and so forth

Moreover we came across a large number of emoticonsEmoticon is a typographic display of a facial representationThe use of emoticons aims at making the text messages moreexpressive We included in our emoticons dictionary twotypes of these graphic representations western or horizontal(mainly from America and Europe) and eastern or vertical(mainly from east Asia) Western style emoticons are writtenfrom left to right as if the head is rotated 90 degrees

counterclockwise ldquo-)rdquo (smiley face) ldquo-rdquo (doubtful face)ldquo-ordquo (shocked face) amongothers Eastern emoticons are notrotated and the eyes often play a bigger role in the expressionldquo(v)rdquo (smiley face) ldquo((+ +))rdquo (doubtful face) and ldquo(oo)rdquo(shocked face)

The compilation of the dictionaries is divided into threesteps

(1) We searched on the Internet for freely availablewebsites thatwere used as sources for the extraction oflists of slang words contractions and abbreviations

(2) From the selected sources we manually extracted thelists of slang words contractions and abbreviationsalong with their corresponding meanings in eachlanguage (English Spanish Dutch and Italian) Wecreated different files with the same structure for eachweb source

(3) Once we had all the lists in different files we pro-ceeded to join all the files of the same nature Weformatted and cleaned each file we also removed theduplicated entries Finally we manually checked themeanings of each entry in the dictionaries

In order to evaluate the created social media lexicon theinitial idea consisted in using the Amazon Mechanical Turk[37] to collect crowdsourced judgments However unlikeother NLP tasks such as word sense disambiguation andsentiment analysis which can be effectively annotated bycrowdsourced raters the annotation of the created dictio-naries requires specific knowledge of the Internet informallanguageTherefore as a reasonable alternative the proposedresource was validated manually by 5 researchers who havelinguistic backgrounds speak multiple languages and havean experience of working with social media content

It is worth mentioning that all the dictionaries wereextracted from the Internet sources with one exceptionmdashthe dictionary of slang words for the Spanish language Thisdictionary was expandedwith the list of slang words obtainedfrom [38] In this work the authors manually selectedslang words from 21000 tweets that had been previouslydownloaded using the hashtags of emotions (feliz (happy)felicidad (happiness) alegrıa (joy) etc) as search queriesand the Tweepy (httpwwwtweepyorg) library for PythonThen the researchers selected the words that are not presentin the Spanish dictionary provided by the Freeling [39] toolThese words along with their contexts were manually revisedby the authors in order to determine whether there existsan official word that conveys the corresponding meaning inorder to include these entries into the slang words list Theauthors provided us with the list of 200 slang words with theirmeanings and it was added to the 739 slang words extractedfrom the web sources making a total of 939 slang words forthe Spanish language

The links of the web pages that were used in order toextract the lists of slang words contractions and abbrevia-tions for the English Spanish Dutch and Italian languagesas well as the links of the pages used to obtain the list ofemoticons for the English language are presented in theAppendix Since themeanings of the emoticons in English are


Table 1 Number of entries in each dictionary

Type of dictionary Dutch Italian English SpanishAbbreviations 1237 107 1346 527Contractions 15 56 131 11Slang words 250 362 1296 939Emoticons mdash mdash 482 482Total 1520 525 3255 1959

the same as in Spanish we translated them in order to createtwo dictionaries of emoticons (one for each language)

31 Description of the Dictionaries Each dictionary isordered alphabetically and each file consists of two columnsseparated by a tabulation The first column corresponds toa slang word abbreviation or contraction entry accordingto the nature of the dictionary The second column containscorresponding meanings of the entry The meanings are sep-arated by a fraction line and ordered from the most commonto the least common ones

The statistics for each dictionary is presented in Table 1where we can see a high number of slang entities available forthe English and Spanish languages however there are muchless slang entities forDutch and ItalianThere are a large num-ber of abbreviations for DutchThe total number of entities inour social media lexicon is 7259 The dictionaries are freelyavailable on our website (httpwwwcicipnmxsimsidorovlexiconzip)

There are several ways of using the proposed resourceOne option is to replace the occurrence of a shortenedexpression in the text by its corresponding meaning in thedictionary in order to standardize the use of the expressionIn this case the occurrence can be replaced by the most com-monmeaning (the firstmeaning in the second column) or themeaning can be selectedmanually from the available optionsAnother way is to replace a shortened expression by a symbolin order to keep track of its occurrence and remove informa-tion related to the specific shortened expressionThere is alsoan option of removing the nonstandard vocabulary instanceidentified by its presence in the dictionary

4 Feature Representation Based ona Neural Network

In this work we use a feature representation based on a neuralnetwork algorithm that is the features are learned in anautomatic manner from the corpus There are two ways tolearn the feature representation of documents (1) to generatedocuments vectors from word embeddings (or word vectors)or (2) to learn document vectors In this work we learndocument vectors using the Doc2vec method following theprevious research on learning word embeddings (Word2vec)[33] Figure 1(a) shows the Word2vec method for learningword vectors The Word2vec method uses an iterative algo-rithm to predict a word given its surrounding context In thismethod each word is mapped to a unique vector representedby a column in a matrix 119882 Formally given a sequence of

Classifier

Averageconcatenate

Word matrix (W)

w4

w1 w2 w3

(a)

Classifier

Averageconcatenate

w4

w1 w2 w3Document matrix

Document id

D

(b)

Figure 1 (a) Framework for learning word vectors and (b) frame-work for learning document vector

training words 1199081 1199082 1199083 119908

119879 the training objective is to

maximize the average of the log probability

1

119879

119879minus119896

sum

119905=119896

log119901 (119908119905| 119908119905minus119896 119908

119905+119896) (1)

For learning document vectors the same method [1] isused as for learning word vectors (see Figure 1) Documentvectors are asked to contribute to the prediction task of thenext word given many contexts sampled from the documentin the same manner as the word vectors are asked tocontribute to the prediction task about the next word in a sen-tence In the document vectormethod (Doc2vec) each docu-ment ismapped to a unique vector represented by a column indocument matrix The word or document vectors are initial-ized randomly but in the end they capture semantics as anindirect result of the prediction taskThere are twomodels fordistributed representation of documents Distributed Mem-ory (DM) and Distributed Bag-of-Words (DBOW)

The Doc2vec method implements a neural network-based unsupervised algorithm that learns feature representa-tions of fixed length from texts (of variable length) [1] In thisworkwe use a freely available implementation of theDoc2vecmethod included in the GENSIM (httpsradimrehurekcomgensim) Python module The implementation of theDoc2vec method requires the following three parametersthe number of features to be returned (length of the vector)


Table 2 Parameters of the Doc2vec method for each language

Parameter Vectorlength Window size Minimum frequency

English 200 14 3Spanish 350 10 3Dutch 200 11 5Italian 200 4 4

the size of the window that captures the neighborhood andthe minimum frequency of words to be included into themodelThe values of these parameters depend on the corpusTo the best of our knowledge no work has been done on theauthor profiling task using the Doc2vec method Howeverin previous work related to opinion classification task [40]a vector length of 300 features a window size equal to 10 anda minimum frequency of 5 were reported In order to narrowdown theDoc2vec parameters search we follow this previousresearch and conduct a grid search over the following fixedranges vector length [50 350] size of window [3 19] andminimum frequency [3 5] Based on this grid search weselected the Doc2vec parameters as shown in Table 2

It is recommended to train the Doc2vec model severaltimes with unlabeled data while exchanging the input orderof the documents Each iteration of the algorithm is called anepoch and its purpose is to increase the quality of the outputvectors The selection of the input order of the documentsis usually done by a random number generator In order toensure the reproducibility of the conducted experiments inthis work we use a set of nine rules in order to perform thechanges in the order the documents are input in each epoch(we run 9 epochs) of the training process Considering the listof all unlabeled documents in the corpus 119879 = [119889

1 1198892 119889

119894]

we generate a new list of the documents with the differentorder 1198791015840 as follows

(1) Use the inverted order of the elements in the set 119879that is 1198791015840 = [119889

119894 119889119894minus1 119889

1]

(2) Select first the documents with an odd index inascending order and then the documentswith an evenindex that is 1198791015840 = [119889

1 1198893 119889

2 1198894 ]

(3) Select first the documents with an even index inascending order and then the documents with an oddindex that is 1198791015840 = [119889

2 1198894 119889

1 1198893 ]

(4) For each document with an odd index exchange itwith the document of index 119894 + 1 that is 1198791015840 =[1198892 1198891 1198894 1198893 ]

(5) Shift in a circular way two elements to the left that is1198791015840= [119889119894+2 119889119894+3 119889119894+4 119889

1 1198892 ]

(6) For each document with index 119894 exchange it withthe document whose index is 119894 + 3 that is 1198791015840 =[1198894 1198895 1198896 1198891 1198892 1198893 ]

(7) For each document with index 119894 if 119894 is a multiple ofthree exchange it with the document next to it (119894 + 1)that is 1198791015840 = [119889

1 1198892 1198894 1198893 1198895 1198897 1198896 ]

(8) For each document with index 119894 if 119894 is a multiple offour exchange it with the document next to it (119894 + 1)that is 1198791015840 = [119889

1 1198892 1198893 1198896 1198895 1198894 ]

(9) For each document with index 119894 if 119894 is a multiple ofthree exchange it with the document whose index is119894 + 2 that is 1198791015840 = [119889

1 1198892 1198895 1198894 1198893 1198898 ]

5 Case Study The Author Profiling Task

The author profiling (AP) task consists in identifying certainaspects of a writer such as age gender and personality traitsbased on the analysis of text samplesThe profile of an authorcan be used in many areas for example in forensics to obtainthe description of a suspect by analyzing posted social mediamessages or it can be used by companies to personalize theadvertisements they promote in the social media or in theemail interfaces [41]

In recent years different methods have been proposedin order to tackle the AP task automatically most of whichuse techniques from machine learning data mining andnatural language processing From a machine learning pointof view the AP task can be considered as a supervisedmultilabel classification problem where a set of text samples119878 = 119878

1 1198782 119878

119894 is given and each sample is assigned

to multiple target labels (1198971 1198972 119897119896) where each position

represents an aspect of the author (gender age personalitytraits etc) The task consists in building a classifier 119865 thatassigns multiple labels to unlabeled texts

51 Experimental Settings Since this work aims at evaluatingthe impact of our social media resource when using it in apreprocessing phase we conducted the experiments with andwithout preprocessing the text samples that is replacing theshortened terms with their full representation To performthe preprocessing we made use of the dictionaries describedearlier in Section 31 in the following manner

(1) Given a target shortened term or expression wesearch for it in the corresponding dictionary

(2) If the shortened term is present in the dictionary wereplace it with the most common meaning (the firstmeaning in the second column) If the term is notpresent in the dictionary we leave it unchanged

We used a machine learning approach which is dividedinto two stages training and testing In the training stagea vector representation of each text sample is automaticallyobtained by a neural network framework that is V119894 =V1 V2 V

119895 where V119894 is the vector representation of the

text sample 119878119894 To obtain the vector representation of the text

samples a neural network-based distributed representationmodel is trained using the Doc2vec method [1]The Doc2vecalgorithm is fed with both labeled and unlabeled samplesin order to learn the distributed representation of each textsample We employed word 119899-grams with 119899 ranging from 1to 3 as input representation for the Doc2vec method Thealgorithm takes into account the context of word 119899-gramsand it is able to capture the semantics of the input texts


Table 3 Age and gender distribution over the PAN author profiling2015 training corpus

English Spanish Dutch ItalianAge

18ndash24 58 22 mdash mdash25ndash34 60 46 mdash mdash35ndash49 22 22 mdash mdash50ndashxx 12 10 mdash mdash

GenderMale 76 50 17 19Female 76 50 17 19

Total 152 100 34 38

Then a classifier is trained using the distributed vectorrepresentations of the labeled samples We conducted theexperiments using scikit-learn [42] implementation of theSupport Vector Machines (SVM) and logistic regression(LR) classifiers since these classifiers with default parametershave previously demonstrated a good performance on high-dimensional and sparse data We generated different classifi-cationmodels for each one of the aspects of an author profilethat is one model for the age profile and another for thegender profile

In the testing phase the vector representations of unla-beled texts are obtained using the distributed model built inthe training stage then the classifiers are asked to assign alabel for each aspect of the author profile to each unlabeledsample

We also performed an alternative evaluation of theproposed dictionaries using as baseline two state-of-the-artapproaches for the AP task character 3-grams and wordunigrams (Bag-of-Words model) feature representations

52 Datasets In order to evaluate the proposed approach weexploited two corpora designed for the 2015 and 2016 authorprofiling tasks at PAN (httppanwebisde) which is held aspart of the CLEF conference The PAN 2015 corpus is com-posed of tweets in four different languages English SpanishDutch and Italian whereas the PAN2016 corpus is composedof tweets in English Spanish and Dutch Both corpora aredivided into two parts training and test Each part consists ofa set of labeled tweets that correspond to the age and genderIn addition the PAN 2015 corpus includes several per-sonality traits (extroverted stable agreeable conscientiousand open) The full description of the corpora is given inTables 3 and 4

Both corpora are perfectly balanced in terms of gendergroupsHowever as can be seen comparingTables 3 and 4 thenumber of instances in the PAN 2016 corpus is much higherthan in the PAN 2015 Furthermore the PAN 2016 corpus ismore unbalanced in terms of age groups and the number ofthese groups is higher than in the PAN 2015 corpus whichmakes the PAN 2016 AP task more challenging

The PAN 2015 and 2016 AP corpora are different in termsof the number of instances and the number and the distri-bution of age groups which allows drawing more general


English Spanish DutchAge18ndash24 26 16 mdash25ndash34 135 64 mdash35ndash49 181 126 mdash50ndash64 78 38 mdash65ndashxx 6 6 mdash

GenderMale 213 125 192Female 213 125 192

Total 426 250 384

conclusionswhen comparing the results of our approachwithand without preprocessing

Due to the policies of the organizers of the PAN APtask only the 2015 and 2016 AP training datasets have beenreleasedTherefore in order to evaluate our proposal we con-ducted the experiments on the training corpus under 10-foldcross-validation in two ways (1) extracting the vectors fromthe whole dataset and (2) extracting the vectors from eachfold of the cross-validation using only the training data Inboth cases we predicted the labels of unlabeled documents ineach fold and calculated the accuracy of the predicted labelsof all documents against the gold standard It is worth men-tioning that at the PAN competition the submitted systemsare evaluated on the test data which is only available on theevaluation platform (httpwwwtiraio) used at the compe-tition A preliminary version of the developed dictionarieswas used in our submissions at PAN 2015 [35] and PAN 2016[43]

53 Evaluation Results with Vectors Extracted from the WholeDataset Tables 5ndash11 present the age and gender classes accu-racy obtained on the PAN 2015 and PAN 2016 corpora usingtwo different classifiers (LR and SVM) with and without pre-processing Here ldquoLR-NPrdquo is logistic regression without pre-processing ldquoLR-WPrdquo logistic regression with preprocessingldquoSVM-NPrdquo SVM without preprocessing ldquoSVM-WPrdquo SVMwith preprocessing ldquoD2Vrdquo Doc2vecThe best results for eachclassifier (withwithout preprocessing) are in bold The bestresults for each feature set are underlined Note that whenconducting experiments on the PAN 2015 corpus for the ageclass we only consider the Spanish and the English datasetssince for Dutch and Italian this class is currently not availableFor the same reason for the PAN 2016 corpus we providethe results for the age and gender classes for the English andSpanish languages while for the Dutch language the resultsare provided only for the gender class

For the majority of cases Doc2vec method outperformsthe baseline approaches However for the Dutch and Italiandatasets of PAN 2015 the character 3-grams approach pro-vides higher accuracy This can be explained by the fact that


Table 5 Obtained results (accuracy ) for age and gender classification on the PAN author profiling 2015 English training corpus under10-fold cross-validation

Feature set AgeLR-NP LR-WP SVM-NP SVM-WP

D2V (1-gram) 6645 6645 6842 6973D2V (1 + 2-grams) 7105 7434 7105 7236D2V (1 + 2 + 3-grams) 6973 7039 6842 7039Character 3-grams 6578 6776 6644 6710Bag-of-Words 6578 6578 6513 6513

Feature set GenderLR-NP LR-WP SVM-NP SVM-WP


Table 6 Obtained results (accuracy ) for age and gender classification on the PAN author profiling 2015 Spanish training corpus under10-fold cross-validation





Table 7 Obtained results (accuracy ) for gender classification on the PAN author profiling 2015 Dutch training corpus under 10-foldcross-validation



Table 8 Obtained results (accuracy ) for gender classification on the PAN author profiling 2015 Italian training corpus under 10-foldcross-validation














Table 11 Obtained results (accuracy ) for age and gender classification on the PAN author profiling 2016 Dutch training corpus under10-fold cross-validation



these corpora have a small number of instances while theDoc2vec method requires a larger number of instances tobuild an appropriate feature representation

We also observed that for all the datasets except for theDutch 2016 the highest accuracy when using the Doc2vecmethod was obtained with higher-order 119899-grams as inputrepresentation (1 + 2-grams or 1 + 2 + 3-grams) Thisbehaviour is due to the fact that these input representations

allow the Doc2vec method to take into account syntactic andgrammatical patterns of the authors with the same demo-graphic characteristics

Moreover the obtained results indicate that in mostcases system performance was enhanced when performingthe preprocessing using the developed dictionaries regard-less of the classifier or the feature set Furthermore the high-est results for each dataset for both the age and gender classes


Table 12 Obtained results (accuracy ) for age and gender classification using SVM classifier on the PAN author profiling 2015 Englishtraining corpus under 10-fold cross-validation with vectors extracted only from the training data

Feature set Age GenderSVM-NP SVM-WP SVM-NP SVM-WP

D2V (1-gram) 6657 7015 5938 6732D2V (1 + 2-grams) 6846 6555 6313 6964D2V (1 + 2 + 3-grams) 6697 6944 6696 6705

Table 13 Obtained results (accuracy ) for age and gender classification using SVM classifier on the PAN author profiling 2015 Spanishtraining corpus under 10-fold cross-validation with vectors extracted only from the training data



Table 14 Obtained results (accuracy ) for gender classificationusing SVM classifier on the PAN author profiling 2015 Dutch train-ing corpus under 10-fold cross-validation with vectors extractedonly from the training data

Feature set GenderSVM-NP SVM-WP

D2V (1-gram) 7250 7500D2V (1 + 2-grams) 6500 7000D2V (1 + 2 + 3-grams) 6500 7250

(except for the Spanish gender class on the PAN 2015 corpus)were obtained when performing the data preprocessing

54 Evaluation Results with Vectors Extracted from Each Foldof the Cross-Validation Using Only the Training Data Inorder to avoid possible overfitting produced by obtainingthe vectors from the whole dataset we conducted additionalexperiments extracting the vectors from each fold of thecross-validation using only the training data of the PAN 2015and PAN 2016 corporaThis better matches the requirementsof a realistic scenario when the test data is not available at thetraining stage

Tables 12ndash17 present several results when the extractionof vectors from each fold in the cross-validation was doneonly with the training documents of this fold and not with thewhole dataset We follow the notations of the tables providedin the previous subsection

The experimental results presented in Tables 12ndash17 con-firm the conclusion that higher classification accuracy isobtained when performing preprocessing using the devel-oped dictionaries and that higher-order 119899-grams input rep-resentation for the Doc2vec method allowing the inclusionof syntactic and grammatical information in the majority ofcases outperforms the word unigrams input representation

Furthermore it can be observed that the results obtainedwhen extracting the vectors from each fold of the cross-validation using only the training data are comparable to

Table 15 Obtained results (accuracy ) for age and genderclassification using SVM classifier on the PAN author profiling 2016English training corpus under 10-fold cross-validation with vectorsextracted only from the training data



Table 16 Obtained results (accuracy ) for age and genderclassification using SVM classifier on the PAN author profiling 2016Spanish training corpus under 10-fold cross-validation with vectorsextracted only from the training data






those when training on the whole dataset ensuring nooverfitting of the examined classifier This also indicates thatour method can be successfully applied under more realisticconditions when there is no information on the test data atthe training stage


Table 18 Significance levels

Symbol Significance level Significance= 119901 gt 005 Not significant+ 005 ge 119901 gt 001 Significant++ 001 ge 119901 gt 0001 Very significant+++ 119901 le 0001 Highly significant

55 Statistical Significance of the Obtained Results In orderto ensure the contribution of our approach we performeda pairwise significance test between the results of differentexperiments We considered the Doc2vec approach with andwithout preprocessing as well as the baseline approaches(character 119899-grams and Bag-of-Words) We used theWilcoxon signed-ranks (Wsr) test [44] for computing thestatistical significance of differences in resultsTheWsr test ispreferred over other statistical tests (such as Studentrsquos 119905-test)for comparing the output of two classifiers [45] TheWsr is anonparametric test that ranks the differences in performanceof two classifiers in different datasets and compares theranks for positive and negative differences An importantrequirement of theWsr test is that the compared classifiers areevaluated using exactly the same random samples and at leastfive experiments are conducted for eachmethod In thisworkwe performed a stratified 10-fold cross-validation in eachexperiment which allowed us to apply this test

The significance levels are encoded as shown in Table 18As is generally assumed when 119901 lt 005 then we consider thesystems to be significantly different from each other Table 19presents the results of the Wsr test calculated for the EnglishSpanish and Dutch languages Here ldquoD2V-NPrdquo correspondsto the set of results obtained using the Doc2vec approachwithout preprocessing ldquoD2V-WPrdquo with preprocessing Theresults obtained with the character 3-grams and Bag-of-Words approaches are represented as ldquoChar 3-grams-NPrdquoand ldquoBag-of-Words-NPrdquo respectively

From the significance test we can conclude that theDoc2vec approach with preprocessing obtained very signif-icant (see Table 18) results as compared to both character 119899-grams and Bag-of-Words approaches without preprocessingfor the considered languages With respect to the Doc2vecmethodwith andwithout preprocessing the results are some-times significant and sometimes are not For example in caseof English the ldquoD2V-WPrdquo obtained significant improvementsover ldquoD2V-NPrdquo however in case of Spanish and Dutch theresults of ldquoD2V-WPrdquo are not significantly better than ldquoD2V-NPrdquo

6 Conclusions and Future Work

Preprocessing of user-generated social media content is achallenging task due to nonstandardized and usually infor-mal style of these messages Furthermore it is an essentialstep to a correct and accurate subsequent text analysis

We developed a resource namely a social media lexi-con for text preprocessing which contains the dictionariesof slang words contractions abbreviations and emoticonscommonly used in social mediaThe resource is composed of

the dictionaries for the English Spanish Dutch and Italianlanguages We described the methodology of the data collec-tion listed the web sources used for creating each dictionaryand explained the standardization process We also providedinformation concerning the structure of the dictionaries andtheir length

We conducted experiments on the PAN 2015 and PAN2016 author profiling datasets The author profiling task aimsat identifying the age and gender of authors based on their useof language We proposed the use of a neural network-basedmethod for learning the feature representation automaticallyand a classification process based on machine learningalgorithms in particular SVM and logistic regression Weperformed a cross-validated evaluation of the PAN 2015 andPAN 2016 training corpora in two ways (1) extracting thevectors from the whole dataset and (2) extracting the vectorsfrom each fold of the cross-validation using only the trainingdata The experiments were conducted with and withoutpreprocessing the corpora using our social media lexicon

We showed that the use of our social media lexiconimproves the quality of a neural network-based featurerepresentation when used for the author profiling task Weobtained better results in themajority of cases for the age andgender classes for the examined languages when using theproposed resource regardless of the classifier We performeda statistical significance test which showed that in mostcases the results improvements obtained by using the devel-oped dictionaries are statistically significant We considerthat these improvements were achieved due to the stan-dardization of the shortened vocabulary which is used inabundance in social media messages

We noticed that there are some commonly used terms insocial media that are not present in our web sources speciallyfor the English Dutch and Italian languages Therefore infuture work we intend to expand the dictionaries of slangwords with manually collected entries for each language asit was done for the Spanish slang words dictionary

Appendix

In this appendix we provide the links of the web pages thatwere used to build our resource

Sources links of each dictionary of the English languageare the following

Abbreviationshttppublicoedcomhow-to-use-the-oedabbrevia-tionshttpwwwenglishleapcomother-resourcesabbre-viationshttpwwwmacmillandictionarycomusthesaurus-categoryamericanwritten-abbreviations

ContractionshttpsenwikipediaorgwikiWikipediaList of Eng-lish contractionshttpgrammaraboutcomodwordsaNotes-On-Contractionshtm


Table 19 Significance of results differences between pairs of experiments for the English Spanish and Dutch languages where NPcorresponds to ldquowithout preprocessingrdquo and WP ldquowith preprocessingrdquo

Approaches English Spanish Dutch2015 2016 2015 2016 2015 2016

D2V-NP versus D2V-WP + + = = = =Char 3-grams-NP versus D2V-WP +++ +++ + +++ ++ ++Bag-of-Words-NP versus D2V-WP +++ +++ = +++ + +++

Slang Words

httpwwwnoslangcomhttptransl8itcomlargest-best-top-text-message-listhttpwwwwebopediacomquick reftextmessage-abbreviationsasp

Sources links of each dictionary of the Spanish languageare the following

Abbreviations

httpwwwwikilenguaorgindexphpLista de abre-viaturas Ahttpwwwreglasdeortografiacomabreviaturashtmhttpbusconraeesdpdapendicesapendice2html

Contractions

httpwwwgramaticasnet201109ejemplos-de-con-traccionhtml

Slang Words

httpspanishaboutcomodwrittenspanishasmshtmhttpswwwduolingocomcomment8212904httpwwwhispafenelonnetESPACIOTRABAJOVOCABULARIOSMShtml

Sources links of each dictionary of theDutch language arethe following

Abbreviations

httpsnlwikipediaorgwikiLijst van afkortingen inhet Nederlands

Contractions

httpsenwiktionaryorgwikiCategoryDutch con-tractionshttpswwwduolingocomcomment5599651

Slang Words

httpwwwwelingelichtekringennltech398386je-tiener-op-het-web-en-de-afkortingenhtml

httpwwwphrasebasecomarchive2dutchdutch-internet-slanghtmlhttpsnlwikipediaorgwikiChattaal

Sources links of each dictionary of the Italian language arethe following

Abbreviations

httphomeschassutorontocasimngarganocorsicor-rispabbreviazionihtmlhttpwwwpaginainiziocomserviceabbreviazionihtm

Contractions

httpwwwliquisearchcomcontraction grammaritalianhttpsenwikipediaorgwikiContraction 28gram-mar29Italian

Slang Words

httpsitwikipediaorgwikiGergo di InternetIl ger-go comune di internethttpunluogocomunealtervistaorgelenco-abbrevi-azioni-in-uso-nelle-chathttpwwwwiredcom201311web-semantics-gergo-di-internet

Sources links of the dictionary of emoticons are thefollowing

httpsenwikipediaorgwikiList of emoticonshttppcnetemoticonshttpwwwnetlingocomsmileysphp

Competing Interests

The authors report no conflict of interests

Acknowledgments

This work was done under the support of CONACYT Project240844 CONACYT under the Thematic Networks program(Language Technologies Thematic Network Projects 260178and 271622) SNI COFAA-IPN and SIP-IPN 2016194720161958 20151589 20162204 and 20162064


References

[1] Q Le and T Mikolov ldquoDistributed representations of sentencesand documentsrdquo in Proceedings of the 31st International Con-ference on Machine Learning (ICML rsquo14) pp 1188ndash1196 BeijingChina 2014

[2] R Socher C C-Y Lin C D Manning and A Y Ng ldquoParsingnatural scenes and natural language with recursive neuralnetworksrdquo in Proceedings of the 28th International Conferenceon Machine Learning (ICML rsquo11) pp 129ndash136 Bellevue WashUSA July 2011

[3] I Brigadir D Greene and P Cunningham ldquoAdaptive represen-tations for tracking breaking news on Twitterrdquo httparxivorgabs14032923

[4] C Yan F Zhang and L Huang ldquoDRWS a model for learningdistributed representations for words and sentencesrdquo in PRICAI2014 Trends in Artificial Intelligence D-N Pham and S-B ParkEds vol 8862 ofLectureNotes inComputer Science pp 196ndash207Springer 2014

[5] V K Rangarajan Sridhar ldquoUnsupervised text normalizationusing distributed representations of words and phrasesrdquo inProceedings of the 1st Workshop on Vector Space Modeling forNatural Language Processing (NAACL rsquo15) pp 8ndash16 Associationfor Computational Linguistics Denver Colo USA June 2015

[6] F Jiang Y Liu H Luan M Zhang and S Ma ldquoMicroblogsentiment analysis with emoticon spacemodelrdquo in Social MediaProcessing pp 76ndash87 Springer Berlin Germany 2014

[7] D Pinto D Vilarino Y Aleman H Gomez-Adorno N Loyaand H Jimenez-Salazar ldquoThe soundex phonetic algorithmrevisited for SMS text representationrdquo in Text Speech andDialogue 15th International Conference TSD 2012 Brno CzechRepublic September 3ndash7 2012 Proceedings vol 7499 of LectureNotes in Computer Science pp 47ndash55 Springer Berlin Ger-many 2012

[8] J Atkinson A Figueroa and C Perez ldquoA semantically-basedlattice approach for assessing patterns in text mining tasksrdquoComputacion y Sistemas vol 17 no 4 pp 467ndash476 2013

[9] D Das and S Bandyopadhyay ldquoDocument level emotiontagging machine learning and resource based approachrdquo Com-putacion y Sistemas vol 15 pp 221ndash234 2011

[10] F Rangel F Celli P Rosso M Potthast B Stein and WDaelemans ldquoOverview of the 3rd author profiling task at PAN2015rdquo in Proceedings of the CLEF Labs and Workshops vol 1391of Notebook Papers CEUR Workshop Proceedings ToulouseFrance September 2015

[11] T Baldwin ldquoSocial media friend or foe of natural languageprocessingrdquo in Proceedings of the 26th Pacific Asia Conferenceon Language Information and Computation (PACLIC rsquo12) pp58ndash59 November 2012

[12] E Clark and K Araki ldquoText normalization in social mediaprogress problems and applications for a pre-processing systemof casual englishrdquo ProcediamdashSocial and Behavioral Sciences vol27 pp 2ndash11 2011

[13] I Hemalatha G P S Varma andAGovardhan ldquoPreprocessingthe informal text for efficient sentiment analysisrdquo InternationalJournal of Emerging Trends amp Technology in Computer Sciencevol 1 no 2 pp 58ndash61 2012

[14] E Haddi X Liu and Y Shi ldquoThe role of text pre-processing insentiment analysisrdquo Procedia Computer Science vol 17 pp 26ndash32 2013

[15] S Roy S Dhar S Bhattacharjee and A Das ldquoA lexicon basedalgorithm for noisy text normalization as pre-processing for

sentiment analysisrdquo International Journal of Research in Engi-neering and Technology vol 2 no 2 pp 67ndash70 2013

[16] Z Ben-Ami R Feldman and B Rosenfeld ldquoUsing multi-viewlearning to improve detection of investor sentiments on twitterrdquoComputacion y Sistemas vol 18 no 3 pp 477ndash490 2014

[17] G Sidorov M Ibarra Romero I Markov R Guzman-Cabrera L Chanona-Hernandez and F Velasquez ldquoDeteccionautomatica de similitud entre programas del lenguaje deprogramacion Karel basada en tecnicas de procesamiento delenguaje naturalrdquo Computacion y Sistemas vol 20 no 2 pp279ndash288 2016 (Spanish)

[18] F Rangel P Rosso M Koppel E Stamatatos and G InchesldquoOverview of the author profiling task at PAN 2013rdquo in Proceed-ings of the Cross Language Evaluation Forum Conference (CLEFrsquo13) vol 1179 Notebook Papers September 2013

[19] F Rangel P Rosso I Chugur et al ldquoOverview of the 2nd authorprofiling task at PAN 2014rdquo in Proceedings of the CLEF 2014Labs and Workshops vol 1180 of Notebook Papers pp 898ndash927Sheffield UK 2014

[20] S Nowson J Perez C Brun S Mirkin and C Roux ldquoXRCEpersonal language analytics engine for multilingual authorprofilingrdquo in Proceedings of the Working Notes of CLEF 2015mdashConference and Labs of the Evaluation Forum vol 1391 CEURToulouse France 2015

[21] S Ait-Mokhtar J Chanod and C Roux ldquoA multi-input depen-dency parserrdquo in Proceedings of the in 7th International Work-shop on Parsing Technologies (IWPT rsquo01) pp 201ndash204 BeijingChina October 2001

[22] P Przybyla and P Teisseyre ldquoWhat do your look-alikes sayabout you Exploiting strong and weak similarities for authorprofilingrdquo in Proceedings of the Conference and Labs of theEvaluation Forum vol 1391 of Working Notes of CLEF CEUR2015

[23] S Maharjan P Shrestha and T Solorio ldquoA simple approach toauthor profiling in MapReducerdquo in Proceedings of the WorkingNotes of CLEF 2014mdashConference and Labs of the EvaluationForum vol 1180 pp 1121ndash1128 CEUR Sheffield UK September2014

[24] Y Aleman N Loya D Vilari and D Pinto ldquoTwo method-ologies applied to the author profiling taskrdquo in Proceedings ofthe Working Notes of CLEF 2013-Conference and Labs of theEvaluation forum vol 1179 CEUR 2013

[25] A Bartoli A D Lorenzo A Laderchi E Medvet and F TarlaoldquoAn author profiling approach based on language-dependentcontent and stylometric featuresrdquo in Proceedings of the WorkingNotes of CLEF 2015mdashConference and Labs of the EvaluationForum vol 1391 CEUR 2015

[26] A P Garibay A T Camacho-Gonzalez R A Fierro-VillanedaI Hernandez-Farias D Buscaldi and I V M Ruız ldquoA randomforest approach for authorship profilingrdquo in Proceedings of theConference and Labs of the Evaluation Forum vol 1391 ofWorking Notes of CLEF CEUR 2015

[27] J Marquardt G Farnadi G Vasudevan et al ldquoAge and genderidentification in social mediardquo in Proceedings of the WorkingNotes of CLEF 2014-Conference and Labs of the Evaluationforum vol 1180 pp 1129ndash1136 CEUR 2014

[28] D I H Farias R Guzman-Cabrera A Reyes andM A RochaldquoSemantic-based features for author profiling identificationrdquo inProceedings of theWorking Notes of CLEF 2013mdashConference andLabs of the Evaluation Forum vol 1179 CEUR 2013

[29] L Flekova and IGurevych ldquoCanwehide in theweb Large scalesimultaneous age and gender author profiling in social mediardquo


in Proceedings of the Working Notes of CLEF 2013mdashConferenceand Labs of the Evaluation Forum vol 1179 CEUR 2013

[30] S Goswami S Sarkar and M Rustagi ldquoStylometric analysis ofbloggersrsquo age and genderrdquo in Proceedings of the 3rd InternationalConference onWeblogs and SocialMedia (ICWSM rsquo09) San JoseCalif USA May 2009

[31] A A C Diaz and J M G Hidalgo ldquoExperiments with SMStranslation and stochastic gradient descent in Spanish textauthor profilingrdquo in Proceedings of the Working Notes of CLEF2013mdashConference and Labs of the Evaluation Forum vol 1179CEUR 2013

[32] R Socher A Perelygin J Wu et al ldquoRecursive deep modelsfor semantic compositionality over a sentiment treebankrdquo inProceedings of the Conference on Empirical Methods in NaturalLanguage Processing (EMNLP rsquo13) pp 1631ndash1642 Associationfor Computational Linguistics 2013

[33] T Mikolov K Chen G Corrado and J Dean ldquoEfficientestimation of word representations in vector spacerdquo ComputingResearch Repository httpsarxivorgabs13013781

[34] F Rangel P Rosso B Verhoeven W Daelemans M Potthastand B Stein ldquoOverview of the 4th author profiling task at PAN2016 cross-genre evaluationsrdquo in Proceedings of the WorkingNotes Papers of the CLEF 2016 Evaluation Labs CEURWorkshopProceedings CLEF and CEUR-WSorg Evora Portugal 2016

[35] J Posadas-Duran H Gomez-Adorno I Markov et al ldquoSyn-tactic n-grams as features for the author profiling taskrdquo inProceedings of theWorking Notes of CLEF 2015mdashConference andLabs of the Evaluation Forum Toulouse France 2015

[36] G Sidorov H Gomez-Adorno I Markov D Pinto and NLoya ldquoComputing text similarity using tree edit distancerdquo inProceedings of the Annual Conference of the North AmericanFuzzy Information Processing Society (NAFIPS) held jointly with5thWorld Conference on Soft Computing (WConSC rsquo15) pp 1ndash4IEEE Redmond Wash USA August 2015

[37] R Snow B OrsquoConnor D Jurafsky and A Y Ng ldquoCheap andfastmdashbut is it good Evaluating non-expert annotations fornatural language tasksrdquo in Proceedings of the Conference onEmpirical Methods in Natural Language Processing (EMNLPrsquo08) pp 254ndash263 Association for Computational Linguistics2008

[38] V Camacho-Vazquez G Sidorov and SNGalicia-Haro ldquoCon-struccion de un corpus marcado con emociones para el analisisde sentimientos en Twitter en espanolrdquo Escritos Revista delCentro de Ciencias del Lenguaje In press

[39] L Padro and E Stanilovsky ldquoFreeling 30 towards widermultilingualityrdquo in Proceedings of the Language Resources andEvaluation Conference (LREC rsquo12) ELRA 2012

[40] TMikolov I Sutskever K Chen G Corrado and J Dean ldquoDis-tributed representations of words and phrases and their compo-sitionalityrdquo inProceedings of the 27thAnnual Conference onNeu-ral Information Processing Systems vol 26 ofAdvances in NeuralInformation Processing Systems pp 3111ndash3119 Lake Tahoe NevUSA December 2013

[41] S Argamon M Koppel J W Pennebaker and J SchlerldquoAutomatically profiling the author of an anonymous textrdquoCommunications of the ACM vol 52 no 2 pp 119ndash123 2009

[42] L Buitinck G Louppe M Blondel et al ldquoAPI design formachine learning software experiences from the scikit-learnprojectrdquo in Proceedings of the ECML PKDD Workshop Lan-guages for Data Mining and Machine Learning pp 108ndash122Prague Czech Republic September 2013

[43] I Markov H Gomez-Adorno G Sidorov and A GelbukhldquoAdapting cross-genre author profiling to language and corpusrdquoin Proceedings of the Working Notes of CLEF 2016-Conferenceand Labs of the Evaluation forum vol 1609 of CEUR WorkshopProceedings pp 947ndash955 2016 CLEF and CEUR-WSorg

[44] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin vol 1 no 6 pp 80ndash83 1945

[45] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Distributed Sensor Networks


Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014


Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications


Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia


Biomedical Imaging


ArtificialNeural Systems

Advances in


RoboticsJournal of



Computational Intelligence and Neuroscience

Industrial EngineeringJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in



representation obtained by a well-known neural networkmethod Doc2vec [1] We consider that our dictionaries canhelp in achieving better results in tasks based on social mediadata since they allow standardizing nonstandard languageexpressions that are used in a different way by differentauthors We evaluate our hypothesis on the author profiling(AP) task which aims at predicting the age and gender of anauthor of a given text [10] In our research we used Twittermessages However given that the writing style in social net-works like Twitter Facebook and Instagram among othersis similar these dictionaries are useful for preprocessing andcleaning messages obtained from all these social networks

The rest of the paper is structured as follows Section 2describes related work Section 3 explains the methodologyused for preparation of each dictionary and presents thestructure of the dictionaries Section 4 describes a neuralnetwork-based feature representation Section 5 presents thecase study for the author profiling task Finally Section 6draws the conclusions from this work and points to thepossible directions of future work

2 Related Work

With the unprecedentedly growing amount of social mediadata available on the Internet and the rapid expansion in user-generated content text preprocessing using correspondinglexical resources is becoming more and more crucial for sub-sequent accurate text analysis In this section we present sev-eral works that demonstrate the importance of the text pre-processing step and its usefulness in achieving better resultsfor different NLP tasks particularly for the tasks that useneural network-based feature representation

The challenges that social media offers to NLP werediscussed in detail in [11] Further we focus on the works(usually not related with word embeddings) that considerdifferent preprocessing approaches in order to solve thesechallenges

Clark and Araki [12] discuss the major problems relatedto processing social media messages written in EnglishThe authors improved the performance of open sourcespellcheckers on Twitter data by developing a preprocessingsystem for automatically normalizing casual social mediaEnglishThe authors report that when using their preprocess-ing system average errors per sentence decreased from 15to less than 5

The role of preprocessing has gained much importanceespecially when dealing with social media data It has beenproven that an appropriate text preprocessing for the taskof sentiment analysis (on social media data) significantlyimproves the results for this task [13ndash16] Some of the mostcommonly used preprocessing steps include removing URLsspecial characters and repeated letters from a word expand-ing abbreviations removing stop words handling nega-tions and performing stemming Besides preprocessing hasproved to be useful even when dealing with other types oftextual data for example source codes [17]

According to the overview papers of the author profiling(AP) task at PAN 2013 [18] PAN 2014 [19] and PAN 2015 [10]many participants carried out some kind of preprocessing

mainly for removing HTML tags from the tweets and sec-ondly for handling hashtags URLs and username mentionsin different ways

In PAN 2015 there was only one team [20] that removedall character sequences representing emojis in the originaltweets Unlike emoticons emojis are typically represented byUnicode characters so in this work their use was replacedby unknown character markers In the same work theauthors used a multi-input dependency parser [21] capable ofidentifying positive and negative emoticons for their furtheruse as features in the classification process

Several research teams that participated in the AP taskat PAN (editions 2013 2014 and 2015) exploited the use ofemoticons and slang words as stylistic and content featuresrespectively Some of these teams extracted emoticons usingregular expressions [22 23]Other research teams built lexicalresources for normalizing emoticons and Out Of Vocabulary(OOV) words with their corresponding normalized terms[24] however the authors did not publish these resourcesmaking it difficult for others to reproduce their results Therest of the teams that also used emoticons in their works didnot mention any information concerning how the emoticonswere identified in the corpus nor the source where they wereobtained from (in case they used a dictionary) [25ndash29]

With respect to the use of slang words in the AP taskGoswami et al [30] added slang words and the average lengthof sentences to the feature set improving accuracy to 803in age group identification and to 892 in gender detectionFarias et al [28] and Diaz and Hidalgo [31] used dictionariesextracted from the web (httpwwwchatslangcomtermscommon [last access 01072016] all other URLs in this docu-ment were also verified on this date) However these lists arevery short in comparisonwith our emoticons and slangwordsdictionaries which were collectedmanually from several websources

Regarding neural networks several approaches havebeen proposed for vector-space distributed representationsof words and phrases These models are used mainly forpredicting a word given a surrounding context Howevermost of the authors indicate that distributed representationsof words and phrases can also capture syntactic and semanticsimilarity or relatedness [1 32 33] This particular behaviourmakes these methods attractive to solve several NLP tasksnevertheless at the same time it rises new issues that isdealing with unnormalized texts which are typically presentin social media forums such as Twitter Facebook and Insta-gram Researchers have proposed several preprocessing stepsin order to overcome this issue which led to an overall perfor-mance increase Yan et al [4] enhanced system performanceby approximately 2 using standard NLP preprocessingwhich consists in tokenization lowercasing and removingstop words and rare terms Rangarajan Sridhar [5] focused onthe spelling issues in social media messages which includesrepeated letters omitted vowels use of phonetic spellingssubstitution of letters with numbers (typically syllables) anduse of shorthands and user created abbreviations for phrasesIn a data-driven approach Brigadir et al [3] apply URL filter-ing combined with standard NLP preprocessing techniques
























Classifier

Averageconcatenate

Word matrix (W)

w4

w1 w2 w3

(a)

Classifier

Averageconcatenate

w4


Document id

D

(b)


training words 1199081 1199082 1199083 119908



1

119879

119879minus119896

sum

119905=119896

log119901 (119908119905| 119908119905minus119896 119908

119905+119896) (1)









1 1198892 119889

119894]



119894 119889119894minus1 119889

1]


1 1198893 119889

2 1198894 ]


2 1198894 119889

1 1198893 ]



1 1198892 ]



1 1198892 1198894 1198893 1198895 1198897 1198896 ]


1 1198892 1198893 1198896 1198895 1198894 ]


1 1198892 1198895 1198894 1198893 1198898 ]




1 1198782 119878
















Total 152 100 34 38










Total 426 250 384














































































Appendix









Slang Words



Abbreviations


Contractions


Slang Words



Abbreviations


Contractions


Slang Words




Abbreviations


Contractions


Slang Words




Competing Interests


Acknowledgments



References
























































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in


























Classifier

Averageconcatenate

Word matrix (W)

w4

w1 w2 w3

(a)

Classifier

Averageconcatenate

w4


Document id

D

(b)


training words 1199081 1199082 1199083 119908



1

119879

119879minus119896

sum

119905=119896

log119901 (119908119905| 119908119905minus119896 119908

119905+119896) (1)









1 1198892 119889

119894]



119894 119889119894minus1 119889

1]


1 1198893 119889

2 1198894 ]


2 1198894 119889

1 1198893 ]



1 1198892 ]



1 1198892 1198894 1198893 1198895 1198897 1198896 ]


1 1198892 1198893 1198896 1198895 1198894 ]


1 1198892 1198895 1198894 1198893 1198898 ]




1 1198782 119878
















Total 152 100 34 38










Total 426 250 384














































































Appendix









Slang Words



Abbreviations


Contractions


Slang Words



Abbreviations


Contractions


Slang Words




Abbreviations


Contractions


Slang Words




Competing Interests


Acknowledgments



References
























































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in









1 1198892 119889

119894]



119894 119889119894minus1 119889

1]


1 1198893 119889

2 1198894 ]


2 1198894 119889

1 1198893 ]



1 1198892 ]



1 1198892 1198894 1198893 1198895 1198897 1198896 ]


1 1198892 1198893 1198896 1198895 1198894 ]


1 1198892 1198895 1198894 1198893 1198898 ]




1 1198782 119878
















Total 152 100 34 38










Total 426 250 384














































































Appendix









Slang Words



Abbreviations


Contractions


Slang Words



Abbreviations


Contractions


Slang Words




Abbreviations


Contractions


Slang Words




Competing Interests


Acknowledgments



References
























































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in








Total 152 100 34 38










Total 426 250 384














































































Appendix









Slang Words



Abbreviations


Contractions


Slang Words



Abbreviations


Contractions


Slang Words




Abbreviations


Contractions


Slang Words




Competing Interests


Acknowledgments



References
























































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in












































































Appendix









Slang Words



Abbreviations


Contractions


Slang Words



Abbreviations


Contractions


Slang Words




Abbreviations


Contractions


Slang Words




Competing Interests


Acknowledgments



References
























































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in







Slang Words



Abbreviations


Contractions


Slang Words



Abbreviations


Contractions


Slang Words




Abbreviations


Contractions


Slang Words




Competing Interests


Acknowledgments



References
























































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




References
























































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




























Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in



research article improving feature representation based on a...

Documents