context-aware entity morph decoding

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguisticsand the 7th International Joint Conference on Natural Language Processing, pages 586–595,

Beijing, China, July 26-31, 2015. c©2015 Association for Computational Linguistics

Context-aware Entity Morph Decoding

Boliang Zhang1, Hongzhao Huang1, Xiaoman Pan1, Sujian Li2, Chin-Yew Lin3

Heng Ji1, Kevin Knight4, Zhen Wen5, Yizhou Sun6, Jiawei Han7, Bulent Yener1

1Rensselaer Polytechnic Institute, 2Peking University, 3Microsoft Research Asia, 4University of Southern California5IBM T. J. Watson Research Center, 6Northeastern University, 7Univerisity of Illinois at Urbana-Champaign

1{zhangb8,huangh9,panx2,jih,yener}@rpi.edu, [email protected], [email protected]@illinois.edu, [email protected], [email protected], [email protected]

Abstract

People create morphs, a special type offake alternative names, to achieve certaincommunication goals such as expressingstrong sentiment or evading censors. Forexample, “Black Mamba”, the name for ahighly venomous snake, is a morph thatKobe Bryant created for himself due to hisagility and aggressiveness in playing bas-ketball games. This paper presents the firstend-to-end context-aware entity morph de-coding system that can automatically iden-tify, disambiguate, verify morph mentionsbased on specific contexts, and resolvethem to target entities. Our approach isbased on an absolute “cold-start” - it doesnot require any candidate morph or tar-get entity lists as input, nor any manuallyconstructed morph-target pairs for train-ing. We design a semi-supervised collec-tive inference framework for morph men-tion extraction, and compare various deeplearning based approaches for morph res-olution. Our approach achieved signifi-cant improvement over the state-of-the-artmethod (Huang et al., 2013), which used alarge amount of training data. 1

1 Introduction

Morphs (Huang et al., 2013; Zhang et al., 2014)refer to the fake alternative names created by so-cial media users to entertain readers or evade cen-sors. For example, during the World Cup in 2014,

1The data set and programs are publicly avail-able at: http://nlp.cs.rpi.edu/data/morphdecoding.zip andhttp://nlp.cs.rpi.edu/software/morphdecoding.tar.gz

a morph “Su-tooth” was created to refer to theUruguay striker “Luis Suarez” for his habit of bit-ing other players. Automatically decoding human-generated morphs in text is critical for downstreamdeep language understanding tasks such as entitylinking and event argument extraction.

However, even for human, it is difficult to de-code many morphs without certain historical, cul-tural, or political background knowledge (Zhanget al., 2014). For example, “The Hutt” can be usedto refer to a fictional alien entity in the Star Warsuniverse (“The Hutt stayed and established himselfas ruler of Nam Chorios”), or the governor of NewJersey, Chris Christie (“The Hutt announced a bidfor a seat in the New Jersey General Assembly”).Huang et al. (2013) did a pioneering pilot study onmorph resolution, but their approach assumed theentity morphs were already extracted and used alarge amount of labeled data. In fact, they resolvedmorphs on corpus-level instead of mention-leveland thus their approach was context-independent.A practical morph decoder, as depicted in Fig-ure 1, consists of two problems: (1) Morph Ex-traction: given a corpus, extract morph mentions;and (2). Morph Resolution: For each morph men-tion, figure out the entity that it refers to.

In this paper, we aim to solve the fundamentalresearch problem of end-to-end morph decodingand propose a series of novel solutions to tacklethe following challenges.

Challenge 1: Large-scope candidates

Only a very small percentage of terms can be usedas morphs, which should be interesting and fun.As we annotate a sample of 4, 668 Chinese weibotweets, only 450 out of 19, 704 unique terms aremorphs. To extract morph mentions, we propose a

586

��!��)$�, ��#!"�? (Conquer West King from Chongqing fell from power, do we still need to sing red songs?)

�0��. (Buhou and Little Brother Ma.)

� ! �-��!! � ! ��! (Attention! Chongqing Conquer West King! Attention! Brother Jun!)

6�72�<��, ��%�4 �, ��!. (Wu Sangui met Wei Xiaobao, and led the army of Qing dynasty into China, and then became Conquer West King.)

):� (Bo Xilai)

��# (Ma Ying-jeou)

!�� (Wang Lijun)

Tweets Target Entities

d1

d2

d3

d4

Figure 1: An Illustration of Morph Decoding Task.

two-step approach to first identify individual men-tion candidates to narrow down the search scope,and then verify whether they refer to morphed en-tities instead of their original meanings.

Challenge 2: Ambiguity, Implicitness,Informality

Compared to regular entities, many morphs con-tain informal terms with hidden information. Forexample, “不厚 (not thick)” is used to refer to“薄熙来 (Bo Xilai)” whose last name “薄 (Bo)”means “thin”. Therefore we attempt to modelthe rich contexts with careful considerations formorph characteristics both globally (e.g., languagemodels learned from a large amount of data) andlocally (e.g. phonetic anomaly analysis) to extractmorph mentions.

For morph resolution, the main challenge liesin that the surface forms of morphs usually ap-pear quite different from their target entity names.Based on the distributional hypothesis (Harris,1954) which states that words that often occur insimilar contexts tend to have similar meanings, wepropose to use deep learning techniques to captureand compare the deep semantic representations ofa morph and its candidate target entities based ontheir contextual clues. For example, the morph“平西王(Conquer West King)” and its target entity“薄熙来 (Bo Xilai)” share similar implicit contex-tual representations such as “重庆(Chongqing)”(Bo was the governor of Chongqing) and “倒台(fall from power)”.

Challenge 3: Lack of labeled data

To the best of our knowledge, no sufficientmention-level morph annotations exist for trainingan end-to-end decoder. Manual morph annotationsrequire native speakers who have certain culturalbackground (Zhang et al., 2014). In this paperwe focus on exploring novel approaches to saveannotation cost in each step. For morph extrac-tion, based on the observation that morphs tend toshare similar characteristics and appear together,we propose a semi-supervised collective inferenceapproach to extract morph mentions from multipletweets simultaneously. Deep learning techniqueshave been successfully used to model word rep-resentation in an unsupervised fashion. For morphresolution, we make use of a large amount of unla-beled data to learn the semantic representations ofmorphs and target entities based on the unsuper-vised continuous bag-of-words method (Mikolovet al., 2013b).

2 Problem Formulation

Following the recent work on morphs (Huanget al., 2013; Zhang et al., 2014), we use Chi-nese Weibo tweets for experiments. Our goalis to develop an end-to-end system that auto-matically extract morph mentions and resolvethem to their target entities. Given a corpusof tweets D = {d1, d2, ..., d|D|}, we define acandidate morph mi as a unique term tj in T ,where T = {t1, t2, ..., t|T |} is the set of uniqueterms in D. To extract T , we first apply sev-eral well-developed Natural Language Process-ing tools, including Stanford Chinese word seg-menter (Chang et al., 2008), Stanford part-of-speech tagger (Toutanova et al., 2003) and Chineselexical analyzer ICTCLAS (Zhang et al., 2003),to process the tweets and identify noun phrases.Then we define a morph mention mp

i of mi as thep-th occurrence of mi in a specific document dj .Note that a mention with the same surface form asmi but referring to its original entity is not consid-ered as a morph mention. For instance, the “平西王 (Conquer West King)” in d1 and d3 in Figure 1are morph mentions since they refer to the modernpolitician “薄熙来 (Bo Xilai)”, while the one in d4

is not a morph mention since it refers to the origi-nal entity, who was king “吴三桂 (Wu Sangui)”.

For each morph mention, we discover a list oftarget candidates E = {e1, e2, ..., e|E|} from Chi-nese web data for morph mention resolution. We

587

design an end-to-end morph decoder which con-sists of the following procedure:

• Morph Mention Extraction

– Potential Morph Discovery: This first stepaims to obtain a set of potential entity-levelmorphs M = {m1,m2, ...}(M ⊆ T ). Then,we only verify and resolve the mentions ofthese potential morphs, instead of all theterms in T in a large corpus.

– Morph Mention Verification: In this step, weaim to verify whether each mentionmp

i of thepotential morphmi(mi ∈M) from a specificcontext dj is a morph mention or not.

• Morph Mention Resolution: The final step isto resolve each morph mention mp

i to its targetentity (e.g., “薄熙来 (Bo Xilai)” for the morphmention “平西王 (Conquer West King)” in d1

in Figure 1).

3 Morph Mention Extraction

3.1 Why Traditional Entity MentionExtraction doesn’t Work

In order to automatically extract morph mentionsfrom any given documents, our first reflection isto formulate the task as a sequence labeling prob-lem, just like labeling regular entity mentions. Weadopted the commonly used conditional randomfields (CRFs) (Lafferty et al., 2001) and got only6% F-score. Many morphs are not presented asregular entity mentions. For example, the morph“天线 (Antenna)” refers to “温家宝 (Wen Ji-abao)” because it shares one character “宝 (baby)”with the famous children’s television series “天线宝宝 (Teletubbies)”. Even when they are pre-sented as regular entity mentions, they must referto new target entities which are different from theregular ones. So we propose the following noveltwo-step solution.

3.2 Potential Morph Discovery

We first introduce the first step of our approach– potential morph discovery, which aims to nar-row down the scope of morph candidates with-out losing recall. This step takes advantage ofthe common characteristics shared among morphsand identifies the potential morphs using a super-vised method, since it is relatively easy to collecta certain number of corpus-level morphs as train-ing data compared to labeling morph mentions.Through formulating this task as a binary classifi-

cation problem, we adopt the Support Vector Ma-chines (SVMs) (Cortes and Vapnik, 1995) as thelearning model. We propose the following fourcategories of features.

Basic: (i) character unigram, bigram, trigram,and surface form; (ii) part-of-speech tags; (iii) thenumber of characters; (iv) whether some charac-ters are identical. These basic features will helpidentify several common characteristics of morphcandidates (e.g., they are very likely to be nouns,and very unlikely to contain single characters).

Dictionary: Many morphs are non-regularnames derived from proper names while retain-ing some characteristics. For example, the morphs“薄督 (Governor Bo)” and “吃省 (GourmandProvince)” are derived from their target entitynames “薄熙来 (Bo Xilai)” and “广东省 (Guan-dong Province)”, respectively. Therefore, weadopt a dictionary of proper names (Li et al., 2012)and propose the following features: (i) Whethera term occurs in the dictionary. (ii) Whether aterm starts with a commonly used last name, andincludes uncommonly used characters as its firstname. (iii) Whether a term ends with a geo-political entity or organization suffix word, but it’snot in the dictionary.

Phonetic: Many morphs are created based onphonetic (Chinese pinyin in our case) modifica-tions. For instance, the morph “饭饼饼 (RiceCake)” has the same phonetic transcription asits target entity name “范冰冰 (Fan Bingbing)”.To extract phonetic-based features, we compilea dictionary composed of 〈phonetic transcription,term〉 pairs from the Chinese Gigaword corpus 2.Then for each term, we check whether it has thesame phonetic transcription as any entry in the dic-tionary but they include different characters.

Language Modeling: Many morphs rarely ap-pear in a general news corpus (e.g., “六步郎(Six Step Man)” refers to the NBA baseketballplayer “勒布朗·詹姆斯 (Lebron James)”.). There-fore, we use the character-based language modelstrained from Gigaword to calculate the occurrenceprobabilities of each term, and use n-gram proba-bilities (n ∈ [1 : 5]) as features.

3.3 Morph Mention Verification

The second step is to verify whether a mention ofthe discovered potential morphs is indeed used asa morph in a specific context. Based on the ob-

2https://catalog.ldc.upenn.edu/LDC2011T07

588

servation that closely related morph mentions of-ten occur together, we propose a semi-supervisedgraph-based method to leverage a small set of la-beled seeds, coreference and correlation relations,and a large amount of unlabeled data to performcollective inference and thus save annotation cost.According to our observation of morph mentions,we propose the following two hypotheses:

Hypothesis 1: If two mentions are coreferen-tial, then they both should either be morph men-tions or non-morph mentions. For instance, themorph mentions “平西王 (Conquer West King)”in d1 and d3 in Figure 1 are coreferential, they bothrefer to the modern politician “薄熙来 (Bo Xilai)”.

Hypothesis 2: Those highly correlated men-tions tend to either be morph mentions or non-morph mentions. From our annotated dataset, 49%morph mentions co-occur on tweet level. For ex-ample, “平西王(Conquer West King)” and “军哥(Brother Jun)” are used together in d3 in Fig-ure 1.

Based on these hypotheses, we aim to designan effective approach to compensate for the lim-ited annotated data. Graph-based semi-supervisedlearning approaches (Zhu et al., 2003; Smola andKondor, 2003; Zhou et al., 2004) have been suc-cessfully applied many NLP tasks (Niu et al.,2005; Chen et al., 2006; Huang et al., 2014).Therefore we build a mention graph to capturethe semantic relatedness (weighted arcs) betweenpotential morph mentions (nodes) and propose asemi-supervised graph-based algorithm to collec-tively verify a set of relevant mentions using asmall amount of labeled data. We now describethe detailed algorithm as follows.

Mention Graph ConstructionFirst, we construct a mention graph that can reflectthe association between all the mentions of poten-tial morphs. According to the above two hypothe-ses, mention coreference and correlation relationsare the basis to build our mention graph, which isrepresented by a matrix.

In Chinese Weibo, their exist rich and cleansocial relations including authorship, replying,retweeting, or user mentioning relations. We makeuse of these social relations to judge the possibilityof two mentions of the same potential morph be-ing coreferential. If there exists one social relationbetween two mentions mp

i and mqi of the morph

mi, they are usually coreferential and assigned anassociation score 1. We also detect coreferential

relations by performing content similarity analy-sis. The cosine similarity is adopted with the tf-idfrepresentation for the contexts of two mentions.Then we get a coreference matrix W 1:

W 1mp

i ,mqi

=

1.0 if mp

i and mqi are linked

with certain social relationcos(mp

i ,mqi ) else if q ∈ kNN(p)

0 Otherwise

where mpi and mq

i are two mentions from thesame potential morph mi, and kNN means thateach mention is connected to its k nearest neigh-boring mentions.

Users tend to use morph mentions together toachieve their communication goals. To incorpo-rate such evidence, we measure the correlation be-tween two mentions mp

i and mqj of two different

potential morphs mi and mj as corr(mpi ,m

qj) =

1.0 if there exists a certain social relation betweenthem. Otherwise, corr(mp

i ,mqj) = 0. Then

we can obtain the correlation matrix: W 2mp

i ,mqj

=

corr(mpi ,m

qj).

To tune the balance of coreferential relation andcorrelation relation during learning, we first gettwo matrices W 1 and W 2 by row-normalizingW 1

and W2, respectively. Then we obtain the finalmention matrix W with a linear combination ofW 1 and W 2: W = αW 1 + (1− α)W 2, where αis the coefficient between 0 and 1 3.

Graph-based Semi-supervised LearningIntuitively, if two mentions are strongly con-nected, they tend to hold the same label. Thelabel of 1 indicates a mention is a morph men-tion, and 0 means a non-morph mention. We useY =

[Yl Yu

]T to denote the label vector of allmentions, where the first l nodes are verified men-tions labeled as 1 or 0, and the remaining u nodesneed to be verified and initialized with the label0.5. Our final goal is to obtain the final label vec-tor Yu by incorporating evidence from initial la-bels and the mention graph.

Following the graph-based semi-supervisedlearning algorithm (Zhu et al., 2003), the mentionverification problem is formulated to optimize theobjective function Q(Y) = µ

∑li=1(yi − y0

i )2 +

12

∑i,j Wij(yi − yj)2 where y0

i denotes the initial

3α is set to 0.8 in this paper, optimized from the develop-ment set.

589

label, and µ is a regularization parameter that con-trols the trade-off between initial labels and theconsistency of labels on the mention graph. Zhuet al. (2003) has proven that this formula has bothclosed-form and iterative solutions.

4 Morph Mention ResolutionThe final step is to resolve the extracted morphmentions to their target entities.

4.1 Candidate Target Identification

We start from identifying a list of target candidatesfor each morph mention from the comparable cor-pora including Sina Weibo, Chinese News andEnglish Twitter. After preprocessing the corporausing word segmentation, noun phrase chunkingand name tagging, the name entity list is still toolarge and too noisy for candidate ranking. Toclean the name entity list, we adopt the tempo-ral Distribution Assumption proposed in our re-cent work (Huang et al., 2013). It assumes thata morph m and its real target e should have sim-ilar temporal distributions in terms of their occur-rences. Following the same heuristic we assumethat an entity is a valid candidate for a morph ifand only if the candidate appears fewer than sevendays after the morph’s appearance.

4.2 Candidate Target Ranking

Motivations of Using Deep LearningCompared to regular entity linking tasks (Ji et al.,2010; Ji et al., 2011; Ji et al., 2014), the majorchallenge of ranking a morph’s candidate targetentities lies in that the surface features such as theorthographic similarity between morph and targetcandidates have been proven inadequate (Huanget al., 2013). Therefore, it is crucial to capturethe semantics of both mentions and target candi-dates. For instance, in order to correctly resolve“平西王 (Conquer West King)” from d1 and d3

in Figure 1 to the modern politician “薄熙来(BoXilai)” instead of the ancient king “吴三桂 (WuSangui)”, it is important to model the surround-ing contextual information effectively to captureimportant information (e.g., “重庆 (Chongqing)”,“倒台 (fall from power)”, and “唱红歌 (sing redsongs)”) to represent the mentions and target en-tity candidates. Inspired by the recent successachieved by deep learning based techniques onlearning semantic representations for various NLPtasks (e.g., (Bengio et al., 2003; Collobert et al.,2011; Mikolov et al., 2013b; He et al., 2013)), we

design and compare the following two approachesto employ hierarchical architectures with multiplehidden layers to extract useful features and mapmorphs and target entities into a latent semanticspace.

Pairwise Cross-genre Supervised LearningIdeally, we hope to obtain a large amount of coref-erential entity mention pairs for training. A nat-ural knowledge resource is Wikipedia which in-cludes anchor links. We compose an anchor’s sur-face string and the title of the entity it’s linked to asa positive training pair. Then we randomly samplenegative training instances from those pairs thatdon’t share any links.

Our approach consists of the following steps:(1) generating high quality embedding for eachtraining instance; (2) pre-training with the stackeddenoising auto-encoder (Bengio et al., 2003) forfeature dimension reduction; and (3) supervisedfine-tuning to optimize the neural networks to-wards a similarity measure (e.g., dot product).Figure 2 depicts the overall architecture of this ap-proach.

n layers stacked auto-encoders

pair-wise supervised fine-tuning layer

…. ….

sim(m,c) = Dot( f (m), f (c))

f f

mention candidate target

Figure 2: Overall Architecture of Pairwise Cross-genre Supervised Learning

However, morph resolution is significantly dif-ferent from the traditional entity linking task sincethe latter mainly focuses on formal and explicitentities (e.g., “薄熙来 (Bo Xilai)”) which tendto have stable referents in Wikipedia. In con-trast, morphs tend to be informal, implicit andhave newly emergent meanings which evolve overtime. In fact, these morph mentions rarely appearin Wikipedia. For example, almost all “平西王(Conquer West King)” mentions in Wikipedia re-fer to the ancient king instead of the modern politi-cian “薄熙来 (Bo Xilai)”. In addition, the contex-tual words in Wikipedia used to describe entitiesare quite different from those in social media. Forexample, to describe a death event, Wikipedia usu-

590

ally uses a formal expression “去世 (pass away)”while an informal expression “挂了 (hang up)” isused more often in tweets. Therefore this approachsuffers from the knowledge discrepancy betweenthese two genres.

Within-genre Unsupervised Learning

context(��[already])

Input Layer

context(��[fell from power])

context(�[sing])

context(��[red song])

Projection Layer

Xw

summation

Output Layer

σ (XwTθ )

Figure 3: Continuous Bag-of-Words Architecture

To address the above challenge, we proposethe second approach to learn semantic embed-dings of both morph mentions and entities di-rectly from tweets. Also we prefer unsuper-vised learning methods due to the lack of train-ing data. Following (Mikolov et al., 2013a),we develop a continuous bag-of-words (CBOW)model that can effectively model the surround-ing contextual information. CBOW is discrimina-tively trained by maximizing the conditional prob-ability of a term wi given its contexts c(wi) ={wi−n, ..., wi−1, wi+1, ..., wi+n}, where n is thecontextual window size, and wi is a term obtainedusing the preprocessing step introduced in Sec-tion 2 4. The architecture of CBOW is depicted inFigure 3. We obtain a vector Xwi through the pro-jection layer by summing up the embedding vec-tors of all terms in c(wi), and then use the sigmoidactivation function to obtain the final embeddingof wi in c(wi) in the output layer.

Formally, the objective function ofCBOW can be formulated as L(θ) =∑

wi∈W

∑wj∈W log p(wj |c(wi)), where W

is the set of unique terms obtained from the wholetraining corpus. p(wj |c(wi)) is the conditionallikelihood of wj given the context c(wi) and it isformulated as follows:

p(wj |c(wi)) = [σ(XTwiθwj )]L

wi (wj) ×[1− σ(XT

wiθwj )]1−Lwi (wj),

4Each wi is not limited to noun phrases we consider ascandidate morphs.

Data Training Development Testing# Tweets 1,500 500 2,688# Unique Terms 10,098 4, 848 15,108# Morphs 250 110 341# Morph Mentions 1,342 487 2,469

Table 1: Data Statistics

where Lwi(wj) ={

1, wi = wj

0, Otherwise, σ is the

sigmoid activation function, and θwi is the embed-dings of wi to be learned with back-propagationduring training.

5 Experiments5.1 DataWe retrieved 1,553,347 tweets from Chinese SinaWeibo from May 1 to June 30, 2013 and 66,559 web documents from the embedded URLsin tweets for experiments. We then randomlysampled 4, 688 non-redundant tweets and askedtwo Chinese native speakers to manually anno-tate morph mentions in these tweets. The anno-tated dataset is randomly split into training, de-velopment, and testing sets, with detailed statisticsshown in Table 1 5. We used 225 positive instancesand 225 negative instances to train the model in thefirst step of potential morph discovery.

We collected a Chinese Wikipedia dump of Oc-tober 9th, 2014, which contains 2,539,355 pages.We pulled out person, organization and geo-political pages based on entity type matching withDBpedia 6. We also filter out the pages with fewerthan 300 words. For training the model, we use60,000 mention-target pairs along with one neg-ative sample randomly generated for each pair,among which, 20% pairs are reserved for parame-ter tuning.

5.2 Overall: End-to-End DecodingIn this subsection, we first study the end-to-enddecoding performance of our best system, andcompare it with the state-of-the-art supervisedlearning-to-rank approach proposed by (Huang etal., 2013) based on information networks con-struction and traverse with meta-paths. We usethe 225 extracted morphs as input to feed (Huanget al., 2013) system. The experiment setting, im-plementation and evaluation process are similarto (Huang et al., 2013).

5We will make all of these annotations and other resourcesavailable for research purposes if this paper gets accepted.

6http://dbpedia.org

591

The overall performance of our approach us-ing within-genre learning for resolution is shownin Table 2. We can see that our systemachieves significantly better performance (95.0%confidence level by the Wilcoxon Matched-PairsSigned-Ranks Test) than the approach proposedby (Huang et al., 2013). We found that (Huanget al., 2013) failed to resolve many unpopularmorphs (e.g., “小马 (Little Ma)” is a morph re-ferring to Ma Yingjiu, and it only appeared oncein the data), because it heavily relies on aggre-gating contextual and temporal information frommultiple instances of each morph. In contrast, ourunsupervised resolution approach only leveragesthe pre-trained word embeddings to capture the se-mantics of morph mentions and entities.

Model Precision Recall F1

Huang et al., 2013 40.2 33.3 36.4Our Approach 41.1 35.9 38.3

Table 2: End-to-End Morph Decoding (%)

5.3 Diagnosis: Morph Mention ExtractionThe first step discovered 888 potential morphs(80.1% of all morphs, 5.9% of all terms), whichindicates that this step successfully narroweddown the scope of candidate morphs.

Method Precision Recall F1

Naive 58.0 83.1 68.3SVMs 61.3 80.7 69.7Our Approach 88.2 77.2 82.3

Table 3: Morph Mention Verification (%)

Now we evaluate the performance of morphmention verification. We compare our approachwith two baseline methods: (i) Naive, which con-siders all mentions as morph mentions; (ii) SVMs,a fully supervised model using Support VectorMachines (Cortes and Vapnik, 1995) based on un-igrams and bigrams features. Table 3 shows theresults. We can see that our approach achieves sig-nificantly better performance than the baseline ap-proaches. In particular it can verify the mentionsof newly emergent morphs. For instance, “棒棒棒 (Good Good Good)” is mistakenly identified bythe first step as a potential morph, but the secondstep correctly filters it out.

5.4 Diagnosis: Morph Mention ResolutionThe target candidate identification step success-fully filters 86% irrelevant entities with high preci-

sion (98.5% of morphs retain their target entitis).For candidate ranking, we compare with severalbaseline approaches as follows:

• BOW: We compute cosine similarity over bag-of-words vectors with tf-idf values to measurethe context similarity between a mention and itscandidates.• Pair-wise Cross-genre Supervised Learning:

We first construct a vocabulary by choosing thetop 100,000 frequent terms. Then we randomlysample 48,000 instances for training and 12,000instances for development. At the pre-trainingstep, we set the number of hidden layers as 3,the size of each hidden layer as 1000, the mask-ing noise probability for the first layer as 0.7,and a Gaussian noise with standard deviation of0.1 for higher layers. The learning rate is set tobe 0.01. At the fine-tuning stage, we add a 200units layer on top of auto-encoders and optimizethe neural network models based on the trainingdata.• Within-genre Unsupervised Learning: We di-

rectly train morph mention and entity embed-dings from the large-scale tweets and web doc-uments that we collect. We set the window sizeas 10 and the vector dimension as 800 based onthe development set.

The overall performance of various resolu-tion approaches using perfect morph mentions isshown in Figure 4. We can clearly see that oursecond within-genre learning approach achievesthe best performance. Figure 5 demonstrates thedifferences between our two deep learning basedmethods. When learning semantic embeddings di-rectly from Wikipedia, we can see that the top 10closest entities of the mention “平西王(ConquerWest King)” are all related to the ancient king “吴三桂(Wu Sangui)”. Therefore this method is onlyable to capture the original meanings of morphs.In contrast, when we learn embeddings directlyfrom tweets, most of the closest entities are rel-evant to its target entity “薄熙来 (Bo Xilai)”.

6 Related Work

The first morph decoding work (Huang et al.,2013) assumed morph mentions are already dis-covered and didn’t take contexts into account. Tothe best of our knowledge, this is the first work oncontext-aware end-to-end morph decoding.

Morph decoding is related to several traditional

592

�3 (Eight Beauties)

'� (Surrender to Qing Dynasty)

8; (Qinhuai)

�� (Army of Qing)

1644� (Year 1644)

� (Break the Defense)

82 (Fall of Qin Dynasty)

/++ (Chen Yuanyuan)

46 (Wu Sangui)

1� (Entitled as)

)�, (Bo Yibo)

�. (Manchuria)

BXL (Bo Xilai)!��

(Wang Lijun)

� (Wen Qiang)

82 (Fall of Qin Dynasty)

�"� (Zhang Dejiang)

-! (King of Han)

) (Bo)

46 (Wu Sangui)

5( (Violation of Rules)

�� (Be Distinguished)

BXL (Bo Xilai)��

(Suppress Gangster)

!�� (Wang Lijun)

�2* (Murdering Case)

�"� (Zhang Dejiang)

&$·9" (Neil Heywood)

�7 (Huang Qifan)

%� (Introduce Investment)

“��,(Conquer West King)” in Wikipedia

“��,(Conquer West King)” in tweets

“/:�(Bo Xilai)” in tweets/web docs

Figure 5: Top 10 closest entities to morph and target in different genres

Figure 4: Resolution Acc@K for Perfect MorphMentions

NLP tasks: entity mention extraction (e.g., (Zi-touni and Florian, 2008; Ohta et al., 2012; Li andJi, 2014)), metaphor detection (e.g., (Wang et al.,2006; Tsvetkov, 2013; Heintz et al., 2013)), wordsense disambiguation (WSD) (e.g., (Yarowsky,1995; Mihalcea, 2007; Navigli, 2009)), and entitylinking (EL) (e.g., (Mihalcea and Csomai, 2007;Ji et al., 2010; Ji et al., 2011; Ji et al., 2014).However, none of these previous techniques canbe applied directly to tackle this problem. Asmentioned in section 3.1, entity morphs are fun-damentally different from regular entity mentions.Our task is also different from metaphor detec-tion because morphs cover a much wider rangeof semantic categories and can include either ab-stractive or concrete information. Some commonfeatures for detecting metaphors (e.g. (Tsvetkov,

2013)) are not effective for morph extraction: (1).Semantic categories. Metaphors usually fall intocertain semantic categories such as noun.animaland noun.cognition. (2). Degree of abstractness.If the subject or an object of a concrete verb isabstract then the verb is likely to be a metaphor.In contrast, morphs can be very abstract (e.g., “函数 (Function)” refers to “杨幂 (Yang Mi)” be-cause her first name “幂 (Mi)” means the PowerFunction) or very concrete (e.g., “薄督 (GovernorBo)” refers to “薄熙来 (Bo Xilai)”). In contrastto traditional WSD where the senses of a word areusually quite stable, the “sense” (target entity) ofa morph may be newly emergent or evolve overtime rapidly. The same morph can also have mul-tiple senses. The EL task focuses more on explicitand formal entities (e.g., named entities), whilemorphs tend to be informal and convey implicitinformation.

Morph mention detection is also related to mal-ware detection (e.g., (Firdausi et al., 2010; Chan-dola et al., 2009; Firdausi et al., 2010; Christodor-escu and Jha, 2003)) which discovers abnormalbehavior in code and malicious software. In con-trast our task tackles anomaly texts in semanticcontext.

Deep learning-based approaches have beendemonstrated to be effective in disambiguation re-lated tasks such as WSD (Bordes et al., 2012), en-tity linking (He et al., 2013) and question link-ing (Yih et al., 2014; Bordes et al., 2014; Yanget al., 2014). In this paper we proved that it’s cru-

593

cial to keep the genres consistent between learningembeddings and applying embeddings.

7 Conclusions and Future Work

This paper describes the first work of context-aware end-to-end morph decoding. By conduct-ing deep analysis to identity the common charac-teristics of morphs and the unique challenges ofthis task, we leverage a large amount of unlabeleddata and the coreferential and correlation relationsto perform collective inference to extract morphmentions. Then we explore deep learning-basedtechniques to capture the semantics of morph men-tions and entities and resolve morph mentions onthe fly. Our future work includes exploiting theprofiles of target entities as feedback to refine theresults of morph mention extraction. We will alsoextend the framework for event morph decoding.

Acknowledgments

This work was supported by the US ARLNS-CTA No. W911NF-09-2-0053, DARPADEFT No. FA8750-13-2-0041, NSF AwardsIIS-1523198, IIS-1017362, IIS-1320617, IIS-1354329 and HDTRA1-10-1-0120, gift awardsfrom IBM, Google, Disney and Bosch. The viewsand conclusions contained in this document arethose of the authors and should not be inter-preted as representing the official policies, eitherexpressed or implied, of the U.S. Government.The U.S. Government is authorized to reproduceand distribute reprints for Government purposesnotwithstanding any copyright notation here on.

ReferencesY. Bengio, R. Ducharme, P. Vincent, and C. Janvin.

2003. A neural probabilistic language model. Jour-nal of Machine Learning Research, 3:1137–1155,March.

A. Bordes, X. Glorot, J. Weston, and Y. Bengio. 2012.Joint learning of words and meaning representationsfor open-text semantic parsing. In Proc. of the 15thInternational Conference on Artificial Intelligenceand Statistics (AISTATS2012).

A. Bordes, S. Chopra, and J. Weston. 2014. Questionanswering with subgraph embeddings. In Proc. ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP2014).

V. Chandola, A. Banerjee, and V. Kumar. 2009.Anomaly detection: A survey. ACM ComputingSurveys (CSUR), 41(3):15.

P. Chang, M. Galley, and D. Manning. 2008. Optimiz-ing chinese word segmentation for machine transla-tion performance. In Proc. of the Third Workshop onStatistical Machine Translation (StatMT 2008).

J. Chen, D. Ji, C Tan, and Z. Niu. 2006. Rela-tion extraction using label propagation based semi-supervised learning. In Proc. of the 21st Interna-tional Conference on Computational Linguistics and44th Annual Meeting of the Association for Compu-tational Linguistics (ACL2006).

M. Christodorescu and S. Jha. 2003. Static analysisof executables to detect malicious patterns. In Proc.of the 12th Conference on USENIX Security Sympo-sium (SSYM2003).

R. Collobert, J. Weston, L. Bottou, M. Karlen,K. Kavukcuoglu, and P. Kuksa. 2011. Naturallanguage processing (almost) from scratch. Jour-nal of Machine Learning Research, 12:2493–2537,November.

C. Cortes and V. Vapnik. 1995. Support-vector net-works. Machine Learning, 20:273–297, September.

I. Firdausi, C. Lim, A. Erwin, and A. Nugroho. 2010.Analysis of machine learning techniques used inbehavior-based malware detection. In Proc. ofthe 2010 Second International Conference on Ad-vances in Computing, Control, and Telecommunica-tion Technologies (ACT2010).

Z. Harris. 1954. Distributional structure. Word,10:146–162.

Z. He, S. Liu, M. Li, M. Zhou, L. Zhang, and H. Wang.2013. Learning entity representation for entity dis-ambiguation. In Proc. of the 51st Annual Meet-ing of the Association for Computational Linguistics(ACL2013).

I. Heintz, R. Gabbard, M. Srivastava, D. Barner,D. Black, M. Friedman, and R. Weischedel. 2013.Automatic extraction of linguistic metaphors withlda topic modeling. In Proc. of the ACl2013 Work-shop on Metaphor in NLP.

H. Huang, Z. Wen, D. Yu, H. Ji, Y. Sun, J. Han, andH. Li. 2013. Resolving entity morphs in censoreddata. In Proc. of the 51st Annual Meeting of the As-sociation for Computational Linguistics (ACL2013).

H. Huang, Y. Cao, X. Huang, H. Ji, and C. Lin.2014. Collective tweet wikification based on semi-supervised graph regularization. In Proc. of the52nd Annual Meeting of the Association for Com-putational Linguistics (ACL2014).

H. Ji, R. Grishman, H.T. Dang, K. Griffitt, and J. El-lis. 2010. Overview of the tac 2010 knowledge basepopulation track. In Proc. of the Text Analysis Con-ference (TAC2010).

H. Ji, R. Grishman, and H.T. Dang. 2011. Overviewof the tac 2011 knowledge base population track. InProc. of the Text Analysis Conference (TAC2011).

594

H. Ji, J. Nothman, and H. Ben. 2014. Overview of tac-kbp2014 entity discovery and linking tasks. In Proc.of the Text Analysis Conference (TAC2014).

J. Lafferty, A. McCallum, and F. Pereira. 2001. Con-ditional random fields: Probabilistic models for seg-menting and labeling sequence data. In Proc. of theEighteenth International Conference on MachineLearning (ICML2001).

Q. Li and H. Ji. 2014. Incremental joint extractionof entity mentions and relations. In Proc. of the52nd Annual Meeting of the Association for Com-putational Linguistics (ACL2014).

Q. Li, H. Li, H. Ji, W. Wang, J. Zheng, and F. Huang.2012. Joint bilingual name tagging for parallel cor-pora. In Proc. of the 21st ACM International Con-ference on Information and Knowledge Manage-ment (CIKM2012).

R. Mihalcea and A. Csomai. 2007. Wikify!: link-ing documents to encyclopedic knowledge. InProc. of the sixteenth ACM conference on Confer-ence on information and knowledge management(CIKM2007).

R. Mihalcea. 2007. Using wikipedia for auto-matic word sense disambiguation. In Proc. of theConference of the North American Chapter of theAssociation for Computational Linguistics (HLT-NAACL2007).

T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013a.Efficient estimation of word representations in vec-tor space. CoRR, abs/1301.3781.

T. Mikolov, I. Sutskever, K. Chen, S.G. Corrado, andJ. Dean. 2013b. Distributed representations ofwords and phrases and their compositionality. InAdvances in Neural Information Processing Systems26.

R. Navigli. 2009. Word sense disambiguation: Asurvey. ACM Computing Surveys, 41:10:1–10:69,February.

Z. Niu, D. Ji, and C. Tan. 2005. Word sense dis-ambiguation using label propagation based semi-supervised learning. In Proc. of the 43rd AnnualMeeting of the Association for Computational Lin-guistics (ACL2005).

T. Ohta, S. Pyysalo, J. Tsujii, and S. Ananiadou. 2012.Open-domain anatomical entity mention detection.In Proc. of the ACL2012 Workshop on DetectingStructure in Scholarly Discourse.

A. Smola and R. Kondor. 2003. Kernels and regu-larization on graphs. In Proc. of the Annual Confer-ence on Computational Learning Theory and KernelWorkshop (COLT2003).

K. Toutanova, D. Klein, C. D. Manning, and Y. Singer.2003. Feature-rich part-of-speech tagging with acyclic dependency network. In Proc. of the 2003

Conference of the North American Chapter of theAssociation for Computational Linguistics on Hu-man Language Technology (NAACL2003).

Y. Tsvetkov. 2013. Cross-lingual metaphor detectionusing common semantic features. In Proc. of theACL2013 Workshop on Metaphor in NLP.

Z. Wang, H. Wang, H. Duan, S. Han, and S. Yu.2006. Chinese noun phrase metaphor recogni-tion with maximum entropy approach. In Proc. ofthe Seventh International Conference on IntelligentText Processing and Computational Linguistics (CI-CLing2006).

M. Yang, N. Duan, M. Zhou, and H. Rim. 2014. Jointrelational embeddings for knowledge-based ques-tion answering. In Proc. of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP2014).

D. Yarowsky. 1995. Unsupervised word sense disam-biguation rivaling supervised methods. In Proc. ofthe 33rd Annual Meeting on Association for Com-putational Linguistics (ACL1995).

W. Yih, X. He, and C. Meek. 2014. Semantic pars-ing for single-relation question answering. In Proc.of the 52nd Annual Meeting of the Association forComputational Linguistics (ACL2014).

H. Zhang, H. Yu, D. Xiong, and Q. Liu. 2003. Hhmm-based chinese lexical analyzer ictclas. In Proc. ofthe second SIGHAN workshop on Chinese languageprocessing (SIGHAN2003).

B. Zhang, H. Huang, X. Pan, H. Ji, K. Knight, Z. Wen,Y. Sun, J. Han, and B. Yener. 2014. Be appropriateand funny: Automatic entity morph encoding. InProc. of the 52nd Annual Meeting of the Associationfor Computational Linguistics (ACL2014).

D. Zhou, O. Bousquet, T. Lal, J. Weston, andB. Scholkopf. 2004. Learning with local and globalconsistency. In Advances in Neural InformationProcessing Systems 16, pages 321–328.

X. Zhu, Z. Ghahramani, and J. Lafferty. 2003. Semi-supervised learning using gaussian fields and har-monic functions. In Proc. of the International Con-ference on Machine Learning (ICML2003).

I. Zitouni and R. Florian. 2008. Mention detectioncrossing the language barrier. In Proc. of the Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP2008).

595

context-aware entity morph decoding

Documents