modeling heterogeneous networks for information ranking...
TRANSCRIPT
MODELING HETEROGENEOUS NETWORKS FORINFORMATION RANKING, ENRICHMENT AND RESOLUTION
ON MICROBLOGS
By
Hongzhao Huang
A Dissertation Submitted to the Graduate
Faculty of Rensselaer Polytechnic Institute
in Partial Fulfillment of the
Requirements for the Degree of
DOCTOR OF PHILOSOPHY
Major Subject: COMPUTER SCIENCE
Examining Committee:
Heng Ji, Dissertation Adviser
Peter Fox, Member
Jim Hendler, Member
Chin-Yew Lin, Member
Yizhou Sun, Member
Rensselaer Polytechnic InstituteTroy, New York
April 2015(For Graduation May 2015)
CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
ACKNOWLEDGMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivations of Research in Microblogging . . . . . . . . . . . . . . . . . 1
1.2 Overall Problem: Enhancing Natural Language Understanding for Mi-croblogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Insights of the Thesis: Leveraging and Modeling Heterogeneous Infor-mation Networks for Natural Language Processing . . . . . . . . . . . . 61.3.1 Microblog Ranking . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.2 Microblog Wikification . . . . . . . . . . . . . . . . . . . . . . . 91.3.3 Morph Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 14
2. Background and Relevant Literature . . . . . . . . . . . . . . . . . . . . . . . 16
2.1 Homogeneous and Heterogeneous Information Networks . . . . . . . . . 16
2.2 Graph-based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.1 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.2 Similarity Measurement . . . . . . . . . . . . . . . . . . . . . . 212.2.3 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . 23
2.3 Related Work to the Thesis Topic . . . . . . . . . . . . . . . . . . . . . . 242.3.1 Ranking in Microblogging . . . . . . . . . . . . . . . . . . . . . 242.3.2 Microblog Wikification . . . . . . . . . . . . . . . . . . . . . . . 252.3.3 Morph Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3. Microblog Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Motivations and Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Our Proposed Approach: Tri-HITS . . . . . . . . . . . . . . . . . . . . . 283.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2.2 Filtering non-informative Tweets . . . . . . . . . . . . . . . . . . 29
iii
3.2.3 Initializing Ranking Scores . . . . . . . . . . . . . . . . . . . . . 303.2.4 Constructing Heterogeneous Networks . . . . . . . . . . . . . . 323.2.5 Iterative Propagation . . . . . . . . . . . . . . . . . . . . . . . . 323.2.6 Redundancy Removal . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.3.1 Data and Evaluation Metric . . . . . . . . . . . . . . . . . . . . 353.3.2 Effect of Parameters . . . . . . . . . . . . . . . . . . . . . . . . 373.3.3 Performance and Analysis . . . . . . . . . . . . . . . . . . . . . 383.3.4 Remaining Challenges . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4. Microblog Wikification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Principles and Approach Overview . . . . . . . . . . . . . . . . . . . . . 444.2.1 Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2.2 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 A Deep Semantic Relatedness Model . . . . . . . . . . . . . . . . . . . . 464.3.1 The DSRM Architecture . . . . . . . . . . . . . . . . . . . . . . 464.3.2 Learning the DSRM . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Relational Graph Construction . . . . . . . . . . . . . . . . . . . . . . . 504.4.1 Local Compatibility . . . . . . . . . . . . . . . . . . . . . . . . 514.4.2 Meta Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.4.3 Coreference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.4.4 Semantic Relatedness . . . . . . . . . . . . . . . . . . . . . . . . 544.4.5 The Combined Relational Graph . . . . . . . . . . . . . . . . . . 54
4.5 Semi-supervised Graph Regularization . . . . . . . . . . . . . . . . . . . 55
4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.6.1 Data and Scoring Metric . . . . . . . . . . . . . . . . . . . . . . 574.6.2 End-to-End Wikification . . . . . . . . . . . . . . . . . . . . . . 584.6.3 Quality of Semantic Relatedness Measurement . . . . . . . . . . 614.6.4 Concept Disambiguation . . . . . . . . . . . . . . . . . . . . . . 634.6.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.6.6 Remaining Challenges . . . . . . . . . . . . . . . . . . . . . . . 66
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
iv
5. Morph Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Morph Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Morph Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.3.1 Target Candidate Identification . . . . . . . . . . . . . . . . . . . 715.3.2 Target Candidate Ranking . . . . . . . . . . . . . . . . . . . . . 72
5.3.2.1 Surface Features . . . . . . . . . . . . . . . . . . . . . 725.3.2.2 Semantic Features . . . . . . . . . . . . . . . . . . . . 725.3.2.3 Social Features . . . . . . . . . . . . . . . . . . . . . . 775.3.2.4 Learning-to-Rank . . . . . . . . . . . . . . . . . . . . 78
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.4.1 Data and Evaluation Metric . . . . . . . . . . . . . . . . . . . . 795.4.2 Morph Detection Performance . . . . . . . . . . . . . . . . . . . 795.4.3 Morph Resolution Performance . . . . . . . . . . . . . . . . . . 805.4.4 Remaining Challenges . . . . . . . . . . . . . . . . . . . . . . . 83
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6. Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
v
LIST OF TABLES
1.1 Distributions of morph examples. . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Meta paths in DBLP bibliographic network. . . . . . . . . . . . . . . . . . 19
3.1 Description of methods (method with ⇤ make use of the Bayesian approachto initialize user credibility scores. . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Tweet distribution by grade. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Grade distributions for filtered tweets. . . . . . . . . . . . . . . . . . . . . 40
4.1 Description of wikification methods. . . . . . . . . . . . . . . . . . . . . . 57
4.2 Statistics of Freebase KG. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Overall performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 The performance of systems without using concatenated meta paths. . . . . 60
4.5 Overall performance of concept semantic relatedness methods. . . . . . . . 62
4.6 Examples of relatedness scores between a sample of concepts and the con-cept “National Basketball Association”. . . . . . . . . . . . . . . . . . . . 62
4.7 Examples of relatedness scores between a sample of concepts and the con-cept “National Football League”. . . . . . . . . . . . . . . . . . . . . . . 63
4.8 Examples of relatedness scores between a sample of concepts and the con-cept “Apple Inc.”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.9 Overall disambiguation performance on AIDA dataset. . . . . . . . . . . . . 65
4.10 Overall disambiguation performance on tweet set. . . . . . . . . . . . . . . 65
4.11 Impact of semantic KGs and DNN on concept semantic relatedness. . . . . 66
4.12 Impact of semantic KGs and DNN on concept disambiguation. . . . . . . . 66
5.1 Description of feature sets. ⇤ Glob only uses the same set of similarity mea-sures when combined with other semantic features. . . . . . . . . . . . . . 78
5.2 Data statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3 Performance of morph detection. . . . . . . . . . . . . . . . . . . . . . . . 80
5.4 The system performance based on each single feature set. . . . . . . . . . . 81
vi
5.5 The system performance based on combinations of surface and semanticfeatures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.6 The system performance of integrating cross source and cross genre infor-mation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.7 The effects of social features. . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.8 The effects of temporal constraint. . . . . . . . . . . . . . . . . . . . . . . 83
5.9 Accuracy of target candidate detection. . . . . . . . . . . . . . . . . . . . . 83
5.10 Performance of two categories. . . . . . . . . . . . . . . . . . . . . . . . . 84
5.11 Effects of popularity of morphs. . . . . . . . . . . . . . . . . . . . . . . . . 85
vii
LIST OF FIGURES
1.1 A sample of tweets related to hurricane irene in 2011. . . . . . . . . . . . . 2
1.2 A sample of tweets with informal and implicit information. . . . . . . . . . 3
1.3 An illustration of wikification task for tweets. Concept mentions detectedin tweets are marked as bold, and correctly linked concepts are underlined.The concept candidates are ranked by their prior popularity which will beexplained in section 4.4.1, and only top 2 ranked concepts are listed. . . . . 4
1.4 A heterogeneous information network example. . . . . . . . . . . . . . . . 6
1.5 An example of Freebase. Nodes represent concepts such as “Miami Heat”,and edges represent semantic relations such as “Coach” and “Location”.Each concept is also provided with textual description and concept types. . . 8
1.6 An illustration of topical coherence for a text. . . . . . . . . . . . . . . . . 11
1.7 Cross-source comparable data example (each morph and target pair is shownin the same color). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1 (a) Heterogeneous DBLP biobiographic network, (b) Homogeneous co-authornetwork. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Schema of the heterogeneous DBLP biobiographic network. . . . . . . . . . 17
3.1 Web-Tweet-User heterogeneous networks. . . . . . . . . . . . . . . . . . . 29
3.2 Overview of Tri-HITS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Annotation guideline for tweet ranking. . . . . . . . . . . . . . . . . . . . . 36
3.4 Effect of parameters: (a) �td and �dt for Web-Tweet networks, (b) �td forWeb-Tweet networks, (c) �dt for Web-Tweet-User networks. . . . . . . . . . 38
3.5 Performance comparison of ranking methods. . . . . . . . . . . . . . . . . 39
3.6 (a) Explicit vs inferred implicit Tweet-User relations to construct Tweet-User networks; (b) TextRank vs one-step propagation on explicit Tweet-User networks using bayesian approach and retweet/reply/user mention re-lations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 Co-HITS vs Tri-HITS on (a) Web-Tweet networks, (b) Tweet-User networks. 41
4.1 Approach overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
viii
4.2 The DSRM architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Schema of the Twitter network. . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 A example of the relational graph constructed for the example tweets inFigure 1.3. Each node represents a pair of hm, ci, separated by a comma.The edge weight is obtained from the linear combination of the weights ofthe three proposed relations. Not all mentions are included due to the spacelimitations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 The effect of labeled tweet size. . . . . . . . . . . . . . . . . . . . . . . . . 60
4.6 The effect of parameter µ. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.7 Error distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 Overview of morph decoding. . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Network Schema of Morph-Related Heterogeneous Information Network . . 73
5.3 Example of morph-related heterogeneous information network. . . . . . . . 74
ix
ACKNOWLEDGMENT
First of all, I would like to express my sincere gratitude and appreciation to my advisor
Prof. Heng Ji. When I first joined her group, I had very limited research experience. It
was her tremendous guidance, support, enthusiasm and encouragement that led me into
the scientific research world and introduced me the attractive fields of natural language
processing. Prof. Ji is always willing to accept new ideas and provide me full supports
to pursue my research goals. In addition, she is always ready to offer help to all of my
personal issues beyond research. It is my great honor to have her as my supervisor.
I would also like to thank my other doctoral committee members Prof. Peter Fox,
Prof. Jim Hendler, Dr. Chin-Yew Lin, and Prof. Yizhou Sun for the great efforts they
have put on supervising this thesis. During the writing of this thesis, they have provided
me a lot of insightful comments and suggestions that guide me to always think bigger and
capture the whole picture, which are not only critical and valuable to this thesis but also
benefit my whole future careers. Specific thanks to Dr. Chin-Yew Lin who provided me
a great summer internship opportunity at his group at Microsoft Research Asia. And the
academic work of Prof. Sun has greatly inspired this thesis.
I also owe my gratitude to many collaborators who contributed a lot to this thesis
and provided me tremendous guidance. The exciting discussions with Dr. Hongbo Deng
sparked the idea of Tri-HITS, and the joint work with Dr. Zhen Wen led to the exciting
morph work. Dr. Yunbo Cao, Dr. Xiaojiang Huang, and Dr. Shuming Shi provided
me great guidance during my first internship at Microsoft Research Asia. And I was
very fortunate to have Dr. Larry Heck as my supervisor during my second internship at
Microsoft Research. Dr. Heck introduced me the fascinating deep learning techniques
and provided me a lot of suggestions and tremendous help for my future careers.
I wish to thank all members and visitors from Blender lab at both CUNY and RPI. It
is my extremely fortunate to spend the past four years with them. I appreciate all of their
tremendous help on both research projects and daily life. Specially thanks to Dr. Haibo
Li and Dr. Arkaitz Zubiaga who we worked together on the tweet ranking project, Dr.
Taylor Cassidy on the joint work of the tweet wikification project, Prof. Sujian Li, Dian
x
Yu, Boliang Zhang, Xiaoman Pan on the teamwork of morph projects, and Prof. Hong Yu
on providing me many constructive comments on my thesis work.
I would also like to thank my parents, my sister, and my wife. Their selfless love
and encouragement helped me go through all those difficult times and kept me moving
forward. In particular, I would like to thank my wife for her sacrifice and full understand-
ing. It was very difficult for her in the past two years because we need to be separated
after I moved to RPI. Finally, I am the most grateful for my beloved grandmother for her
love and caring. I feel very guilty for not being able to accompany her during her most
difficult time at the end of her life. This thesis is dedicated to her.
xi
ABSTRACT
Microblogging, a new type of online information sharing platform through short mes-
sages of up to 140 characters, has grown up quickly and received increasing attentions in
recent years. A microblogging platform (e.g., Twitter) enables both individuals and or-
ganizations to disseminate information, from current affairs to breaking news in a timely
fashion, which makes it a valuable knowledge source with super-fresh information. For
example, during Hurricane Irene in 2011, updates from users living in New York City
and transportation/evacuation posts from the government are very useful information for
people to keep track of the disaster. Therefore, conducting related Natural Language Pro-
cessing (NLP) research on this new genre is demanded to assist knowledge mining and
discovery.
Different from the semi-structured knowledge bases (e.g., Wikipedia) and the tra-
ditional news, the informal microblogs tend to be noisy, short, and informal. And the
phenomenon of information implicitness is more prominent and pervasive in microblog-
ging. These characteristics bring unique challenges to people’s reading and understand-
ing of the informal microblogs, as well as many knowledge mining and discovery tasks.
Thus, in order to alleviate these problems, in this thesis we propose to filter noisy and un-
informative information, enrich the short microblogs with background knowledge from
knowledge bases such as Wikipedia, and resolve the informal and implicit information to
their regular referents.
To achieve our goals, we propose to leverage and model heterogeneous information
networks (HINs), in contrast to most existing NLP approaches on traditional genres (e.g.,
news) that only explored single type of information (e.g., texts). Microblogging contains
heterogeneous types of information from social network structures to cross-genre link-
ages, forming rich HINs. By designing effective approaches to model both unstructured
texts and structured HINs, we can incorporate additional evidence from HIN structures
beyond texts. In this thesis, we present different approaches to construct HINs from cross-
genre, cross-source, and cross-type information by incorporating the existing clean social
relations, as well as performing deep content analysis with some of the well-developed
xii
NLP approaches. We also present various effective approaches including unsupervised
propagation, semi-supervised graph regularization, supervised learning-to-rank and deep
neural networks to model HINs for ranking, classification, and similarity measurement.
Our experimental results demonstrate that heterogeneous information network analysis
approaches are also powerful in the field of NLP.
xiii
CHAPTER 1Introduction
1.1 Motivations of Research in MicrobloggingMicroblogging, a new type of online information sharing platform through short
messages of up to 140 characters, has grown up quickly and received increasing atten-
tions in recent years. A microblogging platform (e.g., Twitter [1] and Sina Weibo [2])
enables both individuals and organizations to seek and disseminate information, from cur-
rent affairs, breaking news, personal updates to nearby events in a timely fashion [3], [4].
The study in [4] further revealed that a retweeted microblog post could reach 1,000 users
on average and it would be disseminated instantly after the first retweet. In addition,
microblogging platforms generate a frequently updated set of trending topics by summa-
rizing a large amount of messages that reflect the hot topics being discussed at a given mo-
ment [5]. All these properties make microblogging a valuable knowledge source and fast
information diffusion platform with super-fresh information. Figure1.1 shows a sample
of Twitter messages (tweets) during Hurricane Irene in 2011. We can obtain very useful
information such as the detailed evacuation zones and the close of transportation sys-
tems to keep track of the disaster. Thus it is crucial to conduct related Natural Language
Processing (NLP) research to assist knowledge mining and discovery from microblogs.
Different from the semi-structured knowledge bases (e.g., Wikipedia [6]) and the
traditional news, microblogging serves as a unique information source with real-time and
detailed information from diverse resources. It has its own unique characteristics: (i)
Portions of this chapter previously appeared as: H. Huang, A. Zubiaga, H. Ji, H. Deng, D. Wang, H.Le, T. Abdelzaher, J. Han, A. Leung, J. Hancock, and C. Voss, “Tweet ranking based on heterogeneousnetworks,” in Proc. of the 24th Int. Conf. on Comput. Linguist., Mumbai, India, 2012, pp. 1239–1256.
H. Huang, Y. Cao, X. Huang, H. Ji, and C.-Y. Lin, “Collective tweet wikification based on semi-supervised graph regularization,” in Proc. of the 52nd Annu. Meeting of the Assoc. for Comput. Linguist.,Baltimore, Maryland, 2014, pp. 380–390.
H. Huang, Z. Wen, D. Yu, H. Ji, Y. Sun, J. Han, and H. Li, “Resolving entity morphs in censoreddata,” in Proc. of the 51st Annu. Meeting of the Assoc. for Comput. Linguist., Sofia, Bulgaria, 2013, pp.1083–1093.
1
2
across the street is an evacuation zone, but my side of the street isn't. here's to the hurricane coloring in the lines... #irene
NYC evacuation order covers 370,000 people who must relocate by tomorrow at 5 pm. Nearly 30 m people under Hurricane Warning on East Coast.
Good morning hurricane Irene hit my side at 5:30am .... as she passing her way to upstate NY Hurricane Irene Prompts Mandatory Emergency Evacuation of New York City http://t.co/
r2ZEokx No subway, no Broadway in New York: America's biggest subway system was ordered shut down
as Hurricane Irene bor... http://t.co/BuGSvsc
Figure 1.1: A sample of tweets related to hurricane irene in 2011.
Noiseness, microblog posts from diverse sources tend to contain uninformative noise such
as subjective comments and conversations. For instance, during Hurricane Irene, there are
many informative tweets such as New Yorkers, find your exact evacuation zone by your
address here: http://t.co/9NhiGKG /via @user #Irene #hurricane #NY. However, the ma-
jority of tweets are babbles such as Me, Myself, and Hurricane Irene. and I’m ready For
hurricane Irene.. The Pear Analytics (2009) report [7] on 2000 sample tweets demon-
strated that 40.55% of the tweets are pointless babble, 37.55% are conversations, and
only 8.7% have pass-along value. (ii) Shortness, the maximum length of 140 characters
results in the lack of information within a single post. The lack of information not only
makes it difficult for people to understand a single post, but also brings unique challenges
for many NLP tasks such as entity linking and text classification which rely extensively
on the richness of contextual and topical clues. (iii) Informality and Implicitness, the
free usage of languages has resulted in many misspellings, informal writings, and the
use of alias/morphs. People also tend to create their own languages to achieve their own
communication goals such as avoiding active censorship, expressing positive or negative
sentiments, and making their descriptions more vivid. Thus, information implicitness is
more prominent and pervasive in microblogs. For example, in Chinese microblogging,
Internet users tend to use morphs such as “Conquer West King” or “Governor Bo” to refer
to the former politician“Bo Xilai”. Figure 1.2 gives more examples to demonstrate the
phenomenon of information informality and implicitness in microblogging. The infor-
mal terms “KD” and “LBJ” and the morph “King” are used to refer to basketball players
“Kevin Durant” and “LeBron James”.
3
Alice: “Will KD and LeBron burn out? @Bob takes a look at the fatigue factor entering the playoffs.”
Bob: “@Alice KD hasn’t had any rest at all. King will be good he used to these moments and he’s had some rest #Heat”
Alice: “@Bob LBJ will perform much better in the playoffs, heat has a difficult regular season...”
Figure 1.2: A sample of tweets with informal and implicit information.
1.2 Overall Problem: Enhancing Natural Language Understanding
for MicroblogsDue to the above unique characteristics of the microblog genre data, it is crucial to
develop automatic tools to process microblogs and provide more background knowledge
to assist people’s reading and understanding on these noisy, short, and informal texts,
as well as to assist downstreaming knowledge mining and discovery tasks. This moti-
vates the overall problem of this thesis: enhancing natural language understanding for
microblogs. We propose to resolve three important sub-problems corresponding to the
above unique characteristics to achieve this goal.
Sub-problem 1: Identification of salient information. Automatic detection of im-
portant information and filtering of uninformative information solves the noiseness prob-
lem. That is particularly useful in emerging situations. This is because eyewitnesses
might be live-tweeting about anything happening at ongoing events [8] such as natural
disasters.
To assist in these situations, we propose to develop a ranking system that organizes
microblog posts by informativeness, so that informative posts are readily identified, while
pointless and speculative observations are filtered out. However, the definition of infor-
mativeness might vary for different points of view. Microblogging users can produce
diverse content ranging from news and events, to conversations and personal status up-
dates. While personal updates and conversations might be relevant to a specific group of
people, we aim to find messages on topics that are informative to a general audience, such
as breaking news and real-time coverage of on-going events. For example, during Hur-
ricane Irene in 2011, updates from a user living in New York City about her own safety
might be very informative to her friends and relatives, but not so informative to others.
To produce rankings that are as relevant to as many people as possible, we define infor-
4
Stay up Hawk Fans. We are going through a slump now, but we have to stay positive. Go Hawks!
Congrats to UCONN and Kemba Walker. 5 wins in 5 days, very impressive...
Just getting to the Arena, we play the Bucks tonight. Let's get it!
Fan (person); Mechanical fan
Slump (geology); Slump (sports)
Atlanta Hawks; Hawks (film)
University of Connecticut; Connecticut Huskies
Kemba Walker
Arena; Arena (magazine); Arena (TV series)
Bucks County, Pennsylvania; Milwaukee Bucks
Tweets Concept CandidatesGo Gators!!! Florida Gators football; Florida Gators men's basketballt1
t2
t3
t4
Figure 1.3: An illustration of wikification task for tweets. Concept mentions de-tected in tweets are marked as bold, and correctly linked concepts areunderlined. The concept candidates are ranked by their prior popularitywhich will be explained in section 4.4.1, and only top 2 ranked conceptsare listed.
mativeness as the extent to which a message meets the general interest of people involved
with or tracking the event. For example, during disasters such as Hurricane, general au-
diences are concerned about the causes and impacts of disasters. They would like to be
informed about whether they need to evacuate or whether the transportation systems are
affected or not. And during sports-related events such as World Cup, the latest sports
results meets the general interest of people.
Sub-problem 2: Information enrichment from a knowledge base with rich and
clean knowledge. Information ranking alleviates the information noiseness problem, but
it fails to solve the information brevity problem. Therefore, information enrichment is
crucial to automatically obtain topically-related background knowledge for the short mes-
sages. Fortunately, web-scale knowledge bases (KBs) (e.g., Wikipedia, DBpedia [9],
and Freebase [10]) with rich and clean information have been emerging. These knowl-
edge bases include rich knowledge about concepts including textual descriptions and
facts. This motivates us to study the popular Wikification (Disambiguation to Wikipedia)
task [11], which aims to automatically identify each concept mention in a microblog post,
and link it to a concept referent in a KB (e.g., Wikipedia). For example, as shown in Fig-
ure 1.3, Hawks in t2
is an identified mention, and its correct referent concept in Wikipedia
is Atlanta Hawks. An end-to-end wikification system needs to resolve two sub-problems:
(i) concept mention detection, (ii) concept mention disambiguation.
5
Automatic information linking to these KBs relieve the information brevity prob-
lem. It allows a reader to easily grasp the related topics and enriched information from
a KB. From a system-to-system perspective, it has been demonstrated its usefulness in a
variety of applications, including coreference resolution [12], classification [13], and user
interest discovery [14], [15].
Sub-problem 3: Identification and resolution of informal and implicit information.
Due to the free usage of languages, there exists a huge amount of informal and implicit
information in microblog posts. In particular, there exists one particular language evo-
lution that creates new ways to communicate sensitive subjects because of the existence
of internet information censorship. We call this phenomenon information morph. For
example, when Chinese online users talk about the former politician “Bo Xilai”, they use
a morph “Conquer West King” instead, a historical figure four hundreds years ago who
governed the same region as Bo. A morph can be either a regular term with new meaning
or a newly created term and it can be considered as a special case of alias used for hiding
true entities in malicious environment [16],[17]. However, social network plays an impor-
tant role in generating morphs. Usually morphs are generated by harvesting the collective
wisdom of the crowd to achieve certain communication goals. Aside from the purpose of
avoiding censorship, other motivations for using morph include expressing sarcasm/irony,
positive/negative sentiment or making descriptions more vivid towards some entities or
events.
The tweet ranking and wikification tasks fail to detect and link such implicit infor-
mation to a KB for two main reasons: (i) unsuccessful identification of candidate entries
in a KB. This is because informal languages are rarely used in a KB with formal texts,
thus there do not exist explicit linkages between a morph mention and its concept referent.
For instance, the anchor text “Conquer West King” is always linked to its original king
“Wu Sangui” in Wikipedia, while none is linked to the former politician “Bo Xilai”. (ii)
The creation and usage of morphs is usually triggered by a certain ongoing event. And
such up-to-date information may not be updated in KBs in a timely fashion. For example,
the usage of “Conquer West King” to refer to “Bo Xilai” was because Bo went out of
power and he shared many common characteristics with the ancient king “Wu Sangui”.
To correctly resolve morphs, it is crucial to explore and leverage background knowledge
6
from comparable data sources such as news.
To address these limitations, we propose a new task “morph decoding” that aims
to detect implicit morphs and resolve them to Web entities. We believe that success-
ful discovery and resolution of morphs is a crucial step for automated understanding of
the fast evolving social media language, which is important to solve the informality and
implicitness problem. Another application is to help common users without enough back-
ground/cultural knowledge to understand internet language for their daily use.
We believe that solving these three important issues are crucial steps to advance
natural language understanding in the informal microblog data. By detecting salient in-
formation, linking mentions to a KB with rich background knowledge, identifying and
resolving implicit morphs, it can benefit down-streaming natural language understanding
systems such as semantic parsing, question answering, and relation extraction.
1.3 Insights of the Thesis: Leveraging and Modeling Heterogeneous
Information Networks for Natural Language Processing
follow
follow
retweet
replyreply
follow
follow
follow
Semantic Relatedness
Coreference
Semantic Relationship
Semantic Relationship
Semantic Relationship
Web Documents
Concept Mentions
Concepts in Knowledge Base
Microblog Posts Social User Community
Social User Community
Figure 1.4: A heterogeneous information network example.
Many of the state-of-the-art NLP systems only relied on the content of a single mi-
croblog post and performed much worse than those designed for traditional formal genre
due to the informal writing style, noiseness, and the lack of context and labeled data in
microblogs. However, different from traditional formal genres such as news, microblog-
7
ging platforms contain heterogeneous types of inter-connected objects, including social
network structures, cross-genre and cross-type linkages. As shown in Figure 1.4, we can
see that multiple types of objects in microblogging are connected with each other through
multiple types of linked relations. For example, microblog posts have direct linkages
to other posts through retweeting and replying relations, microblogs are also connected
to social users through the authorship relation, users are connected to other users via
follower-followee relation, and users also form communities. In addition, some of the
tweets also have cross-genre linkages to the formal genre web documents via the em-
bedded urls or topically-related relations. Furthermore, concept mentions with different
relationships such as coreference and semantic relatedness can be extracted from both
tweets and web documents, with linkages to concepts in a KB. Finally, the concepts in a
KB also form a large-scaled network with different types of concepts and semantic rela-
tions. Figure 1.5 shows an example of Freebase in sports domain. These networks with
multiple types of objects or multiple types of linked relations are defined as Heteroge-
neous Information Networks (HINs), in contrasts to Homogeneous Information Networks
which contain one single type of object and one single type of relation.
HINs have achieved remarkable success over various tasks in the field of data min-
ing, including ranking [18], [19], classification [20], [21], clustering [19], [22], and simi-
larity search and link analysis [23], [24]. HINs have also shown advantages over homo-
geneous networks in the above tasks. This is because the latter is an information loss
projection of the former [25], and modeling HINs directly can incorporate evidence from
multi-typed networks and differentiate different types of objects and relations.
In the field of NLP, homogeneous networks have been applied successfully in var-
ious tasks, including document summarization [26], [27], entity linking [28], word sense
disambiguation [29], and relation extraction [30]. However, HINs have not received many
attentions by the researchers in the NLP field. We then can come up with a very natural
question: can we leverage heterogeneous networks to enhance the state-of-the-art NLP
approaches, especially on microblogs? HINs provide more feasible ways to incorporate
and combine evidence from both unstructured texts and structured networks, and to cap-
ture discrepancies between multi-typed nodes and linkages. This motivates the general
solution of this thesis: leveraging and modeling heterogeneous information networks to
8
Titanic Roster
Member
National Basketball Association
Miami
Erik Spoelstra
Miami Heat
Coach Dwyane
Wade Location
1988 Founded
Description
Professional Sports Team
Type
The Miami Heat are an American professional basketball team based in Miami, Florida. The team is a member of the Southeast Division in the Eastern Conference of the National Basketball Association. They play their home games at the American Airlines Arena in Downtown Miami. The team owner is Micky Arison, who also owns cruise-ship giant Carnival Corporation.
Figure 1.5: An example of Freebase. Nodes represent concepts such as “MiamiHeat”, and edges represent semantic relations such as “Coach” and “Lo-cation”. Each concept is also provided with textual description and con-cept types.
enhance NLP for microblogs. In the following subsections, we introduce our motivations
of our proposed approaches based on HINs to tackle the above discussed issues in this
thesis.
1.3.1 Microblog Ranking
The challenges for this task is that microblogs are from very diverse sources and
noisy. The previous research on microblog ranking has relied on either the text of mi-
croblogs or explicit features of social network such as retweets, replies, and follower-
followee relationships, we believe that such networks can be enhanced by integrating
information from a formal genre. On one hand, tweets from different sources tend to
contain non-informative noise such as subjective comments and conversations. Therefore
it is challenging to identify salient information from microblog content alone. On the
other hand, events of general interest such as natural disasters or political elections are
the topics of microblogs sent by many users from multiple communities which are not
connected to each other. In these situations, users are likely to be unaware of each other.
9
As a result, they fail to connect with many others on topics of mutual interest. This lack of
social interaction produces networks with few explicit linkages between users, and there-
fore between microblogs and users. The sparsity of linkages would limit the effectiveness
of features extracted from social networks.
To address these limitations, we propose to rank microblogs based on a heteroge-
neous network which consists of microblogs, social users, and web documents. We es-
tablish cross-genre linkages between microblogs and web documents, and infer implicit
tweet-user relations beyond the explicit ones, so that networks are enriched by connecting
users that are sharing similar contents. To model cross-genre and cross-type linkages and
capture strong social signals from social networks, we then propose an effective propaga-
tion model to refine the ranking scores of the above three types of objects simultaneously.
The detailed approach will be introduced in Chapter 3.
1.3.2 Microblog Wikification
Motivations of a semi-supervised collective inference model. Measuring context
similarity is a crucial evidence for this task. Context similarity measurement normally
requires to leverage the surrounding contexts of a concept mention and the describing ar-
ticle of a concept in a KB. However, the lack of rich contextual information in a microblog
post has made it challenging to compute context similarity accurately. For instance, if we
only rely on the context of each single microblog to compute context similarity for the
mentions in Figure 1.3, we can only achieve 25% disambiguation accuracy. We can see
that the context of a single microblog usually cannot provide enough information for sim-
ilarity computation for disambiguation. However, we can see that those four microblogs
in Figure 1.3 are topically-relevant and they are posted by the same author within a short
time. If we perform collective inference over them, we can reliably link ambiguous men-
tions such as “Gators”, “Hawks”, and “Bucks” to basketball teams instead of other con-
cepts such as the county “Bucks County”. This motivates us to leverage social network
relations to expand each single microblog with more topically-relevant information, and
design a collective inference model that jointly resolve multiple mentions over multiple
microblogs simultaneously.
For more accurate prominent mention detection and disambiguation, it is also cru-
10
cial to use a set of labeled seeds as guidance for model learning. Sufficient labeled data
is crucial for supervised models. However, manual wikification annotation for short doc-
uments is challenging and time-consuming [31]. The challenges are: (i) unlinkability, a
valid concept may not exist in the KB. (ii) ambiguity, it is impossible to determine the
correct concept due to the dearth of information within a single tweet or there is more
than one correct answer. For instance, it would be difficult to determine the correct ref-
erent concept for “Gators” in t1
in Figure 1.3. Linking “UCONN” in t3
to University
of Connecticut may also be acceptable since Connecticut Huskies is the athletic team
of the university. (iii) prominence, it is challenging to select a set of linkable mentions
that are important and relevant. It is not tricky to select “Fans”, “slump”, and “Hawks” as
linkable mentions, but other mentions such as “stay up” and “stay positive” are not promi-
nent. Therefore, it is challenging to create sufficient high quality labeled microblogs for
supervised models and worth considering semi-supervised learning with the exploration
of unlabeled data. Besides the discussed annotation issues, it is also challenging to incor-
porate into the supervised models multi-dimensional global evidences, which will make
the problem untractable and impossible to find optimal solutions [32].
In order to address these unique challenges for wikification for the short microblogs,
we employ graph-based semi-supervised learning algorithms [33]–[37] for collective in-
ference by exploiting the manifold (cluster) structure in both unlabeled and labeled data.
Different from unsupervised methods, semi-supervised learning approaches can leverage
a small set of labeled seeds to guide model learning, which is crucial for salient and link-
able mention detection. And in contrast to supervised learning models, a large amount
of unlabeled data can be used by semi-supervised learning algorithms to help discover
real data distributions. In order to construct a semantic-rich relational graph capturing
the similarity between mentions and concepts for the model, we introduce three novel
fine-grained relations based on a set of local features and HINs.
Motivations of better concept semantic relatedness approaches. Beyond context
expansion, another crucial evidence for this task is topical coherence which assumes that
information from the same context tends to belong to the same topic. For instance, the
text in Figure 1.6 is on a specific topic NBA basketball, and we can see that the mentions
from this text are also linked to concepts related to this topic. Modeling topical coherence
11
normally requires to define a measure to capture semantic relatedness between candidate
concepts of the mentions from the same context. The standard relatedness measure widely
adopted in existing wikification or entity linking systems leveraged Wikipedia anchor
links with Normalized Google Distance [38], which can be formulated as:
SRmw(ci, cj) = 1� logmax(|Ci|, |Cj|)� log |Ci \ Cj|
log(|C|)� logmin(|Ci|, |Cj|),
where |C| is the total number of concepts in Wikipedia, and Ci and Cj are the set of
concepts that have links to ci and cj , respectively. Our analysis reveals that it gener-
ates unreliable relatedness scores in many cases and tends to be biased towards popular
concepts. For instance, it predicts that “NBA” is more semantically-related to the city
“Chicago” than its basketball team “Chicago Bulls” 1. This is because popular concepts
such as “Chicago” tend to share more common incoming links with other concepts in
Wikipedia. Also, an underling assumption of this method is that semantically-related
concepts must share common anchor links, which is too strong.
NBA basketball - Friday 's results : Detroit 93 Cleveland 81 New York 103 Miami 85 Phoenix 101 Sacramento 95. Miami is going through a slump now.
National Basketball Association
Detroit Pistons New York Knicks
Miami HeatSacramento Kings
Cleveland Cavaliers
Phoenix Suns Slump (sports)
Figure 1.6: An illustration of topical coherence for a text.
To address these limitations, we propose a novel deep semantic relatedness model
(DSRM) that leverages semantic knowledge graphs (KGs) and deep neural networks
(DNN). In the past decade, tremendous efforts have been made to construct many large-
scale structured and linked KGs (e.g., Freebase and DBpedia), which stores a huge amount
of clean and important knowledge about concepts from contextual and typed information
to structured facts. Each fact is represented as a triple connecting a pair of concepts by a
certain relationship and of the form {leftconcept, relation, rightconcept}. An example1The relatedness score generated by [38] between “NBA” and “Chicago Bulls” is 0.59, while the score
between “NBA” and “Chicago” is 0.83.
12
about the concept “Miami Heat” in Freebase is as shown in Figure 1.5. These semantic
KGs are valuable resources to enhance relatedness measurement and deep understanding
of concepts.
Low dimensional representations (i.e., distributed representations) of objects (e.g.,
words, documents, and entities) have shown remarkable success in the fields of NLP and
information retrieval due to their ability to capture the latent semantics of objects [39],
[40]. Deep learning techniques have been applied successfully to learn distributed repre-
sentations since they can extract hidden semantic features with hierarchical architectures
and map objects into a latent space (e.g., [39]–[43]). Motivated by the previous work,
we propose to learn latent semantic entity representations with deep learning techniques
to enhance entity relatedness measurement. We directly encode heterogeneous types of
semantic knowledge from KGs including structured knowledge (i.e., concept facts and
concept types) and textual knowledge (i.e., concept descriptions) into DNN. Therefore,
compared to the standard approach proposed by [38], our proposed DSRM is in nature a
deep semantic model that can capture the latent semantics of concepts. Another advan-
tage is that it can capture more semantically-related relations between concepts which do
not share any common anchor links. We will present the detailed approach in Chapter 4.
1.3.3 Morph Decoding
An end-to-end morph decoding system needs to resolve two sub-problems: (1)
morph detection, (2) morph resolution. Morph detection is difficult for the following
aspects: (i) Large-scope candidates, all terms may serve as morph candidates, but only
a very small percentage of them are used as morphs. As we annotate a sample of 4, 668
tweets, only 450 out of 19, 704 unique terms are morphs. (ii) Informality, many morphs
are informal terms (e.g., “m|| (rice and cake)” and “�ö (not thick)”), compared to
the regular entities.
Morph resolution is also challenging due to the following reasons. First, the sen-
sitive real targets that exist in the same data source under active censorship are often
automatically filtered. Table 1.1 presents the distributions of some examples of morphs
and their targets in English Twitter and Chinese Sina Weibo. For example, the target
“Chen Guangcheng” only appears once in Weibo. Thus, the co-occurrence of a morph
13
and its target is quite low in the vast amount of information in social media. Second, most
morphs were not created based on pronunciations, spellings or other encryptions of their
original targets. Instead, they were created according to semantically related entities in
historical and cultural narratives (e.g. “Conquer West King” as morph of “Bo Xilai”) and
thus very difficult to capture based on typical lexical features. Third, tweets from Twit-
ter/Chinese Weibo are short (only up to 140 characters) and noisy, resulting in difficult
extraction of rich and accurate evidences due to the lack of enough contexts.
Table 1.1: Distributions of morph examples.
Frequency in Twitter
Frequency in Weibo
Morph Target
Morph Target Morph Target Hu Ji Hu Jintao 1 3,864 2,611 71 Blind Man
Chen Guangcheng
18 2,743 20,941 1
Baby Wen Jiabao 2238 2021 26,279 8
Although a morph and its target may have very different orthographic forms, they
tend to be embedded in similar semantic contexts which involve similar topics and events.
Figure 1.7 presents some example messages under censorship (Weibo) and not under cen-
sorship (Twitter and Chinese Daily). We can see that they include similar topics, events
(e.g., “fell from power”, “gang crackdown”, “sing red songs”), and semantic relations
(e.g., family relations with “Bo Guagua”). Therefore if we can automatically extract and
exploit these indicative semantic contexts, we can narrow down the real targets effectively.
In order to tackle these challenges, we propose a HIN-based approach to effectively
model the contexts of a morph and its target. We first construct HINs from multiple
sources, such as Twitter, Sina Weibo and web documents in formal genre (e.g. news)
because a morph and its target tend to appear in similar contexts. The previous work on
alias detection [16] have utilized homogeneous networks to model the unstructured texts.
In order to capture the discrepant contributions of different neighbor sets, we explore and
propose various meta path-based similarity measures to extract effective semantic features
for morph resolution. We will describe this approach in detail in Chapter 5.
In this thesis, two notions of “semantics” are used. First, there exist specific seman-
tic relationships between many concepts or objects in the world. For instance, as shown in
14
� Peace West King from Chongqingfell from power, still need to sing red songs?
� There is no difference between that guy’s plagiarism and Buhou’s gang crackdown.
� Remember that Buhou said that his family was not rich at the press conference a few days before he fell from power. His son Bo Guagua is supported by his scholarship.
� Bo Xilai: ten thousand letters of accusation have been received during Chongqing gang crackdown.
� The webpage of “Tianze Economic Study Institute” owned by the liberal party has been closed. This is the first affected website of the liberal party after Bo Xilai fell from power.
� Bo Xilai gave an explanation about the source of his son, Bo Guagua’s tuition.
� Bo Xilai led Chongqing city leaders and 40 district and county party and government leaders to sing red songs.
Weibo (censored) Twitter and Chinese News (uncensored)
Figure 1.7: Cross-source comparable data example (each morph and target pair isshown in the same color).
Figure 1.5, a “Coach” relationship exist between the person “Erik Spoelstra” and the bas-
ketball team “Miami Heat”, and a “Location” relationship exists between “Miami Heat”
and the city “Miami”. These relationships vary accross domains. For instance, some im-
portant relationships in sports domain include “Coach”, “Founder”, and “Roster”. While
in movie domain, popular relationships include “Director”, “Actor”, and “Genre”. By
defining these relationships and schemas, many web-scale semantic knowledge graphs
such as Freebase and DBpedia have been constructed. On the other hand, many objects
are conceptually-related, but there do not exist explicit and direct relationships between
them. In other words, their relationships are latent. For instance, these three concepts
“Atlanta Hawks”, “Miami Heat”, and “Slump (sports)” are all related to the sports do-
main. Even though there do not exist specific relationships between them, but capturing
the latent semantics are also crucial for many NLP tasks.
1.4 Contributions of the ThesisAfter identifying the unique characteristics of the informal microblog genre data,
we have proposed to tackle three crucial issues to enhance natural language understanding
in microblogs. Our general solution is to leverage and model heterogeneous information
networks to enhance current the-state-of-the-art approaches for the studied sub-problems.
We summarize our key contributions as follows:
15
• The most important contribution of this thesis is that we have introduced a new
and unique angle to improve current NLP approaches in microblogging: conduct-
ing heterogeneous information network analysis for NLP. Through our three case
studies, we show that heterogeneous information network analysis is also powerful
for many NLP tasks. This is crucial since the previous success achieved by HIN-
based approaches in data mining field were mostly based on existing clean and rich
HINs (e.g., DBLP [44]). In this thesis, we explore and construct HINs which are
involved with unstructured texts and tend to include a lot of noise. Thus this thesis
demonstrates the potential application of HIN-based approaches in the field of NLP,
especially on microblogs.
• Another important contribution is that we have enhanced natural language under-
standing in microblogging for both humans and machines. It helps users identify
salient information, provides users rich background knowledge, and resolves the
morphed entities to their regular referents that are easy to understand. Our work
can also benefit many down-streaming NLP and knowledge mining tasks such as
information extraction and text classification.
• We propose, explore, and adapt various approaches including unsupervised propa-
gation, semi-supervised graph regularization, supervised learning-to-rank and deep
neural networks to model HINs for ranking, classification, and similarity measure-
ment. We achieved the state-of-the-art performance in several NLP tasks. For
instance, we advanced the standard concept relatedness method which is adopted
in many existing wikification and entity link systems.
• We propose methods to construct HINs directly from the noisy raw texts with both
existing social network relations and well-developed NLP approaches. We also
explore cross-genre, cross-platform, and cross-type information to construct HINs.
• We propose a brand-new task: morph decoding, which is crucial to study the fast
evolution language in social media.
CHAPTER 2Background and Relevant Literature
In this chapter, we introduce the necessary background knowledge and review the rel-
evant literature. We first formally define homogeneous and heterogeneous information
networks and introduce their applications. Next, we survey the graph-based approaches
for ranking, similarity measurement, and classification that have broad applications in the
field of data mining and NLP. In this thesis, we also extend and exploit these approaches
to resolve our problems. Finally, we review the related research to our overall thesis topic
with an emphasis on the informal microblog genre data.
2.1 Homogeneous and Heterogeneous Information NetworksTwo core concepts for this thesis are homogeneous and heterogeneous information
networks. Formally, an information network can be defined as a directed graph G =
(V , E) with an object type mapping function ⌧ : V ! A and a link type mapping function
� : E ! R, where each object v 2 V belongs to one particular object type ⌧(v) 2 A,
each link e 2 E belongs to a particular relation �(e) 2 R. If two links belong to the
same relation type, then they share the same starting object type as well as the same
ending object type. An information network is homogeneous if and only if there is only
one type for both objects and links, and an information network is heterogeneous when
the objects are from multiple distinct types or there exist more than one type of links.
Figure 2.1(a) shows an example of heterogeneous DBLP bibliographic network, which
includes three types of objects: venues, papers, and authors. The links between papers
and venues indicate “publishing or “published by” relationship, the links between papers
Portions of this chapter previously appeared as: H. Huang, A. Zubiaga, H. Ji, H. Deng, D. Wang, H.Le, T. Abdelzaher, J. Han, A. Leung, J. Hancock, and C. Voss, “Tweet ranking based on heterogeneousnetworks,” in Proc. of the 24th Int. Conf. on Comput. Linguist., Mumbai, India, 2012, pp. 1239–1256.
H. Huang, Y. Cao, X. Huang, H. Ji, and C.-Y. Lin, “Collective tweet wikification based on semi-supervised graph regularization,” in Proc. of the 52nd Annu. Meeting of the Assoc. for Comput. Linguist.,Baltimore, Maryland, 2014, pp. 380–390.
H. Huang, Z. Wen, D. Yu, H. Ji, Y. Sun, J. Han, and H. Li, “Resolving entity morphs in censoreddata,” in Proc. of the 51st Annu. Meeting of the Assoc. for Comput. Linguist., Sofia, Bulgaria, 2013, pp.1083–1093.
16
17
and authors indicate “writing” or “written by” relationship, and the links between papers
and papers indicate “citing” or “cited by” relationship. Figure 2.1(b) shows an example
of homogeneous co-author network, there exists only one type of objects (e.g., authors)
and one type of relation (e.g., co-author relationship).
ACL
COLING
MorphRes
TweetRank
Wikifcation
(a) (b)
Figure 2.1: (a) Heterogeneous DBLP biobiographic network, (b) Homogeneous co-author network.
Topic
Venue
Paper Authorwrite-1
write
mention-1
mention
publish Publish-1
contain/contain-1
Figure 2.2: Schema of the heterogeneous DBLP biobiographic network.
There has been extensive previous work on homogeneous information networks in
various tasks such as ranking, classification, clustering, link analysis and prediction [26],
[45]–[52]. In recent years, mining directly over heterogeneous information networks
(HINs) has been received increasing attentions, and has demonstrated advantages over
the approaches relying on homogeneous networks in the field of data mining. This is
mainly because the later is an information loss projection of the former. For example,
18
the co-author network only contains the co-author relationship, while other information
regarding papers and venues is missing.
One specific type of HINs is the web-scale knowledge graphs (KGs) such as DBpe-
dia and Freebase with an example shown in Figure 1.5. Semantic KGs contain millions
of concepts and store a huge amount of knowledge from structured facts, concept types,
to textual descriptions. Each fact is represented as a triple connecting a pair of concepts
by a certain relationship and of the form {left concept, relation, right concept}. For
instance, {Miami Heat, Founded, 1988} indicates the fact that the basketball team Miami
Heat was founded in 1988. Semantic KGs have been demonstrated to be useful resources
for external knowledge mining for entity and relation extraction [53], [54] and corefer-
ence and entity linking [55], [56]. Some recent work learned distributed representations
for concepts directly from KGs for semantic parsing [57], link prediction [42], [43], [58],
and question answering [59], [60]. By learning the distributed representations such that
the existing relationships between entities are preserved, new relationships can be inferred
to complete the KGs.
An important concept defined over HINs is meta path, which is a path defined over
a network and composed of a sequence of relations between different object types [24].
For example, Table 2.1 shows a set of meta paths extracted from the DBLP bibliographic
network with the network schema as shown in Figure 2.1 [23]. Each meta path normally
has its own semantic meaning. For instance, the path “A - P - A” indicates that two author
ai and aj are co-authors. The meta path concept has been successfully applied to enhance
various tasks, including link prediction [23], similarity search [24], clustering [22], and
classification [21]. The advantages of meta path-based approaches are that they have
better ability to capture the different semantic meanings of each type of path.
2.2 Graph-based Approaches2.2.1 Ranking
Link-based ranking approaches are an important class of ranking algorithms that
utilize link structures to determine the authority or importance of a node in a network. We
survey several popular link-based ranking algorithms based on homogeneous or hetero-
geneous networks.
19
Table 2.1: Meta paths in DBLP bibliographic network.
Meta Path Semantic Meaning of the PathA - P - A ai and aj are co-authorsA - P! P - A ai cites ajA - P P - A ai is cited by ajA - P - V - P - A ai and aj publish in the same venuesA - P - A - P - A ai and aj are co-authors of the same authorA - P - T - P - A ai and aj write papers on the same topicA - P! P! P - A ai cites papers that cite ajA - P! P P - A ai and aj cites the same paperA - P P! P - A ai and aj are cited by the same paper
PageRank. The first important link-based ranking algorithm is PageRank [45],
which is a random-walk based weight propagation algorithm. The underlying assump-
tion of the PageRank algorithm is that both the number of nodes pointing to a node and
the quality of these nodes are important for object ranking. High-quality nodes should
have higher contributions. Given a graph G = (V,E), where V is a set of nodes and E as
a set of edges. The PageRank can be formulated as:
ri = (1� d) + d ⇤X
vj
2In(vi
)
rj|Out(vj)|
.
where vi is a vertex with ri as the ranking score, In(vi) is the set of nodes that have
links to vi, and Out(vi) is the set of nodes that have links from vi, |Out(vj)| is the size of
of the set Out(vi). d is a damping factor, which controls that probability that the random
walk from the current node continues.
Many variant versions of PageRank have been proposed to handle weighted or undi-
rected graphs. For example, TextRank [26] was proposed rank sentences in a weighted
and undirected sentence graphs for document summarization, which can be formulated as
follows:
ri = (1� d) + d ⇤X
vj
2In(vi
)
wjirjX
vk
2Out(vj
)
wjk
.
Another variant is Personalized PageRank [47] by including the personalization
evidence r0(vi) for each node vi, which can be formulated as:
20
ri = (1� d)r0i + d ⇤X
vj
2In(vi
)
rj|Out(vj)|
.
HITS. HITS [46] is another popular link-based ranking algorithm. Its main differ-
ence with PageRank is that a specific type of nodes called Hubs is created and used. The
authors claimed that good authorities are not necessary to point to other good authorities,
but good authorities should be linked by many good hubs. A good hub points to many
good authorities. In HITS, each node vi has two scores: authority score ai and hub score
hi. It can be formally presented as:
ai =
X
vj
2In(vi
)
hj,
hj =
X
vi
2Out(vj
)
ai.
Co-HITS. Co-HITs [18] is another link analysis algorithm designed over a bipartite
graph with content from two types of objects. The intuition behind the score propagation
is the mutual reinforcement to boost co-linked objects.
Given a bipartite graph G = (U [ V,E), where U and V are two disjoint set of
vertices. We use wuvij (or wvu
ji ) to denote the weight for the edge between ui and vj .
To put all the weights between sets U and V together, we can use W uv 2 R|U |⇥|V | (or
W vu 2 R|V |⇥|U |) to denote the weight matrix between U and V . Note that W uv 2 R|U |⇥|V |
is the transpose of W vu 2 R|V |⇥|U | as we have wuvij = wvu
ji . For each ui 2 U , a transition
probability puvij is defined as the probability that vertex ui in U reaches vertex vj in V
at the next step. Formally, it is defined as a normalized weight puvij =
wuv
ijPk
wuv
ik
, such
thatP
j2V puvij = 1. Similarly, we obtain the transition probability pvuji =
wvu
jiPk
wvu
jk
andP
i2U pvuji = 1 for each vj 2 V .
Then the iterative framework of Co-HITS can be formulated as:
r(ui) = (1� �u)r0
(ui) + �u
X
j2V
pvuji r(vj),
r(vj) = (1� �v)r0
(vj) + �v
X
i2U
puvij r(ui).
21
Where �u 2 [0, 1] and �v 2 [0, 1] are personalized parameters, r0(ui) and r0(vj)
are initial ranking scores for ui and vj , and r(ui) and r(vj) denote updated ranking scores
of vertices ui and vj . When both �u and �v are set to be 1, then it becomes the HITs
algorithm. And when one of the parameters �u or �v is set to be 1, then it becomes to the
personalized PageRank.
2.2.2 Similarity Measurement
Similarity measurement is crucial in this thesis since it is directly related to the
microblog wikification and morph resolution tasks. Accurate similarity measurement
approaches also enable us to construct cleaner networks. We review several commonly
used graph-based similarity measures for link prediction [48]. Given a graph G = (V,E),
where V is a set of nodes, and E is the set of existing links. Then the following measures
can be used to predict the probability of linkage between two nodes x and y. Each of
them provides different angles to measure the similarity between two nodes. When there
is labeled data available, supervised approaches such as learning to rank algorithms can
be leveraged to combine them [16], [23].
Common Neighbors. It measures the size of the common neighbor set between x
and y. In other words, sim(x, y) = |�(x)\�(y)|, where �(x) and �(x) are neighbor sets
for x and y, and |.| is the size of a set.
Jaccards coefficient It is a commonly used similarity measures, which can be for-
mulated as: sim(x, y) = |�(x)\�(y)||�(x)[�(y)| .
Adamic/Adar. It aims to capture the importance of each common neighbor. It re-
fines simple counting of common neighbors by putting lower weights on more frequent
neighbors, which can be formulated as: sim(x, y) =P
z2|�(x)\�(y)|1
log(|�(z))| .
The above measures are based on neighbor sets, and the following are path-based
measures.
Path Count. It measures the number of path instances between x and y.
Random Walk. It measures the probability of a random walk that starts from x and
ends at y.
SimRank. SimRank [49] is also a random walk based approach with the assumption
22
that two similar node should share many similar neighbors, which can be formulated as:
sim(x, y) = �
Pa2�(x)
Pb2�(y) sim(a, b)
|�(x)||�(y)| .
Normalized Google Distance (NGD). NGD [61] was originally invented to measure
the similarity between words and phrases based on their co-occurrence in large-scale
documents. Formally, it can presented as:
NGD(x, y) =logmax(|f(x)|, |f(y)|)� log |f(x) \ f(y)|
log(N)� logmin(|f(x)|, |f(y)|) ,
where x and y are two words/phrases, f(x) is the set of documents that contain x, and
N is a normalizing factor. Intuitively, if two words co-occur in many documents, they
tend to more related to each other. NGD has been combined with Wikipedia anchor
links to measure relatedness of concepts [38], which has been widely adopted in many
wikification and entity linking systems.
The above measures have also been adapted to heterogeneous information networks
with the usage of meta path concept [23], [24]. Let P be a specific type of meta path
between two objects x and y, we summarize some of the similarity measures based on P .
Different from above measures, meta path-based methods only leverage one type of path
each time to compute a specific feature, thus they can capture the unique contributions of
each path.
Path Count. It measures the number of path instances of P between x and y. Specif-
ically, sim(x, y) = |p 2 P|.Random walk. It measures the probability of the random walk that starts from x and
ends at y following meta path P . Namely, sim(x, y) =P
p2P prob(p), where P is the set
of all path instances starting with x and ends with y.
Pairwise random walk. For a meta path P that can be decomposed into two shorter
meta paths with the same length P = (P1
P2
), pairwise random walk s(x, y) measures
the probability of the pairwise random walk that starts from both x and y that reaches the
same mediate object: sim(x, y) =
P(p1p2)2(P1P2)
prob(p1
)prob(p�1
2
), where p�1
2
is the
inverse of p2
.
PathSim. PathSim [24] is proposed to measure similarity between peer objects.
23
Given a symmetric meta path P , the similarity between a two x and y can be defined as
sim(x, y) =2⇥ |{px y : px y 2 P}|
|{px y : px y 2 P}|+ |{px y : px y 2 P}| ,
where pm e is a path instance between m and e that follows the defined meta path P ,
pm m is that between m and m, and pe e is that between e and e.
2.2.3 Semi-supervised Learning
Another important family of graph-based algorithms is the graph-based semi-supervised
or transductive learning algorithms, which exploit the manifold (cluster) structure in both
unlabeled and labeled data [33]–[37]. These approaches normally assume label smooth-
ness over a defined graph, where the nodes represent a set of labeled and unlabeled in-
stances, and the weighted edges reflect the closeness of each pair of instances. The goal
of these approaches are two fold: (i) the refined labels should be close to the annotated la-
bels, (ii) the refined labels should be smooth over the whole defined graph. We summarize
two popular approaches.
Experimental Setting. Denoting a dataset with n instances as X = {x1
, ..., xl, ..., xn},
the label vector as F = {f1
, ..., fl, ..., fn}, where each fi belongs to a label set L, the first
l instances are labeled seeds with labeled vector as Fl, and the remaining n � l ones are
unlabeled instances. Then the goal of these transductive algorithms is to infer labels Fu
for the remaining unlabeled instances based on the defined graph structure constructed
over both labeled and unlabeled data.
Weight matrix computation. Normally, transductive learning relies on a n⇥ n sym-
metric weight matrix W that reveals of the similarity of each pair of instances. Suppose
x 2 Rm and each xi is represented as a m-dimensional feature vector: xi = hxi1, ..., ximi,then a common way to compute W is:
Wij = exp(�mX
d=1
(xid � xjd)2
�2
d
).
Label Propagation. One of the earliest transductive learning algorithm is label
propagation (LP) [33]. It aims to minimize the following objective function to ensure
that unlabeled data instances that are strongly connected in the graph should have similar
24
labels:
⌦(F ) =
1
2
nX
i,j=1
wij(Fi � Fj)2.
LLGC. Another popular transductive learning approach is learning with local and
global consistency (LLGC) [36]. It aims to minimize the following objective function:
⌦(F ) =
1
2
nX
i,j=1
Wij
�����FipDii
� FjpDjj
�����
2
+ µnX
i=1
��Fi � F 0
i
��2 .
There also exist both closed-form and iterative form solutions for LP and LLGC
since their objective functions are convex [33],[36]. In practice, the closed-form solutions
have problems due to scalability and efficiency issues, thus iterative form solutions are
more suitable for practical applications on large-scale datasets. The assumptions of both
LP and LLGC are similar, and the difference mainly lies in the selection of loss function
and regularizer.
2.3 Related Work to the Thesis TopicIn this section, we summarize the previous work related to this thesis.
2.3.1 Ranking in Microblogging
Previous research on microblog ranking has relied on the analysis of content [62],
user credibility [63]–[66] and URL availability, or combinations of them [67], [68]. In
addition, Huang et al.[68] also exploited content similarity to propagate evidence within
the microblog genre. Most work has been based on supervised learning models such
as RankSVM, Naive-Bayes classifier, and Linear Regression. Inouye and Kalita [69]
compared various unsupervised methods to rank microblogs for summarization purposes,
but only used lexical-level content analysis features.
In analyzing the information credibility of microblogs, Castillo et al. [70] relied on
various levels of features (i.e., message-based, user-based, topic-based and propagation-
based features) and supervised learning models for information credibility assessment in
Twitter, which Gupta et.al [71] extended by capturing relations among events, tweets,
and users. A Bayesian model was proposed in [72], [73] to assess microblog credibility.
25
However, it remains as a preliminary approach due to the linear assumption made in the
iterative algorithm of the basic fact-finding scheme. Intensive research has also been
conducted on information credibility analysis (cf. [74]).
In identifying influential users in microblogging services, TwitterRank [75] used a
variant of PageRank algorithm with both content information and link structure to mea-
sure user influence. Pal and Counts [76] leveraged a clustering approach to avoid bias to
highly visible users. Romero et al. [77] analyzed the information forwarding activity of
users and they proved that user popularity did not indicate influence.
2.3.2 Microblog Wikification
The task of linking concept mentions has received increased attentions over the past
several years, from the linking of concept mentions in a single text [11], to the linking
of a cluster of coreferent named entity mentions spread throughout different documents
(Entity Linking) [78], [79].
Building such a linking system requires the solving of two sub-problems: mention
detection and mention disambiguation. A significant portion of recent work considers the
two sub-problems separately and focus on the latter by first defining candidate concepts
for a deemed mention based on anchor links. Mention disambiguation is then formu-
lated as a ranking problem, either by resolving one mention at each time (non-collective
approaches), or by disambiguating a set of relevant mentions simultaneously (collective
approaches). Non-collective methods usually rely on prior popularity and context simi-
larity with supervised models [11],[80],[81], while collective approaches further leverage
the global coherence between concepts normally through supervised or graph-based re-
ranking models [28], [31], [32], [80]–[91]. Especially note that when applying the collec-
tive methods to short messages from social media, evidence from other messages usually
needs to be considered [31], [90], [91].
[92], [93] proposed to perform joint detection and disambiguation of mentions for
tweets. [92] studied several supervised machine learning models, but without considering
any global evidence either from a single tweet or other relevant tweets. [93] explored sec-
ond order entity-to-entity relations but did not incorporate evidence from multiple tweets.
26
2.3.3 Morph Decoding
We propose the brand-new task morph decoding to detect and resolve informal and
implicit morphs to their true targeted entities.
Many of the morphs are created to avoid active censorship. Bamman et al. [94] au-
tomatically discovered politically sensitive terms from Chinese tweets based on message
deletion analysis to analyze social user behavior under active censorship. In contrast, our
work goes beyond target idendification by resolving implicit morphs to their real targets.
Some recent work attempted to detect and normalize Chinese informal words into
formal words [95]–[100]. Information morph discovery also belongs to the category of
anomalous text detection. However, morphs are created and used by social user inten-
tionally to achieve certain communication goals. Not all anomalous texts are used as
morphs (e.g., “Ÿõ(geliable)” to replace “à“(awesome)). In addition, many morphs
are regular terms (e.g., “Conquer West King” and “King”).
Our morph resolution work is closely related to alias detection [16], [17], [101],
[102]. [16], [17] proposed to detect aliases in malicious environments by modeling the
behaviors of the entities with semantic models and information theoretic approach. [101]
studied alias detection problem in the Web with lexical patterns-based and co-occurrence-
based models. However, sensitive morphs rarely co-occur with their real targets, which
fails the pattern-based methods. [102] detected email alias by modeling the collocation
of two email addresses to web pages.
CHAPTER 3Microblog Ranking
In this chapter, we introduce Tri-HITS, a novel propagation model that leverages global
information iteratively computed across heterogeneous networks constructed from web
documents, microblogs, and users, to rank microblogs on a topic by informativeness. We
propose three high-level hypotheses that motivate the presented methods of construct-
ing heterogeneous networks of microblogs, users, and web documents. The proposed
model, Tri-HITS, operates iteratively over all networks incorporating the semantics and
importance of different linkages. By ranking microblogs about the Hurricane Irene, we
demonstrate that incorporating a formal genre such as web documents, inferring implicit
social networks and performing effective ranking score propagation with the proposed
model can significantly improve the ranking quality.
3.1 Motivations and HypothesesNext, we describe the motivational aspects and hypotheses in this work, which we
aim to prove.
Hypothesis 1: Informative tweets are more likely to be posted by credible users;
and vice versa (credible users are more likely to post informative tweets). [67], [68] con-
sider that users who have more followers, mentions, and retweets, and are listed more,
are more likely to be authoritative. They used retweet, reply, user mention and follower
counts to to compute the degree of authoritativeness of users; and showed that user ac-
count authority is a helpful feature for tweet ranking. However, for events of general in-
terest involving multiple communities, users are more likely to be unaware of each other,
and rarely interact. This makes it insufficient to rely on user-user networks constructed
from retweet and reply interactions to compute user credibility scores. To overcome this
problem, we apply a Bayesian approach to compute the credibility of users by incorpo-
rating the contents shared by them.
This chapter previously appeared as: H. Huang, A. Zubiaga, H. Ji, H. Deng, D. Wang, H. Le, T.Abdelzaher, J. Han, A. Leung, J. Hancock, and C. Voss, “Tweet ranking based on heterogeneous networks,”in Proc. of the 24th Int. Conf. on Comput. Linguist., Mumbai, India, 2012, pp. 1239–1256.
27
28
Hypothesis 2: Tweets involving many users are more likely to be informative. Hav-
ing many users share similar tweets at the same time helps identify informative tweets.
For example, in the context of Hurricane Irene, users were likely to share information
about the Evacuation Zone when they found relevant news or events. The synchronization
of information within groups has been successfully harnessed in other fields like financial
trading, autonomous swarms of exploratory robots, and flocks of communicating software
agents [103]. This idea has also been successfully exploited for event summarization from
tweets [104].
Hypothesis 3: Tweets aligned with contents of web documents are more likely to
be informative. Tweets come from diverse sources, and can diverse content ranging from
news and events, to conversations and personal status updates. Therefore, informative
tweets tend to be interspersed with noisy and non-informative tweets. This differs from
formal genres such as web documents, which tend to be cleaner. In the case of current
events such as natural disasters or political elections, there are tight correlations between
social media and web documents. Important information shared in social media tends
to be posted in web documents. For example, the following informative tweets would
rank highly because they are linked to informative web documents: ” New Yorkers, find
your exact evacuation zone by your address here: http://t.co/9NhiGKG /via @user #Irene
#hurricane #NY” and ”Details of Aer Lingus flights affected by Hurricane Irene can be
found at http://t.co/PCqE74V201d”. As far as we know, this is the first work to integrate
information from a formal genre such as web documents to enhance tweet ranking.
3.2 Our Proposed Approach: Tri-HITSBased on the formulated hypotheses, we describe how Tri-HITS works. Tri-HITS
is developed over the heterogeneous networks that include web documents, tweets, and
users, as shown in Figure 3.1.
3.2.1 Overview
Figure 3.2 depicts how Tri-HITS works. For a set of tweets on a specific topic,
a rule-based filtering component is first applied to filter out a subset of non-informative
tweets. For the remaining tweets, we define queries based on top terms in tweets, and use
29
U1
U2
U3
T1
T2
T3
T4
D1
D2
D3
Web-Tweet Networks Tweet-User Networks
Web-Tweet-User Networks
Figure 3.1: Web-Tweet-User heterogeneous networks.
Bing Search API [105] to retrieve the titles 2 of the top m web documents for those queries
(m = 2 for these experiments). Then we apply TextRank and a Bayesian approach that
initialize ranking scores for tweets, web documents, and users. Finally, we iteratively
propagate ranking scores for web documents, tweets, and users across the networks to
refine the tweet ranking.
3.2.2 Filtering non-informative Tweets
Tweets are more likely to be shortened or informally written than texts from a for-
mal genre such as web documents. Thus, a prior filtering step would clean up the set
of tweets and improve the ranking quality. We observed that numerous non-informative
tweets have some common characteristics, which help infer patterns to clean up the set of
tweets. In our filtering method, we define several patterns to capture the characteristics
of a non-informative tweet, i.e., very short tweets without a complementary URL, tweets
with first personal pronouns, or informal tweets containing slang words [106]. These fea-
tures have been shown to be effective in previous work on tweet ranking and information
credibility [66], [67], [70]. Our filtering component accurately filters out non-informative
tweets, achieving 96.59% at precision.2We rely on page titles, but it could be extended to the whole content of web documents straightfor-
wardly.
30
Tweets T
Query Construction And Retrieval of Web Documents D
Ranked Tweets Based on
Informativeness
Iterative Propagation
Heterogeneous Networks
Heterogeneous Networks Construction
Infer Implicit U-T Links
Noisy Tweet Filtering
Users U
Initialize Ranking Scores
Align T-D
Figure 3.2: Overview of Tri-HITS.
3.2.3 Initializing Ranking Scores
Initializing scores for tweets and web documents. For a set of tweets T , we first
construct an undirected and weighted graph G = (V,E). After removing stopwords and
punctuations, the bag-of-words of each tweet ti is represented as a vertex vi 2 V , and the
weight for the edge between tweets is the cosine similarity using TF-IDF representations.
Then, we use TextRank to compute initial scores. The same approach is used to initialize
ranking scores for web documents.
Initializing user credibility scores. Based on Hypothesis 1, we define two ap-
proaches to compute initial user credibility scores. First, we construct a user network
based on retweets, replies and user mentions as in [67]. This results in a directed and
weighted graph Gd = (V,E), where V is the set of users and E is the set of directed
edges. A directed edge exists from ui to uj if user ui interacts with uj (i.e., mentions,
retweets, or replies to uj). The weight of the edge is defined as Nij , according to the
31
number of interactions. In this case, we use TextRank to compute initial user credibility
scores.
In addition, we also use the Bayesian ranking approach [72], [73] that considers
the credibility scores of tweets and users simultaneously based on Tweet-User networks.
Given a set of users U = {u1
, u2
, ..., um}, and a set of claims C = {c1
, c2
, ..., cn} the
users make (each claim corresponds to a cluster of tweets in this paper). We also define
matrix W cu where wcuji = 1 if user ui makes claim cj , and is zero otherwise. Let ut
i
denote the proposition that ’user ui speaks the truth’. Let ctj denote the proposition that
’claim cj is true’. Also, let P (uti) and P (ut
i|W cu) be the prior and posterior probability
that user ui speaks the truth. Similarly, P (cti) and P (cti|W cu) are the prior and posterior
probability that claim ci is true. We define the credibility rank of a claim Rank(cj) as the
increase in the posterior probability that a claim is true, normalized by prior probability
P (cti). Similarly, the credibility rank of a user Rank(ui) is defined as the increase in the
posterior probability that a user is credible, normalized by prior probability P (uti). In
other words, we can get:
Rank(cj) =
P (ctj|W cu)� P (ctj)
P (ctj)(3.1)
Rank(ui) =
P (uti|W cu
)� P (uti)
P (uti)
(3.2)
In our previous work, we showed that the following relations hold true regarding
the credibility rank of a claim Rank(cj) and a user Rank(ui):
Rank(cj) =
X
k2Usersj
Rank(uk) (3.3)
Rank(ui) =
X
k2Claimsi
Rank(ck) (3.4)
where Usersj is the set of users makes claim cj , and Claimsi is the set of claims
the user ui makes. From the above, the credibility of sources and claims can be derived
32
as:
P (ctj|W cu) = pta(Rank(cj) + 1) (3.5)
P (uti|W cu
) = pts(Rank(ui) + 1) (3.6)
where pta and pts are initialization constants, which are the ratio of true claims to the total
claims, and the ratio of credible users to the total users.
Then, Equation 3.6 is used to compute initial user credibility scores as our second
approach.
3.2.4 Constructing Heterogeneous Networks
Next, we describe the two types of networks we build as constituent parts of het-
erogeneous networks:
Tweet-User networks. Based on Hypothesis 2, we expand the Tweet-User net-
works by inferring implicit tweet-user relations. If a user ui posted a set of tweets Ti
during a period of time, we say an implicit relation exists between ui and a tweet tj if the
maximum cosine similarity between tj and ti 2 Ti exceeds or equals a threshold �tu.
Web-Tweet networks. Given a set of tweets T and a set of associated web docu-
ments D, we build a bipartite graph G = T [D,E, where an undirected edge with weight
wtdij is added when the cosine similarity between ti 2 T and dj 2 D exceeds or equals
�td. This approach creates cross-genre linkages between tweets and web documents on
similar events (e.g., evacuation events).
In subsection 4.6, we will discuss the effects of parameters �td and �tu.
3.2.5 Iterative Propagation
We introduce a novel algorithm to incorporate both initial ranking scores and global
evidence from heterogeneous networks. It propagates ranking scores across heteroge-
neous networks iteratively. Our algorithm is an extension of Co-HITS [18], which we
introduced in detail in Chapter 2.
The problem with Co-HITS in our experimental settings is the transition probabil-
ity. As mentioned before, we choose cosine similarity as the weight for the edge between
two vertices, and a similarity matrix W is obtained to denote the weight matrix where
33
each entry wij is the similarity between vertex ui and vertex vj . Although the transition
probability is a natural normalization for the weight between two vertices, it may not be
suitable for similarity matrix. The reason is that the original similarity between different
objects has already been normalized, so a further normalization from the similarity matrix
to transition matrix may weaken or damage inherent meanings of the original similarity.
For example, if a tweet ui is aligned with one and only one document vj with relatively
low similarity weight, the transition probability wuvij will be increased to 1 after normal-
ization. Similarly, some higher similarity weights may be normalized to small transition
probabilities.
By extending and adapting Co-HITS, we develop Tri-HITS to handle heteroge-
neous networks with three types of objects: users, tweets and web documents. Given the
similarity matrices W dt (between documents and tweets) and W tu (between tweets and
users), and initial ranking scores of s0(d), s0(t) and s0(u), we aim to refine the initial
ranking scores and obtain the final ranking scores s(d), s(t) and s(u). Starting from doc-
ument s(d), the update process considers both the initial score s0(d) and the propagation
from connected tweets s(t), which can be expressed as:
s(di) =
X
j2T
wtdjis(tj),
s(di) = (1� �td)s0
(di) + �tds(di)Pi s(di)
, (3.7)
where W td is the transpose of W dt, and �td 2 [0, 1] is the parameter to balance between
initial and propagated ranking scores. Tri-HITS normalizes the propagated ranking scores
s(di), while Co-HITS propagates normalized ranking scores by using the transition matrix
instead of the original similarity matrix, potentially weakening or damaging the inherent
meanings of the original similarity. Similarly, we define the propagation from tweets to
users as:
s(uk) =
X
j2T
wtujks(tj),
s(uk) = (1� �tu)s0
(uk) + �tus(uk)Pk s(uk)
, (3.8)
34
Each tweet s(tj) may be influenced by the propagation from both documents and users:
sd(tj) =
X
i2D
wdtij s(di),
su(tj) =
X
k2U
wutkjs(uk),
s(tj) = (1� �dt � �ut)s0
(tj) (3.9)
+�dtsd(tj)Pj sd(tj)
+ �utsu(tj)Pj su(tj)
.
where W ut is the transpose of W tu, �dt and �ut are parameters to balance between initial
and propagated ranking scores. The � variables define the networks being considered: (i)
when �dt is set to 0, only Tweet-User networks are considered (Method 3 in Table 3.1);
(ii) when �ut is set to 0, only Web-Tweet networks are considered (Method 4); (iii) when
both �dt and �ut are different from 0, the entire heterogeneous Web-Tweet-User network
is considered (Method 5). For methods relying on bipartite graphs, we define as one-step
propagation when the propagation is performed in a single direction, while we call it
two-step propagation when it is performed in both directions. The selection of one-step
propagation and two-step propagation is controlled by � parameters.
Model Convergence Proof: From Equation (3.7), and assuming �td > 0 (the rank-
ing scores s(d) for web documents would not change if �td = 0), we get:
s(di) =1
�td
[s(di)� (1� �td)s0
(di)] =s(di)Pi s(di)
. (3.10)
s(d), the normalized score of s(d), is similar to the normalized authority or hub scores
defined in HITS [46], the difference being only the function to select vector norms. Klein-
berg proved that s(di) converges as the iterative procedure continues, from which the
convergence of the ranking scores s(d) for web documents is guaranteed. The same as-
sumption proves the convergence of ranking scores for tweets and users.
Algorithm 1 summarizes Tri-HITS.
3.2.6 Redundancy Removal
Since a list of top-ranked tweets might contain redundant information, diversity is
an important factor to be considered. Diversity has been previously considered, not only
35
Input: A set of tweets (T ), and users (U ) on a given topic.Output: Ranking scores (St) for T .
1: Use rule-based method to filter out noisy tweets (remaining T posted by users U );2: Retrieve relevant web documents D for T ;3: Use TextRank and Bayesian Ranking to compute initial ranking scores S0
t for T , S0d for D
and initial credibility scores S0u for U ;
4: Construct heterogeneous networks across T , U and D;5: k 0, diff 10e6;6: while k < MaxIteration and diff > MinThreshold do7: Use Eq. (3.9) to compute Sk+1
t ;8: Use Eq. (3.8) to compute Sk+1
u ;9: Use Eq. (3.7) to compute Sk+1
d ;10: Normalize Sk+1
t ,Sk+1d , and Sk+1
u ;11: diff
P(|Sk+1
t � Skt |);
12: k k + 113: end while
Algorithm 1: Tri-HITS: Tweet ranking using heterogeneous networks
for information retrieval [107], but also for multi-document summarization [108]. Since
users on Twitter can be tweeting similar information obliviously, and retweet and reply
others’ tweets, redundancy has been shown to be a pervasive phenomenon [109]. This
issue has not been considered in previous works on tweet ranking [67], [68]. In this work,
we perform a redundancy removal step to diversify top ranked tweets. To do so, we adopt
the widely used greedy procedure [107], [108] to apply redundancy removal on top of
Tri-HITS. Based on the initial ranking of each approach, tweet ti in position i is pruned
when the cosine similarity with tj 2 [t1
, ti�1
] in upper ranked positions exceeds or equals
a predefined threshold �red3
3.3 ExperimentsNext, we present the experiment settings and analyze the methods shown in Ta-
ble 3.1.
3.3.1 Data and Evaluation Metric
We use tweets on the Hurricane Irene from August 26 to September 2, 2011 for our
experiments. Using the query terms hurricane or irene to monitor tweets, we collected3We choose �red = 0.6 as a threshold, obtained from our empirical studies with values from 0.1 to 1.0
in the development set.
36
Table 3.1: Description of methods (method with ⇤ make use of the Bayesian ap-proach to initialize user credibility scores.
Methods Descriptions Hypotheses1. Baseline TextRank based on tweet-tweet networks.2. 1+Filtering Baseline with filtering included.3. 2+Tweet-User⇤ Propagation on explicit and implicit Tweet-User
networks.1 and 2
4. 2+Web-Tweet Propagation on Web-Tweet networks. 35. 3+4 Web-Tweet-User⇤
Propagation on Web-Tweet-User networks. all
176,014 tweets posted by 139,136 users within that timeframe. For evaluation purposes,
we segment the tweets into 153 hours with an average of 1,150 tweets in each hour.
! The$quality$of$a$tweet$was$judged$to$a$54star$likert$scale,$according$to$informa;veness$and$readability$of$the$content.$Tweets$with$grade$5$are$the$most$informa;ve,$while$tweets$with$label$1$are$the$least$informa;ve.$Two$basic$criteria$that$judge$the$informa;veness$of$a$tweet$are:$(1)$Whether$the$tweet$is$likely$to$be$news?$(2)$Does$the$tweet$include$informa;on$that$a$general$audience$will$be$concerned$about$during$an$event?$
! Tweets$with$label$5$are$very$informa;ve$and$have$good$readability.$They$can$be$used$as$news$;tles$directly.$o AP)$44$NYC$Mayor$Michael$Bloomberg$has$ordered$mandatory$
evacua;ons$for$residents$in$low4lying$coastal$areas$ahead$of$Hurricane$Irene.$
! Tweets$with$label$4$are$informa;ve$and$have$good$readbility.$o Patch$Storm$Tracker:$Follow$the$track$of$Hurricane$Irene$here.$hSp://
t.co/iSI8kzL$! Tweets$with$label$3$are$informa;ve$but$readability$is$not$good.$
o 'Prayer$for$the$US$RT$@etharkamal:$RT$@guardiannews:$Hurricane$Irene$hits$New$York$\u2013$live$updates$hSp://t.co/t0ZHFqB$
! Tweets$with$label$2$can$provide$some$limited$informa;on.$o About$to$leave$for$school$and$hurricane$Irene$decides$to$hit$
#whaShefuck$! Tweets$with$label$1$can$not$provide$any$useful$informa;on.$
o Me,$Myself,$and$Hurricane$Irene.$
Figure 3.3: Annotation guideline for tweet ranking.
We randomly chose tweets from three hours to be manually annotated as our ref-
erence. This subset contains 3,460 tweets posted on different days: August 27, 2011,
August 28, 2011 and September 1, 2011. Following the annotation guidelines defined by
[68], two annotators parallelly assigned each tweet a grade in a 5-star likert scale. The
37
Table 3.2: Tweet distribution by grade.
Grade 5 4 3 2 1Hour 1 65 48 93 119 847Hour 2 135 159 255 164 458Hour 3 129 102 162 123 602
annotation guideline is as shown in Figure 3.3. Tweets with grade 5 are the most infor-
mative, while tweets with label 1 are the least informative. When the label difference
between annotators was 1, the lower grade was selected. When the label difference was
greater than 1, those tweets were re-annotated until the label difference did not exceed 1.
Table 3.2 shows the distributions of all grades for each of the three hours of tweets.
To evaluate tweet ranking, we rely on three-fold cross validation using nDCG as a
measure [110], which considers both the informativeness, and the position of a tweet:
nDCG(�, k) =1
|�|
|�|X
i=1
DCGik
IDCGik
,
DCGik =
kX
j=1
2
relij � 1
log(1 + j),
where � is the set of documents in the test set, each document corresponding to
an hour of tweets in our case, relij is the human-annotated label for the tweet j in the
document i, and IDCGik is the DCG score for the ideal ranking. The average nDCG
score for the top k tweets is: Avg@k =
Pki=1
nDCG(�, i)/k. To favor diversity of top
ranked tweets, redundant tweets are penalized to lower down the final score.
3.3.2 Effect of Parameters
We study the impact of different parameters on the training set. We present the
most representative figures to show the effect, due to the lack of space. For TextRank, we
explore �tt values from 0 to 1. For the enhanced approaches, we firstly perform one-step
propagation of ranking scores from web documents to tweets by considering all pairs of
�td and �dt from 0 to 1 with a step of 0.1. For each �td, the corresponding �dt and the
best average nDCG scores for top 10 and 100 tweets are shown in Figure 3.4(a). We no-
tice that when both initial tweet ranking scores and propagated ranking scores from web
38
documents are considered (i.e., �td is set from 0 to 0.9 and �dt > 0), the ranking quality
outperforms that by simply considering initial ranking scores of tweets (i.e. �td = 1).
Secondly, for the ranking performance of double-step ranking scores propagation, we
choose to set �td = 0.1, �dt = 0.4 and test �td from 0 to 1. Figure 3.4(b) shows an
encouraging improvement in the ranking quality, and more stable over the baseline and
one-step propagation. This suggests that two-step propagation provides mutual improve-
ment in the ranking quality. The reason is that the ranking of web documents may also be
refined using tweet and user evidence thanks to the large volume and synchrony of tweet-
ing [109]. Here, �td = 0.2 yields the best performance. The aforementioned process is
followed for Tweet-User networks, finding the best performance for �tu = 0.1, �ut = 0.2,
and �tu = 0.6.
When validating on the test set, Method 4 based on Web-Tweet networks outper-
forms Method 3 relying on Tweet-User networks. Therefore, for Web-Tweet-User net-
works, we keep the above values, and explore �ut values from 0 to 0.6 (e.g., 1 � �dt).
Figure 3.4(c) shows that integrating web documents, tweets and users, the ranking qual-
ity improves over both Web-Tweet networks and Tweet-User networks.
0.0 0.2 0.4 0.6 0.8 1.00.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.0 0.2 0.4 0.6 0.8 1.00.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.0 0.2 0.4 0.60.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
λut λtd
Ave
rage
nD
CG
0.3
0.10.1
0.10.1
0.1
0.2
0.30.4
0.6
Ave
rage
nD
CG
λdt=0.2
Avg@10 Avg@100
(a) (b)δtd=0.1, λdt=0.4
Avg@10 Avg@100
(c)δtu=0.1, δtd=0.1, λtu=0.6, λdt=0.4, λtd=0.2
Avg@10 Avg@100
Ave
rage
nD
CG
δtd
Figure 3.4: Effect of parameters: (a) �td and �dt for Web-Tweet networks, (b) �td forWeb-Tweet networks, (c) �dt for Web-Tweet-User networks.
3.3.3 Performance and Analysis
Figure 3.5 shows the performance of ranking methods. The performance gain from
Method 1 to Method 2 shows the need of filtering short and informal tweets. In this case,
filtering reduced from 3,460 to 1,765 tweets (⇠ 49% reduction). Table 3.3 shows the
distribution of labels for filtered tweets: a great majority of 91.75% had been annotated
39
1 2 3 4 5 6 7 8 9 100.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
nDCG@n
n
1. TextRank 2. 1+Filtering 3. 2+Tweet-User 4. 2+Web-Tweet 5. 3+4Web-Tweet-User
1
2
3
4
5
Figure 3.5: Performance comparison of ranking methods.
as 1, while only 0.11% had been annotated as 5.
Methods 3, 4 and 5, which integrate heterogeneous networks after filtering, out-
perform the baseline TextRank. When tweets are aligned with web documents (Method
4), the ranking quality improves significantly, proving that web documents can help infer
informative tweets adding support from a formal genre. The fact that tweets with low ini-
tial ranking scores are aligned with web documents helps improve their ranking positions
(Hypothesis 3). For example, the ranking of the tweet “Hurricane Irene: City by City
Forecasts http://t.co/x1t122A” is improved compared to TextRank, helped by the fact that
10 retrieved web documents are about this topic.
Integrating users (Method 5) further improves performance. This indicates that
Web-Tweet and Tweet-User networks may complement each other in improving rank-
ing. For example, the tweet “A social-media guide to dealing with Hurricane Irene
http://t.co/0XBEnEJ” is not top-ranked when only using Web-Tweet networks, since none
of the retrieved web documents is related to it. However, similar tweets appear with high
frequency in the tweet set. Hence, inferring implicit tweet-user relations and propagating
information through the tweet-user network also improves the ranking.
Figure 3.6(a) shows that inferring implicit tweet-user relationships outperforms the
only use of explicit tweet-user relations, especially for top positions. Looking into lower
40
Table 3.3: Grade distributions for filtered tweets.
Grade 5 4 3 2 1Percentage 0.11% 0.17% 3.13% 4.84% 91.75%
1 2 3 4 5 6 7 8 9 10
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
1 2 3 4 5 6 7 8 9 10
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
nn
Explicit+Implicit Tweet-User Networks Explicit Tweet-User Networks
nDCG@n
(a)
TextRank+FilteringRetweet/Reply/User-
Mention RelationsBayesian Approach
nDCG@n
(b)
Figure 3.6: (a) Explicit vs inferred implicit Tweet-User relations to construct Tweet-User networks; (b) TextRank vs one-step propagation on explicit Tweet-User networks using bayesian approach and retweet/reply/user mentionrelations.
positions, we find that the redundancy removal performs better for the only use of explicit
relations. However, both approaches can still perform similarly in positions 5 ⇠ 10. This
corroborates the synchronous behavior of users as an indicator of informative contents
(Hypothesis 2). Since it is likely that a large set of users only tweet once within a short
timeframe, limiting to explicit tweet-user relations results in sparse links, and ranking
quality cannot be bootstrapped. Interestingly, inferring implicit tweet-user relations can
capture synchronous behavior of users, which indicates subjects that users are concerned
about.
Figure 3.6(b) shows that initializing user credibility scores with the Bayesian ap-
proach and performing one-step ranking score propagation from users to tweets based
on the explicit tweet-user networks also outperforms TextRank. This corroborates our
hypothesis that credible users are more likely to post informative tweets (Hypothesis 1).
In addition, using only retweets, replies, and user mentions to compute initial user rank-
ing scores, the performance does not improve over TextRank. The reason is that for an
event of general interest like the Hurricane Irene, users from different communities rarely
interact with each other.
Finally, Figure 3.7 shows that Tri-HITS significantly outperforms Co-HITS over
41
1 2 3 4 5 6 7 8 9 100.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
1 2 3 4 5 6 7 8 9 100.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Tri-HITS Co-HITS
n
nDCG@n
(a)
n
nDCG@n
Tri-HITS Co-HITS
(b)
Figure 3.7: Co-HITS vs Tri-HITS on (a) Web-Tweet networks, (b) Tweet-User net-works.
bipartite graphs, with the only exception of position n = 2 for the Web-Tweet network.
This corroborates that normalizing the similarity matrix weakens semantic relations be-
tween different objects, and that capturing inherent meanings of cross-genre linkages is
crucial for information propagation.
3.3.4 Remaining Challenges
Tri-HITS shows encouraging improvements in ranking quality with respect to a
state-of-the-art model like TextRank. However, there are still some issues to be addressed
for further improvements.
(i) Topically-relevant tweet identification. We tracked tweets containing the key-
words “Hurricane” and “Irene”. Using such a query to follow tweets might also return
tweets that are not related to the event being followed. This may occur either because
the terms are ambiguous, or because of spam being injected into trending conversations
to make it visible. For example, the tweet “Hurricane Kitty: http://t.co/cdIexE3” is an
advertisement, which is not topically related to Irene.
(ii) Non-informative tweet identification. Our rule-based filtering component achieves
high precision (96.59%) on the identification of non-informative tweets, while there are
still a number of false positives with a 70.7% recall. Performing deeper linguistic analysis,
such as exploring subjectivity, might help clean up the tweet set by identifying additional
42
non-informative tweets. For example, an analysis of writing styles would help identify
the tweet “Hurricane names hurricane names http://t.co/iisc7UY ;)” as informal because
it contains repeated phrases. And the tweet “My favorite parts of Hurricane coverage is
when the weathercasters stand in those 100 MPH winds right on the beach. Good stuff.”
is clearly subjective commentary that may entertain but will not meet the general interest
of people involved with or tracking the event.
(iii) Deep semantic analysis of the content. Users may rely on distinct terms to
refer to the same concept. More extensive semantic analyses of text could help iden-
tify those terms, possibly enhancing the propagation process. For example, information
extraction tools can be used to extract entities and events, and their coreferential rela-
tions, such as “NYC” and “New York City”, or “MTA closed” and “subway shutting
down”. Likewise, existing dictionaries such as WordNet [111] can be utilized to mine
synonym/hypernym/hyponym relations, and Brown clusters [112] can be explored to
mine other types of relations.
3.4 SummaryIn this chapter,
(1) We have introduced Tri-HITS, a novel propagation model that makes use of het-
erogeneous networks composed of tweets, users, and web documents to rank tweets based
on informativeness. This approach can help filter noisy and uninformative information for
end users, and alleviate the “information noiseness” problem in microblogging.
(2) We have conducted cross-genre information analysis between the formal genre
of web documents and the informal genre micrblogs.
(3) We have inferred more social network relations in order to capture the collective
wisdom of the crowd and extract more effective evidence from social networks.
(4) We have studied the integration of different genres to capture the discrepancy of
“tweet - user” and “tweet - web” networks.
CHAPTER 4Microblog Wikification
In Chapter 3, we have proposed to identify informative microblogs to alleviate the “in-
formation noiseness” problem. However, effective information filtering fails to enrich the
short microblogs with rich and clean background knowledge. In this chapter, we intro-
duce our collective inference Wikification approach and deep semantic relatedness model
to enhance microblog wikification so that we can enrich the short microblogs with back-
ground knowledge from a knowledge base. Our designed collective inference model is
based on semi-supervised graph regularization that leverages both a small amount of la-
beled microblogs and a large amount of unlabeled microblogs. And the deep semantic
relatedness model is designed to enhance concept semantic relatedness measurement for
topical coherence modeling.
4.1 PreliminariesConcept and Concept Mention: We define a concept c as a Wikipedia article
(e.g., Atlanta Hawks), and a concept mention m as an n-gram from a specific tweet.
Each concept has a set of textual representation fields [92], including title (the title of the
article), sentence (the first sentence of the article), paragraph (the first paragraph of the
article), content (the entire content of the article), and anchor (the set of all anchor texts
with incoming links to the article).
Wikipedia Lexicon Construction: We first construct an offline lexicon with each
entry as hm, {c1
, ..., ck}i, where {c1
, ..., ck} is the set of possible referent concepts for
the mention m. Following the previous work [82], [113], [114], we extract the possible
mentions for a given concept c using the following resources: the title of c; the aliases
appearing in the introduction and infoboxes of c (e.g., The Evergreen State is an alias
of Washington state); the titles of pages redirecting to c (e.g., State of Washington is a
Portions of this chapter previously appeared as: H. Huang, Y. Cao, X. Huang, H. Ji, and C.-Y. Lin,“Collective tweet wikification based on semi-supervised graph regularization,” in Proc. of the 52nd Annu.Meeting of the Assoc. for Comput. Linguist., Baltimore, MD, USA, 2014, pp. 380–390.
43
44
redirecting page of Washington (state)); the titles of the disambiguation pages containing
c; and all the anchor texts appearing in at least 5 pages with hyperlinks to c (e.g., WA
is a mention for the concept Washington (state) in the text “401 5th Ave N [[Seattle]],
[[Washington (state)—WA]] 98109 USA”. We also propose three heuristic rules to extract
mentions (i.e., different combinations of the family name and given name for a person,
the headquarters of an organization, and the city name for a sports team).
Concept Mention Extraction: Based on the constructed lexicon, we then consider
all n-grams of size n (n=7 in this paper) as concept mention candidates if their entries
in the lexicon are not empty. We first segment @usernames and #hashtags into regular
tokens (e.g., @amandapalmer is segmented as amanda palmer and #WorldWaterDay is
split as World Water Day) using the approach proposed by [115]. Segmentation assists
finding concept candidates for these non-regular mentions.
4.2 Principles and Approach Overview
Relational Graph Construction
Knowledge Base (Wikipedia)
Labeled and Unlabeled Tweets
Wikipedia Lexicon ConstructionConcept Mention and
Concept Candidate Extraction
Local Compatibility(local features, cosine similarity)
Coreference(meta path,
mention similarity)
Semantic Relatedness(meta path, concept semantic relatedness)
Semi-Supervised Graph Regularization
<Mention, Concept>Pairs
Figure 4.1: Approach overview.
45
4.2.1 Principles
A single tweet may not provide enough evidence to identify prominent mentions
and infer their correct referent concepts due to the lack of contextual information. To
tackle this problem, we propose incorporating global evidence from multiple tweets and
performing collective inference for both mention identification and disambiguation. We
first introduce the following three principles that our approach relies on.
Principle 1 (Local compatibility): Two pairs of hm, ci with strong local compat-
ibility tend to have similar labels. Mentions and their correct referent concepts usually
tend to share a set of characteristics such as string similarity between m and c (e.g.,
hChicago, Chicagoi and hFacebook, Facebooki). We define the local compatibility to
model such set of characteristics.
Principle 2 (Coreference): Two coreferential mentions should be linked to the
same concept. For example, if we know “nc” and “North Carolina” are coreferential,
then they should both be linked to North Carolina.
Principle 3 (Semantic Relatedness): Two highly semantically-related mentions
are more likely to be linked to two highly semantically-related concepts. For instance,
when “Sweet 16” and “Hawks” often appear together within relevant contexts, they can
be reliably linked to two baseketball-related concepts NCAA Men’s Division I Basketball
Championship and Atlanta Hawks, respectively.
4.2.2 Approach Overview
Given a set of tweets ht1
, ..., t|T |i, our system first generates a set of candidate con-
cept mentions, and then extracts a set of candidate concept referents for each mention
based on the Wikipedia lexicon. Given a pair of mention and its candidate referent con-
cept hm, ci, the remaining task of wikification is to assign either a positive label if m
should be selected as a prominently linkable mention and c is its correct referent concept,
or otherwise a negative label. The label assignment is obtained by our semi-supervised
graph regularization framework based on a relational graph, which is constructed from
local compatibility, coreference, and semantic relatedness relations. The overview of our
approach is as illustrated in Figure 4.1.
46
4.3 A Deep Semantic Relatedness ModelIn order to construct the relational graph, we first introduce our newly proposed
deep semantic relatedness model (DSRM) for more accurate concept relatedness mea-
surement and more effective topical coherence modeling. In order to learn low dimen-
sional representations (i.e., distributed representations) that capture latent semantics of
concepts, we directly encode heterogeneous types of semantic knowledge from semantic
knowledge graphs (KGs) including structured knowledge (i.e., concept facts and con-
cept types) and textual knowledge (i.e., concept descriptions) into deep neural networks
(DNN). By automatically mining a large amount of training instances from KGs and
Wikipedia, we then train the neural network models discriminatively in a supervised fash-
ion such that the distances between semantically-related concepts are minimized in a la-
tent space. In this way, the neural networks can be optimized directly for the concept
relatedness task and capture semantics in this dimension.
Our proposed DSRM is relevant to the work in [40]. We extend their work to large-
scale semantic KGs by leveraging both structured and contextual knowledge for semantic
representation learning of concepts. Then we apply the approach to model topical coher-
ence for concept disambiguation, as opposed to Web search. He at al. [116] first explored
deep learning techniques to measure local context similarity for concept disambiguation.
Our work complements theirs since we aim to measure entity relatedness for global topi-
cal coherence modeling.
4.3.1 The DSRM Architecture
The architecture of DSRM is as shown in Figure 4.2. In oder to compute the se-
mantic relatedness scores between a given pair of concepts ci and cj (e.g., “Miami Heat”
and “National Basketball Association”), the DSRM first maps each concept into a low-
dimensional numerical feature vector (i.e., distributed representations) through a hierar-
chical architecture. The hierarchical architecture consists of (1) a feature vector layer that
represents a concept with heterogeneous types of knowledge from semantic KGs, (2) A
word hashing layer that transforms a feature vector with high dimension (e.g., 5m) into
a vector with relative small dimension (e.g., 105k), (3) Multiple semantic layers that ex-
tract hidden semantic features through non-linear projections. After obtaining distributed
47
Feature Vector
Word Hashing
Multi-layer non-linear projections
Semantic Layer
1m
105k (50k + 3.2k + 1.6k + 50k)
300
300
300
x
l1
l2
l3
y
{W2 , b2}
W1
{W3 , b3}
{W4 , b4}
Di
4m 3.2k 1.6k Ci Ri CTi
300
300
300
Dj Cj CTj Rj
Semantic relatedness (cosine similarity)
SR(ci , cj)
1m
105k (50k + 3.2k + 1.6k + 50k)
4m 3.2k 1.6k
Figure 4.2: The DSRM architecture.
representations yi and yj for both ci and cj , we use them to measure concept semantic
relatedness.
Feature Vector Layer: The knowledge representations of a concept from KGs
are shown in the bottom layer (Feature Vector). In particular, we leverage four types of
knowledge from KGs to represent each concept c, which is described in details as follows:
• Connected Concepts C: the set of connected concepts of c. For instance, as shown
in Figure 1.5, C = {“Erik Spoelstra”, “Miami”, “NBA”, “Dwyane Wade”} for “Miami
Heat”.
• Relations R: the set of relations that c holds. For example, R = {“Coach”, “Location”,
“Founded”, “Member”, “Roster”} for “Miami Heat” in Figure 1.5.
• Concept Types CT : the set of attached concept types for c. CT = {“professional
sports team”} for “Miami Heat”.
• Concept Description D: the textual description of a concept. The description provides
a concise summary of salient information of c. For instance, from the description of
“Miami Heat”, we can learn about its important information such as role, location, and
founder.
Word Hashing Layer: Following [40], we adopt the letter-n-gram based word
48
hashing technique to reduce the dimensionality of the bag-of-word term vectors. This is
because the vocabulary size of the large-scale KGs is often very large (e.g., more than 4
million concepts and 1 million bag-of-words exist in Wikipedia), which makes the “one-
hot” vector representation very expensive. However, the word hashing techniques can
dramatically reduce the vector dimensionality to a constant small size (e.g., 50k). It also
can handle the out-of-vocabulary words and newly created concepts. The specific ap-
proach we use is based on letter tri-grams. For instance, the word “cat” can be split into
letter tri-grams (#ca, cat, at#) by first adding start- and end- marks to the word (e.g.,
#cat#). We then use a vector of letter tri-grams to represent the word.
For each concept, we generate its surface form and represent it as bag-of-words.
And then the word hashing layer transforms each word into a letter tri-gram vector. Sim-
ilarly, we represent the concept description of a concept as bag-of-words, which are then
transformed by the word hashing layer into letter tri-gram vectors.
We do not adopt word hashing techniques to break down relations and concept
types because their sizes are relatively small (i.e., 3.2k relations and 1.6k concept types).
Thus each relation or concept type is represented as a binary “one-hot” vector (e.g..,
[0, ..., 0, 1, 0, ..., 0]).
Semantic Layers: On top of the word hashing layer, we have multiple hidden
layers to perform non-linear transformations, which allow the DNN to extract hidden
semantic features by performing back propagation with respect to an objective function
designed for the concept relatedness task. Finally, we can obtain the semantic representa-
tion y for c from the top layer. Denoting l1
as the output vector of the word hashing layer,
y as the output semantic vector of c, N as the number of layers, li, i = 2, ..., N � 1 as the
output vectors of the intermediate hidden layers, Wi and bi as the weight matrix and bias
term of the i-th layer respectively, we then can formally present the DSRM as:
li = f(Wili�1
+ bi), i = 2, ..., N � 1
y = f(WN lN�1
+ bN)
where we use the tanh as the activation function at the output layer and the intermediate
hidden layers. Specifically, f(x) = tanh(x) = 1�e�2x
1+e�2x .
49
Concept Semantic Relatedness Measurement: After we obtain the semantic rep-
resentations for concept ci and cj , we use cosine similarity to measure their relatedness as
SRDSRM(ci, cj) =
yTc
i
yc
j
||yc
i
||||yc
j
|| , where yci
and ycj
are the semantic representations of ci and
cj , respectively.
4.3.2 Learning the DSRM
Training Data Mining: In order to train the DSRM which can capture semantics
specific to the concept relatedness task, we first automatically mine training data based on
KGs and Wikipedia anchor links. Beyond using linked concept pairs from KGs as positive
training instances, we also mine more training data (especially negative instances) from
Wikipedia. Suppose ti is an anchor text from a Wikipedia article, and it is linked to a
concept ci. And tj is an anchor text within � = 150 character window of ti, and cj is
its linked concept. Then we consider hci, cji as a positive training instance. To obtain
negative training instances for ci, we randomly sample 5 other candidate concepts of tj(denoted as ˆCj), and consider each hci, c0ji as a negative training instance for each c0j 2 ˆCj .
Similarly, we obtain negative training instances for cj . In this way, we finally obtain about
20 million positive training pairs and 200 million negative training pairs. By mining the
training instances automatically, we can train the DSRM in an unsupervised way and save
tremendous human annotation efforts. The disadvantages are that we can not provide
more fine-grained annotations for more accurate model learning and there exists noise in
the training data.
Model Training: Following [39], [40], [117], we formulate a loss function as:
L(^) = � log
Y
(c,c+)
P (c+|c),
where ^ denotes the set of parameters of the DSRM, and c+ is a semantically-related
concept of c. P (cj|ci) is the posterior probability of concept cj given ci through the
softmax function:
P (cj|ci) =exp(�SRDSRM
(ci, cj))Pc02C
i
exp(�SRDSRM(ci, c0))
,
where � is the smoothing parameter which is determined based on a held-out set,
and Ci is the set of related or non-related concepts of ci in the training data.
50
To obtain the optimal solution, we need to minimize the above loss function. The
idea of the loss function is to ensure that the posterior probabilities of positive training
instances are higher than the negative ones. The model is trained using mini-batch based
stochastic gradient descent (SGD) [39],[40], and the training normally converges after 20
epochs in our experiments.
Implementation Details: In order to avoid over-fitting, we determine model pa-
rameters with cross validation by randomly splitting the mined concept pairs into two
sets: training and validation sets. We set the number of hidden layers as 2 and the number
of units in each hidden layer and output layer as 300. Further gains have been observed
by increasing the number of hidden layers to 2 or 3 in DNN for many tasks such as
Web Search [40] and digit recognition [118]. But adding too many hidden layers (e.g.,
� 4) can worsen the generalization performance since over-fitting is more likely to oc-
cur [118]. Following [40], we initialize each weight matrix Wi, i = 2, ..., N � 1 with a
uniform distribution:
Wi ⇠"�
s6
(|li�1
|+ |li|,
s6
(|li�1
|+ |li|
#,
where|l| is the size of the vector l.
During SGD optimization, we set mini-batch size of training instances as 1024.
And it takes roughly 72 hours to finish the model training on an NVidia Tesla K20 GPU
machine.
4.4 Relational Graph ConstructionWe first construct the relational graph G = hV,Ei, where V = {v
1
, ..., vn} is a
set of nodes and E = {e1
, ..., em} is a set of edges. Each vi = hmi, cii represents a
tuple of mention mi and its referent concept candidate ci. An edge is added between two
nodes vi and vj if there is a proposed relation based on the three principles described in
section 4.2.1.
51
4.4.1 Local Compatibility
We first compute local compatibility (Principle 1) by considering a set of novel local
features to capture the importance and relevance of a mention m to a tweet t, as well as
the correctness of its linkage to a concept c. We have designed a number of features which
are similar to those commonly used in wikification and entity linking work [11],[92],[93].
Mention Features We define the following features based on information from
mentions.
• IDFf (m) = log(
|C|df(m)
), where |C| is the total number of concepts in Wikipedia
and df(m) is the total number of concepts in which m occurs, and f indicates the
field property, including title, content, and anchor.
• Keyphraseness(m) =
|Ca
(m)|df(m)
to measure how likely m is used as an anchor in
Wikipedia, where Ca(m) is the set of concepts where m appears as an anchor.
• LinkProb(m) =
Pc2C
a
(m) count(m,c)P
c2C
count(m,c), where count(m, c) indicates the number of
occurrence of m in c.
• SNIL(m) and SNCL(m) to count the number of concepts that are equal to or
contain a sub-n-gram of m, respectively [92].
Concept Features The concept features are solely based on Wikipedia, including
the number of incoming and outgoing links for c, and the number of words and characters
in c.
Mention + Concept Features This set of features considers information from both
mention and concept:
• prior popularity prior(m, c) =
count(m,c)Pc
0 count(m,c0) , where count(m, c) measures the
frequency of the anchor links from m to c in Wikipedia.
• TFf (m, c) =count
f
(m,c)
|f | to measure the relative frequency of m in each field repre-
sentation f of c, normalized by the length of f . The fields include title, sentence,
paragraph, content and anchor.
• NCT (m, c), TCN(m, c), and TEN(m, c) to measure whether m contains the title
of c, whether the title of c contains m, and whether m is equal to the title of c,
respectively.
52
Context Features This set of features include (i) Context Capitalization features,
which indicate whether the current mention, the token before, and the token after are
capitalized. (ii) tf-idf based features, which include the dot product of two word vectors
vc and vt, and the average tf-idf value of common items in vc and vt, where vc and vt are
the top 100 tf-idf word vectors in c and t.
Local Compatibility Computation For each node vi = hmi, cii, we collect its
local features as a feature vector Fi = hf1, f2, ..., fdi. To avoid features with large nu-
merical values that dominate other features, the values of each feature are re-scaled using
feature standardization approach. The cosine similarity is then adopted to compute the
local compatibility of two nodes and construct a k nearest neighbor (kNN) graph, where
each node is connected to its k nearest neighboring nodes. We compute the weight matrix
that represents the local compatibility relation as:
W locij =
8<
:cosine(Fi, Fj) j 2 kNN(i)
0 Otherwise
4.4.2 Meta Path
Mention
Hashtag
Tweet Userpost-1
post
contain-1
contain
contain-1 contain
Figure 4.3: Schema of the Twitter network.
In this subsection, we introduce the meta paths we will use to detect coreference
(section 4.4.3) and semantic relatedness relations (section 4.4.4).
Recall that in the chapter 2, we introduce the concept of meta path, which is a path
defined over a network and composed of a sequence of relations between different object
types [24]. In our experimental setting, we can construct a natural Twitter network sum-
marized by the network schema in Figure 4.3. The network contains four types of objects:
53
Mention (M), Tweet (T), User (U), and Hashtag (H). Tweets and mentions are connected
by links “contain” and “contained by” (denoted as “contain�1”); users and tweets are
connected by links “post” and “posted by” (denoted as “post�1”); and tweets and #hash-
tags are connected by links “contain” and “contained by” (denoted as “contain�1”).
We then define the following five types of meta paths to connect two mentions as:
• “M - T - M”,
• “M - T - U - T - M”,
• “M - T - H - T - M”,
• “M - T - U - T - M - T - H - T - M”,
• “M - T - H - T - M - T - U - T - M”.
Each meta path represents one particular semantic relation. For instance, the first three
paths express the explicit relations that two mentions are from the same tweet, posted by
the same user, and share the same #hashtag, respectively. The last two paths are con-
structed by concatenating the first three simple paths to express the implicit relations that
two mentions co-occur with a third mention sharing either the same authorship or #hash-
tag. Such complicated paths can be exploited to detect more semantically-related men-
tions from richer contexts. For example, the relational link between “narita airport” and
“Japan” would be missed without using the path “narita airport - t1
- u1
- t2
- american -
t3
- h1
- t4
- Japan” since they don’t directly share any authorship or #hashtag.
4.4.3 Coreference
A coreference (Principle 2) usually occurs across multiple tweets due to the highly
redundant information in Twitter. To ensure high precision, we propose a simple yet
effective approach utilizing the rich social network relations in Twitter.
We consider two mentions mi and mj coreferential if either mi and mj share the
same surface form or one mention is an abbreviation of the other, and at least one meta
path exists between mi and mj . Then we define the weight matrix representing the coref-
54
erential relation as:
W corefij =
8>><
>>:
1.0 if mi and mj are coreferential,
and ci = cj
0 Otherwise
4.4.4 Semantic Relatedness
Ensuring topical coherence (Principle 3) has been beneficial for wikification on
formal texts (e.g., News) by linking a set of semantically-related mentions to a set of
semantically-related concepts simultaneously [28], [32], [119]. However, the shortness of
a single tweet means that it may not provide enough topical clues. Therefore, it is im-
portant to extend this evidence to capture semantic relatedness information from multiple
tweets.
We define the semantic relatedness score between two mentions as SR(mi,mj) =
1.0 if at least one meta path exists between mi and mj , otherwise SR(mi,mj) = 0. Then
we compute a weight matrix representing the semantic relatedness relation as:
W relij =
8<
:SR(Ni, Nj) if SR(Ni, Nj) � �
0 Otherwise
where SR(Ni, Nj) = SR(mi,mj) ⇥ SR(ci, cj), SR(ci, cj) is a semantic relatedness
model, and � = 0.3, which is optimized from a development set.
4.4.5 The Combined Relational Graph
Based on the above three weight matrices W loc, W coref , and W rel, we obtain the
combined graph G with weight matrix W , where Wij = ↵W locij + �W coref
ij + �W relij . ↵,
�, and � are three coefficients between 0 and 1 with the constraint that ↵ + � + � =
1. They control the contributions of these three relations in our semi-supervised graph
regularization model. An example graph of G is shown in Figure 4.4. Compared to the
referent graph which considers each mention or concept as a node in previous graph-based
re-ranking approaches [28], [90], our novel graph representation has two advantages: (i)
It can easily incorporate more features related to both mentions and concepts. (ii) It is
more appropriate for our graph-based semi-supervised model since it is difficult to assign
55
hawks, Atlanta Hawks
uconn, Connecticut
Huskies
bucks, Milwaukee
Bucks
kemba walker, Kemba Walker
0.404
gators, Florida Gators
men's basketballnow, Now
days, Day
tonight, Tonight
0.932
0.7640.665
0.467
0.5630.538
0.447
Figure 4.4: A example of the relational graph constructed for the example tweets inFigure 1.3. Each node represents a pair of hm, ci, separated by a comma.The edge weight is obtained from the linear combination of the weightsof the three proposed relations. Not all mentions are included due to thespace limitations.
labels to a pair of mention and concept in the referent graph.
4.5 Semi-supervised Graph RegularizationGiven the constructed relational graph with the weighted matrix W and the label
vector Y of all nodes, we assume the first l nodes are labeled as Yl and the remaining u
nodes (u = n� l) are initialized with labels Y 0
u . Then our goal is to refine Y 0
u and obtain
the final label vector Yu.
Intuitively, if two nodes are strongly connected, they tend to hold the same label.
We propose a novel semi-supervised graph regularization framework based on the graph-
based semi-supervised learning algorithm [33]:
Q(Y) = µnX
i=l+1
(yi � y0i )2
+
1
2
X
i,j
Wij(yi � yj)2.
The first term is a loss function that incorporates the initial labels of unlabeled examples
into the model. In our method, we adopt prior popularity (section 4.4.1) to initialize the
labels of the unlabeled examples. The second term is a regularizer that smoothes the
refined labels over the constructed graph. µ is a regularization parameter that controls the
trade-off between initial labels and the consistency of labels on the graph. The goal of the
56
proposed framework is to ensure that the refined labels of unlabeled nodes are consistent
with their strongly connected nodes, as well as not too far away from their initial labels.
The above optimization problem can be solved directly since Q(Y) is convex [33],
[36]. Let I be an identity matrix and DW be a diagonal matrix with entries Dii =P
j Wij .
We can split the weighted matrix W into four blocks as W =
2
4Wll Wlu
Wul Wuu
3
5, where Wmn
is an m ⇥ n matrix. Dw is split similarly. We assume that the vector of the labeled
examples Yl is fixed, so we only need to infer the refined label vector of the unlabeled
examples Yu. In order to minimize Q(Y), we need to find Y ⇤u such that
@Q
@Yu
����Yu
=Y ⇤u
= (Duu + µIuu)Yu �WuuYu �
WulYl � µY 0
u = 0.
Therefore, a closed form solution can be derived as Y ⇤u = (Duu+µIuu�Wuu)
�1
(WulYl+
µY 0
u ).
However, for practical application to a large-scale data set, an iterative solution
would be more efficient to solve the optimization problem. Let Y tu be the refined labels
after the tth iteration, the iterative solution can be derived as:
Y t+1
u = (Duu + µIuu)�1
(WuuYtu +WulYl + µY 0
u ).
The iterative solution is more efficient since (Duu + µIuu) is a diagonal matrix and its
inverse is very easy to compute.
4.6 ExperimentsIn this section, we compare our proposed collective tweet wikification approach
with state-of-the-art methods as shown in Table 4.1. We then study the quality of various
concept relatedness measurement approaches and their impact on wikification.
57
Table 4.1: Description of wikification methods.
Methods DescriptionsTagMe The same approach that is described in [86], which aims to annotate short texts based on prior
popularity and semantic relatedness of concepts. It is basically an unsupervised approach,except that it needs a development set to tune the probability threshold for linkable mentions.
Meij A state-of-the-art system described in [92], which is a supervised approach based on therandom forest model. It performs mention detection and disambiguation jointly, and it istrained from 400 labeled tweets.
SSRegu1 Our proposed model based on Principle 1, using 200 labeled tweets.SSRegu12 Our proposed model based on Principle 1 and 2, using 200 labeled tweets.SSRegu13 Our proposed model based on Principle 1 and 3, using 200 labeled tweets.SSRegu123 Our proposed full model based on Principle 1, 2 and 3, using 200 labeled tweets.
Table 4.2: Statistics of Freebase KG.
Knowledge Graph Element Size# Concepts 4.12m# Relations 3.17k# Concept Types 1.57k
4.6.1 Data and Scoring Metric
For our experiments, we use a Wikipedia dump on May 3, 2013 as our knowledge
base, which includes 30 million pages. To reduce noise, we remove the entities which
have fewer than 5 incoming anchor links and obtain 4 millions entities. And we use
a portion of Freebase limited to Wikipedia concepts as the semantic KG with detailed
statistics shown in Table 5.2.
For our experiments we use a public data set [92] including 502 tweets posted by
28 verified users. The data set was annotated by two annotators. We randomly sample
102 tweets for development and the remaining for evaluation. For computational effi-
ciency, we also filter some mention candidates by applying the preprocessing approach
proposed in [86], and remove all the concepts with prior popularity less than 2% from an
mention’s concept set for each mention, similar to [93]. For concept disambiguation, we
compute both standard micro (aggregates over all mentions) and macro (aggregates over
all documents) precision scores over the top ranked candidate concepts. And for end-to-
end wikification, we use the standard precision, recall and F1 measures. A mention and
concept pair hm, ci is judged as correct if and only if m is linkable and c is the correct
referent concept for m.
To evaluate the quality of concept relatedness, we use a benchmark test set cre-
58
Table 4.3: Overall performance.
Methods Precision Recall F1TagMe 0.329 0.423 0.370Meij 0.393 0.598 0.475SSRegu
1
0.538 0.435 0.481SSRegu
12
0.638 0.438 0.520SSRegu
13
0.541 0.457 0.495SSRegu
123
0.650 0.441 0.525
ated by [120] from CoNLL 2003 data. It includes 3, 314 concepts as testing queries and
each query has 91 candidate concepts in average to measure relatedness. After obtaining
the ranked orders of candidate concepts for these queries, we compute the nDCG [110]
and mean average precision (MAP) [121] scores to evaluate the relatedness measurement
quality.
4.6.2 End-to-End Wikification
Overall Performance The overall performance of various approaches is shown in
Table 4.3. The results of the supervised method proposed by [92] are obtained from 5-
fold cross validation. For our semi-supervised setting, we experimentally sample 200
tweets for training and use the remaining set as unlabeled and testing sets. In our semi-
supervised regularization model, the matrix W loc is constructed by a kNN graph (k = 20).
The regularization parameter µ is empirically set to 0.1, and the coefficients ↵, �, and �
are learnt from the development set by considering all the combinations of values from
0 to 1 at 0.1 intervals. In order to randomize the experiments and make the comparison
fair, we conduct 20 test runs for each method and report the average scores across the 20
trials.
The relatively low performance of the baseline system TagMe demonstrates that
only relying on prior popularity and topical information within a single tweet is not
enough for an end-to-end wikification system for the short tweets. As an example, it is
difficult to obtain topical clues in order to link the mention “Clinton” to Hillary Rodham
Clinton by relying on the single tweet “wolfblitzercnn: Behind the scenes on Clinton’s
Mideast trip #cnn”. Therefore, the system mistakenly links it to the most popular concept
Bill Clinton.
59
In comparision with the supervised baseline proposed by [92], our model SSRegu1
relying on local compatibility already achieves comparable performance with 50% of la-
beled data. This is because that our model performs collective inference by making use of
the manifold (cluster) structure of both labeled and unlabeled data, and that the local com-
patibility relation is detected with high precision4 (89.4%). For example, the following
three pairs of mentions and concepts hpelosi, Nancy Pelosii, hobama, Barack Obamai,and hgaddafi, Muammar Gaddafii have strong local compatibility with each other since
they share many similar characteristics captured by the local features such as string sim-
ilarity between the mention and the concept. Suppose the first pair is labeled, then its
positive label will be propagated to other unlabeled nodes through the local compatibility
relation, and correctly predict the labels of other nodes.
Incorporating coreferential or semantic relatedness relation into SSRegu1
provides
further gains, demonstrating the effectiveness of these two relations. For instance, “wh” is
correctly linked to White House by incorporating evidence from its coreferential mention
“white house”. The coreferential relation (Principle 2) is demonstrated to be more bene-
ficial than the semantic relatedness relation (Principle 3) because the former is detected
with much higher precision (99.7%) than the latter (65.4%).
Our full model SSRegu123
achieves significant improvement over the supervised
baseline (5% absolute F1 gain with 95.0% confidence level by the Wilcoxon Matched-
Pairs Signed-Ranks Test), showing that incorporating global evidence from multiple tweets
with fine-grained relations is beneficial. For instance, the supervised baseline fails to link
“UCONN” and “Bucks” in our examples to Connecticut Huskies and Milwaukee Bucks,
respectively. Our full model corrects these two wrong links by propagating evidence
through the semantic links as shown in Figure 4.4 to obtain mutual ranking improvement.
The best performance of our full model also illustrates that the three relations complement
each other.
Effect of Concatenated Meta Paths In this chapter, we propose a unified frame-
work utilizing meta path-based semantic relations to explore richer relevant context. Be-
yond the straightforward meta paths, we introduce more complicated ones by concate-
nating the simple ones. The performance of the system without using the concatenating4Here we define precision as the percentage of links that holds the same label.
60
Table 4.4: The performance of systems without using concatenated meta paths.
Methods Precision Recall F1SSRegu
12
0.644 0.423 0.510SSRegu
13
0.543 0.441 0.486SSRegu
123
0.657 0.419 0.512
meta paths is shown in Table 4.4. In comparison with the system based on all defined meta
paths, we can clearly see that the systems using concatenated meta paths significantly out-
perform those relying on the simple ones. This is because the concatenated meta paths
can incorporate more relevant information with implicit relations into the models by in-
creasing 1.6% coreference links and 9.3% semantic relatedness links. For example, the
mention “narita airport” is correctly disambiguated to the concept “Narita International
Airport” with higher confidence since its semantic relatedness relation with “Japan” is
detected with the concatenated meta path as described in section 4.4.2.
50 100 150 200 250 300 350 4000.30
0.35
0.40
0.45
0.50
0.55
0.60
F1
Labeled Tweet Size
SSRegu123 Meij
Figure 4.5: The effect of labeled tweet size.
Effect of Labeled Data Size In previous experiments, we experimentally set the
number of labeled tweets to be 200 for overall performance comparision with the base-
lines. In this subsection, we study the effect of labeled data size on our full model. We
randomly sample 100 tweets as testing data, and randomly select 50, 100, 150, 200, 250,
and 300 tweets as labeled data. 20 test runs are conducted and the average results are
reported across the 20 trials, as shown in Figure 4.5. We find that as the size of the
labeled data increases, our proposed model achieves better performance, demonstrating
61
that our proposed relational graph can capture the semantic relations between mentions
and concepts effectively. It is encouraging to see that our approach, with only 31.3%
labeled tweets (125 out of 400), already achieves a performance that is comparable to the
state-of-the-art supervised model trained from 100% (400) labeled tweets.
0.1 0.5 1 2 5 10 20 30 40 500.30
0.35
0.40
0.45
0.50
0.55
0.60
F1
Regularization Parameter µ
SSRegu123
Figure 4.6: The effect of parameter µ.
Parameter Analysis In previous experiments, we empirically set the parameter
µ = 0.1. µ is the regularization parameter that controls the trade-off between initial
labels and the consistency of labels on the graph. When µ increases, the model tends to
trust more in the initial labels. Figure 4.6 shows the performance of our models by varying
µ from 0.02 to 50. We can easily see that the system performce is stable when µ < 0.4.
However, when µ � 0.4, the system performance dramatically decreases, showing that
prior popularity is not enough for an end-to-end wikification system.
4.6.3 Quality of Semantic Relatedness Measurement
In this subsection, we evaluate the relatedness measurement quality of various re-
latedness methods: (i) M&W, the Wikipedia anchor link-based method proposed by [38].
(ii) DSRM1
, our proposed DSRM based on connected concepts. (iii) DSRM12
, DSRM
based on connected concepts and relations. (iii) DSRM123
, DSRM based on connected
concepts, relations, and concept types. (iv) DSRM1234
, DSRM based on all four types of
knowledge.
The overall performance of various relatedness methods are shown in Table 4.5.
62
Table 4.5: Overall performance of concept semantic relatedness methods.
Methods nDCG@1 nDCG@5 nDCG@10 MAPM&W 0.538 0.518 0.548 0.483DSRM
1
0.677 0.609 0.623 0.558DSRM
12
0.717 0.642 0.650 0.592DSRM
123
0.742 0.653 0.661 0.605DSRM
1234
0.814 0.732 0.739 0.682
Table 4.6: Examples of relatedness scores between a sample of concepts and the con-cept “National Basketball Association”.
Methods M&W DSRM1234
New York City 0.90 0.22New York Knicks 0.79 0.79Atlanta 0.71 0.39Atlanta Hawks 0.53 0.83Houston 0.57 0.37Houston Rockets 0.49 0.80Milwaukee 0.62 0.38Milwaukee Bucks 0.50 0.79
We can see that our proposed DSRM significantly outperforms the standard relatedness
method M&W (p 0.05, according to the Wilcoxon Matched-Pairs Signed-Ranks Test),
indicating that deep semantic models are more effective for relatedness measurement. As
we incorporate more types of knowledge into the DSRM, it achieves better relatedness
quality, showing that the four types of semantic knowledge complement each other.
To study the main differences between M&W and the DSRM, we also show some
examples of relatedness scores in Table 4.6, 4.7 and 4.8. From Table 4.6, we can see that
M&W predicts that “NBA” are more semantically-related to cities/states than basketball
teams. However, the DSRM produces more reasonable scores to indicate that these bas-
ketball teams are highly semantically-related to their association. In addition, the DSRM
generates very similar scores between these basketball teams and their association (e.g.,
the scores in bold in Table 4.6), which is strong evidence that the DSRM can capture deep
semantics of concepts. We can also see that M&W tends to generate high relatedness
scores for popular concepts (e.g., “Google” and “Barack Obama”), but the DSRM does
not have such a bias.
63
Table 4.7: Examples of relatedness scores between a sample of concepts and the con-cept “National Football League”.
Methods M&W DSRM1234
New York City 0.89 0.09New York Jets 0.92 0.63Boston 0.92 0.19Boston Bruins 0.62 0.38Dallas 0.87 0.34Dallas Cowboys 0.72 0.68Philadelphia 0.93 0.19Philadelphia Eagles 0.79 0.65Miami 0.54 0.27Miami Dolphins 0.92 0.69
Table 4.8: Examples of relatedness scores between a sample of concepts and the con-cept “Apple Inc.”.
Methods M&W DSRM1234
Apple 0.32 0.27Google 0.98 0.81Microsoft 0.86 0.86Samsung 0.49 0.69Facebook 0.83 0.65Twitter 0.83 0.60The New York Times 0.78 0.38The Wall Street Journal 0.78 0.49Steve Jobs 0.78 0.74Bill Gates 0.79 0.68Barack Obama 0.71 0.36
4.6.4 Concept Disambiguation
It will be also interesting to study the performance of our proposed collective infer-
ence based on graph regularization on news dataset to demonstrate its effectiveness. Thus
we also use a benchmark news dataset (AIDA) based on CoNLL 2003 data [122]. It in-
cludes 131 documents and 4,485 non-NIL mentions. We compare our methods with two
state-of-art approaches on news dataset: (i) Shirak, this approach utilizes a probabilistic
taxonomy with the Naive Bayes model [123]. (ii) AIDA, it is a graph-based collective
approach which finds a dense subgraph for joint disambiguation [122].
Topical coherence modeling is mainly used to enhance disambiguation instead of
mention detection, thus to better study the impact of various semantic relatedness meth-
64
ods, we focus on concept disambiguation in this subsection. For concept disambiguation,
many existing approaches are unsupervised. To compare with these state-of-the-art meth-
ods, we also develop an unsupervised graph regularization framework (GraphRegu) for
concept disambiguation, which makes our model more robust to unseen and new data. We
only leverage the semantic relatedness relation to construct the relational graph to study
the impact of relatedness measurement approaches on disambiguation.
We initialize the ranking score of each node based on a sub-system of AIDA [122],
which relies on the linear combination of prior popularity and context similarity. The
context similarity proposed in AIDA is computed based on the extracted keyphrases (e.g.,
Wikipedia anchor texts) of an entity and all of their partial matches in the text of a men-
tion. We also adopt two heuristics to mine a set of labeled seed nodes for the graph
regularization model: (i) If a node v = hm, ei contains unambiguous mention, then v is
selected as a seed node and it has an initial ranking score 1.0. (ii) For a mention m with
the top ranked candidate entity by prior popularity as e, if the prior popularity p(e|m) of
e satisfies p(e|m) � 0.95 and e is also the top ranked entity by context similarity, then
all nodes related to m are selected as labeled seeds. The node v = hm, ei will be as-
signed a ranking score 1.0, and other nodes will be assigned a ranking score 0. During the
graph regularization process, the ranking scores of these labeled seed nodes will remain
unchanged.
The overall disambiguation performance is shown in Table 4.9 and 4.10 for the
AIDA dataset and the tweet set, respectively. Compared with other strong baseline ap-
proaches, our developed unsupervised approach GraphRegu + M&W achieves very com-
petitive performance for both datasets, illustrating that our proposed collective inference
approach is effective to model topical coherence for concept disambiguation.
Our best system based on the DSRM with all four types of knowledge (denoted
as DSRM1234
) significantly outperforms various strong baseline competitors for both
datasets (all with p 0.05). Specially compared with the standard method M&W,
DSRM1234
obtains 4.4% and 6.8% absolute micro precision gains in disambiguation for
news and tweets, respectively. For instance, GraphRegu + M&W fails to disambiguate
the mention “Middlesbrough” to the football club “Middlesbrough F.C.” in the text “Lee
Bowyer was expected to play against Middlesbrough on Saturday.”. This is because
65
M&W generates the same semantic relatedness score (0.39) between h“Middlesbrough
F.C.”, “Lee Bowyer”i and h“Middlesbrough” and “Lee Bowyer”i. However, DSRM1234
computes the relatedness score for the former pair as 0.68, much higher than the score
0.33 of the latter one, thus GraphRegu + DSRM1234
correctly disambiguates the mention.
Table 4.9: Overall disambiguation performance on AIDA dataset.
Baseline Approaches Our MethodsShirak AIDA GraphRegu +
M&W DSRM1 DSRM12 DSRM123 DSRM1234
micro [email protected] 0.814 0.823 0.822 0.842 0.853 0.849 0.866macro [email protected] 0.835 0.820 0.811 0.833 0.839 0.836 0.855
Table 4.10: Overall disambiguation performance on tweet set.
Baseline Approaches Our MethodsTagMe Meij GraphRegu +
M&W DSRM1 DSRM12 DSRM123 DSRM1234
micro [email protected] 0.610 0.683 0.651 0.692 0.702 0.715 0.719macro [email protected] 0.605 0.692 0.662 0.690 0.696 0.709 0.717
4.6.5 Discussions
In this subsection, we aim to answer two questions: (i) Are semantic KGs better
resources than Wikipedia anchor links for relatedness measurement? (ii) Is the DNN
a better choice than Normalized Google Distance (NGD) [61] and Vector Space Model
(VSP) [124] for relatedness measurement?
In order to answer these two questions, we directly apply NGD and VSP with the
tf-idf representations on the same KG that we use to learn the DSRM. Then we combine
them with the graph regularization model and study their impact on concept disambigua-
tion. Table 4.11 and 4.12 show the relatedness quality and disambiguation performance,
respectively. As shown in the first three rows of both tables, we can clearly see that
NGD and VSP based on KGs significantly outperform their variants with Wikipedia an-
chor links (p 0.05), which confirms that semantic KGs are better resources than the
Wikipedia anchor links for relatedness measurement. This is because KGs contain cleaner
semantic knowledge about concepts than Wikipedia anchor links. For instance, “Apple
Inc.” and “Barack Obama” share many noisy incoming links (e.g., “Austin, Texas” and
“2010s”) that are not helpful to capture their relatedness.
66
Table 4.11: Impact of semantic KGs and DNN on concept semantic relatedness.
Methods nDCG@1 nDCG@5 nDCG@10 MAPM&W 0.538 0.518 0.548 0.483M&W
1234
0.692 0.578 0.576 0.514VSP
1234
0.680 0.579 0.583 0.520DSRM
1234
0.814 0.732 0.737 0.682
Table 4.12: Impact of semantic KGs and DNN on concept disambiguation.
Methods AIDA dataset Tweet setmicro macro micro [email protected] [email protected] [email protected] [email protected]
M&W 0.822 0.811 0.651 0.662M&W
1234
0.846 0.838 0.682 0.692VSP
1234
0.848 0.835 0.694 0.702DSRM
1234
0.867 0.855 0.719 0.717
From the last three rows of Table 4.11 and 4.12, we can see that the DSRM based
on DNN significantly outperform NGD and VSP for both relatedness measurement and
concept disambiguation (p 0.05), illustrating that the DNN are indeed more effec-
tive to measure concept relatedness. By extracting useful semantic features layer by
layer with nonlinear functions and transforming sparse binary “one-hot” vectors into low-
dimensional feature vectors in a latent space, the DNN has better ability to represent con-
cepts semantically.
4.6.6 Remaining Challenges
Figure 4.7 demonstrates the distributions of errors from our best system. We can
easily see that a large portion (69.2%) of mistakes are directly caused by mention de-
tection, showing that mention detection performance bottleneck of a tweet wikification
system. This is consistent with the conclusion obtained in [93]. Even though our joint
model has successfully identified 68.4% mentions which are not linkable and salient, de-
tecting linkable mentions remains a very challenging problem. 15.8% of errors are related
to mention disambiguation, showing that disambiguation is relatively easy in microblog
messages. Among the disambiguation errors, 69% of them are on commonly used terms
such as “coverage” and “record”, instead of named entities. Some most challenging men-
tions for disambiguation in both news and microblogs include city and country names
67
(e.g., “Chicago”) which actually refer to sports teams (e.g., “Chicago Bulls”). This is
because our proposed DSRM produce accurate relatedness scores between these sports
teams, as well as between cities and countries. And the system will be biased towards
the popular cities and countries. One possible solution is to design a joint model to per-
form mention disambiguation and to discover document interest simultaneously. Concept
candidate extraction is also more challenging in microblogs due to the informal usage of
languages. In this work, we have segmented the @usernames and #hashtags into reg-
ular tokens for more accurate candidate extraction. Further normalization of typos and
abbreviation into regular tokens can help improve the candidate extraction performance.
6.9%
8.1%
3.1%
12.7%25.4%
43.8%
1. Mention Detection (False Positives) 2. Mention Detection (False Negatives) 3. Mixture of Mention Detection and
Disambiguation 4. Mention Disambiguation 5. Concept Candidate Extraction 6. Annotation
Figure 4.7: Error distributions.
4.7 SummaryIn this chapter,
(1) We have introduced a novel semi-supervised graph regularization framework for
wikification to simultaneously tackle the unique challenges of annotation and information
brevity in short tweets. To the best of our knowledge, this is the first work to explore the
semi-supervised collective inference model for the wikification task.
(2) We have extracted various semantic meta paths from HINs to expand the con-
texts of short tweets, which was proved to be an effective method to incorporate more
topically-relevant information for collective inference.
(3) We have constructed a relational graph with three types of fine-grained relations
including local compatibility based on a set of local features, coreference, and semantic
relatedness. We have also studied the impact of each relation and showed that these
68
relations can complement each other for the wikification task.
(4) We have introduced a deep semantic relatedness model (DSRM) based on deep
neural networks and semantic knowledge graphs. The DSRM maps each concept into a
low-dimensinoal vector that capture its latent semantics. We have compared the impact of
semantic KGs and Wikipedia anchor links, as well as the DNN and some classic similarity
measures that do not use semantics on relatedness measurement. We proved that both
semantic KGs and DNN are better choices for relatedness measurement.
(5) By studying three novel fine-grained relations to construct the relational graph,
detecting semantically-related information with semantic meta paths, and exploiting the
data manifolds in both unlabeled and labeled data for collective inference, our work can
dramatically save annotation cost and achieve better performance.
CHAPTER 5Morph Decoding
In previous chapters, we have introduced our methods to rank microblogs and enrich mi-
croblogs with background knowledge from knowledge bases. Thus we can alleviate the
information noiseness and information brevity problems. However, the wikification sys-
tems have failed to detect and resolve morphs which tend to be informal terms conveying
more implicit information. In this chapter, we propose novel approaches to tackle the
newly proposed morph decoding problem.
5.1 Approach Overview
Comparable Data Acquisition
Target Candidate Ranking
Target
Learning to Rank
Semantic Features
Semantic Annotation, Morph Detection, and
Target Candidate Identification
Surface Features Social Features
Censored Data
Uncensored Data
Figure 5.1: Overview of morph decoding.
Portions of this chapter previously appeared as: H. Huang, Z. Wen, D. Yu, H. Ji, Y. Sun, J. Han, andH. Li, “Resolving entity morphs in censored data,” in Proc. of the 51st Annu. Meeting of the Assoc. forComput. Linguist., Sofia, Bulgaria, 2013, pp. 1083–1093.
69
70
Given a set of Weibo tweet messages W = {w1
, w2
, ..., wn}, the goal of morph
decoding is to find a set of morph M = {m1
,m2
, ...,mp}, and then resolve each mi to
its real targets. Figure 5.1 depicts the general procedure of our approach. First, relevant
comparable data sets that include m are retrieved. In this paper we collect comparable
censored data from Weibo and uncensored data from Twitter and Web documents such
as news articles. We then apply various annotations such as word segmentation, part-of-
speech tagging, noun phrase chunking, name tagging to these data sets to obtain a set of
unique terms, entities, and events. It consists of three main steps.
• Morph Detection: To detect morphs, we propose a set of novel features to capture
the common characteristics of morphs.
• Target Candidate Identification: For each m, identify a list of target candidates
E = {e1
, e2
, ..., eN}. We make use of temporal distribution constraints to identify
target candidates.
• Target Candidate Ranking: Rank the target candidates in E. We explore various
features including surface, semantic and social features, and incorporate them into
a learning to rank framework. Finally, the top ranked candidate is produced as the
resolved target.
5.2 Morph DetectionWe first introduce the first step of our approach – morph detection. This step takes
advantage of the common characteristics shared among morphs and identifies the po-
tential morphs using a supervised method, since it is relatively easy to collect a certain
number of corpus-level morphs as training data. Through formulating this task as a bi-
nary classification problem, we adopt the Support Vector Machines (SVMs) [125] as the
learning model. We propose the following four categories of features.
Basic: (i) character unigram, bigram, trigram, and surface form; (ii) part-of-speech
tags; (iii) the number of characters; (iv) whether some characters are identical. These
basic features will help identify several common characteristics of morph candidates (e.g.,
they are very likely to be nouns, and very unlikely to contain single characters).
71
Dictionary: Many morphs are non-regular names derived from proper names while
retaining some characteristics. For example, the morphs “Ñc (Governor Bo)” and “⇤
� (Gourmand Province)” are derived from their target entity names “Ñôe (Bo Xilai)”
and “�⌧� (Guandong Province)”, respectively. Therefore, we adopt a dictionary of
proper names [126] and propose the following features: (i) Whether a term occurs in
the dictionary. (ii) Whether a term starts with a commonly used last name, and includes
uncommonly used characters as its first name. (iii) Whether a term ends with a geo-
political entity or organization suffix word, but it’s not in the dictionary.
Phonetic: Many morphs are created based on phonetic (Chinese pinyin in our case)
modifications. For instance, the morph “m|| (Rice Cake)” has the same phonetic
transcription as its target entity “⇤∞∞ (Fan Bingbing)”. To extract phonetic-based
features, we compile a dictionary composed of hphonetic transcription, termi pairs from
the Chinese Gigaword corpus. Then for each term, we check whether it has the same
phonetic transcription as any entry in the dictionary but they include different characters.
Language Modeling: Many morphs rarely appear in a general news corpus (e.g.,
“‡| (Bother Octopus), referring to an octopus in Germany, famous for soccer game
prediction”). Therefore, we propose to use the character-based language models trained
from Gigaword to calculate the occurrence probabilities of each term, and use n-gram
probabilities (n 2 [1 : 5]) as features.
5.3 Morph Resolution5.3.1 Target Candidate Identification
The general goal of the first step is to identify a list of target candidates for each
morph query from the comparable corpora including Sina Weibo, Chinese News websites
and English Twitter. However, obviously we cannot consider all of the named entities
in these sources as target candidates due to the sheer volume of information. In addition,
morphs are not limited to named entity forms. In order to narrow down the scope of target
candidates, we propose a Temporal Distribution Assumption as follows. The intuition is
that a morph m and its real target e should have similar temporal distributions in terms of
their occurrences. Suppose the data sets are separated into Z temporal slots (e.g. by day),
the assumption can be stated as:
72
Let Tm = {tm1
, tm2
, ..., tmZm
} be the set of temporal slots each morph m occurs,
and Te = {te1, te2, ..., teZe
} be the set of slots a target candidate e occurs. Then e is
considered as a target candidate of m if and only if, for each tmi 2 Tm (i = 1, 2, ..., Zm),
there exist a j 2 {1, 2, ..., Ze} such that tmi � tej �, where � is a threshold value (in
this paper we set the threshold to 7 days, which is optimized from a development set).
For comparison we also attempted topic modeling approach to detect target candidates,
as shown in section 5.4.
5.3.2 Target Candidate Ranking
Next, we propose a learning-to-rank framework to rank target candidates based on
various levels of novel features based on surface, semantic and social analysis.
5.3.2.1 Surface Features
We first extract surface features between the morph and the candidate based on mea-
suring orthographic similarity measures which were commonly used in entity coreference
resolution (e.g. [16], [127]).
String edit distance: The minimum number of insertions, deletions, and substitu-
tions required to transform one string into the other.
Normalized string edit distance: normalize string edit distance by the maximum
length of the two strings [128].
Longest common subsequence: find the longest subsequence common to both of the
two strings [129]. These measures can be effective when the morph keeps some characters
from the target, for example, “T.; (Qiao Boss)” refers to “T⇤Ø (Steve Jobs).”
5.3.2.2 Semantic Features
Information Network Construction: In order to construct the information net-
works for morphs, we apply the Standford Chinese word segmenter with Chinese Penn
Treebank segmentation standard [130] and Stanford part-of-speech tagger [131] to pro-
cess each sentence in the comparable data sets. Then we apply a hierarchical Hidden
Markov Model (HMM) based Chinese lexical analyzer ICTCLAS [132] to extract named
entities, noun phrases and events.
73
We have also attempted using the results from Dependency Parsing, Relation Ex-
traction and Event Extraction tools [133] to enrich the link types. Unfortunately the state-
of-the-art techniques for these tasks still perform poorly on social media in terms of both
accuracy and coverage of important information, these sophisticated semantic links all
produced negative impact on the target ranking performance. Therefore we limited the
types of vertices into: Morph (M), Entity(E), which includes target candidates, Event
(EV), and Non-Entity Noun Phrases (NP); and used co-occurrence as the edge type. We
extract entities, events, and non-entity noun phrases that occur in more than one tweet as
neighbors. And for two vertices xi and xj , the weight wij of their edge is the frequency
they co-occur together within the tweets. A network schema of such networks is shown
in Figure 5.2. Figure 5.3 presents an example of a heterogeneous information network
M E
NPEV
Figure 5.2: Network Schema of Morph-Related Heterogeneous Information Net-work
from the motivation examples following the above network schema, which connects the
morphs “Peace West King”, “Buhou” and their corresponding target “Bo Xilai”.
Meta-Path-Based Semantic Similarity Measurements: Given the constructed
network, a straightforward solution for finding the target for a morph is to use link-based
similarity search. However, now objects are linked to different types of neighbors, if all
neighbors are treated as the same, it may cause information loss problems. For example,
the entity “ÕÜ (Chongqing)” is a very important aspect characterizing the politician
“Ñôe(Bo Xilai)” since he governed it, and if a morph m which is also highly cor-
related with “ÕÜ (Chongqing)”, it is very likely that “Bo Xilai” is the real target of
m. Therefore, the semantic features generated from neighbors such as the entity “ÕÜ
(Chongqing)” should be treated differently from other types of neighbors such as “∫
M(talented people)” .
74
Gang Crackdown
Fell From Power
Chongqing
Sing Red Songs
Buhou
Peace West King
Bo Xilai
Bo Guagua
Entity
Entity
Entity
Event
Event
Event
Morph
Morph
Figure 5.3: Example of morph-related heterogeneous information network.
In this work, we propose to measure the similarity of two nodes over heterogeneous
networks as shown in Figure 5.2, by distinguishing neighbors into three types according
to the network schema (i.e. entities, events, non-entity noun phrases). We then adopt
meta-path-based similarity measures [23], [24], which are defined over heterogeneous
networks to extract semantic features. A meta-path is a path defined over a network,
and composed of a sequence of relations between different object types. For example, as
shown in Figure 5.2, a morph and its target candidate can be connected by three meta-
paths, including “M - E - E”, “M - EV - E”, and “M - NP - E”. Intuitively, each meta-path
provides a unique angle to measure how similar two objects are.
For the determined meta-paths, we extract semantic features using the similarity
measures proposed in [16], [23]. We denote the neighbor sets of certain type for a morph
m and a target candidate e as �(m) and �(e), and a meta-path as P . We now list several
meta-path-based similarity measures below.
Common neighbors (CN). It measures the number of common neighbors that m and
e share as |�(m) \ �(e)|.Path count (PC). It measures the number of path instances between m and e fol-
lowing meta-path P .
Pairwise random walk (PRW). For a meta-path P that can be decomposed into two
75
shorter meta-paths with the same length P = (P1
P2
), pairwise random walk measures
the probability of the pairwise random walk starting from both m and e and reaching the
same middle object. More formally, it is computed asP
(p1p2)2(P1P2)prob(p
1
)prob(p�1
2
),
where p�1
2
is the inverse of p2
.
Kullback-Leibler distance (KLD). For m and e, the pairwise random walk probabil-
ity of their neighbors can be represented as two probability vectors hpm(x1
), ..., pm(xN)iand hpe(x1
), ..., pe(xN)i. Then Kullback-Leibler distance [16] can be computed as
NX
i=1
pm(xi) logpm(xi)
pe(xi)+ pe(xi) log
pe(xi)
pm(xi)
Beyond the above similarity measures, we also propose to use cosine-similarity-
style normalization method to modify common neighbor and pairwise random walk mea-
sures so that we can ensure the morph node and the target candidate node are strongly
connected and also have similar popularity. The modified algorithms penalize features
involved with the highly popular objects, since they are more likely to have accidental
interactions with each other.
Normalized common neighbors (NCN). Normalized common neighbors can be
measured as sim(m, e) =
|�(m)\�(e)|p|�(m)|
p|�(e)|
. It refines the simple counting of common
neighbors by avoiding bias to highly visible or concentrated objects.
Pairwise random walk/cosine (PRW/cosine). Pairwise random walk measures
linkage weights disproportionately with their visibility to their neighbors, which may be
too strong. Instead, we propose to use a tamer normalization method as:
X
(p1p2)2(P1P2)
f(p1
)f(p�1
2
),
where
f(p1
) =
count(m, x)pPx2⌦ count(m, x)
,
f(p2
) =
count(e, x)pPx2⌦ count(e, x)
,
and ⌦ is the set of middle objects connecting the decomposed meta-paths p1
and p�1
2
,
76
count(y, x) is the total number of paths between y and the middle object x, y could be m
or e.
The above similarity measures can also be applied to homogeneous networks that
do not differentiate the neighbor types.
Global Semantic Feafure Generation: A morph tends to have higher temporal
correlation with its real target, and share more similar topics compared to other irrele-
vant targets. Therefore, we propose to incorporate temporal information into similarity
measures to generate global semantic features.
Let T = t1
[ t2
[ ... [ tN be a set of temporal slots (i.e. by day), E be the set of
target candidates for each morph m. Then for each ti 2 T , and each e 2 E, the local
semantic features simti
(m, e) is extracted based only on the information posted within
ti using one of the similarity measures introduced in Section 5.3.2.2. Then we propose
two approaches to generate global semantic features. The first approach is adding the
similarity score between m and e in each temporal slot to attain the first set of global
features:
simglobal sum(m, e) =X
ti
2T
simti
(m, e).
The second method first normalizes the similarity score in each temporal slot ti, them
sum the normalized scores to generate the second set of global features, which can be
calculated as
simglobal norm(m, e) =X
ti
2T
normti
(m, e).
where normti
(m, e) =sim
t
i
(m,e)Pe2E
simt
i
(m,e).
Integrate Cross Source/Cross Genre Information: Due to internet information
censorship or surveillance, users may need to use morphs to post sensitive information.
For example, the Chinese Weibo message “˝€ªÜ,ÿÅ!@�ö⌫ (Already put in
prison, still need to serve Buhou?” include a morph �ö (Buhou). In contrast, users
are less restricted in some other uncensored social media such as Twitter. For example,
the tweet from Twitter “...äÑôe\“s�ã”�⇧“�ö”... (...call Bo Xilai“peace
west king” or “buhou”...)” contains both the morph and the real targetÑôe (Bo Xilai).
Therefore, we propose to integrate information from another source (e.g. Twitter) to help
resolution of sensitive morphs in Weibo.
77
Another difficulty from morph resolution in micro-blogging is that tweets are only
allowed to contain maximum 140 characters with a lot of noise and diverse topics. The
shortness and diversity of tweets may limit the power of content analysis for semantic
feature extraction. However, formal genres such as web documents are cleaner and con-
tain richer contexts, thus can provide more topically related information. In this work,
we also exploit the background web documents from the embedded URLs in tweets to
enrich information network construction. After applying the same annotation techniques
as tweets for uncensored data sets, sentence-level co-occurrence relations are extracted
and integrated into the network as shown in Figure 5.2.
5.3.2.3 Social Features
It has been shown that there exist correlation between neighbors in social networks
[134], [135]. Because of such social correlation, close social neighbors in social media
such as Twitter and Weibo may post similar information, or share similar opinion. There-
fore, we can utilize social correlation to assist in resolving morphs.
As social correlation can be defined as a function of social distance between a pair
of users, we use social distance as a proxy to social correlation in our approach. The social
distance between user i and j is defined by considering the degree of separation in their
interaction (e.g. retweeting and mentioning) and the amount of the interaction. Similar
definition has been shown effective in characterizing social distance in social networks
extracted from communication data [135], [136]. Specifically, it is
dist(i, j) =K�1X
k=1
1
strength(vk, vk+1
)
.
where v1
, ..., vk are the nodes on the shortest path from user i to user j, and strength(vk, vk+1
)
measures the strength of interactions between vk and vk+1
as:
strength(i, j) =log(Xij)
maxj log(Xij).
where Xij is the total interactions between user i and j, including both retweeting and
mentioning (If Xij < 10, we set strength(i, j) = 0).
We integrate social correlation and temporal information to define our social fea-
78
tures. The intuition is that when a morph is used by an user, the real target may also in
the posts by the user or his/her close friends within a certain time period. Let T be the set
of temporal slots a morph m occurs, Ut be the set of users whose posts include m in slot
t where t 2 T , and Uc be the set of close friends (i.e., social distance < 0.5) for Ut. The
social features are defined as
s(m, e) =
Pt2T f(e, t, Ut, Uc)
|T | .
where f(e, t, Ut, Uc) is a indicator function which return 1 if one of the users in Ut or Uc
posts tweets include the target candidate e within 7 days before t.
5.3.2.4 Learning-to-Rank
Similar to [16], [23], we then model the probability of linkage prediction between a
morph m and its target candidate e as a function incorporating the surface, semantic and
social features. Given a training pair hm, ei, we choose the standard logistic regression
model to learn weights for the features defined above. The learnt model is used to predict
the probability of linking an unseen morph and its target candidate. Based on the de-
scending ranking order of the probability, we select top k candidates as the final answers
based on the answer size k.
5.4 ExperimentsNext, we present the experiment under various settings shown in Table 5.1, and the
impacts of cross source and cross genre information.
Table 5.1: Description of feature sets. ⇤ Glob only uses the same set of similaritymeasures when combined with other semantic features.
Feature sets DescriptionsSurf Surface featuresHomB Semantic features extracted from homogeneous CN, PC, PRW, and KLDHomE HomB + semantic features extracted from homogeneous NCN and PRW/cosineHetB Semantic features extracted from heterogeneous CN, PC, PRW and KLDHetE HetB + semantic features extracted from heterogeneous NCN and PRW/cosineGlob⇤ Global semantic featuresSocial Social network features
79
Table 5.2: Data statistics.
Data Training Development Testing# Tweets 1,500 500 2,688# Unique Terms 10,098 4, 848 15,108# Morphs 250 110 341
5.4.1 Data and Evaluation Metric
We collected 1, 553, 347 tweets from Chinese Sina Weibo from May 1 to June 30 to
construct the censored data set, and retrieved 66, 559 web documents from the embedded
URLs in tweets as the initial uncensored data set. Retweets and redundant web docu-
ments are filtered to ensure more reliable frequency counting of co-occurrence relations.
We then randomly sampled 4, 688 non-redundant tweets and asked two Chinese native
speakers to manually annotate morph in these tweets. The annotated dataset is randomly
split into training, development, and testing sets, with detailed statistics shown in Ta-
ble 5.2. In addition, we used 23 sensitive morphs and the entities that appear in the tweets
as queries and retrieved 25, 128 Chinese tweets from 10% Twitter feeds within the same
time period, as well as 7, 473 web documents from the embedded URLs and added them
into the uncensored data set. For morph resolution, we are more interested in resolving
popular morphs that tend to have more social impacts. Thus we filtered the manually
annotated morphs which appeared fewer than 5 days and obtained a test set consisted of
107 morph entities (81 persons and 26 locations) and their real targets as our references.
To evaluate the system performance, we use leave-one-out cross validation by com-
puting accuracy as Acc@k =
Ck
Q, where Ck is the total number of correctly resolved
morphs at top k ranked answers, and Q is the total number of morph queries. We con-
sider a morph as correctly resolved at the top k answers if the top k answer set contains
the real target of the morph.
5.4.2 Morph Detection Performance
Table 5.3 shows the performance of morph detection using different feature sets.
We can see that the recall values keep increasing as we use more features, while the
precision values keep relatively stable. With all features, our approach newly discovers
888 potential morphs in the test data (9.3% of all the terms in the testing data), indicating
that this step can effectively narrow down the scope of morphs. The basic features greatly
80
Table 5.3: Performance of morph detection.
Features Precision Recall F1
1. Basic 0.270 0.702 0.3902. 1 + Dictionary 0.230 0.780 0.3563. 2 + Phonetic 0.235 0.786 0.3624. 3 + LM 0.245 0.801 0.376
narrow down the scope of candidates by filtering those terms which are easily judged
to be non-morphs (e.g., regular names with part-of-speech tag NR or terms with only
one character). Dictionary and phonetic-based features can further help to improve the
recall values by detecting the irregular terms which are derived from regular names. For
example, the dictionary features can help to identify the commonly used characters (e.g.,
“� (province)”) in entities from the irregular mentions (e.g., “1� (Singing Province)”).
The phonetic features can detect some irregular terms (e.g., “�&⇤ (Charred Cloth)”)
with the same phonetics as regular names (e.g., “�§Ë (Ministry of Foreign Affairs)”).
LM-based features further detect informal terms such as “meŒ (Six Step Man)” and
“‡| (Octopus Brother)” which are rarely used in the standard corpus. However, our
approach fail to detect some potential morphs such as “Ûã(Boxing Champion)”, “�∫
(Great People)”, “§Î��Nursing Supervisor)”. We find that these missed morphs are
general terms, and deeper understanding of their true targets are crucial to discover them.
5.4.3 Morph Resolution Performance
Single Genre Information: We first study the contributions of each set of surface
and semantic features, as shown in the first five rows in Table 5.4. The poor performance
based on surface features shows that morph resolution task is very challenging since 70%
of morphs are not orthographically similar to their real targets. Thus, capturing a morph’s
semantic meaning is crucial. Overall, the results demonstrate the effectiveness of our pro-
posed methods. Specifically, comparing “HomB” and “HetB”, “HomE” and “HetE”, we
can see that the semantic features based on heterogeneous networks have advantages over
those based on homogeneous networks. This corroborates that different neighbor sets
contribute differently, and such discrepancies should be captured. And comparisions of
“HomB” and “HomE”, “HetB” and “HetE”demonstrate the effectiveness of our two new
proposed measures. To evaluate the importance of each similarity measures, we delete
81
the semantic features obtained from each measure in “HetE” and re-evaluate the system.
We find that NCN is the most effective measure, while KLD is the least important one.
Further adding the global semantic features significantly improves the performance. This
indicates that capturing both temporal correlations and semantics of morphing simultane-
ously are important for morph resolution.
Table 5.5 shows that combination of surface and semantic features further improves
the performance, showing that they are complementary. For example, using only surface
features, the real target “T⇤Ø �Steve Jobs ” of the morph “T.; (Qiao Boss)”
is not top ranked since some other candidates such as “Tª (George)” are more ortho-
graphically similar. However, “Steve Jobs” is ranked top when combined with semantic
features.
Table 5.4: The system performance based on each single feature set.
Features Surf HomB HomE HetB HetEAcc@1 0.028 0.201 0.192 0.224 0.252Acc@5 0.159 0.313 0.369 0.393 0.421Acc@10 0.243 0.346 0.407 0.439 0.467Acc@20 0.313 0.411 0.467 0.50 0.523Features + Glob + Glob + Glob + GlobAcc@1 0.230 0.285 0.257 0.285Acc@5 0.402 0.407 0.449 0.458Acc@10 0.435 0.458 0.50 0.495Acc@20 0.486 0.523 0.565 0.542
Table 5.5: The system performance based on combinations of surface and semanticfeatures.
Features Surf + HomB Surf + HomE Surf + HetB Surf + HetEAcc@1 0.234 0.238 0.262 0.276Acc@5 0.416 0.444 0.481 0.519Acc@10 0.477 0.505 0.533 0.570Acc@20 0.519 0.561 0.565 0.598Features + Glob + Glob + Glob + GlobAcc@1 0.290 0.341 0.322 0.346Acc@5 0.505 0.495 0.528 0.533Acc@10 0.551 0.551 0.579 0.584Acc@20 0.594 0.603 0.636 0.631
Impact of Cross Source and Cross Genre Information: We integrate the cross
source information from Twitter, and the cross genre information from web documents
into Weibo tweets for information network construction, and extract a new set of semantic
82
features. Table 5.6 shows that further gains can be achieved. Notice that integrating tweets
from Twitter mainly improves the ranking for top k where k > 1. This is because Weibo
dominates our dataset, and in Weibo many of these sensitive morphs are mostly used with
their traditional meanings instead of the morph senses. Further performance improvement
is achieved by integrating information from background formal web documents which can
provide richer context and relations.
Table 5.6: The system performance of integrating cross source and cross genre in-formation.
Features Surf + HomB + Glob Surf + HomE + Glob Surf + HetB + Glob Surf + HetE + GlobAcc@1 0.290 0.341 0.322 0.346Acc@5 0.505 0.495 0.528 0.533Acc@10 0.551 0.551 0.579 0.584Acc@20 0.594 0.603 0.636 0.631Features + Twitter + Twitter + Twitter + TwitterAcc@1 0.308 0.336 0.336 0.346Acc@5 0.514 0.519 0.547 0.565Acc@10 0.579 0.594 0.594 0.636Acc@20 0.631 0.640 0.668 0.668Features + Web + Web + Web + WebAcc@1 0.327 0.360 0.341 0.379Acc@5 0.528 0.519 0.565 0.575Acc@10 0.594 0.589 0.622 0.645Acc@20 0.631 0.650 0.678 0.678
Effects of Social Features: Table 5.7 shows that adding social features can im-
prove the best performance achieved so far. This is because a group of people with close
relationships may share similar opinion. As an example, two tweets “...of course the
reputation of Buhou is a little too high! //@User1: //@User2: Chongqing event tells
us...)” and “...do not follow Bo Xilai...@User1...) are from two users in the same social
group.One includes a morph “Buhou” and the other includes its target “Bo Xilai”.
Effects of Candidate Detection: The performance with and without candidate
detection step (using all features) is shown in Table 5.8. The gain is small since the com-
bination of all features in the learning to rank framework can already well capture the
relationship between a morph and a target candidate. Nevertheless, the temporal distribu-
tion assumption is effective. It helps to filter out 80% of unrelated targets and speed up
the system 5 times, while retain 98.5% of the morph candidates that can be detected.
We also attempted using topic modeling approach to detect target candidates. Due
83
Table 5.7: The effects of social features.
Features Surf + HomB + Glob Surf + HomE + Glob Surf + HetB + Glob Surf + HetE + Glob+Twitter + Web +Twitter + Web + Twitter + Web + Twitter + Web
Acc@1 0.327 0.360 0.341 0.379Acc@5 0.528 0.519 0.565 0.575Acc@10 0.594 0.589 0.622 0.645Acc@20 0.631 0.650 0.678 0.678Features + Social + Social + Social + SocialAcc@1 0.336 0.369 0.365 0.379Acc@5 0.537 0.547 0.589 0.594Acc@10 0.594 0.601 0.645 0.659Acc@20 0.645 0.664 0.701 0.701
Table 5.8: The effects of temporal constraint.
System Acc@1 Acc@5 Acc@10 Acc@20Without 0.365 0.579 0.645 0.696With 0.379 0.594 0.659 0.701
to the large amount of data, we first split the data set on a daily basis, then applied Prob-
abilistic Latent Semantic Analysis (PLSA) [137]. Named entities which co-occur at least
� times with a morph query in the same topic are selected as its target candidates. As
shown in Table 5.9 (K is the number of predefined topics), PLSA is not quite effective
mainly because traditional topic modeling approaches do not perform well on short texts
from social media. Therefore, in this paper we choose a simple method based on temporal
distribution to detect target candidates.
Table 5.9: Accuracy of target candidate detection.
Method All Temporal PLSA (K = 5, PLSA (K = 5,� = 1) � = 2)
Acc 0.935 0.921 0.935 0.925No. 8, 111 1, 964 6, 380 4, 776Method PLSA (K = 10, PLSA (K = 20, PLSA (K = 20, PLSA (K = 20,
� = 1) � = 2) � = 1) � = 2)Acc 0.935 0.907 0.888 0.757No. 5, 117 3, 138 3, 702 1, 664
5.4.4 Remaining Challenges
Compared with the standard alias detection (“Surf+HomB”) approach [16], our
proposed approach achieves significantly better performance (99.0% confidence level by
84
the Wilcoxon Matched-Pairs Signed-Ranks Test for Acc@1). We further explore several
types of factors which may affect the system performance as follows.
One important aspect affecting the resolution performance is the morph & non-
morph ambiguity. We categorize a morph query as “Unique” if the string is mainly used
as a morph when it occurs, such as “Ñc (Governor Bo)” which is used to refer to
“Bo Xilai”; otherwise as “Common” (e.g. “ùù (Baby)” ,“!� (President)” ). Table
5.10 presents the separate scores for these two categories. We can see that the morphs
in “Unique” category have much better resolution performance than those in “Common”
category. This is because morphs in the “Common” category are also used with its original
meanings, which introduces a lot of noise to heterogeneous network construction. This
can be avoided if we perform context-aware morph resolution by determining whether a
term is used as a morph in a specific microblog post and only leveraging the posts that a
term is used as morphs to construct the network.
Table 5.10: Performance of two categories.
Category Number Acc@1 Acc@5 Acc@10 Acc@20Unique 72 0.479 0.715 0.771 0.819Common 35 0.171 0.343 0.40 0.429
Our resolution system has successfully identified the true targets for 70% morph
queries in the top 20 ranked candidates. Our analysis reveals that deeper profile under-
standing of both morphs and target entities is required to select the true targets for many
morph queries. For instance, the morphs for these three politicians “—Â⇣�Kim Il-
sung)”,“—c (Kim Jong-il)”, and “—ci (Kim Jong-un)” are “—'÷ (Kim Big
Fat)”, “—å÷ (Kin Second Fat)”, and “— ÷ (Kim Third Fat)” respectively. These
morphs and their true targets are very similar, thus it is crucial conduct deeper inference
to capture their family relationships. And detecting the types of both morphs and target
candidates can help filtering candidates that have inconsistent types with morph queries.
We also investigate the effects of popularity of morphs on the resolution perfor-
mance. We split the queries into 5 bins with equal size based on the non-descending
frequency, and evaluate Acc@1 separately. As shown in Table 5.11, we can see that the
popularity is not highly correlated with the performance.
85
Table 5.11: Effects of popularity of morphs.
Rank 0 ⇠ 20% 20% ⇠ 40% 40% ⇠ 60% 60% ⇠ 80% 80% ⇠ 100%All 0.333 0.476 0.341 0.429 0.318Unique 0.321 0.679 0.379 0.571 0.483Common 0.214 0.214 0.071 0.071 0.286
5.5 SummaryIn this chapter,
(1) We have studied the brand-new “morph decoding” task.
(2) We have proposed a set of novel features to capture common characteristics of
morphs and learnt a supervised morph detection model that can greatly narrow down the
scope of morph candidates.
(3) We have proposed to detect target candidates by exploiting the dynamics of the
social media to extract temporal distribution of entities, based on the assumption that the
popularity of an individual is correlated between censored and uncensored text within a
certain time window.
(4) We have built and analyzed heterogeneous information networks from multiple
sources, such as Twitter, Sina Weibo and web documents in formal genre (e.g. news) with
some well-devleloped NLP approaches because a morph and its target tend to appear in
similar contexts.
(5) We have proposed two new similarity measures, as well as integrating temporal
information into the similarity measures to generate global semantic features.
(6) We have modeled social user behaviors and used social correlation to assist in
measuring semantic similarities because the users who posted a morph and its correspond-
ing target tend to share similar interests and opinions.
(7) We have adopted a supervised learning-to-rank framework to combine various
features, including surface features, semantic features extracted from HINs, and social
features.
(8) We have compared various methods based on heterogeneous networks and ho-
mogeneous networks for morph resolution, and showed that HIN-based methods substan-
tially outperform those based on homogeneous networks.
CHAPTER 6Conclusions and Future Directions
6.1 ConclusionsIn this thesis, we have aimed to enhance natural language understanding in the infor-
mal microblogs for both humans and machines by studying three important issues related
to information ranking, enrichment, and resolution. By identifying salient and informative
information, enriching the short microblog posts with rich and clean background knowl-
edge from knowledge bases, and detecting and resolving informal and implicit morphs to
their regular referents, this thesis can assist people’s reading and understanding for mi-
croblogs and can benefit many down-streaming knowledge mining and discovery tasks.
We have introduced a series of approaches based on heterogeneous information networks
(HINs) to achieve our goals. We have showed that mining and modeling HINs is also
powerful in the field of NLP. Thus this thesis sheds lights on many other NLP tasks that
can explore and leverage HINs. Some recent work has also demonstrated that modeling
HINs is effective in other NLP tasks. For example, Yu et al. [138] adopted a similar idea
of our tweet ranking framework and achieved the state-of-the-art slot filling validation
performance. The work in [139] directly modeled HINs with both content information
and social networks to enhance information recommendation. We have mainly conducted
experiments for microblog posts. However, many of the approaches proposed in this the-
sis can also be easily applied and adapted to other genre data, especially the data from
social media. This is because there also exist heterogeneous types of information (e.g.,
social networks, retweeting and replying relations, or thread information) in many other
social media platforms such as Facebook and discussion forums. And our approaches are
also able to construct HINs directly from the unstructured texts (e.g., our morph resolution
and wikification systems). Our findings can be summarized as follows:
• For information ranking, directly modeling heterogeneous networks is more ef-
fective than homogeneous networks. Performing cross-genre information analysis
between the formal genre web documents and informal genre micloblogs improves
identification of informative posts such as news. And leveraging both explicit and
86
87
inferred implicit social network relations help detect informative tweets that meet
the general interest of social users. Cross-genre information analysis and social
user behavior analysis provide complementary evidence to enhance information
ranking.
• Information brevity in each single microblog post brings unique challenges for
the tweet wikification tasks. It is crucial to expand microblog contexts with more
topically-related information. We showed that extracting semantic meta paths from
HINs is an effective way for context expansion. We also demonstrated that leverag-
ing heterogeneous types of relations including local compatibility based on a set of
local features, coreference and semantic relatedness relations enhance tweet wikifi-
cation. In addition, graph-based semi-supervised learning algorithms that perform
collective inference and make use of a large amount of unlabeled data save tremen-
dous annotation costs for this challenging task.
• Modeling topical coherence is crucial for the wikification task and it requires accu-
rate semantic relatedness measurement between concepts. We showed that seman-
tic knowledge graphs are better resources than Wikiepdia anchor links for related-
ness measurement since the latter contains more noisy links. And the deep semantic
models based on deep neural networks (DNN) are also better choices than simi-
larity measures (e.g., Normalized Google Distance and Vector Space Model) that
do not use semantics. This is because DNN exploits hierarchical structures with
non-linear functions to extract useful hidden semantic features and it can represent
concepts with low-dimensional representations that captures the latent semantics
of concepts. We further showed that encoding heterogeneous types of knowledge
including structured facts, concept types, and textual descriptions into deep neural
networks advance relatedness measurement.
• In morph decoding, we showed that heterogeneous networks provide a more effec-
tive way to model unstructured texts than homogeneous networks. By categorizing
the surrounding contexts of morphs and target entities into entities, events, and
other non-entity noun phrases, and capturing their discrepant contributions with
meta path-based heterogeneous information analysis approaches, we substantially
88
enhance morph resolution performance.
6.2 Future DirectionsInformation Freshness Measurement In tweet ranking, we proposed to rank tweets
based on informativeness after applying temporal and spatial constraints to obtain an ini-
tial set of tweets on a topic. From the view of end users, information freshness is also a
crucial factor to judge ranking quality. Our current approach has not taken this factor into
consideration, even though we have removed redundant tweets and penalized redundant
tweets during ranking quality evaluation. Thus the first natural extension is to incorporate
information freshness into the ranking model. One approach is to measure information
freshness based on temporal information and select informative tweets which are not re-
dundant with all informative tweets selected before.
NIL entity recognition and clustering in microblogs In tweet wikification, we
only focused on detecting salient mentions which are linkable to Wikipedia, which have
some limitations. A knowledge base such as Wikipedia is usually constructed manu-
ally and is not updated in a timely fashion, thus many important concepts and facts are
still missing in it. However, new information emerges quickly, especially in microblog-
ging where information is directly from millions of individuals and organizations. This
makes NIL entity recognition and clustering necessary since NIL entities can also be
salient information in their specific contexts. NIL entity recognition and clustering has
been introduced in the Knowledge Base Population (KBP) track at TAC 2011, and exist-
ing successful approaches mainly leveraged unsupervised clustering algorithms and topic
modeling, supervised approach, string matching, and within-document coreference [79].
Thus another natural extension is to adapt and enhance these existing approaches to mi-
croblogs by incorporating additional evidence (e.g., semantic meta paths) mined from
HINs.
Richer and Cleaner Heterogeneous Information Network Construction To fully
leverage the power of heterogeneous network structures, it is crucial to construct HINs
with rich and clean information. And one significant distinction between the field of data
mining and NLP is that we focus more on processing unstructured texts in NLP. In many
cases we need to construct HINs directly from unstructured texts. Thus one natural ex-
89
tension is to enhance the current approaches for HIN construction. In this thesis, we have
leveraged well-developed NLP approaches, explored and proposed computational lin-
guistic features, and leveraged existing HINs to detect more types of nodes and relations
for HIN construction. We can further improve these approaches in several directions: (i)
Leveraging existing web-scaled semantic knowledge graphs (KGs). These semantics KGs
such as Freebase and DBpedia have contained a huge amount of entities, relations, and
facts from various domains. In this thesis, we have showed that semantic KGs are valu-
able resources to measure concept semantic relatedness relations. Some recent work has
successfully leveraged deep learning techniques to jointly model these KGs with unstruc-
tured texts for entity fact extraction [58]. These approaches aim to learn latent semantic
representations of words, concepts, and relations such that the relationships between con-
cepts are preserved in the KGs. We can leverage and extend these approaches to extract
more types of nodes and relations directly from texts. Another direction is to use distant
supervision approaches with these KGs to develop web-scaled extraction models. (ii) So-
cial network relations inferring. During events of general interest such as natural disasters
or political elections, social networks evolve and new communities form quickly [140].
During our study of tweet ranking, we also found that microblog information of these
events of general interest tend to be posted by users from diverse communities and there
exist few explicit social network linkages. In order to construct HINs with rich informa-
tion, it is crucial to infer more implicit social relations by automatically discovering social
communities and identifying social leaders and influencers.
Better Modeling and Mining Approaches In this thesis, we have encoded sepa-
rately heterogeneous types of knowledge from semantic KGs into DNN by putting each
type of knowledge into different dimensions of input feature vector. It is interesting to
directly encode semantic meta paths or even a subgraph into neural networks in order to
capture semantics more effectively. Some possible solutions include leveraging differ-
ent types of neural networks such as convolutional neural networks and recursive neural
networks. In addition, linguistic knowledge and features are crucial for many NLP tasks.
Thus, another interesting extension is to design a unified framework to automatically learn
weights associated with both linguistic features and HIN structures. Inspired by the work
on clustering and topic modeling with network structures [141], [142], we can explore
90
probabilistic models with a joint objective function regarding both linguistic features and
the regularization from HIN structures.
REFERENCES
[1] Twitter Inc., “Twitter.” [Online]. Available: https://twitter.com/, (Date LastAccessed March 7, 2015).
[2] Sina Corp., “Sina weibo.” [Online]. Available: http://weibo.com/, (Date LastAccessed March 7, 2015).
[3] A. Java, X. Song, T. Finin, and B. Tseng, “Why we twitter: Understandingmicroblogging usage and communities,” in Proc. of the 9th WebKDD and 1stSNA-KDD 2007 Workshop on Web Mining and Social Network Anal., New York,NY, USA, 2007, pp. 56–65.
[4] H. Kwak, C. Lee, H. Park, and S. Moon, “What is twitter, a social network or anews media?” in Proc. of the 19th Int. Conf. on WWW, New York, NY, USA,2010, pp. 591–600.
[5] A. Zubiaga, D. Spina, V. Fresno, and R. Martınez, “Classifying trending topics: Atypology of conversation triggers on twitter,” in Proc. of the 20th ACM Int. Conf.on Inform. and Knowl. Manage., New York, NY, USA, 2011, pp. 2461–2464.
[6] Wikipedia, “Wikipedia.” [Online]. Available: http://www.wikipedia.org/, (DateLast Accessed March 7, 2015).
[7] Pear Analytics, “Pear analytics twitter study,” 2009. [Online]. Available: http://pearanalytics.com/wp-content/uploads/2012/12/Twitter-Study-August-2009.pdf,(Date Last Accessed March 12, 2015).
[8] N. Diakopoulos, M. De Choudhury, and M. Naaman, “Finding and assessingsocial media information sources in the context of journalism,” in Proc. of theSIGCHI Conf. on Human Factors in Computing Syst., New York, NY, USA, 2012,pp. 2451–2460.
[9] DBpedia, “Dbpedia.” [Online]. Available: http://dbpedia.org/, (Date LastAccessed March 7, 2015).
[10] Freebase, “Freebase.” [Online]. Available: https://www.freebase.com/, (Date LastAccessed March 7, 2015).
[11] R. Mihalcea and A. Csomai, “Wikify!: Linking documents to encyclopedicknowledge,” in Proc. of the 16th ACM Conf. on Inform. and Knowl. Manage.,New York, NY, USA, 2007, pp. 233–242.
91
92
[12] L. Ratinov and D. Roth, “Learning-based multi-sieve co-reference resolution withknowledge,” in Proc. of the 2012 Joint Conf. on Empirical Methods in NaturalLanguage Process. and Comput. Natural Language Learn., Jeju Island, Korea,2012, pp. 1234–1244.
[13] D. Vitale, P. Ferragina, and U. Scaiella, “Classification of short texts by deployingtopical annotations,” in Proc. of the 34th European Conf. on Advances in Inform.Retrieval, Berlin, Heidelberg, 2012, pp. 376–387.
[14] M. Michelson and S. A. Macskassy, “Discovering users’ topics of interest ontwitter: A first look,” in Proc. of the 4th Workshop on Analytics for NoisyUnstructured Text Data, New York, NY, USA, 2010, pp. 73–80.
[15] Z. Xu, L. Ru, L. Xiang, and Q. Yang, “Discovering user interest on twitter with amodified author-topic model,” in Proc. of the 2011 IEEE/WIC/ACM Int. Conf. onWeb Intell. and Intelligent Agent Technology, Washington, DC, USA, 2011, pp.422–429.
[16] P. Hsiung, A. Moore, D. Neill, and J. Schneider, “Alias detection in link data sets,”in Proc. of the Int. Conf. on Intell. Anal., McLean, VA, USA, 2005, pp. 1–6.
[17] P. Pantel, “Alias detection in malicious environments,” in AAAI Fall Symp. onCapturing and Using Patterns for Evidence Detection, Menlo Park, CA, USA,2006, pp. 14–20.
[18] H. Deng, M. R. Lyu, and I. King, “A generalized co-hits algorithm and itsapplication to bipartite graphs,” in Proc. of the 15th ACM SIGKDD Int. Conf. onKnowl. Discovery and Data Mining, New York, NY, USA, 2009, pp. 239–248.
[19] Y. Sun, Y. Yu, and J. Han, “Ranking-based clustering of heterogeneousinformation networks with star network schema,” in Proc. of the 15th ACMSIGKDD Int. Conf. on Knowl. Discovery and Data Mining, New York, NY, USA,2009, pp. 797–806.
[20] M. Ji, Y. Sun, M. Danilevsky, J. Han, and J. Gao, “Graph regularized transductiveclassification on heterogeneous information networks,” in Proc. of the 2010European Conf. on Mach. Learning and Knowl. Discovery in Databases, Berlin,Heidelberg, 2010, pp. 570–586.
[21] X. Kong, P. S. Yu, Y. Ding, and D. J. Wild, “Meta path-based collectiveclassification in heterogeneous information networks,” in Proc. of the 21st ACMInt. Conf. on Inform. and Knowl. Manage, New York, NY, USA, 2012, pp.1567–1571.
[22] Y. Sun, B. Norick, J. Han, X. Yan, P. S. Yu, and X. Yu, “Pathselclus: Integratingmeta-path selection with user-guided object clustering in heterogeneousinformation networks,” ACM Trans. Knowl. Discov. Data, vol. 7, no. 3, pp.11:1–11:23, Sep. 2013.
93
[23] Y. Sun, R. Barber, M. Gupta, C. Aggarwal, and J. Han, “Co-author relationshipprediction in heterogeneous bibliographic networks,” in Proc. of the 2011 Int.Conf. on Advances in Social Networks Anal. and Mining, Washington, DC, USA,2011, pp. 121–128.
[24] Y. Sun, J. Han, X. Yan, P. Yu, and T. Wu, “Pathsim: Meta path-based top-ksimilarity search in heterogeneous information networks,” The Proc. of the VLDBEndow., vol. 4, no. 11, pp. 992–1003, Aug. 2011.
[25] Y. Sun and J. Han, “Mining heterogeneous information networks: A structuralanalysis approach,” SIGKDD Explor. Newsl., vol. 14, no. 2, pp. 20–28, Apr. 2013.
[26] R. Mihalcea and P. Tarau, “Textrank: Bringing order into texts,” in Proc. of the2014 Conf. on Empirical Methods in Natural Language Process., Barcelona,Spain, 2004, pp. 404–411.
[27] G. Erkan and D. R. Radev, “Lexrank: Graph-based lexical centrality as salience intext summarization,” J. Artif. Int. Res., vol. 22, no. 1, pp. 457–479, Dec. 2004.
[28] X. Han, L. Sun, and J. Zhao, “Collective entity linking in web text: A graph-basedmethod,” in Proc. of the 34th Int. ACM SIGIR Conf. on Res. and Development inInform. Retrieval, New York, NY, USA, 2011, pp. 765–774.
[29] Z.-Y. Niu, D.-H. Ji, and C. L. Tan, “Word sense disambiguation using labelpropagation based semi-supervised learning,” in Proc. of the 43rd Annu. Meetingof the Assoc. for Comput. Linguist., Ann Arbor, Michigan, 2005, pp. 395–402.
[30] J. Chen, D. Ji, C. L. Tan, and Z. Niu, “Relation extraction using label propagationbased semi-supervised learning,” in Proc. of the 21st Int. Conf. on Computat.Linguist. and 44th Annu. Meeting of the Assoc. for Comput. Linguist., Sydney,Australia, 2006, pp. 129–136.
[31] T. Cassidy, H. Ji, L.-A. Ratinov, A. Zubiaga, and H. Huang, “Analysis andenhancement of wikification for microblogs with context expansion,” in Proc. ofthe 24th Int. Conf. on Comput. Linguist., Mumbai, India, 2012, pp. 441–456.
[32] L. Ratinov, D. Roth, D. Downey, and M. Anderson, “Local and global algorithmsfor disambiguation to wikipedia,” in Proc. of the 49th Annu. Meeting of the Assoc.for Comput. Linguist.: Human Language Technologies, Portland, OR, USA, 2011,pp. 1375–1384.
[33] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learning using gaussianfields and harmonic functions,” in Proc. of the 20th Int. Conf. on Mach. Learn.,Washington, DC, USA, 2003, pp. 912–919.
[34] A. J. Smola and I. R. Kondor, “Kernels and regularization on graphs.” in Proc. ofthe Annu. Conf. on Comput. Learn. Theory, Washington, DC, USA, 2003, pp.144–158.
94
[35] A. Blum, J. Lafferty, M. R. Rwebangira, and R. Reddy, “Semi-supervised learningusing randomized mincuts,” in Proc. of the 21st Int. Conf. on Mach. Learn., NewYork, NY, USA, 2004, pp. 13–20.
[36] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf, “Learning withlocal and global consistency,” in Advances in Neural Inform. Process. Syst. 16,Vancouver, Canada, 2004, pp. 321–328.
[37] P. P. Talukdar and K. Crammer, “New regularized algorithms for transductivelearning,” in Proc. of the European Conf. on Mach. Learn. and Knowl. Discoveryin Databases, Berlin, Heidelberg, 2009, pp. 442–457.
[38] D. Milne and I. Witten, “An effective, low-cost measure of semantic relatednessobtained from wikipedia links,” in Prof. of the 23th Conf. on Artif. Intell.,Chicago, IL, USA, 2008, pp. 25–30.
[39] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa,“Natural language processing (almost) from scratch,” J. Mach. Learn. Res.,vol. 12, pp. 2493–2537, Nov. 2011.
[40] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, “Learning deepstructured semantic models for web search using clickthrough data,” in Proc. ofthe 22nd ACM Int. Conf. on Inform. and Knowl. Manage., New York, NY, USA,2013, pp. 2333–2338.
[41] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilisticlanguage model,” J. Mach. Learn. Res., vol. 3, pp. 1137–1155, Mar. 2003.
[42] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko,“Translating embeddings for modeling multi-relational data,” in Advances inNeural Inform. Process. Syst. 26, Lake Tahoe, NV, USA, 2013, pp. 2787–2795.
[43] R. Socher, D. Chen, C. Manning, and A. Ng, “Reasoning with neural tensornetworks for knowledge base completion,” in Advances in Neural Inform.Process. Syst. 26, Lake Tahoe, NV, USA, 2013, pp. 926–934.
[44] DBLP, “Dblp.” [Online]. Available: http://dblp.uni-trier.de/, (Date Last AccessedMarch 8, 2015).
[45] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking:Bringing order to the web,” in Proc. the 7th Int. Conf. on WWW, Brisbane,Australia, 1998, pp. 161–172.
[46] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” J. ACM,vol. 46, no. 5, pp. 604–632, Sep. 1999.
95
[47] T. Haveliwala, S. Kamvar, and G. Jeh, “An analytical comparison of approachesto personalizing pagerank,” Stanford InfoLab, Menlo Park, CA, USA, Tech. Rep.2003-35, June 2003.
[48] D. Liben-Nowell and J. Kleinberg, “The link prediction problem for socialnetworks,” in Proc. of the 20th Int. Conf. on Inform. and Knowl. Manage., NewYork, NY, USA, 2003, pp. 556–559.
[49] G. Jeh and J. Widom, “Simrank: A measure of structural-context similarity,” inProc. of the 8th ACM SIGKDD Int. Conf. on Knowl. Discovery and Data Mining,New York, NY, USA, 2002, pp. 538–543.
[50] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, Aug. 2000.
[51] P. Sen, G. M. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad,“Collective classification in network data,” AI Mag., vol. 29, no. 3, pp. 93–106,Sep. 2008.
[52] Y. Duan, F. Wei, M. Zhou, and H.-Y. Shum, “Graph-based collective classificationfor tweets,” in Proc. of the 21st ACM Int. Conf. on Inform. and Knowl. Manage.,New York, NY, USA, 2012, pp. 2323–2326.
[53] D. Hakkani-Tur, L. Heck, and G. Tur, “Using a knowledge graph and query clicklogs for unsupervised learning of relation detection,” in Proc. of the 2013 IEEEInt. Conf. on Acoust., Speech and Signal Process., Vancouver, BC, Canada, 2013,pp. 8327–8331.
[54] L. Heck, D. Hakkani-Tur, and G. Tur, “Leveraging knowledge graphs forweb-scale unsupervised semantic parsing,” in Proc. of Conf. of the Int. SpeechCommun. Assoc., Lyon, France, 2013, pp. 1594–1598.
[55] H. Hajishirzi, L. Zilles, D. Weld, and L. S. Zettlemoyer, “Joint coreferenceresolution and named-entity linking with multi-pass sieves,” in Proc. of the 2013Conf. on Empirical Methods in Natural Language Process, Seattle, WA, USA,2013, pp. 289–299.
[56] S. Dutta and G. Weikum, “Cross-document co-reference resolution usingsample-based clustering with knowledge enrichment,” Trans. of the Assoc. forComput. Linguist., vol. 3, no. 1, pp. 15–28, Jan. 2015.
[57] A. Bordes, X. Glorot, J. Weston, and Y. Bengio, “Joint learning of words andmeaning representations for open-text semantic parsing,” in Proc. of the 15th Int.Conf. on Artif. Intell. and Stat., La Palma, Spain, 2012, pp. 127–135.
[58] Z. Wang, J. Zhang, J. Feng, and Z. Chen, “Knowledge graph and text jointlyembedding,” in Proc. of the 2014 Conf. on Empirical Methods in NaturalLanguage Process., Doha, Qatar, 2014, pp. 1591–1601.
96
[59] A. Bordes, S. Chopra, and J. Weston, “Question answering with subgraphembeddings,” in Proc. of the 2014 Conf. on Empirical Methods in NaturalLanguage Process., Doha, Qatar, 2014, pp. 615–620.
[60] M. Yang, N. Duan, M. Zhou, and H. Rim, “Joint relational embeddings forknowledge-based question answering,” in Proc. of the 2014 Conf. on EmpiricalMethods in Natural Language Process, Doha, Qatar, 2014, pp. 645–650.
[61] R. L. Cilibrasi and P. M. B. Vitanyi, “The google similarity distance,” IEEE Trans.on Knowl. and Data Eng., vol. 19, no. 3, pp. 370–383, Mar. 2007.
[62] J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D. Lieberman, and J. Sperling,“Twitterstand: News in tweets,” in Proc. of the 17th ACM SIGSPATIAL Int. Conf.on Advances in Geographic Inform. Syst., New York, NY, USA, 2009, pp. 42–51.
[63] S. A. Golder, A. Marwick, and S. Yardi, “A structural approach to contactrecommendations in online social networks,” in Proc. SIGIR2009 Workshop onSearch in Social Media, Boston, MA, USA, 2009, pp. 1–4.
[64] Y. Yamaguchi, T. Takahashi, T. Amagasa, and H. Kitagawa, “Turank: Twitter userranking based on user-tweet graph analysis,” in Proc. of the 11th Int. Conf. onWeb Inform. Syst. Eng., Berlin, Heidelberg, 2010, pp. 240–253.
[65] J. Hannon, M. Bennett, and B. Smyth, “Recommending twitter users to followusing content and collaborative filtering approaches,” in Proc. of the 4th ACMConf. on Recommender Syst., New York, NY, USA, 2010, pp. 199–206.
[66] I. Uysal and W. B. Croft, “User oriented tweet ranking: A filtering approach tomicroblogs,” in Proc. of the 20th ACM Int. Conf. on Inform. and Knowl. Manage.,New York, NY, USA, 2011, pp. 2261–2264.
[67] Y. Duan, L. Jiang, T. Qin, M. Zhou, and H.-Y. Shum, “An empirical study onlearning to rank of tweets,” in Proc. of the 23rd Int. Conf. on Comput. Linguist.,Stroudsburg, PA, USA, 2010, pp. 295–303.
[68] M. Huang, Y. Yang, and X. Zhu, “Quality-biased ranking of short texts inmicroblogging services,” in Proc. of 5th Int. Joint Conf. on Natural LanguageProcess., Chiang Mai, Thailand, 2011, pp. 373–382.
[69] D. Inouye and J. K. Kalita, “Comparing twitter summarization algorithms,” inProc. of the 2011 IEEE 3rd Int. Conf. on Social Computing, Boston, MA, USA,2011, pp. 298–306.
[70] C. Castillo, M. Mendoza, and B. Poblete, “Information credibility on twitter,” inProc. of the 20th Int. Conf. on WWW. New York, NY, USA: ACM, 2011, pp.675–684.
97
[71] M. Gupta, P. Zhao, and J. Han, “Evaluating event credibility on twitter,” in Proc.of the Twelfth SIAM Int. Conf. on Data Mining, Anaheim, CA, USA, 2012, pp.153–164.
[72] D. Wang, T. Abdelzaher, H. Ahmadi, J. Pasternack, D. Roth, M. Gupta, J. Han,O. Fatemieh, H. Le, and C. Aggrawal, “On bayesian interpretation of fact-findingin information networks,” in Proc. of the 14th Int. Conf. on Inform. Fusion,Chicago, IL, USA, 2011, pp. 1–8.
[73] D. Wang, L. Kaplan, H. Le, and T. Abdelzaher, “On truth discovery in socialsensing: A maximum likelihood estimation approach,” in Proc. of the 11th Int.Conf. on Inform. Process. in Sensor Networks, New York, NY, USA, 2012, pp.233–244.
[74] M. Gupta and J. Han, “Heterogeneous network-based trust analysis: a survey,”SIGKDD Explor. Newsl., vol. 13, no. 1, pp. 54–71, Aug. 2011.
[75] J. Weng, E.-P. Lim, J. Jiang, and Q. He, “Twitterrank: Finding topic-sensitiveinfluential twitterers,” in Proc. of the 3rd ACM Int. Conf. on Web Search and DataMining, New York, NY, USA, 2010, pp. 261–270.
[76] A. Pal and S. Counts, “Identifying topical authorities in microblogs,” in Proc. ofthe 4th ACM Int. Conf. on Web Search and Data Mining, New York, NY, USA,2011, pp. 45–54.
[77] D. M. Romero, W. Galuba, S. Asur, and B. A. Huberman, “Influence and passivityin social media,” in Proc. of the 20th Int. Conf. Companion on WWW, New York,NY, USA, 2011, pp. 113–114.
[78] H. Ji, R. Grishman, H. Dang, K. Griffitt, and J. Ellis, “Overview of the tac 2010knowledge base population track,” in Text Anal. Conf., Gaithersburg, MD, USA,2010, pp. 1–25.
[79] H. Ji, R. Grishman, and H. Dang, “Overview of the tac 2011 knowledge basepopulation track,” in Text Anal. Conf., Gaithersburg, MD, USA, 2011, pp. 1–33.
[80] D. Milne and I. H. Witten, “Learning to link with wikipedia,” in Proc. of the 17thACM Conf. on Inform. and Knowl. Manage., New York, NY, USA, 2008, pp.509–518.
[81] X. Han and L. Sun, “A generative entity-mention model for linking entities withknowledge base,” in Proc. of the 49th Annu. Meeting of the Assoc. for Comput.Linguist.: Human Language Technologies, Portland, OR, USA, 2011, pp.945–954.
[82] S. Cucerzan, “Large-scale named entity disambiguation based on Wikipediadata,” in Proc. of the 2007 Joint Conf. on Empirical Methods in Natural Language
98
Process. and Comput. Natural Language Learn., Prague, Czech Republic, 2007,pp. 708–716.
[83] X. Han and J. Zhao, “Named entity disambiguation by leveraging wikipediasemantic knowledge,” in Proc. of the 18th ACM Conf. on Inform. and Knowl.Manage., New York, NY, USA, 2009, pp. 215–224.
[84] S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakrabarti, “Collectiveannotation of wikipedia entities in web text,” in Proc. of the 15th ACM SIGKDDInt. Conf. on Knowl. Discovery and Data Mining, New York, NY, USA, 2009, pp.457–466.
[85] M. Pennacchiotti and P. Pantel, “Entity extraction via ensemble semantics,” inProc. of the 2009 Conf. on Empirical Methods in Natural Language Process.,Singapore, 2009, pp. 238–247.
[86] P. Ferragina and U. Scaiella, “Tagme: On-the-fly annotation of short textfragments (by wikipedia entities),” in Proc. of the 19th ACM Int. Conf. on Inform.and Knowl. Manage., New York, NY, USA, 2010, pp. 1625–1628.
[87] Y. Guo, W. Che, T. Liu, and S. Li, “A graph-based method for entity linking,” inProc. of 5th Int. Joint Conf. on Natural Language Process., Chiang Mai, Thailand,2011, pp. 1010–1018.
[88] Z. Chen and H. Ji, “Collaborative ranking: A case study on entity linking,” inProc. of the 2011 Conf. on Empirical Methods in Natural Language Process.,Edinburgh, Scotland, UK., 2011, pp. 771–781.
[89] Z. Kozareva, K. Voevodski, and S. Teng, “Class label enhancement via relatedinstances,” in Proc. of the 2011 Conf. on Empirical Methods in Natural LanguageProcess., Edinburgh, Scotland, UK., 2011, pp. 118–128.
[90] W. Shen, J. Wang, P. Luo, and M. Wang, “Linking named entities in tweets withknowledge base via user interest modeling,” in Proc. of the 19th ACM SIGKDDInt. Conf. on Knowl. Discovery and Data Mining, New York, NY, USA, 2013, pp.68–76.
[91] X. Liu, Y. Li, H. Wu, M. Zhou, F. Wei, and Y. Lu, “Entity linking for tweets,” inProc. of the 51st Annu. Meeting of the Assoc. for Comput. Linguist., Sofia,Bulgaria, 2013, pp. 1304–1311.
[92] E. Meij, W. Weerkamp, and M. de Rijke, “Adding semantics to microblog posts,”in Proc. of the 5th ACM Int. Conf. on Web Search and Data Mining, New York,NY, USA, 2012, pp. 563–572.
[93] S. Guo, M.-W. Chang, and E. Kiciman, “To link or not to link? a study onend-to-end tweet entity linking,” in Proc. of the 2013 Conf. of the North Amer.
99
Chapter of the Assoc. for Comput. Linguist.: Human Language Technologies,Atlanta, GA, USA, 2013, pp. 1020–1030.
[94] D. Bamman, B. O’Connor, and N. A. Smith, “Censorship and deletion practices inchinese social media,” First Monday, vol. 17, no. 3, pp. 1–21, Mar. 2012.
[95] Y. Xia, K.-F. Wong, and W. Gao, “Nil is not nothing: Recognition of chinesenetwork informal language expressions,” in Proc. of the 4th SIGHAN Workshopon Chinese Language Process., Jeju Island, Korea, 2005, pp. 95–102.
[96] Y. Xia and K.-F. Wong, “Anomaly detecting within dynamic chinese chat text,” inProc. Workshop On New Text Wikis And Blogs And Other Dynamic Text Sources,Trento, Italy, 2006, pp. 48–55.
[97] Y. Xia, K.-F. Wong, and W. Li, “A phonetic-based approach to chinese chat textnormalization,” in Proc. of the 21st Int. Conf. on Comput. Linguist. and 44thAnnu. Meeting of the Assoc. for Comput. Linguist., Sydney, Australia, 2006, pp.993–1000.
[98] Z. Li and D. Yarowsky, “Mining and modeling relations between formal andinformal chinese phrases from web corpora,” in Proc. of the Conf. on EmpiricalMethods in Natural Language Process., Stroudsburg, PA, USA, 2008, pp.1031–1040.
[99] A. Wang, M.-Y. Kan, D. Andrade, T. Onishi, and K. Ishikawa, “Chinese informalword normalization: an experimental study,” in Proc. of the 6th Int. Joint Conf. onNatural Language Process., Nagoya, Japan, 2013, pp. 127–135.
[100] A. Wang and M.-Y. Kan, “Mining informal language from chinese microtext:Joint word recognition and segmentation,” in Proc. of the 51st Annu. Meeting ofthe Assoc. for Comput. Linguist., Sofia, Bulgaria, 2013, pp. 731–741.
[101] D. Bollegala, Y. Matsuo, and M. Ishizuka, “Automatic discovery of personal namealiases from the web,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 6, pp. 831–844,Apr. 2011.
[102] R. Holzer, B. Malin, and L. Sweeney, “Email alias detection using social networkanalysis,” in Proc. of the 3rd Int. Workshop on Link Discovery, New York, NY,USA, 2005, pp. 52–57.
[103] I. Couzin, “Collective minds,” Nature, vol. 445, no. 7129, p. 715, Feb. 2007.
[104] A. Zubiaga, D. Spina, E. Amigo, and J. Gonzalo, “Towards real-timesummarization of scheduled events from Twitter streams,” in Proc. of the 23rdACM Conf. on Hypertext and social media, New York, NY, USA, 2012, pp.319–320.
100
[105] Microsoft Corp., “Bing search api.” [Online]. Available:http://www.bing.com/toolbox/bingdeveloper/, (Date Last Accessed March 15,2015).
[106] M. Hunter, “Twitter slang words.” [Online]. Available:http://www.mltcreative.com/blog/bid/54272/Social-Media-Minute-Big-A-List-of-Twitter-Slang-and-Definition, (Date LastAccessed March 15, 2015).
[107] B. Carterette and P. Chandar, “Probabilistic models of ranking novel documentsfor faceted topic retrieval,” in Proc. of the 18th ACM Conf. on Inf. and Knowl.Manage., New York, NY, USA, 2009, pp. 1287–1296.
[108] R. McDonald, “A study of global inference algorithms in multi-documentsummarization,” in Proc. of the 29th European Conf. on IR Res., Berlin,Heidelberg, 2007, pp. 557–564.
[109] F. M. Zanzotto, M. Pennacchiotti, and K. Tsioutsiouliklis, “Linguistic redundancyin twitter,” in Proc. of the Conf. on Empirical Methods in Natural LanguageProcess., Stroudsburg, PA, USA, 2011, pp. 659–669.
[110] K. Jarvelin and J. Kekalainen, “Cumulated gain-based evaluation of irtechniques,” ACM Trans. Inf. Syst., vol. 20, no. 4, pp. 422–446, Oct. 2002.
[111] G. A. Miller, “Wordnet: A lexical database for english,” Commun. ACM, vol. 38,no. 11, pp. 39–41, Nov. 1995.
[112] P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai,“Class-based n-gram models of natural language,” Comput. Linguist., vol. 18,no. 4, pp. 467–479, Dec. 1992.
[113] R. Bunescu, “Using encyclopedic knowledge for named entity disambiguation,”in Proc. of the 11st Conf. of the European Chapter of the Assoc. for Comput.Linguist., Trento, Italy, 2006, pp. 9–16.
[114] B. Hachey, W. Radford, J. Nothman, M. Honnibal, and J. R. Curran, “Evaluatingentity linking with wikipedia,” Artif. Intell., vol. 194, pp. 130–150, Jan. 2013.
[115] K. Wang, C. Thrasher, and B.-J. P. Hsu, “Web scale nlp: A case study on url wordbreaking,” in Proc. of the 20th Int. Conf. on WWW, New York, NY, USA, 2011,pp. 357–366.
[116] Z. He, S. Liu, M. Li, M. Zhou, L. Zhang, and H. Wang, “Learning entityrepresentation for entity disambiguation,” in Proc. of the 51st Annu. Meeting ofthe Assoc. for Comput. Linguist., Sofia, Bulgaria, 2013, pp. 30–34.
101
[117] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil, “Learning semanticrepresentations using convolutional neural networks for web search,” in Proc. ofthe Companion Publication of the 23rd Int. Conf. on WWW Companion, Republicand Canton of Geneva, Switzerland, 2014, pp. 373–374.
[118] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Exploring strategies fortraining deep neural networks,” J. Mach. Learn. Res., vol. 10, no. 1, pp. 1–40, Jun.2009.
[119] X. Cheng and D. Roth, “Relational inference for wikification,” in Proc. of the2013 Conf. on Empirical Methods in Natural Language Process., Seattle, WA,USA, 2013, pp. 1787–1796.
[120] D. Ceccarelli, C. Lucchese, S. Orlando, R. Perego, and S. Trani, “Learningrelatedness measures for entity linking,” in Proc. of the 22nd ACM Int. Conf. onInform. and Knowl. Manage., New York, NY, USA, 2013, pp. 139–148.
[121] C. D. Manning, P. Raghavan, and H. Schutze, Introduction to InformationRetrieval, 1st ed. New York, NY, USA: Cambridge University Press, 2008.
[122] J. Hoffart, M. Yosef, I. Bordino, H. Furstenau, M. Pinkal, M. Spaniol, B. Taneva,S. Thater, and G. Weikum, “Robust disambiguation of named entities in text,” inProc. of the 2011 Conf. on Empirical Methods in Natural Language Process.,Edinburgh, Scotland, UK., 2011, pp. 782–792.
[123] M. Shirakawa, H. Wang, Y. Song, Z. Wang, K. Nakayama, T. Hara, and S. Nishio,“Entity disambiguation based on a probabilistic taxonomy,” Microsoft Research,Seattle, WA, USA, Tech. Rep. MSR-TR-2011-125, 2011.
[124] G. Salton, A. Wong, and C. S. Yang, “A vector space model for automaticindexing,” Commun. ACM, vol. 18, no. 11, pp. 613–620, Nov. 1975.
[125] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3,pp. 273–297, Sep. 1995.
[126] Q. Li, H. Li, H. Ji, W. Wang, J. Zheng, and F. Huang, “Joint bilingual nametagging for parallel corpora,” in Proc. of the 21st ACM Int. Conf. on Inform. andKnowl. Manage., New York, NY, USA, 2012, pp. 1727–1731.
[127] V. Ng, “Supervised noun phrase coreference research: The first fifteen years,” inProc. of the 48th Annu. Meeting of the Assoc. for Comput. Linguist., Uppsala,Sweden, 2010, pp. 1396–1411.
[128] R. A. Wagner and M. J. Fischer, “The string-to-string correction problem,” J.ACM, vol. 21, no. 1, pp. 168–173, Jan. 1974.
[129] D. S. Hirschberg, “Algorithms for the longest common subsequence problem,” J.ACM, vol. 24, no. 4, pp. 664–675, Oct. 1977.
102
[130] P.-C. Chang, M. Galley, and C. D. Manning, “Optimizing chinese wordsegmentation for machine translation performance,” in Proc. of the 3rd Workshopon Statistical Mach. Translation, Columbus, OH, USA, 2008, pp. 224–232.
[131] K. Toutanova, D. Klein, C. D. Manning, and Y. Singer, “Feature-richpart-of-speech tagging with a cyclic dependency network,” in Proc. of the 2003Conf. of the North Amer. Chapter of the Assoc. for Comput. Linguist. on HumanLanguage Technology, Edmonton, Alberta, Canada, 2003, pp. 173–180.
[132] H.-P. Zhang, H.-K. Yu, D.-Y. Xiong, and Q. Liu, “Hhmm-based chinese lexicalanalyzer ictclas,” in Proc. of the second SIGHAN workshop on Chinese languageprocess., Stroudsburg, PA, USA, 2003, pp. 184–187.
[133] H. Ji and R. Grishman, “Refining event extraction through cross-documentinference,” in Proc. of the 46st Annu. Meeting of the Assoc. for Comput. Linguist.,Columbus, OH, USA, 2008, pp. 254–262.
[134] A. Anagnostopoulos, R. Kumar, and M. Mahdian, “Influence and correlation insocial networks,” in Proc. of the 14th ACM SIGKDD Int. Conf. on Knowl.Discovery and Data Mining, New York, NY, USA, 2008, pp. 7–15.
[135] Z. Wen and C.-Y. Lin, “On the quality of inferring interests from socialneighbors,” in Proc. of the 16th ACM SIGKDD Int. Conf. on Knowl. Discoveryand Data Mining, New York, NY, USA, 2010, pp. 373–382.
[136] C. Lin, L. Wu, Z. Wen, H. Tong, V. Griffiths-Fisher, L. Shi, and D. Lubensky,“Social network analysis in enterprise,” Proc. of the IEEE, vol. 100, no. 9, pp.2759–2776, Jul. 2012.
[137] T. Hofmann, “Probabilistic latent semantic indexing,” in Proc. of the 22nd Annu.Int. ACM SIGIR Conf. on Res. and Development in Inform. Retrieval, New York,NY, USA, 1999, pp. 50–57.
[138] D. Yu, H. Huang, T. Cassidy, H. Ji, C. Wang, S. Zhi, J. Han, C. Voss, andM. Magdon-Ismail, “The wisdom of minority: Unsupervised slot filling validationbased on multi-dimensional truth-finding,” in Proc. of the 25th Int. Conf. onComput. Linguist., Dublin, Ireland, 2014, pp. 1567–1578.
[139] Q. Zhang and H. Wang, “Collaborative topic regression with multiple graphsfactorization for recommendation in social media,” in Proc. of the 25th Int. Conf.on Comput. Linguist., Dublin, Ireland, 2014, pp. 233–244.
[140] Y. Tyshchuk, H. Li, H. Ji, and W. A. Wallace, “Evolution of communities ontwitter and the role of their leaders during emergencies,” in Proc. of the 2013IEEE/ACM Int. Conf. on Advances in Social Networks Analysis and Mining, NewYork, NY, USA, 2013, pp. 727–733.
103
[141] Q. Mei, D. Cai, D. Zhang, and C. Zhai, “Topic modeling with networkregularization,” in Proc. of the 17th Int. Conf. on WWW, New York, NY, USA,2008, pp. 101–110.
[142] Y. Sun, C. C. Aggarwal, and J. Han, “Relation strength-aware clustering ofheterogeneous information networks with incomplete attributes,” Proc. VLDBEndow., vol. 5, no. 5, pp. 394–405, Jan. 2012.