visualizing and curating knowledge graphs over time and …visualizing and curating knowledge graphs...

6
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics—System Demonstrations, pages 25–30, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics Visualizing and Curating Knowledge Graphs over Time and Space Tong Ge 1 , Yafang Wang 1* , Gerard de Melo 2 , Haofeng Li 1 , Baoquan Chen 1 1 Shandong University, China; 2 Tsinghua University, China Abstract Publicly available knowledge repositories, such as Wikipedia and Freebase, benefit significantly from volunteers, whose con- tributions ensure that the knowledge keeps expanding and is kept up-to-date and accu- rate. User interactions are often limited to hypertext, tabular, or graph visualization in- terfaces. For spatio-temporal information, however, other interaction paradigms may be better-suited. We present an integrated system that combines crowdsourcing, au- tomatic or semi-automatic knowledge har- vesting from text, and visual analytics. It enables users to analyze large quantities of structured data and unstructured textu- al data from a spatio-temporal perspective and gain deep insights that are not easily observed in individual facts. 1 Introduction There has been an unprecedented growth of pub- licly available knowledge repositories such as the Open Directory, Wikipedia, Freebase, etc. Many additional knowledge bases and knowledge graphs are built upon these, including DBpedia, YAGO, and Google’s Knowledge Graph. Such repositories benefit significantly from human volunteers, whose contributions ensure that the knowledge keeps ex- panding and is kept up-to-date and accurate. Despite the massive growth of such structured data, user interactions are often limited to sim- ple browsing interfaces, showing encyclopedic tex- t with hyperlinks, tabular listings, or graph visu- alizations. Sometimes, however, users may seek a spatio-temporal perspective of such knowledge. Given that the spatio-temporal dimensions are fun- damental with respect to both the physical world * The corresponding author:[email protected] and human cognition, they constitute more than just a particular facet of human knowledge. Of course, there has been ample previous work on spatio-temporal visualization. However, most pre- vious work either deals with social media (Ardon et al., 2013) rather than knowledge repositories, or focuses on geo-located entities such as buildings, cities, and so on (Hoffart et al., 2011a). From a data analytics perspective, however, much other knowledge can also be analyzed spatio- temporally. For example, given a person like Napoleon or a disease such as the Bubonic Plague, we may wish to explore relevant geographical dis- tributions. This notion of spatio-temporal analytics goes beyond simple geolocation and time metadata. In fact, the relevant spatio-temporal cues may need to be extracted from text. Unfortunately, accu- rate spatio-temporal extraction is also a challenging task (Wang et al., 2011b). Most existing informa- tion extraction tools neglect spatio-temporal infor- mation and tend to produce very noisy extractions. It appears that the best strategy is to put the hu- man in the loop by combining knowledge harvest- ing with methods to refine the extractions, similar to YALI (Wang et al., 2013), a browser plug-in that calls AIDA (Hoffart et al., 2011b) for named entity recognition and disambiguation (NERD) in a real- time manner. That system transparently collects user feedback to gather statistics on name-entity pairs and implicit training data for improving N- ERD accuracy. Overall, we observe that there is a need for more sophisticated spatio-temporal knowledge analytics frameworks with advanced knowledge harvesting and knowledge visualization support. In this paper, we present an integrated system to achieve these goals, enabling users to analyze large amounts of structured and unstructured textual data and gain deeper insights that are not easily observed in indi- vidual facts. 25

Upload: others

Post on 28-Mar-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Visualizing and Curating Knowledge Graphs over Time and …Visualizing and Curating Knowledge Graphs over Time and Space Tong Ge1, Yafang Wang 1, Gerard de Melo 2, Haofeng Li1, Baoquan

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics—System Demonstrations, pages 25–30,Berlin, Germany, August 7-12, 2016. c©2016 Association for Computational Linguistics

Visualizing and Curating Knowledge Graphsover Time and Space

Tong Ge1, Yafang Wang1∗, Gerard de Melo2, Haofeng Li1, Baoquan Chen1

1Shandong University, China; 2Tsinghua University, China

Abstract

Publicly available knowledge repositories,such as Wikipedia and Freebase, benefitsignificantly from volunteers, whose con-tributions ensure that the knowledge keepsexpanding and is kept up-to-date and accu-rate. User interactions are often limited tohypertext, tabular, or graph visualization in-terfaces. For spatio-temporal information,however, other interaction paradigms maybe better-suited. We present an integratedsystem that combines crowdsourcing, au-tomatic or semi-automatic knowledge har-vesting from text, and visual analytics. Itenables users to analyze large quantitiesof structured data and unstructured textu-al data from a spatio-temporal perspectiveand gain deep insights that are not easilyobserved in individual facts.

1 Introduction

There has been an unprecedented growth of pub-licly available knowledge repositories such as theOpen Directory, Wikipedia, Freebase, etc. Manyadditional knowledge bases and knowledge graphsare built upon these, including DBpedia, YAGO,and Google’s Knowledge Graph. Such repositoriesbenefit significantly from human volunteers, whosecontributions ensure that the knowledge keeps ex-panding and is kept up-to-date and accurate.

Despite the massive growth of such structureddata, user interactions are often limited to sim-ple browsing interfaces, showing encyclopedic tex-t with hyperlinks, tabular listings, or graph visu-alizations. Sometimes, however, users may seeka spatio-temporal perspective of such knowledge.Given that the spatio-temporal dimensions are fun-damental with respect to both the physical world

∗The corresponding author:[email protected]

and human cognition, they constitute more thanjust a particular facet of human knowledge. Ofcourse, there has been ample previous work onspatio-temporal visualization. However, most pre-vious work either deals with social media (Ardonet al., 2013) rather than knowledge repositories, orfocuses on geo-located entities such as buildings,cities, and so on (Hoffart et al., 2011a).

From a data analytics perspective, however,much other knowledge can also be analyzed spatio-temporally. For example, given a person likeNapoleon or a disease such as the Bubonic Plague,we may wish to explore relevant geographical dis-tributions. This notion of spatio-temporal analyticsgoes beyond simple geolocation and time metadata.

In fact, the relevant spatio-temporal cues mayneed to be extracted from text. Unfortunately, accu-rate spatio-temporal extraction is also a challengingtask (Wang et al., 2011b). Most existing informa-tion extraction tools neglect spatio-temporal infor-mation and tend to produce very noisy extractions.

It appears that the best strategy is to put the hu-man in the loop by combining knowledge harvest-ing with methods to refine the extractions, similarto YALI (Wang et al., 2013), a browser plug-in thatcalls AIDA (Hoffart et al., 2011b) for named entityrecognition and disambiguation (NERD) in a real-time manner. That system transparently collectsuser feedback to gather statistics on name-entitypairs and implicit training data for improving N-ERD accuracy.

Overall, we observe that there is a need for moresophisticated spatio-temporal knowledge analyticsframeworks with advanced knowledge harvestingand knowledge visualization support. In this paper,we present an integrated system to achieve thesegoals, enabling users to analyze large amounts ofstructured and unstructured textual data and gaindeeper insights that are not easily observed in indi-vidual facts.

25

Page 2: Visualizing and Curating Knowledge Graphs over Time and …Visualizing and Curating Knowledge Graphs over Time and Space Tong Ge1, Yafang Wang 1, Gerard de Melo 2, Haofeng Li1, Baoquan

Figure 1: System architecture

2 Architecture

Figure 1 depicts the overall architecture of our sys-tem. Spatio-temporal events come from three d-ifferent sources: crowdsourcing, information ex-traction, and existing knowledge repositories. Oursystem provides users with interfaces to enter textu-al information, videos, and images of events. Thecrowdsourced events are used as seed facts to ex-tract additional spatio-temporal event informationfrom the Internet. We describe this in more de-tail in Section 3. The extracted spatio-temporalfacts are stored in the knowledge base. Both thecrowdsourced facts and the extracted facts are pre-sented visually in the visualization platform. Userscan browse as well as edit the event information.Finally, the system comes pre-loaded with eventstaken from the Web of Data, particularly the YA-GO (Suchanek et al., 2007) knowledge base, whichcontains events from different categories that serveas seed data for the platform.

The system maintains the edit history for everyevent, allowing users to revert any previous modifi-cation. Moreover, users’ personal activity logs arealso captured and are available for browsing.

Relevant spatio-temporal events are simultane-ously visualized with a map and on a timeline. Aheat-map is added as the top layer of the map to re-flect the distribution and frequency of events. Thereis also a streaming graph and line chart visualiza-tion enabling the user to analyze events based ontheir frequency. These may allow the user to dis-cover salient correlations.

System Implementation. Our system is imple-

mented in Java, with Apache Tomcat1 as the Webserver. While parsing text documents, we rely onOpenNLP2 for part-of-speech tagging, lemmatiz-ing verbs, and stemming nouns. All data are storedin a PostgreSQL3 database. The maps used in oursystem are based on OpenStreetMap4.

3 Spatio-TemporalKnowledge Harvesting

Spatio-Temporal Facts. Crowdsourcing is justone way to populate the spatio-temporal knowl-edge in our system. Additional facts are semi-automatically mined from the Web using informa-tion extraction techniques. We build on previouswork that has developed methods for extractingtemporal facts (Wang et al., 2011a), but extend thisline of work to also procure spatial facts.

Our aim is to extract spatio-temporal factualknowledge from free text. A fact here consistsof a relation and two arguments, as well as optionaltemporal and spatial attributes. For instance, thespatio-temporal fact

playsForClub(Beckham; Real Madrid)@<[2003,2008);Spain>

expresses that Beckham played for Real Madridfrom 2003 to 2007 in Spain. Temporal attributesinvolve either a time interval or a time point, indi-cating that the fact applies to a specific time periodor just a given point in time, respectively. Spatialattributes are described in terms of a disambiguatedlocation name entity. For example, “Georgia” oftenrefers to the country in Europe, but may also referto the state with the same name in the US. Thus,we use disambiguated entity identifiers.

Pattern Analysis. The extraction process s-tarts with a set of seed facts for a given rela-tion. For example, playsForClub(Beckham; Re-al Madrid)@<[2003,2008);Spain> would be avalid seed fact for the playsForClub relation. Theinput text is processed to construct a pattern-factgraph. Named entities are recognized and disam-biguated using AIDA (Hoffart et al., 2011b). Whena pair of entities matches a seed fact, the surfacestring between the two entities is lifted to a pattern.This is constructed by replacing the entities with

1http://tomcat.apache.org/2http://opennlp.apache.org/3http://www.postgresql.org/4https://www.openstreetmap.org/

26

Page 3: Visualizing and Curating Knowledge Graphs over Time and …Visualizing and Curating Knowledge Graphs over Time and Space Tong Ge1, Yafang Wang 1, Gerard de Melo 2, Haofeng Li1, Baoquan

placeholders marked with their types, and keep-ing only canonical lemmatized forms of nouns andverbs as well as the last preposition. We use n-gram based feature vectors to describe the pattern-s (Wang et al., 2011a).

For example, given a sentence such as “Ronal-do signed for Milan from Real Madrid.”, Milan isdisambiguated as A.C. Milan. The correspondingpattern for leaving Real Madrid is “sign for 〈club〉from”. Each pattern is evaluated by investigatinghow frequent the pattern occurs with seed facts of aparticular relation. The normalized value (between0 and 1) is assigned as the initial value for eachpattern, for the fact extraction stage.

Fact Candidate Gathering. Entity pairs thatmatch patterns whose strength is above a minimumthreshold become fact candidates and are fed in-to the fact extraction stage of label propagation.Temporal and spatial expressions occurring withina window of k words in the sentence are consid-ered as the temporal or spatial attribute of the factcandidate (Wang et al., 2011a). These fact candi-dates may have both temporal and spatial attributessimultaneously.

Fact Extraction. Building on (Wang et al.,2011a), we utilize Label Propagation (Talukdarand Crammer, 2009) to determine the relation andobservation type expressed by each pattern. Wecreate a graph G = (VF ∪ VP, E) with one vertexv ∈ VF for each fact candidate observed in the textand one vertex v ∈ VP for each pattern. Edgesbetween VF and VP are introduced whenever a factcandidate appeared with a pattern. Their weight isderived from the co-occurrence frequency. Edgesamong VP nodes have weights derived from then-gram overlap of the patterns.

Let L be the set of labels, consisting of the rela-tion names plus a special dummy label to capturenoise. Further, let Y ∈ R|V|×|L|+ denote the graph’sinitial label assignment, and Y ∈ R|V|×|L|+ standfor the estimated labels of all vertices, Sl encodethe seeds’ weights on its diagonal, and R∗l be a ma-trix of zeroes except for a column for the dummylabel. Then, the objective function is:

L(Y) =∑

`

[(Y∗` − Y∗`)TS`(Y∗` − Y∗`)

+µ1YT∗`LY∗` + µ2‖Y∗` −R∗`‖2

](1)

Figure 2 shows an example of a pattern-fact

graph. Existing events in the database serveas seeds in the graph. For instance, playsFor-Club(David Beckham, LA Galaxy)@US is a seedfact in the example, which will propagate the la-bel playsForClub to other nodes in the graph.After optimizing the objective, the fact candidateswhich bear a relation’s label with weight above athreshold are accepted as new facts (Wang et al.,2011a). These facts, which may include temporalor spatial or both kinds of attributes, are stored inthe database with provenance information, and cansubsequently be used in several kinds of visualiza-tions.

4 Data Visualizationand Analytics

Our system enables several different forms of visu-al analytics, as illustrated in Figure 3, which com-bines several different screenshots of the system.

Spatio-Temporal Range Queries. Users may is-sue range queries for both temporal and spatialknowledge. In Figure 3, Screenshots 1, 3, and4 show results of temporal range queries, whileScreenshot 5 shows the result of a spatial rangequery. After choosing a particular span on the time-line at the bottom, the events relevant for the select-ed time interval are displayed both on a temporalaxis and on the map. A heat-map visualizes thefrequency of events with respect to their geograph-ical distribution. Users may also scroll the timelineto look at different events. The events shown onthe map dynamically change when the scrollbar ismoved. In Screenshot 1, we see that items on thetimeline are shown with different symbols to indi-cate different categories of events. Screenshots 3and 4 show results from different time intervals. Ifusers choose a spatial range by drawing on the map,any events relevant to this geographical area duringthe selected time interval are retrieved. Screenshot5 shows how the system can visualize the retrievalresults using a pie chart. The area highlighted inblue is the bounding box of the polygon, as deter-mined within Algorithm 1. The different colors inthe pie chart indicate different event categories andtheir relative frequency.

Event Browsing and Checking. Users can eitherconsult the events listed on the timeline by click-ing on the icons, or browse the streaming graphand line chart, which show the frequency of events.When selecting an event on the timeline, a pop-up

27

Page 4: Visualizing and Curating Knowledge Graphs over Time and …Visualizing and Curating Knowledge Graphs over Time and Space Tong Ge1, Yafang Wang 1, Gerard de Melo 2, Haofeng Li1, Baoquan

Figure 2: Spatio-temporal pattern-fact graph

Algorithm 1 Spatial range query algorithmInput: spatial polygon on the map P, event database EOutput: events in the polygon.

1: minx ← minimum latitude of all points of P . Get bounding box of polygon P2: maxx ← maximum latitude of all points of P3: miny ← minimum longitude of all points of P4: maxy ← maximum longitude of all points of P5: EP = {e ∈ E | minx ≤ e.x ≤ maxx ∧miny ≤ e.y ≤ maxy} . Query event database6: ED ← edges of polygon P . Get edges of polygon7: for each e ∈ EP do8: line← (x, y;−∞, y)9: if e not located on the edges ∧ line intersects ED with even numbers then

10: EP ← EP − e11: return EP

window appears on top of the map near the relevantlocation. Normally, this window simply providesthe entity label, as in Screenshot 4, while detailedinformation about the event is displayed in the side-bar on the left, as in Screenshot 6. However, whenthe user moves the cursor over the label, it expandsand additional information is displayed. For anexample of this, see Screenshot 3, which shows in-formation for the “Battle of Noreia”. There are alsolinks for related videos and images. If there is nointeraction with a pop-up window for an extendedperiod of time, it is made transparent. When usersmove the cursor above an event on the timeline, anicon on the map pops up to provide the locationand name of that event. At the same time, an iconis displayed in the histogram, which is located be-

neath the timeline. With these coupled effects, theuser simultaneously obtains information about boththe accurate location on the map and the accuratetime point within the timeline (see Screenshot 4).

Users can also scroll the map to navigate toplaces of interest, and observe how frequently rele-vant events happen in that area, as visualized withthe heat-map. When the user double clicks on alocation on the map, all the events pertaining tothat location are shown on the left of the window.Screenshot 6 shows three events that occurred inBeijing. Further details for each event are displayedif the user clicks on them.

Our system also supports querying related eventsfor a specific person. Screenshot 8 provides theresults when querying for Napoleon, where impor-

28

Page 5: Visualizing and Curating Knowledge Graphs over Time and …Visualizing and Curating Knowledge Graphs over Time and Space Tong Ge1, Yafang Wang 1, Gerard de Melo 2, Haofeng Li1, Baoquan

5

Figure 3: User interface screenshots

tant events related to Napolean are displayed onthe map.

Visual Analytics. Users may use the line charton the timeline and the heat-map to jointly inspectstatistics pertaining to the retrieved events. For in-stance, Screenshot 2 shows the results as displayedin the line chart on the timeline. Different colorshere refer to different event categories. As the usermoves the time window at the bottom of the time-line, events on the timeline and maps are updated.The histogram at the bottom of the timeline showsthe overall event statistics for the current state ofthe knowledge base. Each column refers to thenumber of events for a given five year interval. Theheat-map changes profoundly when transitioningfrom Screenshot 3 to Screenshot 4, especially forEurope. The total number of events increases aswell. The line chart visualization of events on thetimeline5 supports zooming in and out by adjust-ing the time interval. Hence, it is not necessaryto initiate a new query if one wishes to drill downon particular subsets of events among the queryresults.

Adding/Editing Event Information. After log-5We use the line chart developed by AmCharts www.

amcharts.com/

ging into the system, users can enter or updateevent information. Our system provides an inter-face to add or edit textual information, images, andvideos for events. This can be used to extend cur-rent text-based knowledge bases into multimodalones (de Melo and Tandon, 2016).

The system further also stores the patterns fromthe extraction component. Hence, users can trackand investigate the provenance of extracted facts inmore detail. They can not only edit or remove noisyfacts but also engage in a sort of debugging processto correct or remove noisy patterns. Corrected ordeleted patterns and facts provide valuable feed-back to the system in future extraction rounds.

After logging in, all user activities, includingqueries, additions, edits, etc. are recorded in or-der to facilitate navigation as well as providing forpotential user analytics. For example, users mayarrive at an interesting result using an entire seriesof operations. Then they may continue to browsethe data aiming at further analyses. At some pointin time, they may wish to go back to consult pre-viously obtained results. It may be challenging toremember the exact sequence of operations that hadled to a particular set of results, especially whenthere are many different querying conditions. Theactivity log addresses this by making it easy to go

29

Page 6: Visualizing and Curating Knowledge Graphs over Time and …Visualizing and Curating Knowledge Graphs over Time and Space Tong Ge1, Yafang Wang 1, Gerard de Melo 2, Haofeng Li1, Baoquan

back to any earlier step. Screenshot 7 shows the useof a graph visualization to depict all the operationsof a user after login. This same data can also beused for studying user behavior.

Furthermore, similarly to Wikipedia, the toolcaptures the complete edit history for a particularevent. The interface for this uses a tabular form,not shown here due to space constraints. Wikipedi-a’s edit history has seen a rich number of uses inprevious research. For instance, one can study theevolution of entity types or the time of appearanceof entities and their geographical distribution.

Providing Ground-Truth Data for Relation Ex-traction Evaluation. Our system continuouslygathers ground-truth information on factual events(especially spatio-temporal facts) based on usercontributions. The knowledge in our system con-sists of relations of interest: event happened inplace, event happened on date, person is relatedto person, person is related to event, etc. This canserve as a growing basis for systematically evalu-ating and comparing different relation extractionmethods and systems, going well beyond currentlyused benchmarks.

Historical Maps. Geographical boundaries arefluid. For instance, countries have changed andborders have evolved quite substantially during thecourse of history. Our system allows uploads ofhistorical map data to reflect previous epochs. Sub-sequently, users can choose to have our systemdisplay available historical maps rather than the s-tandard map layer, based on the temporal selection.

5 Conclusion

We have presented a novel integrated system thatcombines crowdsourcing, semi-automatic knowl-edge harvesting from text, and visual analytics forspatio-temporal data. Unlike previous work, thesystem goes beyond just showing geo-located enti-ties on the map by enabling spatio-temporal analyt-ics for a wide range of entities and enabling users todrill down on specific kinds of results. The systemcombines user contributions with spatio-temporalknowledge harvesting in order to enable large-scaledata analytics across large amounts of data. Giventhe broad appeal of Wikipedia and similar websites,we believe that this sort of platform can serve theneeds of a broad range of users, from casually inter-ested people wishing to issue simple queries overthe collected knowledge all the way to experts in

digital humanities seeking novel insights via thesystem’s advanced knowledge harvesting support.

Acknowledgments

This project was sponsored by National 973 Pro-gram (No. 2015CB352500), National Natural Sci-ence Foundation of China (No. 61503217), Shan-dong Provincial Natural Science Foundation of Chi-na (No. ZR2014FP002), and The Fundamental Re-search Funds of Shandong University (No. 2014T-B005, 2014JC001). Gerard de Melo’s research issupported by China 973 Program Grants 2011C-BA00300, 2011CBA00301, and NSFC Grants61033001, 61361136003, 61550110504.

ReferencesSebastien Ardon, Amitabha Bagchi, Anirban Mahanti,

Amit Ruhela, Aaditeshwar Seth, Rudra Mohan Tri-pathy, and Sipat Triukose. 2013. Spatio-temporaland events based analysis of topic popularity in Twit-ter. In CIKM, pages 219–228.

Gerard de Melo and Niket Tandon. 2016. Seeing is be-lieving: The quest for multimodal knowledge. ACMSIGWEB Newsletter, 2016(Spring).

Johannes Hoffart, Fabian M. Suchanek, KlausBerberich, Edwin Lewis-Kelham, Gerard de Melo,and Gerhard Weikum. 2011a. YAGO2: Exploringand querying world knowledge in time, space, con-text, and many languages. In WWW.

Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor-dino, Hagen Furstenau, Manfred Pinkal, Marc S-paniol, Bilyana Taneva, Stefan Thater, and GerhardWeikum. 2011b. Robust disambiguation of namedentities in text. In EMNLP.

Fabian M. Suchanek, Gjergji Kasneci, and GerhardWeikum. 2007. Yago: A Core of Semantic Knowl-edge. In WWW.

Partha Pratim Talukdar and Koby Crammer. 2009.New regularized algorithms for transductive learn-ing. In ECML PKDD, pages 442–457.

Yafang Wang, Bin Yang, Lizhen Qu, Marc Spaniol, andGerhard Weikum. 2011a. Harvesting facts from tex-tual web sources by constrained label propagation.In CIKM, pages 837–846.

Yafang Wang, Bin Yang, Spyros Zoupanos, Marc Span-iol, and Gerhard Weikum. 2011b. Scalable spatio-temporal knowledge harvesting. In WWW, pages143–144.

Yafang Wang, Lili Jiang, Johannes Hoffart, and Ger-hard Weikum. 2013. Yali: a crowdsourcing plug-infor NERD. In SIGIR, pages 1111–1112.

30