history sample research paper 4

23
Undefined 1 (2012) 1–5 1 IOS Press Semantic Technologies for Historical Research: A Survey Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University, Country Open review(s): Name Surname, University, Country Albert Meroño-Peñuela a,c , Ashkan Ashkpour b , Marieke van Erp a , Kees Mandemakers b , Leen Breure d , Andrea Scharnhorst c , Stefan Schlobach a , and Frank van Harmelen a a Vrije Universiteit Amsterdam, De Boelelaan 1081a, 1081HV Amsterdam, NL E-mail: {albert.merono, marieke.van.erp, k.s.schlobach, frank.van.harmelen}@vu.nl b International Institute of Social History, Cruquiusweg 31, 1019AT Amsterdam, NL E-mail: {ashkan.ashkpour, kma}@iisg.nl c Data Archiving and Networked Services - Royal Netherlands Academy of Arts and Sciences, Anna van Saksenlaan 10, 2593HT Den Haag, NL E-mail: {albert.merono, andrea.scharnhorst}@dans.knaw.nl d Universiteit Utrecht, Princetonplein 5, De Uithof, 3584 CC Utrecht, NL E-mail: [email protected] Abstract. The diversity of sources of information for historical research fill a continuum between individual accounts transmitted for instance in letters but also in poems and songs, and aggregated statistical information as in the case of historical census. Historiography shares this heterogeneity and complexity of source material with other humanities fields. Methods to order this rich material, and by this ordering also to determine the way history is told are as old as history writing and vary among the different branches (or subdisciplines) of historical research. In this paper we focus on the work of historians, and even more specifically economic and social history.At the crossroad of information and historical sciences, so-called Historical Informatics or History and Computing emerged as a specific profes- sion during the nineties of the last century. Together with computer scientists historians created a research agenda concentrating around questions how to create, design, enrich, edit, retrieve, analyze and present historical information with help of information technology. There exist a number problems and challenges in this field; some of them are closely related to semantics and mean- ing of knowledge in general. In this context, Semantic Web technologies can be applied in a number of situations, environments, applications of historical computing and historical information science. However, only a few number of contributions have yet considered these technologies. In this survey we present an overview of the past and present problems, challenges and advances of historical science computing, from out the perspective of Semantic technology. Keywords: Semantic technology, Digital Humanities, Historical datasets, Historical Computing 1. Introduction This survey covers current approaches, namely pa- pers, projects, online resources, and tools and tech- nologies, on how to apply semantic technology to his- torical research. As of now, by historical research we mean strictly research performed by historians, and talk about history as a research domain. Thus, we ex- clude other fields of the humanities in which historical research is also performed, such as art history or his- tory of literature. Hence, the intended audience of the paper is twofold. On the one hand, section 3 offers a general overview of recent advances in eHistory, his- torical computing and historical information sciences, 0000-0000/12/$00.00 c 2012 – IOS Press and the authors. All rights reserved

Upload: ed

Post on 03-Dec-2015

8 views

Category:

Documents


2 download

DESCRIPTION

Pendidikan

TRANSCRIPT

Page 1: History Sample Research Paper 4

Undefined 1 (2012) 1–5 1IOS Press

Semantic Technologies for HistoricalResearch: A SurveyEditor(s): Name Surname, University, CountrySolicited review(s): Name Surname, University, CountryOpen review(s): Name Surname, University, Country

Albert Meroño-Peñuela a,c, Ashkan Ashkpour b, Marieke van Erp a, Kees Mandemakers b,Leen Breure d, Andrea Scharnhorst c, Stefan Schlobach a, and Frank van Harmelen a

a Vrije Universiteit Amsterdam, De Boelelaan 1081a, 1081HV Amsterdam, NLE-mail: {albert.merono, marieke.van.erp, k.s.schlobach, frank.van.harmelen}@vu.nlb International Institute of Social History, Cruquiusweg 31, 1019AT Amsterdam, NLE-mail: {ashkan.ashkpour, kma}@iisg.nlc Data Archiving and Networked Services - Royal Netherlands Academy of Arts and Sciences, Anna vanSaksenlaan 10, 2593HT Den Haag, NLE-mail: {albert.merono, andrea.scharnhorst}@dans.knaw.nld Universiteit Utrecht, Princetonplein 5, De Uithof, 3584 CC Utrecht, NLE-mail: [email protected]

Abstract. The diversity of sources of information for historical research fill a continuum between individual accounts transmittedfor instance in letters but also in poems and songs, and aggregated statistical information as in the case of historical census.Historiography shares this heterogeneity and complexity of source material with other humanities fields. Methods to order thisrich material, and by this ordering also to determine the way history is told are as old as history writing and vary among thedifferent branches (or subdisciplines) of historical research.

In this paper we focus on the work of historians, and even more specifically economic and social history.At the crossroad ofinformation and historical sciences, so-called Historical Informatics or History and Computing emerged as a specific profes-sion during the nineties of the last century. Together with computer scientists historians created a research agenda concentratingaround questions how to create, design, enrich, edit, retrieve, analyze and present historical information with help of informationtechnology. There exist a number problems and challenges in this field; some of them are closely related to semantics and mean-ing of knowledge in general. In this context, Semantic Web technologies can be applied in a number of situations, environments,applications of historical computing and historical information science. However, only a few number of contributions have yetconsidered these technologies. In this survey we present an overview of the past and present problems, challenges and advancesof historical science computing, from out the perspective of Semantic technology.

Keywords: Semantic technology, Digital Humanities, Historical datasets, Historical Computing

1. Introduction

This survey covers current approaches, namely pa-pers, projects, online resources, and tools and tech-nologies, on how to apply semantic technology to his-torical research. As of now, by historical research wemean strictly research performed by historians, and

talk about history as a research domain. Thus, we ex-clude other fields of the humanities in which historicalresearch is also performed, such as art history or his-tory of literature. Hence, the intended audience of thepaper is twofold. On the one hand, section 3 offers ageneral overview of recent advances in eHistory, his-torical computing and historical information sciences,

0000-0000/12/$00.00 c© 2012 – IOS Press and the authors. All rights reserved

Page 2: History Sample Research Paper 4

2 A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey

as well as a landscape of recent problems that canbe tackled with technologies provided by the Seman-tic Web community. In turn, on the other hand, sec-tion 4 gives a summary of current Semantic Web tech-nology developments, as well as typical scenarios inwhich these technologies have provided valuable ben-efits. Hence, both communities, namely scholars do-ing research in history and Semantic Web advocates,clearly identify their domains and, at the same time,have the opportunity to gain insights into the otherfield.

Goal. We claim that some current issues in researchdone by historians can be successfully addressed ap-plying semantic technologies. These issues find theirnature in the inner semantics implicitly contained inhistorical sources, which can be appropriately identi-fied, formalized and linked using. For example, anyreference to a historical person contained in an his-torical source can be conveniently modeled in RDF,given meaning with a set of appropriate OWL vocab-ularies, and linked with other entities (for example,events in which that person participated, places thatperson visited, actions that person did), contributing toa very rich graph of knowledge that machines can eas-ily read and process, and from which scholars can ex-tract meaningful knowledge to their research and facil-itate their role as history detectives.

Methodology. We proceed with a twofold analy-sis. First, we introduce historical computing, identi-fying current problems. Second, we go through a setof 67 relevant contributions, organized into publishedpapers, research projects, online resources and toolsand technologies, and build a catalog on them depend-ing on how they give importance or priority to severalareas in the crossroads of the Semantic Web and re-search by historians: knowledge modeling and ontolo-gies, text processing and mining, search and retrieval,and semantic interoperability.

Contribution. This paper offers the current land-scape on advances in faithfully representing histori-cal data with semantic technology. Few contributionscover the full spectrum of what is desired from a com-plete workflow, going from historical sources to se-mantic data, which can be easily linked and integrated,and although there exist efforts implementing severalparts of the process, much more work is needed tosuccessfully understand, process and represent histori-cal sources using Semantic Web formats. Nevertheless,this paper identifies the most relevant problems foundin research by historians, and suggests how these is-

sues can be addressed using Semantic Web technolo-gies.

In section 2 we describe elements of the epistemicframes used in this field together with ideal-typicalworkflows, and in section 3 we shortly introduce His-torical Information Sciences. The aim is to identifyproblems where Semantic Web technological couldoffer solutions. A prerequisite to that is to bridgebetween different terminologies concerning sources,data, data models and structured or unstructured data.In section 4 we start from the Semantic Web technol-ogy and apply the concepts developed in section 3 toclassify past and current research (as documented inpublications but also projects) along dimensions suchas. Having all these contributions in a single view willhopefully help to realize how the divers the landscapein the borders of historical computing and the Seman-tic Web research is and also indicate promising re-search lines for the future.

2. Historical research and semantic technology

The field of historical research concerns the under-standing of our past and it is currently undergoingmajor changes in its methodology, largely due to thegrowing interest in and the advent of high-quality dig-ital resources [86]. Historical data encompasses statis-tical information, texts, images, and objects. Datasetscontain both factual and terminological knowledgeabout events, people and processes with a temporalperspective that makes them valuable resources andinteresting objects of study. What makes the humani-ties scholar work even more interesting for a seman-tic description, is its context dependence and the va-riety of possible interpretations. Semantic technologycould play an important role when it comes to keepingand comparing different perspectives; to relate differ-ent sources - both in terms of historical facts as well asof their perception and interpretation; and to transferknowledge produced in scholarly discourse to ongoingreflection of society and mankind.

Historical research is also part of other disciplinesthan history itself. The difference in data handling is,that historians want to look through the sources to un-derstand a society of the past, while art history andhistory of literature are much more focused on the artwork itself. This implies also differences in encodinginformation. In this paper we address history or histo-riography as a research domain.

Page 3: History Sample Research Paper 4

A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey 3

Historical sources - the data of historians

Historical sources can be characterized and dividedin many ways, but a basic distinction used by histo-rians is between primary and secondary sources. Pri-mary sources can be distinguished roughly into ad-ministrative sources and narratives like biographies, orchronicles. Secondary sources are all material that hasbeen written by historians or their predecessors aboutthe past [33]. We will restrict here ourselves to primarysources.

Administrative sources contain records of some ad-ministration (census, birth-, marriage- and death rolls,administrative accounts of taxes and expenses, reso-lutions minutes of administrative bodies, deeds, con-tracts etc.). Typically, historians want to extract thefacts in order to gather statistical data. A good exam-ple is the project Historic Sample of the Netherlands(HSN1) The source material for the HSN database con-sists mainly of the certificates of birth2, marriage3 anddeath4 and of the population registers5. From thosesources the life courses of about 78.000 people bornin the Netherlands during the period 1812-1922 havebeen reconstructed. Stored in a database and down-loadable as files, this information forms a unique toolfor research in Dutch history and in the fields of soci-ology and demography. As in the case of the HSN thistype of sources is usually stored in archives, and, forthe majority from a more remote past, not yet machinereadable and not easy to analyze with techniques ofNatural Language Processing (NLP techniques). Thereis one major pitfall in linking this kind of data: extract-ing data about persons, events, institutions, locationsis one thing, but linking to their different instantiations(for instance different name spellings, or persons withthe same name) and keeping good documentation isthe real challenge [21].

Narrative sources are full text documents containinga description of the past, made by an author being aneyewitness: think of diaries, chronicles, newspaper ar-ticles, diplomatic reports, political pamphlets etc. His-torians may be interested in both, factual informationand the author’s vision and the bias. Also here knowl-

1http://www.iisg.nl/hsn/index.html2http://www.iisg.nl/hsn/data/birth.html3http://www.iisg.nl/hsn/data/marriage.html4http://www.iisg.nl/hsn/data/death.html5http://www.iisg.nl/hsn/data/population.

html

edge encoding is interesting, and it requires more so-phisticated ontologies than currently available.

One has to be aware of the fact, that historicaldata (as present in sources) are fundamentally differentfrom hard science data: historical data have not beenproduced under the controlled conditions of an exper-iment. So historical research will always have some-thing of the work of a detective: he must be careful notto destroy certain details (read: annoying inconsisten-cies) because these details may contain relevant infor-mation. On the other hand, to be able to extract statisti-cal information and come up with more general state-ments some formalization, relating information, har-monizing expressions of what is later used as variablesis needed. Harmonization, the process of making data-sources uniformly accessible is closely related to is-sues of standardization and formalization [?]. But, har-monization can have different targets and purposes.

Moreover, there is no common language to labelfacts: the terminology is time and space bound, whichmakes formalization difficult and results dependent ondifferent interpretations.

We will discuss in this paper how semantic tech-nologies can support both research practices we de-picted archetypically above: the careful, authentic andpreserving collection of a variety of evidence (histo-rian as detective) and the ordering of those evidencesalong concepts, theories and models to form explana-tions debatable in scientific discourse (historian as sci-entist).

Structures and data models

In section 3 we use the distinction between struc-tured and non-structured to classify historic data. Wewould like to underline that this differentiation is verydifferent from the use of those notions in history, whereadministrative sources are often labeled as structuredand the textual secondary sources as unstructured.Also narrative sources have internal structures, whichcan be made explicit. From the 19th century onwardshistorians have made scholarly source editions, whichcontain structured and annotated information. Nowa-days the printed source editions are replaced and sup-plemented by databases and XML-based digital edi-tions. So, structured or unstructured are relative no-tions: administrative sources usually have an obviousstructured layout, while narrative sources have a latent,at first sight hidden structure, which is made explicitas soon as they appear in a scholarly source edition.So, both administrative and narrative sources can ap-

Page 4: History Sample Research Paper 4

4 A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey

pear in the form of structured or unstructured data inComputer Scientist jargon.

eHistory and the digital humanities

The identification of un-explored links betweensemantic technologies and historiography requires acareful explanation of the science history of earlier en-counters between computer scientists and historians.We present a short history of the so-called historicalinformation sciences at the beginning of section 3. Itis also important to realize that the interdisciplinarycollaboration between computer science and humani-ties researchers paves the way for true integration ofsemantic technologies in what one could call eHistory.

The challenge of a semantic enrichment of histori-cal sources may bring new facilities, on the one hand,for humanities researchers also outside of history, al-lowing them to search, retrieve and compare informa-tion they need for their everyday work using a vari-ety of dimensions and scopes; and on the other hand,for practitioners, giving them new data sources to de-velop historical-aware applications for public institu-tions, private companies and citizens.

What we call through this paper eHistory, is noth-ing more than a common label for the emergence ofnew research technologies. Co-evolving with them,new theoretical and methodological frameworks arepenetrating existing historical research, widening andshifting boundaries to other academic fields and partlydefining new scientific subfields. In this paper we fo-cus on most recent technological trends connected tothe emergence of the internet and the web [25], andlook in particular into Semantic Web technologies [3].We would like to point out nevertheless that changes inhistorical research are closely connected to the emer-gence of new scientific methods already since cen-turies. Statistics has influenced many fields includinghistory, and paved the ground for quantitative stud-ies [20]. However, these kind of historical studies aremore and more the domain of sociologists, economistsand demographers than scientists educated as histori-ans [27]6.

More recently, the invention of computers and theformation of computer sciences have inspired histori-ans from the start. History computing or Humanitiescomputing were labels used in the pre-Internet area[22]. Many pioneers in computer aided historical anal-

6http://www.hist.umn.edu/~ruggles/hist5011/Decline.pptx

ysis have a background both in history and in informat-ics, and reflected early on about the usefulness of com-putational and digital techniques for historical research[86]. Though not all research dreams materialized inthe primary envisioned way [93], nowadays historiansin particular in the area of economic history and socialhistory aim for world-wide, large scale collaborations.This kind of web based cooperation allows to collect,distribute, annotate and analyze historical informationall around the globe [8]. While we concentrate on his-torical research and more specific branches as socialand economic history in it, we also would like to stressthat similar solutions emerge also in other humanitiesfields at the turn to e-humanities or Digital Humani-ties [5,28]. As historical research overlaps with liter-ary studies, ancient language studies, archeology, arthistory and other humanities fields, these areas of en-counter are also predestined candidates for the travelof generic methods developed from a semantic tech-nology perspective for historical research or other hu-manities fields.

3. Historical Information Science

Since this paper focuses on applications of SemanticWeb technologies in research by historians, it makessense to introduce briefly the field of historical infor-mation science, describing the state-of-the-art and giv-ing some insights in current challenges.

This paper looks also forward on how Semantic Webtechnology can be applied to historical datasets, andhow these technologies can facilitate, boost and im-prove research by historians. To build a meaningfuldiscourse, it is necessary to explain that semantic tech-nologies need specific requirements in order to be cor-rectly deployed in history: they need to be applied tohistorical source datasets in a complex, layered andproperly adapted pipeline. That pipeline, hence, is notstatic at all: its configuration strongly depends on theway and degree the datasets are structured In Com-puter science and history may use different meaningsfor the term structure which will lead to confusion andambiguity. For this reason, we dedicate an entire sub-section to cover existing classifications of historicaldatasets, how their inner structure is somehow relatedto their type of source, and how it strongly influencesfurther semantic pipelines to be applied. As far as weuse the term structure we mean the software engineer-ing definition in which the term structured data refersto information that is contained in a pre-defined data

Page 5: History Sample Research Paper 4

A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey 5

model independent of the content (the structure is anempty shell).

Likewise, the term semantics has to be interpretedfrom the computational point of view. Computationalsemantics7 deal with how to automate the process ofconstructing and reasoning with meaning representa-tions of natural language expressions, which is one ofthe most common ways in which historical sourcesare written (that is, any kind of free speech or freediscourse text like letters, biographies, manuscripts ormemories). Specifically, Semantic Web technologiesuse formal languages (such as RDF, RDFS or OWL)to define concepts, persons, places and any kind of en-tity, so that they can be identified and referred fromthose texts in a way that machines can easily differenti-ate the human expressions from the meaningful, linkedknowledge behind them.

Section 3 is organized in two subsections. The firstsubsection gives an overview on historical comput-ing explaining, first, the evolution (life cycle) of datastreams in research by historians and, secondly, thediverse types of historical sources and their classifi-cations. The second subsection focuses on underly-ing problems and challenges when dealing with thesesources from a semantic point of view.

3.1. State of the art in Historical Information Science

Ever since the advent of computing historians havebeen using them in their research or teachings in oneway or the other. The first revolution in the 1960sallowed researchers to harness the potential of compu-tational techniques in order to analyze more data thanhad ever been possible before, enabling verificationand comparisons of their research data but also giv-ing more precision to their findings [2]. However thiswas a marginal group within the historical research, ingeneral the usage of computers by humanists could bedescribed as occasional [10]. The emphasis was moreon providing historians with the tools to do what theyhave always done but now in a more effective and effi-cient way.

Although computing tools are currently embeddedin the daily life of most researchers, the use of thesetools did not revolutionized all sciences equally. Ac-cordingly, history failed to acknowledge many of thetools computing had come up with [86]. Instead of im-proving the quality of their work and assisting them

7http://en.wikipedia.org/wiki/Computational_semantics

in this process, software developed for historians of-ten requires attending several summer schools beforebeing able to use them [7]. Currently there are stillmany challenges and information problems in histori-cal research. These difficulties mainly range from tex-tual, linkage, structuring, interpretation or visualiza-tion problems [86].

Despite these challenges, computing in history andin the broader sense the humanities, also brought somesignificant contributions in certain fields like linguis-tics (corpus annotations, text mining, historical the-sauri etc..), archeology (impossible without GIS nowa-days), and other fields using sources that have beendigitized for historical (comparative) research and con-verted to databases [86]. The use of electronic toolsand media is incredibly valuable and important foropening up various sources for research which wouldotherwise remain unused. Open access to research datahas always been an issue, especially in the humanities.However, over the past years various efforts have beenmade in opening up these black boxes and makingthem available for researchers. These different sourcescontain rich information from various fields, whichare often digital in nature in the form of databases,text corpus, visual objects etc. These sources or iso-lated databases often contain a lot of semantics, buttheir data models were asynchronously designed, mak-ing them difficult to compare (interoperability prob-lem). So, while more and more sources are being dig-itized, more attention has to be given to the develop-ment of computational methods to process and analyzeall these different types of information [14].

A key issue for historians and other humanities re-searchers when dealing with historical data for com-parative research concerns the lack of consistency andcomparability across time and space, due to chang-ing meanings, various interpretations of the same his-torical situations or processes, changing classificationsetc. In this article we look into the benefits of the uti-lization of semantic web technologies in the field ofhistory, providing novel insights for historians to dealwith these problems and use contemporary methodsand technologies to gain better understanding of his-torical data sources.

3.1.1. The life cycleLike all scientific objects historical objects go

through several distinct phases with specific transfor-mations in order to produce an outcome suitable forhistorical research or specific needs of a researcher.The main object of study in historical information sci-

Page 6: History Sample Research Paper 4

6 A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey

ence is historical information, and the various wayto create, design, enrich, edit , retrieve, analyze andpresent historical information with help of informationtechnology [86]. Accordingly, in line with this per-spective, historical information can be laid out as se-quential phases of a historical information life cycle(see Figure 1. It should be mentioned that these stages,although sequential, do not always have to be passedthrough in sequence and some can be skipped whennecessary. The stages are also quite comparable withthe practice in other fields of science.

Fig. 1. The life cycle of historical information.

The first stage of the life cycle is the creation stage.The main aspect of this stage consists of the physi-cal creation of digital data, including the design of theinformation structure and the research projects de-sign. In the enrichment stage the most important as-pect is to enrich the data which has been created withmetadata, describing the historical information in moredetail. It is suggested to do this in a standardized wayusing systems such as Dublin Core8 for this. Histor-ical context enrichment also involves to link the datawhich belongs together (nominal record linkage), forexample think of persons with the same name, placesor events. The editing phase involves activities whichaim to enhance the data further and entails the actualencoding of textual information.Examples are: enter-ing data in databases, inserting mark-up tags, annotateoriginal data with extra information, bibliographical

8http://dublincore.org/

references and creating links to related passages. Oncethe data has been created and processed it is ready to beviewed and used. The retrieval stage mainly involvesselection mechanism look-ups such as SQL-queries fortraditional databases or Xpath9 and Xquery10 for textretrieval. The analysis stage both involves statisticalanalysis of historical data sets as well as qualitativecomparisons and analysis of text. The presentation ofhistorical information is very diverse and can be ex-pressed in different ways such as digital text editions,databases, visualizations, statistical views or even vir-tual exhibitions.

At the center of the historical information life cycle,three aspects are identified which are central to com-puting in the humanities in general. Durability ensuresthe long term use of the data, usability refers to theease of efficiency, effectiveness and user satisfactionand modeling here denotes to more general modelingof research processes and historical information sys-tems.

3.1.2. Structured and unstructured historical datasetsWe have presented a general overview on historical

computing, putting emphasis on how historical infor-mation science faces the cycle of historical informa-tion [86], and how the cycle starts with constructing therelevant historical datasets. At the end of this creationphase one may expect to have a set of all data neededfor further processes. However, the nature of many ofthe next steps to be taken thereafter may strongly de-pend on the way the resulting dataset is structured. In-deed, the way we could attach semantic web technolo-gies to these historical sources (e.g. to extract RDFtriples) is strongly dependent on the degree of structureof these sources.

To mark these differences we introduce the di-chotomy of structured and unstructured data. Struc-tured data refers, then, to information that is based onsources that have a very clear pre-defined data modellike census material published in rows and columns,while unstructured data refers to information whichhas not. A data model is an abstract model that doc-uments and organizes data for communication, and isused as a plan for developing applications.

Since in this paper we only consider digitizedsources, we talk about a structured historical digitizedsource (shortly, structured source) when, indeed, suchan abstract model for the data contained in the digi-

9http://www.w3.org/TR/xpath/10http://www.w3.org/TR/xquery/

Page 7: History Sample Research Paper 4

A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey 7

tized source does exist. Well known examples of sucha structure are sources encoded as relational databases,XML files, spreadsheet workbooks or RDF triple-stores. It is easy to see that all these examples meet acertain abstract model for the data they represent (rela-tional schemes, DTD constraints, tabular formats andRDF triple statements). In case such a data model islacking we define these as unstructured historical dig-itized sources, so these are the ones that have only afew or no structure at all: commonly, unconstrainedcorpora.

Some historians [86] working from an information-oriented perspective are coming near to this perspec-tive of structured and unstructured. They proposed tostructure historical data depending on their (proba-bly) required further machine processing: textual data,quantitative data and visual data. Textual data com-prises the whole set of unstructured historical sources,such as letters, memoranda or biographies, all in aform of free text. Quantitative data can be seen as his-torical sources aiming at a quantitative analysis, likechurch registers, census tables and municipality micro-data. Finally, visual data gathers all kinds of histori-cal evidence not encoded by text or numbers, such asphotographs, video footage and sound records.

Nevertheless, it has to be said that we dont con-sider source classifications as a major issue in the se-mantic technologies pipeline. Although structure re-ally matters for deciding what has to be applied tothe sources, being those sources administrative or nar-rative, deliberate or inadvertent, does not really mat-ter if their inner structure is clearly identified. Theirbelonging to one type of another may have an influ-ence at some point (for instance, a secondary sourcemay require an OWL ontology with a vocabulary al-lowing to describe historical interpretations), but ingeneral the procedure to extract RDF triples from thesources strongly relies on the type of source we haveregarding their structure. The goal is a faithful repre-sentation of the source in Semantic Web formats: asource-close representation allowing to model data as-is, meeting the same requirements of faithfulness thancritical source editions (which is the standard for histo-rians). It is critical for semantic representations to con-sider context and source structure as critical editionsdo, because they may be relevant for interpretation ofthe data. A digitized, semantically-enabled historicalsource should ideally preserve context and structureand support goal-oriented extraction of data, in orderto construct historical facts in the framework of a cer-tain research. By means of dataset interlinking and ap-

propriate design and usage of ontologies and vocabu-laries, context and source structure should be able tobe preserved using semantic technologies.

Having that said, we propose the source classifica-tion system depicted in Figure 2, distinguishing be-tween three levels of inner structure in the digitizedhistorical source: structured, semi-structured and un-structured. Each level of structure can be divided intoseveral types of structure. In turn, dotted arrows ex-press how typical workflows tend to transform datacontained in unstructured sources, identifying entities,relations and events mentioned in natural (unstruc-tured) language, and modeling them into formal (struc-tured) languages.

Fig. 2. Classification of sources according to their degree of internalstructure.

Structured historical sources can be divided into re-lational databases, graph/tree data and tabular data.Relational databases are the most fixed, static knownway of committing to some schema for representinghistorical objects and their relationships. Because theirstructure, relational databases are ideal for goal ormodel-oriented representation of historical data [86]with some concrete conception of reality in mind, butas a drawback their schemas are the most common ob-stacle in data integration. On the other hand by con-verting them into a flat file this problem is easy to over-come. Other solutions are converting the data into amore common data structure with standardized vari-ables such as the Intermediate Date Structure in useby large databases with historical micro-data [1]. Moreserious are misconceptions of the data itself (bad con-ceptualization of entities and relationships, inconsis-

Page 8: History Sample Research Paper 4

8 A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey

tent normalization etc.) that troubles a historian maybemore than other scientists.

Graph/tree data is found in historical samples thatcome in formats such as XML (trees), RDF (graphs) orJSON. Although they are conceived for modeling datain very disparate models (a tree, a graph and nesteddictionaries, respectively) and purposes (e.g. JSON ismainly used for data interchange), these formats alsofollow some assumptions to put structure on historicaldata.

In tabular data, typically represented by means ofcomma-separated values (CSV11) or Excel workbooks(XLS12) (although other tabular data formats13 are alsocommon), historical information is encoded follow-ing a model based on cells, columns and rows, whichsuits well with datasets containing essentially numeri-cal variables. Census spreadsheets are a good exampleof structured tabular data.

Semi-structured sources consist solely of annotatedcorpora. Although they appear more often not assources, but as some intermediate representation be-tween unstructured and structured historical data rep-resentations, annotated corpora can be treated as his-torical sources as well. Typical technologies appliedhere are markup languages, such as XML, to denotespecial characteristics of historical texts in specific re-gions of the corpus.

Unstructured sources are the most common repre-sentation for digital historical samples. They use to bedigital transcriptions of historical texts. Objects with ahigh variety of historical nature can be included in thiscategory: letters, books, memoranda, acts, etc.

3.2. Information problems

The weapon of choice of historians was and remainsthe database, particularly in relational form [2]. Thisnot only enables historians to retain some of the in-tegrity of the original data sources but also paved wayfor rapid advances on issues such as classifications andrecord linkage. Although many advances have beenmade in different field and computers are seen as valu-able assets, a vast amount of historians are unknownwith or remain unconvinced that semantic technolo-

11See RFC 4180: http://tools.ietf.org/html/rfc4180

12http://en.wikipedia.org/wiki/Microsoft_Excel#File_formats

13http://en.wikipedia.org/wiki/Comparison_of_spreadsheets

gies may become a new methodological asset [2,31].Therefore, needless to say historians typically do re-search using their own datasets, resulting in the cre-ation of a vast amount of scattered data and specifictechnological challenges. By understanding the useand advantages of semantic technologies, practition-ers and researchers of historical data can not only con-nect their own data sources but moreover, also dissem-inate their data into the Semantic Web and integrateit with other data sources which were previously notpossible or cumbersome. In historical research, infor-mation problems can be divided into four main cat-egories [86]. Namely, information problems of his-torical sources, information problems of relationshipbetween sources, information problems in historicalanalysis and information problems of the presentationof sources or analysis. In section 3.2 we will elaborateon these issues, except historical analysis. In section3.2.1 we will go into the issue of semantic operabilityand historical sources.

Historical Sources As historians often have dif-ferent interpretations and no clear research questionwhen starting an investigation, it is neither possiblenor desirable to model the data according to certain re-quirements in advance. The main information problemwith historical sources relates to the fact that differentsources have been produced throughout different peri-ods in history with different views and motives. Thesesources often vary in format, structure, are not consis-tent, often unclear or ambiguous but also incomplete.Historical census data is a great example of these in-consistencies, varying structures and changing levelsof detail which hinders comparative social history re-search both in past and present efforts [26]. Moreover,in historical research the meaning of data cannot ex-ist without interpretations [86]. Textual problems forexample include the meaning of words, its relation toother objects, the context or underlying thoughts of thedata which are subject to different interpretations. Dueto drifting concepts in history, different interpretationscould exist with regards to certain data. However as in-terpretation of data is a subjective matter, this informa-tion should be added in a non destructive way, preserv-ing the original source data. Another main issue relatesto the data structuring problem of historical data. Ashistorical researchers often deal with various (isolated)sources, they often face the problem of how to inte-grate these dissimilar sources for their purposes andhave to decide on what is an adequate data model forhistorical data. The main discussion regarding this in-volves whether to use a source or a goal oriented data

Page 9: History Sample Research Paper 4

A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey 9

model for historical data. Researchers in favor of thesource oriented approach claim that a commitment to acertain data model suitable for analysis should be post-poned to the final stages of a project in order to main-tain flexibility and build on the data in a non destruc-tive manner. This is especially the case wthe databaseis supposed to be shared with other researchers are isto be used in the future [21].

Relationships Between Sources Quite often sev-eral sources are used in historical research whichmakes linking different sources another key problemin historical research. Think of for example micro dataof the same person contained in different censuses,parish registers, marriage or death certificates etc. Thechallenges are here on how to relate the appearanceof a person in one source document to the appear-ance in another document. Obvious linkage problemsare how to disambiguate between persons with thesame name, how to manage changing names (e.g. incase of marriage of a woman) and how to standard-ize spelling variations in the names. Other problemsrelated with the spatial and temporal context of his-torical data are for example, occupational titles whichevolve over time or the problem of changing geograph-ical boundaries of a country (compare for example thecontemporary geographic position of Poland with thesituation in 1930 and in 1900). As historical researchoften deals with changes in time and space, historiansrequire tools which enable them to deal with these as-pects. Accordingly several techniques have been de-veloped for historical research but the applicability ofthese has yet to be determined [86].

Historical Analysis As historical research oftendeals with changes in time and space, historians re-quire tools which enable them to deal with these as-pects. Accordingly several techniques have been de-veloped for historical research but the applicabilityof these has yet to be determined [86]. For exam-ple, various statistical techniques are borrowed fromthe social sciences, multilevel regression or techniqueswhich have been specifically developed for historicalresearch such as event history analysis.

Presentation of Sources One of the main problemsof comparative historical researchers relates to an ad-equate presentation of sources. As noted before thepresentation of historical information may take differ-ent shapes varying from digitized documents, bad andgood modeled databases to visualizations and GIS-representations. Currently there is a great need fortools and methods to present changes over time andspace. Moreover for historians, different types of pre-

sentations are suitable at different stages of a researchproject.

3.2.1. Semantic InteroperabilitySo far, we have presented a classification of digi-

tal historical sources according to their level of struc-ture and the software they may have been encodedwith. Moreover, we have stated that further workflowsthat researchers may need to apply to these sourcesstrongly depend on such structural types. The samecould be said for challenges of semantic interoper-ability challenges that these workflows may encounterwhile extracting and transforming information con-tained in the sources. This section analyzes which spe-cific semantic challenges may occur in these processes,assuming that disparate sources of the same kind (e.g.spreadsheets coming from a dozen sources, XML doc-uments with different data structuring) have to be com-monly aligned (that is, being able to uniformly query-ing them).

Structured digitized sourcesRelational databases. Relational databases have

their own languages (SQL) and systems (MySQL,Postgres) to represent and store historical data. Seman-tic issues, especially when trying to merge differententity-relationship schemas from disparate databases,appear often in historical datasets encoded this way.The field of data integration is currently producing re-sults in this respect.

Graph/tree sources. Hierarchical sources may beunable to be merged or compared due the followingsemantic mismatches.

– Schema mismatch occurs when two differentsources cannot be compared because semanticdifferences in their defining schemas. For in-stance, two XML files conformant to differentDTD schemas may define and structure differ-ently the same historical fact, event or person. Or,two RDF datasets may describe these historicalfacts, events or persons by the means of tripleswhich use different (and not necessarily compati-ble) vocabularies.

– Accordingly, constraints on values regardingrange or type may also mismatch across datasets,though being schema or vocabulary-compatible.For instance, an attribute may encode the variablesocial class with categories A,B,C, while otherdataset may do so with categories high, medium,low. Likewise, categorical values may be repre-sented as numerical in other samples, or have spe-

Page 10: History Sample Research Paper 4

10 A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey

cial restrictions (for instance, if a person has a jobthen he or she must be at least 16 years old).

Tabular data. Tabular sources, which are typicallyrepresented by means of Excel spreadsheet workbooks(XLS files) or CSV, may present a variety of semanticheterogeneities.

– Variable mismatch occurs when using variableswhich do not perfectly overlap. For example, onecolumn marital status and civil status may referto the same concept (i.e. the attribute stating if aperson is married or not) along different sources.

– Value mismatch may happen even though havingtwo sources using the same variable or concept.For example, two data tables may encode a vari-able occupation, but due to temporal, geographi-cal or cultural differences values for that variablemay not be comparable.

– Problems with hierarchy in variables and values.This problem is the best showed with tabular data.Since tables offer a very opened model consistingof cells arranged in columns and rows, data maybe represented according to some particular logicof hierarchical columns and rows. For example,it is very common to find splitted and spannedcolumns in census tabular sources, as well as rowswith hierarchies that classify territories, occupa-tions or social classes.

Semi-structured sourcesAnnotated corpora. An annotated corpus is one of

the most common examples of semi-structured data.They are usually raw historical texts with annotationson defined text sections, usually implemented with amarkup language, like XML. However, semantic vari-ations may be observed across these kind of datasets.

– The annotation procedure in which annotationshave been generated can be different betweensamples. We distinguish here two ways of an-notating corpora: manual annotations, and auto-matic (or semi automatic) annotations. If the cor-pus has been manually annotated, then divergencein criteria from one human expert to the other maycause semantic mismatches when trying to makethe datasets mutually compatible. If the corpushas been annotated by some algorithm, then se-mantic heterogeneity may have been generated bythe use of different algorithms or configurations.

– In a similar way to what happens with struc-tured sources, especially hierarchical sources, theschemas or vocabularies used for annotation may

differ to one historical sample to the other, pro-ducing additional semantic gaps.

Unstructured sourcesRaw corpora. To bring structure into an unstruc-

tured soured using natural language processing (NLP)techniques seems the best option. Two major concernsappear during this process: one on how to extract se-mantics (meaning) from historical texts; and anotheron how to represent the extracted semantics in a struc-tured way.

– Different NLP pipelines and algorithms can bechosen to extract relevant information from un-structured historical datasets. Due to the high va-riety of choices, several concerns regarding se-mantics arise here: the precision of the techniqueused (e.g. while detecting persons on a text, whichratio of them are indeed detected with respect tothe total in the source), the pipeline being con-structed (which strictly depends on the goal to beachieved), the nature of the text being analyzed,and the existence and availability of other relatedtexts for supervised learning.

– Likewise, different vocabularies and schemas canbe chosen to represent the extracted data. Lan-guages like XML or RDF can be useful for repre-senting data in a more structured way, climbing toa more formal representation for concepts, indi-viduals and relationships between them containedin historical texts. This also concerns which vo-cabulary or schema is more appropriate for con-venient historical data modeling.

4. Semantic technologies for HistoricalInformation Science

So far, we have come to a general description ofthe state-of-the-art in historical computing, highlight-ing the life cycle of historical data, describing con-venient classifications for sources, and identifying themost relevant issues that history scholars still have toface in the colliding borders of history and computing.As we show in this section, semantic technologies canprovide solutions to some of these problems. First weoffer a brief description of the relevant semantic tech-nology, and then we describe how these technologiescan be (and, in some cases, are currently being) appliedto digital historical sources. Since it is intended as anoverview on Semantic Web, most Semantic Web advo-cated may prefer to skip the former and go directly tothe latter.

Page 11: History Sample Research Paper 4

A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey 11

4.1. The Semantic Web and semantic technologies

Semantic technologies do not require much intro-duction in an article in a Semantic Web journal. Still,we would like to briefly refer to this notion for scopingand reference. Throughout the paper we use Semantictechnology and Semantic Web technology in the mostgeneral possible way, hereby referring both to the ex-isting standard technology as well as to novel exten-sions to come. More concretely, Semantic technologiesare based on formal (usually symbolic) representationlanguages where some meaning is encoded separatelyfrom data and content. Without overcommitting to anyspecific formalisms in the analysis of current workin the historical domain we usually start our analy-sis with the standardized Semantic Web languages anddata models in mind, such as the Resource Descrip-tion Framework (RDF14), the Web Ontology Language(OWL15) and the SPARQL Protocol and RDF QueryLanguage (SPARQL16).

The purpose of this article is wider, though, thanto study particular languages or paradigms. In generalwe are interested in challenges and potential of Se-mantic technology, which implies that any extension,variant or alternative of such methods is highly rele-vant to the work presented in this paper. This not justapplies to alternative languages for representing vo-cabularies, such as SKOS17 or Provenance (such asProVO18) but also to novel semantic formalisms, suchas multi-dimensional, temporal, contextualized, spatiallanguages (and more more of this kind) that are cur-rently considered as extensions of the standardized for-malisms.

4.2. How can semantic technologies help withhistorical information science challenges?

So far, we have presented the most significant chal-lenges regarding semantics in historical informationscience, as well as an overall introduction on the Se-mantic Web and linked data. But, how are semantictechnologies supposed to help solving these histori-cal computing semantic issues? This section presentsa general view on how Semantic Web facilities likeRDF* and OWL can (and actually are) helping histo-

14http://www.w3.org/TR/rdf-primer/15http://www.w3.org/TR/owl-ref/16http://www.w3.org/TR/rdf-sparql-query/17http://www.w3.org/2004/02/skos18http://www.w3.org/2011/prov/

rians to solve semantic gaps in their datasets, and de-scribes deeper the relations between historical datasetsissues and identified semantic tools.

In section 3 we have discussed five well knownproblems history scholars face when dealing withdigital historical datasets: problems with historicalsources, problems with relationships between sources,problems with historical analysis, and problems withhistorical data presentation and visualization. Prob-lems with data interoperability belong more to thecomputing domain, but they must be taken into ac-count, for example, in scenarios where disparatesources represented with different data models or for-mats need to exchange information or have to bequeried uniformly. Still, those problem domains aretoo widely defined to be of help to analyze how seman-tic technologies can fit each. Therefore, we introducesix more specific problems or challenges we think se-mantic technologies have been proven to be helpful(see Figure 3). Among them are the lack of formal-ized historical domains. Classification and ontologiesdo exist, but not for all areas, not in Semantic Weblanguages and not always agreed upon. The absenceof mechanisms for automatic inference, so that new,implicit historical knowledge can be derived is anotherissue. While there exist quite a lot of different data andinformation sources, not always they are interlinked.This isolation of historical data sources hampers thatthey can be found, but it also inhibits how they canbe further processed and connected. This is why wedefined the existence of non-interoperable historicaldata sources as a separate issue. Hereby, we partic-ularly point to the co-existence of incompatible datamodels, which profoundly hampers exchange of infor-mation. It is considered to be a bad practice in his-torical research to not get your historic data modelingright at the beginning. What is valuable in terms ofscientific rigor can become a barrier for the compari-son of different historic data models. Comparison re-quires a shared framework and a flexibility of the mod-els to be able to match them to one another. We markedthat with the notion of non-flexibility of historical datamodeling. At the end, that enforces history scholars tomake their data selection and processing dependant ofa certain data model that can not be (easily) replacedor altered if needed; this can happen usually in envi-ronments with very changeable requirements (like re-search) or requirements creep [17]. The non-flexibilityof data models is related to a non-flexibility of histori-cal data transformations. Digital historical data sourcesget altered in the life cycle of historical information

Page 12: History Sample Research Paper 4

12 A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey

(see Figure 1). But, if update, enrichment, analytic andinterpretative operations are not controlled those trans-formation lead to a variety of historical data represen-tations which can hardly be related to each other anymore, nor in terms of provenance nor in terms of relat-edness.

Some of these concrete issues are, obviously, sub-sets of the broader ones analyzed in section 3; for in-stance, one may say that isolation of historical datasources is part of the problem of relationships be-tween historical sources, or non-interoperable histori-cal datasets is part of the problem of semantic interop-erability.

Figure 3 compiles both general historical data is-sues identified and semantic technologies, and offersas well an overall perspective on which semantic toolsmay be useful for historical computing in order tosolve issues typically found during the development ofprojects. Green cells, also marked with a (+) sign, de-note semantic technologies which are strongly relatedto concrete solutions for a historical computing prob-lem. Yellow cells, also marked with a (+/-), identifytools which are related to the described problems, butin a less degree than the ones marked in green (+). Fi-nally, red cells, also marked with a (-) sign, are as-signed to technologies which, though in some casescould make sense to apply, are less related as solutionsto those problems. We describe in the following somesteps how to address those problems. While doing thiswe also indicate which of the five selected semantictechnologies are useful for which problem, underlyingthe decisions behind Figure 3, about the usefulness ofwhich technology for which problem.

4.2.1. Creating controlled vocabularies andontologies for historical information as a wayto address lack of formalization

The first problem found when trying to model his-torical datasets into RDF is the lack of controlledvocabularies for describing historical facts. Althoughsome ontologies have been developed for describingevents19 (such as the Simple Event Model [35]), thesemodels are insufficient for the vast amount and vari-ety of historical data that still has to be published inthe web of data, especially when key issues for histo-rians like interpretations or evidences need to be mod-eled and conveniently linked as well. Historical on-tologies and vocabularies have been a reality in re-cent approaches. OWL ontologies describing classes

19http://labs.mondeca.com/dataset/lov/

Fig. 3. Semantic technologies (columns) supporting historical com-puting issues (rows).

and properties of some historical concern, such as con-cepts around the Pearl Harbor attack in 1941 [90], arean exciting modeling exercise for researchers but alsoa necessary step for better structuring historical in-formation in the web. OWL ontologies and RDF vo-cabularies offer a way of controlling the predicates,classes, properties and terms that the community usesas a standard for describing factual and terminologi-cal knowledge about History. Designing good ontolo-gies for historical domains is also an area with plentyof challenges: how can ontologies comprise the manyconceptions of an historical reality depending on thetemporal dimension of events described [91]? More-over, how can differences in meaning and relations be-tween concepts be traced, as time and historical re-alities change these concepts [36]? These questions,which comprise semantic technologies, knowledge ac-quisition and knowledge modeling techniques, are notyet completely understood and are a significant chal-lenge in semantic historical information science re-search. On the other side over the centuries dictionar-ies, thesaurus, classification systems have been devel-oped which are relevant. Think in terms of classifica-tion of historic occupations, or historic dictionaries orlexicons. How to mount those specifically grown or-dering principles to the web in a way that makes themexplorable and linkable to other ontologies is one in-teresting challenge which requires a close collabora-tion between historians knowing and designing those

Page 13: History Sample Research Paper 4

A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey 13

specific tools and computer sciences, often relying onmuch broader and generic ontologies.

4.2.2. Inferencing or implicit knowledge discovery bymeans of ontologies

From the point of view of linked data, ontologiesand vocabularies are designed in order to control theterms in which datasets may express data, as well asthe data model in which these data are represented.However, in a more Semantic Web perspective, onemay expect these ontologies and vocabularies to facili-tate new knowledge discovery; that is, to make explicitsome implicit fact that was not trivial to deduce for thehuman eye, especially in big knowledge bases.

Indeed, OWL historical ontologies can be used to fa-cilitate historical knowledge discovery through infer-ence, using OWL reasoners. Assuming that a particu-lar domain is completely formalized (as long as OWLlogic capabilities allow users to do so), then it is pos-sible to run a reasoner to highlight derived, inferredfacts that were not present in the original model as ex-plicit knowledge (i.e. knowledge that was directly in-troduced by the user as input), but that were there asimplicit knowledge. For instance, if an ontology de-scribes, on the one hand, the fact that a letter was sentfrom one government diplomatic institution to another,and on the other hand, the fact that government diplo-matic institutions have a person responsible of send-ing and receiving letters, then it may be possible forthe reasoner to infer those concrete persons that sentand received, respectively, the mentioned letter. As theknowledge base grows and gets bigger and bigger, im-plicit knowledge can not be as evident as it was in thebeginning, and reasoners may facilitate an enormouswork and produce highly valuable pieces of historicalknowledge.

4.2.3. Data integration as a mean to link betweenisolated data sources

One of the big claims of linked data is that, by link-ing datasets, relations established between nodes ofthese datasets highly enrich the information containedin them. That way, browsing datasets is not an iso-lated task anymore: by allowing users (and machines)to explore URI entities through their predicate links,data get new meanings, uncountable contexts and use-ful perspectives for historians.

For example, consider a scenario with three differentSPARQL endpoints exposing RDF triples of a censuswith occupational data, a historical register of labourstrikes, and a generic classification system for occupa-tions (in the context of one particular country, for in-

stance). Suppose that: the occupational census of thedata exposes triples with countings on occupations (forexample, how many men and women worked in a par-ticular occupation in a concrete city), the historicalregister of labour strikes contains countings on howmany people participated in labour strikes (number ofwomen and men, per occupation and city), and thegeneric classification system harmonizes names of theoccupations between both previous datasets (for exam-ple, gives a common number for representing occupa-tion names that may vary between census occupationsand labour strike occupations). Then, it is clear thatseveral SPARQL queries can be constructed to givevery meaningful and interesting linked data to the his-torian. For instance, such a query may return, given acity and an occupation code, which ratio of men andwomen followed a particular well-known labour strike.Another SPARQL query may return an ordered list ofhistorical labour strikes by relevance, according to sev-eral indicators (strike successfulness ratio, total num-ber of workers on strike, density of people on strikedepending on the location, etc.). It is obvious that thepossibilities increase if we think of more related histor-ical sources to link, like datasets describing historicalweather or historical geographical names and areas.

Although this kind of studies can be (and actuallyare) made, applying semantic technologies and Se-mantic Web principles like in the example above mayprovide significant improvements. First, data retrievalis automatically solved by SPARQL: once pointedto the appropriate endpoint(s), triples needed to an-swer the query are fetched automatically. Second, datacomes with warranty of the authority providing theSPARQL endpoint: well-known domain names em-bedded in entity-defining URIs are the signature of theinstitution issuing the data. Third, potential semanticheterogeneity problems with the data are solved if ap-propriate vocabularies are applied. Fourth, SPARQLtechnology provides handy mechanisms to performautomatically data transformations needed, accordingto user requirements. Last but not least, results fromSPARQL queries can be customized in structure (i.e.which variables are returned) and format (raw text,HTML, JSON, CSV, etc.), which makes easy a directplug to visualization tools (and in general any tool con-suming the requested data).

4.2.4. Data interoperability and new search &retrieval possibilities

Data interoperability is one of the major challengesto be achieved in historical databases. Very often, re-

Page 14: History Sample Research Paper 4

14 A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey

search efforts tend to divide research tasks into severalteams, project groups or even research calls, fragment-ing the data modeling process across several databaseschemas, and thus making data definitely impossibleto query in an uniform way. In section 3 we have iden-tified a set of semantic interoperability problems thatmay appear depending on the type of structure in thesource historical datasets.

How can these data problems be addressed? A majoradvantage of linked data is that exposed RDF datasets,when published through controlled vocabularies, areinteroperable enough to enable cross-querying withoutmanually solving semantic or schema heterogeneities.That is, they can be uniformly queried as long as theyshare standard vocabularies to expose common fac-tual knowledge. SPARQL queries can homogeneouslyquery these endpoints as long as the user knows thevocabularies being used in the remote graphs.

4.2.5. Flexibility, changeability and mapping of datamodels

It is a well known problem that the choice of aparticular data model to represent historical data is acritical issue for most historical computing projects.Moreover, the election of some ÔappropriateÕ datamodel may seem a good design decision at some stageof the project, but new requirements, research direc-tions or stakeholder priorities may convert that datamodel (that once upon a time was the most suitableone) into a nightmare. Indeed, flexibility of data withrespect to the data model used to represent historicalfacts is something desired to avoid restructuring entiredatabases again and again.

Applying semantic technologies and linked dataprinciples to historical datasets may have a major ad-vantage regarding historical data workflows: a com-plete flexibility regarding the historical data mod-eling process. As explained before, two differentapproaches regarding historical data modeling havebeen followed traditionally in historical computing:the source-oriented representation, and the model-oriented (also known as goal-oriented) representation[86]. The source-oriented representation tries to designa database schema which concerns and respects thehistorical source structure. As opposed to this, model-oriented representation models historical data accord-ing with user or application goals, and thus commit-ting to a particular view on the historical reality beingencoded. Both have advantages and drawbacks, andseveral examples follow an hybrid approach betweenthem.

Semantic technologies allow a flexible represen-tation of historical datasets. RDF represents factualknowledge by means of triples, thus using a fine-grained granularity for expressing data. With this inmind, RDF triplestores can act as a middleware rep-resentation of further views on the data [23], whichcan be modeled as close to any particular historicalinterpretation as needed. This way, the decision ofwhat data model suits better the historical source canbe postponed until the very end of the workflow de-sign, or adopted as early as the user may desire, pro-viding complete flexibility regarding the data schema.This feature also prevents historians of designing theirdatabases over and over again, avoiding spending re-sources on data migrations.

4.2.6. Flexibility regarding data transformations(non-destructive updates)

Another big problem when dealing with historicalsources is supporting data transformations under twoconstraints: (a) without modifying source data (so theoriginals stay intact); and (b) with the ability to traceall the changes performed on data. Destructive updatesare, thus, a major concern while selecting, aggregat-ing and modifying data. On the one hand, modifica-tions to specific kinds of digital files (such as CSV,spreadsheets or XML documents) do not support non-destructive updates: if one wants to keep original data,some versioning system has to be maintained so pre-vious statuses can be retrieved. On the other hand, re-lational databases can be inefficient if all transforma-tions, edits and manipulations have to be recorded us-ing a certain provenance model.

Non-destructive updates and specific data viewsare both supported by the CONSTRUCT and SELECTSPARQL queries. The former allows the constructionof RDF triples according to the supplied graph pat-tern, facilitating data transformations without alter-ing consistency of previous factual knowledge in theknowledge base. The latter, as described before, per-mits a certain selection of triples (again, according tosome graph pattern) and disposing the data they con-tain according to any desired view-format (for exam-ple, columns matching some interesting historical vari-ables). This way, SPARQL can provide equivalent fea-tures to the UPDATE and SELECT SQL queries, plusnon-destructive updates and the ability to retrieve pre-vious data transformations and statuses, via the appli-ance of some provenance model for linked data.

Page 15: History Sample Research Paper 4

A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey 15

5. Current Historical Semantic Web

In section 3 and section 4 we reviewed concepts andproblem definitions as articulated in historical infor-mation science and mapped them to approaches de-veloped by the Semantic Web community. The goalof this operation was to identify possible bridge headsfor common problem solving processes on a concep-tual level. In the preparation for such a shared concep-tualization we reviewed a variety of publications, butalso projects, datasets, and technologies relevant forthe foundation of a long-term collaboration betweenspecialists from both fields. We build the collectionstarting from some key publications [86], main jour-nals (Historical Methods, Journal of the Associationfor History and Computing), and conducted about 8 in-terviews with pioneers in this area in the Netherlands.we accompanied this by an extensive web search. Still,we are aware that our selection covers only a small partof the relevant literature.

Another specific characteristics of this survey is thatthe information steams from two at least two differ-ent communities: history and computer sciences. Thatmakes the set very heterogenous, but also very spe-cial. Eventually our selection is driven by the curiosityto identify relevant contributions to semantic researchquestions. Here, with semantic we donÕt point exclu-sively to an understanding as cultivated in the com-puter sciences. We identify semantic research ques-tions as shared objects of concern both by histori-ans and by computer sciences. How to achieve such ashared understanding has been elaborated in the previ-ous sections. In this final section we present an analy-sis and a categorization of 67 sources we have found tobe relevant for depicting the roadmap for a historicalSemantic Web.

The sources are divided into the categories of sci-entific papers, research projects, online resources (likepresentations or online articles), and tools and tech-nologies (like ontologies, demos, applications or pro-gramming libraries). Figures 4, 5, 6, 7 and 8 showthese sources and a classification based on the gen-eral goals they are aiming at. We have used the col-ors green, yellow and red, and accordingly the signs(+), (+/-) and (-) to express that a particular work isstrongly related, more or less related, or weakly re-lated, respectively, to each one of these general goals.

The first identified semantic research question isknowledge modeling and ontologies, and under thiscategory we consider contributions that intensively ap-ply technology to model historical knowledge or his-

torical facts, essentially using semantic technologiessuch as RDF, OWL and SPARQL. In text processingand mining we gather all the work that at some stagedeals with unstructured text, facing problems like textstorage in convenient database systems, or automaticor semi-automatic entity extraction (such events or per-sons) via NLP techniques. In search and retrieval weinclude systems that exploit semantic formalisms as anew way of indexing, querying and accessing histori-cal data, instead of relying on the traditional text-basedor keyword-based algorithms for information retrieval.Finally, under the category of semantic interoperabil-ity we analyze to what extent contributions considerthe problem of data integration and use the Seman-tic Web to deal with that problem, which often facesdata model mismatches, schema incompatibilities anddisparate source formats. In the following we describethose generic semantic questions in greater detail, andwe review selected papers shown in FIgures 4, 5, 6, 7and 8 to illustrate how the identified questions are re-flected in the literature. At the end there are a couple ofsources which can not been disambiguously allocatedto one of the four questions. Those we review under alast group called holistic approaches.

5.1. Knowledge modeling and ontologies

Under this category we have grouped work that con-tributes to a semantically enabled historical web by thefollowing main streams of research: classifying his-torical knowledge by means of classification systems;building and publishing historical structured data; anddesigning historical data models and ontologies.

5.1.1. Classification systems (and censusmanagement)

When dealing with vast amounts of historical data,classification systems are a necessity in order to or-ganize and make sense of the data. The main goal ofa classification system is therefore to put things intomeaningful groups [4]. This entails an allocation ofclasses which are created according to certain relationsor similarities.

The main issue with historical classification systemsis that they are not consistent over time, making com-parative historical studies problematic. Historical cen-sus data is a typical example of this problem. Censusdata is the only historical data on population character-istics which are not strongly distorted and yields an ex-tremely valuable source of information for researchers[27].

Page 16: History Sample Research Paper 4

16 A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey

Fig. 4. Research contributions to historical Semantic Web (rows) re-lated to specific semantic issues (columns) 1/5.

However, major changes in the classification andcoding of the different censuses, have hindered com-parative historical research in both past and present ef-forts [26]. Researchers are forced to create their ownclassifications systems in order to answer their re-search question, however this process often results indisparate systems, which are not comparable, containa lot of expert knowledge, different interpretations ofthe data and could not be easily (re)used by other re-searchers. The fact that many of the modeling tech-niques are destructive in nature (we cannot go back tothe source) makes it even more cumbersome to com-prehend these sources. In order to deal with the chang-ing classifications and vast differences at both nationaland international level, we need to connect the gapsbetween the datasets and conform to certain standardclassification systems.

Currently several significant efforts have been madein this direction. The IPUMS International project forexample faces the problem of bridging 8 different oc-cupational classification systems and a total of 3200different categories, containing the richest source ofquantitative information on the American population.The North Atlantic Population Project (NAPP) project

Fig. 5. Research contributions to historical Semantic Web (rows) re-lated to specific semantic issues (columns) 2/5.

provides a machine-readable database of nine cen-suses from several countries. The main focus of theNAPP project is to harmonize these data sets and linkindividuals across different censuses for longitudinaland comparative analysis. Their linking strategy in-volves the use of variables which (theoretically) donot change over time. In this process records are onlychecked if there is an exact match for some variables,such as race and state of birth. Other variables likeage and name variables are permitted to have somevariations. Another significant historical classificationsystem is the Historical International Standard Codeof Occupations (HISCO). As occupation is one ofthe most problematic variables in historical research(SOURCES), HISCO aims to overcome the problemof changing occupational terminologies over time andspace. This system combines various kinds of informa-tion on tasks and duties in historical setting, by classi-fying tens of thousands of occupational titles and link-ing these to short descriptions and images of the con-tent of the work20. Naturally there are many other clas-

20http://hisco.antenna.nl/

Page 17: History Sample Research Paper 4

A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey 17

Fig. 6. Research contributions to historical Semantic Web (rows) re-lated to specific semantic issues (columns) 3/5.

sification systems which aim to put historical data intomeaningful groups. However the common characteris-tic of all these historical classification systems is thatthey contain loads of semantics which offer a greatstimulus for the field of historical information scienceto look for novel semantic technologies, such as RDF,in order to deal with old problems with contemporarytechniques.

There are two main imperatives when applying anyclassification or model on historical data. When deal-ing with historical data it is important to decide in anearly stage whether the data should be modeled ac-cording to a source and goal oriented approach. Thesource oriented approach aims to postpone enforcingany standards or classifications, resemble the underly-ing source data as close as possible (schema free repre-sentation) and hence allow room for multiple interpre-tations of the data. In recent years special attention hasbeen given to modeling techniques such as RDF-basedrepresentations. By opting for a Linked Data modelsuch as RDF, we can avoid an early commitment to acertain data model and build a DB in a non destruc-tive manner. Another approach is the goal or model

Fig. 7. Research contributions to historical Semantic Web (rows) re-lated to specific semantic issues (columns) 4/5.

Fig. 8. Research contributions to historical Semantic Web (rows) re-lated to specific semantic issues (columns) 5/5.

oriented approach. Historical data is often plaguedwith inconsistencies, changing structures and classifi-cations, redundant or erroneous data and so forth. Thegoal oriented perspective therefore advocates the useof more sound data models to start with. This meansrestructuring the data according to certain views or

Page 18: History Sample Research Paper 4

18 A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey

goals which are mainly dependent on expert knowl-edge. Accordingly, this perspective uses more soundmodels to start with and commits to a sound model inan early stage.

5.1.2. Building and publishing historical structureddata

Although not being pure initiatives to publish his-torical datasets in Semantic Web formats, there are somuch contributions extracting and modeling historicalinformation that cannot be described here. However,some researchers in history have centered their interestin how semantics can help relating and linking histori-cal sources and entities, despite of the underlying tech-nology: historical, semantic networks are a computer-based method for working with historical data. Ob-jects (e.g., people, places, events) can be entered intoa database and connected to each other relationally.Both qualitative and quantitative research could profitfrom such an approach [92]. Indeed, the landscape oncurrent projects exposing structured historical infor-mation extracted from unstructured sources shows atendency on having more and more contributions ex-posing data in RDF.

There is a huge variety of nature in projects lookingfor that structure, though not doing so solely (or ex-plicitly) in RDF. For instance, the CKCC project in theNetherlands tries to formalize an epistolary networkfor circulation of knowledge in Europe in the 17th cen-tury, extracting such knowledge from the correspon-dence of scientific scholars of that time. On a simi-lar line, the CCed project does so with clerical careersfrom the Church of England Database. While theseprojects mine the historical sources for important his-torical personalities, other approaches such the SAILSproject dives into more concrete historical events andlinks World War I naval registries, although ship’s per-sonnel is also analyzed and linked as a primary goal.As the reader may have deducted, the common goalseems to produce that semantic network of historicaldata containing objects like people, places and eventsconnected to each other, which overlaps with the gen-eral purpose of the Web of Data (WoD).

Other projects make explicit use of semantic tech-nologies and expose their datasets using RDF, describ-ing them with well known vocabularies and thus facil-itating their linkage to others. For instance, the Agoraproject aims at formally describing museum collec-tions and linking their objects with historical context,using the SEM (Simple Event Model) for those de-scriptions. Historical events, however, can be also ex-

tracted from other historical sources, like governmentletters and memoranda. This is the aim of the FDRPearl Harbor project, that links events, persons, dates,and correspondence on the surroundings of the PearlHarbor attack on 1941 between the US and Japanesegovernments: these entities are represented in RDF aswell to model a graph of historical knowledge aboutthat particular event. From a more socio demographicpoint of view, the Verrijkt Koninkrijk project linksRDF concepts found on a structured version of DeJong’s studies on pillarization of Dutch society afterthe World War II. This set of approaches, which sharethe RDF entity extraction step from historical sourcescan be summarized with the Fawcett toolkit and theArmadillo project. The latter exports RDF from anyunstructured historical source, producing a particulargraph of historical knowledge that encodes the histor-ical entities and their relationships expressed in thatsource.

5.1.3. Designing historical ontologies and datamodels

Data models are necessary for giving structure toany historical data, since it is the abstract model thatdocuments and organizes data properly for commu-nication. In this direction, semantic data models arepreferably developed in the Web Ontology Language(OWL), but other efforts do so with XML based for-mats.

Regarding XML we found the Historical EventMarkup and Linking Project (HEML21), which aims atproviding a tag, mark-up specific language for describ-ing historical events. In a more concrete approach, TheSemantic Web for Family History22 exposes a set ofgenealogy markup languages based on XML to seman-tically tag genealogical information on sources con-taining that kind of historical data. In the context of theText Encoding Initiative (TEI23) we find an interest-ing discussion building the bridge between XML andOWL in historical data: SIG: Ontologies24 contains afull log on contributions on how to use ontologies withTEI formats; namely, how TEI, XML encoded docu-ments can refer to historical concepts and propertiesthat have been previously formalized in an externallypointed OWL ontology.

21http://heml.mta.ca/heml-cocoon/description#N10098

22http://jay.askren.net/Projects/SemWeb/23http://www.tei-c.org/index.xml24http://wiki.tei-c.org/index.php/SIG:

Ontologies

Page 19: History Sample Research Paper 4

A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey 19

In the pure semantic technology world, OWL his-torical ontologies have begun to be modeled. Thoughsome interesting studies point out specific modelingneeds of the historical domain (e.g. how historical on-tologies should reflect how a particular time frame in-fluences definitions of concepts [91]), most practicalresults show how the central concept of events is atthe core of historical knowledge modeling. In this linewe find the Simple Event Model (SEM) used to modelRDF in the Agora project, but also the Event Ontol-ogy25 and efforts on Linking Open Descriptions ofEvents26, even though their usage may be subject toparticular modeling needs.

5.2. Text processing and mining

Textual resources play an important role in historyresearch, for example primary sources such as letterswritten by historically important people, but also sec-ondary sources that may describe historical events orpersons.

In the Netherlands, currently three projects are un-derway that are concerned with structuring historicalinformation from textual resources for further analysis:Agora27, Bridge28 and HiTime29. The CCKC project30

is also relevant in this context.The Agora project aims to enrich museum collec-

tions with historical context in order to help users placemuseum objects in their historical contexts. The Agoraproject is thus explicitly aimed at the general pub-lic. To this end, Agora employs information extractiontechniques from statistical natural language process-ing to extract named entities (actors, locations, times,event names) from textual resources such as Wikipediaand collection catalogues which are used to populateSEM instances. From the object descriptions, also rel-evant historical entities are extracted which can thenbe linked to the events. In this fashion, it is then pos-sible to relate an object such as the ship model of theMedusa31 to the battle of Shimonoseki, even thoughthe Battle of Shimonoseki.

The Bridge project aims to bring more cohesion intoDutch television archives by finding relevant links be-

25http://purl.org/NET/c4dm/event.owl#26http://linkedevents.org/ontology/27http://agora.cs.vu.nl28http://ilps.science.uva.nl/node/73529http://ilk.uvt.nl/hitime30http://ckcc.huygens.knaw.nl/31http://www.rijksmuseum.nl/collectie/

NG-MC-485/halfmodel-van-het-schroefstoomschip-medusa

tween the official archives maintained at the Nether-lands Institute for Sound and Vision and other infor-mation sources such as program guides and broadcast-ing organizations websites. The Bridge project is fo-cused on improving access to television archives formedia professionals. In order to do so, relevant entitiesare extracted from archives by using statistical natu-ral language processing techniques. Furthermore, theywill detect interesting events in television archives bydetecting redundant stories.

The HiTime project is aimed at detecting and struc-turing biographical events. To this end they analyzebiographies of persons from the Dutch union historyto create timelines that tell the lifestory of these per-sons, and social networks of the persons they inter-acted with.

5.3. Improved search and retrieval

It is not a coincidence that a high number of pa-pers, projects and tools that aim at RDF (or gener-ally structured data) extraction of entities from his-torical sources also point at some desired systemable to improve search and retrieval of those histor-ical entities. Indeed, by means of constructing a se-mantic graph of historical knowledge, search and re-trieval of that knowledge, as well as indexing sys-tems that give exact pointers to the source in whichparticular historical entities (such persons or places)are mentioned, can be easily built and improved, es-pecially comparing to pure textual, keyword-basedsearch systems. The Agora (museum collections),Bridge (historical TV metadata), CHORAL (historicalaudio metadata), HiTime (biographical events), Ver-rijkt Koninkrijk (Dutch post-war social clusters con-cepts) and FDR Pearl Harbor (historical events aroundPearl Harbor attack on 1941) projects are all good ex-amples of this tendency: once the knowledge is suc-cessfully extracted from the historical sources and for-malized appropriately (with RDF, vocabularies and on-tologies, but other tools are also used), entities struc-tured this way can be used for a graph-based searchand retrieval, for instance through SPARQL queries.Other projects, like the H-BOT32 project, use a naturallanguage interface instead of a query system for query-ing such historical structured knowledge (in this case,mined from historical Wikipedia articles).

Indexation of historical contents is another way ofimproving search and retrieval of historical sources.

32http://chnm.gmu.edu/tools/h-bot/

Page 20: History Sample Research Paper 4

20 A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey

Indexing and historical data storage systems have along tradition [86]. Being CLIO [100] a traditional ex-ample of such a system, nowadays indexing is per-formed by XML annotation-oriented approaches, suchas HEML (which also includes facilities for visu-alizing and searching for concrete historical entitiesthrough sources) or TEI with ontologies support, withbridges again markup languages and Semantic Webformats. These initiatives should consider the emerg-ing RDFa, microformats and microdata technologiesto envisage their fitting in the vast domain of historicaltext annotation systems.

5.4. Semantic interoperability

Semantic interoperability has much to do with dataintegration, namely, how to commonly query anduniformly represent data that come from disparatesources (i.e. fitting several, probably non-compatibledata models). There is a high number of publicationsdealing with this problem, especially in classificationsystems [88,89,94,99]. The HISCO coding system33

for historical occupations deals directly with such aproblem: occupations are encoded very differently ifsome work has to be performed against a variety of his-torical sources that come from a different time, regionor language.

Semantic heterogeneity of historical sources is espe-cially present on social historical projects, which alsohave this problem of having very disparate sources:the LINKS project (reconstruction of families), theCEDAR project34 (exposure of ancient Dutch censusdata in Semantic Web formats), and the North Atlanticpopulation project (exposure of microdata of severalAtlantic countries) share this problem of data harmo-nization, in which heterogeneity of sources requiresan intense work on how to resolve data model incon-sistency among datasets. The work developed in Spa-tial cyberinfrastructures, ontologies, and the humani-ties [29] has also to be mentioned in the more generalcontext of the humanities, but also concerning history:its deep analysis deals with how semantic heterogene-ity can be addressed exclusively with semantic tech-nologies, achieving success in environments with verydisparate data models.

33http://hisco.antenna.nl/34http://www.cedar-project.nl/

5.5. Holistic approaches

Finally, there are few, but longitudinal contributionswe have classified as being holistic, because they covera complete workflow on semantic issues present in his-torical datasets, as well as solve some concrete his-torical problem (see section 3) with the application ofsome specific semantic technology (see section 4).

The Agora project is one of these contributions, asit generates historical RDF of events extracted usingNLP techniques from unstructured texts, uses it for en-hanced search and retrieval, improves semantic hetero-geneity and gives context by linking to other datasets.In a related line of research we can tailor the Fawcetttoolkit and the Verrijkt Koninkrijk project; while theformer also extracts RDF event-oriented triples fromunstructured texts, and additionally allows historian re-searchers to install a full semantic toolbox with fancywidgets to experiment with their data, the latter isanother example on how historical RDF datasets getmuch more rich when they are linked between them.The FDR Pearl Harbor project also contributes on thisline, applying NLP to disparate sources, and addi-tionally opening the very promising field of historicalknowledge inference through the formalization and us-age of historical OWL ontologies. A good abstract ap-proach, which somehow summarizes and contains thegeneric plot behind all these contributions, is the Ar-madillo architecture of Semantic Web Services, whichcovers all aspects of historical data mining and his-torical RDF generation in the context of the SemanticWeb. The NSF-ITR/MALACH also fits this frame ofcomplete, meaningful workflows, but this time beingthe perfect candidate for multimedia digital sources(historical videotapes) instead of unstructured texts..

Additionally, there exist some theoretical studies en-visaging possibilities on how the Semantic Web canenhance research by historians. The most remarkableone is Past, present and future of historical informa-tion science [86], a major work on the evolution of his-torical computing, eHistory and historical informationscience, which gives a deep intuition on how computerscience approaches can help to solve ancient problemsin history research.

6. Conclusions

In this paper we have presented a general overviewof semantic technologies applied to historical datasets,as well as a survey of recent contributions to this line of

Page 21: History Sample Research Paper 4

A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey 21

research. First, we have described a general approachto historical computing, introducing its core elements,such as the historical data life cycle and meaningfulclassifications for historical sources, putting a specialemphasis on how structure is a key characteristic ofdigital sources to take into account when applying asemantic pipeline. Then, we have overviewed severalproblematic areas history scholars have to face whileworking with digital historical datasets. After that, aresume on semantic technologies has allowed us toexamine how these technologies can help in solvingsome of these historical data problems, offering solu-tions for interlinking datasets, solving semantic inter-operability problems, or enabling historical knowledgeinferencing. Finally, we have presented a collection ofcontributions that provide advances in some areas ofsemantic historical computing.

It is important to realize that current developemntsflagged out as eHumanities or Digital Humanities pavethe way for an integration of semantic technologiesin what one could call eHistory. While this paper re-flects debates between historians and computer scien-tists we would like to underline that a semantic enrich-ment of historical sources brings new facilities, on theone hand, for humanities researchers also outside ofhistory, allowing them to search, retrieve and compareinformation they need for their everyday work usinga variety of dimensions and scopes; and on the otherhand, for practitioners, giving them new data sourcesto develop historical-aware applications for public in-stitutions, private companies and citizens.

We claim that semantic technologies are suitablefor representing inner semantics implicitly containedin historical sources, which can be appropriately iden-tified, formalized and linked using the cited tools.With the appropriate pipelines, algorithms can extractentities from digital historical sources and transformthese occurrences into RDF triples according to someRDFs or OWL provided historical vocabulary of con-venience, linking those entities between them and withother external, linked datasets for enrichment, con-tributing to a open, world wide, online persistent graphof historical linked knowledge. All the work presentedin this survey have some relevance in one stage or an-other in this graph-building pipeline, providing solu-tions for the previously analyzed problems in historicalcomputing.

Acknowledgements

COMMIT, Comp. Humanities / KNAW, CEDARproject. Special thanks: D. Roorda, O. Boonstra.

References

[1] George Alter, Kees Mandemakers, and Myron P. Gutmann.Defining and distributing longitudinal historical data in a gen-eral way through an intermediate structure. Historical SocialResearch, Vol. 34(No. 3 = No. 129):78–114, 2009.

[2] Ian Anderson. History and computing. Making History, 2008.[3] Grigoris Antoniou and Frank van Harmelen. A Semantic Web

Primer (Cooperative Information Systems). The MIT Press,April 2004.

[4] C Begthol. Classification Theory. Encyclopedia of Libraryand Information Science, 2010:1045–60.

[5] David M. Berry, editor. Understanding Digital Humanities.Palgrave Macmillian, New York, 2012.

[6] Onno Boonstra, Leen Breure, and Peter Doorn. Past , presentand future of historical information science. NIWI-KNAW,Amsterdam, 1nd edition, 2004.

[7] B. Bos and G. Welling. The significance of user-interfacesfor historical software. Proceedings of the Eight InternationalConference of the Association for History and Computing,pages 223–236, 1995.

[8] Stefan Dormans and Jan Kok. An alternative approach tolarge historical databases. Exploring best practices with col-laboratories. Historical methods, 43(3):97–107, 2010.

[9] Albert Esteve and Matthew Sobek. Challenges and Methodsof International Census Harmonization. Historical Methods,36(2):37–41, 2003.

[10] Mary Feeney and Seasmus Ross. Information Technologyin Humanities Scholarship British Achievements, Prospects,and Barriers. Historical Social Research, 19(1):3–59, 1994.

[11] Fox. Nine step process of historical research. 1969.[12] Ronald Goeken, Marjorie Bryer, and Cassandra Lucas. Mak-

ing Sense of Census Responses Coding Complex Variables inthe 1920 PUMS. Historical Methods, 32(3):37–41, 1999.

[13] Good and Scates. Methods of research. 1972.[14] Bernhard Haslhofer, Rainer Simon, Robert Sanderson, and

Herbert Van de Sompel. The open annotation collaboration(oac) model. CoRR, abs/1106.5178, 2011.

[15] Nancy Ide and David Woolner. Exploiting semantic web tech-nologies for intelligent access to historical documents. Pro-ceedings of the Fourth Language Resources and EvaluationConference (LREC), pages 2177–2180, 2004.

[16] Nancy Ide and David Woolner. Historical Ontologies, chapterWords and Intelligence II: Essays in Honor of Yorick Wilks,pages 137–152. Springer, 2007.

[17] C. Jones. Strategies for managing requirements creep. Com-puter, 29(6):92–94, June 1996.

[18] Maximilian Kalus. Semantic networks and historical knowl-edge management: Introducing new methods of computer-based research. Ann Arbor, MI: MPublishing, University ofMichigan Library, 2007.

[19] Jan Kok and Paul Wouters. Virtual Knowledge in FamilyHistory: Visionary Technologies, Research Dreams, and Re-search Agendas. In Paul Wouters, Anne Beaulieu, Andrea

Page 22: History Sample Research Paper 4

22 A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey

Scharnhorst, and Sally Wyatt, editors, Virtual Knowledge. Ex-perimenting in the Humanities and the Social Sciences, pagexxx. MIT Press, Cambridge, Mass., 2013.

[20] Thomas Kuczynski, editor. Wirschaftsgeschichte und Mathe-matik. Akademie-Verlag, Berlin, 1985.

[21] Kees Mandemakers and Lisa Dillon. Best Practices withLarge Databases on Historical Populations. Historical Meth-ods, 37(1):34–38, 2004.

[22] Willard McCatry. Humanities Computing. In Miriam Drake,editor, Encyclopedia of Library and Information Science,pages 1124–35. New York, 2nd edition, 2003.

[23] Albert Meroño Peñuela, Ashkan Ashkpour, Laurens Rietveld,Stefan Schlobach, and Rinke Hoekstra. Data Harmonization:a Structured Approach - With a Case-study in Historic CensusData. 2012.

[24] Peter B. Meyer and Anastasiya M. Osborne. Proposed cate-gory system for 1960-2000 census occupations. U.S. Bureauof Labor Statistics, 2005.

[25] Michael Nentwich. Cyberscience: Research in the Age of theInternet. Austrian Academy of Sciences Press, Vienna, 2003.

[26] Bart Van De Putte and Andrew Miles. A Social ClassificationScheme for Historical Occupational Data. Historical Meth-ods, 38(2):61–92, 2005.

[27] Steven Ruggles and Russell R Menard. The Minnesota His-torical Census Projects. Historical Methods, 28(1):6–10,1995.

[28] Susan Schreibman, Ray Siemens, and John Unsworth, edi-tors. A Companion to Digital Humanities. Blackwell Pub-lishing Inc, Malden, MA, 2004.

[29] Renee E. Sieber, Christopher C. Wellen, and Yuan Jin. Spa-tial cyberinfrastructures, ontologies, and the humanities. Pro-ceedings of the National Academy of Sciences of the UnitedStates of America, 2011.

[30] Matthew Sobek. The comparability of occupations and thegeneration of income scores. Historical Methods, 1, 1995.

[31] W.A Speck. ’History and computing: some reflections on thepast decade’. History and Computing, 6(1):28–32, 1994.

[32] M Thaller. Automation on Parnassus. CLIO - A databankoriented system for historians. Historical Social Research,15, 1980.

[33] John Tosh. The Pursuit of History: Aims, Methods, and NewDirections in the Study of History. Pearson Education: Har-low 2010, 5 edition, 2010.

[34] D. B. Van Dalen. Understanding educational research. NewYork: McGraw-Hill, 1979.

[35] Willem Robert van Hage, VOronique MalaisO, Roxane HSegers, Laura Hollink, and Guus Schreiber. Design and useof the simple event model (sem). Web Semantics: Science,Services and Agents on the World Wide Web, 9(2), 2011.

[36] Shenghui Wang, Stefan Schlobach, and Michel C. A. Klein.Concept drift and how to identify it. J. Web Sem., 9(3):247–265, 2011.

Surveyed work

[37] Agora project. http://agora.cs.vu.nl.[38] Armadillo: Historical data mining project. http://www.

hrionline.ac.uk/armadillo/armadillo.html.[39] Asksam. https://www.asksam.com/.

[40] Atlas.ti. http://www.atlasti.com/index.html.[41] Bridge project. http://ilps.science.uva.nl/

node/735.[42] Cced project. http://www.theclergydatabase.

org.uk/publications/jeh_article.html.[43] Choral project. http://hmi.ewi.utwente.nl/

choral/.[44] Ckcc project. http://ckcc.huygens.knaw.nl/.[45] Clarin-vk project. http://verrijktkoninkrijk.

nl/.[46] Culturesampo - finnish culture on the semantic

web 2.0: Thematic perspectives for the end-user.http://www.museumsandtheweb.com/mw2009/papers/hyvonen/hyvonen.html.

[47] Data portal for social sciences open data with sparql endpoint.http://www.rechercheisidore.fr/.

[48] Dublin core. http://dublincore.org/.[49] Fawcett: A toolkit to begin an historical semantic web.

http://www.digitalstudies.org/ojs/index.php/digital_studies/article/view/175/217.

[50] Fdr pearl harbor project.[51] Framenet. https://framenet.icsi.berkeley.

edu/fndrupal/.[52] Gapminder. http://www.gapminder.org/.[53] Gate. http://gate.ac.uk/.[54] H-bot project. http://chnm.gmu.edu/tools/

h-bot/.[55] Heml project. http://heml.mta.ca/heml-cocoon/

description.[56] Hisco project. http://hisco.antenna.nl/.[57] Hitime project. http://ilk.uvt.nl/hitime/.[58] Links project. http://www.iisg.nl/hsn/news/

links-project.php.[59] Milo. http://sigmakee.cvs.sourceforge.net/

viewvc/sigmakee/KBs/Mid-level-ontology.kif.

[60] Nlp2rdf. http://nlp2rdf.org/.[61] Nltk. http://www.nltk.org/.[62] North atlantic population project. http://www.

nappdata.org/napp/.[63] Nsf-itr/malach project. http://malach.umiacs.umd.

edu/.[64] Opencyc. http://www.opencyc.org/.[65] Rdf vocabularies for historic place-names and relations be-

tween them. http://groups.google.com/group/caa-semantic-sig/browse_thread/thread/ae1db7fa31a1b5a0?pli=1.

[66] Sails project. http://sailsproject.cerch.kcl.ac.uk/2010/07/about-the-sails-project/.

[67] Scratch project. http://www.ai.rug.nl/alice/nwo-catch-scratch/index_english.html.

[68] Sem event model. http://www.cs.vu.nl/~guus/papers/Hage11b.pdf.

[69] Semantic web approaches in digital history: an introduc-tion. http://www.slideshare.net/mpasin/semantic-web-approaches-in-digital-\history-an-introduciton.

[70] The semantic web for family history. http://jay.askren.net/Projects/SemWeb/.

[71] Sgml. http://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language.

Page 23: History Sample Research Paper 4

A. Meroño-Peñuela et al. / Semantic Technologies for Historical Research: A Survey 23

[72] Sig:ontologies. http://wiki.tei-c.org/index.php/SIG:Ontologies.

[73] Simile/timeline. http://www.simile-widgets.org/timeline/.

[74] Spatial cyberinfrastructures, ontologies, and the human-ities. http://www.pnas.org/content/108/14/5504.full.

[75] Sumo. http://sigmakee.cvs.sourceforge.net/viewvc/sigmakee/KBs/Merge.kif.

[76] Tact. http://projects.chass.utoronto.ca/tact/.

[77] Tapor. http://portal.tapor.ca/portal/portal.

[78] Tei (text encoding initiative). http://www.tei-c.org/index.xm.

[79] Text mining for historical documents: Topics and papers.http://www.coli.uni-saarland.de/courses/tm-hist/readings.html.

[80] Tokenx. http://tokenx.unl.edu/.[81] Voyage of the slave ship sally project. http://www.stg.

brown.edu/projects/sally/.[82] Wordcruncher. http://www.wordcruncher.com/

wordcruncher/default.htm.[83] Wordnet. http://wordnet.princeton.edu/.[84] Xces. http://www.xces.org/.[85] Isabelle Augenstein, Sebastian Padó, and Sebastian Rudolph.

Lodifier: Generating linked data from unstructured text. InAxel Polleres Oscar Corcho Valentina Presutti Elena Simperl,Philipp Cimiano, editor, Proceedings of the 9th Extended Se-mantic Web Conference, volume 7295 of LNCS, pages 210–224. Springer, Mai 2012.

[86] Onno Boonstra, Leen Breure, and Peter Doorn. Past , presentand future of historical information science. NIWI-KNAW,Amsterdam, 1nd edition, 2004.

[87] Panos Constantopoulos, Martin Doerr, Maria Theodoridou,and Manolis Tzobanakis. Historical documents as monu-ments and as sources. -, 2009.

[88] Albert Esteve and Matthew Sobek. Challenges and Methodsof International Census Harmonization. Historical Methods,36(2):37–41, 2003.

[89] Ronald Goeken, Marjorie Bryer, and Cassandra Lucas. Mak-ing Sense of Census Responses Coding Complex Variables inthe 1920 PUMS. Historical Methods, 32(3):37–41, 1999.

[90] Nancy Ide and David Woolner. Exploiting semantic web tech-nologies for intelligent access to historical documents. Pro-ceedings of the Fourth Language Resources and EvaluationConference (LREC), pages 2177–2180, 2004.

[91] Nancy Ide and David Woolner. Historical Ontologies, chapterWords and Intelligence II: Essays in Honor of Yorick Wilks,

pages 137–152. Springer, 2007.[92] Maximilian Kalus. Semantic networks and historical knowl-

edge management: Introducing new methods of computer-based research. Ann Arbor, MI: MPublishing, University ofMichigan Library, 2007.

[93] Jan Kok and Paul Wouters. Virtual Knowledge in FamilyHistory: Visionary Technologies, Research Dreams, and Re-search Agendas. In Paul Wouters, Anne Beaulieu, AndreaScharnhorst, and Sally Wyatt, editors, Virtual Knowledge. Ex-perimenting in the Humanities and the Social Sciences, pagexxx. MIT Press, Cambridge, Mass., 2013.

[94] Peter B. Meyer and Anastasiya M. Osborne. Proposed cate-gory system for 1960-2000 census occupations. U.S. Bureauof Labor Statistics, 2005.

[95] Jan Oldervoll. Censsys Ñ a system for analyzing census-typedata. Computer Applications in the Historical Sciences: Se-lected Contributions to the Cologne Computer Conference,pages = 17–22, year = 1989.

[96] B. G. Robertson. Visualizing an historical semantic web withHeml. Proceedings of the 15th international conference onWorld Wide Web, pages 1051–1052, 2006.

[97] Bruce Robertson. Exploring Historical RDF withHeml. Digital Humanities Quarterly, (1). http://www.digitalhumanities.org/dhq/vol/003/1/000026.html, volume = 3.

[98] Roxane Segers, Marieke van Erp, Lourens van der Meij,Lora Aroyo, Jacco van Ossenbruggen, Guus Schreiber, BobWielinga, Johan Oomen, and Geertje Jacobs. Hacking his-tory via event extraction. In Proceedings of the sixth in-ternational conference on Knowledge capture, K-CAP ’11,pages 161–162, New York, NY, USA, 2011. ACM. http://doi.acm.org/10.1145/1999676.1999705.

[99] Matthew Sobek. The comparability of occupations and thegeneration of income scores. Historical Methods, 1, 1995.

[100] Manfred Thaller. Clio - a databank oriented system forhistorians. Historical Social Research, Historische Sozial-forschung 15, 1980.

[101] U. Thiel, H. Brocks, A. Dirsch-Weigand, A. Everts,I. Frommholz, and A. Stein. Queries in context: Access todigitized historic documents in a collaboratory for the human-ities. Aufsatz in Buch, 2007.

[102] Chiel van den Akker et al. Digital hermeneutics: Agora andthe online understanding of cultural heritage. WebSci 2011Proceedings, 2011.

[103] Rene Witte, Ralf Krestel, Thomas Kappler, and Peter C.Lockemann. Converting a historical architecture encyclope-dia into a semantic knowledge base. IEEE Intelligent Systems,25:58–67, 2010.