introduction to semantic web technology and geodata

28
Introduction to Semantic Web Technology and Geodata Arnulf CHRISTL Heerstraße 162 53111 Bonn E-Mail: [email protected] Abstract: The Semantic Web is an emerging idea. Currently we can see three phases, the first of which has already started, as a diverse set of independent technologies. Semantics cannot be implemented by one single new technology or software and thus it is not an obvious target for developers or big vendors. We use the Web within our own semantic context without paying much notice because we are blessed with intuition, inference and association. We can visually deduce and coordinate content by simply looking at web sites (images and text). Machines do not have any of these capabilities. Instead they are really fast but also incredibly dumb. The Semantic Web is about capacitating machines by preparing data in a way that is intelligible to machines. Initial efforts to put geographic data on the Web in a semantic context are ongoing. This article gives an introduction to current Internet technology and is aimed at geospatial professionals who want to get a better understanding of how their data can become part of the Semantic Web. The Geoweb is just one aspect of the semantic web, albeit a highly interesting one because it ties virtual data to real world locations. The outlook of converging standards, crowd sourcing and semantics is promising. 1

Upload: vonga

Post on 14-Feb-2017

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Introduction to Semantic Web Technology and Geodata

Introduction to Semantic Web Technology and Geodata

Arnulf CHRISTL

Heerstraße 16253111 Bonn

E-Mail: [email protected]

Abstract: The Semantic Web is an emerging idea. Currently we can see three phases, the first of which has already started, as a diverse set of independent technologies. Semantics cannot be implemented by one single new technology or software and thus it is not an obvious target for developers or big vendors. We use the Web within our own semantic context without paying much notice because we are blessed with intuition, inference and association. We can visually deduce and coordinate content by simply looking at web sites (images and text). Machines do not have any of these capabilities. Instead they are really fast but also incredibly dumb. The Semantic Web is about capacitating machines by preparing data in a way that is intelligible to machines. Initial efforts to put geographic data on the Web in a semantic context are ongoing. This article gives an introduction to current Internet technology and is aimed at geospatial professionals who want to get a better understanding of how their data can become part of the Semantic Web. The Geoweb is just one aspect of the semantic web, albeit a highly interesting one because it ties virtual data to real world locations. The outlook of converging standards, crowd sourcing and semantics is promising.

1

Page 2: Introduction to Semantic Web Technology and Geodata

1 IntroductionThe term "Semantic Web" is not formally defined, it is just an idea, albeit a very good one [Tim Berners-Lee (2008)]. It is used to describe concepts, formats and standards, some of which have been proposed by the World Wide Web Consortium (W3C). The key idea of the Semantic Web is to always technically associate (link) data with a meaningful context. This means that the meta data required to fully understand the data always need to be readily available. Data and meta data have often been considered as separate things but actually the difference is mostly just a point of view. One conveys the semantics of the other, ideally in a clearly defined ontology.

As any new idea the Semantic Web is worth nothing unless it is realized and put to real world use. In this article we will look into concepts and technologies that enable the geospatial Semantic Web. We will look into the technologies used to implement the Internet and Web starting from DNS, TCP/IP, HTTP, HTML, XML. From this basis we move on to resource oriented patterns and RDF.

There are many other technologies like microformats [Berriman 2011] that have been developed with similar goals even although they are not always described as "Semantic Web" components. All of these technologies, formats and standards intend to provide a formal description of concepts, terms, and relationships within a given domain. Finally we will look into the Resource Oriented Architecture (ROA) pattern to see how it all fits together on the Web [Richardson 2007].

The geospatial aspect is focused on data formats including raster graphics for map images, vector data coded in GML

2

Page 3: Introduction to Semantic Web Technology and Geodata

and KML and spatially extended syndication formats like GeoRSS. This type of data can be created dynamically by web services as specified by the Open Geospatial Consortium (OGC), for example in the WMS, WFS and SWE implementation standards. These standards have been created before resource orientation and the Semantic Web and lack some of these aspects.

Currently most data on the Web are only implicitly linked to their corresponding ontology and these are oftentimes not defined in technical terms at all. We also lack seamlessly integrated tools that will help us to understand the semantic context that the Web can already give us.

One reason why the Semantic Web is so slow to emerge is that it is fairly easy (albeit error prone) for human beings to associate content of a web site with the correct domain. Machines have not developed this capacity at all. The availability of machine-readable meta data is intended to help remedy this problem and eventually enable software agents to access the Web, perform tasks and locate information automagically. The Semantic Web as a global vision is only slowly emerging and many critics still say that it is not feasible at all.

2 Semantics Semantics is the study of meaning. In the context of the science of linguistics, semantics is used to describe the meaning of words, terms and phrases used by humans to communicate. This concept can be extended to computer sciences by describing the meaning of words, terms and phrases in programming languages. The technicality of

3

Page 4: Introduction to Semantic Web Technology and Geodata

computers makes this type of definition much narrower and preciser than for human language.

In information sciences the term semantics describes the technical relation between a datum and its context or ontology. In the context of the Web the term semantics describes the relation of data and it's meta data. The goal is to make both intelligible to machines by formatting and structuring them in standard ways.

This paper assumes that a lot of semantic content is already on the Web but that machines are incapable of uncovering or using it coherently. A better understanding of the goals of the Semantic Web will allow geospatial professionals to make much better use of existing Web technology and achieve better interoperability.

2.1 Syntax and Pragmatics

Linguistic science has two more areas of research beside semantics: Syntax and pragmatics [Levinson 2000]. Syntax describes the rules by which terms and words can be constructed into sentences and phrases.

But it is possible to create syntactically correct sentences which make no sense semantically. Noam Chomsky coined the phrase "colorless green ideas sleep furiously" [Chomsky 1957] as an example of a syntactically correct sentence with nonsensical semantic meaning. Several interpretations of this sentence have been undertaken to show that it could make sense in special contexts. Especially when speaking figuratively (by adding a context or meta data) "colorless" can be interpreted as "nondescript" and "green" as "new" or "fresh". Giving a short introduction to the reader it would

4

Page 5: Introduction to Semantic Web Technology and Geodata

be possible to wrench some meaning from the otherwise nonsensical sentence. This shows that the context of any datum has a thorough influence on the related meaning.

This makes up the third pillar of studies in Linguistics: Pragmatics. It is the relation between the term or word and the observer. This highly interesting aspect of Linguistics has so far not been formally adopted in information technology which may be one reason why semantics is still irrelevant to many practical aspects of the Web.

2.2 Ontology

Ontology is the study of the nature of being and existence and their relations. It is a branch of philosophy known as metaphysics and analyzes what exists or can be said to exist, how these entities can be grouped and are related with each other. Typically relations are grouped and subdivided in hierarchies according to similarities and differences. In computer and information sciences ontologies are formal representation of knowledge of different domains [Gruber 2009]. Ontologies can be used to describe the domain in a formal manner. The relationships between domains can also be described in ontologies. Ontologies are formal, explicit specifications of shared concepts providing a vocabulary with defined semantic meaning. The vocabulary can be used to model a domain with a defined syntax by describing the type of objects, their properties and relations.

Ontologies can be formally described using different standards and languages, for example the Web Ontology Language (OWL). For the context of this article we will not

5

Page 6: Introduction to Semantic Web Technology and Geodata

go into further detail but first get an overview of the technologies already in common use.

3 Web Technology

The Web (or World Wide Web) is a complex network of interlinked hypertext documents, typically served through web sites. The Web is accessed through the interlinked computer network known as the Internet. As we will see later it is important to clearly separate these two concepts because the Internet is a hierarchically organized computer network whereas the Web is a logically organized directed graph of resources residing in the Internet. This means that the Web is an application that runs on the Internet.

We will first explore some of the Internet technologies required to run the Web and then look into patterns and concepts which enable the Web semantically using this same technology. This section is not intended to be comprehensive on either the Internet or the Web. Instead it only highlights specific aspects of the Internet which are relevant to building semantic context on the Web.

3.1 Internet Protocol

The technical foundation of the Web is the Internet Protocol. It has been created to connect nodes. A node can be a server hosting a web site and documents, an email server (mail delivery agent), a router, a firewall or even a printer – basically anything that is addressable with an Internet Protocol (IP) address as 94.23.196.65. The protocol has been designed on the assumption that the underlying physical and logical network infrastructure is inherently unreliable. Nodes may unexpectedly disappear or dynamically move elsewhere. The the location of objects and servers can change at any time. Transport can be interrupted and must

6

Page 7: Introduction to Semantic Web Technology and Geodata

be failsafe. This assumption very well reflects the current experience of the Web at large, including geospatial services and data. There is no central monitoring which tracks or maintains the state of this network.

The Internet provides the basis for the logical domain naming system of the Web which typically has a two level naming schema. The Internet top-level domain (TLD) comprises the root-level [Iana 2010]. It consists of two letter combinations usually derived from political jurisdictions such as "de" for Deutschland, "fr" for France, "us" for the USA, and so on. Some specially reserved TLDs consist of three letters. These include "com" for commercial, "org" for organization, "gov" for the government of the USA, "mil" for the military of the USA, "edu" for educational institutions of the USA and reflect the origins of the Internet as a U.S. federal government-sponsored research network.

Top-level domains are not directly addressable, they are empty nodes. To the left of the TLD appears the domain name as in osgeo.org, w3.org or metaspatial.net. These names are directly addressable. A web browser typically runs the Hyper Text Transfer Protocol, therefore the domain names are normally prefixed by "http://". Between this protocol identifier and to the left of the domain it is possible to add sub domains. Oftentimes this is simply "www" as in http://www.gov.vu/. Other domains are hierarchically broken down into further sub domains as in http://inspire.jrc.ec.europa.eu/. This has no effect on navigation or addressability.

To the right end of the URL directories can be added. Older sites make a habit of organizing their content in virtual and otherwise empty directories as in the example

7

Page 8: Introduction to Semantic Web Technology and Geodata

http://www.osgeo.org/content/sponsorship/sponsors.html. The exactly identical content of that page is also referenced by the URL http://www.osgeo.org/sponsors. The additional structure implemented by adding the virtual directories "content" and "sponsorship" to the URL does not add meaning and is mostly superfluous. The extension .html indicates what type of document the browser should expect but is otherwise also superfluous. Some of the bigger websites like Wikipedia have no directory hierarchy at all with almost every content available on exactly the same one level.

The redundancy and flexibility of Web content appearing through the Internet becomes apparent when we access the very same document through a variety of resources. The front page of the private web site of the author is currently reachable through the following URLs:

• http://arnulf.us

• http://www.arnulf.us

• http://arnulf.us/Main_Page

• http://arnulf.us/Runder_tisch_gis/introduction_to_the_Web

• http://zpatial.org

• http://r32916.ovh.net

• http://94.23.196.65

• http://178.32.100.197/

This will change over time, remember that one of the assumptions of the Internet is that everything is dynamic and in a constant state of flux.

8

Page 9: Introduction to Semantic Web Technology and Geodata

3.2 Transport: Push and Pull; SMTP and HTTP

The Internet is about data transport. Several protocols are used to transmit data and messages across the Internet. Transport can either be typified as initiated by push or pull.

The Simple Mail Transfer Protocol (SMTP) is a typical example of a push based protocol. It is used by the mail delivery agents to send, relay and deliver emails. The work flow of sending and receiving emails is fairly straight forward: One machine is ordered to send an email. To do this it will wrap the message in a package and add a sticker to it that contains the address. Then it sends (pushes) the package to the next node which will pass the package on until it ends up at the given destination address. If the destination server does not accept the message it returns the mail as undeliverable including a message including the reason of rejection. If the destination server is unavailable altogether then the last node that accepted the package will return the message stating just that.

The Hyper Text Transfer Protocol (HTTP) is an application layer protocol designed within the framework of the Internet Protocol Suite. It is the foundation of data communication in the Web and it is pull-based. It is important to remember that HTTP is not the transport protocol (which is the Internet thorugh TCP/IP) but that it is the application layer on the Internet.

HTTP implements four well defined operations following the CRUD paradigm, which translates into "Create", "Read", "Update" and "Delete" data. The four main HTTP operations are HTTP PUT, GET, POST and DELETE correspondingly.

9

Page 10: Introduction to Semantic Web Technology and Geodata

Each of the four main HTTP operations have a set of error codes to address errors that can occur, either in the underlying Internet network or the application running HTTP. The protocol includes a special set of codes to deal with changing Internet addresses, broken links and moving information. This is again based on the core assumption that the underlying network is unreliable. The most important aspect of HTTP are that it is simple, failover tolerant and well defined.

HTTP is by definition stateless. This means that it does not rely on a defined status between client and server but handles every request independently to the next. It is a framework allowing to access documents through an otherwise opaque network. The user has no information about the path that the data takes. The client always requests for data instead of a server actively sending anything.

These two basic protocols show the difference between the push and pull paradigm. But for many work flows the architecture has to allow for a combination of pushing and pulling data. An example: When users want to read emails that have been sent through SMTP they will typically first have to pull the email from the mail delivery agent, for example by using the Post Office Protocol (POP3). But in other scenarios emails can also be pushed to the user's hand held device by an active server component as soon as it arrived at the mail delivery agent.

Push concepts can also be implemented on top of HTTP by adding another layer of architectural logic. One such concept is called WebHooks (http://www.webhooks.org). It is based on the assumption that a server might be interested

10

Page 11: Introduction to Semantic Web Technology and Geodata

in delivering data instead of relying on clients pulling them on their own. To do this the server must have some information about where to push the data. WebHooks does this by allowing clients to register with the server. Once the server has new data to distribute it will simply let the client actively know. The comparable pull oriented version of this process is known as syndication and comes in the flavor of the standards RSS, Atom and the like. These are strictly stateless pull based which requires that the clients actively retrieve information. As the client has no information about when the data of interest changes on the server it must regularly poll for changes which can be advertised using a syndication protocol.

Another prominent combination of active push and pull based data transport is implemented by social network systems like LinkedIn, Facebook, Twitter and the like. These platforms notify the user by sending an email relying on the fact that the user will poll the SMTP server (see above) in regular and frequent intervals. In general the mail does not contain all the data, just a short teaser and the link through which the complete data can be accessed.

These are typical methods of combining pull and push systems to implement an (almost) seamless user experience. Unfortunately most machines have no email account or cannot use it properly.

3.3 Content: Web Sites, HTML, Data and XML

For the context of this article a web site can be seen as an arbitrarily directory structure containing documents which are made accessible through Internet technology. The

11

Page 12: Introduction to Semantic Web Technology and Geodata

directories contain HTML documents which typically contain texts and references to images or other data which can be displayed directly by web browsers.

The primary document format on the Internet is the Hypertext Markup Language (HTML). HTML is a markup language to describe web pages. HTML allows to format text and other multimedia content, mostly images, videos, sound and the like.

HTML syntax was not intended to give semantic meaning to the data it encodes. It was implemented to work well with HTTP, to be displayed on a computer screen and to be consumed by human beings. HTML defines a specific set of tags to add meta data but theses are often not used. Meta information can be given in the TITLE and specific meta tags including about authoring information, date of creation, expiration, or an abstract of the content. Inside HTML documents images can have ALT-tags which make them intelligible to clients who cannot "see" (this can be a blind person but also a machine or robot).

One of the most important aspects of HTML are links. Links are relations, typically to other web sites or data, sometimes also to references within the same document. Links make up the logical network aspect of the web. Interestingly even although links function on the Internet this logical network is independent of the underlying physical network. Links make up the Web, a directed graph residing on a hierarchical structure.

Data referenced through links in HTML documents typically come in files of arbitrary formats. A small subset of standard formats has been captured as MIME types [IANA 2011] increasing the chances that interested software will

12

Page 13: Introduction to Semantic Web Technology and Geodata

eventually learn how to interpret the data correctly and do something coherent with it. Due to the sheer vastness of incompatible data formats and the very limited number of web capable software (besides web browsers) most data simply has to be downloaded before it can be used coherently.

XML (eXtensible Markup Language) is similar to HTML in that it is a markup language, but it is more generic. Whereas HTML has been explicitly designed to encode documents for web sites XML can be used to encode practically any information. Additionally XML allows to add arbitrary semantic context to the text and data it references.

XML is a commonly accepted format to represent trees and hierarchies. This means that web site structures can be represented as XML trees. Added together the whole Internet could theoretically be represented as one single XML tree resulting in a very flat hierarchy. On the root of this tree (the Internet as a whole) each domain and website represents a branches and every HTML document or single chunk of data a leaf.

Currently the Internet comprises more than a hundred million active web sites but most of them only have one or at most a very few levels of content "depth". Therefore this representation would represent the Internet as a very flat hierarchy. The representation of the Web (as opposed to the Internet) therefore also needs to include the (semantic) relations between web sites and documents. The Internet is too "wide" and too "flat" to be useful as a hierarchy of content.

We have to extend the concept of a leaf on a tree in order to describe the Web. Each leaf can become the node of a

13

Page 14: Introduction to Semantic Web Technology and Geodata

network if it has links to other documents or is linked from from other documents.

XML Linking Language (Xlink) is designed to create internal and external links within XML documents. These links can also be created with associated metadata. It is a W3C specification but is currently not yet well supported by most software packages. Xlinks has great potential to become the common data source for tools of a semantically enabled web providing meta data together with the associated documents and data. Xlink could be an option to combine the hierarchical concept of XML with the network node concept of the Web.

3.4 Relations: The Graph as URL in RDF

To understand the Web as a directed graph we need an appropriate format to represent the relations. One such format is the Resource Description Framework (RDF). It is a family of World Wide Web Consortium (W3C) specifications originally designed as a meta data data model. It is now used as a method to conceptually describe and model information that is implemented in web resources, using a variety of syntax formats. Concepts can be described in RDF Schema and modeled using the Web Ontology Language (OWL). Special languages such as SPARQL can be used to make rule based queries on RDF structures [Hitzler 2009].

RDF can represent relations between HTML documents in triples. A triple consist of a subject, a relation and an object. The subject can be any HTML document which links (relates) to any other HTML document (object). Any level, branch and leaf of one tree (data or document on a web site) can relate to any other branch or leaf on any other tree. From this

14

Page 15: Introduction to Semantic Web Technology and Geodata

perspective the hierarchy is mostly irrelevant. It is replaced by relations, typically represented through links.

Links on the Web are different to trees because they are always directed. A link from one hyper text document can point to any addressable URL but the document at that URL does not necessarily need to link back. On a tree (the Internet) this is different because going up and down does not make much of a difference. This is one of the main differences between the Web which is a directed graph and the Internet which is the hierarchical structure in which the Web resides.

Currently the Resource Description Framework (RDF) is the best technology to explore the graph that represents the Web. The Semantic Web does not need to be reinvented, it is already there. What we are lacking is a common way of representing it in comprehensible way. Even although RDF is typically formatted in a readable XML format the content is not immediately visually intelligible to human perception. This type of representation of data on the Web is very much designed to be consumed by machines.

The graph, RDF and triple stores can be represented using XML, which brings us back to the hierarchy of the Internet and working concepts. Currently there is no good way to visualize the graph as a whole [Christl, 2010]. All we can currently create are two-dimensional representation as shown in this example of Linking Open Data [LinkedData 2010] see image 1.

15

Page 16: Introduction to Semantic Web Technology and Geodata

Image 1: Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Simple representations of the Web graph are already in use but we are still a long way off making the intricately linked Web readily intelligible to human perception. Currently the only practical way is to cut the multidimensional graph into planes creating two-dimensional cross sections. These could be a starting point for human interaction which can lead to creating new ever changing two dimensional hierarchical cross section planes. Basically these are maps of patterns as we use them in geospatial contexts. The potential relationship between geographic data and the Web still requires further research.

3.5 The Resource Oriented Architecture

Currently there are three common architectural styles in use on the Internet: Remote Procedure Call (RPC), Key Value Pair (KVP) and Representational State Transfer (REST). The

16

Page 17: Introduction to Semantic Web Technology and Geodata

RPC style architecture has evolved right out of software development. It allows to call procedures (functions) on a remote machine. This requires intimate prior knowledge of the interface of the software which is called, including parameters, values, and error codes. The remote machine performs the operation and typically send back a message of the result. This architecture style is message oriented, a commonly used technology is SOAP.

The REST style limits operations to the protocol it is based on, in the case of the Web this is HTTP. As we have learned HTTP has four well defined (CRUD) functions for persistent storage. No other function or operation beyond these can be used in a RESTful interface. All the logic has to be designed in the data model and work flow. This is a very different approach to simply opening up a software through an API style remote procedure call interface. The architecture paradigm associated with REST is the Resource-Oriented Architecture (ROA). It proposes four concepts:

• Resources

• Their names (URIs)

• Their representations

• The links between them

17

Page 18: Introduction to Semantic Web Technology and Geodata

and four properties:

• Addressability

• Statelessness

• Connectedness

• A uniform interface

These concepts and properties can be implemented perfectly using HTTP and hypermedia making the ROA the best fit for the requirements of the Semantic Web. The ROA describes the pattern that results from applying REST principles to make best use of HTTP based Internet technology. It also describes a set of Best Practices and shows how data needs to be designed to be made available on the Web.

In this architecture pattern software and services are the result of designing data. The focus does not lie on implementing functionality for a user in a software. The ROA reduces software to a thin and opaque layer around data. This emphasis on the data makes the ROA highly relevant to the emerging Semantic Web. At the same time it is hard to understand and follow in traditional software development which is used to think much more software-centric.

18

Page 19: Introduction to Semantic Web Technology and Geodata

4 Geographic Data on the Web

Geographic data will probably go through at least three phases of the evolution of the Semantic Web. We are already experiencing the first results of phase one. This includes simple publication of maps and data which is inherently linked to other data.

In the second phase geospatial data will be published in semantically enabled formats like RDF. This only requires little changes in the existing infrastructure and some sites are already coming up, for example Ordnance Survey UK with the OS OpenData initiative [Ordnance Survey 2011]. During both of these phases traditional GIS work will still mostly be done on local machines with powerful query languages like SQL and highly specialized tools for geospatial operations in traditioanl GIS. In the third phase, which is probably still off by many years, spatial operations might become an inherent feature of the Web, which then may become a real GeoWeb.

To make geographical data available on the Web it needs to be formatted in a way that can be browsed (or crawled by agents) just like the Web. Many current catalogs and structured meta data follow ISO standards and rules and regulations as defined by INSPIRE [INSPIRE (2011)]. This meta data provides a valuable source of information but it is not yet related (linked) well enough. The technology and the processes around this meta data still fall short of addressing the need of a spatially enabled Semantic Web. In addition to the highly structured, hierarchical XML meta data we need to add a more relational perception of the data itself. The understanding of how geospatial data relates and links with other data will eventually grow beyond the geospatial

19

Page 20: Introduction to Semantic Web Technology and Geodata

expert domain. But to get there the experts first have to make the data accessible in a way that follows semantic paradigms.

4.1 OGC Web Services

The members of the Open Geospatial Consortium (OGC) have created a set of service standards to publish geographical data. Maps are increasingly delivered through the OGC Web Map Service standard [OGC 2011] which can be parameterized to deliver dynamically rendered images of maps.

Geospatial features can be made available through the OGC Web Feature Service (WFS) standard. The WFS interface standard allows to access geographic data objects individually and implements a query language similar to SQL but less powerful. WFS services can be configured to only serve data or to also store objects. Geographic objects can be modeled using GML (see below).

Both standards are mature but especially the OGC WFS standard is complex and hard to access without prior knowledge of the client. This currently still prevents wider usage in contexts other than those of geospatial professionals. Even the much simpler OGC WMS standard is considered "difficult" by many Web developers.

4.2 Geographic Data Formats

Raw geographical data is typically made available in the OGC standard formats GML, KML and increasingly also GeoRSS. All have in common that they are designed in XML. Especially GML can be so complex and individually modeled

20

Page 21: Introduction to Semantic Web Technology and Geodata

that up to date practically no Web software has evolved that can use use this data right away. One exception is the Open Source software OpenLayers which has been extended to be able to dynamically render GML in the EU funded project "European SDI Network" [ESDIN (2011)].

Additionally to carrying geographic coordinates, KML can also contain rendering instructions. This makes it easier for software to overlay the data visually on top of other maps and have lead to a wider adoption in the Web. But even Google which is the original designer of the format does not fully support it in it's web map application. Both GML and especially KML make use of Xlink and promise interesting future options for the Semantic Web.

Most of the software packages have in common that they only display maps and offer little or no functionality to further process or even link geographical data.

5 Examples

We will finish this short excursion by looking at two very different but each in their own way promising examples of leveraging Internet technology to make geographical data accessible on the Semantic Web. One project is OpenStreetMap (OSM), the other is OS OpenData by Ordnance Survey of UK.

5.1 OpenStreetMap

OpenStreetMap (OSM) [OpenStreetMap 2011] is based on a crowd source process. The project collects, maintains and makes geographic data available in a crowd sourced

21

Page 22: Introduction to Semantic Web Technology and Geodata

process. Michael Goodchild coined the term Volunteered Geographic Information (VGI) [Goodchild 2008] to describe the production side of of the project and describes how it changes the world of mapping. This is a very good definition but lacks to put an emphasis on the openness aspect of the OSM project which also allows anyone to access and use the data for whichever purpose. For the Geo enabled Web this is probably an even more important aspect.

Anyone is allowed to use OSM data for any purpose, can download, use and pass it on, similar to the definition of Open Source as it is in common use in the development of Free Software. This differentiates OpenStreetMaps from other geographical data producers who collect user's data but without giving back full access to the data. One example is the navigation system provider TomTom who operates MapShare which allows users to submit corrections to the data and also allows to access changes submitted by others, but denies access to the underlying original data source [TomTom 2011].

The core data of OpenStreetMap is maintained in a Wiki-style mode. This means that there is no predefined, fixed structure of the data. This allows for a lot of flexibility but at the cost of defined structured which would allow access with a priori knowledge of the data. The data can also be stored in traditional object-relational databases like PostgreSQL and PostGIS allowing more structured access.

OpenStreetMap implements several levels of access to this data, most prominently the OSM Application Programming Interface (API), an interface that over the years has grown to suit the needs of the communities. The OSM API is the server component to which REST requests are addressed.

22

Page 23: Introduction to Semantic Web Technology and Geodata

The REST requests take the form of HTTP GET, PUT, POST, and DELETE messages. Any payload is in XML form, using the MIME type "text/xml" and UTF-8 character encoding, and may be compressed on the HTTP layer if the client indicates through the HTTP "Accept" header that it can handle compressed messages.

Although this API is not an international open standard itself it does comply to several others standards, including a correct implementation of HTTP. It has a lot of traction due to the momentum of the project itself. The current (January 2011) stable version 0.6 has been in use for more than 20 months. The OpenStreetMap project can be considered a spatial data infrastructure in a resource oriented architecture pattern. It is easy to include with other web applications, allows access through other software projects and makes use of hyper media.

The most common representation of the geographical data contained in the OpenStreetMap database is through map images. Several cartographic layouts based on a variety of different rendering engines are available and maintained by specific domain groups (for example for hiker, biker, street traffic, environmentally interested, and many more). The most commonly used interface to this map data is the OpenStreetMap tiling system which breaks down the world into a set of predefined tiles at predefined scale levels in a predefined coordinate system. OpenStreetMap data is based on the geographic coordinate system EPSG:WGS84 (latitude/longitude). The corresponding open standard is the OGC Web Map Tile Service implementation standard.

OSM data can also be downloaded as database dumps to create individual maps with specific content and

23

Page 24: Introduction to Semantic Web Technology and Geodata

cartographic layouts. Again following an international OGC standard, Well known Text (OGC WKT).

The LinkedGeoData project [LinkedGeoData 2011] regularly creates RDF dump files for download, probably making OpenStreetMap the most prominent candidate for a geographically enabled Semantic Web. But - the sheer size of the data makes it very difficult to handle. RDF is not the right format for mass data processing. How the RDF data can be broken down to be of use in the Semantic Web is still up for a lot of work for researchers, architects and software developers.

5.2 Ordnance Survey OpenData

Ordnance Survey in Great Britain has a long history of collecting, maintaining and publishing maps. Recently Ordnance Survey has considerably enhanced access to online maps by publishing an API which allows access to the OpenData [Ordnance Survey 2011] tiling map scheme. The main difference to OSM is that the access is only allowed to the map images, not the underlying data (similar to TomTom but without the possibility to post updates). The maps can be accessed through a web based framework by using a special API key. All maps are served from servers under the control of Ordnance Survey, in general free of cost.

24

Page 25: Introduction to Semantic Web Technology and Geodata

Additionally to providing access to map images the OpenData initiative also publishes some administrative data in RDF format. This data can be used to easily link documents and data already on the web with location information – and it gives access to the data, even although in a non GIS-typical format and without including the coordinates of the geographic objects.

There are several options how to make use of this data. John Goodwin describes a simple case of linking tabular data through the postal code of addresses [Goodwin 2010]. In all cases it is required to find a unique identifier that can be used as linkage between the geography and the dataset. In the example provided by John Goodwin the data comes with addresses and post codes. As post codes are part of the OS OpenData model the data can be readily linked using the RDF datasets.

6 Conclusion and OutlookThe Internet provides a functional, highly scalable technological foundation for creating, publishing, maintaining and (to a certain degree) processing geographical data on the Web. Traditional GIS processing and associated work flows are still miles away from leveraging this potential. Interestingly some OGC standards like WMS and WFS were implemented long before the Semantic Web or Resource Oriented Architecture concepts were laid out – but they already implement some of the Web paradigms presented in this article. It will be interesting to follow the development of the existing standards and the convergence of GML, KML on one hand, Atom, RSS and RDF on the other.

25

Page 26: Introduction to Semantic Web Technology and Geodata

Now is time to upgrade the existing OGC standards to be able to address these new challenges. This may also include a change in self perception of the standardization community which has grown into a highly expert domain which now has trouble integrating with the more general Web. To allow this integration the OGC community needs to embrace the communities living in the Web which will also require structural changes, some of which are already under way.

OpenStreetMap on the other hand may want to grow it's professional background to better interface with existing expert domains. This might include upgrading the OpenStreetMap API to an OGC standard so that it can eventually run through the ISO standardization process. This process must not to kill innovation this eminent danger must be taken seriously. On the positive side it will allow other structures like public administrations who are bound by ISO to leverage the power of OpenStreetMap.

Personally I do not foresee this happening anytime soon but some crossover between communities already does take place for example in the Open Source Geospatial Foundation (OSGeo). The OSGeo Public Geospatial Data Project lists initiatives, organizations and individuals interested in pursuing this broader and promising perspective [OSGeo 2011]. The Semantic Web will grow right midways between the data producers, the consumers, the crowds and the standards. As we typically belong to one or two of these groups but seldom to all at the same time progress is hard to perceive for many.

26

Page 27: Introduction to Semantic Web Technology and Geodata

7 LiteratureBerners-Lee, Tim (2008): The Time for the Semantic Web is

Now. URL: http://www.readwriteweb.com/archives/tbl_calls_for_semweb.php, Last accessed on 2010-12-14

Berriman, Frances (2011), Artificial Intelligence. URL: http://fberriman.com/2010/06/16/science-hack-day-turing-tests-and-google/ Last accessed on 2010-12-10

Chomsky, Noam (1957): Syntactic Structures. The Hague: Mouton.

Christl, Arnulf (2010): The Hierarchy and the Graph. URL: http://arnulf.us/sevendipity/archives/35-The-Hierarchy-and-the-Graph.html Last accessed on 2010-11-20

ESDIN (2011): European Spatial Data Infrastructure Network; Support in Action for INSPIRE. URL: http://www.esdin.eu Last accessed on 2011-01-17

Goodchild, Michael F. (2008): Citizens as Censors: The World of Volunteered Geography. URL: http://www.ncgia.ucsb.edu/projects/vgi/docs/position/Goodchild_VGI2007.pdf Last accessed on 2009-03-21.

Goodwin, John (2010): So what can I do with the new ordnance survey linked data. URL: http://johngoodwin225.wordpress.com/2010/10/25/so-what-can-i-do-with-the-new-ordnance-survey-linked-data/ Last accessed on 2011-01-17

Gruber, Tom (2009): Ontology. In: The Encyclopedia of Database Systems, Ling Liu and M. Tamer Özsu (Hrsg.), Springer-Verlag.

Hitzler, Pascal; Krötzsch, Markus; Rudolph, Sebastian (2009): Foundations of Semantic Web Technologies, Chapman & Hall/CRC

27

Page 28: Introduction to Semantic Web Technology and Geodata

Iana (2011): Root Zone Database. URL: http://www.iana.org/domains/root/db/ Last accessed on 2011-01-11

INSPIRE (2011): Infrastructure for Spatial Information in the European Community. URL: http://inspire.jrc.ec.europa.eu/ Last accessed on 2011-01-24

Levinson, Stephen C. (2000): Pragmatics. Cambridge Press: Cambridge

LinkedGeoData (2011): LinkedGeoData Data Set. URL: http://linkedgeodata.org/Datasets Last accessed on 2011-01-24

OGC (2011): Web Map Standard. URL: http://www.opengeospatial.org/standards/wms Last accessed on 2011-01-24

OpenStreetMap (2011): OpenStreetMaps: Free Maps for the World. URL: http://www.openstreetmap.org Last accessed on 2011-01-21

Ordnance Survey (2011): OS OpenData. URL http://www.ordnancesurvey.co.uk/oswebsite/opendata/ Last accessed on 2011-01-11

OSGeo (2011): Public Geospatial Data Project URL: http://wiki.osgeo.org/wiki/Public_Geospatial_Data_Project Last accessed on 2011-01-24

Richardson, Leonard; Ruby, Sam (2007): Restful Web Services. O'Reilly Media, Inc: USA

28