geoinfo 2006 presentation by chris jones, cardiff university 1 geographical information retrieval...
TRANSCRIPT
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
1
Geographical Information Retrieval
Christopher Jones
Cardiff University
See www.geo-spirit.org for information on SPIRIT project, the contributing partners, and downloads of articles and project deliverables.
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
2
What is Geo-information?
• Geo-information associates things and events with places
Rich vocabulary:Place names, coordinates, geometric
objects, spatial relationships, spatial structures, patterns, paths, flows, interactions…
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
3
Where is Geo-information?
Personal knowledge – of landscape, of where things, people and
services are located, where things happened
Documents (various media)– Lists of where facilities, resources, structures
are located– Textual descriptions of geographic
phenomena– Images and videos of geographic space
Maps
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
4
GIS and the Web
A GIS typically :– Isolated– Supports individual
organisation – Small range of topics– Structured data /
geo-coded locations– Finds answers – Accessed privately– Complicated to use
World Wide Web is :– Global networked– Supports everyone
on Internet– Vast range of topics– Unstructured
free text / images– Finds documents – Accessed publicly– Easy to use
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
5
Problems with WWW as a source of geo-
information• Geographic context embedded in natural language descriptions
• Place names ambiguous and confused with names of organisations, people, buildings and streets
• Web queries depend on exact match of text terms
• No intelligent interpretation of spatial relationships (“near”, “west” etc)
• No geo-relevance ranking
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
6
Current motivation of GIR : Find geo-specific resources on the
Webfind web resources about
Something related_to Somewhere
related_to = in, near, within Xkm, north_of ..etc.
• Resolve ambiguity of names (many places have same name)
• Interpret the query spatial relationships query footprint
• Find documents geographically associated with region of query footprint
• Relevance rank geographically by place and subject
near north
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
7
GIR, GIS and The Web
Geo-knowledge
GIS
The Web
GIRWorldKnowledge
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
8
Geographical Search Engines
• Google etc have “local” versions.
-Based on business (yellow pages) directories.
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
9
Geographical Search Engines
SPIRIT research prototype general geo-web search
Structured user interface:
Dropdown menu of spatial relationships
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
10
Geographical search engines
SPIRITResultslisted as
URLs Plus
symbols on map
User Interface screen shots from Ross Purves et al University of Zurich
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
11
Anatomy of a Geographical Search Engine
Textual
Spatial
IndexesSpatialTextual
SearchEngine
RelevanceRanking
RankedResults
Search Request + Query footprint
UnrankedResults
Place Ontology
UserInterface
Broker
RankedResults
Query disambiguation
Geo-tagging
Textual
Spatial
WebResources
Document Footprints
Text Indexing
Query footprint
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
12
Geo-Tagging = Geo-parsing + Geo-coding
Geo-parsing Recognising geographic
references (ignoring non-geographic uses of place terminology)
Geo-coding– Attaching a unique
quantitative locations (footprint) to geographic references
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
13
Geo-parsing• The presence of place names can be recognised
with gazetteers (i.e. lists of names)
• Some types of genuine geographic reference– the name of the place : Sao Paulo
– an address School of Computer Science,
Cardiff University,
5 The Parade, Cardiff
– an address fragment “Ross lived in Dalmeny Street in Edinburgh”
– a postcode / zip code CF24 3XF
– a phone number most Cardiff phone numbers start with 02920
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
14
Geo-Parsing : true & false references
Some types of false geographic reference
• Personal names Smedes York, Jack London
• Business name Dorchester Hotel, York Properties..
• Street names Oxford Street,
London Road…
• Common words that are also places bath, battle, derby, over, well, ……
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
15
Geo-Parsing : distinguishing between false and true geo-references
Look for patterns and context
Personal names (Jack London, Mr York): <First_name> <Location>; <Title> <Location>
Business names (Paris Hotel) :
<Business_type> <Location> (or vice versa)
Street names (Oxford Street) :
<Location> <Road_type>
Detect spatial propositions in, near, south of, outside etc “he lived in Over”
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
16
Geo-coding (grounding) the genuine geo-references
Many different places with the same name
(referent ambiguity) Newport, Cambridge,
Springfield………
Use context to decide (references to parent or nearby places )
Or – choose most important one (by population or place type hierarchy)
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
17
Geo-Coding :
What is the geo-focus of a web page?• Frequency of
occurrence• Do multiple
places have common parent?
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
18
Anatomy of a Geographical Search Engine
Textual
Spatial
IndexesSpatialTextual
SearchEngine
RelevanceRanking
RankedResults
Search Request + Query footprint
UnrankedResults
Place Ontology
UserInterface
Broker
RankedResults
Query disambiguation
Geo-tagging
Textual
Spatial
WebResources
Document Footprints
Text Indexing
Query footprint
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
19
Indexing Web ResourcesStandard text index is
inverted file
Query: Restaurants in Cardiff
Find documents that contain all terms
Works literally for “in” but won’t find contained places.
Doesn’t work in general for “near”, “Xkms from”, “north_of” etc
apple Doc79, Doc89, Doc822….
Cardiff Doc2, Doc19, Doc37, …
door Doc16, Doc49, Doc112…..
hotel Doc1, Doc2, Doc23, …
in Doc4, Doc7, Doc19…
London Doc20, Doc35, Doc150…..
pub Doc9, Doc11, Doc100, …
restaurant Doc19, Doc22, Doc37, ..
…………………….
…………………………………………..
Text Term List of resources containing term
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
20
Why Spatial Indexing?Query : “Castles in Wales”
Need to find documents that refer to names of places in Wales (perhaps without mentioning “Wales”)
Query “Hotels outside and within 30Kms of Rio”Need to documents referring to hotels that are in
places other than Rio
• In both cases to use conventional text indexing requires a query to contain the names of all places in Wales and all places outside Rio within 30km
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
21
Spatial indexing of resources• Use prime foci of documents to create
document footprints (point, polygon, bounding rectangle..)
• Use footprints to index documents• Convert query to a query footprint• Match query footprint to doc. footprints
Spatial Query Result
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
22
Combining text and spatial indexing : spatio-textual
indexing
• Space-primary (ST) : textual index for each spatial cell
• Text-primary (TS) : spatial index for each term
• Separate S and T indexes (T)
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
23
D
A B
CSpatial Query Results
term1 docs
term2 docs
term3 docs
Text Index B
term2
docs
term4
docs
term5
docs
Text Index D
Spatial-primary (ST) method
Each spatial cell has a text index
Retrieve document ids for query terms lying in cells intersected by query footprint
High storage overhead with multiple text indexes
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
24
Text primary spatial indexing
• For each text query term, retrieve ids of documents lying in spatial cells intersecting the query footprint
• High storage overhead – multiple spatial indexes• Time performance better than ST
Results
Spatial Queryterm1
term2
term3
Text Index BD
A B
C
D
A B
C
D
A B
C
Index Entry: term2 : cellB(D1, D7); cellD(D3, D11, D13)…
For each term, store spatial index of documents containing the term
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
25
Separate spatial and textual index• Access spatial index
with query footprint• Access text index
with concept terms• Merge results – find
intersection
• Relatively small storage overhead with spatial index
• Time performance superior (in latest experiments)
Term1 D1, D2, D23, …
Term2 D9, D11, D100, …
Term3 D27, D85, ..
D1
D2
D3
D4
D6
D7
D8
D5
D10
D11
D12
D13
D14
D15
D9 D16
R
R1
R3
R2
R4
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
26
Anatomy of a Geographical Search Engine
Textual
Spatial
IndexesSpatialTextual
SearchEngine
RelevanceRanking
RankedResults
Search Request + Query footprint
UnrankedResults
Place Ontology
UserInterface
Broker
RankedResults
Query disambiguation
Geo-tagging
Textual
Spatial
WebResources
Document Footprints
Text Indexing
Query footprint
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
27
Geographical Relevance Ranking
• Determine “distance” between query footprint and document footprint
• Depends on query spatial operator (in, outside, XKms, north_of etc)
Spatial score
Example: airports near Leicester the further away, the lower the spatial score
D
Q
Figure from Marc van Kreveld, University of Utrecht
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
28
Combining textual and spatial scores
• Textual scores: BM25• Spatial scores: by spatial footprint
analysis
0
1
1
normalizedBM25 score
spatial score
query / ideal footprint
footprints of documents
Figure from Marc van Kreveld University of Utrecht
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
29
Anatomy of a Geographical Search Engine
Textual
Spatial
IndexesSpatialTextual
SearchEngine
RelevanceRanking
RankedResults
Search Request + Query footprint
UnrankedResults
Place Ontology
UserInterface
Broker
RankedResults
Query disambiguation
Geo-tagging
Textual
Spatial
WebResources
Document Footprints
Text Indexing
Query footprint
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
30
Place OntologyEncodes knowledge of terminology and structure
of geographic space
• alternative names, languages• place types (political, topographic, social.. )• footprint (point, MBR, polygon) • spatial relationships and attributes : containment, adjacency, overlap • imprecise (vernacular) places
(“Midlands”, “south of France”, “Scottish borders”, “Pennines”, “Highlands”…..)
Derive from gazetteers, thesauri, maps & web
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
31
Roles of Place Ontology
User Interface
Query Disambiguation
Geo-Tagging
Metadata Extraction
Web collection
document footprints
Relevance Ranking
Relevance Ranking
Spatial Index
documentfootprints
Search Component
Query Expansion(query footprint)
ontology
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
32
Mining text on the web for vernacular place name knowledge
• Objective: estimate spatial extent of vague place
• Documents that refer to vague places may also refer to more precise places inside them.
• Places that occur frequently in association with a target named place may have higher chance of being inside
• Analyse frequency of occurrence of co-located places
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
33
Places mentioned in documents retrieved by queries on the
“Cotswolds”
Figure from Ross Purves et al University of Zurich
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
34
Summary of web mining procedure
• Submit web search engine queries referring to a target place
• Geo-Parse resulting highest ranking web pages for occurrence of place names
• Geocode place names with coordinates
• Create density surface model of co-occurring places and extract approximate boundary (contour).
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
35
Formulating appropriate web queries
• Region only, e.g. “Rocky Mountains”– Retrieves all documents mentioning the name
• Region + Concept, e.g. “Hotels in Cotswolds”– Tends to retrieve directory pages listing places
associated with the target place
• Region and lexical pattern (trigger phrase), e.g. “Midwest towns such as”;
“in the South of France”– Reduces the number of relevant documents
retrieved but can work well for those documents– Problem of not enough “hits” for statistical analysis
• Region + Concept produces highest numbers of co-associated places in top ranking documents.
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
36
Devon (county)
Distribution of associated places
Density surface at three threshold levels (1, 0.5, 0.25 points per cell)
Density surface
Note: some places wrongly
geocoded
Thresholded boundary compared with actual boundary
Figure from Ross Purves et al University of Zurich
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
37
Vague place :MittellandEvidence for validity of method
Human interpretations of the extent
+ is the “core”
Density surface of web mining results
Figure from Ross Purves, University of Zurich
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
38
GIR and GIS• GIR currently dominated by web search
– Unstructured results in multiple documents
• Sometimes single focused result wanted
• Hotels within 1 kilometre of the British Museum in London
• Where are pre-sixteenth century dwellings in USA?
• Which areas of East Anglia would be flooded if sea level rose by 1 metre?
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
39
Bringing GIR and GIS together
Geo-knowledge
GIS
The Web
GIRWorldKnowledge
Geo-knowledge
GIS
The Web
GIRWorldKnowledge
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
40
GeoInformation Services
Encode Geo-information in Web Services (Geo-services)
• Parse natural language queries• Interpret geo-terminology of queries• Identify the relevant geo-services to
match geo and non-geo concepts• Compose appropriate chain of services
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
41
Where is GIR going?• Improve “conventional GIR” components:
– Geo-tagging, spatio-textual indexing and geo-relevance ranking
• Creation of rich place ontologies with world-wide coverage
• Improved understanding of spatial natural language terminology
• Open GeoInformation Web services
• Adapt GIR to personal needs (“What’s the quickest way out of here?”)
GeoInfo 2006 Presentation by Chris Jones, Cardiff University
42
More Information• SPIRIT project partners with local representatives:• Cardiff University (Chris Jones, Project
coordinator)• University of Sheffield (Mark Sanderson and Paul
Clough)• IGN, Paris (Anne Ruas)• Unversity of Utrecht (Marx van Kreveld)• University of Hannover (Monika Sester)• Universit of Zurich (Ross Purves and Rob Weibel)
• See www.geo-spirit.org for information on SPIRIT project and downloads of articles and project deliverables.
[N.B. Prototype search engine (with link from SPIRIT web site) is no longer functional]