openstreetmap geocoder based on solr
DESCRIPTION
Presented by Ishan Chattopadhyaya, LucidWorks This talk is on the technical aspects of a new OpenStreetMap geocoder based on Apache Solr & Lucene. Recent changes to Apache Lucene and Apache Solr (4.0 and onwards) have seen a marked improvement in the spatial search capabilities. Also, its improved support for distributed storage and search, via the SolrCloud mode, makes applications using Solr scale easily. OpenStreetMap's current geocoder, Nomainatim, is based on Postgresql/PostGis. Some benefits of using Solr (as compared to a database system like Postgres) for building a geocoder, is robust partial text search, analysis in various languages (stemming, tokenization, stop words etc.), spell check, faceting, highlighting etc. Through this presentation, the author intends to bring out an appreciation for a Solr based geocoder.TRANSCRIPT
Ishan ChattopadhyayaLucidWorksOpenStreetMap FoundationTwitter: @ichattopadhyaya, OSM: chatman
● Wikipedia of GeoData
● OpenStreetMap is a project aimed squarely
at creating and providing free geographic
data such as street maps to anyone who
wants them.
What is OpenStreetMap?
State of OSM
● Commercial competitors
– Google Maps
– Bing Maps
● http://tools.geofabrik.de/mc/
The OpenStreetMap Software Stack
What is a Geocoder?
● Input: raw query
● Output: geocoordinates
Goals for the new Geocoder● Search for:
– Cities and towns
– Streets
– Address points
– Places of Interest, Businesses, Amenities, Attractions etc.
● Reverse geocoding
● Support for fuzzy queries
Good changes in Lucene/Solr 4.x● Support for indexing polygons
– RecursivePrefixTree indexing
● Special spatial search predicates
– Contains
– IsWithin
– Intersects
– Etc.
● Reference: David Smiley's LuceneRevolution presentation
● SolrCloud mode for distributed indexing/searching
Architecture
Indexer
Solr
www.Geocoder.
in
API Layer
Planet dumps
Indexing: OSM Data format
● Node
– “A node defines a single geospatial point using a latitude and longitude.”
● Way
– “A way is an ordered list of between 2 and 2,000 nodes. Ways are used to represent linear features (vectors), such as rivers or roads.”
● Relation
– “A Relation is an all-purpose data structure that documents a relationship between two or more other objects.”
Indexing: Facts and figures
● Number of OSM Nodes in the database = 2071039612
● Number of OSM Ways in the database = 202570637
● Number of OSM Relations in the database = 2217240
Indexing: Schema
admin2 admin3
admin4
admin5 admin6 admin7 street st_type
Ireland Dublin County
Dublin Ballsbridge Lansdowne
Street
name level geo popularity
Landsdowne Street s <shape>
Indexing: Schema
admin2 admin3
admin4
admin5 admin6 admin7 street st_type
Ireland Dublin County
Dublin
name level geo popularity
Dublin 6 <shape> 1
Indexing: Schema (POIs)
admin2 admin3
admin4
admin5 admin6 admin7 street st_type
Ireland Dublin County
Dublin Ballsbridge
name category geo
Ballsbridge Hotel hotel <shape>
Searching
Classifier Validator
Geocoder (lookup)
Raw query Classifications
Valid classifications
Structured location + geocodes
Searching: Classification
Tokenizer Bloom FiltersQuery Shingles Classifications
Searching: Classification
●
● Query= “hotels near lansdowne rd dublin”
● Shingles: hotels, near, lansdowne, rd, dublin, hotels near, near lansdowne, lansdowne rd, rd dublin, .., hotels near lansdowne rd dublin
Tokenizer Bloom FiltersQuery Shingles Classifications
Searching: Classification
●
● hotels, near, lansdowne, rd, dublin, hotels near, near lansdowne, ..
Tokenizer Bloom FiltersQuery Shingles Classifications
Cat A2 A4 A5 Streets
hotels
Match
Searching: Classification
●
● hotels, near, lansdowne, rd, dublin, hotels near, near lansdowne, ..
Tokenizer Bloom FiltersQuery Shingles Classifications
Cat A2 A4 A5 Streets
dublin
MatchMatch
Searching: Classification
●
● hotels, near, lansdowne, rd, dublin, hotels near, near lansdowne, ..
Tokenizer Bloom FiltersQuery Shingles Classifications
Cat A2 A4 A5 Streets
lansdowne
MatchMatch
Searching: Classifications
● Query = “hotels near lansdowne rd dublin”
● Classifications:hotels = categorylansdowne = admin5lansdowne = streetdublin = admin5dublin = street
Searching: Classifications
● Query = “hotels near lansdowne rd dublin”
● Classifications:hotels = categorylansdowne = admin5lansdowne = streetdublin = admin5dublin = street
● Possible permutations:C.5.5C.S.5C.5.SC...5C.5..etc.
Searching: Solr Query
● Query = “hotels near lansdowne rd dublin”
● Possible permutations:C.5.5: +level:5 +admin5:lansdowne +admin5:dublinC.S.5: +level:s +street:lansdowne +admin5:dublinC.5.S: +level:s +street:dublin +admin5:lansdowneC...5: +level:5 +admin5:dublinC.5..: +level:5 +admin5:lansdowneetc.
Searching: Solr Query
● Query = “hotels near lansdowne rd dublin”
● Possible permutations:C.5.5: +level:5 +admin5:lansdowne +admin5:dublinC.S.5: +level:s +street:lansdowne +admin5:dublinC.5.S: +level:s +street:dublin +admin5:lansdowneC...5: +level:5 +admin5:dublinC.5..: +level:5 +admin5:lansdowneetc.
Searching: Solr Query
● Query = “hotels near lansdowne rd dublin”
● Possible permutations:C.5.5: +level:5 +admin5:lansdowne +admin5:dublinC.S.5: +level:s +street:lansdowne +admin5:dublinC.5.S: +level:s +street:dublin +admin5:lansdowneC...5: +level:5 +admin5:dublinC.5..: +level:5 +admin5:lansdowneetc.
"POINT (-6.232063,53.333833)"
Searching: Searching for POIs
● Query = “hotels near lansdowne rd dublin”
● Query = “hotels near” near "POINT (-6.232063,53.333833)"
● Solr query: fl=*,scoresort=score ascq={!geofilt score=distance filter=false sfield=geo pt= 53.333833,-6.232063 d=10}fq=+category:hotel
Searching: Searching for POIs
Challenges: Indexing
● Street Associativity
● Incomplete polygons
Challenges
● Handling Updates
● Data validation
Distributed Search
● Need for distributed search?
● Geographical partitioning