lucene 4 spatial

16
© 2012 The MITRE Corporation. All rights reserved. LUCENE 4 SPATIAL 2012 Basis Technology Open Source Search Conference Presented by David Smiley, MITRE

Upload: david-smiley

Post on 14-Jan-2015

4.835 views

Category:

Technology


1 download

DESCRIPTION

Covers the new Apache Lucene 4 spatial module. Includes Solr usage info. Applicable to ElasticSearch too. Presented the 2012 Open Source Search in Government conference by Basis Technologies.

TRANSCRIPT

Page 1: Lucene 4 spatial

© 2012 The MITRE Corporation. All rights reserved.

LUCENE 4 SPATIAL2012 Basis Technology

Open Source Search Conference

Presented by David Smiley, MITRE

Page 2: Lucene 4 spatial

© 2012 The MITRE Corporation. All rights reserved.

About David Smiley• Working at MITRE, for 12 years

• web development, Java, search• 3 Solr apps, 1 Endeca

• Published 1st book on Solr; then 2nd edition (2009, 2011)• Apache Lucene / Solr committer (2012)

• Specializing on spatial

• Presented at Lucene Revolution (2010) & Basis O.S. Search Conference (2011)

• Taught Solr classes at MITRE (2010, 2011, 2012)• Solr search consultant within MITRE and its sponsors,

and privately via OpenSource Connections

2

Page 3: Lucene 4 spatial

© 2012 The MITRE Corporation. All rights reserved.

What is Spatial Search?

Primary features:• Spatial filter query• Spatial distance sorting• Spatial distance relevancy (i.e. spatial query score)

NOT “geocoding” – resolve “Boston” to its latitude and longitude

Typical use-case:

1. Index a location for each Lucene document given a latitude & longitude

2. Then search for matching documents by a circle (point-radius) or bounding box

3. Then sort results by distance

Page 4: Lucene 4 spatial

© 2012 The MITRE Corporation. All rights reserved.

History of Spatial for Lucene & Solr• 2007: Local-Lucene

• by Patric O’Leary (AOL)

• 2009-09: LL -> Lucene spatial contrib in Lucene 2.9.0• Local-Lucene graduates to an official Lucene contrib module

• 2009-12: Spatial Search Plugin (SSP) for Solr• by Chris Male (JTeam -> Orange11, ElasticSearch)

• 2010-10: SOLR-2155 a geohash prefix tree filter• by David Smiley (MITRE)

• 2011-01: Lucene Spatial Playground (LSP)• by Ryan McKinley (Voyager GIS), David, and Chris

• 2011-03: Solr 3.1 new spatial features• by Grant Ingersoll and Yonik Seeley (LucidWorks)

• 2012-03: LSP -> Lucene 4 spatial module + Spatial4j• replaces former Lucene spatial contrib module

Page 5: Lucene 4 spatial

© 2012 The MITRE Corporation. All rights reserved.

Lucene Spatial Committers• David Smiley, MITRE

• Bedford, MA

• Chris Male, Elastic Search• New Zealand

• Ryan McKinley, Voyager GIS• Oakland, CA

Page 6: Lucene 4 spatial

© 2012 The MITRE Corporation. All rights reserved.

Breakdown of Spatial Components

Spatial4j43%

Lucene spatial36%

Solr adapters6%

Misc16%

Total: 4,781 Non-Comment Source Statements (without javadocs or tests)

Page 7: Lucene 4 spatial

© 2012 The MITRE Corporation. All rights reserved.

Spatial4j: It’s all about the shapes• Shapes

• Types: Point, Rectangle, Circle, Polygon• Geospatial & Euclidean/2D implementations• Intersection: within, contains, intersects, disjoint

• Distance and area math utilities• Input/Output serialization to Well Known Text (WKT)

• Ex: POLYGON ((30 10, 10 20, 20 40, 40 40, 30 10))

• ASL licensed project independent of Apache on GitHub• Requires JTS (3rd party LGPL) for polygon & WKT support• Ported to .NET as Spatial4n and used by RavenDB

• by Itamar Syn-Herskhko

Page 8: Lucene 4 spatial

© 2012 The MITRE Corporation. All rights reserved.

Lucene 4 Spatial Module• There isn’t one best way to implement spatial indexing for

all use-cases• Index just points, or other shapes too? Which?• Multiple shapes per field?• Query by Intersection? Contains? Within? Equals? Disjoint? …• Distance sorting? Query boost by distance?

• Or more exotic shape relevancy like overlap percentage?

• Tradeoff shape precision for speed?

• Multiple SpatialStrategy implementations:• RecursivePrefixTreeStrategy and TermQueryPrefixTreeStrategy• PointVectorStrategy• BBoxStrategy (currently in trunk, not 4x)• JtsGeoStrategy (in Spatial4j/LSP)

Names subject to change!

Page 9: Lucene 4 spatial

© 2012 The MITRE Corporation. All rights reserved.

Strategy: PointVector• Similar to Solr’s PointType / LatLonType

• X & Y trie double fields; caching via FieldCache

• Characteristics• Indexes points (only)• Single-valued field (no multi)• Query by rectangle or circle (only)

• Circle uses FieldCache (requires memory)• Circle does bbox pre-filter for performance• Relations: Intersects, Within (only)

• Exact precision for x & y coordinates and query shape• Distance sort

• Uses FieldCache (requires memory)

Page 10: Lucene 4 spatial

© 2012 The MITRE Corporation. All rights reserved.

Strategy: RecursivePrefixTree

• Grid / Tile / Trie / Prefix-Tree based• With recursive decent

algorithm• Or TermQueryPrefixTree

alternative

• Choose Geohash (geo only) or Quad tree

• The most mature strategy to date

• The current evolution of SOLR-2155

Potential rename toGridFilterSpatialStrategy

Page 11: Lucene 4 spatial

© 2012 The MITRE Corporation. All rights reserved.

Strategy: RecursivePrefixTree• Characteristics:

• Indexes all shapes• Variable precision of shape edges

• Highly precise shapes other than point won’t scale• LineString’s possibly not precise enough for your needs

• Multi-valued field support• Query by any shape

• Variable precision for query shape• Highest precision usually scales

• Relations: Intersects (only)

• Distance sort (w/ multi-value support)• Warning: immature, won’t scale• Uses significant amounts of memory

• Fast spatial filtering; no cache needed

Page 12: Lucene 4 spatial

© 2012 The MITRE Corporation. All rights reserved.

Strategy: BBox• Implemented with 4 doubles & 1 boolean• Ported from ESRI Open SourceGeoPortal• Characteristics:

• Indexes rectangles (only)• Single-valued field (no multi)• Query by rectangle (only)

• Supports all relations: Intersects, Within, Contains, …

• Distance sort from box center• Uses FieldCache (requires memory)

• Area overlap sorting• Sort results by percentage overlap between query and indexed boxes• Uses FieldCache (requires memory)

• Note: FieldCache needs are somewhat high

Page 13: Lucene 4 spatial

© 2012 The MITRE Corporation. All rights reserved.

Strategy: JtsGeoStrategy• Stores any JTS geometry in Lucene 4’s DocValues

• Stores WKB -- WKT in binary format• Full vector geometry is retained for search

• DocValues is mostly a better FieldCache• Faster loading into memory• Can be disk resident or memory

• Characteristics:• Indexes any shape• Single valued field but can be MultiPoint, MultiPolygon, etc.• Query by any shape

• Uses DocValues (memory use optional)• Supports all relations: intersect, within, contains, …

• No sorting• Experimental / immature status

Page 14: Lucene 4 spatial

© 2012 The MITRE Corporation. All rights reserved.

Solr Adapters• Configuration:<fieldType name="geo" class="solr.SpatialRecursivePrefixTreeFieldType" spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory"distErrPct="0.025" maxDistErr="0.000009" /><field name="geo" type="geo" indexed="true" stored="true” multiValued="true" />

• Adding data:<field name="geo">43.17614,-90.57341</field><field name="geo">POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))</field>

• Search Filterfq=geo:”Intersects(Circle(54.729696,-98.525391 d=10))”

• Distance Sortsort=query($sortsq) asc&sortsq={! score=distance v=$sq}&sq=store:"Intersects(Circle(54.729696,-98.525391 d=10))"

Page 15: Lucene 4 spatial

© 2012 The MITRE Corporation. All rights reserved.

Future Possibilities• Solr:

• Filter out points in multi-valued field from search results not matching filter• Heatmap/grid faceting spatial summarization

• Spatial-Temporal search• 3d (x,y,t) point shapes, and “track” shape queries

• Support any query shape for all Strategies• PrefixTreeStrategy:

• More efficient binary grid encoding; use Hilbert Curve order• Better multi-value point caches• Cache-less sort of top-N results• More query relations: Contains, Within

• Configurable DocValues vs. FieldCache choice• Choose floats or configurable bits instead of forcing doubles• CircleStrategy