h-hypermap heatmap analytics at scale

Post on 07-Jan-2017

49 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

H-Hypermap: Heatmap Analytics at ScaleDavid Smiley

Freelance Search Developer/Consultant

About: David Smiley• Software Engineer (16 years)• Search (7 years)• Java (full-stack), Web, Spatial• Freelance search consultant / developer• Apache Lucene / Solr committer & PMC• Wrote first book on Solr, updated twice

Agenda• About this project• Architecture• Solr & time sharding• Experiences with:

– Kotlin, Dropwizard, Swagger

– Kafka– Docker, Kontena

• Solr for geo-enrichment• Solr adapter for Lucene

BKD Lat-Lon point search & sort

• Heatmaps– Existing functionality

• demo– New functionality

H-Hypermap / BOP• Harvard University, CGA:

Center for Geospatial Analysis http://gis.harvard.edu

• Harvard Hypermap Project– Managed by Ben Lewis

• BOP “Billion Object Platform”– Funded by the Sloan Foundation

BOP Requirements Summary

• Most recent ~billion geo-tweets• Realtime search (<5 sec latency)• Sub-second queries– Including heatmaps!

• On the cheap: ~6 mediocre boxes

Provide a proof-of-concept platform designed to lower the barrier for researchers who need to access big streaming spatio-temporal datasets.

Logical High-Level Architecture

Archival

Realtime

Harvesting Enrichment

various clients...

various clients...

Data flows via Apache Kafka Systems exposeHTTP web services

“BOP”

Shard: W51

The BOPKafka Topic Ingester

ZooKeeper

Shard: W52Shard: W53Shard: W54Shard: RT

...

Web-Service

Kafka Streams• Create Solr doc• Routes to shard

REST/JSON API• Keyword search• Faceting• Heatmaps• CSV export

...

BOP Solr Sharding ArchitectureRealtime

T2016_05_20T2016_05_06T2016_04_22T2016_04_08

… 4-5 mo.

T2016_05_20T2016_05_06T2016_04_22T2016_04_08

… 4-5 mo.

G_North_America G_Elsewhere

Lone Realtime Collection/Shard. 1-25 hrsCopy then delete, at night

• Realtime shard is where realtime search happens. No caches, but small.

• Primary collections have useful caches• Housekeeping Tasks:

• Move data from RT to primary• Create new shards; expire old• Merge/optimize shards

Building a Search Web-Service• Kotlin language (JVM based)– Nullity as first-class language feature

• DropWizard framework– Designed for web-services

• Swagger– Dynamically generated dev UI for web-services

Apache Kafka• Kafka: a scalable message/queue platform• See new Kafka Streams & Kafka Connect APIs• No back-pressure; can be a challenge• Non-obvious use:– For storage; time partitioning• Lots of benefits yet serious limitations

Docker• Easy to find/try/use

software– No installation– Simplified configuration

(env variables)– Common logging– Isolated

• Ideal for:– Continuous Int. servers– Trying new software– Production advantages

• But “new”

Docker in Production• I use “Kontena”• Common logging, machine/proc stats, security– VPN to secure network; access everything as local

• No longer need to care about:– Ansible, Chef, Puppet, etc.– Security at network or proxy; not service specific

• Challenges: state & big-data

Enrichment

Geo: Query Solr via spatial point query; attach related metadata to tweet

Kafka Topic Enrich Kafka

Topic

Twitter Sentiment Classifier

Geo: Solr with regional polygons & metadata

Solr for Geo Enrichment• Tweets (docs) can have a geo lat/lon• Enrich tweet with Country, State/Province, …– Gazetteer lookup (point-in-polygon)

Data Set Features Raw size Index time Index size

Admin2 46,311 824 MB 510 min 892 MB

US States 74,002 747 MB 4.9 min 840 MB

Massachusetts Census Blocks 154,621 152 MB 5.9 min 507 MB

Fast Point-in-Polygon TricksIndex/Config• Optimize to 1 segment• RptWithGeometry

SpatialField– precisionModel=

"floating_single"– autoIndex="true"

• <cache name="perSegSpatialFieldCache_WKT" …

Search• Embed Solr (in-process)• Use docValues, not stored

– fl=block:field(GEOID10)Query like this:• q={!field cache=false

f=WKT}Intersects(POINT($lon $lat))

Sub-Millisecond!

Lucene “LatLonPoint”• Uses new PointValues (BKD index) in Lucene 6• Fastest: http://home.apache.org/~mikemccand/geobench.html

• Presently in Lucene sandbox module• Some limitations: WGS84 points only• Credit to Rob Muir and Mike McCandless

Solr Adapter For LatLonPoint• New Solr FieldType for Lucene LatLonPoint– Filter points by circle, rect, polygon– Distance sort; but no boosting

Coming soon! Solr 6.4?

How-to: Heatmaps• On an RPT field geo="false" worldBounds= "ENVELOPE( -180, 180, 180, -180)" prefixTree="packedQuad"

• Query: /select?facet=true&facet.heatmap=geo_rpt&facet.heatmap.geom= ["-180 -90" TO "180 90”]&facet.heatmap.format= ints2D or png

// Normal Solr response..."facet_counts":{ ... // facet response fields "facet_heatmaps":{ "geo_rpt":[ "gridLevel",2, "columns",32, "rows",32, "minX",-180.0, "maxX",180.0, "minY",-90.0, "maxY",90.0, "counts_ints2D”,[null, null, [0, 1, ... ]]...

New HeatmapSpatialField• Why?– With new BKD/PointValues, no “RPT” field to use– Scalable for heatmaps; don’t worry about search• Scalable at all resolutions; many millions of docs/shard

– Can be specific about grid resolutions

Coming soon! Solr 6.4?

Heatmaps with Stats• Instead of counting docs; calculate a metric– Ex: avg(minuteOfDay)

• Will require JSON Facet API• Inherently slower than just doc counts

Coming soon! Solr 6.4?

Final Remarks• Open-Source– https://github.com/dsmiley/hhypermap-bop

• In-progress• Improvements to Solr expected to be available

before December; officially in Solr 6.4.

top related