greenbacker open analyticsdc

20
Open Source Software for Geospatial Analytics on Unstructured Big Data Charlie Greenbacker, Principal Data Scientist

Upload: open-analytics

Post on 26-Jan-2015

106 views

Category:

Technology


0 download

DESCRIPTION

Berico:

TRANSCRIPT

Page 1: Greenbacker open analyticsdc

Open Source Software for Geospatial Analytics on Unstructured Big DataCharlie Greenbacker, Principal Data Scientist

Page 2: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 2

Background

About Me:

Data Scientist

Natural Language Processing

Unstructured Text Information

Berico Technologies:

Veteran-owned Small Business

Big Data Analytics in the Cloud

Defense & Intel Community

Page 3: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 3

The Problem: geotagging unstructured text

Growing demand forgeospatial analytics

Most of human knowledge remains “trapped” in text

Existing solutions are expensive and don’t scale

Need an open source solution

Page 4: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 4

The Solution: an open source geoparser

1. Data Ingestion

Input: unstructured text

2. Entity Extraction

Named entity recognition

Find location names in text

3. Entity Resolution

Match against a gazetteer

“The Springfield Problem”

4. Data Enrichment

Output: structured geo data

Page 5: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 5

Data Ingestion: unstructured text

photo: Flickr user NS Newsflash

Page 6: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 6

Entity Extraction: named entity recognition

Page 7: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 7

Entity Resolution: match against a gazetteer

Page 8: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 8

Data Enrichment: structured geo data

Page 9: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 9

“The Springfield Problem”

Page 10: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 10

Dealing with Ambiguity

Intelligent Context-based Heuristics

First: rank by population

Next: look for other locations mentioned in the same document

“Springfield” + “Chicago” = Illinois

“Springfield” + “Boston” = Massachusetts

Soon: calculate distance based on lat/lons

Resolve alternate names to same geospatial entity

“Ivory Coast” = “Cote d’Ivoire”

Use fuzzy matching to capture misspelled place names

Including both phonetic spelling & typographical errors

Page 11: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 11

CLAVIN: an open source geoparser

Page 12: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 12

System Architecture

Page 13: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 13

Live Demonstration

Page 14: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 14

Live Demonstration

What can I do with this data?

Page 15: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 15

Map Visualizations

Page 16: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 16

Hierarchical Geospatial Search

Virginia

ArlingtonReston

Page 17: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 17

Geospatial Bounding Box Search

Page 18: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 18

Geospatial Analytics on Unstructured Text

Page 19: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 19

Performance Metrics & Features

Accurate: 0.75 F-measure

Fast: 100 locations per sec per cpu

Scalable: processes 1 million documentsin 1 hour on a 9-node Hadoop cluster

Smart: natural language processing, context-based heuristics, & fuzzy matching

Easy to use: simple Java-based API

Open source: Apache License

CLAVIN

“Cartographic

Location

And

Vicinity

INdexer

Page 20: Greenbacker open analyticsdc

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 20

clavin.bericotechnologies.com

Charlie Greenbacker@greenbacker

meetup.com/DC-NLP

@DCNLP