greenbacker open analyticsdc
DESCRIPTION
Berico:TRANSCRIPT
Open Source Software for Geospatial Analytics on Unstructured Big DataCharlie Greenbacker, Principal Data Scientist
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 2
Background
About Me:
Data Scientist
Natural Language Processing
Unstructured Text Information
Berico Technologies:
Veteran-owned Small Business
Big Data Analytics in the Cloud
Defense & Intel Community
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 3
The Problem: geotagging unstructured text
Growing demand forgeospatial analytics
Most of human knowledge remains “trapped” in text
Existing solutions are expensive and don’t scale
Need an open source solution
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 4
The Solution: an open source geoparser
1. Data Ingestion
Input: unstructured text
2. Entity Extraction
Named entity recognition
Find location names in text
3. Entity Resolution
Match against a gazetteer
“The Springfield Problem”
4. Data Enrichment
Output: structured geo data
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 5
Data Ingestion: unstructured text
photo: Flickr user NS Newsflash
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 6
Entity Extraction: named entity recognition
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 7
Entity Resolution: match against a gazetteer
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 8
Data Enrichment: structured geo data
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 9
“The Springfield Problem”
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 10
Dealing with Ambiguity
Intelligent Context-based Heuristics
First: rank by population
Next: look for other locations mentioned in the same document
“Springfield” + “Chicago” = Illinois
“Springfield” + “Boston” = Massachusetts
Soon: calculate distance based on lat/lons
Resolve alternate names to same geospatial entity
“Ivory Coast” = “Cote d’Ivoire”
Use fuzzy matching to capture misspelled place names
Including both phonetic spelling & typographical errors
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 11
CLAVIN: an open source geoparser
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 12
System Architecture
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 13
Live Demonstration
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 14
Live Demonstration
What can I do with this data?
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 15
Map Visualizations
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 16
Hierarchical Geospatial Search
Virginia
ArlingtonReston
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 17
Geospatial Bounding Box Search
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 18
Geospatial Analytics on Unstructured Text
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 19
Performance Metrics & Features
Accurate: 0.75 F-measure
Fast: 100 locations per sec per cpu
Scalable: processes 1 million documentsin 1 hour on a 9-node Hadoop cluster
Smart: natural language processing, context-based heuristics, & fuzzy matching
Easy to use: simple Java-based API
Open source: Apache License
CLAVIN
“Cartographic
Location
And
Vicinity
INdexer
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 20
clavin.bericotechnologies.com
Charlie Greenbacker@greenbacker
meetup.com/DC-NLP
@DCNLP