how ikanow uses mongodb to help organizations solve really big problems

Post on 05-Dec-2014

1.828 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

to help organizations solve really big problems

The Open Source document analysis platform

Or, how IKANOW uses

Agenda

• What is Document Analysis?• The Infinit.e Solution

– Infinit.e’s Architecture– Why and How we use MongoDB

• Analyzing #MongoDC• Questions

This is what Big Data Looks Like

Shamelessly stolen from: http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/

What is Document Analysis?

"Document Analysis refers tocomputer-assisted analysis of large numbers of documents in order to answer questions about the content of a document set.”Source: http://www.text-tech.com/docanalysis/definition.html

Document Analysis

• Common document source formats:

RSS JSON XML

HTML PDF TXT

RTF Word PPT

Multimedia Files RDBMS Records ETC.

Document Analysis

• The goal is to:– Extract Entities (people, places, things)– Create Associations between entities (in the

form of noun-verb-noun), e.g.:• John Doe lives in Washington, D.C• John Doe is married to Jane Doe• John Doe is a Virgo• John Doe traveled to Mexico on July 6th, 2011

• And…

Document Analysis

• Turn Who, What, When andWhere into a unified data structure that supports data analytics and visualization.

Whopeople, organizations, facilities, company

Whatevents, summaries,facts, themes

Whenpast, present, future dates

Wherecity, state, country, coordinate

• Infinit.e is an Open Source document discovery and analysis platform that has these very cool Open Sourcetools lurking under the hood.

The Infinit.e Solution

github.com/ikanow/Infinit.e

The Infinit.e Solution

CollectingStoring

EnrichingRetrieving

AnalyzingVisualizing

Structured and Unstructured Documents

Infinit.e is a scalable

framework for

IkanMeow

Document Collection

• Infinit.e harvests documents from:

– URLs

– File Shares

– Databases

Sample RSS Document<rss version="2.0"><channel>…<item>

<title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title><link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism-in-egypt-tunisia-report-by-egyptlastminute-com-13613.html</link><description>Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia the most … </description><dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher><dc:creator>unknown</dc:creator><dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date>

</item>…</channel></rss>

Full Text Source

Source Ingestion Data Flow

Document DBs and Collections

Document Metadata

• doc_metadata.metadata{

"_id" : ObjectId("4f93638e0cf212156d0559d2"),"title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...","url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism-in-egypt-tunisia-report-by-egyptlastminute-com-13613.html""description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia; the most ...","created" : ISODate("2012-04-22T01:49:02Z"),

“metadata” : {…},"associations" : […],"entities" : […],...

}

Harvested Document Metadata

• doc_metadata.metadata.metadata"metadata" : {

"location" : [{

"region" : "South Asia","citystateprovince" : {

"stateprovince" : "Rolpa”, "city" : "Newang"

},"country" : "Nepal"

}],"icn" : [ "200573487" ],"incidentdate" : [ "07/25/2005" ],"organization" : [

"Communist Party of Nepal (Maoist)/United People's Front” ],...

},

Note: It is okay to laugh at this

Document Enrichment

• Infinit.e supports the extraction of entities and creation of associations using a combination of built in enrichment libraries and 3rd party NLP APIs including:

Harvested Entities

• feature.entity{

"_id" : ObjectId("4f9189d48baf188282a1c9ef"),"alias" : [

"Zine el Abidine Ben Ali","Zine El Abidine Ben Ali","Zine el Abidine ben Ali"

],"batch_resync" : true,"communityId" : ObjectId("4f8f138103644ee8003bf518"),"db_sync_doccount" : NumberLong(143),"db_sync_time" : "1338751174988","dimension" : "Who","disambiguated_name" : "Zine El Abidine Ben Ali","doccount" : 152,"index" : "zine el abidine ben ali/person","totalfreq" : 353,"type" : "Person"

}

Harvested Entities

Harvested Associations

• feature.association{

"_id" : ObjectId("4f9189d48baf188282a1ca24"),"assoc_type" : "Fact","communityId" : ObjectId("4f8f138103644ee8003bf518"),"db_sync_doccount" : NumberLong(70),"db_sync_time" : "1338491609281","doccount" : NumberLong(73),"entity1" : [

"zine el abidine ben ali","zine el abidine ben ali/person"

],"entity1_index" : "zine el abidine ben ali/person","entity2" : ["president”,"president/position”],"entity2_index" : "president/position","index" : "5e3fff27ddb78d6873ccfc77cf05c52f","verb" : ["career”,"current”,"past”],"verb_category" : "career"

}

Harvested Associations

Geolocation of Entities/Events

• feature.geo{

"_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"),"search_field" : "cairo","country" : "Egypt","country_code" : "EG","city" : "cairo","region" : "Al Qahirah","region_code" : "EG11","population" : 7734602,"latitude" : "30.05","longitude" : "31.25","geoindex" : {

"lat" : 30.05,"lon" : 31.25

}}

Note: MongoDB 2d Index

Geolocation of Entities/Events

Who, What, Where and When

Why MongoDB? – Reason #1

Document-Oriented Storage• MongoDB’s document-oriented storage

(i.e. schema-less) is perfectly suited to the data design requirements of a system that needs to ingest a wide variety of structured and unstructured document formats and normalize them into one unified, semi-structured format

Why MongoDB? – Reason #2

JSON• The standard language of open document

analysis– JSON is a common interchange format supported by

tools like elasticsearch and SaaS NLP engines– BSON (Binary JSON) is MongoDB’s native data

format– Infinit.e ingests and exports JSON

natively via the REST based API

Note: Infinit.e uses Google’s GSON JAVA library to convert JSON to POJOs and back

This is the JSON logo

Why MongoDB? – Reason #3

MongoDB Is Web Scale*

*Shards are the secret ingredients in the web scale sauce. They just work.

Why MongoDB? – Reason #3

Scalability• Seriously, MongoDB Scales

– Harvesting and enriching documents requires a lot of disk space

– MongoDB scales to arbitrary sizes in both read/write dimensions

– Sophisticated sharding keys provide powerful/flexible balancing

BUT building an initial cluster can be complex and managing cluster changes is “fiddly”

Why MongoDB? – Reason #4

Integration with Apache Hadoop• Hadoop is rapidly becoming the de-facto standard for

data analytics– Open Source, very customizable– Proven scalability– Java libraries

• The MongoDB Hadoop Adaptor allows Hadoop to read from and write to MongoDB instead of HDFS

+ =

Tweeting about MongoDC

• Source: http://search.twitter.com/search.rss?q=mongodc– Who’s Tweeting?– What are they Tweeting?– What does basic document analysis of these

Tweets tell us?

Who’s Tweeting about MongoDC?

How are Tweeter’s Connected?

What are they Tweeting About?

Sentiment?

Twitter has its Limits…

Thank You!

Craig Vitter

www.ikanow.com

cvitter@ikanow.com

top related