how ikanow uses mongodb to help organizations solve really big problems

37
to help organizations solve really big problems The Open Source document analysis platform Or, how IKANOW uses

Upload: ikanow

Post on 05-Dec-2014

1.828 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: How IKANOW uses MongoDB to help organizations solve really big problems

to help organizations solve really big problems

The Open Source document analysis platform

Or, how IKANOW uses

Page 2: How IKANOW uses MongoDB to help organizations solve really big problems

Agenda

• What is Document Analysis?• The Infinit.e Solution

– Infinit.e’s Architecture– Why and How we use MongoDB

• Analyzing #MongoDC• Questions

Page 3: How IKANOW uses MongoDB to help organizations solve really big problems

This is what Big Data Looks Like

Shamelessly stolen from: http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/

Page 4: How IKANOW uses MongoDB to help organizations solve really big problems

What is Document Analysis?

"Document Analysis refers tocomputer-assisted analysis of large numbers of documents in order to answer questions about the content of a document set.”Source: http://www.text-tech.com/docanalysis/definition.html

Page 5: How IKANOW uses MongoDB to help organizations solve really big problems

Document Analysis

• Common document source formats:

RSS JSON XML

HTML PDF TXT

RTF Word PPT

Multimedia Files RDBMS Records ETC.

Page 6: How IKANOW uses MongoDB to help organizations solve really big problems

Document Analysis

• The goal is to:– Extract Entities (people, places, things)– Create Associations between entities (in the

form of noun-verb-noun), e.g.:• John Doe lives in Washington, D.C• John Doe is married to Jane Doe• John Doe is a Virgo• John Doe traveled to Mexico on July 6th, 2011

• And…

Page 7: How IKANOW uses MongoDB to help organizations solve really big problems

Document Analysis

• Turn Who, What, When andWhere into a unified data structure that supports data analytics and visualization.

Whopeople, organizations, facilities, company

Whatevents, summaries,facts, themes

Whenpast, present, future dates

Wherecity, state, country, coordinate

Page 8: How IKANOW uses MongoDB to help organizations solve really big problems

• Infinit.e is an Open Source document discovery and analysis platform that has these very cool Open Sourcetools lurking under the hood.

The Infinit.e Solution

github.com/ikanow/Infinit.e

Page 9: How IKANOW uses MongoDB to help organizations solve really big problems

The Infinit.e Solution

CollectingStoring

EnrichingRetrieving

AnalyzingVisualizing

Structured and Unstructured Documents

Infinit.e is a scalable

framework for

Page 10: How IKANOW uses MongoDB to help organizations solve really big problems

IkanMeow

Page 11: How IKANOW uses MongoDB to help organizations solve really big problems

Document Collection

• Infinit.e harvests documents from:

– URLs

– File Shares

– Databases

Page 12: How IKANOW uses MongoDB to help organizations solve really big problems

Sample RSS Document<rss version="2.0"><channel>…<item>

<title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title><link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism-in-egypt-tunisia-report-by-egyptlastminute-com-13613.html</link><description>Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia the most … </description><dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher><dc:creator>unknown</dc:creator><dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date>

</item>…</channel></rss>

Page 13: How IKANOW uses MongoDB to help organizations solve really big problems

Full Text Source

Page 14: How IKANOW uses MongoDB to help organizations solve really big problems

Source Ingestion Data Flow

Page 15: How IKANOW uses MongoDB to help organizations solve really big problems

Document DBs and Collections

Page 16: How IKANOW uses MongoDB to help organizations solve really big problems

Document Metadata

• doc_metadata.metadata{

"_id" : ObjectId("4f93638e0cf212156d0559d2"),"title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...","url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism-in-egypt-tunisia-report-by-egyptlastminute-com-13613.html""description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia; the most ...","created" : ISODate("2012-04-22T01:49:02Z"),

“metadata” : {…},"associations" : […],"entities" : […],...

}

Page 17: How IKANOW uses MongoDB to help organizations solve really big problems

Harvested Document Metadata

• doc_metadata.metadata.metadata"metadata" : {

"location" : [{

"region" : "South Asia","citystateprovince" : {

"stateprovince" : "Rolpa”, "city" : "Newang"

},"country" : "Nepal"

}],"icn" : [ "200573487" ],"incidentdate" : [ "07/25/2005" ],"organization" : [

"Communist Party of Nepal (Maoist)/United People's Front” ],...

},

Note: It is okay to laugh at this

Page 18: How IKANOW uses MongoDB to help organizations solve really big problems

Document Enrichment

• Infinit.e supports the extraction of entities and creation of associations using a combination of built in enrichment libraries and 3rd party NLP APIs including:

Page 19: How IKANOW uses MongoDB to help organizations solve really big problems

Harvested Entities

• feature.entity{

"_id" : ObjectId("4f9189d48baf188282a1c9ef"),"alias" : [

"Zine el Abidine Ben Ali","Zine El Abidine Ben Ali","Zine el Abidine ben Ali"

],"batch_resync" : true,"communityId" : ObjectId("4f8f138103644ee8003bf518"),"db_sync_doccount" : NumberLong(143),"db_sync_time" : "1338751174988","dimension" : "Who","disambiguated_name" : "Zine El Abidine Ben Ali","doccount" : 152,"index" : "zine el abidine ben ali/person","totalfreq" : 353,"type" : "Person"

}

Page 20: How IKANOW uses MongoDB to help organizations solve really big problems

Harvested Entities

Page 21: How IKANOW uses MongoDB to help organizations solve really big problems

Harvested Associations

• feature.association{

"_id" : ObjectId("4f9189d48baf188282a1ca24"),"assoc_type" : "Fact","communityId" : ObjectId("4f8f138103644ee8003bf518"),"db_sync_doccount" : NumberLong(70),"db_sync_time" : "1338491609281","doccount" : NumberLong(73),"entity1" : [

"zine el abidine ben ali","zine el abidine ben ali/person"

],"entity1_index" : "zine el abidine ben ali/person","entity2" : ["president”,"president/position”],"entity2_index" : "president/position","index" : "5e3fff27ddb78d6873ccfc77cf05c52f","verb" : ["career”,"current”,"past”],"verb_category" : "career"

}

Page 22: How IKANOW uses MongoDB to help organizations solve really big problems

Harvested Associations

Page 23: How IKANOW uses MongoDB to help organizations solve really big problems

Geolocation of Entities/Events

• feature.geo{

"_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"),"search_field" : "cairo","country" : "Egypt","country_code" : "EG","city" : "cairo","region" : "Al Qahirah","region_code" : "EG11","population" : 7734602,"latitude" : "30.05","longitude" : "31.25","geoindex" : {

"lat" : 30.05,"lon" : 31.25

}}

Note: MongoDB 2d Index

Page 24: How IKANOW uses MongoDB to help organizations solve really big problems

Geolocation of Entities/Events

Page 25: How IKANOW uses MongoDB to help organizations solve really big problems

Who, What, Where and When

Page 26: How IKANOW uses MongoDB to help organizations solve really big problems

Why MongoDB? – Reason #1

Document-Oriented Storage• MongoDB’s document-oriented storage

(i.e. schema-less) is perfectly suited to the data design requirements of a system that needs to ingest a wide variety of structured and unstructured document formats and normalize them into one unified, semi-structured format

Page 27: How IKANOW uses MongoDB to help organizations solve really big problems

Why MongoDB? – Reason #2

JSON• The standard language of open document

analysis– JSON is a common interchange format supported by

tools like elasticsearch and SaaS NLP engines– BSON (Binary JSON) is MongoDB’s native data

format– Infinit.e ingests and exports JSON

natively via the REST based API

Note: Infinit.e uses Google’s GSON JAVA library to convert JSON to POJOs and back

This is the JSON logo

Page 28: How IKANOW uses MongoDB to help organizations solve really big problems

Why MongoDB? – Reason #3

MongoDB Is Web Scale*

*Shards are the secret ingredients in the web scale sauce. They just work.

Page 29: How IKANOW uses MongoDB to help organizations solve really big problems

Why MongoDB? – Reason #3

Scalability• Seriously, MongoDB Scales

– Harvesting and enriching documents requires a lot of disk space

– MongoDB scales to arbitrary sizes in both read/write dimensions

– Sophisticated sharding keys provide powerful/flexible balancing

BUT building an initial cluster can be complex and managing cluster changes is “fiddly”

Page 30: How IKANOW uses MongoDB to help organizations solve really big problems

Why MongoDB? – Reason #4

Integration with Apache Hadoop• Hadoop is rapidly becoming the de-facto standard for

data analytics– Open Source, very customizable– Proven scalability– Java libraries

• The MongoDB Hadoop Adaptor allows Hadoop to read from and write to MongoDB instead of HDFS

+ =

Page 31: How IKANOW uses MongoDB to help organizations solve really big problems

Tweeting about MongoDC

• Source: http://search.twitter.com/search.rss?q=mongodc– Who’s Tweeting?– What are they Tweeting?– What does basic document analysis of these

Tweets tell us?

Page 32: How IKANOW uses MongoDB to help organizations solve really big problems

Who’s Tweeting about MongoDC?

Page 33: How IKANOW uses MongoDB to help organizations solve really big problems

How are Tweeter’s Connected?

Page 34: How IKANOW uses MongoDB to help organizations solve really big problems

What are they Tweeting About?

Page 35: How IKANOW uses MongoDB to help organizations solve really big problems

Sentiment?

Page 36: How IKANOW uses MongoDB to help organizations solve really big problems

Twitter has its Limits…