nosql part3

NoSQL & MongoDB..Part III

Arindam Chatterjee

Aggregation in MongoDB

• Aggregations are operations that process data records and return computed

results.

• MongoDB provides a rich set of aggregation operations that examine and

perform calculations on the data sets.

• Running data aggregation on the mongod instance simplifies application code

and limits resource requirements.

• Like queries, aggregation operations in MongoDB use collections of

documents as an input and return results in the form of one or more

documents.

• In MongoDB aggregations are implemented using

– Aggregation Pipeline

– Map-Reduce

Aggregation Pipeline

Map Reduce

• MongoDB applies the map phase to each input document (i.e. the documents

in the collection that match the query condition).

• The map function emits key-value pairs.

• For those keys that have multiple values, MongoDB applies the reduce

phase, which collects and condenses the aggregated data.

• MongoDB then stores the results in a collection.

• MongoDB supports sharded collections both as input and output.

Map Reduce

Illustration

Map Reduce

Map Reduce..more example

• Insert data in collection “orders” as follows– db.orders.insert({

_id: ObjectId("50a8240b927d5d8b5891743c"),

cust_id: "abc123",

ord_date: new Date("Oct 04, 2012"),

status: 'A',

price: 25,

items: [ { sku: "mmm", qty: 5, price: 2.5 },

{ sku: "nnn", qty: 5, price: 2.5 } ]

});

• Task: Find the total price per customer

• Step I: Define map function that emits “cust_id” and “price” pair• var mapFunction1 = function() {

emit(this.cust_id, this.price);

};

Map Reduce..more example..2

• Define Reduce function with two arguments keyCustId and valuesPrices

– The valuesPrices is an array whose elements are the price values emitted by the map function and grouped by keyCustId.

– The function reduces the valuesPrices array to the sum of its elements.

• var reduceFunction1 = function(keyCustId, valuesPrices) {

return Array.sum(valuesPrices);

}

• Perform the map-reduce on all documents in the orders collection using the mapFunction1 map function and the reduceFunction1 reduce function.

– db.orders.mapReduce(

mapFunction1,

reduceFunction1,

{ out: "map_reduce_example" }

)

• Do a find() to check the new collection “map_reduce_example”– db.map_reduce_example.find();

Full Text Search in MongoDB

• Important Concepts

– Stop Words: filter words that are irrelevant for searching. Examples are is, at, the,

am, I, your etc.

– Stemming: process of reducing words to their root, base .E.g. “waiting”, “waited”,

“waits” have the same root “wait”

• Example: I am your father, Luke

– “I”, “am”, “your” are Stop Words

– After removing the Stop Words, the words left are “father” and “Luke”

– These are processed in next step

Text Search process in MongoDB

• Tokenizes and stems the search term(s) during both the index creation and the text command execution.

• Assigns a score to each document that contains the search term in the indexed fields. The score determines the relevance of a document to a given search query.

• By default, the text command returns at most the top 100 matching documents as determined by the scores.

Full Text Search in MongoDB..Example

• While starting the MongoDB server, use the following parameters

– mongod --setParameter textSearchEnabled=true

• Create a text Index on Collection “txt”

– db.txt.ensureIndex( {txt: "text"}

• To show up the text index use the following

– db.txt.getIndices()

• Insert data in collection “txt”

– db.txt.insert( {txt: "I am your father, Luke"} )

• Stop word filtering has already happened. The following command shows

only 2 keys in the index txt.txt.$txt_text

– db.txt.validate()

• Perform a Full Text Search using the following

– db.txt.runCommand( "text", { search : "father" } )

Text Analytics

What is Text Analytics

• Process of identifying meaningful information from unstructured content

Social Media Analytics : Facebook, Twitter

What do people

Feel about the

latest movie?

What is the

response to theLast ad

campaign?

What are

People feeling about the new

brand of product

What is our

competitor doing in

market?

What is the

sentiment of people in the

organization

Text Analytics..2

Email Analytics•Customer Support•Regulatory Compliance

Log Analytics• IT Server Log

Text Analytics..3

Fraud DetectionAnalytics

• Insurance Claims•Credit Card Transactions•Tax Return claims

Text Analytics: Scenarios

• Obtain reviews from various blogs, review sites about a new movie

• Highlight important viewer’s comments on the movie

In the process, the Text Analytics engine performs the following

• Understand human language

• Understand Positive vs. Negative comments

• Identify sarcasm, criticism, pun

• Tries to interpret like a human being

Sentiment Analysis of the movie

Krrish 3 (Hindi) (U)

Krrish 3 (2013)

152 min - Action November 2013 (India)

6.5 Your rating: 6.5/10

Ratings: 6.5/10 from 6,762 users Reviews: 135 user | 26 critic Krrish and his scientist father have to save the world and their own family from an evil man named Kaal and his team of human-animal mutants led by the ruthless Kaya. Will they succeed? How? Director:Rakesh RoshanWriters:Robin Bhatt (screenplay), Honey Irani (screenplay), 5 more credits »Stars:Priyanka Chopra, Hrithik Roshan, Amitabh Bachchan | See

full cast and crew »

6.5

“Wish I were 12 again”, Author: shahin mahmud

1 November 2013

“Plagiarism..Plagiarism... Everywhere”

Author: venugopal19196 from Guntur2 November 2013

“Krrish ek soch hain jo hum tak nahi pahunch paye”Author: darkshadowsxtreme from India 4 November 2013

“Far below expectations”, Author: Arpan Mallik from India3 November 2013

“Krrish 3: No more than a mere rubbish..”Author: amruthvvkp from India3 November 2013

Text Analytics: Information Extraction

• Distill structured data from unstructured and semi-structured text

• Exploit the extracted data in your applications

Noun Adjective Comment

Krish 3 good “Krrish ek soch hain jo hum tak nahi pahunch paye"

Rakesh Roshan worst "rubbish"

Priyanka Chopra more "plagiarism"

Hrithik Roshan below

Amitabh Bacchan

Robin Bhatt

Honey Irani

Text Extraction

EngineUnstructured

content

Structured

Content

Extraction logic

Text Analytics: Information Extraction..2

Pattern Recognition

• Phone numbers

• Date formats• Email addresses• URL

Entities and Relations

• Person

• Location• Organization• Association between entities

Others

• Topic identification• Sentiment / Opinion• Classification• Ontology

Linguistic Annotation

• Tokenization

• Parts of Speech• Normalization• Co-reference resolution

Text Analytics Terminology

• RegEx: Regular expression to recognize patterns of text, e.g. Phone number

• Dictionaries: A list of entries containing domain specific terms. Example:

dictionary of city names, dictionary of IT companies

• Text Extraction Script: A script that uses dictionaries and regex on a set of

text documents and performs extraction of text. Example: GATE Extractor

program

• Annotation: A labeled text, matching a particular criteria. Example: Person

name

• Precision: Measure of exactness or accuracy of pattern recognition program

• Recall: Measure of completeness

The higher the precision and recall, the better the program is

Text Analytics Approaches

• Grammar based

– Input text viewed as a

sequence of tokens

– Rules expressed as regular

expression patterns over

these tokens

• Algebra based

– Extract SPANs matching a

dictionary or regex

– Create an operator for each

basic operation

– Compose operators to build

complex extractors

MongoDB as Analytics Platform

• The flexibility of MongoDB makes it perfect for storing analytics.

• Customers have different types of analytics engines on MongoDB platform

like

– usage metrics,

– business domain specific metrics,

– financial platforms.

• The most generic type of metrics that most clients start tracking are events

(e.g. “how many people walked into my stores” or “how many people

opened an iPhone application”).

• The queries to support the above questions should be efficient in a

distributed environment

MongoDB as Analytics Platform…2

• Example: Insert data as follows– {

store_id: ObjectId(), // Object id of a store

event: "door open", // will be one of "door opened", "sale made", or "phone calls"

created_at: new Date("2013-01-29T08:43:00Z")

}

• To run a query on the event, store_id, and created_at, you run the following query.– db.events.find({store_id: ObjectId("aaa"),

created_at: {$gte: new Date("2013-01-29T00:00:00Z"),

$lte: new Date("2013-01-30T00:00:00Z")}})

• The above query runs fast in local environment but is painfully slow in a distributed environment having large database

• Multiple compound indexes are created to increase speed.– db.events.ensureIndex({store_id: 1, created_at: 1})

db.events.ensureIndex({event: 1, created_at: 1})

db.events.ensureIndex({store_id: 1, event: 1, created_at: 1} )


• Achieving Optimization

– Each of the indexes should fit into the RAM

– Any new document will have a seemingly randomly chosen “store_ id”.

– An insert command will have a high probability of inserting the document record

to the middle of an index.

– To minimize RAM usage, it is best to insert sequentially: termed “writing to the

right side of the index”.

– Any new key is greater than or equal to the previous index key.


• Achieving Optimization using “time bucket”

– Create a time_bucket attribute that breaks down acceptable date ranges to hour,

day, month, week, quarter, and/ or year.{

store_id: ObjectId(), // Object id of a store

event: "door open",

created_at: new Date("2013-01-29T08:43:00Z"),

time_bucket: [

"2013-01-29 08-hour", "2013-01-29-day", "2013-04-week", "2013-01-month",

"2013-01-quarter", "2013-year” ]}

– Create the following indexes

db.events.ensureIndex({time_bucket: 1, store_id: 1, event: 1})

db.events.ensureIndex({time_bucket: 1, event: 1})

– Instead of running the query on entire range, run the following

db.events.find({store_id: ObjectId("aaa"), "time_bucket": "2013-01-29-day"})


• Benefit of “time bucket”

– Using the optimized time_bucket, new documents are added to the right side of

the index.

– Any inserted document will have a greater time_ bucket value than the previous

documents.

– By adding to the right side of the index and using time_bucket to query,

Mon-goDB will swap to disk any rarely older doc-u-ments resulting in minimal

RAM usage.

– The “hot data” size will be the most recently accessed (typically 1- 3 months with

most analytics applications), and the older data will settle nicely to disk.

– Nei-ther queries nor inserts will access the middle of the index, and older index

chunks can swap to disk.

Thank You

nosql part3

Education

text analytics process

mongodb aggregations

text index

text search process

data aggregation

aggregation pipeline

mongodb server

text analytics engine