nosql part3
DESCRIPTION
Weekend Business Analytics PraxisTRANSCRIPT
NoSQL & MongoDB..Part III
Arindam Chatterjee
Aggregation in MongoDB
• Aggregations are operations that process data records and return computed
results.
• MongoDB provides a rich set of aggregation operations that examine and
perform calculations on the data sets.
• Running data aggregation on the mongod instance simplifies application code
and limits resource requirements.
• Like queries, aggregation operations in MongoDB use collections of
documents as an input and return results in the form of one or more
documents.
• In MongoDB aggregations are implemented using
– Aggregation Pipeline
– Map-Reduce
Aggregation in MongoDB
• Aggregations are operations that process data records and return computed
results.
• MongoDB provides a rich set of aggregation operations that examine and
perform calculations on the data sets.
• Running data aggregation on the mongod instance simplifies application code
and limits resource requirements.
• Like queries, aggregation operations in MongoDB use collections of
documents as an input and return results in the form of one or more
documents.
• In MongoDB aggregations are implemented using
– Aggregation Pipeline
– Map-Reduce
Aggregation Pipeline
Map Reduce
• MongoDB applies the map phase to each input document (i.e. the documents
in the collection that match the query condition).
• The map function emits key-value pairs.
• For those keys that have multiple values, MongoDB applies the reduce
phase, which collects and condenses the aggregated data.
• MongoDB then stores the results in a collection.
• MongoDB supports sharded collections both as input and output.
Map Reduce
Illustration
Map Reduce
Map Reduce..more example
• Insert data in collection “orders” as follows– db.orders.insert({
_id: ObjectId("50a8240b927d5d8b5891743c"),
cust_id: "abc123",
ord_date: new Date("Oct 04, 2012"),
status: 'A',
price: 25,
items: [ { sku: "mmm", qty: 5, price: 2.5 },
{ sku: "nnn", qty: 5, price: 2.5 } ]
});
• Task: Find the total price per customer
• Step I: Define map function that emits “cust_id” and “price” pair• var mapFunction1 = function() {
emit(this.cust_id, this.price);
};
Map Reduce..more example..2
• Define Reduce function with two arguments keyCustId and valuesPrices
– The valuesPrices is an array whose elements are the price values emitted by the map function and grouped by keyCustId.
– The function reduces the valuesPrices array to the sum of its elements.
• var reduceFunction1 = function(keyCustId, valuesPrices) {
return Array.sum(valuesPrices);
}
• Perform the map-reduce on all documents in the orders collection using the mapFunction1 map function and the reduceFunction1 reduce function.
– db.orders.mapReduce(
mapFunction1,
reduceFunction1,
{ out: "map_reduce_example" }
)
• Do a find() to check the new collection “map_reduce_example”– db.map_reduce_example.find();
Full Text Search in MongoDB
• Important Concepts
– Stop Words: filter words that are irrelevant for searching. Examples are is, at, the,
am, I, your etc.
– Stemming: process of reducing words to their root, base .E.g. “waiting”, “waited”,
“waits” have the same root “wait”
• Example: I am your father, Luke
– “I”, “am”, “your” are Stop Words
– After removing the Stop Words, the words left are “father” and “Luke”
– These are processed in next step
Text Search process in MongoDB
• Tokenizes and stems the search term(s) during both the index creation and the text command execution.
• Assigns a score to each document that contains the search term in the indexed fields. The score determines the relevance of a document to a given search query.
• By default, the text command returns at most the top 100 matching documents as determined by the scores.
Full Text Search in MongoDB..Example
• While starting the MongoDB server, use the following parameters
– mongod --setParameter textSearchEnabled=true
• Create a text Index on Collection “txt”
– db.txt.ensureIndex( {txt: "text"}
• To show up the text index use the following
– db.txt.getIndices()
• Insert data in collection “txt”
– db.txt.insert( {txt: "I am your father, Luke"} )
• Stop word filtering has already happened. The following command shows
only 2 keys in the index txt.txt.$txt_text
– db.txt.validate()
• Perform a Full Text Search using the following
– db.txt.runCommand( "text", { search : "father" } )
Text Analytics
What is Text Analytics
• Process of identifying meaningful information from unstructured content
Social Media Analytics : Facebook, Twitter
What do people
Feel about the
latest movie?
What is the
response to theLast ad
campaign?
What are
People feeling about the new
brand of product
What is our
competitor doing in
market?
What is the
sentiment of people in the
organization
Text Analytics..2
Email Analytics•Customer Support•Regulatory Compliance
Log Analytics• IT Server Log
Text Analytics..3
Fraud DetectionAnalytics
• Insurance Claims•Credit Card Transactions•Tax Return claims
Text Analytics: Scenarios
• Obtain reviews from various blogs, review sites about a new movie
• Highlight important viewer’s comments on the movie
In the process, the Text Analytics engine performs the following
• Understand human language
• Understand Positive vs. Negative comments
• Identify sarcasm, criticism, pun
• Tries to interpret like a human being
Sentiment Analysis of the movie
Krrish 3 (Hindi) (U)
Krrish 3 (2013)
152 min - Action November 2013 (India)
6.5 Your rating: 6.5/10
Ratings: 6.5/10 from 6,762 users Reviews: 135 user | 26 critic Krrish and his scientist father have to save the world and their own family from an evil man named Kaal and his team of human-animal mutants led by the ruthless Kaya. Will they succeed? How? Director:Rakesh RoshanWriters:Robin Bhatt (screenplay), Honey Irani (screenplay), 5 more credits »Stars:Priyanka Chopra, Hrithik Roshan, Amitabh Bachchan | See
full cast and crew »
6.5
“Wish I were 12 again”, Author: shahin mahmud
1 November 2013
“Plagiarism..Plagiarism... Everywhere”
Author: venugopal19196 from Guntur2 November 2013
“Krrish ek soch hain jo hum tak nahi pahunch paye”Author: darkshadowsxtreme from India 4 November 2013
“Far below expectations”, Author: Arpan Mallik from India3 November 2013
“Krrish 3: No more than a mere rubbish..”Author: amruthvvkp from India3 November 2013
Text Analytics: Information Extraction
• Distill structured data from unstructured and semi-structured text
• Exploit the extracted data in your applications
Noun Adjective Comment
Krish 3 good “Krrish ek soch hain jo hum tak nahi pahunch paye"
Rakesh Roshan worst "rubbish"
Priyanka Chopra more "plagiarism"
Hrithik Roshan below
Amitabh Bacchan
Robin Bhatt
Honey Irani
Text Extraction
EngineUnstructured
content
Structured
Content
Extraction logic
Text Analytics: Information Extraction..2
Pattern Recognition
• Phone numbers
• Date formats• Email addresses• URL
Entities and Relations
• Person
• Location• Organization• Association between entities
Others
• Topic identification• Sentiment / Opinion• Classification• Ontology
Linguistic Annotation
• Tokenization
• Parts of Speech• Normalization• Co-reference resolution
Text Analytics Terminology
• RegEx: Regular expression to recognize patterns of text, e.g. Phone number
• Dictionaries: A list of entries containing domain specific terms. Example:
dictionary of city names, dictionary of IT companies
• Text Extraction Script: A script that uses dictionaries and regex on a set of
text documents and performs extraction of text. Example: GATE Extractor
program
• Annotation: A labeled text, matching a particular criteria. Example: Person
name
• Precision: Measure of exactness or accuracy of pattern recognition program
• Recall: Measure of completeness
The higher the precision and recall, the better the program is
Text Analytics Approaches
• Grammar based
– Input text viewed as a
sequence of tokens
– Rules expressed as regular
expression patterns over
these tokens
• Algebra based
– Extract SPANs matching a
dictionary or regex
– Create an operator for each
basic operation
– Compose operators to build
complex extractors
MongoDB as Analytics Platform
• The flexibility of MongoDB makes it perfect for storing analytics.
• Customers have different types of analytics engines on MongoDB platform
like
– usage metrics,
– business domain specific metrics,
– financial platforms.
• The most generic type of metrics that most clients start tracking are events
(e.g. “how many people walked into my stores” or “how many people
opened an iPhone application”).
• The queries to support the above questions should be efficient in a
distributed environment
MongoDB as Analytics Platform…2
• Example: Insert data as follows– {
store_id: ObjectId(), // Object id of a store
event: "door open", // will be one of "door opened", "sale made", or "phone calls"
created_at: new Date("2013-01-29T08:43:00Z")
}
• To run a query on the event, store_id, and created_at, you run the following query.– db.events.find({store_id: ObjectId("aaa"),
created_at: {$gte: new Date("2013-01-29T00:00:00Z"),
$lte: new Date("2013-01-30T00:00:00Z")}})
• The above query runs fast in local environment but is painfully slow in a distributed environment having large database
• Multiple compound indexes are created to increase speed.– db.events.ensureIndex({store_id: 1, created_at: 1})
db.events.ensureIndex({event: 1, created_at: 1})
db.events.ensureIndex({store_id: 1, event: 1, created_at: 1} )
MongoDB as Analytics Platform…2
• Achieving Optimization
– Each of the indexes should fit into the RAM
– Any new document will have a seemingly randomly chosen “store_ id”.
– An insert command will have a high probability of inserting the document record
to the middle of an index.
– To minimize RAM usage, it is best to insert sequentially: termed “writing to the
right side of the index”.
– Any new key is greater than or equal to the previous index key.
MongoDB as Analytics Platform…3
• Achieving Optimization using “time bucket”
– Create a time_bucket attribute that breaks down acceptable date ranges to hour,
day, month, week, quarter, and/ or year.{
store_id: ObjectId(), // Object id of a store
event: "door open",
created_at: new Date("2013-01-29T08:43:00Z"),
time_bucket: [
"2013-01-29 08-hour", "2013-01-29-day", "2013-04-week", "2013-01-month",
"2013-01-quarter", "2013-year” ]}
– Create the following indexes
db.events.ensureIndex({time_bucket: 1, store_id: 1, event: 1})
db.events.ensureIndex({time_bucket: 1, event: 1})
– Instead of running the query on entire range, run the following
db.events.find({store_id: ObjectId("aaa"), "time_bucket": "2013-01-29-day"})
MongoDB as Analytics Platform…4
• Benefit of “time bucket”
– Using the optimized time_bucket, new documents are added to the right side of
the index.
– Any inserted document will have a greater time_ bucket value than the previous
documents.
– By adding to the right side of the index and using time_bucket to query,
Mon-goDB will swap to disk any rarely older doc-u-ments resulting in minimal
RAM usage.
– The “hot data” size will be the most recently accessed (typically 1- 3 months with
most analytics applications), and the older data will settle nicely to disk.
– Nei-ther queries nor inserts will access the middle of the index, and older index
chunks can swap to disk.
Thank You