realtime analytics with elasticsearch [new media inspiration 2013]
DESCRIPTION
A presentation from the New Media Inspiration 2013 conference (http://www.tuesday.cz/akce/new-media-inspiration-2013/) about using Elasticsearch's faceting features for realtime analytics of big data.TRANSCRIPT
Real time analyticsof big data with Elasticsearch
Karel Minařík
JSON
Facets
Analytics
http://www.youtube.com/watch?v=-GftBySG99Q
Realtime Analytics With ElasticSearch
http://karmi.cz
http://elasticsearch.com
Realtime Analytics With ElasticSearch
Using a search engine for analytics?
wat?
A collection of documentsHOW DOES SEARCH WORK?
file_1.txtThe ruby is a pink to blood-‐red colored gemstone ...
file_2.txtRuby is a dynamic, reflective, general-‐purpose object-‐oriented programming language ...
file_3.txt"Ruby" is a song by English rock band Kaiser Chiefs ...
How do you search documents?HOW DOES SEARCH WORK?
File.read('file_1.txt').include?('ruby')File.read('file_2.txt').include?('ruby')...
The inverted indexHOW DOES SEARCH WORK?
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices
TOKENS POSTINGS
ruby file_1.txt file_2.txt file_3.txt
pink file_1.txt
gemstone file_1.txt
dynamic file_2.txt
reflective file_2.txt
programming file_2.txt
song file_3.txt
english file_3.txt
rock file_3.txt
The inverted indexHOW DOES SEARCH WORK?
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices
ruby file_1.txt file_2.txt file_3.txt
pink file_1.txt
gemstone file_1.txt
dynamic file_2.txt
reflective file_2.txt
programming file_2.txt
song file_3.txt
english file_3.txt
rock file_3.txt
search "ruby"
The inverted indexHOW DOES SEARCH WORK?
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices
pink file_1.txt
gemstone file_1.txt
dynamic file_2.txt
reflective file_2.txt
programming file_2.txt
song file_3.txt
english file_3.txt
rock file_3.txt
search "song"
ruby file_1.txt file_2.txt file_3.txt
The inverted indexHOW DOES SEARCH WORK?
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices
ruby file_1.txt file_2.txt file_3.txt
pink file_1.txt
gemstone file_1.txt
dynamic file_2.txt
reflective file_2.txt
programming file_2.txt
english file_3.txt
rock file_3.txt
search "ruby AND song"
song file_3.txt
The inverted indexHOW DOES SEARCH WORK?
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices
TOKENS POSTINGS
ruby file_1.txt file_2.txt file_3.txt
pink file_1.txt
gemstone file_1.txt
dynamic file_2.txt
reflective file_2.txt
programming file_2.txt
song file_3.txt
english file_3.txt
rock file_3.txt
31
Statistics!
Realtime Analytics With ElasticSearch
ElasticSearch is an open source, scalable, distributed, cloud-ready, highly-available full-text search engine and database with powerful aggregation features, communicating by JSON over RESTful HTTP, based on Apache Lucene.
Faceted NavigationFACETS
http://blog.linkedin.com/2009/12/14/linkedin-faceted-search/
Query
Facets
Faceted Navigation with ElasticsearchFACETS
curl "http://localhost:9200/people/_search?pretty=true" -‐d '{ "query" : { "match" : { "name" : "John"} }, "filter" : { "terms" : { "employer" : ["IBM"] } }, "facets" : { "employer" : { "terms" : { "field" : "employer", "size" : 3 } } }}'
User query
“Checkboxes”
Facets
http://www.elasticsearch.org/guide/reference/api/search/facets/index.html
"facets" : { "employer" : { "missing" : 0, "total" : 10, "other" : 3, "terms" : [ { "term" : "ibm", "count" : 3 }, { "term" : "twitter", "count" : 2 }, { "term" : "apple", "count" : 2 } ] } }
Response
Visualizing the FacetsFACETS
http://mbostock.github.com/d3/tutorial/bar-1.html
"facets" : { "employer" : { "missing" : 0, "total" : 10, "other" : 3, "terms" : [ { "term" : "ibm", "count" : 3 }, { "term" : "twitter", "count" : 2 }, { "term" : "apple", "count" : 2 } ] } }
d3.js ~ A Bar Chart, Part 1
DEMO: http://bl.ocks.org/4571766
Visualizing the FacetsFACETS
Visualizing the FacetsFACETS
Realtime Analytics With ElasticSearch
‣No batch orientation‣No stats precomputation and caching‣No predefined metrics or schemas
Important Concepts
‣Combination of free text search, structured search, and facets‣ Scripting for performing ad–hoc analytics‣ Extendable: write your own facet types
ScriptingFACETS
curl -X DELETE localhost:9200/demo-articlescurl -X POST localhost:9200/demo-articles -d '{"mappings": { "a": { "properties": {"url": {type: "string", "index": "not_analyzed"}} } } }'
curl -X PUT localhost:9200/demo-articles/a/1 -d '{"title":"...","url":"http://some.blogger.com/2012/09/01/index.html"}'curl -X PUT localhost:9200/demo-articles/a/2 -d '{"title":"...","url":"http://some.blogger.com/2012/09/11/index.html"}'curl -X PUT localhost:9200/demo-articles/a/3 -d '{"title":"...","url":"http://some.blogger.com/about.html"}'curl -X PUT localhost:9200/demo-articles/a/5 -d '{"title":"...","url":"https://github.com/user/A"}'curl -X PUT localhost:9200/demo-articles/a/5 -d '{"title":"...","url":"http://github.com/user/B"}'curl -X POST localhost:9200/demo-articles/_refresh
curl -X GET 'localhost:9200/demo-articles/_search/?search_type=count&pretty' -d '{ "facets": { "popular-domains": { "terms": { "field" : "url",
"script" : "term.replace(new RegExp(\"https?://\"), \"\").split(\"/\")[0]", "lang" : "javascript" } } }}'
Extract and aggregate most popular domains from article URLs
"facets" : { "popular-‐domains" : { // ... "terms" : [ { "term" : "some.blogger.com", "count" : 3 }, { "term" : "github.com", "count" : 1 } ] } }
Response
DemonstrationsFACETS
curl -X DELETE localhost:9200/demo-articlescurl -X POST localhost:9200/demo-articles -d '{"mappings": { "a": { "properties": {"url": {type: "string", "index": "not_analyzed"}} } } }'
curl -X PUT localhost:9200/demo-articles/a/1 -d '{"title":"...","url":"http://some.blogger.com/2012/09/01/index.html"}'curl -X PUT localhost:9200/demo-articles/a/2 -d '{"title":"...","url":"http://some.blogger.com/2012/09/11/index.html"}'curl -X PUT localhost:9200/demo-articles/a/3 -d '{"title":"...","url":"http://some.blogger.com/about.html"}'curl -X PUT localhost:9200/demo-articles/a/5 -d '{"title":"...","url":"https://github.com/user/A"}'curl -X PUT localhost:9200/demo-articles/a/5 -d '{"title":"...","url":"http://github.com/user/B"}'curl -X POST localhost:9200/demo-articles/_refresh
curl -X GET 'localhost:9200/demo-articles/_search/?search_type=count&pretty' -d '{ "facets": { "popular-domains": { "terms": { "field" : "url",
"script" : "term.replace(new RegExp(\"https?://\"), \"\").split(\"/\")[0]", "lang" : "javascript" } } }}'
Extract and aggregate most popular domains from article URLs
"facets" : { "popular-‐domains" : { // ... "terms" : [ { "term" : "some.blogger.com", "count" : 3 }, { "term" : "github.com", "count" : 1 } ] } }
Response
Demo
Thanks!d