how to build your own google
TRANSCRIPT
How to build your own google ...
Data Wizards Dec 2015
Artur Grządziel few words about me
email: [email protected]
Currently: BigData and Machine Learning Leader
From Jan 2016: BigData Solution Architect at General Electric
PhD in progress at PAN (Polish Academy of Sciences) Systems Research Institute
Graduated from Warsaw University of Technology and Warsaw School of Economics
BigData & Machine Learning enthusiast focused on leveraging Big Data and Machine Learning
in real business cases
Privately, husband and father
pl.linkedin.com/in/ArturGrzadziel
Introduction Data Wizards
Artur represents „Data Wizards” group – informal group of
BigData/Machine Learning/Data Science professionals located in
Poland and interested in knowledge sharing and addressing business
challenges leveraging modern BigData and Machine Learning
methods.
Agenda
1. Cloudera search
2. How it works?
MySearch very high level architecture
Data
Source
Index
Cloudera search Apache Solr and Tika
1.
Other
Sources
Cloudera Search
Cloudera Search is one of Cloudera's near-real-time access products.
Cloudera Search enables non-technical users to search and explore data stored
in or ingested into Hadoop and HBase. Users do not need SQL or programming
skills to use Cloudera Search because it provides a simple, full-text interface for
searching.
Cloudera Search incorporates Apache Solr, which includes Apache Lucene,
SolrCloud, Apache Tika, and Solr Cell. Cloudera Search is tightly integrated
with Cloudera's Distribution, including Apache Hadoop (CDH). Cloudera Search
provides these key capabilities:
- Near-real-time indexing
- Batch indexing
- Simple, full-text data exploration and navigated drill down
http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-
0/Cloudera-Search-User-Guide/csug_introducing.html
Cloudera search Tika
https://tika.apache.org/download.html
Cloudera search Tika – image
Cloudera search Tika – PDF file
Cloudera search Tika – gazeta.pl
Cloudera search Tika – formats
Supported Document Formats
• HyperText Markup Language
• XML and derived formats
• Microsoft Office document formats
• OpenDocument Format
• Portable Document Format
• Electronic Publication Format
• Rich Text Format
• Compression and packaging formats
• Text formats
• Audio formats
• Image formats
• Video formats
• Java class files and archives
• The mbox format
https://tika.apache.org/1.4/formats.html
Cloudera search Solr – how to start it …
.\bin\solr start –e cloud -noprompt http://lucene.apache.org/solr/
Cloudera Search Administration
Cloudera Search Data
id cat name price inStock author series_t sequence_i genre_s
553573403 book A Game of Thrones 7.99 TRUE George R.R. Martin A Song of Ice and Fire 1 fantasy
553579908 book A Clash of Kings 7.99 TRUE George R.R. Martin A Song of Ice and Fire 2 fantasy
055357342X book A Storm of Swords 7.99 TRUE George R.R. Martin A Song of Ice and Fire 3 fantasy
553293354 book Foundation 7.99 TRUE Isaac Asimov Foundation Novels 1 scifi
812521390 book The Black Company 6.99 FALSE Glen Cook The Chronicles of The Black Company 1 fantasy
812550706 book Ender's Game 6.99 TRUE Orson Scott Card Ender 1 scifi
441385532 book Jhereg 7.95 FALSE Steven Brust Vlad Taltos 1 fantasy
380014300 book Nine Princes In
Amber 6.99 TRUE Roger Zelazny the Chronicles of Amber 1 fantasy
805080481 book The Book of Three 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 1 fantasy
080508049X book The Black Cauldron 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 2 fantasy
Cloudera Search Output format
Cloudera Search Simple query
Cloudera Search Simple query
Cloudera Search More advanced query
Cloudera Search Query with facets
Cloudera search Solr – other features
The MoreLikeThis search component enables users to query for documents
similar to a document in their result list. It is achieved leveraging terms from the
original document to find similar documents in the index
The SpellCheck component is designed to provide inline query suggestions
based on other, similar, terms.
Highlighting in Solr allows fragments of documents that match the user's query
to be included with the query response.
Synonyms, stop words
Cloudera search Solr – other features – geospacial search
Solr has sophisticated geospatial support, including searching within a
specified distance range of a given location (or within a bounding box),
sorting by distance, or even boosting results by the distance http://lucene.apache.org/solr/quickstart.html
Cloudera Search Common Use Cases
Cloudera Search lets your entire business explore and analyze data quickly and
easily for a variety of critical use cases all within a single platform, including:
- Threat detection
- Customer 360-degree visibility
- Improved user experience
- Interactive market segmentation
- Accessible global knowledge base
https://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-
solr.html
Cloudera Search Other Use Cases
Instagram: Instagram (a Facebook company) is one of the famous sites, and it
uses Solr to power its geosearch API
WhiteHouse.gov: The Obama administration's website is inbuilt in Drupal and
Solr
Netflix: Solr powers basic movie searching on this extremely busy site
StubHub.com: This ticket reseller uses Solr to help visitors search for concerts
and sporting events.
https://www.safaribooksonline.com/library/view/scaling-apache-
solr/9781783981748/ch01s05.html
How it works ... ?
How it works … ? Data Source – documents …
Document Content
1 John has a cat
2 John has a dog
3 Eva has a cat
4 George has a dog
How it works … ? Data Source – documents … space of unique terms
Document Content
1 John has a cat
2 John has a dog
3 Eva has a cat
4 George has a dog
1 2 3 4
1 2 3 5
6 2 3 4
7 2 3 4
List of unique
words:
1. John
2. has
3. a
4. cat
5. dog
6. Eva
7. George
How it works … ? Data Source – Documents … boolean search with inverted index
Term Tot. freq.
John 2
has 4
a 4
cat 2
dog 2
Eva 1
George 1
Doc #
1
2
1
2
3
4
1
2
3
4
1
3
2
4
3
4
Dictionary Documents
How it works … ? Data Source – documents as vectors
Documents
document 1 John has a cat
document 2 John has a dog
document 3 Eva has a cat
document 4 George has a dog
Space of unique terms -> John has a cat dog Eva George
vector representing doc1 -> 1 1 1 1 0 0 0
vector representing doc2 -> 1 1 1 0 1 0 0
vector representing doc3 -> 0 1 1 1 0 1 0
vector representing doc4 -> 0 1 1 0 1 0 1
How it works … ? Data Source – Documents … vectors
Summary
1.
Other
Sources
Thank you Data Wizards
E-mail: [email protected]
Links:
• Cloudera Search:
http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-
3-0/Cloudera-Search-User-Guide/csug_introducing.html
• Tika
https://tika.apache.org/
• Apache Solr
http://lucene.apache.org/solr/
https://www.cloudera.com/content/www/en-us/products/apache-
hadoop/apache-solr.html
• Vectors, Inversed Index, Frequency Matrix, etc. ...
http://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htm