large scale search, discovery and analytics in action

21
Confidential © Copyright 2012 Large Scale Search, Discovery and Analysis in Action Grant Ingersoll Chief Scientist September 18, 2012

Upload: grant-ingersoll

Post on 27-Jan-2015

109 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Large Scale Search, Discovery and Analytics in Action

Confidential © Copyright 2012

Large Scale Search, Discovery and Analysis in Action

Grant IngersollChief Scientist

September 18, 2012

Page 2: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

• Good keyword search is a commodity and easy to get up and running

• The Bar is Raised• Relevance is (always will

be?) hard

• Holistic view of the data AND the users is critical

• Search, Discovery and Analytics are the key to unlocking this view of users and data

Search is Dead, Long Live Search

Documents

User Interaction

Access

Content Relationships

Page 3: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Topics

•Background and needs

•Architecture•Road Ahead

•SDA In Action• Components• Challenges and Lessons Learned

•Wrap Up

Page 4: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Why Search, Discovery and Analytics (SDA)?

• User Needs• Real-time, ad hoc access to

content

• Aggressive Prioritization based on Importance

• Serendipity

• Feedback/Learning from past

• Business Needs• Deeper insight into users

• Leverage existing internal knowledge

• Cost effective

Search

DiscoveryAnalytics

Page 5: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Sample Use Cases

• Claims processing and analysis, including fraud analysis

• Large scale content acquisition and access for:• Defense, intelligence and pharmaceutical applications

• Views of data surrounding natural disasters and other tragedies for research, archiving and therapeutic purposes

• Analysis of Website and social media interactions

• Access and processing of genetic information for improved medical treatments

• Log processing and fraud detection in telecommunications

5

Page 6: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

In Focus: Personalized Medicine

6

Genetic Variations

Patient DNA

Alignment and other analysis

Search and Faceting

Standard Therapies

Alternative Therapies

Page 7: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

In Focus: Log Processing in Telecommunications

• Each year, large sums of money are lost due to fraudulent calls and poor service

• Logs are usually semi-structured and contain vital information about errors and fraud

• Deeper batch analytics can provide insight into patterns across vast amounts of data

• Search of call and network information (via logs) is critical to providing deeper analysis and understanding of these errors and fraudulent activities

7

Page 8: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

What Does an SDA Platform Need?

• Fast, efficient, scalable search• Bulk and Near Real Time Indexing

• Handle billions of records with sub-second search and faceting

• Large scale, cost effective storage and processing capabilities• Need whole data consumption and analysis

• Experimentation/Sampling tools

• NLP and machine learning tools that scale to enhance discovery and analysis

Page 9: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Architecture

Page 10: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Under the Hood

• Lucene/Solr 4.0-dev

• Sharded with SolrCloud• 1 second (default) soft commits for NRT

updates

• 1 minute (default) hard commits (no searcher reopen)

• Transaction logs for recovery

• Solr takes care of leader election, etc. so no more master/slave

• RESTful services built on Restlet 2.1

• Service Discovery, load balancing, failover enabled via ZooKeeper + Netflix Curator

• Authentication and authorization over SSL (optional)

• Proxies for LucidWorks and WebHDFS API

• Workflow engine coordinates data flow

LucidWorks 2.1 SDA Engine

Page 11: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Under the Hood, cont.

• Apache Hadoop• Map-Reduce (MR) jobs for ETL

and bulk indexing into SolrCloud sharded system

• Leverage Pig and custom MR jobs for log processing and metric calculation

• WebHDFS

• Apache Mahout• K-Means Clustering

• Statistically Interesting Phrases

• Similar Docs

• More to come

• Apache HBase• Key-value and time series of all

calculated metrics

• Document storage

• Apache Pig• ETL

• Log analysis -> Hbase

• Apache ZooKeeper• Netflix Curator for service

discovery and higher level ZK client

• Apache Kafka• Pub-sub for collecting logs from

LucidWorks into HDFS

Page 12: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

The Road Ahead

• Our approach is from search and discovery outwards to analytics• Analytics in beta are focused around analysis of search logs

• Apache Hive support

• Analytics Themes• Relevance

• Data quality

• Discovery

• Experiment Management

• Machine Learning• Classification

• Recommendations

• Natural Language Processing

• Incorporate latest LucidWorks/Solr

Page 13: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Computation and Storage

LucidWorks Search/Solr

• Document Index

• Faceting

• SolrCloud makes sharding easy

Hadoop

• Stores Logs, Raw files, intermediate files, etc.

• WebHDFS

• Small files are an unnatural act

HBase

• Metric Storage

• User Histories/Profile

• Document Storage

Challenges• Who is the authoritative store?• Real time vs. Batch• Where should analysis be done?

Page 14: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Search In Practice

•Three primary concerns• Performance/Scaling

• Relevance

• Operations: monitoring, failover, etc.

•Business typically cares more about relevance

•Devs care more about performance at first…

Page 15: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Search: Relevance

•Always Be Testing•Experiment management is critical•Top X + sampling•Click Logs

•Track Everything!• Queries• Clicks• Displayed Documents • Mouse/Scroll tracking?

•Phrases are your friends

Page 16: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Discovery Components

Serendipity

• Trends• Topics• Recommendations• Related Items• More Like This• Did you mean?• Stat. Interesting

Phrases

Organization

• Importance• Clustering• Classification

• Named Entities• Time Factors• Faceting

Data Quality

• Document factor Distributions• Length• Boosts

• Duplicates

Challenges• Many of these are intense calculations or iterative• Many are subjective and require a lot of experimentation

Page 17: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Discovery with Mahout

• Mahout’s 3 “C”s provide tools for helping across many aspects of discovery• Collaborative Filtering

• Classification

• Clustering

• Also: • Collocations (Statistically Interesting Phrases)

• Singular Value Decomposition (SVD)

• Others

• Challenges:• High cost to iterative machine learning algorithms

• Mahout is very command line oriented

• Some areas less mature

Page 18: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Aside: Experiment Management

• Plan for running experiments from the beginning across Search and Discovery components• Your engine should help!

• Types of Experiments to consider• Indexing/Analysis

• Query parsing

• Scoring formulas

• Machine Learning Models

• Recommendations, many more

• Make it easy to do A/B testing across all experiments and compare and contrast the results

Page 19: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Analytics in Practice

• Many of the components discussed provide analytical features• Leverage existing tools: R, etc.

• Simple Counts:• Facets

• Term and Document frequencies

• Clicks

• Search and Discovery example metrics• Relevance measures like Mean Reciprocal Rank

• Histograms/Drilldowns around Number of Results

• Log and navigation analysis

• Data cleanliness analysis is helpful for finding potential issues in content

Page 20: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Wrap

• Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users

• LucidWorks has combined many of these things into LucidWorks Big Data• http://www.lucidworks.com/products/lucidworks-big-data

• Design for the big picture when building search-based applications

Page 21: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Discussion and Resources

• Questions?

• http://www.lucidworks.com

[email protected]• @gsingers

21