trihug: lucene solr hadoop

20
Where It All Began Using Apache Hadoop for Search with Apache Lucene and Solr

Upload: grant-ingersoll

Post on 26-Jan-2015

127 views

Category:

Technology


3 download

DESCRIPTION

High level talk with references on using Apache Hadoop with Apache Lucene, Solr, Nutch, etc.

TRANSCRIPT

Page 1: TriHUG: Lucene Solr Hadoop

Where It All Began

Using Apache Hadoop for Search with Apache Lucene and Solr

Page 2: TriHUG: Lucene Solr Hadoop

Lucid Imagination, Inc.

Topics

Search

What is:

Apache Lucene?

Apache Nutch?

Apache Solr?

Where does Hadoop (ecosystem) fit?

Indexing

Search

Other

Page 3: TriHUG: Lucene Solr Hadoop

Lucid Imagination, Inc.

Search 101

Search tools are designed for dealing with fuzzy data

Works well with structured and unstructured dataPerforms well when dealing with large volumes of data

Many apps don’t need the limits that databases place on contentSearch fits well alongside a DB too

Given a user’s information need, (query) find and, optionally, score content relevant to that need

Many different ways to solve this problem, each with tradeoffs

What’s “relevant” mean?

Page 4: TriHUG: Lucene Solr Hadoop

Search 101

Relevance IndexingFinds and maps terms and documents

Conceptually similar to a book index

At the heart of fast search/retrieve

Vector Space Model (VSM) for relevanceCommon across many search enginesApache Lucene is a highly optimized implementation of the VSM

Page 5: TriHUG: Lucene Solr Hadoop

Lucid Imagination, Inc.

Lucene is a mature, high performance Java API to provide search capabilities to applications

Supports indexing, searching and a number of other commonly used search features (highlighting, spell checking, etc.)

Not a crawler and doesn’t know anything about Adobe PDF, MS Word, etc.

Created in 1997 and now part of the Apache Software Foundation

Important to note that Lucene does not have distributed index (shard) support

Page 6: TriHUG: Lucene Solr Hadoop

Lucid Imagination, Inc.

Nutch

ASF project aimed at providing large scale crawling, indexing and searching using Lucene and other technologies

Mike Cafarella and Doug Cutting originally created Hadoop as part of Nutch based on the Google paper by Dean and Ghemawat

http://labs.google.com/papers/mapreduce.html

Only much later did it spin out to become the Hadoop that we all know

In other words, Hadoop was born from the need to scale search crawling and indexing

Originally used Lucene for search/indexing, now uses Solr

Page 7: TriHUG: Lucene Solr Hadoop

Lucid Imagination, Inc.

Solr

Solr is the Lucene based search server providing the infrastructure required for most users to work with Lucene

Without knowing Java!

Also provides:

Easy setup and configuration

Faceting

Highlighting

Replication/Sharding

Lucene Best Practices

http://search.lucidimagination.com

Page 8: TriHUG: Lucene Solr Hadoop

Lucid Imagination, Inc.

Lucene Basics

Content is modeled via Documents and Fields

Content can be text, integers, floats, dates, custom

Analysis can be employed to alter content before indexing

Searches are supported through a wide range of Query options

Keyword

Terms

Phrases

Wildcards, other

Page 9: TriHUG: Lucene Solr Hadoop

Lucid Imagination, Inc.

Quick Solr DemoPre-reqs:

Apache Ant 1.7.x

SVN

svn co https://svn.apache.org/repos/asf/lucene/dev/trunk solr-trunk

cd solr-trunk/solr/

ant example

cd example

java –jar start.jar

cd exampledocs; java –jar post.jar *.xml

http://localhost:8983/solr/browse

Page 10: TriHUG: Lucene Solr Hadoop

Lucid Imagination, Inc.

Anatomy of a Distributed Search System

Indexers

Shard[0] Shard[n]

Input Docs

Users

Application

Sharding Alg.

Searchers

Shard[0]

Shard[n]

Fan In/Out

Coordination Layer

Page 11: TriHUG: Lucene Solr Hadoop

Lucid Imagination, Inc.

Sharding Algorithm

Good document distribution across shards is important

Simple approach:

hash(id) % numShards

Fine if number of shards doesn’t change or easy to reindex

Better:

Consistent Hashing• http://en.wikipedia.org/wiki/Consistent_hashing

Also key: how to deal with the shape/size of the cluster changing

Page 12: TriHUG: Lucene Solr Hadoop

Lucid Imagination, Inc.

Hadoop and Search

Much of the Hadoop ecosystem is useful for search related functionality

Indexing

Process of adding documents to inverted index to make them searchable

In most cases, batch-oriented and embarrassingly parallel, so Hadoop Core can help

Search

Query the index and return documents and other info (facets, etc.) related to the result set

Subsecond response time usually required

ZooKeeper, Avro and others are still useful

Page 13: TriHUG: Lucene Solr Hadoop

Lucid Imagination, Inc.

Indexing (Lucene)

Hadoop ships with contrib/index• Almost no documentation, but…

• Good example of map-side indexing

• Mapper does analysis and creates in memory index which is written out to segments

• Indexes merged on the reduce side

Katta• http://katta.sourceforge.net

Shard management, distributed search, etc.

Both give you large amount of control, but you have to build out all the search framework around it

Page 14: TriHUG: Lucene Solr Hadoop

Lucid Imagination, Inc.

Indexing (Solr)

https://issues.apache.org/jira/browse/SOLR-1301

Map side formats

Reduce-side indexing

Creates indexes on local file system (outside of HDFS) and copies to default FS (HDFS, etc.)

Manually install index into a Solr core once built

https://issues.apache.org/jira/browse/SOLR-1045

Map-side indexing

Incomplete, but based on Hadoop contrib/index

Write a distributed Update Handler to handle on the server side

Page 15: TriHUG: Lucene Solr Hadoop

Indexing (Nutch to Solr)

Use Nutch to crawl content, Solr to index and serve

Doesn’t support indexing to Solr shards just yet

Need to write/use Solr distributed Update Handler

Still useful for smaller crawls (< 100M pages)

http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/

Page 16: TriHUG: Lucene Solr Hadoop

Lucid Imagination, Inc.

Searching

Hadoop Core is not all that useful for distributed search

Exception: Hadoop RPC layer, possibly

Exception: Log analysis, etc. for search related items

Other Hadoop ecosystem tools are useful:

Apache ZooKeeper (more in a moment)

HDFS – storage of shards (pull down to local disk)

Avro, Thrift, Protocol Buffers (serialization utilities)

Page 17: TriHUG: Lucene Solr Hadoop

Lucid Imagination, Inc.

ZooKeeper and Search

ZooKeeper is a centralized service for coordination, configuration, naming and distributed synchronization

In the context of search, it’s useful for:

Sharing configuration across nodes

Maintaining status about shards• Up/down/latency/rebalancing and more

Coordinating searches across shards/load balancing

Page 18: TriHUG: Lucene Solr Hadoop

Lucid Imagination, Inc.

ZooKeeper and Search (Practical)

Katta employs ZooKeeper for search coordination, etc.

Query distribution, status, etc.

Solr Cloud

All the benefits of Solr + ZooKeeper for coordinating distributed capabilities

Query distribution, configuration sharing, status, etc.

About to be committed to Solr trunk

http://wiki.apache.org/solr/SolrCloud

Page 19: TriHUG: Lucene Solr Hadoop

Lucid Imagination, Inc.

Other Search Related Tasks

Log Analysis

Query analytics

Related Searches

Relevance assessments

Classification and Clustering

Mahout – http://mahout.apache.org

HBase and other stores for documents

Avro, Thrift, Protocol Buffers for serialization of objects across the wire

Page 20: TriHUG: Lucene Solr Hadoop

Lucid Imagination, Inc.

Resources

http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/

http://hadoop.apache.org

http://nutch.apache.org

http://lucene.apache.org

http://www.lucidimagination.com