introduction to apache lucene/solr

32
Introduction to Apache Solr & Lucid Imagination Grant Ingersoll Thursday, 29 July 2010 Sponsored by Co-sponsored by We deliver information solutions

Upload: lucidimagination

Post on 20-Nov-2014

298 views

Category:

Documents


12 download

DESCRIPTION

Lucene and Solr are state of the art search technologies available for free as open source from The Apache Software Foundation. Lucene is the underlying search library, and Solr is a platform built on top of Lucene that makes it easy to build Lucene-based applications. Both are full-featured and have excellent performance, relevancy ranking and scalability. These technologies are used today by thousands of organizations and power substantial search applications at AOL, Comcast Interactive Media, IBM, Netflix, LinkedIn and MySpace.http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Introduction-Apache-Lucene-and-Solr

TRANSCRIPT

Page 1: Introduction to Apache Lucene/Solr

Introduction to Apache Solr & Lucid ImaginationGrant IngersollThursday, 29 July 2010

Sponsored by

Co-sponsored by

We deliver information solutions

Page 2: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.© 2010

Sponsored by

Co-sponsored by…

2

We consult and design.

We architect and build.

We support.

And we realise the

true value of your content...

We deliver information solutions.

Steve Odartwww.ixxus.com

Page 3: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

Agenda

Introductions

About Lucid Imagination & Open Source Search

LucidWorks for Solr

Searching your domain with Solr

Putting Solr into production

Questions

3© 2010

Slides are posted for download at the end of this

presentation; full replay available within

~48 hours of live webcast

Page 4: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

About me

Grant Ingersoll

Lucene/Solr committer

Co-founder Apache Mahout project

Co-author of upcoming “Taming Text”

Chair, Apache Lucene PMC

4© 2010

Page 5: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

About Lucid Imagination

Build on, complement the open source technology & install base of Apache Lucene and Solr

Deliver subscription-based value-add software, support and training to enhance & extend Lucene/Solr

Center of excellence for Lucene/Solr app developers

5© 2010

Page 6: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

Lucene Project Launched: 1997

Solr Project Launched 2006

Company Launched: Aug. 2007

Financing: Shasta Ventures, Granite Ventures, Walden International, In-Q-Tel

Paying Customers: 100+ (and counting…)

HQ: San Mateo, California, USA

Partners: US, Europe, Japan, Latin America

Company Background

6© 2010

Page 7: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

SearchCustomersBuilding Better,

Faster, Less Costly Search ApplicationsBest Practices

Training

Consulting Subscriptions

Certified Distributions

Health Checks

Lucid Imagination Offerings

7© 2010

Page 8: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.© 2010

Lucene/Solr Success Stories with Lucid Imagination

8

Page 9: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

Data Happens

Data constantly growing faster, more diverse

Mix of content, composition, and repositories: new terms, fields, range of data types grow in tandem with volume

Diversity and location of data arean application development problem

Search and discovery tools are the solution

Scalability, performance and relevancy key to user success

Transparency, breadth and flexibility are key to development success

9© 2010

Page 10: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc. 10© 2010

Page 11: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.© 2010

Lucene/Solr

•Lucene, Solr and their logos are trademarks of the Apache Software Foundation

Java ported to 7 other environments (PHP, C++, Python, etc.)

Liberal Apache License

One of Top 5 Apache Projects

Top 10 Open Source Project

Hit highlighting

RDBMS integration

Distributed scalability

Solr: The Lucene Search Server

Lucene: powerful flexible search librarySpeed, accuracy, scalability, efficiency

Cross-platform portability of indexes

REST-like interface

Faceting

Rich Document Handling

Easy configuration

11

Page 12: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

Lucene/Solr Open Source Quality @ the tipping point

Scalability

823 billion documents searched by Lucene at MySpace.com

Performance

Real time: LinkedIn search covers 48 million members, adding one new member (with new content) per second

Relevancy

Open source APIs deliver better customization and the ability to fine tune results

Economics

5-8x reduction in server footprint over commercial search

No vendor lock-in lowers lifecycle costs

12© 2010

Page 13: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

Reduced risk

Better fitShorter time to market

Resulting from direct communication between innovators and users

From being locked into single-vendor relationships

Access to code results in increased adaptability of process to systems

Three key trends…

CREATING COMPETITIVE ADVANTAGE: Focus on core process innovations unique to your business instead of operating and maintaining 3rd party software packages

…result in:

Creating Lasting Business Value

13© 2010

Page 14: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

Search 101

Search tools are designed for dealing with fuzzy data

Works well with structured and unstructured data

Performs well when dealing with large volumes of data

Many apps don’t need the limits that databases place on content

Search fits well alongside a DB too

Given a user’s information need, (query) find and, optionally, score content relevant to that need

Many different ways to solve this problem, each with tradeoffs

What’s “relevant” mean?

14© 2010

Page 15: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

Two Foundation Concepts

Relevance IndexingFinds and maps terms and documents

Conceptually similar to a book index

At the heart of fast search/retrieve

Vector Space Model (VSM) for relevance

Common across many search engines

Apache Lucene is a highly optimized implementation of the VSM

15© 2010

Page 16: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

Solr Basics

Content is modeled via Documents and Fields

Content can be text, integers, floats, dates, custom

Analysis can be employed to alter content before indexing

Controlled via schema.xml

Searches are supported through a wide range of Query options

Keyword

Terms

Phrases

Wildcards, other

16© 2010

Page 17: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

Solr Basics

Schema

Define Fields, field metadata and Analysis

<field name="name" type="text" indexed="true" stored="true"/>

Solr Config

Define low-level Lucene controls

Specify how clients interact with Solr via Request Handlers (“mini servlets”)

Configure highlighting, spell checking, admin, etc.

17© 2010

Page 18: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

Getting Started

1. Install LucidWorks Certified Distribution

2. Model your domain

3. Index your content

4. Test

5. Deploy

18© 2010

Page 19: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

LucidWorks Certified Distribution

Free certified distribution

Installer

Simple

Plugins and enhancements

Updateable

Complete Reference Guide

Support for Linux, Windows, Mac

UI and headless both available

Get started at http://lucene.li/R

19© 2010

Page 20: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

Master Your Domain with Solr

Get to know your content

Get to know your users

20© 2010

Page 21: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

Modeling your Content

Collection/Aggregate

Examine collection level stats, like:

MIME Types

Number of Docs

Update rates

Languages present

Much, much more

Look for patterns and relationships

Identify helpful resources

21© 2010

Page 22: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

Modeling your Content

Randomly sample a set of your documents

Look for:

Common structures like titles, tables, columns, etc.

Important metadata

Tokenization issues

Try out in http://localhost:8983/solr/admin/analysis.jsp

Importance Indicators

May also look at paragraph, sentence, word and character issues

22© 2010

Page 23: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.© 2010

Understanding your Users

Sophisticated vs. Simple

Speed and Relevance

Search and Discovery

Search

Faceting

Did you mean?

Similar Pages (More Like This)

Highlighting

UI expectations

23

Page 24: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

Build your Application

Map your content into Documents and Fields via the Solr schema

Setup your Solr access patterns in the solrconfig.xml

Index your content

Search/Browse/Discover

24© 2010

Page 25: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.© 2010

Indexing

Many Clients

Java, PHP, Ruby, etc.

See example/exampledocs

Example: Upload CSV, Solr XML

<add><doc>

<field name="id">EN7800GTX/2DHTV/256M</field>

<field name="manu">ASUS Computer Inc.</field>

<field name="cat">electronics</field>

</doc></add>

25

Page 26: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.© 2010

Search

Clients also support search through API calls

HTTP support by definition:

http://localhost:8983/solr/select/?q=*:*&fl=score,id

http://localhost:8983/solr/select/?q=name:iPod&fl=score,id

26

Page 27: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

Getting to Production

Some Issues to think about:

Scaling

Improving Findability

27© 2010

Page 28: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

Scaling Solr

Get the most out of each machine

Typical Hardware (your mileage may vary):

Modern multicore CPU, Fast disk (SSD?), 4-16 GB RAM

High Query Volume

Large Index

Both

http://lucene.li/V

28© 2010

Page 29: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.© 2010

Improving Findability

Common Techniques

Analysis:

Lowercase, stemming, synonyms, stopwords, compound analysis (e.g. STR-AV220 -> STR AV 220)

Faceting

Spell Checking

Editorial

See http://lucene.li/U29

Page 30: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

Improving Findability

Phrase Queries and other Position-based Queries (SpanQuery)

Disjunction Max Query (aka “DisMax”)

Intent Analysis

Invisible Queries

Fake Queries

Relevance Feedback and “More Like This”

See http://lucene.li/S

30© 2010

Page 31: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.

Resources

Websites

http://www.lucidimagination.com

http://search.lucidimagination.com

http://lucene.apache.org/solr

Solr Support

http://www.lucidimagination.com/How-We-Can-Help

[email protected]

31© 2010

Page 32: Introduction to Apache Lucene/Solr

Lucid Imagination, Inc.© 2010

Q&ASlides are posted for

download at http://lucene.li/a ;

full replay available within ~48 hours of live webcast