lucene, apache

Apache Lucene

-Vinay K (1PI03CS117)

Apache Lucene 1

What is Lucene?

• High performance, Java Information Retrieval (IR) library

• Creator: Doug Cutting

• makes it easier for developers to add Search Engine capabilities to their applications, by its simple & powerful API

• Lucene can be thought as a layer that applications sit on top of

Apache Lucene 2

Indexing

• Sequential scanning process is extremely slow

• Need for a highly efficient look-up in order to facilitate rapid searching

• This data structure is called ‘Index’ & the conversion process is called ‘indexing’

• Kind of space-time trade off as Index requires additional storage space

Apache Lucene 3

Searching

• Process of looking up words in an index to find documents where they appear.

• Factors affecting Quality of Search– Search Time

– Precision

– Result ranking

– Cached Content

Apache Lucene 4

Lucene Index Architecture

Apache Lucene 5

Index as a ‘Black Box’

An Example Index

• Index - collection of documents. Each index is contained in a unique directory on a file system.

• Document - sequence of fields. can be dynamically added or deleted from an Index, but cannot be updated

• Field - consist of a name/value pairs. E.g: Author, Publisher

Apache Lucene 6

Lucene Index Structure

Sample Index Directory File Listing

Apache Lucene 7

Index Segments

• Lucene index consists of one or more segments, and each segment is made up of several index files.

• Index files belonging to the same segment share a common prefix and differ in the suffix. Example shows an index with two segments, _lfyc and _gabh.

• Segments lets us quickly add new Documents to the index –as they are added to newly created index segments.

Apache Lucene 8

Key index files in Lucene

• Segments file - A single file contains the active segments information for each index. This file lists the segments by name, and it contains the size of each segment.

• Fields information file - documents in the index are composed of fields, and this file contains the fields information in the segment.

• Text information file - This core index file stores all of the terms and related information in the index, sorted by term.

• Frequency file - This file contains the list of documents that contain the terms, along with the term frequency in each document.

• Position file - This file contains the list of positions at which the term occurs within each document.

Apache Lucene 9

ANALYZERS

• Process of converting field text into its most fundamental indexed representation, terms.

• Analyzer tokenizes text by performing following tasks:– extracting words– discarding punctuation, removing accents from characters– lowercasing (also called normalizing)– removing common words– reducing words to a root form (stemming)– changing words into the basic form (lemmatization).

• Analysis done at 2 steps– Adding fields to index– Preparing user query for searching

Apache Lucene 10

ANALYZERS contd..

• Lucene has 4 analyzers built into it– Whitespace Analyzer– Simple Analyzer– Stop Analyzer– Standard Analyzer

• A stream of tokens is the fundamental output of the analysis process.

• During indexing, fields designated for tokenization are processed with the specified analyzer, and each token is written to the index as a term.

• Analyzers don’t help in field separation because their scope is to deal with a single field at a time. Instead, parsing thesedocuments prior to analysis is required.

Apache Lucene 11

SEARCHING

• Lucene provides a powerful Search syntax.

• Supports several kinds of advanced searches.– Boolean operators – AND, OR, NOT

– Field search - "title:Lucene AND content:Java“

– Wildcard search - "tex*", "tex?", "?ex*“

– Fuzzy search – “Solarus~”

– Range search – “birthday [20000101 –20060606]”

Apache Lucene 12

Disjunctive Search Algorithm

The search algorithm on an inverted index follows 3 general steps:1. Vocabulary search

The words and patterns present in the query are isolated and searched in the vocabulary. Notice that phrases and proximity queries are split into single words.

2. Retrieval of occurrencesThe list of the occurrences of all the words found is retrieved.

3. Manipulation of occurrencesThe occurrences are processed to solve phrases, proximity, or Boolean operations. If block addressing is used it may be necessary to directly search the text to find the information missing from the occurrences (e.g., extract word positions to form phrases).

Apache Lucene 13

Disjunctive Search Algorithm contd..

• Advantages– Linear Search algorithm computes many

document/query similarities that are zero – Inverted-index approach obviates these computations

by only considering documents that have some overlap with the query as ‘early references’.

• Disadvantages– Exact score of each document cannot be computed in

most of the situations.– Increased disk usage: Kind of space-time trade-off

Apache Lucene 14

Partial Ranking Algorithm

• Full score of each document is not known until the entire query terms have been processed.

• Algorithm maintains partial scores for each document considered.

• As each posting is processed the partial score for its document is updated. A queue of the current top k-scoring documents is maintained and returned as the result.

• Advantage: Reduction of disk accesses. Rather than reading each document vector from secondary storage, only the p-postings of the query terms need be read from the inverted-index.

Apache Lucene 15

Lucene scoring

Apache Lucene 16

Factor Description

tf(t in d) Term frequency factor for the term (t) in the document (d).

idf(t) Inverse document frequency of the term.

boost(t.field in d) Field boost, as set during indexing.

lengthNorm(t.field in d)Normalization value of a field, given the number of terms within the field. This value is computed during indexing and stored in the index.

coord(q, d)Coordination factor, based on the number of query terms the document contains.

queryNorm(q)Normalization value for a query, given the sum of the squared weights of each of the query terms.

Shortcomings of Lucene

• No in-built support for Synonyms

• Does not allow ‘update’ of documents

• No in-built support for parsing regular file formats like .doc, .pdf, …

• Relatively large size of index – need some text compression mechanisms

Apache Lucene 17

Questions ?

Apache Lucene18

Thank You!

Apache Lucene19

lucene, apache

Technology