javaedge09 : java indexing and searching

1

Java Indexing and SearchingBy : Shay Sofer & Evgeny Borisov

2

» Motivation» Lucene Intro» Hibernate Search» Indexing» Searching» Scoring» Alternatives

Agenda

3

MotivationWhat is Full Text Search and why do I need it?

4

Id Title Price

1 Head First Java 200

2 JBoss in action 120

3 Best jokes about Chuck Norris 250

4 Best of the best of the best 10

Motivation

Use case“Book” table

Good practices for Gava

5

» We’d like to : Index the information efficiently answer queries using that index

» More common than you think

Full Text Search

Motivation

6

» Integrated full text search engine in the database e.g. DBSight, Recent versions of MySQL, MS SQL Server,

Oracle Text, etc» Out of the box Search Appliances

e.g. Google Search Appliance» Third party libraries

Full Text Search Solutions

Motivation

7

Lucene Intro

8

» The most popular full text search library» Scalable and high performance» Around for about 9 years» Open source » Supported by the Apache Software Foundation

Apache Lucene

Lucene Intro

9

Lucene Intro

10

» “Word-oriented” search» Powerful query syntax

Wildcards, typos, proximity search.» Sorting by relevance (Lucene’s scoring algorithm) or

any other field» Fast searching, fast indexing

Inverted index.

Lucene’s Features

Lucene Intro

11

Head First Java

Best of the best of the best

Chuck Norris in action

JBoss in action

Head 0

First 0

Java 0

Action 2 3

Best 1

JBoss 3

Chuck 2

Norris 2

0

2

1

3

Lucene Intro

Inverted Index DB

12

» A Field is a key+value. Value is always represented as a String (Textual)

» A Document can contain as many Fields as we’d like» Lucene’s index is a collection of Documents

Basic Definitions

Lucene Intro

13

Lucene Intro

Using Lucene API…IndexSearcher is = new IndexSearcher(“BookIndex");QueryParser parser = new QueryParser("title",

analyzer);

Query query = parser.parse(“Good practices for Gava”);return is.search(query);

14

OO domain model Vs. Lucene’s Index structure

Lucene Intro

Extensible type system

Strong type system

Polymorphic

OO Domain ModelIndex structure

15

» The Structural Mismatch Converting objects to string and vice versa No representation of relation between Documents

» The Synchronization Mismatch DB must by sync’ed with the index

» The Retrieval Mismatch Retrieving documents ( =pairs of key + value) and not objects

Object vs Flat text mismatches

Lucene Intro

16

Hibernate Search

Emmanuel Bernard

17

» Leverages ORM and Lucene together to solve those mismatches

» Complements Hibernate Core by providing FTS on persistent domain models.

» It’s actually a bridge that hides the sometimes complex Lucene API usage.

» Open source.

Hibernate Search

18

» Document = Class (Mapped POJO)» Hibernate Search metadata can be described by

Annotations only» Regardless, you can still use Hibernate Core with XML

descriptors (hbm files)

» Let’s create our first mapping – Book

Mapping

Hibernate Search

19

@Entity @Indexedpublic class Book implements Serializable { @Id private Long id;

@Boost(2.0f) @Field

private String title;

@Field private String description;

private String imageURL;

@Field (index=Index.UN_TOKENIZED) private String isbn; … }

Hibernate Search

20

» Types will be converted via “Field Bridge”.» It is a bridge between the Java type and its

representation in Lucene (aka String)» Hibernate Search comes with a set for most standard

types (Numbers – primitives and wrappers, Date, Class etc)

» They are extendable, of course

Bridges

Hibernate Search

21

» We can use a field bridge…

@FieldBridge(impl = MyPaddedFieldBridge.class, params = {@Parameter(name="padding",

value=“5")} )public Double getPrice(){ return price;}

» Or a class bridge - incase the data we want to index is more than just the field itself e.g. concatenation of 2 fields

Custom Bridges

Hibernate Search

22

» In order to create a custom bridge we need to implement the interface StringBridge

» ParameterizedBridge – to inject params

Custom Bridges

Hibernate Search

23

» Directory is where Lucene stores its index structure.» Filesystem Directory Provider» In-memory Directory Provider» Clustering

Directory Providers

Hibernate Search

24

» Default» Most efficient» Limited only by the disk’s free space» Can be easily replicated» Luke support

Filesystem Directory Provider

Hibernate Search

25

» Index dies as soon as SessionFactory is closed.» Very useful when unit testing. (along side with

in-memory DBs)» Data can be made persistent at any moment, if

needed.» Obviously, be aware of OutOfMemoryException

In-memory Directory Provider

Hibernate Search

26

<property

name="hibernate.search.default.directory_provider"> org.hibernate.search.store.FSDirectoryProvider

</property>

<property name="hibernate.search.com.alphacsp.Book.directory_provider"> org.hibernate.search.store.RAMDirectoryProvider</property>

Directory Providers Config Example

Hibernate Search

27

» Correlated queries - How do we navigate from one entity to another?

» Lucene doesn’t support relationships between documents

» Hibernate Search to the rescue - Denormalization

Relationships

Hibernate Search

28

Hibernate Search

29

@Entity @Indexedpublic class Book{ @ManyToOne @IndexEmbedded

private Author author;}

@Entity @Indexedpublic class Author{

private String firstName;}

» Object navigation is easy (author.firstName)

Relationships

Hibernate Search

30

» Entities can be referenced by other entities.

Relationships – Denormalization Pitfall

Hibernate Search

31



Hibernate Search

32



Hibernate Search

33

» The solution: The association pointing back to the parent will be marked with @ContainedIn

@Entity @Indexedpublic class Book{ @ManyToOne @IndexEmbedded private Author author;}

@Entity @Indexedpublic class Author{

@OneToMany(mappedBy=“author”) @ContainedIn private Set<Book> books;

}

Relationships – Solution

Hibernate Search

34

» Responsible for tokenizing and filtering words » Tokenizing – not a trivial as it seems» Filtering – Clearing the noise (case, stop words etc) and

applying “other” operations» Creating a custom analyzer is easy

» The default analyzer is Standard Analyzer

Analyzers

Hibernate Search

35

» StandardTokenizer : Splits words and removes punctuations.» StandardFilter : Removes apostrophes and dots from acronyms.» LowerCaseFilter : Decapitalizes words.» StopFilter : Eliminates common words.

Standard Analyzer

Hibernate Search

36

Other cool Filters….

Hibernate Search

37

» N-Gram algorithm – Indexing a sequence of n consecutive characters.

» Usually when a typo occurs, part of the word is still correct Encyclopedia in 3-grams = Enc | ncy | cyc | ycl | clo | lop | ope | ped | edi | dia

Approximative Search

Hibernate Search

38

» Algorithms for indexing of words by their pronunciation

» The most widely known algorithm is Soundex » Other Algorithms that are available : RefinedSoundex,

Metaphone, DoubleMetaphone

Phonetic Approximation

Hibernate Search

39

» Synonyms You can expand your synonym dictionary with your own

rules (e.g. Business oriented words)

» Stemming Stemming is the process of reducing words to their stem,

base or root form. “Fishing”, “Fisher”, “Fish” and “Fished” Fish Snowball stemming language – supports over 15

languages

Synonyms & Stemming

Hibernate Search

40

» Lucene is bundled with the basic analyzers, tokenizers and filters.

» More can be found at Lucene’s contribution part and at Apache-Solr

Additional Analyzers

Hibernate Search

http://lucene.apache.org/java/2_9_0/lucene-contrib/index.html

http://lucene.apache.org/java/2_9_0/lucene-contrib/index.html

http://lucene.apache.org/solr/

http://lucene.apache.org/solr/

41

» No free Hebrew analyzer for Lucene» Itamar Syn-Hershko

Involved in the creation of CLucene (The C++ port of Lucene) Creating a Hebrew analyzer as a side project Looking to join forces [email protected]

Hebrew?

Hibernate Search

mailto:[email protected]




42

Hibernate Search

אחוות הטבעתשר הטבעות, גירסה ראשונה:

43

» Motivation» Lucene Intro» Hibernate Search» Indexing» Searching» Scoring» Alternatives

Agenda

44

» When data has changed?» Which data has changed?» When to index the changing data?» How to do it all efficiently?

Hibernate Search will do it for you!

Transparent indexing

Indexing

45

Indexing – On Rollback

Application

Session (Entity Manager)

DB

Lucene Index

Insert/update

delete

Queue

Start Transaction

46

Indexing – On Rollback

Application


DB

Lucene Index

Insert/update

delete

QueueTransaction failed

Rollback

Start Transaction

47

Indexing – On Commit

Application


DB

Lucene Index

Insert/update

delete

QueueTransaction Committed

√

48

<property name="org.hibernate.worker.execution“>async</property>

<property name="org.hibernate.worker.thread_pool.size“>2 </property>

<property name="org.hibernate.worker.buffer_queue.max“>10</property>

hibernate.cfg.xml

Indexing

49

It’s too late! I already have a database without Lucene!

Indexing

50

» FullTextSession extends from Session of Hibernate core Session session = sessionFactory.openSession(); FullTextSession fts = Search.getFullTextSession(session);

» index(Object entity)» purge(Class entityType, Serializable id)» purgeAll(Class entityType)

Manual indexing

Indexing

51

tx = fullTextSession.beginTransaction(); //read the data from the database Query query = fullTextSession.createCriteria(Book.class); List<Book> books = query.list(); for (Book book: books ) {

fullTextSession.index( book); } tx.commit();

Manual indexing

Indexing

52

tx = fullTextSession.beginTransaction(); List<Integer> ids = getIds(); for (Integer id : ids) { if(…){ fullTextSession.purge(Book.class, id ); } } tx.commit();

» fullTextSession.purgeAll(Book.class);

Removing objects from the Lucene index

Indexing

53

Rrrr!!! I got an OutOfMemoryException!

Indexing

54

session.setFlushMode(FlushMode.MANUAL);session.setCacheMode(CacheMode.IGNORE);Transaction tx=session.beginTransaction();ScrollableResults results =

session.createCriteria(Item.class) .scroll(ScrollMode.FORWARD_ONLY);

int index = 0;while(results.next()) { index++; session.index(results.get(0)); if (index % BATCH_SIZE == 0){ session.flushToIndexes(); session.clear();

} }tx.commit();

Indexing

54

100

55

Searching

56

title : lord title: rings+title : lord +title: rings title : lord –author: Tolkien title: r?ngs title: r*gs title: “Lord of the Rings” title: “Lord Rings”~5 title: rengs~0.8 title: lord author: Tolkien^2And more…

Lucene’s Query Syntax

Searching

57

» To build FTS queries we need to: Create a Lucene query Create a Hibernate Search query that wraps the Lucene

query

Why?» No need to build framework around Lucene» Converting document to object happens

transparently.» Seamless integration with Hibernate Core API

Querying

Searching

58

String stringToSearch = “rings";Term term = new Term(“title",stringToSearch);TermQuery query = new TermQuery(term);FullTextQuery hibQuery = session.createFullTextQuery(query,Book.class);

List<Book> results = hibQuery.list();

Hibernate Queries Examples

Searching

59

String stringToSearch = "r??gs";Term term = new Term(“title",stringToSearch);WildCardQuery query = new WildCardQuery (term);...

List<Book> results = hibQuery.list();

WildCardQuery Example

Searching

60

Id Title Price

1 Head First Java 200

2 Chuck Norris in action 120

3 Chuck Norris vs JBoss 120

4 JBoss strikes back 10

Motivation

Use caseBook table

Good practices for Gava

61

HS Query Flowchart

Searching

Loads objects from the Persistence Context

Hibernate

SearchQuery

Client

LuceneIndex

DB

Query the index

Persistence Context

DB access

(if needed)

Receive matching ids

62

» You can use list(), uniqueResult(), iterate(), scroll() – just like in Hibernate Core !

» Multistage search engine» Sorting» Explanation object

Querying tips

Searching

63

Score

64

» Most based on Vector Space Model of Salton

Score

65

» Most based on Vector Space Model of Salton

Score

66

Term Rating

Score

total number of documents containing term “I”

term weightnumber of documents in the index

Logarithm

best java in action books

67

Term Rating Calculation

Score

0=)500

500log(

2=)50

5000log(

3=)5

5000log(

68

1. Head First Java2. Best of the best of the best3. Best examples from Hibernate in action4. The best action of Chuck Norris

Scoring example

Score

Search for: “best java in action books”Term Frequency ScoreJava 1 Best 3Action 2

0.124940.30103

0.60206

69

» Conventional Boolean retrieval» Calculating score for only matching documents» Customizing similarity algorithm» Query boosting» Custom scoring algorithms

Lucene’s scoring approach

Score

70

Alternatives

71

Alternatives

Shay Banon

72

Alternatives

Simple

Lucene based

Configurable via XML or

annotations

Local & External TX Manager

Integrates with popular ORM frameworks

Spring support

Distributed

73

Alternatives

74

» Enterprise Search Server Supports multiple protocols (xml, json, ruby, etc...)

» Runs as a standalone Full Text Search server within a servlet e.g. Tomcat

» Heavily based on Lucene» JSA – Java Search API (based on JPA)

ODM (Object/Document Mapping) Spring integration (Transactions)

Apache Solr

Alternatives

75

» Powerful Web Administration Interface Can be tailored without any Java coding!

» Extensive plugin architecture» Server statistics exposed over JMX» Scalability – easily replicated

Apache Solr

Alternatives

76

Resources

Lucene

Lucene contrib part

Hibernate Search

Hibernate Search in Action / Emmanuel Bernard, John Griffin

Compass

Apache Solr

http://lucene.apache.org/

http://lucene.apache.org/java/2_3_2/contributions.html




http://docs.jboss.org/hibernate/stable/search/reference/en/html_single/

http://www.amazon.com/Hibernate-Search-Action-Emmanuel-Bernard/dp/1933988649

http://www.compass-project.org/

http://wiki.apache.org/solr/

http://wiki.apache.org/solr/

77

Thank you!Q & A

javaedge09 : java indexing and searching

Technology

proximity search

documentshibernate search

coursebridgeshibernate

text search engine

text search libraryscalable

box search appliances

field index

indexedpublicclass author