javaedge09 : java indexing and searching
DESCRIPTION
From AlphaCSP's Java conference - JavaEdge09. The presentation of myself and Evgeny Borisov about 'Java Indexing and Searching' In this session we discussed the need of Full Test Search (as opposed to regular textual/SQL search) , Lucene and it's OO mismatches, the solution that Hibernate Search provides to those mismatches and then a bit about Lucene's scoring algorithm.TRANSCRIPT
1
Java Indexing and SearchingBy : Shay Sofer & Evgeny Borisov
2
» Motivation» Lucene Intro» Hibernate Search» Indexing» Searching» Scoring» Alternatives
Agenda
3
MotivationWhat is Full Text Search and why do I need it?
4
Id Title Price
1 Head First Java 200
2 JBoss in action 120
3 Best jokes about Chuck Norris 250
4 Best of the best of the best 10
Motivation
Use case“Book” table
Good practices for Gava
5
» We’d like to : Index the information efficiently answer queries using that index
» More common than you think
Full Text Search
Motivation
6
» Integrated full text search engine in the database e.g. DBSight, Recent versions of MySQL, MS SQL Server,
Oracle Text, etc» Out of the box Search Appliances
e.g. Google Search Appliance» Third party libraries
Full Text Search Solutions
Motivation
7
Lucene Intro
8
» The most popular full text search library» Scalable and high performance» Around for about 9 years» Open source » Supported by the Apache Software Foundation
Apache Lucene
Lucene Intro
9
Lucene Intro
10
» “Word-oriented” search» Powerful query syntax
Wildcards, typos, proximity search.» Sorting by relevance (Lucene’s scoring algorithm) or
any other field» Fast searching, fast indexing
Inverted index.
Lucene’s Features
Lucene Intro
11
Head First Java
Best of the best of the best
Chuck Norris in action
JBoss in action
Head 0
First 0
Java 0
Action 2 3
Best 1
JBoss 3
Chuck 2
Norris 2
0
2
1
3
Lucene Intro
Inverted Index DB
12
» A Field is a key+value. Value is always represented as a String (Textual)
» A Document can contain as many Fields as we’d like» Lucene’s index is a collection of Documents
Basic Definitions
Lucene Intro
13
Lucene Intro
Using Lucene API…IndexSearcher is = new IndexSearcher(“BookIndex");QueryParser parser = new QueryParser("title",
analyzer);
Query query = parser.parse(“Good practices for Gava”);return is.search(query);
14
OO domain model Vs. Lucene’s Index structure
Lucene Intro
Extensible type system
Strong type system
Polymorphic
OO Domain ModelIndex structure
15
» The Structural Mismatch Converting objects to string and vice versa No representation of relation between Documents
» The Synchronization Mismatch DB must by sync’ed with the index
» The Retrieval Mismatch Retrieving documents ( =pairs of key + value) and not objects
Object vs Flat text mismatches
Lucene Intro
16
Hibernate Search
Emmanuel Bernard
17
» Leverages ORM and Lucene together to solve those mismatches
» Complements Hibernate Core by providing FTS on persistent domain models.
» It’s actually a bridge that hides the sometimes complex Lucene API usage.
» Open source.
Hibernate Search
18
» Document = Class (Mapped POJO)» Hibernate Search metadata can be described by
Annotations only» Regardless, you can still use Hibernate Core with XML
descriptors (hbm files)
» Let’s create our first mapping – Book
Mapping
Hibernate Search
19
@Entity @Indexedpublic class Book implements Serializable { @Id private Long id;
@Boost(2.0f) @Field
private String title;
@Field private String description;
private String imageURL;
@Field (index=Index.UN_TOKENIZED) private String isbn; … }
Hibernate Search
20
» Types will be converted via “Field Bridge”.» It is a bridge between the Java type and its
representation in Lucene (aka String)» Hibernate Search comes with a set for most standard
types (Numbers – primitives and wrappers, Date, Class etc)
» They are extendable, of course
Bridges
Hibernate Search
21
» We can use a field bridge…
@FieldBridge(impl = MyPaddedFieldBridge.class, params = {@Parameter(name="padding",
value=“5")} )public Double getPrice(){ return price;}
» Or a class bridge - incase the data we want to index is more than just the field itself e.g. concatenation of 2 fields
Custom Bridges
Hibernate Search
22
» In order to create a custom bridge we need to implement the interface StringBridge
» ParameterizedBridge – to inject params
Custom Bridges
Hibernate Search
23
» Directory is where Lucene stores its index structure.» Filesystem Directory Provider» In-memory Directory Provider» Clustering
Directory Providers
Hibernate Search
24
» Default» Most efficient» Limited only by the disk’s free space» Can be easily replicated» Luke support
Filesystem Directory Provider
Hibernate Search
25
» Index dies as soon as SessionFactory is closed.» Very useful when unit testing. (along side with
in-memory DBs)» Data can be made persistent at any moment, if
needed.» Obviously, be aware of OutOfMemoryException
In-memory Directory Provider
Hibernate Search
26
<!-- Hibernate Search Config --><property
name="hibernate.search.default.directory_provider"> org.hibernate.search.store.FSDirectoryProvider
</property>
<property name="hibernate.search.com.alphacsp.Book.directory_provider"> org.hibernate.search.store.RAMDirectoryProvider</property>
Directory Providers Config Example
Hibernate Search
27
» Correlated queries - How do we navigate from one entity to another?
» Lucene doesn’t support relationships between documents
» Hibernate Search to the rescue - Denormalization
Relationships
Hibernate Search
28
Hibernate Search
29
@Entity @Indexedpublic class Book{ @ManyToOne @IndexEmbedded
private Author author;}
@Entity @Indexedpublic class Author{
private String firstName;}
» Object navigation is easy (author.firstName)
Relationships
Hibernate Search
30
» Entities can be referenced by other entities.
Relationships – Denormalization Pitfall
Hibernate Search
31
» Entities can be referenced by other entities.
Relationships – Denormalization Pitfall
Hibernate Search
32
» Entities can be referenced by other entities.
Relationships – Denormalization Pitfall
Hibernate Search
33
» The solution: The association pointing back to the parent will be marked with @ContainedIn
@Entity @Indexedpublic class Book{ @ManyToOne @IndexEmbedded private Author author;}
@Entity @Indexedpublic class Author{
@OneToMany(mappedBy=“author”) @ContainedIn private Set<Book> books;
}
Relationships – Solution
Hibernate Search
34
» Responsible for tokenizing and filtering words » Tokenizing – not a trivial as it seems» Filtering – Clearing the noise (case, stop words etc) and
applying “other” operations» Creating a custom analyzer is easy
» The default analyzer is Standard Analyzer
Analyzers
Hibernate Search
35
» StandardTokenizer : Splits words and removes punctuations.» StandardFilter : Removes apostrophes and dots from acronyms.» LowerCaseFilter : Decapitalizes words.» StopFilter : Eliminates common words.
Standard Analyzer
Hibernate Search
36
Other cool Filters….
Hibernate Search
37
» N-Gram algorithm – Indexing a sequence of n consecutive characters.
» Usually when a typo occurs, part of the word is still correct Encyclopedia in 3-grams = Enc | ncy | cyc | ycl | clo | lop | ope | ped | edi | dia
Approximative Search
Hibernate Search
38
» Algorithms for indexing of words by their pronunciation
» The most widely known algorithm is Soundex » Other Algorithms that are available : RefinedSoundex,
Metaphone, DoubleMetaphone
Phonetic Approximation
Hibernate Search
39
» Synonyms You can expand your synonym dictionary with your own
rules (e.g. Business oriented words)
» Stemming Stemming is the process of reducing words to their stem,
base or root form. “Fishing”, “Fisher”, “Fish” and “Fished” Fish Snowball stemming language – supports over 15
languages
Synonyms & Stemming
Hibernate Search
40
» Lucene is bundled with the basic analyzers, tokenizers and filters.
» More can be found at Lucene’s contribution part and at Apache-Solr
Additional Analyzers
Hibernate Search
41
» No free Hebrew analyzer for Lucene» Itamar Syn-Hershko
Involved in the creation of CLucene (The C++ port of Lucene) Creating a Hebrew analyzer as a side project Looking to join forces [email protected]
Hebrew?
Hibernate Search
42
Hibernate Search
אחוות הטבעתשר הטבעות, גירסה ראשונה:
43
» Motivation» Lucene Intro» Hibernate Search» Indexing» Searching» Scoring» Alternatives
Agenda
44
» When data has changed?» Which data has changed?» When to index the changing data?» How to do it all efficiently?
Hibernate Search will do it for you!
Transparent indexing
Indexing
45
Indexing – On Rollback
Application
Session (Entity Manager)
DB
Lucene Index
Insert/update
delete
Queue
Start Transaction
46
Indexing – On Rollback
Application
Session (Entity Manager)
DB
Lucene Index
Insert/update
delete
QueueTransaction failed
Rollback
Start Transaction
47
Indexing – On Commit
Application
Session (Entity Manager)
DB
Lucene Index
Insert/update
delete
QueueTransaction Committed
√
48
<property name="org.hibernate.worker.execution“>async</property>
<property name="org.hibernate.worker.thread_pool.size“>2 </property>
<property name="org.hibernate.worker.buffer_queue.max“>10</property>
hibernate.cfg.xml
Indexing
49
It’s too late! I already have a database without Lucene!
Indexing
50
» FullTextSession extends from Session of Hibernate core Session session = sessionFactory.openSession(); FullTextSession fts = Search.getFullTextSession(session);
» index(Object entity)» purge(Class entityType, Serializable id)» purgeAll(Class entityType)
Manual indexing
Indexing
51
tx = fullTextSession.beginTransaction(); //read the data from the database Query query = fullTextSession.createCriteria(Book.class); List<Book> books = query.list(); for (Book book: books ) {
fullTextSession.index( book); } tx.commit();
Manual indexing
Indexing
52
tx = fullTextSession.beginTransaction(); List<Integer> ids = getIds(); for (Integer id : ids) { if(…){ fullTextSession.purge(Book.class, id ); } } tx.commit();
» fullTextSession.purgeAll(Book.class);
Removing objects from the Lucene index
Indexing
53
Rrrr!!! I got an OutOfMemoryException!
Indexing
54
session.setFlushMode(FlushMode.MANUAL);session.setCacheMode(CacheMode.IGNORE);Transaction tx=session.beginTransaction();ScrollableResults results =
session.createCriteria(Item.class) .scroll(ScrollMode.FORWARD_ONLY);
int index = 0;while(results.next()) { index++; session.index(results.get(0)); if (index % BATCH_SIZE == 0){ session.flushToIndexes(); session.clear();
} }tx.commit();
Indexing
54
100
55
Searching
56
title : lord title: rings+title : lord +title: rings title : lord –author: Tolkien title: r?ngs title: r*gs title: “Lord of the Rings” title: “Lord Rings”~5 title: rengs~0.8 title: lord author: Tolkien^2And more…
Lucene’s Query Syntax
Searching
57
» To build FTS queries we need to: Create a Lucene query Create a Hibernate Search query that wraps the Lucene
query
Why?» No need to build framework around Lucene» Converting document to object happens
transparently.» Seamless integration with Hibernate Core API
Querying
Searching
58
String stringToSearch = “rings";Term term = new Term(“title",stringToSearch);TermQuery query = new TermQuery(term);FullTextQuery hibQuery = session.createFullTextQuery(query,Book.class);
List<Book> results = hibQuery.list();
Hibernate Queries Examples
Searching
59
String stringToSearch = "r??gs";Term term = new Term(“title",stringToSearch);WildCardQuery query = new WildCardQuery (term);...
List<Book> results = hibQuery.list();
WildCardQuery Example
Searching
60
Id Title Price
1 Head First Java 200
2 Chuck Norris in action 120
3 Chuck Norris vs JBoss 120
4 JBoss strikes back 10
Motivation
Use caseBook table
Good practices for Gava
61
HS Query Flowchart
Searching
Loads objects from the Persistence Context
Hibernate
SearchQuery
Client
LuceneIndex
DB
Query the index
Persistence Context
DB access
(if needed)
Receive matching ids
62
» You can use list(), uniqueResult(), iterate(), scroll() – just like in Hibernate Core !
» Multistage search engine» Sorting» Explanation object
Querying tips
Searching
63
Score
64
» Most based on Vector Space Model of Salton
Score
65
» Most based on Vector Space Model of Salton
Score
66
Term Rating
Score
total number of documents containing term “I”
term weightnumber of documents in the index
Logarithm
best java in action books
67
Term Rating Calculation
Score
0=)500
500log(
2=)50
5000log(
3=)5
5000log(
68
1. Head First Java2. Best of the best of the best3. Best examples from Hibernate in action4. The best action of Chuck Norris
Scoring example
Score
Search for: “best java in action books”Term Frequency ScoreJava 1 Best 3Action 2
0.124940.30103
0.60206
69
» Conventional Boolean retrieval» Calculating score for only matching documents» Customizing similarity algorithm» Query boosting» Custom scoring algorithms
Lucene’s scoring approach
Score
70
Alternatives
71
Alternatives
Shay Banon
72
Alternatives
Simple
Lucene based
Configurable via XML or
annotations
Local & External TX Manager
Integrates with popular ORM frameworks
Spring support
Distributed
73
Alternatives
74
» Enterprise Search Server Supports multiple protocols (xml, json, ruby, etc...)
» Runs as a standalone Full Text Search server within a servlet e.g. Tomcat
» Heavily based on Lucene» JSA – Java Search API (based on JPA)
ODM (Object/Document Mapping) Spring integration (Transactions)
Apache Solr
Alternatives
75
» Powerful Web Administration Interface Can be tailored without any Java coding!
» Extensive plugin architecture» Server statistics exposed over JMX» Scalability – easily replicated
Apache Solr
Alternatives
76
Resources
Lucene
Lucene contrib part
Hibernate Search
Hibernate Search in Action / Emmanuel Bernard, John Griffin
Compass
Apache Solr
77
Thank you!Q & A