grouping and joining in lucene/solr

41
Grouping & Joining Martijn van Groningen [email protected] Lucene Committer & PMC Member Thursday, May 17, 2012

Upload: lucenerevolution

Post on 01-Nov-2014

16.992 views

Category:

Technology


2 download

DESCRIPTION

Presented by Martijn van Groningen, SearchWorkings - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 In the real world data isn’t flat. Data is often modelled into complex models. Lucene is document oriented and doesn’t support relations natively. The only way you could index this data is by de-normalizing the relations in a document with many fields and execute subsequent queries. Subsequent queries can be expensive and data gets duplicated. This isn’t always ideal. Recently Solr and Lucene provide features that allow you to join and group. You can join and group on fields across documents and still have the power of Lucene’s awesome free text search. In this presentation, we’ll look at these new alternatives, the advantages and disadvantages and how these features can be utilized. how these new capabilities impact the design of Solr-based search applications primarily from infrastructure and operational perspectives.

TRANSCRIPT

Page 1: Grouping and Joining in Lucene/Solr

Grouping & Joining

Martijn van [email protected] Committer & PMC Member

Thursday, May 17, 2012

Page 2: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Overview

Grouping & Joining

‣ Background

‣ Joining

‣ Result grouping

‣ Conclusion

2

Thursday, May 17, 2012

Page 3: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Lucene’s model

Background

‣ Lucene is document based.

‣ Lucene doesn’t store information about relations between documents.

‣ Data often holds relations.

‣ Good free text search over relational data.

3

Thursday, May 17, 2012

Page 4: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Example

Background

‣ Product

‣ Name

‣ Description

‣ Product-item

‣ Color

‣ Size

‣ Price

‣ Goal: Show the most applicable product based on product-item criteria.

4

Thursday, May 17, 2012

Page 5: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Common Lucene solutions

Background

‣ Compound documents.

‣May result in documents with many fields.

‣ Subsequent searches.

‣May cause a lot network overhead.

‣ Non Lucene based approach:

‣ If free text search isn’t very important use a relational database.

5

Thursday, May 17, 2012

Page 6: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Example domain

Background

‣ Compound Product & Product-items document.

‣ Each product-item has its own field prefix.

6

Thursday, May 17, 2012

Page 7: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Different solutions

Background

‣ Lucene offers solutions to have a 'relational' like search.

‣ Parent child

‣ Grouping & joining aren't naturally supported.

‣ All the solutions do increase the search time.

‣ Some scenarios grouping and joining isn't the right solution.

7

Thursday, May 17, 2012

Page 8: Grouping and Joining in Lucene/Solr

Joining

Modelling relations

Thursday, May 17, 2012

Page 9: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Introduction

Joining

‣ Support for parent child like search from Lucene 3.4

‣ Not a SQL join.

‣ The parent and each children are stored as documents.

‣ Two types:

‣ Index time join

‣ Query time join

9

Thursday, May 17, 2012

Page 10: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Index time join

Joining

‣ Two block join queries:

‣ ToParentBlockJoinQuery

‣ ToChildBlockJoinQuery

‣ One Lucene collector:

‣ ToParentBlockJoinCollector

‣ Index time join requires block indexing.

10

Thursday, May 17, 2012

Page 11: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Block indexing

Joining

‣ Atomically adding documents.

‣ A block of documents.

‣ Each document gets sequentially assigned Lucene document id.

‣ IndexWriter#addDocuments(docs);

11

Thursday, May 17, 2012

Page 12: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Block indexing

Joining

‣ Index doesn't record blocks.

‣ Segment merging doesn’t re-order documents in a segment.

‣ App is responsible for identifying block documents.

‣Marking the last document in a block.

‣ Adding a document to a block requires you to reindex the whole block.

‣ Removing a document from a block doesn’t requires reindexing a block.

12

Thursday, May 17, 2012

Page 13: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Example domain

Joining

‣ Parent is the last document in a block.

13

Thursday, May 17, 2012

Page 14: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Block indexing

Joining

14

Marking parent documents

Thursday, May 17, 2012

Page 15: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Block indexing

Joining

15

Add block

Add block

Thursday, May 17, 2012

Page 16: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

‣ Parent filter marks the parent documents.

‣ Child query is executed in the parent space.

‣ ToChildBlockJoinQuery works in the opposite direction.

ToParentBlockJoinQuery

Joining

16

Thursday, May 17, 2012

Page 17: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Query time joining

Joining

‣ Query time joining is executed in two phases and is field based:

‣ fromField

‣ toField

‣ Doesn’t require block indexing.

17

Thursday, May 17, 2012

Page 18: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Query time joining

Joining

‣ First phase collects all the terms in the fromField for the documents that match with the original query.

‣ Currently doesn’t take the score from original query into account.

‣ The second phase returns the documents that match with the collected terms from the previous phase in the toField.

‣ Two different implementations:

‣ JoinUtil - Lucene (≥ 3.6)

‣ Join query parser - Solr (trunk)18

Thursday, May 17, 2012

Page 19: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Query time joining - Indexing

Joining

19

Referrer the product id.

Thursday, May 17, 2012

Page 20: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Query time joining - Indexing

Joining

20

Thursday, May 17, 2012

Page 21: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Query time joining

Joining

21

‣ Result will contain one product.

‣ Possible to join over two indices.

Thursday, May 17, 2012

Page 22: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Final thoughts

Joining

‣ Joining module has good solutions to model parent child relations.

‣ Use block join if you care about scoring.

‣ Frequent updates can be problematic.

‣ Use query time join for parent child filtering.

‣ Query time join is slower than index time join.

‣Mostly a Lucene feature only.

‣ All code is annotated as experimental.22

Thursday, May 17, 2012

Page 23: Grouping and Joining in Lucene/Solr

Result grouping

Previously known as Field Collapsing.

Thursday, May 17, 2012

Page 24: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Introduction

Result grouping

‣ Group matching documents that share a common property.

‣ Search hit represents a group.

‣ Facet counts & total hit count represent groups.

‣ Per group collect information

‣Most relevant document.

‣ Top three documents.

‣ Aggregated counts24

Thursday, May 17, 2012

Page 25: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Usages

Result grouping

‣ Group documents by a shared property

‣ Product-item by product id (Parent child)

‣ Collapse similar looking documents

‣ E.g. all results from the Wikipedia domains.

‣ Remove duplicates from the search result.

‣ Based on a field that contains a hash

25

Thursday, May 17, 2012

Page 26: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Example domain

Result grouping

‣ Each Product-item is a document, but includes the product data.

26

Thursday, May 17, 2012

Page 27: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Implementation

Result grouping

‣ Result grouping implemented with Lucene collectors.

‣Module in trunk and a contrib in 3.x versions.

‣ Two pass result grouping.

‣ Grouping by indexed field, function or doc values.

‣ Single pass result grouping.

‣ Requires block indexing.

27

Thursday, May 17, 2012

Page 28: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Two pass implementation

Result grouping

‣ First pass collects the top N groups.

‣ Per group: group value + sort value

‣ Second pass collects data for each top group.

‣ The top N documents per group.

‣ Possible other aggregated information.

‣ Second pass search ignores all documents outside topN groups.

28

Thursday, May 17, 2012

Page 29: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Result grouping - Indexing

Result grouping

29

Thursday, May 17, 2012

Page 30: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Result grouping - Searching

Result grouping

30

Thursday, May 17, 2012

Page 31: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Result grouping made easier

Result grouping

31

‣ GroupingSearch

‣ Solr

‣ http://myhost/solr/select?q=shirt&group=true&group.field=product_id

‣Many more options:

‣ http://wiki.apache.org/solr/FieldCollapsing

Thursday, May 17, 2012

Page 32: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Parent child result

Result grouping

‣ TopGroups - Equivalent to TopDocs.

‣ Hit count

‣ Group count

‣ Groups

‣ Top documents

‣ Facet and total count can represent groups instead of documents.

‣ But requires more query time.

32

Thursday, May 17, 2012

Page 33: Grouping and Joining in Lucene/Solr

Conclusion

Compare...

Thursday, May 17, 2012

Page 34: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Compare the parent child solutions

Conclusion

‣ Result grouping

‣ + Distributed support & Parent child relation as hit.

‣ - Parent data duplication

‣ - Impact on query time

‣ Joining

‣ + Fast & no data duplication

‣ - Index time join not optimal for updates

‣ - Query time join is limited.34

Thursday, May 17, 2012

Page 35: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Compare the parent child solutions

Conclusion

‣ Compound documents.

‣ + Fast and works out-of-the box with all features.

‣ - Not flexible when it comes to updates.

‣ - Document granularity is set in stone.

35

Thursday, May 17, 2012

Page 36: Grouping and Joining in Lucene/Solr

36

Any questions?

Thursday, May 17, 2012

Page 37: Grouping and Joining in Lucene/Solr

Extra slides

We have time left!

Thursday, May 17, 2012

Page 38: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Future work

Conclusion

‣ Higher level parent-child API.

‣ Needs to cover search & indexing.

‣ Joining

‣ Distributed support.

‣ Represent a hit as a parent child relation in the search result.

‣ Result grouping

‣ Aggregated grouped information like: sum, avg, min, max etc.

38

Thursday, May 17, 2012

Page 39: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

ToParentBlockJoinCollector

Joining

‣ TopGroups contains a group per top N parent document.

‣ Each group contains a parent and child documents.

39

Thursday, May 17, 2012

Page 40: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Groups & facet counts

Result grouping

‣ Faceting and result grouping are different features.

‣ But are often used together!

‣ Facet counts can be based on:

‣ Found documents.

‣ Found groups.

‣ Combination of facet value and group.

‣ All options are supported in Solr.40

Thursday, May 17, 2012

Page 41: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Doc values

Result grouping

‣ Doc values / Column Stride values

‣ Prevents the creation of expensive data structures in FieldCache.

‣ Inverted index is meant for free text search.

‣ All grouping collectors have doc values based implementations!

41

Thursday, May 17, 2012