lucene/solr revolution 2016 参加レポート

Lucene/Solr Revolution 2016 参加レポート

Shinpei Nakata, Search Core Team, ECPD, Rakuten Inc.twitter: @shinpeinktDec/13/2016, 第 19 回 Lucene/Solr 勉強会グラントウキョウサウスタワー

2

Lucene/Solr Revolution

• Lucidworks 主催の、Lucene/Solrを主題としたイベント– Cassandra Summit, Spark Summit等と同類– 参加費 $1095– 今年で6回目?(2011年、Bostonが最初とのこと）

• 2010 、Lucid imagination の頃はカウントしていない？– 今回はBoston, MA, USAで開催

• 規模– 2日間– 参加者数 : 800+– 56発表、63発表者– コミッター多数 (17発表 )

(Photo by me)

3

About me and Rakuten

• 楽天株式会社、サーチコアチーム– Internal サービス向けの検索エンジンの開発– バグ取り、実装、新機能の提案

• 最近は Solr6 と戯れる• OSC program

– エンジニアを中心に、 2 年に１回の国際会議参加チャンス

– 今年は幸いにも行けることに

• 個人的活動– 趣味で Go 言語– Blog: http://shinpei.github.io

(Photo by me)

http://shinpei.github.io/

http://shinpei.github.io/

4

Lucene/Solr Revolution 概観

• Data science (11)– Relevancy tuning, Recommendation, BigData

• Ecosystem (10)– Combination with other software. (Docker, UI, Spark, CI, Durability)

• Exploring Solr (14)– Streaming (Solr6), Security, Numeric points (Solr6)

• Keynote (7)– a.k.a., Big company use case. IBM Watson, Salesforce, Commonvalut...

• Use case (17)– SIE, Bloomberg, Flipkart, Rakuten, Tech consultants...

5

Data Science

• “Working with Deeply Nested Documents in Apache Solr”, Anshum Gupta, Alisa Zhila, IBM Watson– Deeply nested document を Solr でどうやって扱うか– 最近の Solr の機能を使えばけっこういろいろできるよ

• Deeply nested document?– e.g., blog などで記事へのコメントへの返信

titleComments

titleReplies

title

6

（おさらい） Nested Document [1/3]

• Lucene は flat な index しか持てない– 親も子も独立したドキュメントとして持つ– 親と子は連続する docid 空間に配置– 子から親、親から子がシーケンシャルに辿れる構造

• 親、子の区別にもう一つのフィールドを利用– e.g., <bool fieldName=“isParent”>false</bool>

docid1 2 3 4 docid1 2 3 4Luceneの index segmentの様子 Nestedは連続した docidに格納

Child Parent

7

• 子から親を引っかける– 子（ブログへの返信）に” Elastic” が含まれる親（記事）


docid1 2 3 4

Child Parent

5 6 7 8

Elastic

8


q=text:”Elastic” & fq=isParent:false1. 子の検索

q={!parent which=“isParent:true”}text:”Elastic”2. 子から親の検索 (Block Join Query)

docid1 2 3 4 5 6 7 8

Elastic

docid1 2 3 4 5 6 7 8

Elastic

9

Deeply Nested Document

• 基本的には Nested と同じ– どの階層でも独立した１つのドキュメント

• ID は階層に関係なく、ユニークにする

• ”path” の導入– 同じ名前だが階層が違う場合でも区別する為

• “1.blog-posts.comments.title”• “2.blog-posts.comments.replies.title”

• 簡単のため、 Preprocessor を用意– ネストされた JSON を渡せば、 Path や ID は自動割り当て

10

例

• ブログ記事– コメント

• コメントへの返信

neutralnegativ

e

positive

sentiment

Solrと、そのほかの検索エンジンについてSolrへの素晴らしいポストだ

その通り！私は Elasticのほうが好きだな

Solrの便利な機能紹介重要な機能が忘れられてる！

それ違うバージョンでは？Elasticのほうが先に実装してたけど

※本例は発表の例を訳したものです

11

Searching

• コメント ( 子 ) から親の検索





12

Searching

• コメント ( 子 ) から親の検索q={!parent which=“path:1.blog-posts”} (path:2.blog-posts.comments and sentiment:positive)





13

Searching

• コメント ( 子 ) から親の検索q={!parent which=“path:1.blog-posts”} (path:2.blog-posts.comments and sentiment:positive)





14

Searching

• 指定階層への検索 (Replies, level2) への検索q={!child of=“path:2.blog-posts.comments”}path:2.blog-posts.commentsAND sentiment:negative&fq=path:3.blog-posts.comments.replies





15

Searching






16

Searching






17

Searching






18

Response

• ChildDocTransformerFactory を利用 (Solr5.3+)q={!parent which=”path:2.blog-posts.comments”}path:3.blog-posts.comments.replies AND sentiment:positive&fl=*,[child parentFilter=path:2.blog-posts.comments childFilter=path:3.blog-posts.comments.replies limit=50]





19

Response






20

Response






21

他にも…

• Wildcards + Level Numbering

• Faceting– Block Join Faceting (from Solr5.5)

q={!parent which=“path:2.*”}path:3.blog-posts.*.keywords AND text:Solr&fq=path:2.blog-posts.title OR path2.blog-posts.body

22

Reference

• “Solr’s Nesting: On Solr’s Capabilities to Handle (Deeply) Nested Document Structures“, https://medium.com/@alisazhila/solr-s-nesting-on-solr-s-capabilities-to-handle-deeply-nested-document-structures-50eeaaa4347a#.w8plg0muk

• Nested Objects in Solr, Solr’nStuff, http://yonik.com/solr-nested-objects/

• “Working with deeply nested documents in apache solr”, http://www.slideshare.net/anshumg/working-with-deeply-nested-documents-in-apache-solr

• “Block-Join 虎の巻” , 第 16 回 Lucene/Solr 勉強会 http://www.slideshare.net/ebisawashinobu/block-join-toranomaki

https://medium.com/@alisazhila/solr-s-nesting-on-solr-s-capabilities-to-handle-deeply-nested-document-structures-50eeaaa4347a#.w8plg0muk





http://www.slideshare.net/anshumg/working-with-deeply-nested-documents-in-apache-solr



23

Ecosystem

• “Rebalancing API for Solr Cloud”, Bloomreach, Netflix– Solr6 で入った Rebalancing API の紹介

24

Background

• Bloomreach, Personalization as a product– パーソナライゼーションサービスのホスティング会社– 企業ごとに違うコレクション , ~160M docs

• Solr Cloud の管理は大変– 複数コア、コレクション、ランク、設定– QPS が増えてきたからコアを２個から４個に増やそう

• でもどうやって、、、？

25

Rebalancing API

• Rebalance API– Scaling Strategy

• Auto Shard• Redistribute• Replace• Scale Up• Scale Down• Remove Dead Nodes

– Allocation Strategy• Least Used Node• Unused Node

– Size Based Sharding– Discovery Based

Redistribution

26

例１：Re-sharding

• 別の Shard を用意していったんマージ– IndexSplitter で分割

Merge Split

http://host:port/solr/admin/collections?action=REBALANCE&scaling_strategy=AUTO_SHARD&shards=4&collection=collection_name

Node

Core

27

例２：マイグレーション

• コアのマイグレーション– s http://host:port/solr/admin/collections?

action=REBALANCE&scaling_strategy=REDISTRIBUTE&collection=collection_name

Node

Core

28

例３ : Horizontal Scaling

• 冗長化したいhttp://host:port/solr/admin/collections?action=REBALANCE&scaling_strategy=SCALE_UP&num-replicas=2&collection=collection_name

29

Performance

#Doc Re-indexing Open sourceSolr split shard

BloomReachRebalance API

BloomReach Rebalance API with parallel split

~10K 2 - 3 min 35 - 40 secs 30 - 35 secs 15 - 20 secs

~100K 6 - 7 min 3 - 3.5 min 2.5 - 3 mins 40 - 55 secs

~1M 35 min 13 - 15 mins 10 - 12 mins 2 - 3 mins

~10M 1h 15 min 28 - 30 mins 21 - 24 mins 3 - 4 mins

~150M 7h~ Timeout ~ 1 hour 18 - 20 mins

c.f. http://engineering.bloomreach.com/solrcloud-rebalance-api/

• Reindexing なしなので速い• インデックスの分割だけでなく、コアの設定も自動以降

http://engineering.bloomreach.com/solrcloud-rebalance-api/

30

Exploring Solr

• “The Evolution of Lucene & Solr Numerics from Strings to Points”, Steve Rowe, Lucidworks– Lucene/Solr での数値の扱いを、内部データ構造の変遷という視点か

ら振り返り– 最新の Dimensional Point のベンチ報告

31

数値の文字列表現

1. 初期は String で保持2. Solr の Int/Long/Float/Double は 10 variable-width

String3. 数字でソートしたい場合は、 0 で埋めろ、といわれて

たe.g., 15 0000015

32

数値の文字列表現

• 2000, Lucene 0.0.1– Modified UTF-8 terms

• Null is 2 bytes• 2008, Lucene 2.4

– UTF-8 terms• 2012, Lucene 4.0

– Binary terms

33

高速な計算のためのスペース

• 2005, Lucene 1.4, FieldCache登場– メモリ上にデータを保持できるようになる

• 2009, Lucene 2.9/Solr1.4– Trie numerics が導入

• 2016, Lucene/Solr 6.0 – Trie numerics は Dimensional Point に

34

（おさらい）Trie Numerics

• 数値をトライ木に格納– 範囲検索が早くなる

• 必要最小限なレンジクエリの生成

– 分割の粒度は Precision steps で指定

c.f., https://epic.awi.de/17813/1/Sch2007br.pdf

4

42

421 423

44

445 446 448

5

52

521 522

intField: [423 TO 599]

intField:423 OR intField:424 ORintField:425 OR intField:426 OR..

intField:423 OR intField:44OR intField:5

35

インデックスの活用

• 2012, Lucene/Solr 4.0– DocValues の導入

• インデックス時に埋め込まれる FieldCach– Flexible indexing

• Codec 導入、インデックスをいじれるようになる

36

より効率的な分割へ向けて

• 2015, Lucene/Solr 5.2– Auto prefix terms

• 静的に決まってた Precision step では非効率になるケースを避ける為、自動的に分割範囲を調整する機能

– Lucene/Solr 6.2 で Removed　 (LUCENE-7317)• Dimensional point が代替できるため

• 2016, Dimensional point 導入– すべての数値型を置き換える

37

Dimensional Points

1. 値は固定幅 ( 最大 128bit)2. 1D-8D3. Block k-d tree

1-16 bytes

1-8 dimensions

38

k dimension tree

39

k dimension tree

1. X 軸の分割

40

k dimension tree

1. X 軸の分割2. Y 軸の分割

41

k dimension tree

1. X 軸の分割2. Y 軸の分割3. X 軸の分割(2nd)

Block kd treeはノードの数が一定数になったら分割をやめる

42

Dimensional Points

1. 値は固定幅 ( 最大 128bit)2. 1D-8D3. Block k-d tree4. Points はソートされる5. 一定数以下への分割がおわると葉ブ

ロックとしてディスクに書き込まれる6. In-memory な二分木がブロックへの

マッピングを持つ7. Adaptive optimal partitioning

– 密度に応じてバランスされる

1-16 bytes

1-8 dimensions

43

Dimensional Points の特徴

• まだ Lucene Only– Solr からは SOLR-8396

• Multi-valued はサポート• 値の取得は未サポート

– store=true を入れておく

• ソート、ファセットも未サポート– DocValues を使え

44

数値型の置き換え

• 1D Naitive– LongPoint, IntPoint, DoublePoint, FloatPoint, BinaryPoint

• 1D 128bit– BigIntegerPoint, InetAddressPoint

• 1D – 4D Range– LongRangeField, IntRangeField, DoubleRangeField, FloatRangeField

• 2D Geospatial– LatLonPoint

• 3D Geospatial– Geo3DPoint

45

Benchmark (1)

• McCandless benchmark & Adrien Grand re-run– 36% faster at query time– 71% faster at index time– 66% less disk– 85% less memory

、、、良すぎない？

46

Benchmark

• Fixed range query• 25M NYC taxi data• 3種類の Long

– Trie numerics, precision step 8– Point fields– Trie numerics, precision step 最大

47

Benchmark

Indexing time Index size

Points 31s 1.2GiB

Trie 53s 1.6GiB

Single-precision Trie 19s 0.7GiB

http://www.slideshare.net/lucidworks/the-evolution-of-lucene-solr-numerics-from-strings-to-points-presented-by-steve-rowe-lucidworks

• 24 fields, 6 string, 1 text, 2 long fields, 1 int field, 14 double fields.

48

Benchmark

field cardinarity hits

passenger_count

10 7.5M IntPoint 86ms

TrieInt/8 114ms

TrieInt/32 116ms

pick_up_date_time

4.1M 10.4M LongPoint 69ms

TrieLong/16 105ms

TrieLong/64 365ms

trip_distance 4,754 9.6M DoublePoint 116ms

TrieDouble/16 92ms

TrieDouble/64 105ms


49

References

• “The Evolution of Lucene & Solr Numerics from Strings to Points”, Steve Rowe, Lucidworks, http://www.slideshare.net/lucidworks/the-evolution-of-lucene-solr-numerics-from-strings-to-points-presented-by-steve-rowe-lucidworks

• Fun with flexible indexing, Michael McCandless, http://blog.mikemccandless.com/2010/10/fun-with-flexible-indexing.html





http://blog.mikemccandless.com/2010/10/fun-with-flexible-indexing.html



50

カンファレンスに参加してみて…

• コアな開発してる人にはメリット多そう– 新機能の多くは、企業からのコミット– コミュニティも嬉しいし、企業もメンテナンスメリット– 多くのCommitterに会えるチャンス

• ただし、 Elasticsearch寄りの Lucene committer には、、、 (ry– Lucene/Solr界隈の熱量に触れられる– G1GC, OK!

• 将来– IBM Watsonは割と大きなユースケースになりそう– Yonik氏からはExpressionへの大きな期待が感じられた

• ビジネス要素も多少あり– 良くも悪くも技術系だけではない

51

We are hiring!

• Rakuten tech blog– http://techblog.rakuten.co.jp/

• Rakuten Engineer hiring– http://corp.rakuten.co.jp/careers/

http://techblog.rakuten.co.jp/



52

Question?

lucene/solr revolution 2016 参加レポート

Engineering