dawid weiss- finite state automata in lucene
DESCRIPTION
Finite state automata and transducers made it into Lucene fairly recently, but already show a very promising impact on search performance. This data structure is rarely exploited because it is commonly (and unfairly) associated with high complexity. During the talk, I will try to show that automata and transducers are in fact very simple, their construction can be very efficient (memory and time-wise) and their field of applications very broad.TRANSCRIPT
![Page 1: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/1.jpg)
Finite State Automatain
DawidWEISS
![Page 2: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/2.jpg)
DawidWeiss
20+ years of coding10 years assembly only
Academia & ResearchPhD in Information Retrieval, PUT
Open sourceCarrot2, HPPC, Lucene,…
Industry & BusinessCarrot Search s.c.
.
.
.
.
.
.
. .
![Page 3: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/3.jpg)
Talk outline
State machines (automata)FSAs, DFAs, FSTs and other XXXs.
Use cases in Lucene and SolrSuggester. FuzzySearch. Index.
No API detailsStill @experimental.
![Page 4: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/4.jpg)
(Non)? Deterministic FiniteState (Automata|Machines)
![Page 5: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/5.jpg)
HashSet
hash → slot → value0x29384d34 → lucene0xde3e3354 → lucid0x00000666 → lucifer
FSA (deterministic)
l u c e n e
id
fe
rexists(sequence)oor(pre x)
ceil(pre x)
![Page 6: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/6.jpg)
HashSet
hash → slot → value0x29384d34 → lucene0xde3e3354 → lucid0x00000666 → lucifer
FSA (deterministic)
l u c e n e
id
fe
r
exists(sequence)oor(pre x)
ceil(pre x)
![Page 7: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/7.jpg)
HashSet
hash → slot → value0x29384d34 → lucene0xde3e3354 → lucid0x00000666 → lucifer
FSA (deterministic)
l u c e n e
id
fe
rexists(sequence)oor(pre x)
ceil(pre x)
![Page 8: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/8.jpg)
k i l l
bl
li
deterministic, non-minimal
i l l
b
k
deterministic, minimal
i
l
l
b
k
i
lnon-deterministic,non-minimal
![Page 9: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/9.jpg)
k i l l
bl
li
deterministic, non-minimal
i l l
b
k
deterministic, minimal
i
l
l
b
k
i
lnon-deterministic,non-minimal
![Page 10: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/10.jpg)
k i l l
bl
li
deterministic, non-minimal
i l l
b
k
deterministic, minimal
i
l
l
b
k
i
lnon-deterministic,non-minimal
![Page 11: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/11.jpg)
(Sorted)Map
lucene → 1lucid → 2lucifer → 666
FST (transducer)
l|1 u c e n e
i|1d
f|664e
r
![Page 12: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/12.jpg)
(Sorted)Map
lucene → 1lucid → 2lucifer → 666
FST (transducer)
l u c e n e|1
id|2
fe
r|666
l|1 u c e n e
i|1d
f|664e
r
![Page 13: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/13.jpg)
(Sorted)Map
lucene → 1lucid → 2lucifer → 666
FST (transducer)
l|1 u c e n e
i|1d
f|664e
r
![Page 14: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/14.jpg)
NFSAs and
Regular expressions
Determinizationstates explosion, not always possible
Backtrackingrecursion explosion
aa
e1e2 e1 e1
e+e
e*e
e?e
![Page 15: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/15.jpg)
a?nan
n=3 → a?a?a?aaa
Source: Russ Cox, Regular ExpressionMatching Can Be Simple And Fast (re2).
![Page 16: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/16.jpg)
a?nann=3 → a?a?a?aaa
Source: Russ Cox, Regular ExpressionMatching Can Be Simple And Fast (re2).
![Page 17: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/17.jpg)
a?nann=3 → a?a?a?aaa
Source: Russ Cox, Regular ExpressionMatching Can Be Simple And Fast (re2).
![Page 18: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/18.jpg)
0
5000
10000
15000
20000
25000
30000
35000
0 5 10 15 20 25 30
Tim
e [
ms]
Time of matching an for pattern a?nan , depending on n. Java 1.6, modern hardware.
![Page 19: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/19.jpg)
Linear-time, minimal, deterministic
FSA construction
Linear algorithm from sorted inputby Daciuk, Mihov, et al.
Active pathstates that still can change
States dictionarynodes that will never change
![Page 20: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/20.jpg)
1) common AP pre x2) freeze the rest of AP3) add suffix → new AP
lucene
![Page 21: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/21.jpg)
1) common AP pre x2) freeze the rest of AP3) add suffix → new AP
l u c e n e
lucid
![Page 22: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/22.jpg)
1) common AP pre x2) freeze the rest of AP3) add suffix → new AP
l u c e n e
i
d
![Page 23: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/23.jpg)
1) common AP pre x2) freeze the rest of AP3) add suffix → new AP
l u c e n e
i
d
lucifer
![Page 24: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/24.jpg)
1) common AP pre x2) freeze the rest of AP3) add suffix → new AP
l u c e n e
id
fe r
![Page 25: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/25.jpg)
1) common AP pre x2) freeze the rest of AP3) add suffix → new AP
l u c e n e
id
fe
r
![Page 26: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/26.jpg)
FS(A|T)s in (Lucene|Solr)
![Page 27: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/27.jpg)
Automata in
Lucene|Solr
org.apache.lucene.util.automaton.*partial port of brics, FuzzyQuery, AutomatonTermsEnum
org.apache.lucene.util.automaton.fst.FSTFSA and FSTs from sorted data, suggester, indexes
![Page 28: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/28.jpg)
org.apache.lucene.util.automaton.fst.*
FSA representation
Arc-based, not state-basedMoore vs. Mealy. Compact vs. intuitive
Next-state chainingrequires unusual tricks during construction
Everything in a byte[]traversals-ready, memory-efficient
Dual transition storage formatlookup: bsearch or linear scan
Input: abc, bd, bde.a b c
b
d
d e
a b c
bd e
![Page 29: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/29.jpg)
org.apache.lucene.util.automaton.fst.*
FSA representation
Arc-based, not state-basedMoore vs. Mealy. Compact vs. intuitive
Next-state chainingrequires unusual tricks during construction
Everything in a byte[]traversals-ready, memory-efficient
Dual transition storage formatlookup: bsearch or linear scan
s2 s1s3a b c
s4
bs5
d e
s1
cFL bL eFL dL a bL
s1s1s2
s2 s4 s3s5
s1
cFL bL eFL dL abLN
s2 s4 s3s5
![Page 30: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/30.jpg)
org.apache.lucene.util.automaton.fst.*
FSA representation
Arc-based, not state-basedMoore vs. Mealy. Compact vs. intuitive
Next-state chainingrequires unusual tricks during construction
Everything in a byte[]traversals-ready, memory-efficient
Dual transition storage formatlookup: bsearch or linear scan
s2 s1s3a b c
s4
bs5
d e
s1
cFL bL eFL dL a bL
s1s1s2
s2 s4 s3s5
s1
cFL bL eFL dL abLN
s2 s4 s3s5
![Page 31: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/31.jpg)
org.apache.lucene.util.automaton.fst.*
FSA representation
Arc-based, not state-basedMoore vs. Mealy. Compact vs. intuitive
Next-state chainingrequires unusual tricks during construction
Everything in a byte[]traversals-ready, memory-efficient
Dual transition storage formatlookup: bsearch or linear scan
s2 s1s3a b c
s4
bs5
d e
s1
cFL bL eFL dL a bL
s1s1s2
s2 s4 s3s5
s1
cFL bL eFL dL abLN
s2 s4 s3s5
![Page 32: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/32.jpg)
Input size Compressed size (MB)
Input MB Terms Lucene morf. gzip
Wikipedia t.index 481 38092 045 258 164 149Polish in . 162 3 672 200 3.1 1.7 15.4
.
![Page 33: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/33.jpg)
Use Cases:Solr's Autocomplete
![Page 34: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/34.jpg)
Solr's
Suggesters
Design choicessort order (alpha, score), pre x vs. spelling, boost exact matches?
Weightsterm→weight, lookup(term, onlyMorePopular)
org.apache.solr.spelling.suggest.LookupJaspellLookup, TSTLookup, FSTLookup
![Page 35: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/35.jpg)
flour|3four|4fourier|3furious|2
f
l
o
u
o
u
r i
r
u
ri
|
o u
e
4
|3
s | 2
Find pre x.Depth-in traversal for completions.PQ on score|alpha
. ...Take 1
![Page 36: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/36.jpg)
flour|3four|4fourier|3furious|2
→fou*
f
l
o
u
o
u
r i
r
u
ri
|
o u
e
4
|3
s | 2
Find pre x.Depth-in traversal for completions.PQ on score|alpha
. ...Take 1
![Page 37: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/37.jpg)
2furious3flour3fourier4four
2
3
4
f
f
f o
lo
ur
u
u
rr
i o
i e
us
From score roots, until N collected.Find pre x.Depth-in traversal for completions, stop if N collected.Find/boost exact match.
. ...Take 2
![Page 38: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/38.jpg)
2furious3flour3fourier4four
→fou*
2
3
4
f
f
f o
lo
ur
u
u
rr
i o
i e
us
From score roots, until N collected.Find pre x.Depth-in traversal for completions, stop if N collected.Find/boost exact match.
. ...Take 2
![Page 39: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/39.jpg)
2furious5urious|furious5rious|furious5ious|furious5ous|furious5us|furious5s|furious3flour…
. ...Take 3 (in xes)
![Page 40: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/40.jpg)
.
.
2
3
4
5
6
7
f
f
f
i
o
r
s
u
e
il
o
r
u
o
ru
u
|r
r
eo
u
|
i
r
o ui
|r
s
ol
o
u
r
u
u
s
i
|
u
|
f
r
f
r
r
i o
il
| o
e
us
![Page 41: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/41.jpg)
Constant time lookups!Regardless of the terms dictionary size.
Regardless of pre x length.
Exact matches only.Static snapshot (not incremental).
Discretized weights.
![Page 42: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/42.jpg)
Constant time lookups!Regardless of the terms dictionary size.
Regardless of pre x length.
Exact matches only.Static snapshot (not incremental).
Discretized weights.
![Page 43: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/43.jpg)
Top50KWiki.utf8, 676 KB, 50 000 terms
Jaspell TST FST
..RAM [B] ..7 869 415 ..7 914 524 ..300 175
queries per second,. . . tpq
..PREFIX [100-200] ..458 ..966 ..742
..PREFIX [6-9] ..330 ..228 ..659
..PREFIX [2-4] ..126 ...29 ..501
![Page 44: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/44.jpg)
Summary
![Page 45: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/45.jpg)
Summary and Conclusions
Automatacompact, powerful, efficient data structure
Lucene/Solr bene tsbehind the scenes, but spreading: index, queries, suggesters
API in Lucene…is shaped right now, still @experimental
![Page 46: Dawid Weiss- Finite state automata in lucene](https://reader034.vdocuments.site/reader034/viewer/2022050712/5562187ad8b42a00138b5564/html5/thumbnails/46.jpg)
Acknowledgement
Michael McCandless
Robert Muir
committer: .+