bouma2 talk
TRANSCRIPT
A High-Performance Input-Aware
Multiple String-Match Algorithm
Erez
Buchnik
Page 2
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
Page 3
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
Page 4
The Multiple String-Match Problem
• Goal: Given a set of strings and input
text, find all occurrences of any of the
strings in the text
• Input: Set of strings L and input text M
• Output: Offsets 1 ≤ i ≤ |M| where a
substring of M matches any of the
strings in L
• Uses: AV, IPS, DPI, DNA Search etc…
Page 5
The Multiple String-Match Problem - References
• Aho-Corasick ’75
• Commentz-Walter ’79
• Rabin-Karp ’87
• Wu-Manber ’94
• Muth-Manber ’96
• Hopcroft-Motwani-Ullman ’00
• Dori-Landau ’06
Page 6
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
Page 7
Stateful Approach (e.g. Aho-Corasick)
• Linear in the length of the input
• Large automatons cause cache-
misses and degrade performance
• One state
transition per
symbol
Page 8
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
Page 9
Guidelines
• INTUITIVE: Search for ‘Hints’ of
a Match Before the Full Match
• REALISTIC: Use Prior
Knowledge of Expected Input
• SIMPLE: Trivial Match Process
Page 10
Bouma2: Motif-Based String Match
bi
re
or
at
ok
ek
bore
core
bits
corridor
boat
book
cooks
trek
• Preprocessing: Map every string to
its own substring: Motif
Set of strings
Set of selected 2-symbols long substrings
Q1: How to select motifs?
Page 11
Bouma2: Motif-Based String Match (cont.)
“ r a b b i t s h a t e
b o o k
c o o k s “
c o o k s
No match
Match Match
b o a t
No match
b i t s
Match
• Match: Examine symbols 2-by-2
(STATELESS); attempt full match
around motif occurrences Q2: How to resolve collisions?
Page 12
Capturing all Occurrences
“ h a b i t s o f r a b b i t s “
b i t sb i t s
MatchMatch
• Even-offset occurrences and odd-
offset occurrences require separate
passes, but instead…
Page 13
Upgrade #1: 2-Symbol Strides
“ h a b i t s o f r a b b i t s “
b i t s
Match
b i t s
MatchMatch
• We map each string TWICE: once to
an even-offset motif, and once to an
odd-offset motif
Page 14
Upgrade #2: Fast-Path / Slow-Path
“ h a b i t s o f r a b b i t s “ 4
14
4 14
• Fast-Path:
- Stateless
- “Monolithic” (zero branches)
- Cache-Aware (small direct-table)
- SIMPLE…
Page 15
Upgrade #2: Fast-Path / Slow-Path
“ h a b i t s o f r a b b i t s “
b i t s
Match
b i t s
MatchMatch
4
14
4 14
• Slow-Path:
- Memory-Efficient (pointers to
original strings for comparison)
- “Localized” (separate structure for
every motif)
Page 16
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
• n – length of input
• S – no. of string-matches in n
• m – no. of motif-matches in n
• l – length of the longest string
• Match Complexities:
- Aho-Corasick:
- Bouma2:
Page 17
Bouma2 vs. Aho-Corasick
)( SnO
)2
( lmn
O
• In practice, Bouma2 is usually at
least twice as fast as Aho-Corasick
• Fast-path alone is 10 times faster
Page 18
Bouma2 vs. Aho-Corasick (Speed)
Bouma2 Fast-Path
Bouma2 Slow-Path (Sub-Optimal)
Aho-Corasick
Q3: How to optimize slow-path?
• Bouma2 exhibits 8.5 times less
cache-misses than Aho-Corasick
(fast-path + slow-path) Page 19
Bouma2 vs. Aho-Corasick (Cache)
Bouma2 Cache-Misses
Aho-Corasick Cache-Misses
• Bouma2 footprint is less than 70%
of Aho-Corasick for textual search
(down to 35% in other cases) Page 20
Bouma2 vs. Aho-Corasick (Memory)
Bouma2 Fast-Path
Bouma2 Slow-Path
Aho-Corasick
Original Strings
Page 21
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
• A1: Out of all 2-symbol substrings,
find a minimum subset that covers
all given strings (even & odd offsets) Page 22
Q1: How to select motifs?
bo co do id or re ri rr
bo re • •
co re • •
co rr id or • • • •
b or e •
c or e •
c or ri do r • • •
Even
Offset
Odd
Offset
• But… maybe the minimum subset is
not the optimal subset?
Page 23
Q1: How to select motifs?
bo co do id or re ri rr
bo re Χ √
co re Χ √
co rr id or Χ Χ √ Χ
b or e √
c or e √
c or ri do r Χ √ Χ
Even
Offset
Odd
Offset
Page 24
Q1: How to select motifs?
• Bad selection of motifs for English
text searches: substrings of ‘the’ -
the most common word in English
“The good, the bad and the ugly“ in theaters nearby
thea
No match No match
thea
Matchter thea
Match No match
ter
ter
thea
Match No match
ter
at ea er he te thEven
Offset th ea te r Χ Χ √Odd
Offset t he at er Χ Χ √
• Use input-specific occurrence
statistics to optimize motif-sets
• REALISTIC… Page 25
Q1: How to select motifs? 2-Symbol Sequence Occurrence Probability
bo 0.0002 re 0.001861 co 0.001028 rr 0.000031 id 0.001756 or 0.000444 ri 0.000284 do 0.000151
• NOTE: After selecting the motif-set,
remove redundant mappings from
the final String-to-Motif mapping Page 26
Q1: How to select motifs?
bo co do id or re ri rr
bo re √ Χ
co re √ Χ
co rr id or √ Χ √ Χ
b or e √
c or e √
c or ri do r Χ √ Χ
Even
Offset
Odd
Offset
Page 27
Statistics for Motif Selection
0
2000000
4000000
6000000
8000000
10000000
0 10000 20000 30000 40000 50000 60000 70000
Occu
rren
ces
(mo
re t
han
100,0
00
)
“\r\n”
00 00
FF FF
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
0 10000 20000 30000 40000 50000 60000 70000
Occu
rren
ces
(mo
re t
han
40,0
00)
00 00
“??” FF FF
• 2-symbol sequence statistics: IP
traffic (top) vs. OS files (bottom)
Page 28
Motif Selection as an ILP Problem
• L: a given string-set
• TL: all 2-symbol substrings of strings in L
• c(t): cost-function for every t in TL
Minimize ,
whereas for every
LTt
txtc )(
}1,0{txL
Tt
Subject To: for every Lw
LTt
t twassocx 1),(0 , and
LTt
t twassocx 1),(1
• A2:
- Examine adjacent symbols at
relative offsets to eliminate strings
- New structure: The Mangled-Trie
Page 29
Q2: How to resolve collisions?
b o r ec o r ec o r r i d o r
c o r r i d o r
-1 0 1 2 3 4 5 6-2-3-4-5-6
I
Page 30
The Mangled-Trie
. . .
b o r ec o r ec o r r i d o r
c o r r i d o r
c o r r i c o r r i d o r . . .
1 2 3
Resolve:
Offset -1
‘b’
‘c’
‘d’
OTHERNO
MATCH
‘e’ in
Offset 2?
“bore” in
Offset -1
NONO
MATCH
YES
“corri” in
Offset -6?
“corridor” in
Offset -6
NONO
MATCH
YESResolve:
Offset 2
OTHERNO
MATCH
‘e’
‘r’
“core” in
Offset -1
“idor” in
Offset 3?
“corridor” in
Offset -1
NONO
MATCH
YES-1 0 1 2 3 4 5 6-2-3-4-5-6
I
1
2
3
‘or’ Motif at Offset 0
Page 31
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
• A3:
- Optimize Frequent Scenarios:
Apply statistics to Mangled-Trie
construction
- Improve Motif-Set Quality: Avoid
slow-path altogether when possible
Page 32
Q3: How optimize slow-path?
• Adaptive System: Collect statistics
“on-the-go” and improve motif-set
• Faster Preprocessing: Custom
Branch-and-Cut (Margot ’10)
• Regular Expressions
• Hardware Implementation
• Bouma3?…
Page 33
More Future Work…
Page 34
“ Search has always been about
people. It's not an abstract thing.
It's not a formula. It's about getting
people what they need... It depends
on the type of search you do—and
how to take all those signals and
put them together.”
- Udi Manber, Google, 2008
Thank You