bouma2 talk

35
A High-Performance Input-Aware Multiple String-Match Algorithm Erez Buchnik

Upload: erez-buchnik

Post on 09-Jun-2015

810 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Bouma2 talk

A High-Performance Input-Aware

Multiple String-Match Algorithm

Erez

Buchnik

Page 2: Bouma2 talk

Page 2

Agenda

• Problem

• Existing Solutions

• Bouma2 – Model

• Comparisons

• Preprocessing in Detail

• Future Work

Page 3: Bouma2 talk

Page 3

Agenda

• Problem

• Existing Solutions

• Bouma2 – Model

• Comparisons

• Preprocessing in Detail

• Future Work

Page 4: Bouma2 talk

Page 4

The Multiple String-Match Problem

• Goal: Given a set of strings and input

text, find all occurrences of any of the

strings in the text

• Input: Set of strings L and input text M

• Output: Offsets 1 ≤ i ≤ |M| where a

substring of M matches any of the

strings in L

• Uses: AV, IPS, DPI, DNA Search etc…

Page 5: Bouma2 talk

Page 5

The Multiple String-Match Problem - References

• Aho-Corasick ’75

• Commentz-Walter ’79

• Rabin-Karp ’87

• Wu-Manber ’94

• Muth-Manber ’96

• Hopcroft-Motwani-Ullman ’00

• Dori-Landau ’06

Page 6: Bouma2 talk

Page 6

Agenda

• Problem

• Existing Solutions

• Bouma2 – Model

• Comparisons

• Preprocessing in Detail

• Future Work

Page 7: Bouma2 talk

Page 7

Stateful Approach (e.g. Aho-Corasick)

• Linear in the length of the input

• Large automatons cause cache-

misses and degrade performance

• One state

transition per

symbol

Page 8: Bouma2 talk

Page 8

Agenda

• Problem

• Existing Solutions

• Bouma2 – Model

• Comparisons

• Preprocessing in Detail

• Future Work

Page 9: Bouma2 talk

Page 9

Guidelines

• INTUITIVE: Search for ‘Hints’ of

a Match Before the Full Match

• REALISTIC: Use Prior

Knowledge of Expected Input

• SIMPLE: Trivial Match Process

Page 10: Bouma2 talk

Page 10

Bouma2: Motif-Based String Match

bi

re

or

at

ok

ek

bore

core

bits

corridor

boat

book

cooks

trek

• Preprocessing: Map every string to

its own substring: Motif

Set of strings

Set of selected 2-symbols long substrings

Q1: How to select motifs?

Page 11: Bouma2 talk

Page 11

Bouma2: Motif-Based String Match (cont.)

“ r a b b i t s h a t e

b o o k

c o o k s “

c o o k s

No match

Match Match

b o a t

No match

b i t s

Match

• Match: Examine symbols 2-by-2

(STATELESS); attempt full match

around motif occurrences Q2: How to resolve collisions?

Page 12: Bouma2 talk

Page 12

Capturing all Occurrences

“ h a b i t s o f r a b b i t s “

b i t sb i t s

MatchMatch

• Even-offset occurrences and odd-

offset occurrences require separate

passes, but instead…

Page 13: Bouma2 talk

Page 13

Upgrade #1: 2-Symbol Strides

“ h a b i t s o f r a b b i t s “

b i t s

Match

b i t s

MatchMatch

• We map each string TWICE: once to

an even-offset motif, and once to an

odd-offset motif

Page 14: Bouma2 talk

Page 14

Upgrade #2: Fast-Path / Slow-Path

“ h a b i t s o f r a b b i t s “ 4

14

4 14

• Fast-Path:

- Stateless

- “Monolithic” (zero branches)

- Cache-Aware (small direct-table)

- SIMPLE…

Page 15: Bouma2 talk

Page 15

Upgrade #2: Fast-Path / Slow-Path

“ h a b i t s o f r a b b i t s “

b i t s

Match

b i t s

MatchMatch

4

14

4 14

• Slow-Path:

- Memory-Efficient (pointers to

original strings for comparison)

- “Localized” (separate structure for

every motif)

Page 16: Bouma2 talk

Page 16

Agenda

• Problem

• Existing Solutions

• Bouma2 – Model

• Comparisons

• Preprocessing in Detail

• Future Work

Page 17: Bouma2 talk

• n – length of input

• S – no. of string-matches in n

• m – no. of motif-matches in n

• l – length of the longest string

• Match Complexities:

- Aho-Corasick:

- Bouma2:

Page 17

Bouma2 vs. Aho-Corasick

)( SnO

)2

( lmn

O

Page 18: Bouma2 talk

• In practice, Bouma2 is usually at

least twice as fast as Aho-Corasick

• Fast-path alone is 10 times faster

Page 18

Bouma2 vs. Aho-Corasick (Speed)

Bouma2 Fast-Path

Bouma2 Slow-Path (Sub-Optimal)

Aho-Corasick

Q3: How to optimize slow-path?

Page 19: Bouma2 talk

• Bouma2 exhibits 8.5 times less

cache-misses than Aho-Corasick

(fast-path + slow-path) Page 19

Bouma2 vs. Aho-Corasick (Cache)

Bouma2 Cache-Misses

Aho-Corasick Cache-Misses

Page 20: Bouma2 talk

• Bouma2 footprint is less than 70%

of Aho-Corasick for textual search

(down to 35% in other cases) Page 20

Bouma2 vs. Aho-Corasick (Memory)

Bouma2 Fast-Path

Bouma2 Slow-Path

Aho-Corasick

Original Strings

Page 21: Bouma2 talk

Page 21

Agenda

• Problem

• Existing Solutions

• Bouma2 – Model

• Comparisons

• Preprocessing in Detail

• Future Work

Page 22: Bouma2 talk

• A1: Out of all 2-symbol substrings,

find a minimum subset that covers

all given strings (even & odd offsets) Page 22

Q1: How to select motifs?

bo co do id or re ri rr

bo re • •

co re • •

co rr id or • • • •

b or e •

c or e •

c or ri do r • • •

Even

Offset

Odd

Offset

Page 23: Bouma2 talk

• But… maybe the minimum subset is

not the optimal subset?

Page 23

Q1: How to select motifs?

bo co do id or re ri rr

bo re Χ √

co re Χ √

co rr id or Χ Χ √ Χ

b or e √

c or e √

c or ri do r Χ √ Χ

Even

Offset

Odd

Offset

Page 24: Bouma2 talk

Page 24

Q1: How to select motifs?

• Bad selection of motifs for English

text searches: substrings of ‘the’ -

the most common word in English

“The good, the bad and the ugly“ in theaters nearby

thea

No match No match

thea

Matchter thea

Match No match

ter

ter

thea

Match No match

ter

at ea er he te thEven

Offset th ea te r Χ Χ √Odd

Offset t he at er Χ Χ √

Page 25: Bouma2 talk

• Use input-specific occurrence

statistics to optimize motif-sets

• REALISTIC… Page 25

Q1: How to select motifs? 2-Symbol Sequence Occurrence Probability

bo 0.0002 re 0.001861 co 0.001028 rr 0.000031 id 0.001756 or 0.000444 ri 0.000284 do 0.000151

Page 26: Bouma2 talk

• NOTE: After selecting the motif-set,

remove redundant mappings from

the final String-to-Motif mapping Page 26

Q1: How to select motifs?

bo co do id or re ri rr

bo re √ Χ

co re √ Χ

co rr id or √ Χ √ Χ

b or e √

c or e √

c or ri do r Χ √ Χ

Even

Offset

Odd

Offset

Page 27: Bouma2 talk

Page 27

Statistics for Motif Selection

0

2000000

4000000

6000000

8000000

10000000

0 10000 20000 30000 40000 50000 60000 70000

Occu

rren

ces

(mo

re t

han

100,0

00

)

“\r\n”

00 00

FF FF

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

0 10000 20000 30000 40000 50000 60000 70000

Occu

rren

ces

(mo

re t

han

40,0

00)

00 00

“??” FF FF

• 2-symbol sequence statistics: IP

traffic (top) vs. OS files (bottom)

Page 28: Bouma2 talk

Page 28

Motif Selection as an ILP Problem

• L: a given string-set

• TL: all 2-symbol substrings of strings in L

• c(t): cost-function for every t in TL

Minimize ,

whereas for every

LTt

txtc )(

}1,0{txL

Tt

Subject To: for every Lw

LTt

t twassocx 1),(0 , and

LTt

t twassocx 1),(1

Page 29: Bouma2 talk

• A2:

- Examine adjacent symbols at

relative offsets to eliminate strings

- New structure: The Mangled-Trie

Page 29

Q2: How to resolve collisions?

b o r ec o r ec o r r i d o r

c o r r i d o r

-1 0 1 2 3 4 5 6-2-3-4-5-6

I

Page 30: Bouma2 talk

Page 30

The Mangled-Trie

. . .

b o r ec o r ec o r r i d o r

c o r r i d o r

c o r r i c o r r i d o r . . .

1 2 3

Resolve:

Offset -1

‘b’

‘c’

‘d’

OTHERNO

MATCH

‘e’ in

Offset 2?

“bore” in

Offset -1

NONO

MATCH

YES

“corri” in

Offset -6?

“corridor” in

Offset -6

NONO

MATCH

YESResolve:

Offset 2

OTHERNO

MATCH

‘e’

‘r’

“core” in

Offset -1

“idor” in

Offset 3?

“corridor” in

Offset -1

NONO

MATCH

YES-1 0 1 2 3 4 5 6-2-3-4-5-6

I

1

2

3

‘or’ Motif at Offset 0

Page 31: Bouma2 talk

Page 31

Agenda

• Problem

• Existing Solutions

• Bouma2 – Model

• Comparisons

• Preprocessing in Detail

• Future Work

Page 32: Bouma2 talk

• A3:

- Optimize Frequent Scenarios:

Apply statistics to Mangled-Trie

construction

- Improve Motif-Set Quality: Avoid

slow-path altogether when possible

Page 32

Q3: How optimize slow-path?

Page 33: Bouma2 talk

• Adaptive System: Collect statistics

“on-the-go” and improve motif-set

• Faster Preprocessing: Custom

Branch-and-Cut (Margot ’10)

• Regular Expressions

• Hardware Implementation

• Bouma3?…

Page 33

More Future Work…

Page 34: Bouma2 talk

Page 34

“ Search has always been about

people. It's not an abstract thing.

It's not a formula. It's about getting

people what they need... It depends

on the type of search you do—and

how to take all those signals and

put them together.”

- Udi Manber, Google, 2008

Page 35: Bouma2 talk

Thank You