high performance deep packet inspection · duces an algorithm for scanning traffic compressed by...

153
The Raymond and Beverly Sackler Faculty of Exact Sciences The Blavatnik School of Computer Science High Performance Deep Packet Inspection Thesis submitted for the degree of Doctor of Philosophy by Yaron Koral This work was carried out under the supervision of Professor Yehuda Afek and Doctor Anat Bremler-Barr Submitted to the Senate of Tel Aviv University September 2012

Upload: others

Post on 24-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

  • The Raymond and Beverly Sackler Faculty of Exact Sciences

    The Blavatnik School of Computer Science

    High Performance Deep Packet Inspection

    Thesis submitted for the degree of Doctor of Philosophy

    by

    Yaron Koral

    This work was carried out under the supervision of

    Professor Yehuda Afek and Doctor Anat Bremler-Barr

    Submitted to the Senate of Tel Aviv University

    September 2012

  • © 2012

    Copyright by Yaron Koral

    All Rights Reserved

  • This work is dedicated to the pursuit of a safe and secure world.

  • Acknowledgements

    First and foremost, I would like to thank my advisors, Yehuda Afek and Anat

    Bremler-Barr, for their continued support and guidance throughout my Ph.D. I have

    learned a lot from you whether in doing research, writing papers or giving presentations.

    Above all you taught me how to walk in the world of science and think sharply.

    I had the pleasure of working with the following people: David Hay, Yotam Har-

    chol, Shimrit Tzur-David and Victor Zigdon. I thank you for your companionship and

    support. Working with you was both enriching and a great delight.

    Last, and certainly the most, I thank my family: my beloved wife Keren; my charm-

    ing kids Omer, Ofri, Romi and Yarden; and my parents Akiva and Rahel for their

    unfailing love, encouragement and support.

    The work in this thesis was partially supported by the European Research Council

    under the European Union’s Seventh Framework Programme (FP7/2007-2013)/ERC

    Grant agreement no 259085.

  • Abstract

    Deep packet inspection (DPI) is a form of network packet filtering that can search

    the packet’s content and locate the presence of certain patterns. These include headers

    and data-protocol structures as well as the payload of the message. It enables advanced

    network management, user service, and security functions as well as Internet data min-

    ing, eavesdropping, and censorship. It is currently being used by enterprises, service

    providers, and governments in a wide range of applications.

    DPI may be implemented by a wide range of pattern matching algorithms. The

    general problem of pattern matching is considered fundamental in computer science

    and has been researched thoroughly over the last decades. Still, when applied to the

    network domain of recent years, the traditional algorithms fail to face current challenges.

    The first challenge is the continual increase in Internet traffic rates, which requires a

    scalable design in terms of speed and memory usage. The second challenge arises from

    the increase in Web traffic compression due to the increasing popularity of Web surfing

    over mobile devices. The security device is forced to decompress this traffic prior to

    inspection, leading in turn to processing and space penalties. The third challenge is due

    to the requirement for a solution that is resilient to attacks that overload the security

    device. We address these challenges here. Moreover, we apply several technological

    advances to boost the performance of the traditional algorithms, including, for example,

    the presence of Ternary Content Addressable Memory (TCAM) elements in network

    devices and the availability of multi-core platforms for the DPI task.

    The work presented in this thesis focuses on DPI algorithms and techniques that

    relate to network security elements. In Chapter 3, we provide an algorithm for a scalable

    design of a DPI engine. Our design reduces the problem of pattern matching to the

    well-studied problem of Longest Prefix Match (LPM), which can be solved either in

    TCAM, in IP-lookup chips, or in software.

    Next we deal with the challenge of DPI over compressed traffic. Chapters 4 and 5

    focus on reducing the space and time penalties resulting from the compressed traffic.

  • These works show that, by using the meta-data generated during the compression stage,

    pattern matching over compressed traffic can be accelerated significantly as compared

    to traditional pattern matching over non-compressed traffic, and that the space penalty

    can be reduced by a factor of six as compared to current designs. Chapter 6 intro-

    duces an algorithm for scanning traffic compressed by SDCH compression, which is the

    compression scheme used by Google. Our design gains a performance boost of over

    40%.

    Finally, we address the challenge of performing DPI when the system is under denial-

    of-service via algorithmic complexity attacks. We provide a system design that takes

    advantage of commercial multi-core platforms to efficiently mitigate complexity attacks

    of varying intensity.

    The algorithms and techniques presented in this thesis provide a suitable DPI solu-

    tion that confronts today’s network challenges.

  • Contents

    1 Introduction 1

    1.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.2 Overview of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.2.1 CompactDFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.2.2 SOP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.2.3 SPC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.2.4 SDCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    1.2.5 MCA2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    1.3.1 DFA Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    1.3.2 Compressed Web-Traffic . . . . . . . . . . . . . . . . . . . . . . . 15

    1.3.3 DPI Using Multi-Core Platforms . . . . . . . . . . . . . . . . . . 16

    1.3.4 Denial-of-Service Mitigation . . . . . . . . . . . . . . . . . . . . . 17

    1.4 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2 Background 19

    2.1 DFA based Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . 19

    2.2 Compressed Web-Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    2.2.1 Gzip Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    2.2.2 SDCH Compression . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.3 Complexity attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    3 CompactDFA 29

    3.1 The CompactDFA Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    3.1.1 CompactDFA Output . . . . . . . . . . . . . . . . . . . . . . . . 29

    3.1.2 CompactDFA Algorithm . . . . . . . . . . . . . . . . . . . . . . . 30

    ix

  • 3.1.3 The Aho-Corasick Algorithm-like Properties . . . . . . . . . . . 32

    3.1.4 Stage I: State Grouping . . . . . . . . . . . . . . . . . . . . . . . 34

    3.1.5 Stage II: Common Suffix Tree . . . . . . . . . . . . . . . . . . . . 36

    3.1.6 Stage III: State and Node Encoding . . . . . . . . . . . . . . . . 38

    3.2 CompactDFA for total memory minimizations . . . . . . . . . . . . . . . 40

    3.3 CompactDFA for DFA with strides . . . . . . . . . . . . . . . . . . . . . 41

    3.4 Implementing CompactDFA using IP-lookup Solutions . . . . . . . . . . 43

    3.4.1 Implementing CompactDFA with non-TCAM IP-lookup solutions 44

    3.4.2 Implementing CompactDFA with TCAM . . . . . . . . . . . . . 45

    3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    4 Space Efficient DPI of Compressed Web Traffic 57

    4.1 SOP Packing technique . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    4.1.1 Buffer Packing: Swap Out of boundary Pointers (SOP) . . . . . . 58

    4.1.2 Huffman Coding Scheme . . . . . . . . . . . . . . . . . . . . . . . 60

    4.1.3 Unpacking the Buffer: Gzip Decompression . . . . . . . . . . . . 63

    4.2 Combining SOP with ACCH algorithm . . . . . . . . . . . . . . . . . . . 64

    4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

    4.3.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . 69

    4.3.2 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    4.3.3 Space and Time Results . . . . . . . . . . . . . . . . . . . . . . . 70

    4.3.4 Time Results Analysis . . . . . . . . . . . . . . . . . . . . . . . . 72

    4.3.5 DPI of Compressed Traffic . . . . . . . . . . . . . . . . . . . . . . 73

    5 Shift-based Pattern Matching for Compressed Traffic 75

    5.1 The Modified Wu-Manber Algorithm . . . . . . . . . . . . . . . . . . . . 75

    5.2 Shift-based Pattern matching for Compressed traffic (SPC) . . . . . . . 78

    5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

    5.3.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

    5.3.2 Pattern Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

    5.3.3 SPC Characteristics Analysis . . . . . . . . . . . . . . . . . . . . 83

    5.3.4 SPC Run-Time Performance . . . . . . . . . . . . . . . . . . . . 85

    5.3.5 SPC Storage Requirements . . . . . . . . . . . . . . . . . . . . . 86

  • 6 Decompression-Free Inspection 88

    6.1 Our Decompression-Free algorithm . . . . . . . . . . . . . . . . . . . . . 88

    6.1.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . 88

    6.1.2 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

    6.1.3 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

    6.1.4 Dealing with Gzip over SDCH . . . . . . . . . . . . . . . . . . . . 97

    6.2 Regular Expressions Inspection . . . . . . . . . . . . . . . . . . . . . . . 97

    6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

    7 MCA2 104

    7.1 Snort Cache-Miss Complexity Attack . . . . . . . . . . . . . . . . . . . . 104

    7.2 The MCA2 System Description . . . . . . . . . . . . . . . . . . . . . . . 106

    7.2.1 MCA2 Design overview . . . . . . . . . . . . . . . . . . . . . . . 106

    7.2.2 Cross-Thread Communication Mechanism . . . . . . . . . . . . . 109

    7.2.3 Thread Allocation Scheme . . . . . . . . . . . . . . . . . . . . . . 111

    7.2.4 Flow Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    7.3 MCA2 for Cache-Miss Attacks . . . . . . . . . . . . . . . . . . . . . . . . 113

    7.4 MCA2 for Active-States Attacks . . . . . . . . . . . . . . . . . . . . . . . 116

    7.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

    7.5.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . 118

    7.5.2 Cache-Miss Attack Simulation Results . . . . . . . . . . . . . . . 120

    7.5.3 Active-State Attack Simulation Results . . . . . . . . . . . . . . 123

    8 Conclusion 124

    Bibliography 125

  • List of Tables

    3.1 Statistics of the pattern sets used in Section 3.5 . . . . . . . . . . . . . . 52

    3.2 Summary of experimental results for Snort and ClamAV pattern sets . . 54

    4.1 Comparison of Time and Space parameters of different algorithms . . . . 69

    4.2 Overview of pattern matching with gzip processing . . . . . . . . . . . . 73

    5.1 Storage Requirements (KB) . . . . . . . . . . . . . . . . . . . . . . . . . 87

    6.1 Step by step execution of our algorithm on the example of Section 6.1.1 92

    7.1 No-drop setting parameters . . . . . . . . . . . . . . . . . . . . . . . . . 112

    7.2 The non-common states ratio . . . . . . . . . . . . . . . . . . . . . . . . 116

    7.3 Validation of the thread allocation model of Section 7.2.3 . . . . . . . . 122

    xii

  • List of Figures

    1.1 The goodput of MCA2 for different attack intensities . . . . . . . . . . . 13

    2.1 Example of an Aho-Corasick DFA and methods to store it in memory . 20

    2.2 LZ77 example on Yahoo! home page . . . . . . . . . . . . . . . . . . . . 24

    3.1 Aho-Corasick DFA toy example . . . . . . . . . . . . . . . . . . . . . . . 31

    3.2 Illustration of the intra-flow interleaving on a single packet . . . . . . . . 48

    3.3 Expansion factor under Truncated CompactDFA . . . . . . . . . . . . . 53

    3.4 The distribution of the values C . . . . . . . . . . . . . . . . . . . . . . . 55

    3.5 The latency of inter-flow and intra-flow interleaving . . . . . . . . . . . . 56

    4.1 Sketch of the gzip 32KB memory buffer . . . . . . . . . . . . . . . . . . 58

    4.2 Sketch of the memory buffer in different scenarios . . . . . . . . . . . . . 64

    4.3 Illustration of common terms . . . . . . . . . . . . . . . . . . . . . . . . 65

    4.4 Sketch of the memory buffer including the Status Vector . . . . . . . . . 68

    4.5 HTTP Compression usage among the Alexa top-site lists . . . . . . . . . 70

    5.1 MWM algorithm example . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    5.2 Pointer scan procedure example . . . . . . . . . . . . . . . . . . . . . . . 81

    5.3 Skipped Character Ratio (Sr) . . . . . . . . . . . . . . . . . . . . . . . . 84

    5.4 Normalized Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

    6.1 Example of an Aho-Corasick Automaton . . . . . . . . . . . . . . . . . . 89

    6.2 The depth of first three states of each failure path . . . . . . . . . . . . . 101

    6.3 Comparison between the scan-ratio and compression-ratio . . . . . . . . 102

    6.4 Comparison when considering also regular expression matching . . . . . 103

    7.1 The effects of a cache-miss attack . . . . . . . . . . . . . . . . . . . . . . 105

    xiii

  • 7.2 Illustration of MCA2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

    7.3 Sketch of a record in the bad packet queue . . . . . . . . . . . . . . . . . 110

    7.4 Distribution of cache-misses under normal traffic and under attack. . . . 114

    7.5 CDF of the percentage of normal traffic packets . . . . . . . . . . . . . . 115

    7.6 The total system throughput for a different number of common states . 116

    7.7 CDF of the percentage of mild attack . . . . . . . . . . . . . . . . . . . . 117

    7.8 Distribution of maximal average number of active states . . . . . . . . . 119

    7.9 Average throughput per thread over time . . . . . . . . . . . . . . . . . . 121

    7.10 Goodput of Hybrid-FA and of Hybrid-FA with MCA2 full-drop setup . . 122

  • Chapter 1

    Introduction

    Deep packet inspection (DPI) consists of inspecting both the packet header and pay-

    load and alerting the system when signatures of malicious software appear in the traffic.

    These signatures are identified through pattern matching algorithms that are classified

    either as string matching, in which the patterns are a set of strings, or regular expression

    matching, in which the patterns are defined as regular expressions. DPI is a basic ele-

    ment in today’s security tools, such as Network Intrusion Detection/Prevention System

    (NIDS/NIPS) or Web application firewall, which are used to detect malicious activi-

    ties. Moreover, DPI and its corresponding pattern matching algorithms are also crucial

    building blocks for other networking applications such as traffic monitoring and HTTP

    load-balancing. Today, the performance of security tools is dominated by the speed of

    the underlying pattern matching algorithms [49].

    Both string matching and regular expression matching are fundamental problems

    in computer science and have been a topic of intensive research for decades. In what

    follows, we provide a brief description of the main approaches to these problems and

    explain why they are not adequate for contemporary needs.

    The fundamental string matching paradigm derives from the Aho-Corasick (AC) [23]

    algorithm. This algorithm constructs a deterministic finite automaton (DFA) for de-

    tecting all occurrences in any given set of patterns by processing the input in a single

    pass, performing a state transition for each input byte. An alternative is the shift-

    based paradigm that includes the Boyer-Moore (BM) [32] as well as the modified Wu-

    Manber (MWM) [94] algorithms. This paradigm aims at reducing the average-case per-

    formance by exploiting a heuristic approach for skipping portions of the input. These

    algorithms result in an average sublinear performance.

    1

  • 2 CHAPTER 1. INTRODUCTION

    Likewise, regular expression matching also has two mainstream approaches; these

    are based on either a deterministic or a non-deterministic finite automaton (DFA or

    NFA). DFA has superior runtime complexity of O(1), and thus a constant time per bit,

    as compared with an O(n2) runtime complexity for NFA. NFA, on the other hand, has

    superior space complexity of O(n) linear space, as compared with O(2n) exponential

    space for DFA. Complexity is calculated as a function of n, the length of the regular

    expression [54].

    None of the aforementioned traditional solutions is suitable for coping with today’s

    requirements, due to problems with scalability, compressed traffic, and resiliency. We

    detail these problems below.

    Scalability It is essential to increase the speed and reduce the memory requirements

    of the pattern matching solutions. As DPI is performed in the critical path of packet

    processing, current solutions must handle network speeds of 10–100 Gbps. Moreover,

    the solutions must deal with thousands of patterns. For example, the ClamAV [4] virus-

    signature database consists of 61K patterns, and the popular Snort NIDS [9] has more

    than 30K patterns. Typically, the number of patterns considered by NIDS systems

    grows dramatically over time. The size of the pattern database prohibits the use of a

    fast memory such as CPU cache or SRAM. Thus, its memory requirement has a direct

    effect on its time performance. Current research has focused on reducing the memory

    requirement by compressing the corresponding DFA [24, 45, 85, 88, 89]; however, all

    proposed techniques suggest a pure-hardware solution, which usually incurs prohibitive

    deployment and development costs.

    Compressed Traffic Scalable DPI solutions should support increasing rates of In-

    ternet traffic. One method for supporting these rates at network servers is to compress

    the ongoing traffic, therefore transferring data more efficiently. This method is used

    today for compressing HTTP text when transferring pages over the Web. The sharply

    increasing number of compressed Web pages is largely motivated by the increase in Web

    surfing over mobile devices. Sites such as Yahoo!, Google, MSN, YouTube, Facebook

    and others use HTTP compression to enhance the speed of their content download.

    For example, in February 2012, W3Techs published a ranking breakdown report [10],

    which shows that 44.7% of the Web sites compress their traffic; when focusing on the

    top 1 000 sites, a remarkable 83.4% of the sites compress their traffic. HTTP 1.1 has a

  • 1.1. METHOD 3

    standard-based method of delivering this compressed content; therefore it is supported

    by all modern browsers.

    The presence of compressed traffic makes Internet traffic harder to analyze by the

    DPI routine because doing so requires two time-consuming phases: traffic decompression

    and pattern matching. Currently, most security tools fail to analyze compressed traffic.

    In some cases they simply do not scan compressed traffic, thus compromising their

    original goal of detecting malicious activities. In other cases, the security tools ensure

    that there is no compressed traffic by rewriting the HTTP header between the original

    client and server; this solution leads to a waste of bandwidth and a higher cost per-bit.

    Resiliency Increased traffic rates and compressed traffic are considered legitimate

    Internet phenomena. Still, NIDS and Firewall, as the security tools that protect against

    malicious users, are naturally becoming a favorable target for illegitimate phenomena

    such as denial-of-service attacks. A recent trend is a two-phase combined attack on

    security devices: the attackers first neutralize the device, for example, by overwhelming

    it with traffic, and then, when it has been knocked down, attack the assets it was

    protecting. For example, a recent attack on SONY combined a distributed denial-of-

    service (DDoS) attack with credit card theft [16]. The combined attacks usually affect

    NIDS and NIPS differently. In NIDS, where the stealth mode device only monitors

    the traffic and issues alerts when it detects malicious activity, these DDoS attacks may

    force the device to stop inspecting part, or all of the traffic, thereby allowing another

    attack to pass unnoticed. In-line NIPS, on the other hand, because it inspects packets

    on their critical path, might be forced to drop legitimate traffic, causing, in practice, a

    denial-of-service on the servers it is supposed to protect. Bro and Snort, for example,

    are both vulnerable to this kind of attack [65].

    1.1 Method

    The research presented in this thesis requires multiple methods from different disciplines.

    A major effort was invested to find new algorithms and design approaches for DPI.

    We analyze the performance of the algorithms for both the average case, with normal

    routine traffic, and the worst case, with traffic from malicious users. The normal traffic

    is obtained from either prepared data traces such as the DARPA MIT traces in [5]

    or traces collected from our simulation environment using live Internet traffic. The

  • 4 CHAPTER 1. INTRODUCTION

    worst-case traffic traces are usually synthesized with respect to the referred attack.

    Since most of today’s security-tools source code and pattern sets are free to the public,

    an adversary may devise a tailored attack accordingly. Therefore we consider both

    adaptive adversaries that can observe the actions of our proposed scheme and respond

    accordingly in real time, and oblivious adversaries that cannot.

    In addition, we verify the performance of our proposed scheme with simulations and

    experiments. This is especially crucial for heuristics that have no theoretical perfor-

    mance guarantees. More specifically, we implement all our software solutions and test

    their performance using both synthetic and real pattern sets, including patterns and

    regular expressions from contemporary security applications, such as Snort [9], Mod-

    Security [6], Bro [2] and ClamAV [4]. Our simulation environment includes multi-core

    platforms having different cache architectures, to test their influence on our proposed

    algorithms.

    1.2 Overview of Results

    The following subsections briefly describe five research results presented in this thesis.

    1.2.1 CompactDFA: Scalable Pattern Matching using Longest Prefix

    Match Solutions

    Recently, much effort has been devoted to compressing the Aho-Corasick DFA in order

    to improve the algorithm’s performance [28, 63, 64, 66, 76, 87–89, 93, 96]. Other works,

    such as [77], focused on the construction of the above compressed DFAs using small

    intermediate parsers. While most of these works either suggest dedicated hardware

    solutions or introduce a non-constant higher processing time, we present a generic DFA

    compression algorithm and show how to store the resulting DFA in an off-the-shelf

    hardware. Our novel algorithm works on a large class of Aho-Corasick–like DFAs, whose

    unique properties are defined in the following chapter. This algorithm reduces the rule

    set to the minimum possible size: only one rule per state. A key observation is that in

    prior works the state codes were chosen arbitrarily; we take advantage of this degree of

    freedom and add information about the state properties in the state code. This allows

    us to encode all transitions to a specific state by a single prefix that captures a set

    of current states. Moreover, if a state matches more than one rule, the rule with the

  • 1.2. OVERVIEW OF RESULTS 5

    longest prefix is selected. Thus, our scheme reduces the problem of pattern matching

    to the well-studied problem of Longest Prefix Match (LPM).

    The reduction in the number of rules comes with a small overhead in the number

    of bits used to encode a state. For example, a DFA based on Snort requires 17 bits

    to encode each state with an arbitrary piece of code, while when using our scheme it

    requires 36; for ClamAV’s DFA the code width increases from 22 bits to 59 bits.

    In addition, we present two extensions to our basic scheme. Our first extension,

    called CompactDFA for total memory minimization, aims at minimizing the product of

    the number of rules and the code width, rather than only the number of rules. This

    captures situations in which, at some point, the reduction in the number of rules is

    not worth the additional bits in the state code. Specifically, for the above pattern

    sets, this extension reduced the memory requirement by up to an additional 40%. The

    second extension deals with variable-stride DFAs, which were created to speed up the

    inspection process by inspecting more than one symbol at a time; like the first extension,

    it requires more rules than the number of states.

    One of the main advantages of CompactDFA is that it fits into commercially avail-

    able IP-lookup solutions, implying that they may also be used for performing fast pat-

    tern matching. We demonstrate the power of this reduction by implementing the Aho-

    Corasick algorithm on an IP-lookup chip (as in [84]) and on a TCAM.

    Specifically, in TCAM, each rule is mapped to one entry. Since TCAMs are config-

    ured with entry width that is a multiple of 36 bits or 40 bits, minimizing the number

    of bits to encode a state is less important and the basic CompactDFA that minimizes

    the number of rules is more adequate.

    We also deal with two obstacles that arise when using TCAMs: the power consump-

    tion and the latency induced by the pipeline in the TCAM chip, which is especially

    significant since CompactDFA works in a closed-loop manner (that is, where the input

    for one lookup depends on the output of a previous lookup). To overcome the latency

    problem, we propose two kinds of interleaving executions (namely, inter-flow interleav-

    ing and intra-flow interleaving). We show that combining these executions provides low

    latency (in the order of a few tens of microseconds) at high throughput. We reduce the

    power consumption of the TCAM by taking advantage of the fact that today’s vendors

    partition the TCAM to blocks and allow, in every lookup, activation of only some of

    these blocks. We suggest dividing the rules to different blocks, each is associated with a

  • 6 CHAPTER 1. INTRODUCTION

    different subset of symbols. Dividing the rules to blocks in this way reduces the number

    of bits required for encoding the symbol field of the rule to the logarithm of the number

    of symbols that are mapped to the same block.

    The small memory requirement of the compressed rules and the low power consump-

    tion enable the use of multiple TCAMs simultaneously, where each performs pattern

    matching over different sessions or packets (namely, inter-flow interleaving). Further-

    more, one can take advantage of the common multiprocessor architecture of contempo-

    rary security tools and design a high throughput solution, applicable to the common

    case of multi-sessions/packets. Notice that while state-of-the-art TCAM chips are 5MB

    in size, high throughput may be achieved using multiple small TCAM chips. For the

    Snort pattern set, we achieve a throughput of 10Gbps and latency of less than 60

    microseconds by using 5 small TCAMs of 0.5MB each, and as much as 40Gbps (and

    the same latency) with 20 small TCAMs.

    This work was published in the proceedings of Infocom 2010 [35].

    1.2.2 Space-Efficient Deep Packet Inspection of Compressed

    Web Traffic

    Networking devices that perform deep packet inspection (DPI) over compressed traffic

    need first to decompress the message to inspect its payload. Gzip compression, which is

    used for compressed Web traffic, replaces repeated strings with back references, denoted

    as pointers, to their prior occurrence within the last 32KB of the text. Therefore, the

    decompression process requires a 32KB buffer of the recent decompressed data to keep

    all possible bytes that might be back-referenced by the pointers, which causes a major

    space penalty. With today’s mid-range firewalls, which are built to support 100K to

    200K concurrent connections, keeping a buffer for the 32KB window for each connection

    occupies a few gigabytes of main memory. Decompression also causes a time penalty

    but this penalty was successfully reduced in [36].

    This high memory requirement leaves vendors and network operators with three bad

    options: ignore compressed traffic, forbid compression, or divert the compressed traffic

    for offline processing. Obviously none of these are acceptable as they present a security

    hole or serious performance degradation.

    The basic structure of our approach is to keep the 32KB buffer of all connections

    compressed, except for the data of the connection whose packet(s) is now being pro-

  • 1.2. OVERVIEW OF RESULTS 7

    cessed. When a packet arrives, we unpack its connection buffer and process it. One may

    naïvely suggest keeping only the original compressed data as it was received. However,

    this approach fails since the buffer would contain recursive pointers to data more than

    32KB backwards. Our technique, called “Swap Out-of-boundary Pointers" (SOP), packs

    the buffer’s connection by combining recent information from both the compressed and

    uncompressed 32KB buffer to create a new compressed buffer that contains pointers

    that refer only to locations within itself. We show that by employing our technique on

    real-life data we reduce the space requirement by a factor of 5 with a time penalty of

    26%. Note that while our method modifies the compressed data locally, it is transparent

    to both the client and the server.

    We further design an algorithm that combines our SOP space-reducing technique

    with the ACCH algorithm presented in [36] (the ACCH algorithm accelerates pattern

    matching on compressed HTTP traffic). The combined algorithm achieves an improve-

    ment of 42% of the time and 79% of the space requirements. The time-space tradeoff

    presented by our technique provides the first solution that enables DPI on compressed

    traffic at wire speed for network devices such as IPS and Web application firewall.

    This work was published in the proceedings of IFIP Networking 2011 [21]. An

    extended version was published in the Computer Communications Journal [22].

    1.2.3 Shift-based Pattern Matching for Compressed Web Traffic

    In this work we provide a method for accelerating DPI over compressed traffic. The

    most common algorithm for compressed traffic uses the gzip compression algorithm,

    which eliminates repetitions of strings using back references (pointers). The key insight

    is to store information produced by the pattern matching algorithm for scanned decom-

    pressed traffic, and in the case of pointers, use this data to either find a match or to skip

    scanning that area. Recent work [36] presents the ACCH technique for pattern match-

    ing on compressed traffic. This technique decompresses the traffic and then uses data

    from the decompression phase to accelerate the process. This work analyzed the case

    of using the well-known Aho-Corasick (AC) [23] algorithm as a multi-pattern match-

    ing technique. The basic Aho-Corasick has good worst-case performance since every

    character requires traversing exactly one deterministic finite automaton (DFA) edge.

    However, the adaptation for compressed traffic, where some characters represented by

    pointers may be skipped, is complicated since the Aho-Corasick requires inspection of

  • 8 CHAPTER 1. INTRODUCTION

    every byte within the traffic.

    Inspired by the insights of that work, we investigate the case of performing DPI over

    compressed Web traffic using the shift-based multi-pattern matching technique of the

    modified Wu-Manber algorithm [94]. The Wu-Manber algorithm does not scan every

    position within the traffic; in fact it shifts (skips) scanning areas in which the algorithm

    concludes that no pattern starts.

    As a preliminary step, we present an improved version for the Wu-Manber algo-

    rithm (see Section 5.1). This modification improves both time and space complexity to

    fit the large number of patterns within current pattern sets such as those in the Snort

    database [9]. We then present our Shift-based Pattern matching for Compressed traffic

    algorithm (SPC ), which accelerates Wu-Manber on compressed traffic. SPC results in

    a simpler algorithm, with higher throughput and lower storage overhead than the accel-

    erated AC, since the Wu-Manber algorithm basic operation involves shifting (skipping)

    some of the traffic. Thus, it is natural to combine Wu-Manber with the idea of shifting

    (skipping) some of pointers.

    We show that we can skip scanning up to 87.5% of the data and gain a performance

    boost of more than 73% as compared to the Wu-Manber algorithm on real Web traffic

    and security-tool signatures. Furthermore, we show that the suggested algorithm also

    gains a normalized throughput improvement of 51% as compared to ACCH. Finally, the

    SPC algorithm reduces the additional space required for previous scan results by half,

    by storing only 4KB per connection as compared to the 8KB of ACCH.

    This work was published in the proceedings of HPSR 2011 [37].

    1.2.4 Decompression-Free Inspection: DPI for Shared Dictionary

    Compression over HTTP

    Gzip works well as a compression method for each individual HTTP-response, but it

    often happens that a lot of common data is shared by a group of pages. This type

    of sharing is known as inter-response redundancy. Therefore, next generation Web

    compression methods are inter-file, where there is one dictionary that may be referenced

    by several files. An example of a compression method that uses a shared dictionary is

    Shared Dictionary Compression over HTTP (SDCH).

    SDCH [38] was proposed by Google Inc.; thus, Google Chrome (Google’s browser)

    supports it by default. According to W3Schools [3], Google’s Chrome browser surpassed

  • 1.2. OVERVIEW OF RESULTS 9

    Mozilla’s Firefox browser in March 2012 (after it surpassed Microsoft’s Internet Explorer

    browser back in April 2011) to become the clear, dominant winner in the latest browser

    wars. Thus, the popularity of SDCH compression should increase to the same degree.

    Android is a software stack for mobile devices that includes an operating system,

    middleware and key applications. The Android operating system, also introduced by

    Google, is currently the world’s best-selling smartphone platform, with a 68.1% market

    share worldwide [1]. SDCH code appears also in the Android platform and is likely to be

    used in the near future. Therefore, a solution for DPI on shared dictionary compressed

    data is essential for this platform as well. SDCH is complementary to gzip or Deflate,

    i.e., it could be used before applying gzip. On Web pages containing Google search

    results, the data size reduction when adding SDCH compression before gzip is about

    40% better than gzip alone.

    The idea of the shared dictionary approach is to transmit the data that is common

    to each response once and after that send only the parts that differ. In SDCH notations,

    the common data is called the dictionary and the differences are stored in a delta file.

    A dictionary is composed of the data used by the compression algorithm, as well as

    metadata describing its scope and lifetime. The scope is specified by the domain and

    path attributes; thus, a user may download several dictionaries, even from the same

    server.

    In this work we present a novel pattern matching algorithm for SDCH. Our algorithm

    operates in two phases, the offline phase and the online phase. The offline phase

    starts when the device gets the dictionary. In this phase the algorithm uses the Aho-

    Corasick [23] pattern matching algorithm to scan the dictionary for patterns and marks

    auxiliary information to facilitate the scan of the delta files. Once received, the delta

    file is scanned online using Aho-Corasick algorithm. Since the delta file eliminates

    repetitions of strings using references to the common strings in the dictionary, our

    algorithm tries to skip these reference, so each plain-text byte is scanned only once

    (either in the offline or the online phases). We show that we skip up to 99% of the

    referenced data and gain up to 56% improvement in the performance of the multi-

    pattern matching algorithm, compared with scanning the plain text directly.

    We are the first to address the problem of pattern matching algorithms for SDCH. In

    addition, we have designed a novel algorithm that scans only a negligible number of bytes

    more than once, as our evaluations confirm (see Section 6.3). This is a remarkable result

  • 10 CHAPTER 1. INTRODUCTION

    considering that bytes in the dictionary can be referenced multiple times by different

    positions in one delta file and moreover, by different delta files. The SDCH compression

    ratio is about 44%, implying that 56% of the data is copied from the dictionary. Thus,

    in a single scan, our algorithm achieves 56% improvement over a plain text file scan.

    Our algorithm also has low memory consumption. It stores only the dictionary being

    used (along with some auxiliary information per dictionary). In the case of SDCH, since

    it was developed for Web traffic, one dictionary usually supports many connections.

    In other words, the memory consumption depends on the number of dictionaries and

    their sizes and not on the number of connections, in contrast to intra-file compression

    methods.

    Finally, an important contribution is a mechanism to deal with matching regular-

    expression signatures in SDCH-compressed traffic. Regular expression signatures are

    becoming increasingly popular due to their superior expressibility [26]. We show how to

    use our algorithm as a building block for regular expression matching. Our experiments

    show that our regular expression matching mechanism gains a similar 56% boost in

    performance.

    This work was published in the proceedings of Infocom 2012 [33].

    1.2.5 MCA2: Multi Core Architecture for Mitigating

    Complexity Attacks

    This work deals with complexity attacks, which exploit the gap between the amount of

    resources the system requires in processing normal packets and carefully crafted packets

    that consume drastically more resources (computing, memory, cache, or other). These

    crafted packets, which we call heavy packets, are easy to construct but require very

    intensive processing from the system. This implies that a small effort on the attacker’s

    side leads to a great effort on the part of the system, which is bound to lose.

    We present MCA2—a Multi-Core Architecture for Mitigating Complexity Attacks.

    MCA2 essentially isolates the malicious traffic to a fraction of the cores and deals with

    legitimate traffic on the remaining cores, which are therefore not affected by the attack.

    Our MCA2 system can be configured to mitigate any complexity attack with the

    following properties:

    1. There are heavy and normal packets, where heavy packets consume considerably

    more resources from the security device when being processed.

  • 1.2. OVERVIEW OF RESULTS 11

    2. There is a method to identify heavy packets that requires very few resources.

    3. Packets can be moved efficiently between system cores.

    4. There is a special method that handles heavy packets more efficiently than the

    method used for normal packets.1

    It turns out that there are quite a few complexity attacks that meet these criteria.

    However, we restrict our discussion to the DPI component of NIDS/NIPS. We consider

    three examples that have the above properties: cache-miss attack on Snort’s signature

    detection engine; active states explosion attack on the Hybrid-FA [27] regular expression

    detection engine; and forced construction attack on Bro IDS regular expression detection

    engine.

    We focus on the first example and use it to explain our method and the above-

    mentioned properties. We then show that the active states explosion complexity attack

    fits our requirements as well. We back up all our findings with experimental results,

    showing the benefits of using MCA2 in conjunction with the NIDS. For the force con-

    struction attack example, we look at the Bro IDS regular expression detection engine.

    Bro takes a lazy approach in order to cope with the large DFA size. Namely, it con-

    structs only the DFA parts it actually uses. Normal traffic uses only a small part of the

    DFA. Hence, a simple complexity attack forces Bro to construct a large portion of the

    DFA, which significantly degrades performance.

    With regard to our main example, we target Snort’s DPI engine, which uses some

    variant of the Aho-Corasick (AC) [23] algorithm for performing pattern matching. A

    complexity attack on the Aho-Corasick algorithm (in a stand-alone environment) is

    shown in [34]: Aho-Corasick uses a large DFA that cannot fit entirely in the cache. The

    common traffic, however, uses only a very small part of it, resulting in fast memory

    references and few cache misses. An attacker can easily craft malicious packets that

    cause an exhaustive traversal over the DFA, which in turn pollutes the cache. In this

    work, we show for the first time that Snort is indeed vulnerable to this attack: an attack

    on its DPI component degrades its overall performance by a factor of 4.2.

    After establishing that the threat of this attack is real, we turn to investigate how

    MCA2 mitigates such an attack. The key challenge is how to detect and isolate malicious

    1This special method usually handles normal packets poorly; otherwise it would have been used bythe system in the first place.

  • 12 CHAPTER 1. INTRODUCTION

    traffic. This is done in two steps. First, training data is used to identify and mark the

    common states of the DFA. These are the frequently-visited states while processing

    normal common traffic. Then, for each packet, we count the fraction of non-common

    states visited (out of the total number of states traversed by the packet). As soon as

    this fraction exceeds a certain threshold, the packet is marked heavy. When the fraction

    of heavy packets is above a second threshold, we allocate one or more cores to deal with

    them exclusively, while the rest of the cores continue to process only normal traffic (and

    to detect heavy packets); each subsequent heavy packet is moved to one of the dedicated

    cores. This process isolates the effect of heavy packets and protects the private caches

    of the non-dedicated cores from pollution. MCA2 can be further optimized by running

    on the dedicated cores an implementation that is optimized for heavy packets (albeit

    with penalty in the normal case).

    The main performance measure we use is the goodput of the system, namely the

    volume of the non-malicious packets that were processed. Our experimental results

    are summarized in Fig. 1.1, which shows the system’s goodput under different attack

    intensities (namely, in attack intensity of 50%, half of the incoming traffic is malicious).

    We compare the performance of MCA2 with two implementations of the Aho-Corasick

    algorithm: the first, denoted “Full Matrix AC,” is optimized for well-behaved normal

    traffic, and the second, denoted “Compressed AC,” is optimized to work under cache-

    miss attack (as described in Section 2.1).

    When the system is not allowed to drop packets, MCA2 uses “Full Matrix AC” on

    the cores that process normal traffic and “Compressed AC” on the dedicated cores. The

    number of cores of each type is dynamically determined as a function of the attack level.

    When there is no attack, MCA2 is reduced to “Full Matrix AC.”

    We also consider the case when the NIDS/NIPS is allowed to drop packets. Drop-

    ping all heavy packets implies that no dedicated threads are required, freeing up all

    processing resources for the detection of heavy packets and processing of non-heavy

    (mostly legitimate) packets, thus increasing the goodput.

    Our experiments show a significant goodput improvement: MCA2 achieves up to

    twice the goodput of both implementations, even without dropping packets. Further-

    more, it always outperforms a hybrid implementation that chooses the best of the

    previous implementations at any given time, with a goodput boost of up to 73%.

    As for the second example, we use the regular expressions Hybrid-FA data structure

  • 1.3. RELATED WORK 13

    0 20 40 60 80 1000

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    Attack Intensity [%]

    Go

    od

    pu

    t [M

    bp

    s]

    Full Matrix ACCompressed ACMCA2

    MCA2 with drop

    Figure 1.1: The goodput of MCA2 for different attack intensities. MCA2 with no dropsmaintains a balance between all cores.

    to illustrate an active states explosion attack. Hybrid-FA uses a single “head-DFA" for

    commonly-used states while other parts of the automaton are kept as separate DFAs,

    which are activated simultaneously when required. Usually, only the “head-DFA" is

    activated. Thus, our complexity attack causes the Hybrid-FA to activate many states

    in parallel, therefore causing the system to traverse several states per input byte; this

    degrades system throughput significantly. We show that MCA2 in full-drop setup can

    mitigate such an attack: our experiments show that under a mild active states explosion

    attack, the goodput of the system is increased by a factor of 4.8.

    This work was published in the proceedings of ANCS 2012 [20].

    1.3 Related Work

    The following section provides related work regarding the research presented in this

    thesis.

    1.3.1 DFA Compression

    Intensive efforts have been made to implement compact Aho-Corasick-like DFA that

    can fit into faster memory.

    Van-Lunteren [89] proposed a novel architecture that supports prioritized tables.

    His results are equal to CompactDFA presented in Chapter 3, with a suffix tree that is

    limited to depth 2, thus having 25 (66) times more rules than the CompactDFA solution

    for Snort (ClamAV). CompactDFA in some sense is a generalization of [89], which

  • 14 CHAPTER 1. INTRODUCTION

    eliminates all cross-transitions. Song et al. [85] proposed an n-step cache architecture

    to eliminate some of the DFA’s cross-transitions. This solution still has 4 (9) times

    more rules for Snort (ClamAV) than in CompactDFA. In addition, this solution, like

    other hardware solutions [76, 87], uses dedicated hardware and thus is not flexible.

    As far as we know, CompactDFA is the first proposed method for reducing the

    number of transitions in DFA to the minimum possible one, the number of DFA states.

    CompactDFA does not depend on any specific hardware architecture or any statistical

    property of data (as opposed to the work of Tuck et al. [88]).

    The papers [96] and [86, 93] encode segments of the patterns in the TCAM and do

    not encode the DFA rules. However, both solutions require significantly larger TCAM

    (especially [93]) and more SRAM (an order of magnitude more). The work of Lin

    et al. [66] encodes the DFA rules in TCAM, just as CompactDFA does. CompactDFA

    and [66] are based on the same basic observation, that we can eliminate cross-transitions

    by using information from the next state label. However, [66] does not use the bits of

    the state to encode the information; on the contrary, they just append to each state

    code the last m bytes of its corresponding label to eliminate cross-transitions to depth

    m. Thus, for depth 4, [66] requires 62 bits while CompactDFA requires only 36 bits,

    and hence the solution is not scalable.

    A recent work presented a method for state encoding in a TCAM-based implemen-

    tation of Aho-Corasick NFA rather than Aho-Corasick DFA [97]. While such a method,

    which was developed concurrently with ours, shares some of the insights of our work

    (e.g., it also eliminates all failure transitions), it is limited to the TCAM implementa-

    tions where CompactDFA may be used with any known IP-lookup solution. In addition,

    unlike our work, the method in [97] does not deal with pipelined TCAM implementa-

    tions (which are common in contemporary TCAM chips) and therefore suffers from

    significant performance degradation if such TCAMs are used.

    Following our work, several methods to perform regular expression matching using

    TCAM [71, 78] were suggested. These methods rely on the same high-level principle

    of our work: exploiting the degree of freedom in the way states are encoded. Since

    these methods deal with regular expression rather than exact string matching, they do

    not use AC-DFA, but other automata that are geared to handle regular expressions.

    Specifically, [71] uses D2FA, while [78] uses both a DFA and a knowledge derived from a

    corresponding NFA; both methods then construct a tree (or a forest) structure, which is

  • 1.3. RELATED WORK 15

    encoded similarly to CompactDFA. Finally, unlike our work, the methods in [71, 78, 97]

    do not deal with pipelined TCAM implementations (which are common in contemporary

    TCAM chips) and therefore suffer from significant performance degradation if such

    TCAMs are used.

    Two additional methods that use TCAMs to handle regular expression matching

    were presented by Liu et al. [68] and Zheng et al. [99]. These methods present orthogonal

    improvements to utilizing TCAMs. Specifically, Liu et al. [68] method is based on

    implementing a fast and cheap pre-filter, so that only portion of the traffic should be

    fully-inspected; on the other hand, Zheng et al. [99] suggest a technique that parallelizes

    the process of using TCAM by smartly dividing the pattern rule set and the flows to

    different TCAM blocks. Naturally, these two approaches can be easily combined with

    ours.

    Finally, we note that [71] introduces the table consolidation technique, which com-

    bines entries even if they lead to different states. This technique trades TCAM memory

    with a cheaper SRAM memory that stores the different states of each combined entry.

    Table consolidation, which requires solving complicated optimization problems, can be

    applied also to our results to further reduce TCAM space.

    1.3.2 Compressed Web-Traffic

    Extensive research has been conducted on performing pattern matching on compressed

    files as in [25, 59, 72, 73], but very limited work has been done on compressed traffic.

    Requirements for dealing with compressed traffic are: (1) on-line scanning (1-pass),

    (2) handling thousands of connections concurrently and (3) working with the LZ77

    compression algorithm, which is used by gzip, (as opposed to most papers, which deal

    with LZW/LZ78 compressions). To the best of our knowledge, [47, 52] are the only

    papers that deal with pattern matching over LZ77. However, the algorithms are for a

    single pattern and require two passes over the compressed text (file), which is not an

    option in network domains that require “on-the-fly" processing.

    Klein and Shapira [60] have suggested a modification to the LZ77 compression al-

    gorithm, to change the backward pointer into forward pointers. That modification

    makes the pattern matching easier in files and may save some of the space required by

    the 32KB buffer for each connection. However, their proposal is not implemented in

    today’s HTTP.

  • 16 CHAPTER 1. INTRODUCTION

    The first paper to analyze the obstacles in dealing with compressed traffic is [36],

    but it only accelerated the pattern matching task on compressed traffic and did not

    handle the space problem. Furthermore, it still requires the decompression.

    Techniques have been developed for in-place compression, the main one being

    LZO [75]. While LZO claims to support decompression without memory overhead,

    it works with files and assumes that the uncompressed data is available. We assume

    decompression of thousands of concurrent connections on-the-fly without having the

    uncompressed data available. Thus what is for free in LZO is considered overhead in

    compressed Web traffic. Furthermore, while gzip is considered the standard for Web

    traffic compression, LZO is not supported by any Web server or Web browser.

    1.3.3 DPI Using Multi-Core Platforms

    The recent proliferation of multi-core general purpose processors motivated many re-

    searchers to reinvestigate well-known problems in this new domain. Among these are

    several works that proposed a multi-core solution for DPI processing. These papers’

    main focus is on different ways to load balance the system tasks between the available

    cores.

    Current NIDS/NIPS systems such as Snort [9] and Bro [2] split the load to many

    sequential sub-tasks in a pipeline manner. Other works, such as [91], suggest fine-

    grained pipelining for parallelizing network applications on multi-core architectures.

    This partitioning is effective if the processing cost for each sub-task is similar, which is

    usually not the case for NIDS/NIPS.

    A different line of research focuses on load balancing the traffic flows equally between

    the different cores and performing the inspection in parallel [41, 53, 67, 74, 83]. The load

    balancing is based on both the packet header parameters and some layer-7 parameters.

    We note that such architectures are orthogonal to our MCA2 algorithm (see Chapter 7)

    and may be applied to load balance the work between general threads that process the

    normal traffic. If MCA2 is not used in conjunction with these architectures, they are

    all vulnerable to complexity attacks.

    Becchi et al. [30] focus on DPI and present a performance evaluation scheme for

    multiprocessor systems. The proposed design also splits the traffic between several

    cores with the same DPI engine that supports regular expression matching. Their

    study identifies and evaluates algorithmic and architectural trade-offs and limitations,

  • 1.4. SIGNIFICANCE 17

    and highlights how the presence of caches affects the overall performance. However, it is

    geared at optimizing the normal case and is vulnerable to similar complexity attacks as

    those we describe in this work. Such attacks can be mitigated by incorporating MCA2

    into this scheme as well.

    Another multi-core load-balancing approach is to partition the patterns among the

    cores (cf. [90, 95, 98]). Then different DPI algorithms, each specializing in different

    kinds of pattern sets, are run on each core. In some cases, the partitioning itself is

    done so as to balance the load between the algorithms. It is important to note that

    architectures of this kind differ from MCA2 in that each packet is examined by several

    cores (each performs only part of the inspection). In addition, they do not take into

    account the incoming traffic, and are vulnerable to separate attacks on each core.

    1.3.4 Denial-of-Service Mitigation

    Kumar et al. [62] present several methods to reduce regular-expressions-based DFA

    size. One of the mechanisms used in that paper is based on the assumption that normal

    flows rarely match more than the first few symbols of any signature. Thus, the most

    frequently visited portions of the automaton are used to build a fast path DFA, and the

    rest of the automaton is represented by a separate NFA, which is the slow path. The

    authors suggest a solution that is similar to MCA2 in that it handles heavy traffic with

    a different algorithm and applies a lightweight classification algorithm to distinguish

    between heavy and normal traffic. In addition, [62] proposes to protect against denial-

    of-service (DoS) attacks by attaching lower priority to flows with higher probability

    of being malicious. Nevertheless, that work analyzes the case of a single core, and

    therefore could not benefit from the multi-core properties as MCA2 does. Furthermore,

    the proposed protection in [62] fails under a continuous DoS attack because the heavy

    packets that receive lower priority eventually overload the system buffer. MCA2 is also

    resilient to DoS attacks of longer duration.

    1.4 Significance

    This thesis provides algorithms and techniques in the field of deep packet inspection for

    high performance network security tools. These algorithms focus on three problems:

    scalability, compressed traffic, and security-tool resiliency.

  • 18 CHAPTER 1. INTRODUCTION

    For the first topic, that of scalability, we are the first to provide a scheme that reduces

    the pattern matching problem to the well-studied problem of Longest Prefix Matching

    (LPM), which may be solved either in TCAM, in commercially available chips, or in

    software.

    For the second topic, that of compressed traffic, we are the first to address the

    problem and to provide a set of state-of-the-art solutions that achieve good theoretical

    and practical results.

    As for the third topic, we have uncovered and demonstrated weaknesses of preva-

    lent security tools for commercial networks, by devising a denial-of-service algorithmic

    complexity attack over the Snort network intrusion detection server. Furthermore, we

    are the first to incorporate the common multi-core platform architecture to mitigate

    complexity attacks over network security tools.

  • Chapter 2

    Background

    In this chapter we provide background on topics that are relevant to the following

    chapters, which are “pattern matching", “compressed traffic" and “complexity attacks".

    2.1 DFA based Pattern Matching

    DPI is a major component in contemporary security tools, which heavily relies on

    pattern matching to detect signatures of malicious traffic. We consider the following

    two classes of pattern matching: exact matching and regular expression matching. The

    former usually uses a deterministic finite automaton (DFA), while the latter uses either

    a DFA or a non-deterministic finite automaton (NFA) for the ongoing inspection of the

    input data [54]. A sub-category of the latter class is the Ternary Content Addressable

    Memory (TCAM) based regular expression matching, which encodes the DFA rules

    using TCAM elements (as discussed in Chapter 3).

    In our main example, we mostly focus on the exact matching algorithms, which

    use DFA. A DFA is a five-tuple 〈S,Σ, δ, q0, F 〉, where S is a finite set of states, Σ is a

    finite set of input symbols, δ : S × Σ → S is a transition function, returning the next

    state, given the current state and any symbol from the input, s0 ∈ S is the initial state,

    and F ⊆ S is a set of accepting states. Aho-Corasick algorithm provides a method to

    build such an automaton (a.k.a. Aho-Corasick DFA) from a set of patterns. Given the

    DFA, a packet is inspected by traversing the automaton symbol by symbol from s0; a

    pattern is detected if a state in F is reached in this traversal. Fig. 2.1(a) depicts the

    Aho-Corasick DFA for the pattern-set {E,BE,BD,BCD,CDBCAB,BCAA}.

    In today’s security tools, Aho-Corasick DFAs are huge—e.g., Snort’s Aho-Corasick

    19

  • 20 CHAPTER 2. BACKGROUND

    B

    E

    CB

    E

    CBE

    C

    DE

    BC

    D

    E C

    E

    B CE

    B

    C

    E

    B C

    E

    BC

    B B

    s0

    s7

    s12

    s1 s2

    s3 s5s4

    s14

    s13 s6

    s8

    s9

    s10

    s11

    C

    C

    E

    D

    B

    E D

    D B

    C

    A

    B

    A

    A

    (a) The Aho-Corasick DFA for pattern-set {E, BE, BD, BCD, CDBCAB,BCAA}

    (b) Full-matrix Encoding

    s 0

    s 7s 2

    s 5

    C

    C

    E *

    D

    BD*

    s 13

    D*B C AB *

    A

    A*

    (B )

    (B C )

    (B C A)

    E *

    s 8

    (c) Compressed Automaton

    S0

    B: 2, C: 7, E: 0*, fail: 0

    S2

    C: 5, D: 0*, E: 0*, fail: 0

    S5

    A: 13, D: 8*, fail: 0

    S7

    D: 8, fail: 0

    S8

    BCAB: 2*, BCA: 13, BC: 5, B:2, fail:0

    S13

    A: 0*, fail: 0

    (d) Compressed Encoding

    Figure 2.1: Example of an Aho-Corasick DFA and two methods to store it in mem-ory: non-compressed (full-matrix) encoding, and compressed encoding. The compressedencoding is derived from a compressed automaton, in which fail transitions are takenwithout consuming input symbols, and transitions marked with ‘*’ indicate that a matchwas found.

  • 2.1. DFA BASED PATTERN MATCHING 21

    DFA has 77, 182 states for 31, 094 patterns—raising the question of how to store it

    efficiently in memory. The alternatives naturally trade memory space with execution

    time. In addition, most security tools (including Snort) divide their patterns to several

    sets, according to the type of traffic.

    Snort uses a full-matrix encoding for its Aho-Corasick DFAs as presented in [23]. In

    this representation (see Fig. 2.1(b)), transitions are stored in a two-dimensional array

    with |S| rows and |Σ| columns. An entry at position (i, j) holds the value of δ(si, j),

    implying that the number of bits in each entry is at least log2 |S|. In the typical

    case, when the input is inspected one byte at a time, |Σ| = 256, resulting in overall

    memory footprint of 256|S| log2 |S|. For Snort’s Aho-Corasick DFAs, this translates

    to a combined footprint of 75.15 MB. On the other hand, the main advantage of this

    encoding is that a transition consists of a single memory load operation that reveals

    directly the next state.

    An alternative approach is to implement an AC automaton using the concept of

    failure transitions. In such implementations, only part of the outgoing transitions from

    each state are stored explicitly. While traversing the automaton, if the transition from

    state s with symbol x is not stored explicitly, one will take the failure transition from

    s to another state s′ and look for an explicit transition from s′ with x. This process is

    repeated until an explicit transition with x is found, resulting in failure paths. Naturally,

    since only part of the transitions are stored explicitly, these implementations (sometimes

    referred to as AC NFAs) are more compact, but incur higher processing time. A classical

    result states that the longest failure path is at most the size of the longest pattern,

    and that, regardless of the traffic pattern, the total number of transitions (failure and

    explicit) is at most twice the number of symbols. This result does not take into account

    the representation of each single state, which determines the time it takes to figure out

    whether an explicit rule exists or not.

    We use the following definitions regarding this encoding: Let the label of a state

    s, denoted by L(s), be the concatenation of symbols along the path from the root to

    s. Furthermore, let the depth of a state s be the length of the label L(s). The failure

    transition from s is always to a state s′, whose label L(s′) is the longest suffix of L(s)

    among all other DFA states. This implies the following property of the Aho-Corasick

    DFA:

    Property 1. If L(s′) is a suffix of L(s) then there is a failure path (namely, a path

  • 22 CHAPTER 2. BACKGROUND

    comprised only of failure transitions) from state s to state s′.

    The DFA is traversed starting from the root. When the traversal goes through an

    accepting state, it indicates that some patterns are a suffix of the input; one of these

    patterns always corresponds to the label of the accepting state. Formally, we denote

    by s.output the set of patterns matched by state s; if s is not an accepting state then

    s.output = ∅. Finally, we denote by scan(s, b), the AC procedure when reading input

    symbol b while in state s; namely, transiting to a new state s′ after traversing failure

    transitions and a forward transition as necessary, and reporting matched patterns in

    case s′.output 6= ∅. scan(s, b) returns the new state s′ as an output. The correctness of

    the AC algorithm essentially stems from the following simple property:

    Property 2. Let b1, . . . bn be the input, and let s1, . . . , sn be the sequence of states the

    AC algorithm goes through, after scanning the symbols one by one (starting from the

    root of the DFA). For any i ∈ {0, . . . , n}, L(si) is a suffix of b1, . . . , bi; furthermore, it

    is the longest such suffix among all other states of the DFA.

    There are other encodings that require more than one memory access, but offer

    significant memory reduction. Several such encodings exist in the literature [29, 34, 88].

    Fig. 2.1(d) depicts one such alternative, as suggested in [34]; this encoding is based on

    a compressed automaton as depicted in Figure 2.1(c).

    The construction of AC’s DFA is done in two phases. First, the algorithm builds a

    trie of the pattern set: All the patterns are added from the root as chains, where each

    state corresponds to a single symbol. When patterns share a common prefix, they also

    share the corresponding set of states in the trie. In the second phase, additional edges

    are added to the trie. These edges deal with situations where the input does not follow

    the current chain in the trie (that is, the next symbol is not an edge of the trie) and

    therefore we need to transit to a different chain. In such a case, the edge leads to a

    state corresponding to a prefix of another pattern, which is equal to the longest suffix

    of the previously matched symbols.

    It is sometimes useful to look at the DFA as a directed graph whose vertex set is S

    and there is an edge between s1 and s2 with label x if and only if δ(s1, x) = s2. The

    input is inspected one symbol at a time: Given that the algorithm is in some state s ∈ S

    and the next symbol of the input is x ∈ Σ, the algorithm applies δ(s, x) to get the next

    state s′. If s′ is in F (that is, an accepting state) the algorithm indicates that a pattern

    was found. In any case, it then transits to the new state s′.

  • 2.2. COMPRESSED WEB-TRAFFIC 23

    We use the following simple definitions to capture the meaning of a state s ∈ S:

    The depth of a state s, denoted depth(s), is the length (in edges) of the shortest path

    between s0 and s. The label of a state s, denoted label(s), is the concatenation of the

    edge symbols of the shortest path between s0 to s. Further, for every i ≤ depth(s),

    suffix(s, i) ∈ Σ∗ (respectively, prefix(s, i) ∈ Σ∗) is the suffix (prefix) of length i of

    label(s). The code of a state s, denoted code(s), is the unique number that is associated

    with the state, i.e., the number that encodes the state. Traditionally, this number is

    chosen arbitrarily; in this work we take advantage of this degree of freedom.

    We use the following classification of DFA transitions (cf. [85]):

    • Forward transitions are the edges of the trie; each forward transition links a

    state of some depth d to a state of depth d+ 1.

    • Cross transitions are all other transitions. Each cross transition links a state of

    depth d to a state of depth d′ where d′ ≤ d. Cross transitions to the initial state

    s0 are also called failure transitions, and cross transitions to states of depth 1

    are also called restartable transitions.

    2.2 Compressed Web-Traffic

    This section provides an overview of the main techniques that are used to compress

    Web traffic in the Internet.

    2.2.1 Gzip Compression

    HTTP 1.1 [19] supports the usage of content-codings to allow a document to be com-

    pressed. The RFC suggests three content-codings: gzip, compress and deflate. In fact,

    gzip uses deflate as its underlying compression protocol. For the purpose of this the-

    sis they are considered the same. Currently gzip and deflate are the common codings

    supported by current browsers and Web servers (analyzing captured packets from the

    latest versions of both Internet Explorer, FireFox and Chrome browsers shows that

    these browsers accept only the gzip and deflate codings).

    The gzip algorithm uses a combination of the following compression techniques: first

    the text is compressed with the LZ77 algorithm and then the output is compressed with

    the Huffman coding. Let us elaborate on the two algorithms:

  • 24 CHAPTER 2. BACKGROUND

    (a) (b)

    Figure 2.2: Example of the LZ77 compression on the beginning of the Yahoo! homepage (a) Original (b) After the LZ77 compression

    LZ77 Compression The purpose of LZ77 [100] is to reduce the string presenta-

    tion size, by spotting repeated strings within the last 32KB of the uncompressed data.

    The algorithm replaces a repeated string by a backward-pointer consisting of a (dis-

    tance,length) pair, where distance is a number in [1,32768] (32K) indicating the distance

    in bytes of the string and length is a number in [3,258] indicating the length of the re-

    peated string. For example, the text: ‘abcdeabc’ can be compressed to: ‘abcde(5,3)’;

    namely, “go back 5 bytes and copy 3 bytes from that point". LZ77 refers to the above

    pair as “pointer" and to uncompressed bytes as “literals".

    Fig. 2.2 depicts an example extracted from the ‘Yahoo!’ home page after LZ77

    compression. Note that decompression has a moderate time consumption, since it reads

    and copies sequential data blocks, hence relying on spatial locality that requires only a

    few memory references.

    Huffman Coding The second algorithm used by gzip is the Huffman coding. This

    method works on a character-by-character basis, transforming each 8-bit character to a

    variable-size codeword ; the more frequent the character is, the shorter its corresponding

    codeword. The codewords are coded such as no codeword is a prefix of another, so the

    end of each codeword can be easily determined. Dictionaries are provided to facilitate

    the translation of binary codewords to bytes.

    In the gzip format, Huffman codes both ASCII characters (that is literals) and point-

    ers into codewords using two dictionaries, one for the literals and the pointer lengths

    and the other for the pointer distances. Huffman may use either fixed or dynamic dic-

    tionaries, where the latter gains better compression ratio. The Huffman dictionaries for

    the two alphabets appear immediately after the header bits and prior to the compressed

    data.

  • 2.2. COMPRESSED WEB-TRAFFIC 25

    A common implementation of Huffman decoding (cf. zlib [17]) uses two levels of

    lookup tables. The first level stores all codewords of a length shorter than 9 bits in a

    table of 29 entries that represents all possible inputs; each entry holds the symbol value

    and its actual length. If a symbol exceeds 9 bits, there is an additional reference to

    a second lookup table. Thus, in most of the cases, decoding a symbol requires only a

    single memory reference, while for the less frequent symbols it requires two.

    2.2.1.1 Challenges in performing DPI on Compressed HTTP

    While transparent to the end-user, compressed Web traffic needs special care by bump-

    in-the-wire devices that reside between the server and the client and perform DPI. The

    device needs first to decompress the data in order to inspect its payload since there is no

    apparent “easy" way to perform DPI over compressed traffic without decompressing the

    data in some way. This is mainly because LZ77 is an adaptive compression algorithm,

    namely the text represented by each symbol is determined dynamically by the data.

    As a result, the same substring is encoded differently depending on its location within

    the text. For example the pattern ‘abcdef’ can be expressed in the compressed data by

    abcde ∗j (j + 5, 5)f for all possible j < 32763.

    One of the main problems with the decompression is its memory requirement; the

    straightforward approach requires a 32KB sliding window for each connection. Note

    that this requirement is difficult to avoid, since the back-reference pointer can refer to

    any point within the sliding window and the pointers may be recursive (i.e., a pointer

    may point to an area with a pointer). As opposed to compressed traffic, DPI of non-

    compressed traffic requires storing only two or four bytes variable that holds the corre-

    sponding DFA state aside of the DFA itself, which is of course stored in any case. Hence,

    dealing with compressed traffic poses a significantly higher memory requirement by a

    factor of 8 000 to 16 000. Thus, mid-range firewall that handles 100K-200K concurrent

    connections (like GTA’s G-800 [12], SonicWall’s Pro 3060 [13] or Stonesoft’s StoneGate

    SG-500 [14]), needs 3GB-6GB memory while a high-end firewall that supports 500K-

    10M concurrent connections (like the Juniper SRX5800 [15] or the Cisco ASA 5550 or

    5580 [11]) would need 15GB-300GB memory only for the task of decompression. This

    memory requirement not only imposes high price and infeasibility of the architecture

    but also implies on the capability to perform caching or using fast memory chips such

    as SRAM. Hence, reducing the space boosts the speed also because faster memory tech-

  • 26 CHAPTER 2. BACKGROUND

    nology is becoming a viable option, such as SRAM memory. This work deals with the

    challenges imposed by that space aspect.

    Apart from the space penalty described above, the decompression stage also in-

    creases the overall time penalty. However, we note that DPI requires significantly more

    time than decompression, since decompression is based on reading consecutive mem-

    ory locations and therefore enjoys the benefit of cache block architecture and has low

    per-byte read cost, where as DPI employs a very large data structure that is accessed

    by reads to non-consecutive memory areas therefore requires expansive main memory

    accesses. In [36] we provided an algorithm that takes advantage of information gathered

    by the decompression phase in order to accelerate the commonly used Aho-Corasick pat-

    tern matching algorithm. By doing so, we significantly reduced the time requirement

    of the entire DPI process on compressed traffic.

    2.2.2 SDCH Compression

    2.2.2.1 The SDCH Framework

    SDCH is a new compression mechanism proposed by Google Inc. In SDCH, a dictionary

    is downloaded (as a file) by the user agent from the server. The dictionary contains

    strings which are likely to appear in subsequent HTTP responses. If, for example, the

    header, footer, JavaScript and CSS are stored in a dictionary possessed by both user

    agent and server, the server can construct a delta file by substituting these elements with

    references to the dictionary, and the user agent can reconstruct the original page from

    the delta file using these references. By substituting dictionary references for repeated

    elements in HTTP responses, the payload size is reduced and we can save the cross-

    payload redundancy. In order to use SDCH, the user agent adds the label SDCH in

    the Accept-Encoding field of the HTTP header. The scope of a dictionary is specified

    by the domain and path attributes, thus, one server may have several dictionaries and

    the user agent has to have a specific dictionary in order to decompress the server’s

    compressed traffic. If the user agent already has a dictionary from the negotiated

    server, it adds the dictionary id as a value to the header Avail-Dictionary. If the user

    agent does not have the specific dictionary that was used by the server, the server sends

    an HTTP response with the header Get-Dictionary and the dictionary path; now, the

    user agent can construct a request to get the dictionary.

  • 2.3. COMPLEXITY ATTACK 27

    2.2.2.2 The VCDIFF Compression Algorithm

    SDCH encoding is built upon the VCDIFF compression data format. VCDIFF encoding

    process uses three types of instructions, called delta instructions: add, run and copy.

    add(i, str) means to append to the output i bytes, which are specified in parameter

    str. run(i, b) means to append i times the byte b. Finally, copy(p, x) means that

    the interval [p, p + x) should be copied from the dictionary (that is, x bytes starting

    at position p). The delta file contains the list of instructions with their arguments and

    the dictionary is one long string composed of the characters that can be referenced

    by the copy instructions in the delta file. In the rest of this thesis, we ignore the

    run instruction since it is barely used and can be replaced with an equivalent add for

    our purposes.

    For example, suppose that the dictionary is DBEAACDBCABC, and the delta file is

    given by the following commands:

    1. add (3,ABD)

    2. copy (0,5)

    3. add (1,A)

    4. copy (4,5)

    5. add (2,AB)

    6. copy (9,3)

    7. add (4,AACB)

    8. copy (5,3)

    9. add (1,A)

    10. copy (6,3)

    Thus, the plain-text that should be considered is therefore (bolded bytes were copied

    from the dictionary):

    ABDDBEAAAACDBCABABCAACBCDBADBC

    2.3 Complexity attack

    In a complexity attack, the attacker exploits the system’s worst-case performance, which

    differs from the average case that the system was designed for. Crosby and Wallach were

    among the first to demonstrate the phenomenon on the commonly-used Open Hash data

  • 28 CHAPTER 2. BACKGROUND

    structure [43]: an attacker designs an input that requires O(n) elementary operations

    per insertion, instead of O(1) operations that are required on the average.

    Recent works show that many other systems and algorithms are vulnerable to com-

    plexity attacks including QuickSort [70], regular expression matcher [79], intrusion de-

    tection systems [34, 48, 82], the Linux route-table cache [92], SSL authentication al-

    gorithm [40], and the retransmission algorithm in wireless networks [31]. Complexity

    attacks on different components of NIDS/NIPS were suggested in the past. For exam-

    ple, Bro maintains a hash table with the IP header fields of packets as keys; thus, by

    tailoring the traffic with specific headers, one can cause the hash insert-operation to

    last significantly longer, resulting in Bro failure. While in some cases modifying the

    algorithm suffices to mitigate the problem (e.g., Crosby and Wallach’s attack can be

    solved by using hash functions that are not known to the attacker), this does not hold

    in general.

  • Chapter 3

    CompactDFA: Generic State

    Machine Compression for Scalable

    Pattern Matching

    In this chapter we propose a novel method to compress deterministic finite automata

    (DFA), which is the common data structure for DPI. Compressing the DFA enables

    storing the DFA in a faster memory, which in turn gains significant performance boost.

    Related background for pattern matching using DFA is provided in Section 2.1. Related

    work is in Section 1.3.1.

    3.1 The CompactDFA Scheme

    In this section we explain our CompactDFA Scheme. We begin by explaining the scheme

    output, namely a compact encoding of the DFA and continue by describing the algorithm

    and the intuition behind it.

    3.1.1 CompactDFA Output

    A straightforward encoding of the Aho-Corasick DFA is to store the set of rules (one

    rule for each transition) with the following fields:

    Current state field Symbol field Next state field

    The output of the CompactDFA scheme is a set of compressed rules, such that there

    is only one rule per state. This is achieved by cleverly choosing the code of the states.

    29

  • 30 CHAPTER 3. COMPACTDFA

    Unlike traditional AC-like algorithms, in our compact DFA each rule has the following

    structure:

    Set of current states Symbol Field Next state code

    The set of current states of each rule is written in a prefix style, i.e., the rule captures

    all states whose code matches a specific prefix. Specifically, for each state s, let N(s)

    be the incoming neighborhood of s, namely all states that has an edge to s. For every

    state s ∈ S, we have one rule where the current state is the common prefix of the code

    of the states in N(s) and the next state is s. Note that the symbol that transfers each

    state in N(s) to a state s is common for all the states in N(s) due to AC-like algorithm

    properties (see Property 2 in Section 3.1.3).

    Fig. 3.1(c) shows the rules produces by CompactDFA on the DFA of Fig. 3.1(a).

    For example, Rule 5 in Fig. 3.1(c), which is 〈010**, D, 11010(s11)〉, is the compressed

    rule for next state s11 and it replaces three original rules: 〈01000(s3), D, 11010(s11)〉,

    〈01001(s5), D, 11010(s11)〉, and 〈01010(s10), D, 11010(s11)〉.

    In the compressed set of rules, a code of a state may match multiple rules. Very

    similar to forwarding table in IP networks, the rule with the Longest Prefix Match (LPM)

    determines the action. In our example, this is demonstrated by looking at Rules 6 and

    10 in Fig. 3.1(c). Suppose that the current state is s8, whose code is 00010, and the

    symbol is A. Then, Rule 10 is matched since 00*** is a prefix of the current state. In

    addition, Rule 6, with current state 000**, is also matched. According to the longest

    prefix match rule, Rule 6 determines the next state.

    3.1.2 CompactDFA Algorithm

    This section describes the encoding algorithm of CompactDFA and gives the intuition

    behind each of its three stages: State Grouping (Algorithm 1, Section 3.1.4), Common

    Suffix Tree Construction (Algorithm 2, Section 3.1.5), and State and Rule Encoding

    (Algorithm 3, Section 3.1.6).

    The first stage of our algorithm is based on the following insight: Suppose that each

    state s is encoded with its label; our goal is to encode with a single rule the incoming

    neighborhood N(s), which should appear in the first field of the rule corresponding to

    next state s. Note that the labels of all states in N(s) share a common suffix, which

    is the label of s without its last symbol. Thus, by assigning code(N(s)) to be label(s)

    without its last symbol, padded with “don’t care” symbols in its beginning, and applying

  • 3.1. THE COMPACTDFA SCHEME 31

    s0

    s12

    s13

    s1 s6

    s7 s10s8

    s9

    C

    C

    M

    F

    B

    A B

    s5

    s4

    s2

    B

    B

    C

    A DC

    S11s3

    (a)

    B C

    B B

    00

    0

    01

    s3,s5,s10s4,s8 s2,s6

    11

    10

    s12

    0

    Connecting

    Connecting

    Connecting

    s0,s1,s7

    Connecting

    s9,s11,s13

    11

    (b)

    00010 (s2)B100014

    Nxt StateSymbolCurrent state

    01001 (s5)C000001

    01000 (s3)C000102

    00000 (s4)B000103

    11010 (s11)D010**5

    11000 (s9)A000** 6

    11100 (s13)F01***7

    01010 (s10)C00***8

    00001 (s8)B00***9

    10100 (s7)A00***10

    10010 (s1)M*****11

    01100 (s12)C*****12

    00011 (s6)B*****13

    10000 (s0)******14

    00010 (s2)B100014

    Nxt StateSymbolCurrent state

    01001 (s5)C000001

    01000 (s3)C000102

    00000 (s4)B000103

    11010 (s11)D010**5

    11000 (s9)A000** 6

    11100 (s13)F01***7

    01010 (s10)C00***8

    00001 (s8)B00***9

    10100 (s7)A00***10

    10010 (s1)M*****11

    01100 (s12)C*****12

    00011 (s6)B*****13

    10000 (s0)******14

    (c)

    Figure 3.1: A toy example. (a) Aho-Corasick DFA for the patterns{EBC,EBBC,BA,BBA,BCD,CF}. Failure and restartable transitions are omitted forclarity. (b) The Common Suffix Tree; (c) The Rules of the compressed DFA.

  • 32 CHAPTER 3. COMPACTDFA

    a longest suffix match rule, one captures correctly the transitions of the DFA.

    For example, consider Fig. 3.1(a). The code of state s7 is BA. N(s7) = {s6, s2},

    label(s6) = B and label(s2) = EB; their common suffix is B, and indeed the code of

    N(s7) is “***B”. On the other hand, code(N(s9)) = code({s4, s8}) = “**BB”; thus, if

    the current state is s4, whose label is EBB, and the symbol is A, the next state is s9

    whose corresponding rule has longer suffix than the rule corresponding to s7.

    As demonstrated above, the longest suffix match rule should be applied to resolve

    conflicts when more than one rule is matched. Intuitively, this encoding is correct since

    all incoming edges to a state s (alternatively, all edges from N(s)) share the same suffix,

    which is code(N(s)). Moreover, a cross transition edge from a state s with symbol x

    always ends up at a state s′ whose label is the longest suffix (among all state labels) of

    the concatenation of label(s′) with x.

    However, this code is, first and foremost, extremely wasteful (and thus unpractical),

    requiring a 32 bit code for the automaton of Fig. 3.1(a) (namely, to encode 4 byte labels)

    and hundreds of bits for Snort’s DFA. In addition, it uses a longest suffix match