©2008 secure computing corporation. all rights reserved. 1 10/20/2015 adaptive language parsing...

10
©2008 Secure Computing Corporation. All Rights Reserved. 1 07/20/22 Adaptive Language Parsing Teaching Parsers to Program Themselves J. Zdziarski [email protected] om

Upload: thomasine-bradford

Post on 02-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ©2008 Secure Computing Corporation. All Rights Reserved. 1 10/20/2015 Adaptive Language Parsing Teaching Parsers to Program Themselves J. Zdziarski z@securecomputing.com

©2008 Secure Computing Corporation. All Rights Reserved. 1

04/20/23

Adaptive Language ParsingAdaptive Language ParsingTeaching Parsers to Program ThemselvesTeaching Parsers to Program Themselves

J. Zdziarski

[email protected]

J. Zdziarski

[email protected]

Page 2: ©2008 Secure Computing Corporation. All Rights Reserved. 1 10/20/2015 Adaptive Language Parsing Teaching Parsers to Program Themselves J. Zdziarski z@securecomputing.com

2

The Problem

Adaptive spam filters have proven to workHeuristics have proven to decompose

So how is this a problem?

Page 3: ©2008 Secure Computing Corporation. All Rights Reserved. 1 10/20/2015 Adaptive Language Parsing Teaching Parsers to Program Themselves J. Zdziarski z@securecomputing.com

3

The Problem

It’s a problem because:

Adaptive spam filters still use heuristics!Well, most do anyway

Page 4: ©2008 Secure Computing Corporation. All Rights Reserved. 1 10/20/2015 Adaptive Language Parsing Teaching Parsers to Program Themselves J. Zdziarski z@securecomputing.com

4

Rules-Based Parsing

The Problem with Rules-Based Parsers• They make assumptions about language syntax

• Many languages have their own set of rules

• Requires foreknowledge of the languages being used

• Spammers can’t obey RFC, let alone proper English

• A machine can learn how to read better than you can

• Some languages don’t support whitespace…

THEREDWELLSAMISSLATE

THE RED WELL, SAM IS SLATE

THERE DWELLS A MISS, LATE

(G. Sinnamon)

ENDANGERSSPARSEAMANSWORDS

ENDANGER! SPAR, SEAMAN SWORDS!

END ANGERS; PARSE A MAN’S WORDS

(D. Higgs)

Parse-o-Grams Courtesy of Robert Craigen, Univ. of Manitoba

Page 5: ©2008 Secure Computing Corporation. All Rights Reserved. 1 10/20/2015 Adaptive Language Parsing Teaching Parsers to Program Themselves J. Zdziarski z@securecomputing.com

5

Adaptive Language Parsing

3 Steps to Adaptive Parsing

1. Build a statistical hypothesis space for all parsing optionsThis can be all ASCII chars, wchars, biGram separators, legacy heuristic rules

2. Calculate the probability that each parsing rule yields interesting dataFor each potential delimiter or rule, how often was it found in an uninteresting

token (LOW) vs. how often was it found in an interesting token (HI).

3. Use this data to reprogram the parserTake the most uninteresting N possible delimiters and use them to parse the

document differently; wash, rinse, and repeat.

* Counters are per-token, not per-message

LOWDELIM/ LOWTOTAL

(LOWDELIM/ LOWTOTAL) + (HIDELIM/ HITOTAL)PDELIM=

Page 6: ©2008 Secure Computing Corporation. All Rights Reserved. 1 10/20/2015 Adaptive Language Parsing Teaching Parsers to Program Themselves J. Zdziarski z@securecomputing.com

6

Adaptive Language Parsing

Some ExamplesFinal Delimiter Set for a SpamAssassin Corpus run:Header Delimiters: 378z049Y;mF:w"O@!^N\%$(>

Body Delimiters: T?N,I?OS.pEmroaicthldesn

Interesting Data Generated

[0.990000] ,+click (8s, 0i)

Click more interesting when with comma

[0.940828] igh (105s, 2i)

|-|igh, High, H-IGH Interest Mortgage

[0.990000] $888 (15s, 0i)

Yup, we knew about this one. So does Hal.

[0.990000] ional_Inc.+Now (6s, 0i)

[0.990000] s0r+C|ubs (12s, 0i)

Junk can be very useful to the machine

Foreign Character Sets

[0.990000] !+ESC(B (50s, 0i)

[0.990000] ESC$B(-ESC(B (19s, 0i)

[0.990000] ESC$B!!!!!!!!!!ESC(B (29s, 0i)

I have no idea what this means, but the machine says it’s Japanese spam.

Page 7: ©2008 Secure Computing Corporation. All Rights Reserved. 1 10/20/2015 Adaptive Language Parsing Teaching Parsers to Program Themselves J. Zdziarski z@securecomputing.com

7

Adaptive Language Parsing

Some TestsSpamAssassin Corpus

TP TN FP FN Precision Recall FScore

Whitespace 1640 4143 7 256 0.9957 0.8649 0.925

Static Defaults 1654 4145 5 242 0.9969 0.8723 0.930

Adaptive,W24 1760 4138 12 136 0.9932 0.9282 0.959

Adaptive,W32 1723 4140 10 173 0.9942 0.9087 0.949

Pure,W28 1756 4137 13 140 0.9926 0.9261 0.958

Pure,W36 1666 4142 8 230 0.9952 0.8786 0.933

Chinese ISP Corpus

TP TN FP FN Precision Recall Fscore

Whitespace 21085 9034 24 3699 0.9988 0.8507 0.918

Static Defaults 22787 8782 66 2109 0.9971 0.9152 0.954

Kakasi* 18745 8904 108 1438 0.9942 0.9287 0.960

Adaptive,W24 23952 9224 40 1104 0.9983 0.9559 0.976

Adaptive,W32 24068 4222 42 988 0.9982 0.9596 0.978

* Kakasi was not designed for Chinese, but it works pretty well anyway

Page 8: ©2008 Secure Computing Corporation. All Rights Reserved. 1 10/20/2015 Adaptive Language Parsing Teaching Parsers to Program Themselves J. Zdziarski z@securecomputing.com

8

Counter-Example

So can we break it too?SpamAssassin Corpus

TP TN FP FN Precision Recall FScore

Whitespace 1640 4143 7 256 0.9957 0.8649 0.925

Static Defaults 1654 4145 5 242 0.9969 0.8723 0.930

Adaptive,W24 1760 4138 12 136 0.9932 0.9282 0.959

Adaptive,W32 1723 4140 10 173 0.9942 0.9087 0.949

Pure,W28 1756 4137 13 140 0.9926 0.9261 0.958

Pure,W36 1666 4142 8 230 0.9952 0.8786 0.933

Counter- 1578 4146 4 318 0.9974 0.8322 0.9073

Example

More junk tokens = lower efficiency, and of course less ability to catch anything (good or bad)

Answer: To some degree

Page 9: ©2008 Secure Computing Corporation. All Rights Reserved. 1 10/20/2015 Adaptive Language Parsing Teaching Parsers to Program Themselves J. Zdziarski z@securecomputing.com

9

Future Work

What Else Could We Do With This?

• Extend support to statistically stem words and parse inflectionExtend the hypothesis space to include biGram and triGram delimiters, and position of split (before, after, or as delimiter).

• Character Set DetectionCertain parsing models will no doubt adhere to specific character sets

• Fuzzy Data MiningImprove text retrieval by parsing documents to be more machine-coherent

• Apply to binary parsing challengesParse executable files, forensic recovery of hard drives, pixel border detection, etc.

Page 10: ©2008 Secure Computing Corporation. All Rights Reserved. 1 10/20/2015 Adaptive Language Parsing Teaching Parsers to Program Themselves J. Zdziarski z@securecomputing.com

10

Questions

Questions?

Jonathan Zdziarski ([email protected])

http://www.securecomputing.com