regexp secrets

67
Secrets of Regexp Hiro Asari Red Hat, Inc.

Upload: hiro-asari

Post on 12-Nov-2014

578 views

Category:

Technology


0 download

DESCRIPTION

Regexp class is in every Rubyist's toolbox. But do you know the theory behind it, and what goes on under the hood?

TRANSCRIPT

Page 1: Regexp secrets

Secrets of RegexpHiro Asari

Red Hat, Inc.

Page 2: Regexp secrets

Let's Talk AboutRegular Expressions

Page 3: Regexp secrets

Let's Talk AboutRegular Expressions

• There is no regular expression

Page 4: Regexp secrets

Let's Talk AboutRegular Expressions

• A good approximation as a name

Page 5: Regexp secrets

Let's Talk AboutRegexp

Page 6: Regexp secrets

Some people, when confronted with a problem, think, "I know, I'll use regular expressions."

Now they have two problems.

Jaime Zawinski12 Aug, 1997

http://regex.info/blog/2006-09-15/247http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html

The point is not so much the evils of regular expressions, but the evils of overuse of it.

Page 7: Regexp secrets

Formal Language Theory

• The Language L

• Over Alphabet Σ

Page 8: Regexp secrets

Formal Language Theory

• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)

Page 9: Regexp secrets

Formal Language Theory

• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)

• Words over Σ: "a", "b", "ab", "aequafdhfad"

Page 10: Regexp secrets

Formal Language Theory

• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)

• Words over Σ: "a", "b", "ab", "aequafdhfad"

• Σ*: The set of all words over Σ

Page 11: Regexp secrets

Formal Languageover Σ

• A subset L of Σ* (with various properties)

• L can be finite, and enumerate well-formed words, but often infinite

Page 12: Regexp secrets

Example

• Language L over Σ = {a,b}

• 'a' is a word

• a word may be obtained by appending 'ab' to an existing word

• only words thus formed are legal

Page 13: Regexp secrets

aaabaabab

Well-formed words

Page 14: Regexp secrets

baaaababb

Ill-formed words

Page 15: Regexp secrets

Succinctly…

• a(ab)*

Page 16: Regexp secrets

Expression

• Textual representation of the formal language against which an input is tested whether it is a well-formed word in that language

Page 17: Regexp secrets

Regular Languages

• ∅ (empty language) is regular

Page 18: Regexp secrets

Regular Languages

• ∅ (empty language) is regular

• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.

Page 19: Regexp secrets

Regular Languages

• ∅ (empty language) is regular

• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.

• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages

Page 20: Regexp secrets

Regular Languages

• ∅ (empty language) is regular

• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.

• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages

• No other languages over Σ are regular.

Page 21: Regexp secrets

Regular Expressions

• Expressions of regular languages

Page 22: Regexp secrets

Regular Expressions

• Expressions of regular languages

Not

Page 23: Regexp secrets

Regular? Expressions

• It turns out that some expressions are more powerful and expresses non-regular languages

• Language of 'squares': (.*)\1

• a, aa, aaaa, WikiWiki

Page 24: Regexp secrets

How does Regexp work?

• Build a finite state automaton representing a given regular expression

• Feed the String to the regular expression and see if the match succeeds

Page 25: Regexp secrets

a

a

Page 26: Regexp secrets

ab*

a

b

Page 27: Regexp secrets

.*

.

Page 28: Regexp secrets

a$

a $

Page 29: Regexp secrets

a?

a

ε

Page 30: Regexp secrets

a|b

a

b

Page 31: Regexp secrets

(ab|c)

c

a b

Page 32: Regexp secrets

(ab+|c)

c

a

b

b

Page 33: Regexp secrets

Match is attempted at every character, left to

right

Page 34: Regexp secrets

zyxwvutsrqponmlkjihgfedcba^

/a$/

Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line

Page 35: Regexp secrets

zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^

/a$/

Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line

Page 36: Regexp secrets

zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^

/a$/

Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line

Page 37: Regexp secrets

zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^

/a$/

Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line

Page 38: Regexp secrets

zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^⋮zyxwvutsrqponmlkjihgfedcba ^

/a$/

Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line

Page 39: Regexp secrets

abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^

# matches 'abc d a dfadg '

^\s*(.*)\s*$

Page 40: Regexp secrets

def pathological(n=5) Regexp.new('a?' * n + 'a' * n)end

1.upto(40) do |n| print n, ": " print Time.now, "\n" if 'a'*n =~ pathological(n)end

a?a?a?…a?aaa…a

Page 41: Regexp secrets

aaa^

a?a?a?aaa

Page 42: Regexp secrets

Regexp tips

Page 43: Regexp secrets

UP_TO_256 = /\b(?:25[0-5] # 250-255|2[0-4][0-9] # 200-249|1[0-9][0-9] # 100-199|[1-9][0-9] # 2-digit numbers|[0-9]) # single-digit numbers\b/x

IPV4_ADDRESS = /#{UP_TO_256}(?:\.#{UP_TO_256}){3}/

Use /x

Page 44: Regexp secrets

\A, \z for strings^, $ for lines

• \A: the beginning of the string

• \z: the end of the string

• ^: after \n

• $: before \n

Page 45: Regexp secrets

always in Ruby

\A, \z for strings^, $ for lines

• \A: the beginning of the string

• \z: the end of the string

• ^: after \n

• $: before \n

Page 46: Regexp secrets

What's the problem?

also note the difference in what /m means

Page 47: Regexp secrets

#! /usr/bin/env perl$a = "abc\ndef";if ($a =~ /^d/) { print "yes\n";}if ($a =~ /^d/m) { print "yes now\n";}# prints 'yes now'

What's the problem?

also note the difference in what /m means

Page 48: Regexp secrets

#! /usr/bin/env ruby

a = "abc\ndef";if (a =~ /^d/) p "yes"end

What's the problem?

http://guides.rubyonrails.org/security.html#regular-expressions

Page 49: Regexp secrets

class File < ActiveRecord::Base!!validates :name, :format => /^[\w\.\-\+]+$/end

Security Implications

http://guides.rubyonrails.org/security.html#regular-expressions

Page 50: Regexp secrets

file.txt%0A<script>alert(‘hello’)</script>

Page 51: Regexp secrets

file.txt%0A<script>alert(‘hello’)</script>

Page 52: Regexp secrets

file.txt\n<script>alert(‘hello’)</script>

Page 53: Regexp secrets

file.txt\n<script>alert(‘hello’)</script>

/^[\w\.\-\+]+$/

Page 54: Regexp secrets

file.txt\n<script>alert(‘hello’)</script>

/^[\w\.\-\+]+$/

Match succeedsActiveRecord validation succeeds

Page 55: Regexp secrets

file.txt\n<script>alert(‘hello’)</script>

/\A[\w\.\-\+]+\z/

Page 56: Regexp secrets

file.txt\n<script>alert(‘hello’)</script>

/\A[\w\.\-\+]+\z/

Match failsActiveRecord validation fails

Page 57: Regexp secrets

require 'benchmark'

# simple benchmark for alternations and character class

n = 5_000

str = 'cafebabedeadbeef'*5_000

Benchmark.bmbm do |x| x.report('alternation') do str =~ /^(a|b|c|d|e|f)+$/ end x.report('character class') do str =~ /^[a-f]+$/ endend

Prefer Character Class to Alterations

Page 58: Regexp secrets

Ruby 1.8.7 user system total realalternation 0.030000 0.010000 0.040000 ( 0.036702)character class 0.000000 0.000000 0.000000 ( 0.004704)

Ruby 2.0.0 user system total realalternation 0.020000 0.010000 0.030000 ( 0.023139)character class 0.000000 0.000000 0.000000 ( 0.009641)

JRuby 1.7.4.dev user system total realalternation 0.030000 0.000000 0.030000 ( 0.021000)character class 0.010000 0.000000 0.010000 ( 0.007000)

Benchmarks

Page 59: Regexp secrets

# case-insensitively match any non-word character…

# one is unlike the others'r' =~ /(?i:[\W])/'s' =~ /(?i:[\W])/'t' =~ /(?i:[\W])/

Beware of Character Classes

matches, even if 's' is a word character

https://bugs.ruby-lang.org/issues/4044

Page 60: Regexp secrets

/^1?$|^(11+?)\1+$/

Page 61: Regexp secrets

/^1?$|^(11+?)\1+$/

Matches '1' or ''

Page 62: Regexp secrets

/^1?$|^(11+?)\1+$/

Non-greedily match 2 or more 1's

Page 63: Regexp secrets

/^1?$|^(11+?)\1+$/

1 or more additional times

Page 64: Regexp secrets

/^1?$|^(11+?)\1+$/

matches a composite number

Page 65: Regexp secrets

/^1?$|^(11+?)\1+$/

Matches a string of 1's if and only if there are a non-prime # of 1's

Page 66: Regexp secrets

class Integer def prime? "1" * self !~ /^1?$|^(11+?)\1+$/ endend

Integer#prime?

No performance guarantee

Attributed a Perl hacker Abigail

Page 67: Regexp secrets

• @hiro_asari

• Github: BanzaiMan