regexp secrets

Secrets of RegexpHiro Asari

Red Hat, Inc.

Let's Talk AboutRegular Expressions


• There is no regular expression


• A good approximation as a name

Let's Talk AboutRegexp

Some people, when confronted with a problem, think, "I know, I'll use regular expressions."

Now they have two problems.

Jaime Zawinski12 Aug, 1997

http://regex.info/blog/2006-09-15/247http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html

The point is not so much the evils of regular expressions, but the evils of overuse of it.

Formal Language Theory

• The Language L

• Over Alphabet Σ


• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)



• Words over Σ: "a", "b", "ab", "aequafdhfad"



• Words over Σ: "a", "b", "ab", "aequafdhfad"

• Σ*: The set of all words over Σ

Formal Languageover Σ

• A subset L of Σ* (with various properties)

• L can be finite, and enumerate well-formed words, but often infinite

Example

• Language L over Σ = {a,b}

• 'a' is a word

• a word may be obtained by appending 'ab' to an existing word

• only words thus formed are legal

aaabaabab

Well-formed words

baaaababb

Ill-formed words

Succinctly…

• a(ab)*

Expression

• Textual representation of the formal language against which an input is tested whether it is a well-formed word in that language

Regular Languages

• ∅ (empty language) is regular

Regular Languages


• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.

Regular Languages



• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages

http://en.wikipedia.org/wiki/Kleene_star


Regular Languages



• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages

• No other languages over Σ are regular.



Regular Expressions

• Expressions of regular languages

Regular Expressions

• Expressions of regular languages

Not

Regular? Expressions

• It turns out that some expressions are more powerful and expresses non-regular languages

• Language of 'squares': (.*)\1

• a, aa, aaaa, WikiWiki

How does Regexp work?

• Build a finite state automaton representing a given regular expression

• Feed the String to the regular expression and see if the match succeeds

ab*

a

b

a$

a $

a?

a

ε

a|b

a

b

(ab|c)

c

a b

(ab+|c)

c

a

b

b

Match is attempted at every character, left to

right

zyxwvutsrqponmlkjihgfedcba^

/a$/

Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line

zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^

/a$/


zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^

/a$/


zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^

/a$/


zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^⋮zyxwvutsrqponmlkjihgfedcba ^

/a$/


abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^

# matches 'abc d a dfadg '

^\s*(.*)\s*$

def pathological(n=5) Regexp.new('a?' * n + 'a' * n)end

1.upto(40) do |n| print n, ": " print Time.now, "\n" if 'a'*n =~ pathological(n)end

a?a?a?…a?aaa…a

aaa^

a?a?a?aaa

Regexp tips

UP_TO_256 = /\b(?:25[0-5] # 250-255|2[0-4][0-9] # 200-249|1[0-9][0-9] # 100-199|[1-9][0-9] # 2-digit numbers|[0-9]) # single-digit numbers\b/x

IPV4_ADDRESS = /#{UP_TO_256}(?:\.#{UP_TO_256}){3}/

Use /x

\A, \z for strings^, $ for lines

• \A: the beginning of the string

• \z: the end of the string

• ^: after \n

• $: before \n

always in Ruby

\A, \z for strings^, $ for lines

• \A: the beginning of the string

• \z: the end of the string

• ^: after \n

• $: before \n

What's the problem?

also note the difference in what /m means

#! /usr/bin/env perl$a = "abc\ndef";if ($a =~ /^d/) { print "yes\n";}if ($a =~ /^d/m) { print "yes now\n";}# prints 'yes now'

What's the problem?

also note the difference in what /m means

#! /usr/bin/env ruby

a = "abc\ndef";if (a =~ /^d/) p "yes"end

What's the problem?

http://guides.rubyonrails.org/security.html#regular-expressions

class File < ActiveRecord::Base!!validates :name, :format => /^[\w\.\-\+]+$/end

Security Implications

http://guides.rubyonrails.org/security.html#regular-expressions

file.txt%0A<script>alert(‘hello’)</script>

file.txt\n<script>alert(‘hello’)</script>


/^[\w\.\-\+]+$/


/^[\w\.\-\+]+$/

Match succeedsActiveRecord validation succeeds


/\A[\w\.\-\+]+\z/


/\A[\w\.\-\+]+\z/

Match failsActiveRecord validation fails

require 'benchmark'

# simple benchmark for alternations and character class

n = 5_000

str = 'cafebabedeadbeef'*5_000

Benchmark.bmbm do |x| x.report('alternation') do str =~ /^(a|b|c|d|e|f)+$/ end x.report('character class') do str =~ /^[a-f]+$/ endend

Prefer Character Class to Alterations

Ruby 1.8.7 user system total realalternation 0.030000 0.010000 0.040000 ( 0.036702)character class 0.000000 0.000000 0.000000 ( 0.004704)

Ruby 2.0.0 user system total realalternation 0.020000 0.010000 0.030000 ( 0.023139)character class 0.000000 0.000000 0.000000 ( 0.009641)

JRuby 1.7.4.dev user system total realalternation 0.030000 0.000000 0.030000 ( 0.021000)character class 0.010000 0.000000 0.010000 ( 0.007000)

Benchmarks

# case-insensitively match any non-word character…

# one is unlike the others'r' =~ /(?i:[\W])/'s' =~ /(?i:[\W])/'t' =~ /(?i:[\W])/

Beware of Character Classes

matches, even if 's' is a word character

https://bugs.ruby-lang.org/issues/4044

/^1?$|^(11+?)\1+$/

/^1?$|^(11+?)\1+$/

Matches '1' or ''

/^1?$|^(11+?)\1+$/

Non-greedily match 2 or more 1's

/^1?$|^(11+?)\1+$/

1 or more additional times

/^1?$|^(11+?)\1+$/

matches a composite number

/^1?$|^(11+?)\1+$/

Matches a string of 1's if and only if there are a non-prime # of 1's

class Integer def prime? "1" * self !~ /^1?$|^(11+?)\1+$/ endend

Integer#prime?

No performance guarantee

Attributed a Perl hacker Abigail

• @hiro_asari

• Github: BanzaiMan

regexp secrets

Technology

1 matches

str

character class 0

txtn

txtn

fast forwardto

lets talk

kleene star