regexp secrets
DESCRIPTION
Regexp class is in every Rubyist's toolbox. But do you know the theory behind it, and what goes on under the hood?TRANSCRIPT
Secrets of RegexpHiro Asari
Red Hat, Inc.
Let's Talk AboutRegular Expressions
Let's Talk AboutRegular Expressions
• There is no regular expression
Let's Talk AboutRegular Expressions
• A good approximation as a name
Let's Talk AboutRegexp
Some people, when confronted with a problem, think, "I know, I'll use regular expressions."
Now they have two problems.
Jaime Zawinski12 Aug, 1997
http://regex.info/blog/2006-09-15/247http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html
The point is not so much the evils of regular expressions, but the evils of overuse of it.
Formal Language Theory
• The Language L
• Over Alphabet Σ
Formal Language Theory
• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
Formal Language Theory
• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
• Words over Σ: "a", "b", "ab", "aequafdhfad"
Formal Language Theory
• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
• Words over Σ: "a", "b", "ab", "aequafdhfad"
• Σ*: The set of all words over Σ
Formal Languageover Σ
• A subset L of Σ* (with various properties)
• L can be finite, and enumerate well-formed words, but often infinite
Example
• Language L over Σ = {a,b}
• 'a' is a word
• a word may be obtained by appending 'ab' to an existing word
• only words thus formed are legal
aaabaabab
Well-formed words
baaaababb
Ill-formed words
Succinctly…
• a(ab)*
Expression
• Textual representation of the formal language against which an input is tested whether it is a well-formed word in that language
Regular Languages
• ∅ (empty language) is regular
Regular Languages
• ∅ (empty language) is regular
• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.
Regular Languages
• ∅ (empty language) is regular
• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.
• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages
Regular Languages
• ∅ (empty language) is regular
• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.
• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages
• No other languages over Σ are regular.
Regular Expressions
• Expressions of regular languages
Regular Expressions
• Expressions of regular languages
Not
Regular? Expressions
• It turns out that some expressions are more powerful and expresses non-regular languages
• Language of 'squares': (.*)\1
• a, aa, aaaa, WikiWiki
How does Regexp work?
• Build a finite state automaton representing a given regular expression
• Feed the String to the regular expression and see if the match succeeds
a
a
ab*
a
b
.*
.
a$
a $
a?
a
ε
a|b
a
b
(ab|c)
c
a b
(ab+|c)
c
a
b
b
Match is attempted at every character, left to
right
zyxwvutsrqponmlkjihgfedcba^
/a$/
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^
/a$/
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^
/a$/
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^
/a$/
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^⋮zyxwvutsrqponmlkjihgfedcba ^
/a$/
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^
# matches 'abc d a dfadg '
^\s*(.*)\s*$
def pathological(n=5) Regexp.new('a?' * n + 'a' * n)end
1.upto(40) do |n| print n, ": " print Time.now, "\n" if 'a'*n =~ pathological(n)end
a?a?a?…a?aaa…a
aaa^
a?a?a?aaa
Regexp tips
UP_TO_256 = /\b(?:25[0-5] # 250-255|2[0-4][0-9] # 200-249|1[0-9][0-9] # 100-199|[1-9][0-9] # 2-digit numbers|[0-9]) # single-digit numbers\b/x
IPV4_ADDRESS = /#{UP_TO_256}(?:\.#{UP_TO_256}){3}/
Use /x
\A, \z for strings^, $ for lines
• \A: the beginning of the string
• \z: the end of the string
• ^: after \n
• $: before \n
always in Ruby
\A, \z for strings^, $ for lines
• \A: the beginning of the string
• \z: the end of the string
• ^: after \n
• $: before \n
What's the problem?
also note the difference in what /m means
#! /usr/bin/env perl$a = "abc\ndef";if ($a =~ /^d/) { print "yes\n";}if ($a =~ /^d/m) { print "yes now\n";}# prints 'yes now'
What's the problem?
also note the difference in what /m means
#! /usr/bin/env ruby
a = "abc\ndef";if (a =~ /^d/) p "yes"end
What's the problem?
http://guides.rubyonrails.org/security.html#regular-expressions
class File < ActiveRecord::Base!!validates :name, :format => /^[\w\.\-\+]+$/end
Security Implications
http://guides.rubyonrails.org/security.html#regular-expressions
file.txt%0A<script>alert(‘hello’)</script>
file.txt%0A<script>alert(‘hello’)</script>
file.txt\n<script>alert(‘hello’)</script>
file.txt\n<script>alert(‘hello’)</script>
/^[\w\.\-\+]+$/
file.txt\n<script>alert(‘hello’)</script>
/^[\w\.\-\+]+$/
Match succeedsActiveRecord validation succeeds
file.txt\n<script>alert(‘hello’)</script>
/\A[\w\.\-\+]+\z/
file.txt\n<script>alert(‘hello’)</script>
/\A[\w\.\-\+]+\z/
Match failsActiveRecord validation fails
require 'benchmark'
# simple benchmark for alternations and character class
n = 5_000
str = 'cafebabedeadbeef'*5_000
Benchmark.bmbm do |x| x.report('alternation') do str =~ /^(a|b|c|d|e|f)+$/ end x.report('character class') do str =~ /^[a-f]+$/ endend
Prefer Character Class to Alterations
Ruby 1.8.7 user system total realalternation 0.030000 0.010000 0.040000 ( 0.036702)character class 0.000000 0.000000 0.000000 ( 0.004704)
Ruby 2.0.0 user system total realalternation 0.020000 0.010000 0.030000 ( 0.023139)character class 0.000000 0.000000 0.000000 ( 0.009641)
JRuby 1.7.4.dev user system total realalternation 0.030000 0.000000 0.030000 ( 0.021000)character class 0.010000 0.000000 0.010000 ( 0.007000)
Benchmarks
# case-insensitively match any non-word character…
# one is unlike the others'r' =~ /(?i:[\W])/'s' =~ /(?i:[\W])/'t' =~ /(?i:[\W])/
Beware of Character Classes
matches, even if 's' is a word character
https://bugs.ruby-lang.org/issues/4044
/^1?$|^(11+?)\1+$/
/^1?$|^(11+?)\1+$/
Matches '1' or ''
/^1?$|^(11+?)\1+$/
Non-greedily match 2 or more 1's
/^1?$|^(11+?)\1+$/
1 or more additional times
/^1?$|^(11+?)\1+$/
matches a composite number
/^1?$|^(11+?)\1+$/
Matches a string of 1's if and only if there are a non-prime # of 1's
class Integer def prime? "1" * self !~ /^1?$|^(11+?)\1+$/ endend
Integer#prime?
No performance guarantee
Attributed a Perl hacker Abigail
• @hiro_asari
• Github: BanzaiMan