12. regular expressions. 2 motto: i don't play accurately-any one can play accurately- but i...
TRANSCRIPT
12. Regular Expressions
2
Motto:
I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned, sentiment is my forte. I keep science for life.
- Oscar Wilde
3
Concepts
• Regular Expressions– allows to search for a pattern within a text string
– the patterns can be rather complex• same idea as "wildcard" characters – compare SQL – but much more expressive
– often abbreviated, e.g. as RegExp
– RegExps match as much as possible• they are greedy
• Theoretical underpinnings– nondeterministic final automata (NFA)
– regular grammars
– but some constructs extend the functionality further• even beyond CFG (context-free grammars)
4
Support
• Popular, widely supported
• Directly in scripting languages– JavaScript
• special syntax
– PHP• functions
– Ruby
– Perl
• as libraries– Java's java.lang.regex package
5
JavaScript RegExp
• Directly as argument of methods of String object– string.match(regexp)
• returns an array of substrings that matched regexp pattern
– string.replace(regexp,by)• returns a new string where the first (or all) matched patterns were replaced with by string
– string.search(regexp)• returns the index of first substring that matched regexp pattern, -1 if there is no match
– string.split(regexp)• returns an array of the substrings of string separated by regexp
• regexp argument– enclosed in /
• e.g., /ex/ matches first occurrence of "ex"
– optional modifiers placed as suffix• g (global); used in replace()
– e.g., /ex/g matches all occurrences of "ex"
• i (ignore case)– e.g., /ex/i matches all occurrences of "ex", "EX", "Ex" and "eX"
• m (multiline)
6
PHP RegExp • functions with $regexp and $string arguments
– ereg($regexp,$string [,&$matches])• returns length of matched string, false if there is no match• array reference &$matches if given, will be filled with the string in $matches[0] and the
matched substrings in subsequent elements
– ereg_replace($regexp,$by,$string)• returns a string where the first (or all) matched patterns were replaced with $by string
– split($regexp,$string [,$limit])• returns an array of substrings of $string that were separated by patterns matching $regexp• optional $limit determines how many substrings to return (the last one contains the remainder)
– eregi(), eregi_replace(), spliti()• same as ereg() and ereg_replace(), but ignores case
– preg_match($regexp,$string )• similar to ereg(), see PHP documentation
• if global search for all matches is to be performed, ereg() or ereg_replace() must be called in a loop
7
Syntax in JavaScript• by "element" we mean a character or a group
• . any character• ? one occurrences of preceding element or nothing• * any number of occurrences of preceding element, incl. none
• e.g., a.*z matches the largest substring that starts with a and ends with z, incl. "az"
• + any number of occurrences of preceding element, but at least one• e.g., a.+z matches the largest substring that starts with a and ends with z, not including "az"
– note that "azz" and "aaz" are matched
• {n} exactly n occurrences of preceding element• {m,n} between n and m occurrences of preceding element• ^ beginning of the string• $ end of the string• sequence of elements means that such sequence must be matched
• e.g., a.z matches "axz", "a5z", "aQz", etc.
• [] alternative elements– e.g., [ab] means a or b
• [^ ] none of the alternative elements– e.g., [^ab] means not a and not b
• - range– e.g., [a-zA-Z] means a through z or A through Z, i.e. all lower-case and upper-case letters
• | or– e.g., ab|yz matches "ab" and "yz"
8
Special Characters
• Denoted by \– \/: /– \b: space/blank– \t: tab character– \n: line feed– \r: carriage return – \f: form feed– \s: whitespace character, i.e.[ \t\r\n]– \d: digit, i.e.[0-9]– \w: word character, i.e.[a-zA-Z0-9_]– \S: not a whitespace character, i.e.[^\s]– \D: not a digit, i.e.[^\d]– \W: not a word character, i.e.[^\]– any other character preceded by \ means the character itself– the "meta-characters" need to be escaped:
• \\, \/, \[, \], \., \?, \[, \], \|, \+, \*, \(, \), \^, \$, \-, \{, \}
9
RegExp Capturing
• If you enclose subpattern(s) ( and ) within a RegExp it the pattern(s) that will be captured, i.e. returned or used– e.g., \b(.*)@ will capture the first part of an email
10
Sample RegExp
• hex digit:– [0-9a-fA-F]
• identifier:– [a-zA-Z_][a-zA-Z_0-9]*
• email address:– \b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b