12. regular expressions. 2 motto: i don't play accurately-any one can play accurately- but i...

10
12. Regular Expressions

Upload: lindsay-riley

Post on 05-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,

12. Regular Expressions

Page 2: 12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,

2

Motto:

I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned, sentiment is my forte. I keep science for life.

- Oscar Wilde

Page 3: 12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,

3

Concepts

• Regular Expressions– allows to search for a pattern within a text string

– the patterns can be rather complex• same idea as "wildcard" characters – compare SQL – but much more expressive

– often abbreviated, e.g. as RegExp

– RegExps match as much as possible• they are greedy

• Theoretical underpinnings– nondeterministic final automata (NFA)

– regular grammars

– but some constructs extend the functionality further• even beyond CFG (context-free grammars)

Page 4: 12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,

4

Support

• Popular, widely supported

• Directly in scripting languages– JavaScript

• special syntax

– PHP• functions

– Ruby

– Perl

• as libraries– Java's java.lang.regex package

Page 5: 12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,

5

JavaScript RegExp

• Directly as argument of methods of String object– string.match(regexp)

• returns an array of substrings that matched regexp pattern

– string.replace(regexp,by)• returns a new string where the first (or all) matched patterns were replaced with by string

– string.search(regexp)• returns the index of first substring that matched regexp pattern, -1 if there is no match

– string.split(regexp)• returns an array of the substrings of string separated by regexp

• regexp argument– enclosed in /

• e.g., /ex/ matches first occurrence of "ex"

– optional modifiers placed as suffix• g (global); used in replace()

– e.g., /ex/g matches all occurrences of "ex"

• i (ignore case)– e.g., /ex/i matches all occurrences of "ex", "EX", "Ex" and "eX"

• m (multiline)

Page 6: 12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,

6

PHP RegExp • functions with $regexp and $string arguments

– ereg($regexp,$string [,&$matches])• returns length of matched string, false if there is no match• array reference &$matches if given, will be filled with the string in $matches[0] and the

matched substrings in subsequent elements

– ereg_replace($regexp,$by,$string)• returns a string where the first (or all) matched patterns were replaced with $by string

– split($regexp,$string [,$limit])• returns an array of substrings of $string that were separated by patterns matching $regexp• optional $limit determines how many substrings to return (the last one contains the remainder)

– eregi(), eregi_replace(), spliti()• same as ereg() and ereg_replace(), but ignores case

– preg_match($regexp,$string )• similar to ereg(), see PHP documentation

• if global search for all matches is to be performed, ereg() or ereg_replace() must be called in a loop

Page 7: 12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,

7

Syntax in JavaScript• by "element" we mean a character or a group

• . any character• ? one occurrences of preceding element or nothing• * any number of occurrences of preceding element, incl. none

• e.g., a.*z matches the largest substring that starts with a and ends with z, incl. "az"

• + any number of occurrences of preceding element, but at least one• e.g., a.+z matches the largest substring that starts with a and ends with z, not including "az"

– note that "azz" and "aaz" are matched

• {n} exactly n occurrences of preceding element• {m,n} between n and m occurrences of preceding element• ^ beginning of the string• $ end of the string• sequence of elements means that such sequence must be matched

• e.g., a.z matches "axz", "a5z", "aQz", etc.

• [] alternative elements– e.g., [ab] means a or b

• [^ ] none of the alternative elements– e.g., [^ab] means not a and not b

• - range– e.g., [a-zA-Z] means a through z or A through Z, i.e. all lower-case and upper-case letters

• | or– e.g., ab|yz matches "ab" and "yz"

Page 8: 12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,

8

Special Characters

• Denoted by \– \/: /– \b: space/blank– \t: tab character– \n: line feed– \r: carriage return – \f: form feed– \s: whitespace character, i.e.[ \t\r\n]– \d: digit, i.e.[0-9]– \w: word character, i.e.[a-zA-Z0-9_]– \S: not a whitespace character, i.e.[^\s]– \D: not a digit, i.e.[^\d]– \W: not a word character, i.e.[^\]– any other character preceded by \ means the character itself– the "meta-characters" need to be escaped:

• \\, \/, \[, \], \., \?, \[, \], \|, \+, \*, \(, \), \^, \$, \-, \{, \}

Page 9: 12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,

9

RegExp Capturing

• If you enclose subpattern(s) ( and ) within a RegExp it the pattern(s) that will be captured, i.e. returned or used– e.g., \b(.*)@ will capture the first part of an email

Page 10: 12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,

10

Sample RegExp

• hex digit:– [0-9a-fA-F]

• identifier:– [a-zA-Z_][a-zA-Z_0-9]*

• email address:– \b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b