csc 4630 meeting 21 april 4, 2007. return to perl where are we? what is confusing? what practice do...
TRANSCRIPT
Ray’s Problem
Given a string of the form:1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 b 9 = 100replace the 8 b’s with
– one plus sign– two minus signs– five empty strings, signifying close up the
spacing to make a number
and find which replacements yield a true statement.
Ray’s Problem (2)
Thoughts on the answer:
• 1234-56-78+9 = 100 is an example
• How many possible strings are there?
• Proof by exhaustion may be the best
Regular Expressions Revisited
Returning to a fundamental structure
• Theoretically defined
• Implemented in grep, egrep,
• Implemented in awk, gawk, nawk
• Implemented in Perl
RE(2)
• Theoretically a RE defines a set of strings on an alphabet
• In implementation matching with a RE checks whether the current string is an element of a set of strings that is constructed from the strings defined theoretically.
RE(3)
• A single character c• Theoretically defines the set of strings {c}• Which generates the set of matching lines {ScT},
where S and T are arbitrary, possibly empty strings.
• In implementation,– grep c somelines returns ______________– awk “/c/” somelines returns ______________– if (/c/) print {$_;} returns ______________
RE(4)
so grep c somelines is equivalent to
perl re1 <somelines where re1 is the Perl program
while <STDIN> {
if (/c/) {print $_;}
}
RE(5)
• Theoretically if r and s are regular expressions defining languages L and M respectively, then– rs defines the language LM, meaning
concatenate a string in L with a string in M
• Hence,– grep abc somelines– awk “/abc/” somelines– while <STDIN> { if (/abc/) {print $_;}}
RE(6)
all return the lines that are contained in the set {SabcT} where S and T are arbitrary, possibly empty strings.
Details: /a/ defines {a}, /b/ defines {b}, /c/ defines {c}
/abc/ defines {abc} by concatenation
Lines matching /abc/ are in {SabcT}
RE(7)
• The * operator shows that the previous simple regular expression is repeated 0 or more times.
• /ab*c/ defines the language formed as the union of the languages defined by /ac/, /abc/, /abbc/, /abbbc/, etc. This is the set {abnc | n = 0,1,2, …} (an infinite set)
• Hence /ab*c/ matches any string of the form SabncT
RE(8)
• The symbol . designates any character in the alphabet (What is the alphabet we’re using?) except \n which stands for newline. (A Perl definition, check for the various shells and the various awks).
• Thus . defines the language A-{\n}• And . matches any line that contains at least
one character. Officially an empty line looks like\n
and every line ends with \n
RE(9)
Exercise: Construct all possible lines of text that will not be matched by /a./
Exercise: Construct all possible lines of text that will be matched by /.a.b./
Exercise: Regardless of their content, what lines of text will not be matched by /.a.b./
RE(10)
Character Classes
• Any set of characters enclosed in brackets– The vowels [aeiou]
• Any range of consecutive ASCII coded characters enclosed in brackets– The lower case letters [a-z]– The digits [0-9]– The hex digits [0-9A-F]
RE(12)
• Including special characters in the set– To get ], use \] or []a-z] (Think about reading this
string character by character to learn its meaning.)
– To get -, use \- or [a-z-]
• Complementing (not complimenting) a set– Use ^ as leading character, [^0-9] or [^aeiou]
• More special characters– To get ^, use \^ or place it away from the first
position [a-z^_]
RE(13)
The Matching Game:• [0123456789]• [0-9]• [0-9\-]• [a-z0-9]• [a-zA-Z0-9_]• [^0-7]• [^A-M.,;]• [^\^]• [0 - 9]• [.]
RE(14)
Short character set names
• \d means [0-9]
• \D means [^0-9]• \w means [a-zA-Z0-9_] (identifier characters)
• \W means [^a-zA-Z0-9_]
• \s means [ \r\t\n\f]
• \S means [^ \r\t\n\f]
RE(15)
More repetition symbols• b* means zero or more repetitions of b, as does
b{0,}• b+ means one or more repetitions of b, as does
b{1,}• b? means zero or one repetitions of b, as does
b{0,1}• b{5,8} means five, six, seven or eight repetitions
of b• b{4} means exactly four repetitions of b
RE(16)
• Splitting a string
split(/:/,$line) divides $line into substrings at the colons and places the substrings in a list (array)
Note: Two adjacent colons :: produce an empty string.
split(/:+/,$line) divides $line into nonempty substrings
Andy’s Problem
Lines from a text file look like• 105028|Adam Mrugalski|AJM Residential|1067 Shoecraft
rd|Webster|NY|14580||||||[email protected]||No||No|||Thu Dec 21 21:23:23 2006|
• 105029|robert ritchey|robert industries|po box 472|crockett |ca|94525|510-787-7290|||||[email protected]||No||No|||Fri Dec 22 02:54:54 2006|
• 105030|Jack Still|WISE TV|PO BOX 280|Coeburn|VA|24230|2763959339|||||[email protected]||No||No||9feet 1inch floor to floor. Connects to balcony. Need oak 4 feet round with landing at top. Send me a quote. J. Still WISE TV |Fri Dec 22 03:18:19 2006|