csc 4630 meeting 21 april 4, 2007. return to perl where are we? what is confusing? what practice do...

21
CSC 4630 Meeting 21 April 4, 2007

Upload: jordan-andrews

Post on 14-Jan-2016

217 views

Category:

Documents


3 download

TRANSCRIPT

CSC 4630

Meeting 21

April 4, 2007

Return to Perl

• Where are we?

• What is confusing?

• What practice do you need?

Ray’s Problem

Given a string of the form:1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 b 9 = 100replace the 8 b’s with

– one plus sign– two minus signs– five empty strings, signifying close up the

spacing to make a number

and find which replacements yield a true statement.

Ray’s Problem (2)

Thoughts on the answer:

• 1234-56-78+9 = 100 is an example

• How many possible strings are there?

• Proof by exhaustion may be the best

Regular Expressions Revisited

Returning to a fundamental structure

• Theoretically defined

• Implemented in grep, egrep,

• Implemented in awk, gawk, nawk

• Implemented in Perl

RE(2)

• Theoretically a RE defines a set of strings on an alphabet

• In implementation matching with a RE checks whether the current string is an element of a set of strings that is constructed from the strings defined theoretically.

RE(3)

• A single character c• Theoretically defines the set of strings {c}• Which generates the set of matching lines {ScT},

where S and T are arbitrary, possibly empty strings.

• In implementation,– grep c somelines returns ______________– awk “/c/” somelines returns ______________– if (/c/) print {$_;} returns ______________

RE(4)

so grep c somelines is equivalent to

perl re1 <somelines where re1 is the Perl program

while <STDIN> {

if (/c/) {print $_;}

}

RE(5)

• Theoretically if r and s are regular expressions defining languages L and M respectively, then– rs defines the language LM, meaning

concatenate a string in L with a string in M

• Hence,– grep abc somelines– awk “/abc/” somelines– while <STDIN> { if (/abc/) {print $_;}}

RE(6)

all return the lines that are contained in the set {SabcT} where S and T are arbitrary, possibly empty strings.

Details: /a/ defines {a}, /b/ defines {b}, /c/ defines {c}

/abc/ defines {abc} by concatenation

Lines matching /abc/ are in {SabcT}

RE(7)

• The * operator shows that the previous simple regular expression is repeated 0 or more times.

• /ab*c/ defines the language formed as the union of the languages defined by /ac/, /abc/, /abbc/, /abbbc/, etc. This is the set {abnc | n = 0,1,2, …} (an infinite set)

• Hence /ab*c/ matches any string of the form SabncT

RE(8)

• The symbol . designates any character in the alphabet (What is the alphabet we’re using?) except \n which stands for newline. (A Perl definition, check for the various shells and the various awks).

• Thus . defines the language A-{\n}• And . matches any line that contains at least

one character. Officially an empty line looks like\n

and every line ends with \n

RE(9)

Exercise: Construct all possible lines of text that will not be matched by /a./

Exercise: Construct all possible lines of text that will be matched by /.a.b./

Exercise: Regardless of their content, what lines of text will not be matched by /.a.b./

RE(10)

Character Classes

• Any set of characters enclosed in brackets– The vowels [aeiou]

• Any range of consecutive ASCII coded characters enclosed in brackets– The lower case letters [a-z]– The digits [0-9]– The hex digits [0-9A-F]

RE(12)

• Including special characters in the set– To get ], use \] or []a-z] (Think about reading this

string character by character to learn its meaning.)

– To get -, use \- or [a-z-]

• Complementing (not complimenting) a set– Use ^ as leading character, [^0-9] or [^aeiou]

• More special characters– To get ^, use \^ or place it away from the first

position [a-z^_]

RE(13)

The Matching Game:• [0123456789]• [0-9]• [0-9\-]• [a-z0-9]• [a-zA-Z0-9_]• [^0-7]• [^A-M.,;]• [^\^]• [0 - 9]• [.]

RE(14)

Short character set names

• \d means [0-9]

• \D means [^0-9]• \w means [a-zA-Z0-9_] (identifier characters)

• \W means [^a-zA-Z0-9_]

• \s means [ \r\t\n\f]

• \S means [^ \r\t\n\f]

RE(15)

More repetition symbols• b* means zero or more repetitions of b, as does

b{0,}• b+ means one or more repetitions of b, as does

b{1,}• b? means zero or one repetitions of b, as does

b{0,1}• b{5,8} means five, six, seven or eight repetitions

of b• b{4} means exactly four repetitions of b

RE(16)

• Splitting a string

split(/:/,$line) divides $line into substrings at the colons and places the substrings in a list (array)

Note: Two adjacent colons :: produce an empty string.

split(/:+/,$line) divides $line into nonempty substrings

Andy’s Problem

Lines from a text file look like• 105028|Adam Mrugalski|AJM Residential|1067 Shoecraft

rd|Webster|NY|14580||||||[email protected]||No||No|||Thu Dec 21 21:23:23 2006|

• 105029|robert ritchey|robert industries|po box 472|crockett |ca|94525|510-787-7290|||||[email protected]||No||No|||Fri Dec 22 02:54:54 2006|

• 105030|Jack Still|WISE TV|PO BOX 280|Coeburn|VA|24230|2763959339|||||[email protected]||No||No||9feet 1inch floor to floor. Connects to balcony. Need oak 4 feet round with landing at top. Send me a quote. J. Still WISE TV |Fri Dec 22 03:18:19 2006|

Andy (2)

The lines need to be cleaned and parsed into several reports:

• Phone contact information

• Email contact information

• Address labels

• Full data base, checking for unique entries