ling/c sc/psyc 438/538 lecture 8 sandiway fong. adminstrivia homework 4 not yet graded …

42
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong

Upload: griselda-richardson

Post on 17-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

LING/C SC/PSYC 438/538

Lecture 8Sandiway Fong

Page 2: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Adminstrivia

• Homework 4 not yet graded …

Page 3: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Today's Topics

• Homework 4 review• Perl regex

Page 4: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Homework 2 Review• Sample data file:

• First try.. just try to detect a repeated word

Page 5: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Homework 2 Review• Sample data file:

• Sample output:

Page 6: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Homework 2 Review

• Key: think algorithmically… – think of a specific example first

w1 w2 w3 w4 w5

Compare w1 with w2

Compare w2 with w3

Compare w3 with w4

Compare w4 with w5

Page 7: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Array indices start from 0…

Homework 2 Review

• Generalize specific example, then code it up

Compare w1 with w1+1

Compare w2 with w2+1

Compare wn-2 with wn-2+1

Compare wn-1 with wn

“for” loop implementation

words

0 ,words

1 … w

ordsn-1

array @words

Array indices end just before $#words…

Page 8: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Homework 2 Review

Page 9: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Homework 2 Review

Page 10: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Homework 2 Review

Page 11: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Homework 2 Review

Page 12: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Homework 2 Review

Page 13: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Homework 2 Review

Page 14: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Homework 2 Review

a decent first pass …

Page 15: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Homework 2 Review

• Sample data file:

• Output:

Page 16: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Homework 2 Review• Second try.. merging multiple occurrences

Page 17: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Homework 2 Review• Second try.. merging multiple occurrences• Sample data file:

• Output:

Page 18: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Homework 2 Review• Third try.. implementing a simple table of exceptions

Page 19: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Homework 2 Review

• Third try.. table of exceptions• Sample data file:

• Output:

Page 20: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Perl regex

• more powerful than simple wildcard matching, e.g. files– rm *.jpg, rm PIC000?.JPG

• Regular expression pattern matching:– regular expressions are patterns using operators:

• * (zero or more occurrences), • + (one or more occurrences), • ? (optional), • | (disjunction)

– widely used in many areas– theoretically equivalent to Type-3 languages in the Chomsky hierarchy

• less powerful than Context-free languages etc.

Page 21: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Perl regex

• Perl regular expression (re) matching:– $a =~ /foo/

– /…/ contains a regular expression– will evaluate to true/false depending on what’s

contained in $a

• Perl regular expression (re) match and substitute:– $a =~ s/foo/bar/– s/…match… /…substitute… /

– will modify $a by looking for a single occurrence of match

and replacing that with substitute

– s/…match… /…substitute… /g

– g = flag: global match and substitute

Page 22: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Perl regex

• Typically useful with the standard code template for reading in a file line-by-line:

open($txtfile,$ARGV[0]) or die "$ARGV[0] not found!\n";while ($line = <$txtfile>) {

if ($line =~ /..regex../) {do stuff…

}}

Page 23: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Chapter 2: JM

character class: Perl lingo

Page 24: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Chapter 2: JM

Page 25: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Chapter 2: JM

• Backslash lowercase letter for class• Uppercase variant for all but class

Page 26: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Unicode and \w

• \w is [0-9A-Za-z_]

Definition is expanded for Unicode:use utf8;use open qw(:std :utf8);

my $str = "school école École šola trường स्कू� ल škole โรงเร�ยน";@words = ($str =~ /(\w+)/g);foreach $word (@words) { print "$word\n" }

list

cont

ext

Page 27: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Chapter 2: JM

Page 28: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Chapter 2: JM

Sheeptalk

Page 29: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Chapter 2: JM

Page 30: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Chapter 2: JM• Precedence of operators

– Example: Column 1 Column 2 Column 3 …– /Column [0-9]+ */– /(Column [0-9]+ *)*/– /house(cat(s|)|)/

• Perl:– in a regular expression the pattern matched by within the pair of parentheses is

stored in designated variables $1 (and $2 and so on)

• Precedence Hierarchy:

space

Page 31: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Chapter 2: JM

A shortcut: list context for matching

http://perldoc.perl.org/perlretut.html

returns a list

returns 1 (true) or “” (empty if false)

Page 32: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Chapter 2: JM

• s/([0-9]+)/<\1>/ what does this do?

Backreferences give Perl regexps more expressive power than finite state automata (fsa)

Page 33: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Shortest vs. Greedy Matching

• default behavior – in Perl RE match:• take the longest possible matching string

– aka greedy matching• This behavior can be changed, see next slide

Page 34: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Shortest vs. Greedy Matchingfrom http://www.perl.com/doc/manual/html/pod/perlre.html

• Example:$_ = "The food is under the bar in the barn."; if ( /foo(.*?)bar/ ) { print ”matched <$1>\n"; }

• Output:– matched <d is under the >

• Notes:– ? immediately following a repetition operator like * (or +)

makes the operator work in non-greedy mode

Page 35: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Shortest vs. Greedy Matchingfrom http://www.perl.com/doc/manual/html/pod/perlre.html

• Example:$_ = "The food is under the bar in the barn."; if ( /foo(.*?)bar/ ) { print ”matched <$1>\n"; }

• Output:– greedy: matched <d is under the bar in the >

– shortest: matched <d is under the >

(.*?)(.*)

Page 36: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Shortest vs. Greedy Matching

• RE search is supposed to be fast– but searching is not necessarily proportional to the length of

the input being searched– in fact, Perl RE matching can can take exponential time (in

length)– non-deterministic

• may need to backtrack (revisit) if it matches incorrectly part of the way through

time

length

lineartime

length

exponential

Page 37: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Global Matching: scalar context

g flag in the condition of a while-loop

Page 38: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Global Matching: list contextg flag in list context

Page 39: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Split

• @array = split /re/, string– splits string into a list of substrings split by re. Each

substring is stored as an element of @array.• Examples (from perlrequick tutorial):

Page 40: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Split

Page 41: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Matched Positions

Page 42: LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …

Matched Positions