regular expressions 1 day 6 - 9/08/14 ling 3820 & 6820 natural language processing harry howard...

15
Regular expressions 1 Day 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Upload: todd-young

Post on 19-Jan-2016

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Regular expressions 1Day 6 - 9/08/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Page 2: REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization

08-Sept-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/

The syllabus is under construction. http://www.tulane.edu/~howard/CompCu

ltEN/

Page 3: REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

The quiz was the review.

Review

08-Sept-2014

3

NLP, Prof. Howard, Tulane University

Page 4: REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Open Spyder

08-Sept-2014

4

NLP, Prof. Howard, Tulane University

Page 5: REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

§4. Regular expressions

08-Sept-2014

5

NLP, Prof. Howard, Tulane University

Page 6: REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Regular expressions, or regex >>> import re re.findall(pattern, target string)

08-Sept-2014NLP, Prof. Howard, Tulane University

6

Page 7: REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

4.2. Fixed-length matching

08-Sept-2014

7

NLP, Prof. Howard, Tulane University

Page 8: REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

The test string

>>> S = '''This above all: to thine own self be true,

... And it must follow, as the night the day,

... Thou canst not then be false to any man.'''

08-Sept-2014NLP, Prof. Howard, Tulane University

8

Page 9: REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Strings as regular expressions>>> re.findall(' be ', S)

[' be ', ' be ']

08-Sept-2014NLP, Prof. Howard, Tulane University

9

Page 10: REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Match one character of a disjunction with |>>> re.findall(' to | be | it | as ', S)

[' to ', ' be ', ' it ', ' as ', ' be ', ' to ']

>>> set(re.findall(' to | be | it | as ', S))

set([' it ', ' as ', ' to ', ' be '])

08-Sept-2014NLP, Prof. Howard, Tulane University

10

Page 11: REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Match a group of characters with capturing or non-capturing parentheses, ()>>> re.findall(' (to|be|it|as) ', S)

['to', 'be', 'it', 'as', 'be', 'to']

R>>> re.findall(' (?:to|be|it|as) ', S)

[' to ', ' be ', ' it ', ' as ', ' be ', ' to ']

The default behavior of parentheses is to capture the string inside them in the output. The ?: prefix turns capturing off. For the rest of this discussion, we prefer to exclude the spaces from the output.

08-Sept-2014NLP, Prof. Howard, Tulane University

11

Page 12: REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Match one character of a range with [] and its negation with [^]>>> re.findall(' ([a-z][a-z]) ', S)

['to', 'be', 'it', 'as', 'be', 'to']

>>> re.findall(' ([^0-9][^0-9]) ', S)

['to', 'be', 'it', 'as', 'be', 'to']

>>> re.findall(' ([a-e][a-e]) ', S)

['be', 'be']

>>> re.findall(' ([^a-e][^a-e]) ', S)

['to', 'it', 'to']

08-Sept-2014NLP, Prof. Howard, Tulane University

12

Page 13: REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Match a number of repetitions of a character with {}

>>> re.findall(' ([a-z]{2}) ', S)['to', 'be', 'it', 'as', 'be', 'to']

08-Sept-2014NLP, Prof. Howard, Tulane University

13

Page 14: REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Match any character with .

>>> re.findall(' (..) ', S)['to', 'be', 'it', 'as', 'be', 'to']>>> re.findall(' (.{2}) ', S)['to', 'be', 'it', 'as', 'be', 'to']

08-Sept-2014NLP, Prof. Howard, Tulane University

14

Page 15: REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

4.2.7. and following

Next time

08-Sept-2014NLP, Prof. Howard, Tulane University

15