lecture 2: lexical analysis. 2 lexical analysis input: sequence of characters output: sequence of...

26
Lecture 2: Lexical Analysis

Upload: charlene-anthony

Post on 13-Jan-2016

257 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

Lecture 2: Lexical Analysis

Page 2: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

2

Lexical Analysis

INPUT: sequence of characters

OUTPUT: sequence of tokens

A lexical analyzer is generally a subroutine of parser: Simpler design Efficient Portable

Input Scanner Parser

SymbolTable

Next_char()

character token

Next_token()

Page 3: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

3

Definitions

token – set of strings defining an atomic element with a defined meaning

pattern – a rule describing a set of string lexeme – a sequence of characters that

match some pattern

Page 4: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

4

Examples

Token Pattern Sample Lexeme

while while while

relation_op = | != | < | > <

integer (0-9)* 42

string Characters between “ “

“hello”

Page 5: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

5

Input string: size := r * 32 + c

<token,lexeme> pairs: <id, size> <assign, :=> <id, r> <arith_symbol, *> <integer, 32> <arith_symbol, +> <id, c>

Page 6: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

6

Implementing a Lexical Analyzer

Practical Issues: Input buffering Translating RE into executable form Must be able to capture a large number

of tokens with single machine Interface to parser Tools

Page 7: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

7

Capturing Multiple Tokens

Capturing keyword “begin”

Capturing variable names

What if both need to happen at the same time?

b e g i n WS

WS – white spaceA – alphabeticAN – alphanumericA

AN

WS

Page 8: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

8

Capturing Multiple Tokens

b e g i n WS

WS – white spaceA – alphabeticAN – alphanumeric

A-b

AN WS

AN

Machine is much more complicated – just for these two tokens!

Page 9: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

9

Lex – Lexical Analyzer Generator

Flex/Lex

C/C++ compiler

a.out

Lexspecification

lex.yy.c

input tokens

Page 10: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

10

Lex Specification

%{ int charCount=0, wordCount=0, lineCount=0;%}word [^ \t\n]*%%{word} {wordCount++; charCount += yyleng; }[\n] {charCount++; lineCount++;}. {charCount++;}%%main() { yylex(); printf(“Characters %d, Words: %d, Lines: %d\n”,charCount, wordCount, lineCount);}

Definitions – Code, RE

Rules – RE/Action pairs

User Routines

Page 11: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

11

Lex definitions section

C/C++ code: Surrounded by %{… %} delimiters Declare any variables used in actions

RE definitions: Define shorthand for patterns: digit [0-9] letter [a-z] ident {letter}({letter}|{digit})* Use shorthand in RE section: {ident}

%{ int charCount=0, wordCount=0, lineCount=0;%}word [^ \t\n]*

Page 12: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

12

Lex Regular Expressions

Match explicit character sequences integer, “+++”, \<\>

Character classes [abcd] [a-zA-Z] [^0-9] – matches non-numeric

{word} {wordCount++; charCount += yyleng; }[\n] {charCount++; lineCount++;}. {charCount++;}

Page 13: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

13

Alternation twelve | 12

Closure * - zero or more + - one or more ? – zero or one {number}, {number,number}

Lex Regular Expressions(cont.)

Page 14: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

14

Other operators . – matches any character except newline ^ - matches beginning of line $ - matches end of line / - trailing context () – grouping {} – RE definitions

Lex Regular Expressions(cont.)

Page 15: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

15

Lex Matching Rules

Lex always attempts to match the longest possible string.

If two rules are matched (and match strings are same length), the first rule in the specification is used.

Page 16: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

16

Lex Operators

Highest: closure concatenation alternation

Special lex characters: - \ / * + > “ { } . $ ( ) | % [ ] ^Special lex characters inside [ ]: - \ [ ] ^

Page 17: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

17

Examples

a.*z (ab)+ [0—9]{1,5} (ab|cd)?ef = abef,cdef,ef -?[0-9]\.[0-9]

Page 18: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

18

Lex Actions

Lex actions are C (C++) code to implement some required functionality

Default action is to echo to output Can ignore input (empty action) ECHO – macro that prints out matched

string yytext – matched string yyleng – length of matched string

Page 19: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

19

User Subroutines

C/C++ code Copied directly into the lexer code User can supply ‘main’ or use default

main() { yylex(); printf(“Characters %d, Words: %d, Lines: %d\n”,charCount, wordCount, lineCount);}

Page 20: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

20

Uses for Lex

Transforming Input – convert input from one form to another (example 1). yylex() is called once; return is not used in specification

Extracting Information – scan the text and return some information (example 2). yylex() is called once; return is not used in specification.

Extracting Tokens – standard use with compiler (example 3). Uses return to give the next token to the caller.

Page 21: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

21

•A regular expression is a kind of pattern that can be applied to text (Strings, in Java)•A regular expression either matches the text (or part of the text), or it fails to match.• Regular expressions are an extremely useful tool for manipulating text

– Regular expressions are heavily used in the automatic generation of Web pages

Regular expression

Page 22: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

22

• Scan for virus signatures• Process natural languages• Search for information using Google• Search and replace in word processors• Filter text( spam, malware )• Validate data-entry field (dates, email, url)

Pattern matching applications:

Page 23: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

23

Basic Operation

• Notation to specify a set of strings

Page 24: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

24

Regular expression : examples

• Notation is surprisingly expressive.

Regular Expression Yes No

a* | (a*ba*ba*ba*)*multiple of 3 b's

εbbbaaa

abbbaaabbbaababba

a

bbb

abbaaaabaabbba

a

a | a(a|b)*abegins and ends with a

aabaaa

abbaabba

εabba

(a|b)*abba(a|b)*contains the substring

abba

abbabbabbabbabbaabba

εabb

bbaaba

Page 25: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

25

Using Regular expression

• Built in to Java, Perl , PHP, Unix, .NET,…. • Additional operations typically aded for

convenience.• Ex. [a-e]+ is shorthand for (a|b|c|d|e) (a|b|c|d|e)*

Operation Regular Expression Yes No

Concatenation hello helloOthello

say helloHello

Any single character ..oo..oo.bloodrootspoonfood

cookbookchoochoo

Page 26: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine

26

Using Regular expression

Operation Regular Expression Yes No

Replication a(bc)*deade

abcdeabcbcde

abc

One or more a(bc)+deabcde

abcbcdeadeabc

Once or not at all a(bc)?deade

abcdeabc

abcbcde

Character classes [a-m]*blackmailimbecile

abovebelow

Negation of character classes [^aeiou]bc

ae

Exactly N times [^aeiou]{6}rhythmsyzygy

rhythmsallowed

Between M and N times [a-z]{4,6}spidertiger

jellyfishcow

Whitespace characters [a-z\s]*hellohello

say helloOthello2hello