compiler design€¦ ·  · 2017-07-31what it lexical analysis lexical analyzer parser symbol...

43
Compiler Design Lexical Analysis

Upload: truongque

Post on 19-May-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Compiler Design Lexical Analysis

Page 2: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

What it Lexical Analysis

lexical

analyzer parser

symbol

table

source

program

token

get next

token

It is the phase where the compiler reads the text from device

Reading has to be character by character, but buffering is helps

Page 3: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

What it Lexical Analysis

It detects valid “tokens”, which are equivalent to “words” in a text For example, from the following code segment, If xval <= yval then result := yval The lexical analyzer will detect the following unbreakable meaning ful components “if”, “xval”, ”<=“, “yval”, “then”, “result”, “:=“, “yval”

Page 4: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

What it Lexical Analysis

It will check if these meaningful components, which are like words, are valid from the point of view of the given language. If a components is valid according to the language, then Lexical Analyser will determine its type. Or Token. In the example, tokens are identified as follows: “if” == keyword “<=“ == logical operator “xval” == identifier “:=“ == assignment operator “yval” == identifier result == identifier

Page 5: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

What it Lexical Analysis

Question: What is a “valid” token? Answer: There will be set rules to define valid tokens. For example, rule an identifier is usually like “starts with alphabet and followed by a combination of alphabet and digit or nothing” So, “xval” is a valid identifier. But, if we write “9xval”, it won’t be a valid identifier. In fact, generally, it will be valid nothing. No keyword or operator or numeric value.

Page 6: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Valid Tokens

Valid operators : fixed strings “=“,“<“,”<=“,”>”,”>=“ Valid numeric: there is a pattern

• Starts with digit • Followed by digit • There may be a decimal • Digits after decimal • Then there may be an “E” (exponent) followed by

signed or unsigned integer

12.34 12.34E10 12.34E-5

Page 7: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

File Reading is overhead

Most expensive phase of Compiler because it reads the text from device extensive Input operation Though it processes the input character by character for matching set patterns Better to read a block (say 1024 bytes) at a time and place it in a buffer. Afterwards process from buffer. Why? Every read involves a system call, therefore context switch. It is more time saving to have one system call per 1024 characters than one per character.

Page 8: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

File Reading is overhead

So, the following happen in a loop for each character. 1. Reads a character from the disk (System Call) 2. Compares with a pattern (User mode) 3. changes state (User Mode) 4. Go to 1

So, we see that for every input character read there is a system call. It means context changes from user to system. After the read, the context is changed from system to user again.

Page 9: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

What it Lexical Analysis

Context Change being an overhead, for every input character we are incurring 2 such overheads. If we could read , say 1024 characters at once, through a single system call, it will save context switch time by 1024 times.

Page 10: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Buffering

Lexical Analyzer tasks: Loop 1 to 2 until end of file 1. Read a character from disk THIS IS A SYSTEM CALL CPU changes context to OS, thus saving PCB of user process 2. Match pattern

Here the CPU goes back to the user mode, changing context, saving PCB of OS and retrieving PCB of the user process

So, if there is a file of 10,000 characters, context change will take place 20,000

times .

Page 11: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

User Process OS

Read a character

Context switch

Context switch

Context switch

Context switch

Context switch

Context switch Pattern match

Pattern match

Pattern match

Read a character

Read a character

Instead, if we could

2048 context switches for 1024 characters !!!!

Buffering

Page 12: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

User Process OS

Context switch

Context switch

Pattern match

Read 1024 characters

Read 1024 characters Only 2 context switches

instead of 2048

Buffering

Page 13: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

How it happens using a buffer

newval5=oldval*12

Forward Pointer

Base Pointer n

Buffering

Page 14: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

How it happens using a buffer

newval=oldval*12

Forward Pointer

Base Pointer n

e

Buffering

Page 15: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

How it happens using a buffer

newval=oldval*12

Forward Pointer

Base Pointer n

ew

Buffering

Page 16: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

How it happens using a buffer

newval=oldval*12

Base Pointer

Forward Pointer

n

ewv

Buffering

Page 17: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

How it happens using a buffer

newval=oldval*12

Base Pointer

Forward Pointer

n

ewva

Buffering

Page 18: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

How it happens using a buffer

newval5=oldval*12

Base Pointer

Forward Pointer

n

ewval

Buffering

Page 19: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

How it happens using a buffer

newval5=oldval*12

Forward Pointer

Base Pointer n

ewval5

Buffering

Page 20: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

How it happens using a buffer

newval=oldval*12

Forward Pointer

Base Pointer n

ewval5

=

Buffering

Page 21: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

How it happens using a buffer

newval=oldval*12

Forward Pointer

Base Pointer n

ewval5

=

Retract; Return (Gettoken )

The string between BP and FP is the next token (if it has adhered to any rule)

Buffering

Page 22: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

How it happens using a buffer

newval=oldval*12

Forward Pointer

Base Pointer

=

BP is sprung forward to FP; Now the next token will be found.

Retract; Return (Gettoken )

Buffering

Page 23: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

……………………………………………………………………………………………newval = old

Okay, input characters are read in blocks and put in a buffer / array in main memory. The characters are scanned one by one. What follows, is that when the last character in the buffer has been read and processed, the buffer needs to be reloaded Consider a scenario where the buffer ends before the variable/ identifier oldval is complete. Base

Pointer

Forward Pointer

Buffering

Page 24: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Obviously, next block has to be read into the buffer As a result, the current buffer is overwritten.

val * 12 …….……………………………………………………………………………………………

Forward Pointer

Therefore, previous content of the buffer is lost. BP doesn’t point to the earlier content.

Base Pointer

Buffering

Page 25: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Way out : Two buffers, or split buffers to be reloaded alternatively

Initially, both buffers are empty.

Base Pointer

Forward Pointer

Buffering

Page 26: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Way out : Two buffers, or split buffers to be reloaded alternatively

………………………………………newval = old

Read the first one.

Base Pointer

Forward Pointer

Buffering

Page 27: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Way out : Two buffers, or split buffers to be reloaded alternatively

………………………………………newval = old

Base Pointer

Forward Pointer

Read the first one.

Buffering

Page 28: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

………………………………………newval = old

Keep on scanning and processing till the FP reaches the last character. That means, the lexical analyzer is inside the last potential token.

Base Pointer

Forward Pointer

Read the first one.

Way out : Two buffers, or split buffers to be reloaded alternatively

Buffering

Page 29: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

………………………………………newval = old val * 12 …….………………………………………

Base Pointer

Forward Pointer

Read the next block and reload the second buffer.

Way out : Two buffers, or split buffers to be reloaded alternatively

Buffering

Page 30: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

………………………………………newval = old val * 12 …….………………………………………

Base Pointer

Forward Pointer

Read the next block and reload the second buffer. Move the forward pointer by one character. It therefore enters the second buffer. Continue processing as if it is a single buffer .

Way out : Two buffers, or split buffers to be reloaded alternatively

Buffering

Page 31: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

………………………………………newval = old val * 12 …….………………………………………

Forward Pointer

Read the next block and reload the second buffer. Move the forward pointer by one character. It therefore enters the second buffer. Continue processing as if it is a single buffer.

Way out : Two buffers, or split buffers to be reloaded alternatively

Base Pointer

Buffering

Page 32: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Buffering

Necessity of buffering (A) To read a block of characters at a time, thus reducing context switching time and

I/O time as well. Otherwise, the process would be : Loop: I/O request to read next character /* System Call */

execute processing logic /* User mode */ go back to Loop. There would be 1024 system calls for reading 1024 characters.

Also, separately requesting the disk controller to read each character.

Using block read and buffering, the one system call and I/O request is issued per block of characters, for example, 1024 characters.

Page 33: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Buffering

Necessity of buffering (B) Some times, the lexical analyzer might have to look ahead in order to identify a

token. Example. There are two similar looking FORTRAN statements, along with their meaning. (i) DO 5 I = 1.25 /* DO5I is a variable. Set its value to 1.25 */

/* FORTRAN allows spaces in variable name */ (ii) DO 5 I = 1,25 /* Execute line#5 for I = 1 to 25 */

BP FP

The difference of the two is in the “.” and “,” between 1 and 25. Lexical analyzer will have to read forward (look ahead), halting the FP to detect the

presence of a “,” . After that, it goes back to FP.

Look ahead

Page 34: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Buffering

Limitation of Buffer Pair (1) If look ahead character is beyond the end of buffer

Such as in PL/1 DECLARE (ARG1,ARG2,………………….ARGn) To determine if DECLARE is an array or key word, it has look ahead till the closing “)” (2) With every character scan, lexical analyzer has to check if it is end of block. Since there are two

buffers, it has to check two times, if it is end of buffer 1 or buffer 2. Algorithm and its alternative are furnished in the following two slides.

Page 35: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Algorithm With every character read, the

program has to check end of buffer.

Buffering

FWD = End of 1st

Half ? Reload 2nd Half

FWD = End of 2nd

Half ? Reload 1st Half

Move FWD to the beginning of the

1st Half

FWD := FWD + 1

Back to loop

YES

YES

NO

NO

Check 1st buffer . If not end, then check 2nd buffer

Page 36: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Alternative (more efficient) Use sentinel value for end of buffer. Put a $ at the end of each buffer

half.

Buffering

FWD = $ ?

FWD = End of 2nd

Half ? Reload 1st Half

Move FWD to the beginning of the

1st Half

Reload 2nd Half

Back to loop

YES

YES

NO

NO Back to loop

It must be the end of 1st half

Though it looks like two checks, but the second check comes only when FWD = $ Which happens once for each buffer. If each buffer is of 1024 characters, then for 1023 times the second check does not happen

Happens once out of 1024

$ $

Page 37: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Specification of Tokens

Regular expressions are used for recognizing patterns Let’s understand how Regular Pattern is represented. First of all, a single character is Regular Language in itself.

Operation on Regular Languages (1) Union Alphabet ‘A’ is itself a Regular Language. Likewise, ‘B’ and also ‘C’ and so on. Therefore, {A,B,C,………….Z} will be regular since it is the union of regular languages.

Page 38: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Specification of Tokens

Say, L’ = {A, B, C,…..Z,a,b,c,….z} /* all the letters */ and D’ = {0,1,2,3,4,5,6,7,8,9} /* all the digits*/ Then UNION: L’ U D ‘ = {A, B, C,…..Z,a,b,c,….z, 0,1,2,3,4,5,6,7,8,9} is also a regular language , representing the set of all letters and digits. (2) Concatenation L’ . D’ consists of each element of L’ concatenated with each element of D’, thus resulting in the set {A0, A1, A2….A9,B0,B1,B2…B9………..z9} Which is also regular

Page 39: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Specification of Tokens

(2) Exponent / Kleene Closure L’ is regular. So, the concatenation , L’L’ , which can be written as (L’)2 is also regular Similarly, (L’)3 , (L’)4 , (L’)5 , etc. are all regular. And then, ɛ is regular and (L’)0 = ɛ Therefore, (L’)0 + (L’)1 + (L’)2 + (L’)3 + (L’)4 +………………. Any number of elements Is regular. This is called Kleene Closure (L’)*

Page 40: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Specification of Tokens

(2) Transpose / Reverse Reverse of regular string is also regular. Set of reverses of all strings in a regular language is also regular So, {A0,A1,A2} being regular, {0A,1A,2A} is also regular.

Page 41: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Specification of Tokens

Regular Expressions A regular language is represented by a regular expression. As mentioned before, a single character is a regular expression. Like regular languages, union, concatenation, Kleene closure and reversal of a regular expression result in regular expressions. So, if R1 and R2 are two regular expressions, then following are also: R1+R2 alternatively, R1 | R2 R1.R2 R1* And any combination of the above three

Page 42: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Specification of Tokens

Regular Expressions Example: Identifiers in Pascal Starts with an alphabet followed by any number of alphabets and digits in any order. Letter A | B| C ……………. Z | a | b | c … z Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Id Letter . (Letter | Digit ) * Note: (1) Letter and Digit have to be defined before Id. (2) Character class notation can also be used to declare Letter and Digit . Letter [A-Za-z] Digit [0-9]

Page 43: Compiler Design€¦ ·  · 2017-07-31What it Lexical Analysis lexical analyzer parser symbol table source ... Afterwards process from buffer. Why? ... Regular expressions are used

Specification of Tokens

Regular Expressions Example: Floating point numbers Rule: Digits, optionally separated by a “.” , optionally followed by E and signed/unsigned digit

Digit [0-9] Digits Digit.(Digit )* Optional_Fraction “.” Digits | ɛ Optional_Exponent (E (+|-| ɛ) digits) | ɛ Floating _point_number Digits . Optional_Fraction . Optional_Exponent