compiler design€¦ · · 2017-07-31what it lexical analysis lexical analyzer parser symbol...

Compiler Design Lexical Analysis

What it Lexical Analysis

lexical

analyzer parser

symbol

table

source

program

token

get next

token

It is the phase where the compiler reads the text from device

Reading has to be character by character, but buffering is helps


It detects valid “tokens”, which are equivalent to “words” in a text For example, from the following code segment, If xval <= yval then result := yval The lexical analyzer will detect the following unbreakable meaning ful components “if”, “xval”, ”<=“, “yval”, “then”, “result”, “:=“, “yval”


It will check if these meaningful components, which are like words, are valid from the point of view of the given language. If a components is valid according to the language, then Lexical Analyser will determine its type. Or Token. In the example, tokens are identified as follows: “if” == keyword “<=“ == logical operator “xval” == identifier “:=“ == assignment operator “yval” == identifier result == identifier


Question: What is a “valid” token? Answer: There will be set rules to define valid tokens. For example, rule an identifier is usually like “starts with alphabet and followed by a combination of alphabet and digit or nothing” So, “xval” is a valid identifier. But, if we write “9xval”, it won’t be a valid identifier. In fact, generally, it will be valid nothing. No keyword or operator or numeric value.

Valid Tokens

Valid operators : fixed strings “=“,“<“,”<=“,”>”,”>=“ Valid numeric: there is a pattern

• Starts with digit • Followed by digit • There may be a decimal • Digits after decimal • Then there may be an “E” (exponent) followed by

signed or unsigned integer

12.34 12.34E10 12.34E-5

File Reading is overhead

Most expensive phase of Compiler because it reads the text from device extensive Input operation Though it processes the input character by character for matching set patterns Better to read a block (say 1024 bytes) at a time and place it in a buffer. Afterwards process from buffer. Why? Every read involves a system call, therefore context switch. It is more time saving to have one system call per 1024 characters than one per character.

File Reading is overhead

So, the following happen in a loop for each character. 1. Reads a character from the disk (System Call) 2. Compares with a pattern (User mode) 3. changes state (User Mode) 4. Go to 1

So, we see that for every input character read there is a system call. It means context changes from user to system. After the read, the context is changed from system to user again.


Context Change being an overhead, for every input character we are incurring 2 such overheads. If we could read , say 1024 characters at once, through a single system call, it will save context switch time by 1024 times.

Buffering

Lexical Analyzer tasks: Loop 1 to 2 until end of file 1. Read a character from disk THIS IS A SYSTEM CALL CPU changes context to OS, thus saving PCB of user process 2. Match pattern

Here the CPU goes back to the user mode, changing context, saving PCB of OS and retrieving PCB of the user process

So, if there is a file of 10,000 characters, context change will take place 20,000

times .

User Process OS

Read a character

Context switch

Context switch

Context switch

Context switch

Context switch

Context switch Pattern match

Pattern match

Pattern match

Read a character

Read a character

Instead, if we could

2048 context switches for 1024 characters !!!!

Buffering

User Process OS

Context switch

Context switch

Pattern match

Read 1024 characters

Read 1024 characters Only 2 context switches

instead of 2048

Buffering

How it happens using a buffer

newval5=oldval*12

Forward Pointer

Base Pointer n

Buffering


newval=oldval*12

Forward Pointer

Base Pointer n

e

Buffering


newval=oldval*12

Forward Pointer

Base Pointer n

ew

Buffering


newval=oldval*12

Base Pointer

Forward Pointer

n

ewv

Buffering


newval=oldval*12

Base Pointer

Forward Pointer

n

ewva

Buffering


newval5=oldval*12

Base Pointer

Forward Pointer

n

ewval

Buffering


newval5=oldval*12

Forward Pointer

Base Pointer n

ewval5

Buffering


newval=oldval*12

Forward Pointer

Base Pointer n

ewval5

=

Buffering


newval=oldval*12

Forward Pointer

Base Pointer n

ewval5

=

Retract; Return (Gettoken )

The string between BP and FP is the next token (if it has adhered to any rule)

Buffering


newval=oldval*12

Forward Pointer

Base Pointer

=

BP is sprung forward to FP; Now the next token will be found.

Retract; Return (Gettoken )

Buffering

……………………………………………………………………………………………newval = old

Okay, input characters are read in blocks and put in a buffer / array in main memory. The characters are scanned one by one. What follows, is that when the last character in the buffer has been read and processed, the buffer needs to be reloaded Consider a scenario where the buffer ends before the variable/ identifier oldval is complete. Base

Pointer

Forward Pointer

Buffering

Obviously, next block has to be read into the buffer As a result, the current buffer is overwritten.

val * 12 …….……………………………………………………………………………………………

Forward Pointer

Therefore, previous content of the buffer is lost. BP doesn’t point to the earlier content.

Base Pointer

Buffering

Way out : Two buffers, or split buffers to be reloaded alternatively

Initially, both buffers are empty.

Base Pointer

Forward Pointer

Buffering


………………………………………newval = old

Read the first one.

Base Pointer

Forward Pointer

Buffering


………………………………………newval = old

Base Pointer

Forward Pointer

Read the first one.

Buffering

………………………………………newval = old

Keep on scanning and processing till the FP reaches the last character. That means, the lexical analyzer is inside the last potential token.

Base Pointer

Forward Pointer

Read the first one.


Buffering

………………………………………newval = old val * 12 …….………………………………………

Base Pointer

Forward Pointer

Read the next block and reload the second buffer.


Buffering

………………………………………newval = old val * 12 …….………………………………………

Base Pointer

Forward Pointer

Read the next block and reload the second buffer. Move the forward pointer by one character. It therefore enters the second buffer. Continue processing as if it is a single buffer .


Buffering

………………………………………newval = old val * 12 …….………………………………………

Forward Pointer

Read the next block and reload the second buffer. Move the forward pointer by one character. It therefore enters the second buffer. Continue processing as if it is a single buffer.


Base Pointer

Buffering

Buffering

Necessity of buffering (A) To read a block of characters at a time, thus reducing context switching time and

I/O time as well. Otherwise, the process would be : Loop: I/O request to read next character /* System Call */

execute processing logic /* User mode */ go back to Loop. There would be 1024 system calls for reading 1024 characters.

Also, separately requesting the disk controller to read each character.

Using block read and buffering, the one system call and I/O request is issued per block of characters, for example, 1024 characters.

Buffering

Necessity of buffering (B) Some times, the lexical analyzer might have to look ahead in order to identify a

token. Example. There are two similar looking FORTRAN statements, along with their meaning. (i) DO 5 I = 1.25 /* DO5I is a variable. Set its value to 1.25 */

/* FORTRAN allows spaces in variable name */ (ii) DO 5 I = 1,25 /* Execute line#5 for I = 1 to 25 */

BP FP

The difference of the two is in the “.” and “,” between 1 and 25. Lexical analyzer will have to read forward (look ahead), halting the FP to detect the

presence of a “,” . After that, it goes back to FP.

Look ahead

Buffering

Limitation of Buffer Pair (1) If look ahead character is beyond the end of buffer

Such as in PL/1 DECLARE (ARG1,ARG2,………………….ARGn) To determine if DECLARE is an array or key word, it has look ahead till the closing “)” (2) With every character scan, lexical analyzer has to check if it is end of block. Since there are two

buffers, it has to check two times, if it is end of buffer 1 or buffer 2. Algorithm and its alternative are furnished in the following two slides.

Algorithm With every character read, the

program has to check end of buffer.

Buffering

FWD = End of 1st

Half ? Reload 2nd Half

FWD = End of 2nd

Half ? Reload 1st Half

Move FWD to the beginning of the

1st Half

FWD := FWD + 1

Back to loop

YES

YES

NO

NO

Check 1st buffer . If not end, then check 2nd buffer

Alternative (more efficient) Use sentinel value for end of buffer. Put a $ at the end of each buffer

half.

Buffering

FWD = $ ?

FWD = End of 2nd

Half ? Reload 1st Half

Move FWD to the beginning of the

1st Half

Reload 2nd Half

Back to loop

YES

YES

NO

NO Back to loop

It must be the end of 1st half

Though it looks like two checks, but the second check comes only when FWD = $ Which happens once for each buffer. If each buffer is of 1024 characters, then for 1023 times the second check does not happen

Happens once out of 1024

$ $

Specification of Tokens

Regular expressions are used for recognizing patterns Let’s understand how Regular Pattern is represented. First of all, a single character is Regular Language in itself.

Operation on Regular Languages (1) Union Alphabet ‘A’ is itself a Regular Language. Likewise, ‘B’ and also ‘C’ and so on. Therefore, {A,B,C,………….Z} will be regular since it is the union of regular languages.


Say, L’ = {A, B, C,…..Z,a,b,c,….z} /* all the letters */ and D’ = {0,1,2,3,4,5,6,7,8,9} /* all the digits*/ Then UNION: L’ U D ‘ = {A, B, C,…..Z,a,b,c,….z, 0,1,2,3,4,5,6,7,8,9} is also a regular language , representing the set of all letters and digits. (2) Concatenation L’ . D’ consists of each element of L’ concatenated with each element of D’, thus resulting in the set {A0, A1, A2….A9,B0,B1,B2…B9………..z9} Which is also regular


(2) Exponent / Kleene Closure L’ is regular. So, the concatenation , L’L’ , which can be written as (L’)2 is also regular Similarly, (L’)3 , (L’)4 , (L’)5 , etc. are all regular. And then, ɛ is regular and (L’)0 = ɛ Therefore, (L’)0 + (L’)1 + (L’)2 + (L’)3 + (L’)4 +………………. Any number of elements Is regular. This is called Kleene Closure (L’)*


(2) Transpose / Reverse Reverse of regular string is also regular. Set of reverses of all strings in a regular language is also regular So, {A0,A1,A2} being regular, {0A,1A,2A} is also regular.


Regular Expressions A regular language is represented by a regular expression. As mentioned before, a single character is a regular expression. Like regular languages, union, concatenation, Kleene closure and reversal of a regular expression result in regular expressions. So, if R1 and R2 are two regular expressions, then following are also: R1+R2 alternatively, R1 | R2 R1.R2 R1* And any combination of the above three


Regular Expressions Example: Identifiers in Pascal Starts with an alphabet followed by any number of alphabets and digits in any order. Letter A | B| C ……………. Z | a | b | c … z Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Id Letter . (Letter | Digit ) * Note: (1) Letter and Digit have to be defined before Id. (2) Character class notation can also be used to declare Letter and Digit . Letter [A-Za-z] Digit [0-9]


Regular Expressions Example: Floating point numbers Rule: Digits, optionally separated by a “.” , optionally followed by E and signed/unsigned digit

Digit [0-9] Digits Digit.(Digit )* Optional_Fraction “.” Digits | ɛ Optional_Exponent (E (+|-| ɛ) digits) | ɛ Floating _point_number Digits . Optional_Fraction . Optional_Exponent

compiler design€¦ · · 2017-07-31what it lexical analysis lexical analyzer parser symbol...

Documents