cfg

1

CmSc 365 Theory of Computation

Context-free Languages, Context-free Grammars

(Chapter 3, Section 3.1)

FSAs recognize regular languages - they are devices that accept strings belonging to

some regular language.

FSAs can be viewed as generators of regular languages, assuming that the transition from

one state to another depends on the current state, and each time a transition is performed,

and output symbol is delivered.

FSAs are not the only devices that can recognize or generate languages. People are also

such "devices". We recognize quite well whether a sentence belongs to a given language

or not. We can also generate sentences (strings) belonging to a given language.

FSAs have one limitation - they deal with regular languages only, and we know that there

are other languages - non-regular. We would like to have a more powerful device -

capable of recognizing languages that are beyond the limits of FSAs.

We saw that in order to build an FSA to accept a given language, we used the language

representation as a regular expression. If we want to build a device that recognizes non-

regular languages, we would need also a formal mechanism of representing such

languages.

Here we shall discuss a formal mechanism (formalism) to represent a class of languages,

called context-free languages. Context-free languages include regular languages,

however they contain also languages that are not regular. The formalism is called

context-free grammars

1. Grammars

The formal systems, used to describe languages by means of rules, are called grammar

formalisms.

The basic idea of grammar formalisms is to capture the structure of strings by

a. using special symbols to stand for substrings of a particular structure

b. using rules to specify how the substrings are combined to form new

substrings.

The special symbols are called non-terminal symbols, while the letters in the alphabet

are called terminal symbols.

The rules are called grammar rules

Regular languages can be described by regular expressions. They can also be described

by grammar rules, and the grammars are called regular grammars (see the examples

2

below). There are languages that are not regular, and for their description we need a

different representation.

Example 1 ( A very simple example)

Consider the language generated by the regular expression: a*

It consists of the empty string and strings containing any number of a's

a is the terminal symbol.

Now we introduce one non-terminal symbol S to represent any string in the

language. S may be the empty string, so our first rule would be:

S → e

S may also consist of a leading a, followed by any number of a's including the

empty string. This is represented by the second rule:

S → aS

Thus our grammar, let's call it G1, will have two rules, and will generate the language L1

containing only strings of a's and the empty string.

Example 2 (still very simple)

Let's consider now the language L2 represented by the regular expression a* U b*

It contains all the strings in the first example, so we will write their representation, using

a non-terminal symbol A instead of S:

A → e

A → aA

Here we have also b*, and we can use another non-terminal symbol B to represent the

strings of b's:

B → e

B → bB

Any string in the language can be A or B, and to represent this we introduce another non-

terminal symbol S (stands for "string" or "sentence"):

S → A

S → B

Thus our grammar G2 will have six rules:

S → A (1)

S → B (2)

A → e (3)

A → aA (4)

B → e (5)

B → bB (6)

3

Rules (1) and (2) describe the structure of any string in the language.

Rules (3) and (4) describe strings of a's only, and rules (5) and (6) describe strings of b's

only.

We say also that rules (3) and (4) describe the structure of the non-terminal symbol A,

while rules (5) and (6) describe non-terminal symbol B. These rules are recursive

The first gives the minimal (or starting) element, and the second gives the recursive

relation.

Note that in the rule A → aA, the A in the left is not the same as the A in the right hand

side. The one in the left would contain one a more than the A in the right.

Example 3 (the one in the textbook)

Let L3 be the language generated by a(a* U b*)b

L3 consists of strings starting with an a, and ending with a b, with something in the

middle. Thus we can write:

S → aMb

The middle part is exactly the language L2 from the previous example, so we can just add

the rules (note that instead of S in G2, we will use M)

G3:

S → aMb (1)

M → A (2)

M → B (3)

A → e (4)

A → aA (5)

B → e (6)

B → bB (7)

Another way to write the same grammar, with one non-terminal symbol less, is:

S → aAb (1)

S → aBb (2)

A → e (3)

A → aA (4)

B → e (5)

B → bB (6)

As we shall see one and the same context-free language can be represented by several

context-free grammars (similarly to the fact that one and the same regular language can

be accepted by several FSAs)

4

Example 4

Consider now the language L4, represented by the regular expression a(aUb)*b

The first rule will be:

S → aMb (1)

Now we have to see how to elaborate M.M is any string consisting of a's and b's,

including the empty string. The empty string would be our first rule for M:

M → e

If not empty, any M would start either with a, or with b, and the remaining part with still

be any combination of a's and b's, including the empty string.

M → aM

M → bM

The resulting grammar G4 would be:

S → aMb (1)

M → e (2)

M → aM (3)

M → bM (4)

Here M is defined by means of three recursive rules: (2) is the "terminating rule" -

describing M by means of a terminal symbols only, (3) and (4) are the recursion rules,

describing how M is constructed.

2. Rule interpretation

2.1. If we want to generate a string

We build a sequence of strings (containing non-terminal and/or terminal symbols)

starting with the initial symbol S and aiming at obtaining a string of terminal symbols.

Let us generate a string in L4 (Example 4) We have the grammar:

S → aMb (1)

M → e (2)

M → aM (3)

M → bM (4)

We start from the starting non-terminal symbol, usually this is S.

Rule (1) says : S can be rewritten as aMb

Thus we build the sequence: S => aMb

5

Then in the right side we look for non-terminal symbols, and rules that have those

non-terminal symbols in their left side. Here we have M , and three rules that expand M.

If we apply Rule (2) M → e , we get the string ab, consisting of terminal symbols only.

Thus we have generated a string in the L4.

If instead of Rule (2) we choose Rule(3), we get: aaMb. If next we choose Rule

(4) we get aabMb . If now we choose Rule (2) we get aabb, a string in L4.

Thus one sequence of strings built in the generation process is:

S => aMb => ab

Another sequence is:

S => aMb => aaMb => aabMb => aabb

This sequence is called a derivation, to be defined formally a little later.

2.2. If we want to recognize a string

Let's say we want to see if abab belongs to L4.

We examine the right sides of the rules, trying to find a sequence of rules that will

take us from the string to the non-terminal symbol S. The process is called parsing, and

we'll discuss it later.

3. Context-Free Grammars (CFGs)

3.1. Definition: A Context-Free Grammar is a quadruple G = (V, ∑, R, S) where:

V is an alphabet (the alphabet of the grammar - contains both the terminal and

non-terminal symbols)

∑ is the set of terminal symbols (the alphabet of the language), ∑ V

R is the set of rules: R (V - ∑) x V*

R is a subset of the Cartesian product of the non-terminal symbols and

strings in V*.

For example, a rule of the form S → aMb

can be written as (S, aMb) -an element of (V - ∑ ) x V*.

S is in V - ∑ , aMb is in V*

S - the start symbol, is a designated non-terminal symbol from V - ∑

If we have:

A є (V - ∑) ( a non terminal symbol), u є V*, (A,u) є R

we write A → G u

6

3.2. Definition: For any two strings u and v in V*,

we say that v is immediately derived from u in G, written as u => G v iff:

a. there exist two strings x and y in V* and a non-terminal symbol A in V-∑,

such that u = xAy:

b. There exists a rule A → G v'

c. v = xv'y

For example in G4 we have

aMb => aaMb, since there is a rule M → aM

aMb => ab, since there is a rule M → e

The reflexive transitive closure of the relation "immediately derived from" => is

=>* , pronounced "derived from". It is described in the next definition.

3.3. Definition For any two strings u and v in V*, we say that

v is derived from u in G, written as u =>* G v iff

there exist strings in V*: v1, v2,…v n-1, such that

u => v1 => v2 => … => vn-1 => v.

The sequence u => v1 => v2 => … => vn-1 => v is called derivation

3.4. Definition: Context-free language: any language, generated by a context-free

grammar

Given a context free grammar G, the language generated by G:

L(G) = {w ∑* | S =>*G w}, is a context-free language

4. Context-free grammars and regular grammars

The difference between grammars lies in the format of their rules.

All rules in a regular grammar can be represented in the form (right-regular or right-linear

grammars):

A → w ( a string of terminal symbols)

A → wB (terminal symbols followed by a non-terminal symbol)

Or in the form (left-regular or left-linear grammars):

A → w ( a string of terminal symbols)

A → Bw (a non-terminal symbol followed by terminal symbols)

Context-free grammar rules have the form:

7

A → a

A→ A1 A2 … An

where A is a non terminal symbol, a is a terminal symbol or e, A1, … An can be terminal

or non-terminal symbols.

Note, that all regular grammars are also context-free grammars, however there are

context-free grammars that are not regular grammars.

The set of CF languages contains the set of regular languages, thus each regular

language is a context-free language too. This can be shown directly by giving a grammar

definition of the FSAs (see Example 3.1.5 on p. 119)

There are however CF languages that are not regular and we have seen an

example: the language L = {an

bn

| n ≥ 1}. It is a CF language and can be represented by a

CFG in the following way:

S → ab

S → aSb

5. Why context-free

Each rule says that a non-terminal symbol to the left can be replaced by the string to the

right side. Nothing is said about the surrounding context of the non-terminal symbol. A

non-context free rule would be:

bbM → bbaM

It says that M can be replaced by aM when it is preceded by two b's.

6. Some applications

Natural language processing, compilers.

8

Problems:

3.1.1. Consider the grammar G = (V, ∑, R, S), where

V = { a,b,S,A}

∑ = { a,b}

R = { S → AA (1)

A → AAA (2)

A → a (3)

A → bA (4)

A → Ab } (5)

a. Which strings of L(G) can be produced by derivations of 4 or fewer steps:

S → AA → aA → aa

S → AA → bAA → bAa → baa

S → AA → AAb → aAb → aab

S → AA → AbA → abA → aba

…..

b. Give at least four distinct derivations for the string babbab

S => AA => bAA =>baA => babA => babbA => babbAb => babbab

(1) (4) (3) (4) (4) (5) (3)

S => AA => bAA => bAbA => bAbbA => bAbbAb => babbAb => babbab

(1) (4) (4) (5) (5) (3) (3)

S => AA => bAA => bAbA => bAbbA => babbA => babbAb => babbab

(1) (4) (4) (5) (3) (5) (3)

S = > AA => AbA => AbAb => bAbAb => bAbbAb => bAbbab => babbab

(1) (4) (5) (4) (4) (3) (3)

c. For any m, n, p > 0, describe a derivation in G of the string bm

abnab

p

S => A A

=>m

bm

AA

=>n b

mAb

nA

=>p b

mAb

nAb

p

=> bm

abnAb

p

=> bm

abnab

p

cfg

Documents