cfg
DESCRIPTION
CfgTRANSCRIPT
1
CmSc 365 Theory of Computation
Context-free Languages, Context-free Grammars
(Chapter 3, Section 3.1)
FSAs recognize regular languages - they are devices that accept strings belonging to
some regular language.
FSAs can be viewed as generators of regular languages, assuming that the transition from
one state to another depends on the current state, and each time a transition is performed,
and output symbol is delivered.
FSAs are not the only devices that can recognize or generate languages. People are also
such "devices". We recognize quite well whether a sentence belongs to a given language
or not. We can also generate sentences (strings) belonging to a given language.
FSAs have one limitation - they deal with regular languages only, and we know that there
are other languages - non-regular. We would like to have a more powerful device -
capable of recognizing languages that are beyond the limits of FSAs.
We saw that in order to build an FSA to accept a given language, we used the language
representation as a regular expression. If we want to build a device that recognizes non-
regular languages, we would need also a formal mechanism of representing such
languages.
Here we shall discuss a formal mechanism (formalism) to represent a class of languages,
called context-free languages. Context-free languages include regular languages,
however they contain also languages that are not regular. The formalism is called
context-free grammars
1. Grammars
The formal systems, used to describe languages by means of rules, are called grammar
formalisms.
The basic idea of grammar formalisms is to capture the structure of strings by
a. using special symbols to stand for substrings of a particular structure
b. using rules to specify how the substrings are combined to form new
substrings.
The special symbols are called non-terminal symbols, while the letters in the alphabet
are called terminal symbols.
The rules are called grammar rules
Regular languages can be described by regular expressions. They can also be described
by grammar rules, and the grammars are called regular grammars (see the examples
2
below). There are languages that are not regular, and for their description we need a
different representation.
Example 1 ( A very simple example)
Consider the language generated by the regular expression: a*
It consists of the empty string and strings containing any number of a's
a is the terminal symbol.
Now we introduce one non-terminal symbol S to represent any string in the
language. S may be the empty string, so our first rule would be:
S → e
S may also consist of a leading a, followed by any number of a's including the
empty string. This is represented by the second rule:
S → aS
Thus our grammar, let's call it G1, will have two rules, and will generate the language L1
containing only strings of a's and the empty string.
Example 2 (still very simple)
Let's consider now the language L2 represented by the regular expression a* U b*
It contains all the strings in the first example, so we will write their representation, using
a non-terminal symbol A instead of S:
A → e
A → aA
Here we have also b*, and we can use another non-terminal symbol B to represent the
strings of b's:
B → e
B → bB
Any string in the language can be A or B, and to represent this we introduce another non-
terminal symbol S (stands for "string" or "sentence"):
S → A
S → B
Thus our grammar G2 will have six rules:
S → A (1)
S → B (2)
A → e (3)
A → aA (4)
B → e (5)
B → bB (6)
3
Rules (1) and (2) describe the structure of any string in the language.
Rules (3) and (4) describe strings of a's only, and rules (5) and (6) describe strings of b's
only.
We say also that rules (3) and (4) describe the structure of the non-terminal symbol A,
while rules (5) and (6) describe non-terminal symbol B. These rules are recursive
The first gives the minimal (or starting) element, and the second gives the recursive
relation.
Note that in the rule A → aA, the A in the left is not the same as the A in the right hand
side. The one in the left would contain one a more than the A in the right.
Example 3 (the one in the textbook)
Let L3 be the language generated by a(a* U b*)b
L3 consists of strings starting with an a, and ending with a b, with something in the
middle. Thus we can write:
S → aMb
The middle part is exactly the language L2 from the previous example, so we can just add
the rules (note that instead of S in G2, we will use M)
G3:
S → aMb (1)
M → A (2)
M → B (3)
A → e (4)
A → aA (5)
B → e (6)
B → bB (7)
Another way to write the same grammar, with one non-terminal symbol less, is:
S → aAb (1)
S → aBb (2)
A → e (3)
A → aA (4)
B → e (5)
B → bB (6)
As we shall see one and the same context-free language can be represented by several
context-free grammars (similarly to the fact that one and the same regular language can
be accepted by several FSAs)
4
Example 4
Consider now the language L4, represented by the regular expression a(aUb)*b
The first rule will be:
S → aMb (1)
Now we have to see how to elaborate M.M is any string consisting of a's and b's,
including the empty string. The empty string would be our first rule for M:
M → e
If not empty, any M would start either with a, or with b, and the remaining part with still
be any combination of a's and b's, including the empty string.
M → aM
M → bM
The resulting grammar G4 would be:
S → aMb (1)
M → e (2)
M → aM (3)
M → bM (4)
Here M is defined by means of three recursive rules: (2) is the "terminating rule" -
describing M by means of a terminal symbols only, (3) and (4) are the recursion rules,
describing how M is constructed.
2. Rule interpretation
2.1. If we want to generate a string
We build a sequence of strings (containing non-terminal and/or terminal symbols)
starting with the initial symbol S and aiming at obtaining a string of terminal symbols.
Let us generate a string in L4 (Example 4) We have the grammar:
S → aMb (1)
M → e (2)
M → aM (3)
M → bM (4)
We start from the starting non-terminal symbol, usually this is S.
Rule (1) says : S can be rewritten as aMb
Thus we build the sequence: S => aMb
5
Then in the right side we look for non-terminal symbols, and rules that have those
non-terminal symbols in their left side. Here we have M , and three rules that expand M.
If we apply Rule (2) M → e , we get the string ab, consisting of terminal symbols only.
Thus we have generated a string in the L4.
If instead of Rule (2) we choose Rule(3), we get: aaMb. If next we choose Rule
(4) we get aabMb . If now we choose Rule (2) we get aabb, a string in L4.
Thus one sequence of strings built in the generation process is:
S => aMb => ab
Another sequence is:
S => aMb => aaMb => aabMb => aabb
This sequence is called a derivation, to be defined formally a little later.
2.2. If we want to recognize a string
Let's say we want to see if abab belongs to L4.
We examine the right sides of the rules, trying to find a sequence of rules that will
take us from the string to the non-terminal symbol S. The process is called parsing, and
we'll discuss it later.
3. Context-Free Grammars (CFGs)
3.1. Definition: A Context-Free Grammar is a quadruple G = (V, ∑, R, S) where:
V is an alphabet (the alphabet of the grammar - contains both the terminal and
non-terminal symbols)
∑ is the set of terminal symbols (the alphabet of the language), ∑ V
R is the set of rules: R (V - ∑) x V*
R is a subset of the Cartesian product of the non-terminal symbols and
strings in V*.
For example, a rule of the form S → aMb
can be written as (S, aMb) -an element of (V - ∑ ) x V*.
S is in V - ∑ , aMb is in V*
S - the start symbol, is a designated non-terminal symbol from V - ∑
If we have:
A є (V - ∑) ( a non terminal symbol), u є V*, (A,u) є R
we write A → G u
6
3.2. Definition: For any two strings u and v in V*,
we say that v is immediately derived from u in G, written as u => G v iff:
a. there exist two strings x and y in V* and a non-terminal symbol A in V-∑,
such that u = xAy:
b. There exists a rule A → G v'
c. v = xv'y
For example in G4 we have
aMb => aaMb, since there is a rule M → aM
aMb => ab, since there is a rule M → e
The reflexive transitive closure of the relation "immediately derived from" => is
=>* , pronounced "derived from". It is described in the next definition.
3.3. Definition For any two strings u and v in V*, we say that
v is derived from u in G, written as u =>* G v iff
there exist strings in V*: v1, v2,…v n-1, such that
u => v1 => v2 => … => vn-1 => v.
The sequence u => v1 => v2 => … => vn-1 => v is called derivation
3.4. Definition: Context-free language: any language, generated by a context-free
grammar
Given a context free grammar G, the language generated by G:
L(G) = {w ∑* | S =>*G w}, is a context-free language
4. Context-free grammars and regular grammars
The difference between grammars lies in the format of their rules.
All rules in a regular grammar can be represented in the form (right-regular or right-linear
grammars):
A → w ( a string of terminal symbols)
A → wB (terminal symbols followed by a non-terminal symbol)
Or in the form (left-regular or left-linear grammars):
A → w ( a string of terminal symbols)
A → Bw (a non-terminal symbol followed by terminal symbols)
Context-free grammar rules have the form:
7
A → a
A→ A1 A2 … An
where A is a non terminal symbol, a is a terminal symbol or e, A1, … An can be terminal
or non-terminal symbols.
Note, that all regular grammars are also context-free grammars, however there are
context-free grammars that are not regular grammars.
The set of CF languages contains the set of regular languages, thus each regular
language is a context-free language too. This can be shown directly by giving a grammar
definition of the FSAs (see Example 3.1.5 on p. 119)
There are however CF languages that are not regular and we have seen an
example: the language L = {an
bn
| n ≥ 1}. It is a CF language and can be represented by a
CFG in the following way:
S → ab
S → aSb
5. Why context-free
Each rule says that a non-terminal symbol to the left can be replaced by the string to the
right side. Nothing is said about the surrounding context of the non-terminal symbol. A
non-context free rule would be:
bbM → bbaM
It says that M can be replaced by aM when it is preceded by two b's.
6. Some applications
Natural language processing, compilers.
8
Problems:
3.1.1. Consider the grammar G = (V, ∑, R, S), where
V = { a,b,S,A}
∑ = { a,b}
R = { S → AA (1)
A → AAA (2)
A → a (3)
A → bA (4)
A → Ab } (5)
a. Which strings of L(G) can be produced by derivations of 4 or fewer steps:
S → AA → aA → aa
S → AA → bAA → bAa → baa
S → AA → AAb → aAb → aab
S → AA → AbA → abA → aba
…..
b. Give at least four distinct derivations for the string babbab
S => AA => bAA =>baA => babA => babbA => babbAb => babbab
(1) (4) (3) (4) (4) (5) (3)
S => AA => bAA => bAbA => bAbbA => bAbbAb => babbAb => babbab
(1) (4) (4) (5) (5) (3) (3)
S => AA => bAA => bAbA => bAbbA => babbA => babbAb => babbab
(1) (4) (4) (5) (3) (5) (3)
S = > AA => AbA => AbAb => bAbAb => bAbbAb => bAbbab => babbab
(1) (4) (5) (4) (4) (3) (3)
c. For any m, n, p > 0, describe a derivation in G of the string bm
abnab
p
S => A A
=>m
bm
AA
=>n b
mAb
nA
=>p b
mAb
nAb
p
=> bm
abnAb
p
=> bm
abnab
p