formal languages & grammars continued - cs home · pdf filechecking, type errors lie in a...

23
Formal Languages & Grammars CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Friday, January 20, 2017 Glenn G. Chappell Department of Computer Science University of Alaska Fairbanks [email protected] © 2017 Glenn G. Chappell continued

Upload: duongtuong

Post on 07-Mar-2018

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

Formal Languages & Grammars

CS F331 Programming LanguagesCSCE A331 Programming Language ConceptsLecture SlidesFriday, January 20, 2017

Glenn G. ChappellDepartment of Computer ScienceUniversity of Alaska [email protected]

© 2017 Glenn G. Chappell

continued

Page 2: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

20 Jan 2017 CS F331 / CSCE A331 Spring 2017

ReviewCourse Overview

In this class, we study programming languages with a view toward the following.§ How programming languages are specified, and how these

specifications are used.§ How certain features differ between various programming

languages.§ Categories of programming languages.

You will need to obtain access to the following programming languages (all are freely available on the web).§ Lua. Version 5.3 or later.§ Haskell. Install The Haskell Platform.§ Forth. Get the GNU version: Gforth.§ Scheme. Install Dr. Racket.§ Prolog. Get the GNU version: gprolog.

2

Page 3: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

20 Jan 2017 CS F331 / CSCE A331 Spring 2017

ReviewIntroduction to Syntax & Semantics

“Dynamic” means at runtime.“Static” means before runtime.

Syntax is the correct structure of code.Semantics is the meaning of code.In programming languages like C++ and Java, which do static type

checking, type errors lie in a gray area between syntax and semantics. We classify these under static semantics. What code does when executed involves dynamic semantics.

Coming up:§ How the syntax of a programming language is specified.§ How such specifications are used.§ Homework: write a parser, which checks syntactic correctness.§ Later, a brief study of semantics.

3

Page 4: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

20 Jan 2017 CS F331 / CSCE A331 Spring 2017

ReviewFormal Languages & Grammars — Formal Languages [1/4]

A formal language (often just language) is a set of strings.

The characters in these strings lie in some alphabet. We talk about a language over an alphabet.

When we study formal languages as abstract objects, we often write strings without quote marks. That makes it tricky to represent the empty string, so we denote the empty string with a lower-case Greek epsilon (ε).

"abc" becomes abc"" becomes ε

Note: A formal language is not the same as a programming language. This unfortunate terminology is, alas, very standard.

4

Page 5: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

20 Jan 2017 CS F331 / CSCE A331 Spring 2017

ReviewFormal Languages & Grammars — Formal Languages [2/4]

Here are some examples of formal languages.§ {abc, xyz, q}§ {ε, 01, 0101, 010101, 01010101, …}

§ The above is a language over the alphabet {0, 1}.§ The set of all syntactically correct C++ identifiers.

§ These are strings that contain only letters, digits, and underscores (“_”), begin with a letter or underscore, and are not one of the C++ reserved words (for, class, if, const, private, virtual, delete, friend, throw, this, etc.).

§ The set of all syntactially correct Lua programs.§ We do not normally think of a whole program as a string. But it is.

The last two examples above illustrate why we are talking about formal languages in a class on programming languages.

5

Page 6: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

20 Jan 2017 CS F331 / CSCE A331 Spring 2017

ReviewFormal Languages & Grammars — Formal Languages [3/4]

How do we describe a formal language?

We can use a generator or a recognizer.

A generator is something that can produce the strings in a formal language (all of them, and nothing else).

A recognizer is a way of determining whether a given string lies in the formal language.

6

Page 7: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

20 Jan 2017 CS F331 / CSCE A331 Spring 2017

ReviewFormal Languages & Grammars — Formal Languages [4/4]

An important question, when we are dealing with a formal language: Given a string, does it lie in the language?

To answer this question, we need a recognizer.§ Every compiler has to contain a recognizer.

But it is usually easier to construct a generator.

A common technique: Write a generator, and then have a program use it to produce a recognizer automatically.§ Example: the program Yacc, which inputs a kind of generator called

a grammar, and outputs C code for a recognizer.

Over the next few days, we will have a lot more to say about generators and recognizers.

7

Page 8: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

Formal Languages & GrammarsGrammars — Definitions [1/2]

A phrase-structure grammar (often just grammar) is one kind of language generator.

To write a grammar, we need a collection of terminal symbols. This is our alphabet.

We also need a collection of nonterminal symbols. These are like variables that eventually turn into something else. One nonterminal symbol is the start symbol.

Our conventions, for now, will be that lower-case letters are terminal symbols, upper-case letters are nonterminal symbols, and “S” is the start symbol.

Some terminal symbols: a b xSome nonterminal symbols: C Q S

20 Jan 2017 CS F331 / CSCE A331 Spring 2017

continued

Start symbol

8

Page 9: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

Formal Languages & GrammarsGrammars — Definitions [2/2]

A grammar is a list of one or more productions. A production is a rule for altering strings by substituting one substring for another. The strings are made of terminal and nonterminal symbols.

Here is a grammar with four productions.

S → ABA → cB → BdB → ε

20 Jan 2017 CS F331 / CSCE A331 Spring 2017

Epsilon (ε) is neither terminal nor nonterminal. It is not a symbol at all. Rather, it represent a string containing no symbols.

9

Page 10: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

Formal Languages & GrammarsGrammars — Derivations [1/4]

1. S → AB2. A → c3. B → Bd4. B → ε

Here is what we do with a grammar.§ Begin with the start symbol.§ Repeatedly apply productions. To apply a production, replace the

left-hand side of the production (which must bea contiguous collection of symbols in the currentstring) with the right-hand side.

§ We can stop only when there are no morenonterminals.

The result is a derivation of the final string.§ To the right is a derivation of cdd based on the

above grammar.

20 Jan 2017 CS F331 / CSCE A331 Spring 2017

The same grammar. Productions are numbered, to make it easy to refer to them.

SABABdABddAddcdd

10

Page 11: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

Formal Languages & GrammarsGrammars — Derivations [2/4]

Below are the same grammar and derivation. I have annotated the derivation to show what is happening.§ The number indicates which production is being used.§ The underlined symbols show the substring being replaced, which is

the left-hand side of the production being used.

Grammar1. S → AB2. A → c3. B → Bd4. B → ε

Note the use of production 4. No “ε”appears in the derivation.

20 Jan 2017 CS F331 / CSCE A331 Spring 2017

Derivation of cddSABABdABddAddcdd

1

3

3

4

2

11

Page 12: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

Formal Languages & GrammarsGrammars — Derivations [3/4]

Grammar1. S → AB2. A → c3. B → Bd4. B → ε

Recall: A grammar is a kind of generator.The language generated by a grammar

consists of all strings for which there is a derivation.§ So “cdd” lies in the language generated by the above grammar.

Q. What language does this grammar generate?A. All strings that consist of a single c followed by zero or more d’s.

{c, cd, cdd, cddd, cdddd, cddddd, …}

20 Jan 2017 CS F331 / CSCE A331 Spring 2017

Derivation of cddSABABdABddAddcdd

1

3

3

4

2

12

Page 13: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

Formal Languages & GrammarsGrammars — Derivations [4/4]

Here is another example, involving a different grammar.

Grammar1. S → xSy2. S → ε

Q. What language does this grammargenerate?

A. All strings that consist of a zero ormore x’s followed by the same number of y’s.

{ε, xy, xxyy, xxxyyy, xxxxyyyy, …}

Here is another way to write this language: { xkyk | k ≥ 0 }.

20 Jan 2017 CS F331 / CSCE A331 Spring 2017

Derivation of xxxyyySxSyxxSyyxxxSyyyxxxyyy

1

1

1

2

13

Page 14: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

Formal Languages & GrammarsGrammars — Applications [1/2]

As the name suggests, phrase-structure grammars were first used in linguistics, as a way of formalizing the grammar of a natural language (examples of natural languages: English, French, Arabic).§ The start symbol could represent a sentence.§ Various other nonterminals might represent things like subject,

predicate, or prepositional phrase.§ The terminal symbols would be the words of the natural language.

20 Jan 2017 CS F331 / CSCE A331 Spring 2017 14

Page 15: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

Formal Languages & GrammarsGrammars — Applications [2/2]

In computing, the most important application of phrase-structure grammars is specifying programming-language syntax.§ The start symbol represents a program.§ Other nonterminals might represent

things like statement, for loop, orclass definition.

§ Terminal symbols are usually thelexemes*—words, roughly—of the programming language.

*We discuss lexemes in detail later in the semester. For now, here are some examples of lexemes in C++.§ Keywords: for class const return§ Identifiers: mergeSort17 ARRAY_SIZE x§ Literals: "Hello" -42 3.47e-12 true§ Operators: += << ! ::§ Punctuation: { } ;

20 Jan 2017 CS F331 / CSCE A331 Spring 2017

Since the 1970s, almostevery programming language has had its syntax specified

using a grammar.

15

Page 16: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

Formal Languages & GrammarsThe Chomsky Hierarchy — Introduction

In the late 1950s, linguist Noam Chomsky described a hierarchy of categories of formal languages, defined in terms of the kinds of grammars that could describe them. Chomsky was developing a framework for studying natural languages; however, his hierarchy was proved to be useful in the theory of computation.

Chomsky’s hierarchy includes four categories of languages. He called them types 3, 2, 1, and 0. More modern names are regular, context-free, context-sensitive, and computably enumerable.

20 Jan 2017 CS F331 / CSCE A331 Spring 2017 16

Page 17: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

20 Jan 2017 CS F331 / CSCE A331 Spring 2017

Formal Languages & GrammarsThe Chomsky Hierarchy — The Hierarchy [1/2]

Here is Chomsky’s hierarchy.

Language CategoryGenerator Recognizer

Number Name

Type 3 Regular Grammar in which each production has one of the following forms.• A → ε• A → b• A → bCAnother kind of generator: regular expressions (covered later).

Finite State MachineThink: Program that uses a small, fixed amount of memory.

Type 2 Context-Free

Grammar in which the left-hand side of each production consists of a single nonterminal.• A → [anything]

Nondeterministic Push-Down AutomatonThink: Finite State Machine + Stack (roughly).

Type 1 Context-Sensitive

Don’t worry about it. Don’t worry about it.

Type 0 Computably Enumerable

Grammar (no restrictions). Turing MachineThink: Computer Program

17

Page 18: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

Formal Languages & GrammarsThe Chomsky Hierarchy — The Hierarchy [2/2]

Each category of languages in the Chomsky Hierarchy is contained in the next. So every regular language is context-free, etc.

Next we look at each category in the Chomsky Hierarchy, how it is defined, and why we care about it.

20 Jan 2017 CS F331 / CSCE A331 Spring 2017

Computably Enumerable Languages

Context-Free Languages

Context-Sensitive Languages

Regular Languages

18

Page 19: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

Formal Languages & GrammarsThe Chomsky Hierarchy — Why We Care [1/5]

A regular language is one that can be generated by a grammar in which each production has one of the following forms.§ A → ε§ A → b§ A → bC

Alternative generator: regular expression (covered later).A regular language can be recognized by a finite state machine.

§ Like a computer program with a small, fixed amount of memory.

Regular languages generally describe lexeme categories.§ The set of all legal C++ identifiers is a regular language.§ The set of all legal C++ floating-point literals is a regular language.

Thus, these languages encompass the level of computation required for lexical analysis: breaking a program into lexemes.

Regular languages are also involved in text search/replace.

20 Jan 2017 CS F331 / CSCE A331 Spring 2017 19

Page 20: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

Formal Languages & GrammarsThe Chomsky Hierarchy — Why We Care [2/5]

A context-free language is one that can be generated by a grammar in which the left-hand each production consists of a single nonterminal.§ A → [anything]

A regular language can be recognized by a nondeterministic push-down automaton.§ Roughly: a finite state machine with a memory that acts as a stack.

Context-free languages generally describe programming-language syntactic correctness.§ The set of all syntactically correct Lua programs is a context-free

language.Thus, these languages encompass the level of computation

required for parsing: determining whether a program is syntactically correct, and, if so, how it is structured.

20 Jan 2017 CS F331 / CSCE A331 Spring 2017 20

Page 21: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

Formal Languages & GrammarsThe Chomsky Hierarchy — Why We Care [3/5]

As for context-sensitive languages: we generally do not care.§ I would call this category a mistake—an idea that Chomsky thought

would be fruitful, but turned out not to be.§ I mention this category only for historical interest. You do not need

to know anything about context-sensitive languages.

In case anyone is interested: The kind of grammar that describes a context-sensitive language allows restricting the expansion of a nonterminal to a particular context. For example, such a grammar might include the following production.

xAy → xBcy

So A can be expanded to Bc, as long as it lies between x and y.And the recognizer for a context-sensitive language is called a

linear bounded automaton. Google it, if you wish. Or don’t.20 Jan 2017 CS F331 / CSCE A331 Spring 2017 21

Page 22: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

Formal Languages & GrammarsThe Chomsky Hierarchy — Why We Care [4/5]

A computably enumerable language is one that can be described by a grammar. We place no restrictions on the productions in the grammar.

The recognizer for a computably enumerable language is a Turing machine—a formalization of a computer program.

We care about computably enumerable languages because they encompass the things that computers can do.§ If a language is computably enumerable, then we can write a

computer program that is a recognizer for it.§ If a language is not computably enumerable, then no such program

exists.

Note. This kind of language is also called a recursively enumerable language. This terminology comes from a branch of mathematics called recursive function theory.

20 Jan 2017 CS F331 / CSCE A331 Spring 2017 22

Page 23: Formal Languages & Grammars continued - CS Home · PDF filechecking, type errors lie in a gray area between syntax and ... Formal Languages & Grammars The Chomsky Hierarchy ... Here

Formal Languages & GrammarsThe Chomsky Hierarchy — Why We Care [5/5]

Summary§ A lexeme category (e.g., C++ identifiers) generally forms a regular

language. Recognition of a regular language is thus the level of computation required for lexical analysis—and text search/replace.

§ The set of all syntactically correct programs (in your favorite programming language) generally forms a context-free language. Recognition of context-free languages is thus the level of computation required for parsing.

§ Context-sensitive languages are mostly a historical curiosity.§ Recognition of computably enumerable languages encompasses the

tasks that computers are capable of. These languages are important in the theory of computation.

Our next topic is Regular Languages, where we will cover ideas to be used in lexical analysis.

After that, we study Context-Free Languages, covering ideas to be used in parsing.

20 Jan 2017 CS F331 / CSCE A331 Spring 2017 23