Programming Languages: Chapter 2:
Formal Languages and Grammars
Formal languages
 what is a formal language?
 set of sentences (strings) over some alphabet
 legal strings = sentences
 how do we define a formal language?
 grammar (define syntax of a language)
 any more?
 syntax and semantics
 syntax refers to structure of language
 semantics refers to meaning of language
 previously, both syntax and semantics
 used to be described intuitively
 now, welldefined, formal systems are available
 there are finite and infinite languages
 is C an infinite language
 most interesting languages are infinite
Progressive stages of sentence validity
candidate sentence  lexically valid?  syntactically valid?  semantically valid? 
Socrates is a mann.  no  no  no 
Man Socrates is a.  yes  no  no 
Man is a Socrates.  yes  yes  no 
Socrates is a man.  yes  yes  yes 
Regular languages and grammars
Finite Automata and Regular Expressions (ref. Randal Nelson and Tom LeBlanc,
University of Rochester)
 lexemes can be formally described by regular grammars
 . (any single character)
 * (0 or more of previous character)
 + (or)
 shorthand notation:
 [az] (one of the characters in this range)
 [^az] (any character but one in this range)
 examples:
 regular grammars (also called linear grammars)
are generative devices for regular languages
 regular grammars define regular languages
 any finite language is regular
 sentences from regular languages are recognized using finite state
automata (FSA)
 finite state automaton to accept positive integers
identifiers in C: [19][09]* + [_azAZ][_azAZ09]*
formal language 
defined by/generator 
model of computation/recognizer 
regular language (RL) 
regular grammar (RG) 
finite state automata (FSA) 
Contextfree grammars
(in BackusNaur form) and
contextfree languages
Generating
and Recognizing Recursive Descriptions of Patterns with ContextFree
Grammars (ref. Randal Nelson and Tom LeBlanc,
University of Rochester)
 stream of tokens must conform to a grammar
(must be arranged in a particular order)
 grammars define how sentences are constructed
 defined using a metalanguage notation called BackusNaur form (BNF)
 John Backus @ IBM for Algol 58 (1977 ACM A.M. Turing Award winner)
 Noam Chomsky
 Peter Naur for Algol 60 (2005 ACM A.M. Turing Award winner)
 simple grammar for English sentences
(r_{1})  <sentence>  ::=
<article> <noun> <verb> <adverb> . 
(r_{2})  <article>  ::= a 
(r_{3})  <article>  ::= an 
(r_{4})  <article>  ::= the 
(r_{5})  <noun>  ::= dog 
(r_{6})  <noun>  ::= cat 
(r_{7})  <noun>  ::= Socrates 
(r_{8})  <verb>  ::= runs 
(r_{9})  <verb>  ::= jumps 
(r_{10})  <adverb>  ::= slowly 
(r_{11})  <adverb>  ::= fastly 
 elements:
 grammar = set of production rules,
 start symbol (<sentence>),
 nonterminals (e.g., <noun>, <verb>)
 terminals (e.g., cat, runs, .)
Language generation and recognition
 what can we use grammars for?
 language generation
 apply the rules in a topdown fashion
 construct a derivation
 deriving sentences from the above grammar;
derive "the dog runs fastly" (=> means `derive')
<sentence>  => <article> <noun> <verb> <adverb> .  (r_{1}) 
 => <article> <noun> <verb> fastly .  (r_{11}) 
 => <article> <noun> runs fastly .  (r_{8}) 
 => <article> dog runs fastly .  (r_{5}) 
 => the dog runs fastly .  (r_{4}) 
 grammar for a simple arithmetic expressions for
a simple fourfunction calculator
(r_{1})  <expr>  ::= <expr> + <expr> 
(r_{2})  <expr>  ::= <expr>  <expr> 
(r_{3})  <expr>  ::= <expr> * <expr> 
(r_{4})  <expr>  ::= <expr> / <expr> 
(r_{5})  <expr>  ::= <id> 
(r_{6})  <id>  ::= x  y  z 
(r_{7})  <expr>  ::= (<expr>) 
(r_{8})  <expr>  ::= <number> 
(r_{9})  <number>  ::= <number> <digit> 
(r_{10})  <number>  ::= <digit> 
(r_{11})  <digit>  ::= 0  1  2  3  4  5  6  7  8  9 
 there are leftmost and rightmost derivations
 some derivations are neither
 sample derivations of 132:
 leftmost derivation:
<expr> =>  <number>  (r_{8}) 
 <number> <digit>  (r_{9}) 

 <number> <digit> <digit>  (r_{9}) 

 <digit> <digit> <digit>  (r_{10}) 

 1 <digit> <digit>  (r_{11}) 

 13 <digit>  (r_{11}) 

 132  (r_{11}) 

 rightmost derivation:
<expr> =>  <number>  (r_{8}) 
 <number> <digit>  (r_{9}) 
 <number> 2  (r_{11}) 
 <number> <digit> 4  (r_{9}) 
 <number> 32  (r_{11}) 
 <digit> 32  (r_{10}) 
 132  (r_{11}) 
 neither rightmost nor leftmost derivation:
<expr> =>  <number>  (r_{8}) 
 <number> <digit>  (r_{9}) 
 <number> <digit> <digit>  (r_{9}) 
 <number> <digit> 2  (r_{11}) 
 <number> 32  (r_{11}) 
 <digit> 32  (r_{10}) 
 132  (r_{11}) 
 neither rightmost nor leftmost derivation:
<expr> =>  <number>  (r_{8}) 
 <number> <digit>  (r_{9}) 
 <number> <digit> <digit>  (r_{9}) 
 <number> 3 <digit>  (r_{11}) 
 <digit> 3 <digit>  (r_{10}) 
 13 <digit>  (r_{11}) 
 132  (r_{11}) 
 derive "x + y * z"
<expr> =>  <expr> + <expr>  (r_{1}) 
 <expr> + <expr> * <expr>  (r_{3}) 
 <expr> + <expr> * <id>  (r_{5}) 
 <expr> + <expr> * z  (r_{6}) 
 <expr> + <id> * z  (r_{5}) 
 <expr> + y * z  (r_{6}) 
 <id> + y * z  (r_{5}) 
 x + y * z  (r_{6}) 
 is a grammar a generative device or recognition device?
one of the seminal discoveries in computer science
 language recognition; do the reverse
generation: grammar → sentence
recognition: sentence → grammar
 let's parse x + y * z (do the reverse)
. x + y * z  (shift) 
x . + y * z  (reduce r_{6}) 
<id> . + y * z  (reduce r_{5}) 
<expr> . + y * z  (shift) 
<expr> + . y * z  (shift) 
<expr> + y . * z  (reduce r_{6}) 
<expr> + <id> . * z  (reduce r_{5}) 
<expr> + <expr> . * z  (shift) ← why not reduce r_{1} here instead? 
<expr> + <expr> * . z  (shift) 
<expr> + <expr> * z .  (reduce r_{6}) 
<expr> + <expr> * <id> .  (reduce r_{5}) 
<expr> + <expr> * <expr> .  (reduce r_{2}; emit multiplication) 
<expr> + <expr> .  (reduce r_{1}; emit addition) 
<expr> .  (start symbol...hurray! this is a valid sentence) 
 . (dot) denotes the top of the stack
 the rhs is called the handle
 called bottomup or shiftreduce parsing
 construct a parse tree
 parse trees for "x + y * z"
the above parse exhibits a shiftreduce conflict
 if we shift, multiplication will have higher precedence (desired)
 if we reduce, addition will have higher precedence (undesired)
there is also a reducereduce conflict (those the above
parse does not have one); consider the following:
(r_{1})  <expr>  ::= <term>

(r_{2})  <expr>  ::= <id>

(r_{3})  <term>  ::= <id>

(r_{4})  <id>  ::= x  y  z

let's parse x
. x (reduce r_{4})
<id> . ← reduce r2 or r3 here?
parse trees for "x"
the underlying source of a shiftreduce conflict and
a reducereduce conflict is an ambiguous grammar
formal language 
defined by/generator 
model of computation/recognizer 
regular language (RL) 
regular grammar (RG) 
finite state automata (FSA) 
contextfree language (CFL) 
contextfree grammar (CFG) 
pushdown automata (PDA) 
Ambiguity
disambiguation is a mechanical process: take a compilers course
C still uses an ambiguous grammar, why?
rules get lengthy and impractical to implement
Extended BackusNaur Form (EBNF)
EBNF adds: , [ ], { }^{*}, { }^{+}, and {}^{*(c)}
  means alternation
 [ ] means enclosed is optional
 { }^{*} means 0 or more of enclosed
 { }^{+} means 1 or more of enclosed
 {<expression>}^{*(c)}
 example:
<expr>  ::= ( <list> ) 
<expr>  ::= a

<list>  ::= <expr> 
<list>  ::= <expr> <list> 
 EBNF grammar which defines the same language
<expr>  ::= ( <list> )  a 
<list>  ::= <expr> [ <list> ] 
 another example:
<term>  ::= <factor> + <factor> 

<factor>  ::= <term> 
 EBNF grammar which defines the same language
<term> ::= <factor> + <factor> {+ <factor>}^{*}
Contextsensitivity
 an example of a property that is not contextfree, or
what is an example of something that is contextsensitive?
 first letter of a sentence must be capitalized
 Socrates is the boy.
 The boy is Socrates.
 an example contextsensitive grammar (CSG) for this:
<beginning><article>  ::= The  An  A 
<article>  ::= the  an  a 
 exercise: try expressing this as CFG (hint: it is possible)
 others:
 a variable must be declared before it is used
 * operator in C
 in this course we will not go beyond CFGs
 is C a contextfree or contextsensitive language (CSL)?
it is a CSL implemented with a CFG
 solutions:
 use more powerful grammars (CSGs), or
 use attribute grammars (Knuth; 1974 ACM A.M.
Turing Award winner): CFGs decorated with rules
(see [COPL9] pp. 134141)
Chomsky hierarchy
(progressive classes of formal grammars)
(image created by Travis Z. Suel)
 phrase structured (unrestricted) grammars
 generate recursively enumerable (unrestricted) languages
 include all formal grammars
 implemented with Turing machines
 contextsensitive grammars
 generate contextsensitive languages
 implemented with linearbounded automata
 contextfree grammars
 generate contextfree languages
 single nonterminal on left
 nonterminals & terminals on right
 implemented with pushdown automata
 regular grammars
 generate regular languages
 implemented with finite state automata
formal language 
defined by/generator 
model of computation/recognizer 
regular language (RL) 
regular grammar (RG) 
finite state automata (FSA) 
contextfree language (CFL) 
contextfree grammar (CFG) 
pushdown automata (PDA) 
contextsensitive language (CSL) 
contextsensitive grammar (CSG) 
linearbounded automata (LBA) 
recursivelyenumerable language (REL) 
unrestricted grammar (UG) 
Turing machine (TM) 
Exercises
 express .*hw.* as a CFG
 express <S> ::= ()  <S><S>  (<S>) as a regular
grammar
 generates strings of balanced parentheses (no dangling parentheses)
 of critical importance to programming languages
 e.g., (()), ()()
 a CSG can express context (which a CFG cannot).
what can a CFG express that a regular grammar cannot?
(hint: exercise above gives some clues)
Constructs and capabilities
References
[COPL9] 
R.W. Sebesta. Concepts of Programming Languages.
AddisonWesley, Boston, MA, Ninth edition, 2010. 
[PLPP] 
K.C. Louden.
Programming Languages: Principles and Practice.
Brooks/Cole, Pacific Grove, CA, Second edition, 2002.

