Programming Languages: Chapter 2: Formal Languages and Grammars



Formal languages

  • what is a formal language?
    • set of sentences (strings) over some alphabet
    • legal strings = sentences
  • how do we define a formal language?
    • grammar (define syntax of a language)
    • any more?
  • syntax and semantics
    • syntax refers to structure of language
    • semantics refers to meaning of language
  • previously, both syntax and semantics
    • used to be described intuitively
    • now, well-defined, formal systems are available
  • there are finite and infinite languages
    • is C an infinite language
    • most interesting languages are infinite


Progressive stages of sentence validity

candidate sentencelexically valid?syntactically valid?semantically valid?
Socrates is a mann. no no no
Man Socrates is a. yes no no
Man is a Socrates. yes yes no
Socrates is a man. yes yes yes


Regular languages and grammars

Finite Automata and Regular Expressions (ref. Randal Nelson and Tom LeBlanc, University of Rochester)
  • lexemes can be formally described by regular grammars
    • . (any single character)
    • * (0 or more of previous character)
    • + (or)
    • shorthand notation:
      • [a-z] (one of the characters in this range)
      • [^a-z] (any character but one in this range)
    • examples:
      • hw* (defines a set of sentences = {h, hw, hww, hwww, hwwww, ...})
      • hw[1-9][0-9]* (defines a set of sentences = {hw1, hw2, ..., hw9, hw10, hw11, ...})
      • ssns? [0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]
      • matching any phrase of exactly three words separated by white space:
        This is a short sentence.
        ---------
             ----------
                ----------------
        
        [^ ][^ ]*[ ][ ]*[^ ][^ ]*[ ][ ]*[^ ][^ ]*
        
  • regular grammars (also called linear grammars) are generative devices for regular languages
  • regular grammars define regular languages
  • any finite language is regular
  • sentences from regular languages are recognized using finite state automata (FSA)
  • finite state automaton to accept positive integers identifiers in C: [1-9][0-9]* + [_a-zA-Z][_a-zA-Z0-9]*
  • formal language defined by/generator model of computation/recognizer
    regular language (RL) regular grammar (RG) finite state automata (FSA)


Context-free grammars (in Backus-Naur form) and context-free languages

Generating and Recognizing Recursive Descriptions of Patterns with Context-Free Grammars (ref. Randal Nelson and Tom LeBlanc, University of Rochester)
  • stream of tokens must conform to a grammar (must be arranged in a particular order)
  • grammars define how sentences are constructed
  • defined using a metalanguage notation called Backus-Naur form (BNF)
    • John Backus @ IBM for Algol 58 (1977 ACM A.M. Turing Award winner)
    • Noam Chomsky
    • Peter Naur for Algol 60 (2005 ACM A.M. Turing Award winner)
  • simple grammar for English sentences

    (r1)<sentence>::= <article> <noun> <verb> <adverb> .
    (r2)<article>::= a
    (r3)<article>::= an
    (r4)<article>::= the
    (r5)<noun>::= dog
    (r6)<noun>::= cat
    (r7)<noun>::= Socrates
    (r8)<verb>::= runs
    (r9)<verb>::= jumps
    (r10)<adverb>::= slowly
    (r11) <adverb>::= fastly

  • elements:
    • grammar = set of production rules,
    • start symbol (<sentence>),
    • non-terminals (e.g., <noun>, <verb>)
    • terminals (e.g., cat, runs, .)


Language generation and recognition

  • what can we use grammars for?
  • language generation
    • apply the rules in a top-down fashion
    • construct a derivation
  • deriving sentences from the above grammar; derive "the dog runs fastly" (=> means `derive')

    <sentence> => <article> <noun> <verb> <adverb> . (r1)
    => <article> <noun> <verb> fastly . (r11)
    => <article> <noun> runs fastly . (r8)
    => <article> dog runs fastly .(r5)
    => the dog runs fastly .(r4)

  • grammar for a simple arithmetic expressions for a simple four-function calculator

    (r1)<expr>::= <expr> + <expr>
    (r2)<expr>::= <expr> - <expr>
    (r3)<expr>::= <expr> * <expr>
    (r4)<expr>::= <expr> / <expr>
    (r5)<expr>::= <id>
    (r6)<id>::= x | y | z
    (r7)<expr>::= (<expr>)
    (r8)<expr>::= <number>
    (r9)<number>::= <number> <digit>
    (r10)<number>::= <digit>
    (r11)<digit>::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

  • there are leftmost and rightmost derivations
  • some derivations are neither
  • sample derivations of 132:
    • leftmost derivation:
      <expr> => <number>(r8)
      <number> <digit>(r9)
      <number> <digit> <digit> (r9)
      <digit> <digit> <digit>(r10)
      1 <digit> <digit>(r11)
      13 <digit>(r11)
      132(r11)
    • rightmost derivation:
      <expr> =><number>(r8)
      <number> <digit>(r9)
      <number> 2(r11)
      <number> <digit> 4(r9)
      <number> 32(r11)
      <digit> 32(r10)
      132(r11)

    • neither rightmost nor leftmost derivation:
    • <expr> =><number>(r8)
      <number> <digit>(r9)
      <number> <digit> <digit>(r9)
      <number> <digit> 2(r11)
      <number> 32(r11)
      <digit> 32(r10)
      132(r11)

    • neither rightmost nor leftmost derivation:
    • <expr> =><number>(r8)
      <number> <digit>(r9)
      <number> <digit> <digit>(r9)
      <number> 3 <digit>(r11)
      <digit> 3 <digit>(r10)
      13 <digit>(r11)
      132(r11)

    • derive "x + y * z"
    • <expr> =><expr> + <expr>(r1)
      <expr> + <expr> * <expr>(r3)
      <expr> + <expr> * <id>(r5)
      <expr> + <expr> * z(r6)
      <expr> + <id> * z(r5)
      <expr> + y * z(r6)
      <id> + y * z(r5)
      x + y * z(r6)

    • is a grammar a generative device or recognition device? one of the seminal discoveries in computer science
    • language recognition; do the reverse
        generation: grammar → sentence
        recognition: sentence → grammar
    • let's parse x + y * z (do the reverse)
    • . x + y * z (shift)
      x . + y * z (reduce r6)
      <id> . + y * z (reduce r5)
      <expr> . + y * z (shift)
      <expr> + . y * z (shift)
      <expr> + y . * z (reduce r6)
      <expr> + <id> . * z (reduce r5)
      <expr> + <expr> . * z (shift) ← why not reduce r1 here instead?
      <expr> + <expr> * . z (shift)
      <expr> + <expr> * z . (reduce r6)
      <expr> + <expr> * <id> . (reduce r5)
      <expr> + <expr> * <expr> . (reduce r2; emit multiplication)
      <expr> + <expr> . (reduce r1; emit addition)
      <expr> . (start symbol...hurray! this is a valid sentence)
    • . (dot) denotes the top of the stack
    • the rhs is called the handle
    • called bottom-up or shift-reduce parsing
    • construct a parse tree
    • parse trees for "x + y * z"

                     

  • the above parse exhibits a shift-reduce conflict
    • if we shift, multiplication will have higher precedence (desired)
    • if we reduce, addition will have higher precedence (undesired)

  • there is also a reduce-reduce conflict (those the above parse does not have one); consider the following:

    (r1)<expr>::= <term>
    (r2)<expr>::= <id>
    (r3)<term>::= <id>
    (r4)<id>::= x | y | z

    let's parse x . x (reduce r4)
    <id> . ← reduce r2 or r3 here?

    parse trees for "x"

                   

  • the underlying source of a shift-reduce conflict and a reduce-reduce conflict is an ambiguous grammar

  • formal language defined by/generator model of computation/recognizer
    regular language (RL) regular grammar (RG) finite state automata (FSA)
    context-free language (CFL) context-free grammar (CFG) pushdown automata (PDA)


Ambiguity

    sentence derivation parse tree meaning
    132 multiple one one (132)
    1+3+2 multiple multiple one (6)
    1+3*2 multiple multiple multiple (7 or 8)
    6-3-2 multiple multiple multiple (1 or 5)

  • last three cases make a grammar ambiguous; if a sentence from a language has more than one parse tree, then the grammar for the language is ambiguous
  • parse tree for "132"



  • parse trees for "1 + 3 + 2"

                   

  • parse trees for "1 + 3 * 2"

                   

  • parse trees for "6 - 3 - 2"

                   

  • let's parse "Time flies like an arrow"
    • four different meanings!
    • we say that the grammar is ambiguous!
    • how can we determine intended meaning? need context
  • let's parse "I shot the man on the mountain with the camera."
  • ambiguous grammar
    • a grammar is ambiguous if you can construct at least two parse trees for the same sentence in the language
    • trivial to prove above grammar is ambiguous
  • how to prove a grammar is ambiguous (steps)
    1. generate an expression from the grammar and show the expression
    2. give two parse trees "using the grammar" for that expression
    notes:
    • the expression must come from the grammar
    • a parse tree is fully expanded; it has no leaves which are non-terminals; they are all terminals
    • collected leaves in each parse tree must constitute the expression
    • you cannot change the grammar while building the parse trees
    • you cannot change the expression while building the parse trees
  • we would like part of the meaning (or semantics) to be determined from the grammar (or syntax)
  • desideratum: syntax imply semantics (major complaint against systems like UNIX)
    • precedence
    • associativity
  • what does `have higher precedence' mean? occurs lower in the parse tree because expressions are evaluated bottom-up
  • solution
    • either state a disambiguating rule (order of precedence) (e.g., * has higher precedence than + (most languages, except APL)) or
    • (always possible to) revise the grammar
  • grammar revision
    • introduce new steps (non-terminals) in the non-terminal cascade so that multiplications are always lower than additions in the parse tree
    • worked out solution for 2+3*4

      (r1) <expr> ::= <expr> + <expr>
      (r2) <expr> ::= <expr> - <expr>
      (r3) <expr> ::= <term>
      (r4) <term> ::= <term> * <term>
      (r5) <term> ::= <term> / <term>
      (r6) <term> ::= (<expr>)
      (r7) <term> ::= <number>

      • this is still ambiguous? yes. why?
      • how can we disambiguate it?
  • associativity
    • comes into play when dealing with operators with same precedence
      • 6-3-2 = (6-3)-2 = 1 (left associative)
      • - - - 6 = -(-(-6))) = -6 (right associative)
    • matters when adding floating-point numbers, or
    • with an operator such as subtraction (e.g., 6-3-2)
    • which operators in C are right-associative?
  • overcoming ambiguity of associativity
    • left-recursive leads to left associativity
    • right-recursive leads to right associativity
  • grammar still ambiguous for 1 + 3 + 2 and 6 - 3 - 2?
    • let's fix it and make it left-associative

      (r1)<expr>::= <expr> + <term>
      (r2)<expr>::= <expr> - <term>
      (r3)<expr>::= <term>
      (r4)<term>::= <term> * <factor>
      (r5)<term>::= <term> / <factor>
      (r6)<term>::= <factor>
      (r7)<factor>::= (<expr>)
      (r8)<factor>::= <number>

    • theme: add another level of indirection by introducing a new non-terminal
    • notice rules get lengthy
    • why do we prefer a small rule set?
  • another example of ambiguity; (<term>)-<term> in C can mean two different things
    • subtracting: (10) - 2
    • typecasting: (int) - 3.5
  • classical example of grammar ambiguity in PLs: the dangling else problem
    • disambiguating grammars
    •    if (a < 2) 
            if (b > 3) 
              x
            else /* associates with which if above ? */
              y
      
    • ambiguous grammar
              <stmt> ::= if <cond> <stmt>
              <stmt> ::= if <cond> <stmt> else <stmt>
      
    • parse trees for if (a < 2) if (b > 3) x else y

                     

    • exercise: develop an unambiguous version
  • disambiguation is a mechanical process: take a compilers course
  • C still uses an ambiguous grammar, why? rules get lengthy and impractical to implement


Extended Backus-Naur Form (EBNF)

EBNF adds: |, [ ], { }*, { }+, and {}*(c)
  • | means alternation
  • [ ] means enclosed is optional
  • { }* means 0 or more of enclosed
  • { }+ means 1 or more of enclosed
  • {<expression>}*(c)
  • example:

    <expr> ::= ( <list> )
    <expr> ::= a
    <list> ::= <expr>
    <list> ::= <expr> <list>

  • EBNF grammar which defines the same language

    <expr> ::= ( <list> ) | a
    <list> ::= <expr> [ <list> ]

  • another example:

    <term> ::= <factor> + <factor>
    <factor> ::= <term>

  • EBNF grammar which defines the same language

    <term> ::= <factor> + <factor> {+ <factor>}*


Context-sensitivity

  • an example of a property that is not context-free, or what is an example of something that is context-sensitive?
    • first letter of a sentence must be capitalized
      • Socrates is the boy.
      • The boy is Socrates.
    • an example context-sensitive grammar (CSG) for this:

      <beginning><article>::= The | An | A
      <article>::= the | an | a

    • exercise: try expressing this as CFG (hint: it is possible)
    • others:
    • a variable must be declared before it is used
    • * operator in C
  • in this course we will not go beyond CFGs
  • is C a context-free or context-sensitive language (CSL)? it is a CSL implemented with a CFG
  • solutions:
    • use more powerful grammars (CSGs), or
    • use attribute grammars (Knuth; 1974 ACM A.M. Turing Award winner): CFGs decorated with rules (see [COPL9] pp. 134-141)


Chomsky hierarchy

(progressive classes of formal grammars)



(image created by Travis Z. Suel)
  • phrase structured (unrestricted) grammars
    • generate recursively enumerable (unrestricted) languages
    • include all formal grammars
    • implemented with Turing machines
  • context-sensitive grammars
    • generate context-sensitive languages
    • implemented with linear-bounded automata
  • context-free grammars
    • generate context-free languages
    • single non-terminal on left
    • non-terminals & terminals on right
    • implemented with pushdown automata
  • regular grammars
    • generate regular languages
    • implemented with finite state automata

  • formal language defined by/generator model of computation/recognizer
    regular language (RL) regular grammar (RG) finite state automata (FSA)
    context-free language (CFL) context-free grammar (CFG) pushdown automata (PDA)
    context-sensitive language (CSL) context-sensitive grammar (CSG) linear-bounded automata (LBA)
    recursively-enumerable language (REL) unrestricted grammar (UG) Turing machine (TM)


Exercises

  • express .*hw.* as a CFG
  • express <S> ::= () | <S><S> | (<S>) as a regular grammar
    • generates strings of balanced parentheses (no dangling parentheses)
    • of critical importance to programming languages
    • e.g., (()), ()()
  • a CSG can express context (which a CFG cannot). what can a CFG express that a regular grammar cannot? (hint: exercise above gives some clues)


Constructs and capabilities


References

    [COPL9] R.W. Sebesta. Concepts of Programming Languages. Addison-Wesley, Boston, MA, Ninth edition, 2010.
    [PLPP] K.C. Louden. Programming Languages: Principles and Practice. Brooks/Cole, Pacific Grove, CA, Second edition, 2002.

Return Home