UNIX/Linux & C Programming:
Chapter n: yacc



Coverage: [UPE] Chapter 8, [LYT], [CPL] § 6.8 (pp. 147-149), § 7.3 (pp. 155-156), § A8.3 (p. 212), and § B7 (p. 254)


Grammar warm up

    Is the following grammar ambiguous?
    E ::= E + E
    E ::= id
    
    How can we fix it?
    E ::= E + T | T
    T ::= id
    
    Now is it left- or right-associative? How about this?
    E ::= T + E | T
    T ::= id
    
    How can we disambiguate our running example?
    E ::= E + E
    E ::= E * E
    E ::= id
    
    How about?
    E ::= E + T | T
    T ::= T * F | F
    F ::= id
    
    Can we do it with only two non-terminals?


Scanning and parsing







yacc conflicts

both are caused by an ambiguous grammar
  • shift-reduce conflict: yacc will always shift
  • reduce-reduce conflict: yacc will always use the first rule to reduce
remedies
  • disambiguate the grammar by rewriting it (the best option)
  • choose a default action by declaring the precedence and associativity of operators as follows:
    %left '+' '-'
    %left '*' '/'
    
    where highest to lowest precedence proceeds bottom-up (i.e., * and / have higher precedence than + and - above)


Essential yacc

  • the parsing technique used by yacc is LALR(1) or Look Ahead Left Recursive; (1) indicates that the lookahead is limited to one token

  • how do lex and yacc communicate? through a global variable named yylval.

  • yacc maintains two internal stacks
    • parse stack
      • contains terminals and non-terminals
      • returned by int yylex()
    • value stack
      • contains values of type YYSTYPE (int by default)
      • YYSTYPE is defined in <filename>.tab.h
      • pushed from variable yylval
      • $$ (top of stack after the reduction takes place), $1, $2, $3, ... $n reference items on the value stack corresponding to the items (from left to right) on the rhs of the production rule used in the reduction




    • these two stacks must always be synchronized

    %token INTEGER
    /* produces "#define INTEGER 258" in calc.tab.c on our system 
       because values 0-255 are reserved for character values, and
       lex reserves several values for end-of-file and error processing and,
       therefore, token values typically start around 258 */
    
  • how to get values of different types on the value stack? use a union
  • 3rd generation languages (e.g., C) and 4th generation languages (yacc)


Marriage of lex and yacc






(adapted version of [LYT] Fig. 2, p. 5)


Running yacc (in conjunction with lex) to automatically generate a parser

$ flex tokens.l # produces lex.yy.c
$ bison -d gram.y # produces gram.tab.c and gram.tab.h
$ gcc -c gram.tab.c # produces gram.tab.o
$ gcc -c lex.yy.c # produces lex.yy.o
$ gcc -o parser gram.tab.o lex.yy.o # produces parser
$ ./parser < ...


$ flex calc1.l # produces lex.yy.c
$ bison -d calc1.y # produces gram.tab.c and gram.tab.h
$ gcc -c calc1.tab.c # produces calc1.tab.o
$ gcc -c lex.yy.c # produces lex.yy.o
$ gcc -o parser calc1.tab.o lex.yy.o # produces calc1
$ ./calc1 < ...


Scanning and parsing in flex and bison







Evaluating arithmetic expressions

  • expr
    $ expr 2 + 3
    5
    $ expr 2 + 3 \* 4
    14
    $ expr 2 \* 3 + 4
    10
    $ expr "2 + 3 * 4"
    2 + 3 * 4
    
  • bc -l (an arbitrary precision calculator)
    23+47
    70
    2 + 3
    5
    2 + 3 * 4
    14
    2 * 3 + 4
    10
    2 ^ 3
    8
    ^D
    


Makefile for simple calculator (version 1)

    SRC = calc
    CC = gcc
    LEX = flex
    LEX_FLAGS = -d
    YACC = bison
    YACC_FLAGS = -d -t
    
    all: $(SRC)
    
    $(SRC): lex.yy.o $(SRC).tab.o
            $(CC) lex.yy.o $(SRC).tab.o -o $(SRC)
    
    lex.yy.o: lex.yy.c $(SRC).tab.h
            $(CC) -c lex.yy.c
    
    lex.yy.c: $(SRC).l
            $(LEX) $(LEX_FLAGS) $(SRC).l
    
    $(SRC).tab.o: $(SRC).tab.c
            $(CC) -c $(SRC).tab.c
    
    $(SRC).tab.c: $(SRC).y
            $(YACC) $(YACC_FLAGS) $(SRC).y
    
    $(SRC).tab.h: $(SRC).y
            $(YACC) $(YACC_FLAGS) $(SRC).y
    
    clean:
            -rm *.[cho] $(SRC)
    


Programming exercise

Use lex and yacc to generate a parser for the language defined by the following grammar (akin to the parser we generated in class for the balanced, nested parentheses language).
<sentence> ::= <sentence> <expr> | <expr>
    <expr> ::= ( <list> ) | a
    <list> ::= <expr> | <expr> <list>
We said in class that grammars are also generative devices. Write a C program which utilizes this grammar to generate n sentences from the language. Use those sentences to evaluate the correctness of your parser.


Enhancements to the simple calculator (version 2)

  • multiplication and division: demonstrates setting precedence
  • parentheses to make precedence explicit
  • exponentiation operator
    • has highest precedence
    • right associative
  • unary minus
    • shares highest precedence with the exponentiation operator
    • requires %prec directive
    • to disambiguate
  • single character variables
    • requires building and indexing an environment or symbol table
    • symbol table is indexed by ints
      /* yields an integer in the range 0-25 */
      /* ascii code for character 'a' is 97 */
      /* ascii code for character 't' is 116 */
      yylval = *yytext - 'a';
      
    • requires a <statement> non-terminal
    • requires an assignment operator which is right associative and has lowest precedence
  • print statement


Interpreting while parsing


Interpreting while parsing (in caculator 1 & 2)


Some conceptual exercises

  • in version 2 of the calculator, why is the string print -4 - 5 parsed as a sentence if unary minus has the highest precedence?
  • in version 2 of the calculator, will the '-' expr %prec '^' { $$ = $2*-1; } rule interfere with parsing the string print 2 ^ -3;?
  • is there a problem parsing print -2 ^ 3;? is there a problem parsing print -2 ^ 4;? Explain.


unions

  • see [CPL] § 6.8 (pp. 147-149) and § A8.3 (p. 212)
  • `A union is a variable that may hold (at different times) objects of different types and sizes, with the compiler keeping track of size and alignment requirements' [CPL] (p. 147).
  • `Unions provide a way to manipulate different kinds of data in a single area of storage, without embedding any machine-dependent information in the program' [CPL] (p. 147).
  • a union variable is large enough to hold the largest of its member types
  • a union is the ideal variable to use for the node in a syntax tree data structure
  • union {
        int i;
        float f;
        char[16] s;
    }
    
  • a union is the C analog of a variant record in Pascal


Variable argument lists

  • see [CPL] § 7.3 (pp. 155-156) and § B7 (p. 254)
  • use ellipses in prototype/header
    • int printf (char* fmt, ...)
    • means number and type may vary across calls
    • ellipses must come at the end
  • to step through arguments
    • declare variable of type va_list in function (e.g., va_list ap;)
    • initialize ap with va_start (e.g., va_start (ap);)
    • then call va_arg with ap and a datatype to retrieve a value of that type (e.g., va_arg (ap, int);)
    • call va_end with ap after all arguments have been processed, but before the function returns (e.g., va_end (ap);)
    • if variable types are used, use a switch to control the particular call made to va_arg
  • typically pass the number of variable arguments as a parameter to the function
  • necessary macro, datatypes, and functions are declared in stdarg.h
  • another use: operator node in a parse tree (with a variable number of operands) (see [LYT])
  • type va_list supports functions accepting a variable number of arguments
  • the macros are defined in stdarg.h
  • void f(int nargs, ...) {
    /* the declaration ... can only appear at the end of an argument list */
    
       int i, tmp;
       
       va_list ap;                /* argument pointer */
    
       va_start(ap, narags);      /* initializes ap to point to the first unnamed argument;
                                     va_start must be called once before ap can be used */
    
       for (i=0; i < nargs; i++)
          temp = va_arg(ap, int); /* returns one argument and steps ap to the next argument */
                                  /* the second argument to va_arg must be a type 
                                     name so that va_args knows how big a step to take */
    
       va_end(ap);                /* clean-up; must be called before function returns */ 
    }
    


Interpretation

  • preprocessing (purges comments)
  • lexical analysis (scanning)
  • syntax analysis (parsing)
  • semantic analysis


Alternative View of Interpretation


Compilation


Low-level view of execution by compilation


Compiler vs. interpreter

  • both compilers and interpreters have a front end which consists of a scanner (lexical analyzer) and parser (syntactic analyzer)

  • compiler
    • a compiler is a program which translates a program in one language (the source language) to an equivalent program in another language (the target language)
    • a compiler is just a translator, nothing more
    • advantages to compilation:
      • fast execution: generates machine code which executes fast
      • compile once, execute multiple times
    • disadvantages to compilation:
      • slow development: vicious compile-run-debug-re-compile cycle
      • less flexibility: most choices in a program are fixed at compile-time (e.g., size of an array)

  • an interpreter is a software simulation of machine which natively (i.e., no translation involved) understands instructions in the source language [COPL6]
  • an interpreter provides a virtual machine for a programming language
    • advantages to interpretation:
      • interpreters provide direct support for source-level debugging (e.g., consider a run-time array out-of-bounds error)
      • interpreters lend themselves to late binding
        • dynamic typing (based on run-time data)
        • programming on-the-fly and then interpreting the code at run-time
    • disadvantages to interpretation:
      • slow: decoding high-level expressions (which are more complex than machine instructions) is the bottleneck as opposed to pipeline between the processor and memory [COPL6]
      • source program usually occupies more space (i.e., program manipulated by the interpreter is often stored in a representation which makes interpretation convenient and this representation is typically not of minimal size)

  • hybrid (compilation-interpretation) systems: Perl, Java; why?


Calculator (version 3)

Interpreter (evaluator)












Compiler (translator)




















Precedence and associativity in version 3

    /* value stack will be an array of these YYSTYPE's;
       has nothing to do with the union in calc.h */
    %union {
       int literal;       /* integer value */
       char environI;     /* environment index */
       nodeType* nPtr;    /* node pointer */
    };
    /* generates the following:
    
       typedef union {
          int literal;
          char environI;
          PTnode* nodePtr;
       } YYSTYPE;
       extern YYSTYPE yylval; 
    
       in other words, literals, variables, and node pointers can
       be represented by yylval in the parser's value stack
    
       binds INTEGER to literal in the YYSTYPE union
       associates token names with correct component of the YYSTYPE union
       to generate following code
       yylval.nodePtr = newLiteralOrVariableNode(yyvsp[0].literal); */
    
    %token <literal> INTEGER
    %token <environI> VARIABLE
    %token WHILE IF PRINT
    %nonassoc IFX
    %nonassoc ELSE
    
    %left GE LE EQ NE '>' '<'
    %left '+' '-'
    %left '*' '/'
    %right '^'
    %nonassoc UMINUS
    
    /* binds expr to nodePtr in the YYSTYPE union */
    %type <nodePtr> stmt expr stmtlist
    
    As a language grows in size and increases in complexity, this approach will not scale. It is always preferable to disambiguate the grammar.


structures for parse tree nodes in calculator (version 3)














Makefile dependency graph for calculator (version 3)


Makefile for calculator language (version 3)

    SRC = calc
    CC = gcc
    LEX = flex
    LEX_FLAGS =
    YACC = bison
    YACC_FLAGS = -d -t
    
    all: interpreter compiler parsetree
    
    interpreter: lex.yy.o $(SRC).tab.o interpreter.o
            $(CC) -lm lex.yy.o $(SRC).tab.o interpreter.o -o interpreter
    
    compiler: lex.yy.o $(SRC).tab.o compiler.o
            $(CC) lex.yy.o $(SRC).tab.o compiler.o -o compiler
    
    parsetree: lex.yy.o $(SRC).tab.o parsetree.o
            $(CC) lex.yy.o $(SRC).tab.o parsetree.o -o parsetree
    
    lex.yy.o: lex.yy.c $(SRC).tab.h $(SRC).h
            $(CC) -c lex.yy.c
    
    lex.yy.c: $(SRC).l
            $(LEX) $(LEX_FLAGS) $(SRC).l
    
    $(SRC).tab.o: $(SRC).tab.c $(SRC).h
            $(CC) -c $(SRC).tab.c
    
    $(SRC).tab.c: $(SRC).y
            $(YACC) $(YACC_FLAGS) $(SRC).y
    
    $(SRC).tab.h: $(SRC).y
            $(YACC) $(YACC_FLAGS) $(SRC).y
    
    interpreter.o: interpreter.c $(SRC).h $(SRC).tab.h
            $(CC) -c interpreter.c
    
    compiler.o: compiler.c $(SRC).h $(SRC).tab.h
            $(CC) -c compiler.c
    
    parsetree.o: parsetree.c $(SRC).h $(SRC).tab.h
            $(CC) -c parsetree.c
    
    clean:
            -rm *.o $(SRC).tab.h $(SRC).tab.c lex.yy.c interpreter compiler parsetree
    


Some more conceptual exercises

  • what is the difference between '-' expr %prec '^' { $$ = $2*-1; } (in version 2) and '-' expr %prec UMINUS { $$ = $2*-1; } (in version 3)?
  • when either works just as well, in yacc (bottom-up parsing), which is preferable: a left- or right-recursive grammar?
  • when either works just as well, in top-down parsing, which is preferable: a left- or right-recursive grammar?


Memory management questions to ponder

  • why is OperatorNode the last field of the union? will this approach work if it is not the last field?
  • are there any other approaches we can take to laying out the memory for these nodes of the parse tree? how about a union of structs? what are the implications?
  • moral of the story: since C is the lowest high-level language (with little type checking), we can manipulate the compiler into laying out memory in an advantageous way based on how we organize/overlap our memory structures in the program codes


New versions of the calculator

  • add a do { .. } while ( ... ); loop
  • Re-instrument version 3 of the calculator so that the integer representing a literal or variable in the PTnode type is wrapped in a struct called LiteralOrVariableNode. Call this approach version 4 (see below).
  • Re-instrument version 4 of the calculator to factor the LiteralOrVariableNode struct into a LiteralNode struct and a VariableNode struct. Similarly, factor the newLiteralOrVariableNode function into newLiteralNode and newVariableNode functions. Call this approach version 5 (see below).
  • Re-instrument version 3 of the calculator to use a different design for the OperatorNode struct. Specifically, instead of an a pointer to an array of type PTnode*, make the operands field of the OperatorNode struct be a array of size one of pointers of type PTnode* (as shown below) and dynamically expand it as needed in the newOperatorNode function. Call this approach version 6 (see below). Would this approach work if the union was the first field of the PTnode struct rather than the PTnodeFlag enum? Explain.
  • Re-instrument version 4 of the calculator to use the memory design of version 6 Call this approach version 7.
  • Re-instrument version 5 of the calculator to use the memory design of version 6 Call this approach version 8.
  • Re-instrument version 7 of the calculator to use the memory design depicted rather than a struct containing a union. Call this approach version 9 (a memory overlay approach; see below). Would this approach work if the nodeFlag enum type was not a member of both the LiteralOrVariableNode and OperatorNode struct types, in addition to being a member of the PTnode struct type? Explain. Would this approach work if the PTnodeFlag enum was the last member of the PTnode union? Explain.
  • Re-instrument version 8 of the calculator to use the memory design depicted in version 9. Call this approach version 10 (see below).


structures for parse tree nodes in calculator (version 4)









structures for parse tree nodes in calculator (version 5)














structures for parse tree nodes in calculator (version 6)









structures for parse tree nodes in calculator (version 9)




structures for parse tree nodes in calculator (version 10)




References

    [LYT] T. Niemann. Lex and Yacc Tutorial. ePaperPress.
    [COPL6] R.W. Sebesta. Concepts of Programming Languages. Addison-Wesley, Boston, MA, Sixth edition, 2003.
    [CPL] B.W. Kernighan and D.M. Ritchie. The C Programming Language. Prentice Hall, Upper Saddle River, NJ, Second edition, 1988.
    [UPE] B.W. Kernighan and R. Pike. The UNIX Programming Environment. Prentice Hall, Upper Saddle River, NJ, Second edition, 1984.

Return Home