Programming Languages: Chapter 3:
Scanning and Parsing
Scanning and parsing: the front end
Lexical analysis or scanning
- a lexical analyzer is software system
that picks out the lexemes, validates
them, and returns a list of tokens
- it is often called a scanner or a lexer
- what does lexical analysis do?
- parcel characters into lexemes
- lexemes fall into token categories
- example: parceling int i = 20; into lexemes
|int ||reserved word|
|= ||special symbol|
|; ||special symbol|
- principle of longest substring [PLPP] p. 79
- what delimiter did we use? whitespace?
- free-format language: formatting has no effect on program structure
[PLPP] p. 79
- fixed-format language: formatting has effect on program structure;
early versions of FORTRAN were fixed format (e.g.,
DO 99 I = 1.10 (DO99I = 1.10 in C) is different from
DO 99 I = 1, 10 (for (I=1; I<10; I++) in C))
- others: Haskell and Python use layout-based (indentation)
- reserved words (cannot be used as a name, e.g., int
in C) vs.
keywords (only special in certain contexts, e.g., main in C)
- returns a stream of tokens
(why stream of tokens and not stream of lexemes?)
- example of coding up a lexical analyzer
- to distinguish between positive numbers
([1-9][0-9]*) and identifiers ([_a-zA-Z][_a-zA-Z0-9]*)
- lexical analyzer is called a scanner or a lexer
a UNIX tool which takes a set of regular expressions (in a .l file)
and generates a lexical analyzer in C for those;
each call to lex() retrieves the next token
- flex is generates a scanner in C
Syntactic analysis or parsing
- parsing validates the order of the tokens in a list of
tokens and, if valid, organizes
this list of tokens into a parse tree
- the system that validates a program string and, if valid, converts it into
a parse tree is called a front end (scanner and parser)
- programs which process other programs, such as
interpreters and compilers, are syntax directed
- building up a parse tree
- or just simply checking for validity
- need not always actually build the tree;
sometimes a traversal is enough, especially if you are not going on to
semantic analysis or code generation
- a syntactic analyzer is called a parser
- parsing is the process of determining if a string is a sentence
(in some language) and, if so, converting the concrete representation of that
sentence into an abstract representation (e.g., parse tree) which easily
facilitates the intended subsequent processing of it.
- performed by a program called a parser
(or syntactic analyzer)
- independent of what you are going to do with the tree
- Scheme (read) facility
- building a shift-reduce parser is a well-studied area of computer science
and rarely done by hand anymore
- a parser generator is a program that accepts lexical and syntactic
specifications and automatically generates a scanner and parser from them
(i.e., a front end).
a UNIX tool which takes a EBNF grammar (in a .y file)
and generates a parser in C for the language it defines
- ambiguous grammar: small and leads to a fast parser, but is ambiguous
- unambiguous grammar: large and leads a slow parser, but has no ambiguity
- SLLGEN is a parser-generator system for Scheme (see [EOPL3] Appendix B)
- tokens are specified using regular expressions
- a context-free grammar is specified using a variation of EBNF
- the sllgen:make-string-parser procedure is used to
automatically generate the scanner and parser;
it returns a procedure that takes a string
and produces an abstract syntax representation (see Chapter
||R.W. Sebesta. Concepts of Programming Languages.
Addison-Wesley, Boston, MA, Ninth edition, 2010.
||D.P. Friedman and M. Wand.
Essentials of Programming Languages.
MIT Press, Cambridge, MA, Third edition, 2008.
Programming Languages: Principles and Practice.
Brooks/Cole, Pacific Grove, CA, Second edition, 2002.