Step 1: Scanner (tokenizer) -- Due: Friday, September 1st, 11:59pm

Please read this handout carefully before submitting the project.

Introduction

Your task in this step is to generate a scanner (also referred as tokenizer) for the project's grammar. A scanner is a program that takes a sequence of characters (the source file of the program) and produces a sequence of tokens that will be used to feed the compiler's parser.

The figure below shows the general structure of a simple compiler similarly to the one presented in the Fischer & LeBlanc's book. The picture highlights the part of the compiler that will be developed at this step and conceptually relate it to other parts that will be developed in the future.

Description of a simple compiler

Scanner Generation Tools

The scanner's source code is normally generated by a scanner generation tool. You can use one of the tools available to generate your scanner in this step. These tools normally work by taking the token definitions expressed by regular expressions and generates the source code for the scanner automatically (for this project, the tokens are specified in the language's grammar). The programmer has to add the code to handle the scanner's output. For example, at this step you will print all the tokens in the standard output. In the step 2 of this project you will modify your scanner to feed the parser replacing the print routines by calls to the parser and passing the tokens as parameters. In order to learn how to merge the code of the scanner generator with the rest of your source code, please, read the user's manual of the tool you decide to use. Some useful links are posted on the main page of the project

Some popular tools are:

Step 1: The compiler's tokenizer (scanner)

At the end of this step, the program developed (the very first part of our compiler) should be able to open a program's source file written in the LITTLE LANGUAGE and recognize its tokens. At this step, the output of your program should be the prints of each token's type and value (see below). In the next step the same output will be used to feed the parser.

Your scanner program should be able to open and read LITTLE source file and print the all the valid tokens within the source file and their respective type in the standard output. You might want to redirect the output to a file and compare your results with the output files provided for the testcases.

Testcases:

Outputs:

See grammar file for token definitions (Keywords, Operator, Literals)

Output Format

The program is expected to print each token in a predefined format. Here is an example

Token Type: INTLITERAL
Value: 20

Please read the output samples for more details.

Running your scanner

This is how we will test your scanner in Java:

java -cp classes/:lib/antlr.jar Micro fibonacci.micro > fibonacci.scanner

This is how we will test your scanner in C/C++:

Micro fibonacci.micro > fibonacci.scanner

Note that the input is being read in from a file passed as a command line argument, not redirected from STDIN, while the output is being redirected from STDOUT. If you are using a different language that C/C++ or Java, include a shell script in your submission called "Micro" that provides the same usage interface as in the C/C++ case above

Note: This step will be graded automatically and the outputs generated by your code will be directly compared with the expected outputs using "diff -b -B" command. Please make sure that your outputs are identical to the sample outputs provided (use diff -b -B yourself).

Project template

To help you start the project we are providing an ANTLR v4 template that has the latest version of antlr for step1. (A template with older version is here) If you are using C/C++, you should still download the template but modify the Makefile and directory structure according to this page. (You won't need the .jar, .java, and .g4 file, so delete them, and you will have to modify the Makefile)

Directory Structure

Carefully read the instructions about the directory structure here.

Readings

Several books introduce the theory necessary for a good understanding of lexical analysis and how a compiler tokenizer works, as well a good description of how a simple compiler is structured.  You may find it helpful to check some of the following references:

Alfred v. Aho, Ravi Sethi, Jeffrey D. Ullman, "Compilers: Principles, Techniques, and Tools", Addison-Wesley, 1986.
A simple one-pass compiler - Chapter 2
Lexical analysis - Chapter 3

Andrew W. Appel, "Modern Compiler Implementation in Java", Cambridge Univ. Press, 1998.
Lexical Analysis - Chapter 2

Charles Fischer, Richard LeBlanc, "Crafting a Compiler with C", The Benjamin/Commings Publishing Company, Inc, 1991.

A simple compiler - Chapter 2
Scanning - Theory and Practice - Chapter 3

Q&A

Submitting your project

First, make sure that your directory structure is correct and you are submitting only the required files. Then, follow the instructions about the turnin command and note that the step name for this step is step1 (use -p step1). Also, if you are working in a pair, submit from the same login (also called your group account) every time.