The first real step of the project is building the first phase of a compiler: the scanner (sometimes called a tokenizer). A scanner’s job is to convert a series of characters in an input file into a sequence of tokens – the “words” in the program. So, for example, the input
A := B + 4
Would translate into the following tokens:
IDENTIFIER (Value = "A") OPERATOR (Value = ":=") IDENTIFIER (Value = "B") OPERATOR (Value = "+") INTLITERAL (Value = "4")
The way that we define tokens in a programming language is with regular expressions. For example, a regular expression that defines an integer literal token looks like:
[0-9]+ (read: “1 or more digits”),
while a regular expression that defines a float literal token looks like:
[0-9]+\.[0-9]* | \.[0-9]+ (read: “Either 1 or more digits followed by a decimal followed by 0 or more digits; or a decimal followed by 1 or more digits”)
(See the notes for Lecture 2 for details).
While you can write a scanner by hand, it is very tedious. Instead, we typically use tools to help us automatically generate scanners. The tools we recommend you to use are either
flex (if you’re planning on writing your compiler in C or C++) or
ANTLR (if you’re planning on writing your compiler in Java).
flex is available on most Unixes/Linux (including the ecegrid machines), while ANTLR requires a download. If you want to use other tools to generate your scanner, feel free, but we will be able to provide less help.
If you use ANTLR, you will find its API documentation. If you use
flex, the manual is a good resource, and a working example using
flex and its accompanying tool
bison (which you will use in step 2, and possibly for the token definition on this step) is available here
We will be building a compiler for a simple language called MICRO in this class. The token definitions (written in plain English) are as follows:
an IDENTIFIER token will begin with a letter, and be followed by any number of letters and numbers. IDENTIFIERS are case sensitive. INTLITERAL: integer number ex) 0, 123, 678 FLOATLITERAL: floating point number available in two different format yyyy.xxxxxx or .xxxxxxx ex) 3.141592 , .1414 , .0001 , 456.98 STRINGLITERAL: any sequence of characters except '"' between '"' and '"' ex) "Hello world!" , "***********" , "this is a string" COMMENT: Starts with "--" and lasts till the end of line ex) -- this is a comment ex) -- any thing after the "--" is ignored Keywords PROGRAM,BEGIN,END,FUNCTION,READ,WRITE, IF,ELSE,ENDIF,WHILE,ENDWHILE,RETURN,INT, VOID,STRING,FLOAT,TRUE,FALSE 573 Students add the following keywords: FOR,ENDFOR,CONTINUE,BREAK Operators := + - * / = != < > ( ) ; , <= >=
What you need to do
You should build a scanner that will take an input file and output a list of all the tokens in the program. For each token, you should output the token type (e.g.,
OPERATOR) and its value (e.g.,
The inputs and outputs we will test your program can be found here. Your outputs need to match our outputs exactly (we will be comparing them using
diff, though we will ignore whitespace).
Note that even though our sample outputs combine together a bunch of different tokens as a single type (e.g., all keywords have the token type
KEYWORD), you will be better served by defining every keyword and operator as a different token type (so your scanner will have different tokens for, say,
<), and then writing a little bit of extra code to print the output we expect for that token type.
While it might seem weird, you will need to define a token that eats up any whitespace in your program (recall that your compiler really only sees a list of characters; it has no reason to think that a tab character isn’t an important character). Make sure that when you recognize a whitespace token, you just silently drop it, rather than printing it out.
What you need to submit
All of the necessary code for your compiler that you wrote yourself. You do not need to include the ANTLR jar files if you are using ANTLR.
A Makefile with the following targets:
compiler: this target will build your compiler
clean: this target will remove any intermediate files that were created to build the compiler
team: this target will print the same team information that you printed in step 0.
A shell script (this must be written in bash, which is located at
/bin/bashon the ecegrid machines) called
runmethat runs your scanner. This script should take in two arguments: first, the input file to the scanner and second, the filename where you want to put the scanner’s output. You can assume that we will have run
make compilerbefore running this script.
While you may create as many other directories as you would like to organize your code or any intermediate products of the compilation process, both your
Makefile and your
runme script should be in the root directory of your repository.
Do not submit any binaries. Your git repo should only contain source files; no products of compilation.
You should tag your step 1 submission as
This step is graded by using the
diff -b <our_output> <your_output> command