In this first phase, your job is to implement a scanner for the SnuPL/2 language as specified in the term project overview.
[[toc]]
The scanner takes a character stream as its input and produces a stream of attributed tokens as its output.
The scanner must correctly recognize and tokenize all keywords, identifiers, numbers, operators (assignment, binary and relational), comments, and the syntax elements of the SnuPL/2 language. The scanner should scan the input until the end of the input character stream is reached or an error is detected. The syntax of SnuPL/2 can be found here.
The scanner must scan but ignore white-space and single-line comments. Character constants enclosed in single quotes and strings (enclosed in double quotes) must be recognized as a whole and returned to the parser as a single token with the token value set to the single character/character sequence. Numbers are scanned as string and returned to the parser as a single token. The conversion string to number will later be performed in the parser, i.e., you need to return the lexeme (string) representing the number, not the number itself.
We provide a skeleton for a scanner/parser framework so that you don't have to write the entire framework from scratch and can focus only on the interesting parts. The provided skeleton code implements lexical analysis for the language "SnuPL/-1". SnuPL/-1 is a stripped down version of SnuPL/2 but will help you to understand how to use the SnuPL framework. The syntax definition of SnuPL/-1 is provided here.
The scanner skeleton can be found in the files snuplc/src/scanner.[h/cpp]
.
The header file defines the token type EToken
that represents all classes of lexemes we implement.
You will need to add additional tokens to this data type to implement SnuPL/2.
There are two corresponding data structures (ETokenName
and ETokenStr
) that are used to output the token names in human-readable form in scanner.cpp
.
ETokenName contains the token type in string format, while ETokenStr can be used to print the token name along with its attributes (the lexeme).
Since the elements of ETokenStr are fed to printf, you can print the lexeme by inserting a %s
placeholder somewhere in that string.
The token class CToken
is fully implemented and functional, you do not have to modify it.
The scanner CScanner
is also fully functional, but only recognizes SnuPL/-1.
You will need to modify the function CToken* Scanner::Scan()
to accept all possible tokens of SnuPL/2.
Properly scanning characters and strings is quite difficult. For that reason, the skeleton contains a method CScanner::GetCharacter()
that can be used to parse the (possibly escapted) characters in a string or character constant.
Last but not least, the sources also contain a simple test program to test your lexical analyzer.
The test program instaniates a CScanner
object instance and repeatedly calls CScanner::Get()
to retrieve and print the next token until the end of file is reached.
To build the test program, run the following command in the snuplc/
directory
snuplc $ make test_scanner
You can invoke test_scanner
with a file name as an argument:
snuplc $ ./test_scanner ../test/scanner/test01.mod
In the directory test/scanner/
you will find a number of test files for the scanner.
We advise you to create your own test cases to test special cases; we have our own set of test files to test (and grade) your scanner.
Hints:
The first phase is pretty straight-forward to implement. Two points are noteworthy:
- error recovery: unrecognized lexemes should be handled by returning a tUndefined token with the illegal lexeme or a custom error message as its attribute
- handling of comments and whitespace: consume all whitespace and comments in the scanner (i.e., do not return a token for whitespace or comments)
Our SnuPL/2 compiler and the skeleton code are fully documented with Doxygen. You can generate the documentation from your sources by running
snuplc $ make doc
from the snuplc
directory. You will need to install Doxygen and Graphviz (dot) on your machine.
-
Source code of the scanner
Document your code properly - including Doxygen comments for all new classes, member functions, and fields. Please do not include any generated files (documentation, relocateable object files, binaries) into your GitLab repository. We will compile your code by ourselves. -
A brief report describing your implementation of the scanner in PDF format
The report must be stored asreports/1.Lexical.Analysis.pdf
.
You can use the reports from the individual phases to compiler your final report at the end of this semester. Note that the reports are almost as important as your code. Make sure to put sufficient effort into them!
Implementing a compiler is challenging and requires a bit of time. Do noot hesitate to ask questions in class and on Slack. Also, start as soon as possible; if you wait until a few days before the deadline we cannot help you much and you may not be able to finish in time.
Happy coding!