CUP2 User Manual

CUP2 is a system for generating parsers from concise specifications. CUP2 is completely written in Java and combines the algorithmic ideas from its predecessor CUP with the features of modern Java.

While CUP supports LALR(1) only, CUP2 includes LR(0), LR(1) and LALR(1) out of the box and allows easy extensions for other grammar classes.

Unlike many other traditional parser generators (also including CUP) CUP2 does not create parser source code based on a separate grammar definition language, but uses Java API calls combined with data structures to represent grammar-related information. This means, there is no special language to define grammars and semantic actions, but everything is done in pure Java. Even more, CUP2 disclaims of generating a parser as a source code file, but rather produces a stream of serialised Java code, which can then be loaded by the CUP2 runtime classes.

2. Short Guide: Introduction

Note: To understand the following short guide, you should have a basic understanding of context-free grammars and what a parser generator is used for. If not, please consider reading introductory texts like Aho, Sethi and Ullmann's "Compilers: principles, techniques, and tools", Andrew W. Appel's "Modern compiler implementation in Java" or the Wikipedia articles about context-free grammars and compiler-compilers before.

This short guide presents how to use CUP2 to create a parser for your language by means of a simple example. In this example we implement a grammar for arithmetic expressions by first setting up the CUP2 environment, and then specifying a CUP2 grammar based on a grammar for arithmetic expressions. Finally, we define the semantic actions of the parser, producing a syntax tree representation of expressions. Finally the parser is created and deployed. You can download the complete example from the repository.

Prerequisites: The only thing you need to download is the recent CUP2 release. It works on any Java 1.5 compatible development kit. For this introductory example, it is sufficient to download the plain cup2gen.jar file containing the pre-built binaries. Then, you only have to include the cup2gen.jar file into your classpath.

2.1. Defining the Grammar

CUP2 uses Java data structures for defining a grammar specification. In CUP2, a specification class must be created, extending the edu.tum.cup2.spec.CUPSpecification class.

15.public class SampleSpec

16. extends CUPSpecification

The classical grammar for simple arithmetic expressions over integers in BNF, obeying the operator precedence rules, might look like:

e → e + t | t
t → t * f | f
f → ( e ) | NUMBER

To specify a parser based on this grammar, our first step is to declare the symbols of the alphabet for our grammar in CUP2. For defining terminal and non-terminal symbols, we have to declare two enums within our class which must be called Terminals and NonTerminals, implementing the Terminal and NonTerminal interface from the edu.tum.cup2.grammar package.

//terminals (tokens returned by the scanner)
public enum Terminals implements Terminal
{
        PLUS, TIMES, LPAREN, RPAREN, NUMBER;
}
 
//non-terminals
public enum NonTerminals implements NonTerminal
{
        e, t, f;
}

In general a grammar consists of a list of productions. Such productions are created by calls to the prod method, which is provided by CUP2. The prod method expects a non-terminal (the left-hand side) and for each alternative right-hand side a list of symbols, grouped via the rhs method. Finally the productions, created in this way, are declared as grammar by passing them as parameters to a call of the grammar method from the constructor of our parser class:

grammar(
            prod(e,       rhs(e, PLUS, t),
                          rhs(t)          ),  
            prod(t,       rhs(t, TIMES, f),
                          rhs(f)          ),
            prod(f,       rhs(LPAREN, e, RPAREN),
                          rhs(NUMBER)     )
);

2.2. Semantic Actions

So far, our parser generated for the specification above only indicates success or failure of a parse, i.e. it accepts those words which can be created by this grammar and throws an exception if the word cannot be generated by this grammar. The next step consists in telling the parser, how to build a syntax tree from these productions. To achieve this, each production can be extended by semantic actions. These actions are called whenever the corresponding production is reduced. The return value of such an action is then associated with the production's left-hand side. For example, let the parser run on a sequence of symbols like e + t. We then want to obtain an instance like new BinaryExpression(e,Operators.PLUS,t). As there may exist different variants for a production's right hand sides, there may be different actions, providing different return values for a non-terminal. The type of a symbol thus needs to be general enough, to permit all possible variants. We need to specify this type for each symbol (with a value) in the body of our parser class:

public class NUMBER extends SymbolValue<Integer>{};
public class e      extends SymbolValue<Expression>{};
public class f      extends SymbolValue<Expression>{};
public class t      extends SymbolValue<Expression>{};

For each symbol with a value, create an instance of the edu.tum.cup2.semantics.SymbolValue class, parametrised with the appropriate value class. The name of this instance must be exactly the same as the symbol (case-sensitive). In our example, neither the terminals PLUS, TIMES nor the parentheses (LPAREN,RPAREN) carry any values, so they are omitted in the type declaration. We declare that NUMBER carries integer values (as line 32 illustrates). All our non-terminals and consequently our productions are based on a common class Expression (cf. lines 33-35). . For this consider the following listing:

grammar(
            prod(e,       rhs(e, PLUS, t),
                 new Action(){ public Expression a(Expression e, expression t) 
                         {return new BinaryExpression(e,Operators.PLUS,t); } }, 
                          rhs(t),  
                 new Action(){ public Expression a(Expression t) 
                         {return t; } }   ), 
            prod(t,       rhs(t, TIMES, f),
                 new Action(){ public Expression a(Expression t, expression f) 
                         {return new BinaryExpression(t,Operators.TIMES,f); } }, 
                          rhs(f),  
                 new Action(){ public Expression a(Expression f) 
                         {return f; } }   ), 
            prod(f,       rhs(LPAREN, e, RPAREN),
                 new Action(){ public Expression a(Expression e) 
                         {return e; } }, 
                          rhs(NUMBER),
                 new Action(){ public Expression a(Integer n) 
                         {return new IntegerConstExpression(n); } })
);

Let us look at lines 42 to 44 in more detail. We extend the parameters of the production method by instances of the edu.tum.cup2.semantics.Action class. This class allows to specify a method a, which provides the semantic action code, which is executed, when the previously specified rhs block is reduced. For each non-terminal of the corresponding rhs which has been assigned a type via the SymbolValue class, there must be a corresponding formal parameter of a. In our case, we have e.g. the non-terminals e and t, which lead to parameters of the same name and are of type Integer since they are specified as SymbolValue<Integer>. The PLUS terminal was not declared to carry a value and thus it is omitted in the parameter list of a.

2.3. Creating and Using the Parser

CUP2 is not really a "parser generator", since it does not generate parser source code. Instead, it generates a parsing table which is then processed by a parsing table interpreter, acting as a general parser-driver.

Thus, first we have to create the parsing table and then we are able to initialise the parser with it. This is shown in the following listing. Since we do not want to re-compute the parsing table each time before we initialise the parser, the parsing table can be serialised to a file, as shown in section 3.3..

1.LRParsingTable table = new LR1Generator(new SampleSpec()).getParsingTable(); //here, we use the LR(1) generator

2.LRParser parser = new LRParser(table);

Now, we can feed the parser with input data. In most cases, we will link a lexer generated by a scanner generator to the parser, like in the following listing (see section 3.2. for more information how to connect a scanner like JFlex to CUP2).

Expression result = (Expression)parser.parse(
    new GeneratedScanner(new FileReader("TestInput.txt"))); //contains "13+25*23+4/2*((4+3)*2)*5"
//result must be 728
assertEquals(result.evaluate(), 728);

3. CUP2 in Depth

Next we present a complete description of all parts of a CUP2 parser specification. A complete parser specification is a subclass of CUP2Specification which consists of

Symbol Declarations

classifying symbols as terminals or non-terminals
types for symbols

a Grammar Specification with

Production declarations
Action codes
Precedences

Installation: To use all features of CUP2 within your Java application, you only have to include the cup2gen.jar file into your classpath.

3.1. Specifying the Parsing Rules

3.1.1. Symbol Declarations

First, you have to define the terminal and non-terminal symbols occurring in the productions of your grammar. Terminal symbols are those tokens, from which the input of the parser is composed of and they are in generally produced by a lexical analyser. The non-terminal symbols are the inner nodes of the parse tree. As each subtree of depth 1 of the parse tree corresponds to a production of the grammar, we have to use a non-terminal symbol as a left-hand side symbol for each production. Before usage, each of these symbols has to be declared in Java. This is achieved by creating a new enum called Terminals (implementing the Terminal interface) and another enum called NonTerminals (implementing the NonTerminal interface) within Your CUP2 specification class, each of them containing the particular symbol names. It doesn't matter whether the symbols consist of uppercase or lowercase letters. Here is an example of such two enums:

//terminals (tokens returned by the scanner)
public enum Terminals implements Terminal
{
        PLUS, TIMES, LPAREN, RPAREN, NUMBER;
}
 
//non-terminals
public enum NonTerminals implements NonTerminal
{
        e, t, f;
}

Technical remark: It is mandatory to name the two enums precisely Terminals and NonTerminals. CUP2 uses reflections to search for enums of these names to learn about the symbol names of the grammar.

The specification of the symbols above provides names for symbols, which can be used within productions. Note that with CUP2 it is possible to associate a class type with each symbol. Terminal symbols (created by a scanner) can carry a value, while non-terminal symbols can be filled with information obtained during the parse run. To declare a type for a symbol, You have to declare inner classes within Your CUP2 specification, each of them named exactly the same as the enums in the symbol specifications above. Each such class must inherit from the SymbolValue class which is parametrised with the actual type of the symbol. You can see an example of such a type assignment here:

public class NUMBER extends SymbolValue<Integer>{};
public class e      extends SymbolValue<Expression>{};
public class f      extends SymbolValue<Expression>{};
public class t      extends SymbolValue<Expression>{};

Technical remark: Again, it is mandatory to use exactly the same names here, as You did when specifying the symbols within the two enums Terminals and NonTerminals. CUP2 uses reflection to recognise the names of the symbols.

3.1.2. Defining Grammar Rules

A grammar consists of a set of productions. Each production has a left-hand side non-terminal and a list of right-hand side symbols. This hierarchy is reflected in the class hierarchy of the CUP2 library classes. CUP2 provides the classes and interfaces Grammar, Production, Symbol (either Terminal or NonTerminal) to represent grammars. When defining the grammar, we create productions out of left- and right-hand sides and add them to the CUP2 grammar specification. You can instantiate these classes manually, using their constructors, or more conveniently use the provided builder methods of the CUP2 specification class, as described in the following. Next, we want to define the following grammar for arithmetical expressions in CUP2:

res → expr SEMI
expr → NUMBER
expr → expr PLUS expr
expr → expr TIMES expr
expr → LPAREN expr RPAREN

We now use the builder methods grammar, prod and rhs to first group the right-hand side symbols of a production and then pass this production to the grammar. The constructor of the CUP2 specification is as follows:

grammar(
    prod(res,         rhs(expr, SEMI)),  
    prod(expr,        rhs(NUMBER)),
    prod(expr,        rhs(expr, PLUS, expr)),
    prod(expr,        rhs(expr, TIMES, expr)),
    prod(expr,        rhs(LPAREN, expr, RPAREN))
);

Apart from this direct transfer from grammar to code, we may also join the productions with the same left-hand side to a single production with multiple variations of right hand sides:

grammar(
    prod(res,         rhs(expr, SEMI)),  
    prod(expr,        rhs(NUMBER),
                      rhs(expr, PLUS, expr),
                      rhs(expr, TIMES, expr),
                      rhs(LPAREN, expr, RPAREN))
);

3.1.3. Assigning Semantic Actions to Rules

A parser, constructed by the so far introduced productions is fully functional, i.e. indicates a successful parse run or a failure. In order to compute any value or to construct any representation of the parser's input, we need to equip the parser with semantic actions. Semantic actions in CUP2 are Java methods that contain code which is executed every time the parser decides to reduce the so far recognised symbols according to one of its rules. A semantic action for CUP2 is specified as an anonymous subclass of the Action class, providing the method a. The parameters of a represent the right-hand side symbols of a production, while its return type must match the type associated with the production's left-hand side. Such semantic actions are declared as parameter to the prod builder method, directly following the rhs they belong to. Consider the following example for a semantic action, which is provided for the production expr → expr PLUS expr :

// expr -> expr PLUS expr
prod(
  expr,rhs(expr,PLUS,expr),
  new Action(){
  public Integer a(Integer a, Integer b)
  {
      System.out.println("expr -> expr PLUS expr");
      System.out.println("  a = "+a);
      System.out.println("  b = "+b);
      return a+b;
  }}
)

Note that the types of the parameters a and b must match the types they have been associated with in the symbol declaration step. Also note that symbols without an associated type (like PLUS in the upper example) do not occur in the parameter list of an action.

It is not very common that there exists the need for semantic actions, before the sequence of symbols of a right-hand side of a production is recognised completely. In these cases, CUP2 permits the use of so-called inner actions. In contrast to ordinary actions, inner actions are not triggered when a production is reduced, and thus it has to be flagged, at which position within the sequence of right-hand side symbols this action has to be triggered. An inner action is marked with the special terminal $a. The type of this terminal is determined by the type of the production's left-hand side. It may be accessed in the action code of later occurring actions which are part of the production. Therefore consider the following example:

// expr -> expr PLUS [$a] expr
prod(
  S,rhs(expr,SpecialTerminals.$a,expr),
 
  // inner action modifying symbol a 
  new Action() {
  public Integer a(Integer a)
  {
      System.out.println("expr -> expr . PLUS expr");
      System.out.println("  a = "+a);
      return Math.abs(a);
  }},
 
  // semantic REDUCE action processing a,b
  // and the result of the inner action
  new Action() {
  public Integer a(Integer a, Integer resultOfInnerAction, Integer b)
  {
      System.out.println("expr -> expr PLUS expr.");
      System.out.println("  a                       = "+a);
      System.out.println("  result of inner-action  = "+resultOfInnerAction);
      System.out.println("  b                       = "+b);
      return resultOfInnerAction+ Math.abs(c); // returns "|a|+|b|"
   }}
)

For more examples have a look at edu.tum.cup2.test.semanticShiftActions.*

Technical remark: The parser generator does not create inner actions as special cases, but emulates them with ordinary reduce actions. For every semantic shift action X indicated by the special terminal $a, an auxiliary non-terminal HX is created which then replaces $a. An additional auxiliary production HX → ε is created and assigned to a semantic reduce action which contains the code of the semantic shift action X. Note that using inner actions, different variants of the right-hand side may lead to shift reduce conflicts.

Technical remark: CUP2 tries to verify that actions match the productions they are associated with. First, CUP2 compares the number of supplied arguments to the number of symbols which are associated with types (and thus can hold values). In more detail, CUP2 checks, whether the action parameters and return types match with the types, associated with the symbols. However, as the type information is checked at runtime, CUP2 cannot distinguish between generic types (i.e. List<String> == List == List<Integer>) in the current version.

3.1.4. Eliminating Ambiguities in Productions

Experienced users will have already noted a problem with the example grammar from section 3.1.2.: It is ambiguous. An ambiguous grammar is a grammar which, given a certain input, can reduce the parts of the input in two different ways such as to give two different answers. Take the grammar from section 3.1.2. for example, given the following input: 3 PLUS 4 TIMES 5. The grammar can either create an expression tree evaluating to 35 or to 23, depending on which sub-expression is evaluated first. Such ambiguities result in so-called shift-reduce or reduce-reduce conflicts. In our example, creating a parser with the naive grammar will yield a shift-reduce conflict. This results from: under the symbol TIMES, the parser is not able to decide, whether to shift the symbol TIMES onto the stack or whether to reduce 3 PLUS 4. Thus, when the parsing table has to be generated, an exception occurs: a ShiftReduceConflict is thrown.

However, CUP2 offers a construct for a useful parsing with ambiguous grammars, by specifying precedences and associativities of particular terminal symbols. These precedences can be seen as advice, which transition in the automaton has to be preferred when creating the parsing table. Precedences resolve shift reduce conflicts. In our example, specifying TIMES to have a higher precedence than PLUS would solve these conflicts, as the multiplication will be performed before the addition. Consequently, in case of a shift reduce conflict the parser determines whether the terminal to be shifted has a higher precedence or if the production to reduce has a higher precedence. In the first case the terminal is shifted, otherwise a reduce is performed. However, there is still a problem if the terminals have equal precedence. Let us take the input 3 PLUS 4 PLUS 5 as an example. We overcome this by specifying the associativity of a terminal symbol, which are also used to resolve shift reduce conflicts. When using left associativity, the leftmost occurrence of the symbol is reduced first, i.e. 3 PLUS 4 is preferred. When using right associativity 4 PLUS 5 is the preferred reduction, i.e. that it is shifted onto the stack, hence, the reductions will take place from right to left. nonassoc states that there is no associativity for terminals with a certain precedence (e.g. for a compare operation like 3 == 4 == 5) TODO: Does this make sense? If there are two consecutive occurrences of terminals with equal precedence and the associativity nonassoc a ConsecutiveNonAssocException is thrown. Summarising, we have three different kinds of associativity in CUP2:

left
right
nonassoc

In order to specify precedences and associativities in CUP2, the precedences method in the constructor of the specification class has to be called. The precedences methods gets passed (multiple) associativities of symbols. Their order in the parameter list determines the precedences of the symbols. Conveniently, associativities of symbols are obtained through return values of calls to left, right or nonassoc. These methods will internally create the appropriate associativity class, e.g. LeftAssociativity. For an example on how to specify precedences and associativities, consider the following code:

//precedences
precedences(
    left(TIMES),
    left(PLUS));

More experienced grammar authors may encouter situations, in which it is helpful to fine-tune precedences of specific production alternatives. For these cases, CUP or YACC permitted to annotate grammar alternatives with a %prec TERMINAL term, which adjusted this productions alternative precedence to the precedence of the specified terminal. CUP2 also supports the concept of explicitely defining the precedence of a production alternative. Consider the following grammar:

precedences(left(UMINUS),left(TIMES),left(MINUS,PLUS));
 
grammar(
    prod(res,         rhs(expr, SEMI)),  
    prod(expr,        rhs(NUMBER),
                      rhs(expr, TIMES, expr),
                      rhs(expr, PLUS, expr),
                      rhs(expr, MINUS, expr),
                 prec(rhs(MINUS, expr),UMINUS),
                      rhs(LPAREN, expr, RPAREN))
    );

We now introduced binary and unary minus to our expression language, resolving the pending conflicts by giving the UMINUS alternative the higher precedence compared to the ordinary MINUS.

Technical remark: The following diagram shows the classes used to implement precedences and associativities and their relation.

3.2. Connecting Syntactical and Lexical Analysis

CUP2 concentrates on syntactical analysis only. Thus, it does not read raw character input from files directly. Instead, CUP2 expects a lexical analyser to split the raw input into tokens (such as keywords, numbers and special symbols), annotated with further information, e.g. the value of an integral constant. This task is in general performed by a scanner. CUP2 offers a concise interface for scanners. This interface can either be implemented by hand or one can take one of the established scanner generators like JFlex. JFlex is a well-known scanner generator written in Java and works with CUP2 smoothly.

3.2.1. General Interface

Connecting a scanner to CUP2 is simple. The scanner class must implement the interface edu.tum.cup2.scanner.Scanner, which consists of the single method:

12.public ScannerToken<Object> readNextTerminal() throws java.io.IOException;

Each time CUP2 needs to process a new symbol, the readNextTerminal method of the scanner is called. The scanner has to return the next terminal in the input stream, possibly annotated with a value. This is dependent on whether the terminal symbol was declared to carry data or not within your grammar specification in section 3.1.1.. For instance, if You use a SymbolValue<Integer> for the NUMBER terminal from the above example, an integer value must be assigned to each NUMBER terminal which is returned to the parser.

Symbols are instances of the edu.tum.cup2.scanner.ScannerToken<T> class, parametrised by the class T of the value associated with the symbol (or Object for tokens with no value) GOON: create static factory methods. Consider the following example, assuming that NUMBER has a value but PLUS does not:

1.new ScannerToken<Integer>(NUMBER, 5); //create a token with value

2.new ScannerToken<Integer>(PLUS); //create a token with no value

CUP2 supports to annotate line and column information for each symbol. This allows to generate more verbose and informative error messages for the CUP2 framework when parsing fails:

4.new ScannerToken<Integer>(NUMBER, 5, 10, 25); //create a token with value, read on line 10 and column 25

5.new ScannerToken<Integer>(PLUS, 10, 28); //create a token with no value, read on line 10 and column 28

3.2.2. Generating Parsers with JFlex

The easiest way to create a scanner for CUP2 parsers is using the JFlex tool. For a general introduction please refer to the JFlex User's Manual.

Here is a sample JFlex specification for our example:

package my.project;
 
import static my.project.MySpec.NonTerminals.*;
import static my.project.MySpec.Terminals.*;
 
/**
 * MyScanner provides a scanner for the {@link MySpec} grammar.
 * 
 */
%%
%class MyScanner
%cup2
%%
"+" { return token(PLUS); }
"*" { return token(TIMES); }
";" { return token(SEMI); }
"(" { return token(LPAREN); }
")" { return token(RPAREN); }
[0-9]+ { return token(NUMBER, new Integer(yytext())); }
[ \t\r\n\f] { /* ignore white space. */ }
. { System.err.println("Illegal character: "+yytext()); }

There are basically three things you should notice when writing a specification for CUP2:

Before the first %% you should define your package and the needed imports. In most cases, these are just the static imports of your terminals and non-terminals. You can also add your class comment here.
In the second part, just name your class after %class and add the %cup2 option. This automatically adds support for Unicode, line numbering and column numbering as well as the correct interface to the CUP2 scanner and some convenience methods which we will use in the third part.
In the last part, define the token classes as described in the JFlex manual. You can use the token factory methods to create the terminal symbols for CUP2. There are two versions of token: The first version expects a terminal as its single parameter and is used for terminals with no semantic value. The other version, shown in line 19, has a second parameter which contains the semantic value of the terminal.

Using this specification, you can create your scanner class by simply calling java -jar JFlex.jar MyScanner.jflex or using the appropriate ANT task. Finally, add this class to your classpath and you're done.

3.3. Running the Parser

After the previous sections, You created a CUP2 specification ready to be transformed into a parser. A CUP2 parser works with a static code part which executes the main parser driver. This parser driver is in practice parametrised with the concrete parsing table of the particular parser and the class files, containing the grammar and action specifications. In order to parse a file with CUP2, You have to

Instantiate a parser generator with an algorithm of Your choice
Create a parse table from Your specification with this generator
Feed the parser driver with this parse table
Run the parse driver on the input

Since generating the parsing table is in general a costly operation, You may want to perform this task only once in advance and just use the generated parsing table. This is possible, if You serialise Your parsing table. Furthermore, You may not want to actively write a Java program just to compile the CUP2 specification into the parse table and save it. In this case, You can resort to the provided ANT task.

3.3.1. Manual Creation

CUP2 contains different generator algorithms based on LR-parsing:

edu.tum.cup2.generator.LR1Generator
edu.tum.cup2.generator.LR1toLALRGenerator
edu.tum.cup2.generator.LALR1Generator
edu.tum.cup2.generator.LR0Generator

They all produce LRParsingTable objects which are compatible with the parser driver edu.tum.cup2.parser.LRParser. This parser driver, equipped with a parsing table, can then be used to parse arbitrary inputs. The following example runs a parser based on your.path.SampleSpec on a token stream, provided by MyScanner (c.f. section 3.2.):

try
{
    LALR1Generator generator = new LALR1Generator(new your.path.SampleSpec()); //we want to use LALR(1)
    LRParsingTable table = generator.getParsingTable(); //get the resulting parsing table
    LRParser parser = new LRParser(table); //create a new LR parser using our table
    Object result = parser.parse(new MyScanner("25+17*2;")); //apply parser to a token stream
}
catch (GeneratorException ex)
{
    //creation failed. evaluate ex for more details.
}
catch(Exception ex)
{
    //parsing failed. evaluate ex for more details
}

In most applications, You will separate the parsing table generation process from the application of the finished parser. This can be achieved by using serialisation to e.g. dump the parse table into a file and reload it when using the parser. This can be realised by:

try
{
    LALR1Generator generator = new LALR1Generator(new your.path.SampleSpec()); //we want to use LALR(1)
    LRParsingTable table = generator.getParsingTable(); //get the resulting parsing table
    LRParser parser = new LRParser(table); //create a new LR parser using our table
    LRParserSerialization serial = new LRParserSerialization("myfilename"); //create serializer
    serial.saveParser(parser); //serialize parser
}
catch (GeneratorException ex)
{
    //creation failed. evaluate ex for more details.
}

You can now reload the parse table from the file and use it to parse arbitrary inputs:

//anywhere else (also on other machine, where the file "myfilename" is available)
LRParserSerialization serial = new LRParserSerialization("myfilename"); //create serializer
LRParser parser = serial.loadParser(); //have fun!
try{
  Object result = parser.parse(new MyScanner("34-2*13;"));
}
catch(Exception ex)
{
    //parsing failed. evaluate ex for more details
}

Technical remark: Since the specification is not compiled any more, the amount of classes delivered in the final form can be reduced from the whole package cup2gen.jar to just cup2parser.jar.

3.3.2. Integration in the Build Process

CUP2 can be integrated into the build process of a project via its integrated ANT task. This task allows to generate parsing tables from existing CUP2 specifications. At first, You have to register the CUP2 task in Your ANT script, which can be done as follows:

<taskdef name="cup2"
    classname="edu.tum.cup2.ant.Cup2AntTask"
    classpath="bin" />

You can integrate the CUP2 task inside an ANT target to produce a parse table, as in the following example:

<cup2
   cup2specification="my.project.MySpec"
   Cup2File="MyParseTable.serialised" >
   <classpath>
     <pathelement location="./bin/"/>
   </classpath>
</cup2>
         

TODO: This is going to be messy! All this is not implemented correctly yet!
The absolute minimum of information to be provided is:

Attribute "specification": The fully qualified name of the CUP2 specification for which to generate a parse table
Attribute "parsetable": The name of the file for the parse table
Nested tag "classpath": The path where to search for the CUP2 specification

The further possible options are:

Attribute "parsetableHtml": name of the file for the parse table dump as HTML
Attribute "automatonGraphviz": name of the file for the graphviz automaton dump
Attribute "verbose": values are "on" or "off"
Attribute "algorithm": values are: "lr1"|"lr0"|"lalr1" or an arbitrary fully qualified name of a generator algorithm class.

TODO:But this is still not implemented correctly.

3.4. Handling Input Errors

A final important aspect of building parsers with CUP2 is support for syntactic error recovery. The parsers we built so far work well until the parser reaches a symbol which it did not expect. In these cases an unprepared parser simply stops with an exception. It would be more useful if the parser tries to continue parsing at a point, where it is sure what is meant or even tries to give hints, how to correct the input.

3.4.1. Resume Parsing after Syntactical Errors

One approach to recover from an erroneous symbol sequence consists in error-handling productions. In particular, CUP2 supports a special error symbol (denoted simply by error). This symbol plays the role of a special non-terminal which, instead of being defined by productions, matches an erroneous input sequence.
The error symbol only comes into play, if a syntax error is detected. If a syntax error is detected, then the parser tries to replace some portion of the input token stream with error and then continue parsing. For example, we might have productions like:

s → ε 
s → expr SEMI s
s → error SEMI s
expr → expr PLUS expr 
expr → NUMBER 

This indicates that if none of the normal production for s can be matched by the input, then a syntax error should be declared, and recovery should be made by skipping erroneous tokens (equivalent with matching and replacing them with error) up to a point at which parsing can be continued with a semicolon (and additional context that legally follows a statement). An error is considered to be recovered from if and only if a sufficient number of tokens is past the error symbol can be successfully parsed. (The number of tokens required is determined by registering a ParserInterface as described in section 4.2.).

From the point of view of a grammar writer, there is even the possibility to react to the case when a syntax error occurred. Error productions can be also equipped with an action method, giving the program access to an edu.tum.cup2.semantics.ErrorInformation object, which contains information about the erroneous input fragment:

// NEW error-catching production : s -> error SEMI s
prod(s,       rhs(SpecialTerminals.Error, SEMI, s),
      new Action() {
          public String a(ErrorInformation x, String s)
          { 
              String location = x.getLine()+"/"+x.getColumn();
              return "some bad expression at "+location+";\n"+s; 
          }
      }
),

The symbol error is always associated with the class type edu.tum.cup2.semantics.ErrorInformation. It provides start- and end-position of the whole erroneous symbol sequence as well as a set of expected symbols. It distinguishes between correctly read tokens, bad tokens and the token which is responsible for getting the parser into an erroneous state (crash-token). The crash-token is the first bad token or alternatively the part of some correct input following the error. For example the erroneous input NUMBER PLUS NUMBER NUMBER SEMI (which has no bad tokens) leads to an error covering the expr as correctly read token and the crash-position at the NUMBER. Expected tokens: PLUS, SEMI.
As the computation of correctly read tokens requires some memory overhead, it has to be enabled specifically when parsing by using the saveTokens parameter:

1.parse( inputScanner, true )

3.4.2. Correcting Input by Inserting Tokens

Now, if the parser encounters an error, the only way to recover consists in special error productions (see section 3.4.1.) providing safe continue handles. This feature is called phrase-based error recovery and often widens the error in order to resume parsing because of missing symbols. Instead of giving up parsing, one may try to insert, replace or delete symbols in the input stream.

Repairing a file written in MiniJava: The original input is shown in black, orange tokens have been inserted, underlined text is covered by error productions.

The current version of CUP2 features an implementation of error recovery via the insertion of tokens. This approach yields more detailed error information while reducing the need for phrase-based error recovery. Since tokens may carry a value, and the parser cannot automatically guess what value would make sense for each particular token, the CUP2 specification provides a method to specify default values for freshly inserted tokens via the insert method. Let us have a look at the parameters of the method insert(terminal[, repeat], proposals[, value]):

terminal: Defines the terminals to be inserted. Non-terminals are not supported.
repeat: Optional parameter, which indicates how far this token must be away from another one of the same terminal. The algorithm for token-insertion does not know the semantic background of the insertion and its choice depends on the algorithm which generates the table. Although it tries to use as less tokens as possible, repetitions might occur. Use this parameter if repetitions occur for a token. Use this feature sparely.
proposals: An array of strings of significant proposals for the user, which e.g. might then be displayed as a drop-down menu in a GUI.
value: Optional parameter - if the token has a symbol-value-binding, a default value must be given.

Consider the following example illustrating the usage of method insert:

    /** Tokens to be inserted. Highest precedence : TYPE **/
insert( TYPE          , new String[] { "int" }, 0 );
insert( COMP          , new String[] { "<",">","==","<=",">=","!=" }, Constants.Unknown );
insert( COMMA         , new String[] { "," } );
insert( SEMICOLON, 1  , new String[] { ";" } ); // ";" must not be inserted one after another
insert( BINOP         , new String[] { "+","-","*","/" }, Constants.Unknown );
insert( ASSIGN        , new String[] {" = "} );
insert( IF            , new String[] { "if", "while" } ); // if/while differ mainly in semantics.
insert( RPAR          , new String[] { ")" } );
insert( LPAR, 1       , new String[] { "(" } ); // "(" must not be inserted one after another

However, the technique of token insertion cannot arbitrarily guess identifiers, or decide between alternative tokens without knowledge of the idea behind the program (missing semantics). Thus, token insertion does not always suffice to recover from an error, conserving the need for error-productions.

Attention: To use token insertion for error recovery, CUP2 relies on the programmer to insert a further attribute insert to the terminals in the following way:

public enum Terminals
    implements Terminal
{
    SEMICOLON,
    COMMA,
    LPAR,
...
    COMP,
    BBINOP,
    IDENT;
    public Insertable   insert  = null; // <- tweak
}

Technical remark: The class InsertionScanner is responsible for deciding which tokens to choose. It uses the original scanner and the parsing table in order to create tokens. More details on the algorithm and usage of the scanner can be found in chapter 4.4.. When parsing repaired input, all observers are notified automatically.

public Object repairAndParse(Scanner inputScanner) throws GeneratorException, IOException, LRParserException
{
    LRParsingTable table = new LR1toLALRGenerator(this).getParsingTable();
    Scanner inserter = new InsertionScanner(table,inputScanner);
    return (new LRParser(table)).parse(table,inserter);
}

4. Behind the Scenes

This chapter contains information for developers that want to modify or extend CUP2. If you have any questions or if you think that we should describe a certain topic in more detail, please contact us!

4.1. Adding New Generator Classes

In order to add your own parser generator class to CUP2 you have to insert your parser generator in the generator class hierarchy. (see section 7.2.) This means that your parser class has to be a subtype of the class Generator, either directly or by being derived from another class that is a Generator subtype. When creating a generator for an LR parser (or a parser based on a related algorithm) it is recommended to subclass the class LRGenerator.

4.1.1. Creating a generator class for an LR parser (or related parsing algorithms)

To create such a generator, derive your own generator class from LRGenerator. The main parts of the LRGenerator class are its constructor and the following template methods that a custom parser generator class should override. Those methods are used in the constructor of the LRGenerator class to generate the parsing table for the parser.

public Automaton<I, S> createAutomaton()
protected abstract S createStartState()
creates and returns the start state; This is the item which consists of the start production at position 0.
protected abstract void createReduceActions(LRActionTable actionTable, I item, LRParserState state)
fills the given LRActionTable with reduce actions, using the given non-shiftable item and its parent parser state;

The constructor of LRGenerator is passed to the CUP2 specification for which to generate a parser and furthermore the verbosity (how many debug messages are printed) and a boolean flag that is used to indicate whether the grammar should be extended have to be specified. It is recommended to provide a constructor in your own parser generator class that also gets passed this information and passes it further to the LRGenerator constructor. (Additionally it is useful to provide some kind of convenience constructors which pass default parameters to the LRGenerator constructor. Consider e.g. LALR1Generator.java.)

The method createAutomaton returns an automaton for the grammar. For a separation of concerns in CUP2 there is a separate class hierarchy for Automaton Factories. The root of this hierarchy is the class

AutomatonFactory<I extends Item, S extends State<I>> .

These classes contain code to create the automaton for an LRGenerator. The automaton is similar to a canonical automaton for the grammar. (The states of such an automaton contains the kernels of the states for the canonical automaton that can be extended with the method closure to obtain all the states of the canonical automaton.)

In the current version of CUP2 there are three different concrete Automaton Factories (LR0AutomatonFactory, LR1AutomatonFactory, LALR1AutomatonFactory). Each of them creates a different type of automaton for a certain kind of LR parser.

4.2. Error Recovery

Any parser must be able to cope with syntactically invalid input. It is normally unsatisfactory for the parser to just terminate on detecting an error. Errors must be recovered from in some way, producing an error message, and continuing the process of parsing until the end of input is reached.
CUP2 supports phrase-based error-recovery with dynamic synchronisation. It can skip erroneous tokens based on rules given in the grammar in order to resume safely. It is able to retrieve all correctly read information, which may be used to repair the error. The usage of error-handling productions is explained in chapter 3.4.1..

There may occur two exceptions when parsing erroneous input: If the specification of the grammar does not provide an appropriate error-recovery for an error X, the parser throws a MissingErrorRecoveryException if X occurs. If the end of the input stream is reached too early (while searching for an appropriate token to resume parsing), an EndOfInputstreamException is thrown. In order to deal with all problematic situations equally, a parser observer may be used.

//create the parser
LRParser parser = new LRParser(table);
 
// register an observer
parser.register(
    new ParserObserver()
    {
      public void syntax_error(ErrorInformation error) {
        if (error.isRecovered())
          System.err.println("A syntax error occured :\n"+error);
        else
          System.err.println("An unrecoverable syntax error occured :\n"+error);
      }
    }
);
 
// parse the input
Object result = parser.parse( input );

Picture of recovery-pyramid If the parser encounters an error, this implies that

maybe some tokens have already been read (a) and may have been reduced to non-terminals (b)
maybe some bad tokens follow the current parsing position (c)
parsing continues with a set of well-known tokens (d)

In order to supply the grammar writer with sufficient tokens (from a and c), the parser has to temporarily save correctly read tokens, although they might never be used, and save many tokens in the ErrorInformation object. Because this feature requires a lot of overhead for complex grammars with a big input, by default they are turned off. If this is the case, the error information lacks information about a and c, but its start-, end- and crash-positions are still available.
This feature can be turned on by starting the parser using the parameter saveTokens - i.e. parse( input, true ) in order to obtain instances of ErrorInformation with complete lists of token.

If a token of set d is found, the parser cannot always take the subsequent input as granted, because that token might occur randomly (e.g. due to misprint or as part of something else which can not be read correctly because of the error). In order to deal with such problems, the parser only continues with normal parsing if at least three correct tokens follow the error. This limit is called error-sync-size and it can be changed dynamically during parsing. In order to do so, register your own implementation of ParserInterface:

register( new ParserInterface() {
  public int getErrorSyncSize() {
    return 0; // recover immediately after error, i.e. do never combine errors.
  }
});
 
grammar( // your specification here

Note that too few error-handling productions are used, some errors might not be recovered from. Although if too much or bad error-handling productions are used , the parser is not able to easily recover from an error.
Consider the following grammar:

S → EXPR
EXPR → number
EXPR → EXPR plus EXPR
EXPR → if COND EXPR else EXPR
COND → BRACEL EXPR eq EXPR BRACER
COND → BRACEL EXPR le EXPR BRACER

After inserting an error-rule as the following for the condition rules, error-recovery tries to find a BRACER to continue parsing the condition.

1.COND → BRACEL error BRACER

However, what happens if the user forgot the BRACER? If the terminal following the error-terminal was one which occurred more often, the error-recovery would not assume the error to be that big. Even if the expected token was found, in case of an additional error shortly after it, the subsequent input is assumed to be part of the error until the next RBRACER occurs. If using sync-size correctly, severe expansions of errors may be reduced. Nevertheless the following rule seems more suitable, because it resumes at the next expression, although it extends an error inside the braces from BRACEL to BRACER.

1.COND → error

In order to understand error-recovery in more detail, it is necessary to understand how CUP2 parses. The parser driver maintains a stack to maintain all tokens which do not yet state a full rule as well as values created from other stack-items previously stating a full rule. The parse table contains the possible next action for every pair of stack-item and a terminal which may be read. Every action in the table may push the token onto the stack, reduce the stack-items which state a full rule, or indicate that this token cannot follow the previous input (error).

Error-recovery in CUP2 follows three steps:

If an error is encountered, items will be popped from the stack until a state A is found which is able to catch the error, i.e. it can make use of the error-terminal in order to change the stack to state B.
Tokens from the input stream are read until a terminal X is found which may resume parsing after the error-terminal. This means, that the action found for terminal X under state B in the action table is not an error-action. X is one of the expected terminals.
Parsing continues for error-sync-size tokens whilst all actions are not executed but saved in a list (dry-run). If an error occurs during dry-run, the saved actions are dismissed and the recovery resumes at step 2. Otherwise, the recovery is complete and the saved actions are being executed.

The manner in which the parser cuts back the stack to the place at which an error-terminal can be shifted, fixes the parser into trying to correct the error for a specific construct, which might not be appropriate. TODO : reference to Bruce Hutton - Computer Language Implementation 2006 Lecture Notes

4.3. Parsing Table Serialisation

4.3.1. How to use Serialisation in Cup2

In order to use serialisation an object of type LRParserSerialization has to be created and the file name of the file that is going to be used to store the parser data has to be passed to the constructor. When creating an LRParserSerialization the method saveParser can be used to save a certain parser state to the file. This is shown in the following figure.

LRParser parser = new LRParser(generator.getParsingTable());
 
//create LRParserSerialization object
LRParserSerialization serial = new LRParserSerialization(strFileName);
 
//serialize parser
serial.saveParser(parser);

To restore a parser use the method LRParserSerialization.loadParser.

4.3.2. How it works

The serialisation mechanism in CUP2 is based on the Java serialisation mechanism.

When serialising a parser the parser state (especially the LRParsingTable object) is saved to a file in order to be able to use the generated parser without having to generate the parser again later on.

In the following class diagram all the classes that get serialised are shown.

Semantic actions are treated in a special way when serialising the parser. As we want to keep the specification as simple as possible, the user should not have to care about serialisation when writing the CUP2 specification for a grammar. Therefore the Production class takes care of serialising the semantic actions.

When an action is serialised the Production class saves a class object of the anonymous subclass to the file. When deserialising the parser, the Action gets instantiated via reflections.

To provide good compatibility with different JVM versions, we explicitly configure the Java serialisation to use protocol version 1 of the Java serialisation.

4.4. Token Completion

This chapter goes more into detail on how to use CUP2 token insertion. It is recommended to read chapters 3.4.2. and 3.2.1. to understand the basics.

The class InsertionScanner is responsible for deciding which tokens to choose. It uses the original scanner and the parse table in order to create tokens of type InsertedScannerToken, which contain detailed error-information.

A TokenInsertionScanner may be used e.g. in a GUI to propose solutions to the user or to repair erroneous input and parse it. A call to scan4allInsertions() retrieves all inserted tokens.

public void displayRepairs(Scanner inputScanner) throws GeneratorException, IOException
{
    LRParsingTable table = new LR1toLALRGenerator(this).getParsingTable();
    InsertionScanner inserter = new InsertionScanner(table,inputScanner);
    for (InsertedScannerToken<?> tkn : inserter.scan4allInsertions())
      System.out.println(tkn.getErrorInformation());
}

The algorithm works on a buffer of bufSize tokens (default: 8) which may contain maxOversize additional tokens following the last error in the buffer (default: 8). During token insertion from left to right (into the buffer), the scanner identifies how much other possibilities could have been chosen in the past and when this number rises to maxPossibilities (default:10), another solution is chosen. This way, solutions with many insertions are refused if a shorter one exists.

public void displaySolution(Scanner inputScanner, int bufSize, int maxOversize, int maxPossibilities )
{
    LRParsingTable table = new LR1toLALRGenerator(this).getParsingTable();
    Scanner inserter = new InsertionScanner(table,inputScanner,bufSize,maxOversize,maxPossibilities);
    ScannerToken<Object> t;
    do {
        t = inserter.readNextTerminal();
        System.out.print(t+" ");
    } while ( t.getSymbol()!=SpecialTerminals.EndOfInputStream );
}

In order to find solutions for the parser whilst the buffer may change, two stacks are maintained: Operations for read (and inserted) tokens are performed on the readStack. The changeStack represents the contents of readStack at the time the tokens in the buffer have not been read yet. The algorithm works as follows:

Tokens are read from another scanner S if available until the buffer is full. The input is parsed according to the table on the readStack (step C).
If the input could be parsed this far, the first token of the buffer will be returned (step A).
If an error occurred, readStack is returned to changeStack and the contents of the buffer are cleared after pushing them back into S (step B).
A first solution is chosen : A fitting token shall be inserted directly before the crash-position.
1. Tokens will be read from S and inserted according to the chosen solution until at least an amount of error-sync-size tokens could be parsed successfully after the error (step C).
2. If the input could be parsed this far, the solution is suitable and all tokens from the buffer can be returned (step D).
3. Otherwise, if parsing the solution encounters an error or if checks on maxOversize or maxPossibilities fail, the TokenNumberGenerator chooses another solution. The contents of the buffer are cleared and the stacks returned once more before the algorithm continues at step 3.1.
4. Otherwise, if no more solutions are known, the original tokens from S are returned whilst the erroneous readStack is restored using a simple phrase-based recovery (step D). The changeStack has now of course again to be in the same state as the readStack.
Continue with step 1 or return if the input has been completely parsed.

5. Short Guide: CUP2 for CUP Users

5.1. From CUP to CUP2

First of all note that CUP2 is not an updated version of CUP, but a complete re-implementation with different design aspects in mind (v. section 7.1.). It may be hard to transform your working CUP specification to a working CUP2 specification in no time, but we hope that the advantages of CUP2 will be worth the effort. Furthermore, we are working on a tool (see section 5.2.) that helps you transform your CUP files into CUP2 classes.

5.1.1. Specification in Pure Java

As described in more detail in chapter 3., CUP2 does not use plain text specification files, but works with pure Java data structures. This means that the grammar, precedences and semantic actions all have to be defined in Java.

Consequently, the old JFlex specifications cannot be re-used in CUP2. But the required changes are minimal, and altogether less lines of code are needed. Refer to section 3.2.2. for more information.

5.1.2. Parsing Table instead of Parser Code

CUP2 does not generate parser source code. Instead, a parsing table (consisting of Java objects only) is generated, which is needed to initialise a predefined universal parser driver, as shown in section 2.3.. This parsing table can be serialised to a file, and later quickly deserialized back to a Java object, as shown in section 3.3..

5.1.3. Error Recovery

The error recovery mechanism of CUP2 behaves quite similar to the one in CUP. The error symbol is now given by the terminal Error which is a special terminal and thus should be imported:

1.import static edu.tum.cup2.grammar.SpecialTerminals.Error;

The method error_sync_size() can be implemented. This method is used for parsing a sufficient number of tokens past the error symbol, by creating your own parser interfaces:

register( new ParserInterface() {
  public int getErrorSyncSize() {
    return 0; // recover immediately after error, i.e. do never combine errors.
  }
});

In fact, there is a change in the way an error will be reported. CUP2 gathers detailed information inside an ErrorInformation object and hands it over to the corresponding semantic action:

prod(s,       rhs(SpecialTerminals.Error, SEMI, s),
      new Action() {
          public String a(ErrorInformation x, String s)
          { 
              String location = x.getLine()+"/"+x.getColumn();
              return "some bad expression at "+location+";\n"+s; 
          }
      }
),

The object of type ErrorInformation provides start- and end-position of the whole error as well as a set of expected tokens. It distinguishes between correctly read tokens, bad tokens and the token which is responsible for getting the parser into an erroneous state (crash-token). In addition, it contains the results of calculations already performed. The flag recovered indicates that the error has been caught by an error-handling production.
The crash-token is the first bad token or alternatively the part of some correct input following the error. As the calculation of correct tokens requires some memory overhead, it has to be enabled specifically when parsing by using the saveTokens parameter:

1.parse( inputScanner, true )

5.1.4. Customising the Parser

CUP provides a series of methods that can be used to customise the generated parser. Instead of supplying copying the given code and overwriting of parser methods, CUP2 uses interfaces and observers to communicate. The following methods for customisation can be implemented in CUP2:

user_init(), int error_sync_size()

by implementing the parser interface;
report_error(String message, Object info), report_fatal_error(String message, Object info), public void syntax_error(Symbol cur_token), public void unrecovered_syntax_error(Symbol cur_token)

by implementing and registering an error-observer and/or by catching exceptions;
Symbol scan(), Scanner getScanner(), setScanner(java_cup.runtime.Scanner s)

by using the scanner-interface of CUP2;

Examples for a simple parser interface and an error-observer are given here:

register( new ParserInterface() {
  public int getErrorSyncSize() {
    return 0; // recover immediately after error, i.e. do never combine errors.
  }
});

//create the parser
LRParser parser = new LRParser(table);
 
// register an observer
parser.register(
    new ParserObserver()
    {
      public void syntax_error(ErrorInformation error) {
        if (error.isRecovered())
          System.err.println("A syntax error occured :\n"+error);
        else
          System.err.println("An unrecoverable syntax error occured :\n"+error);
      }
    }
);

5.2. Using the CUPtoCUP2 tool

For this section an understanding of the syntax and semantics of the old CUP specification files and the new CUP2 specification files (Java classes) is required.

The CUPtoCUP2 tool translates old CUP grammars to new CUP2 grammars - with restrictions. First of all: The old CUP specification files generate classes that are derived from a certain super class and are able to make use of objects and classes which are visible there or somewhere in the old package. Especially the first aspect is responsible that it is not trivial to guarantee that all the Java code written in an old CUP specification will compile - or even run - when it is just transferred to a new environment. Except doing vast modifications to the code is an option.

However some of the semantic actions might run in the new environment with very few modifications. For example it is possible that semantic actions in the old CUP specification are just used to build up the parse tree and make no or little use of local or inherited methods.

There is particularly one trap to be aware of: The new CUP2 specification syntax does not have the %prec feature. So if your old specification file contains such syntax, you will have to find a work-around for yourself. It is considered to implement this feature in future releases of CUP2.

To illustrate the use of the CUPtoCUP2 tool, consider the following example, which implements a simple calculator:

package Example;
 
import java_cup.runtime.*;
 
parser code {:
    public static void main(String args[]) throws Exception {
        SymbolFactory sf = new DefaultSymbolFactory();
        if (args.length==0) new Parser(new Scanner(System.in,sf),sf).parse();
        else new Parser(new Scanner(new java.io.FileInputStream(args[0]),sf),sf).parse();
    }
:}
 
terminal SEMI, PLUS, TIMES, LPAREN, RPAREN;
terminal Integer NUMBER;
 
non terminal Integer expr_list, expr_part;
non terminal Integer expr;
 
precedence left PLUS;
precedence left TIMES;
 
expr_list ::= expr_list expr_part | expr_part;
expr_part ::= expr:e {: System.out.println(" = "+e+";"); :} SEMI ;
expr      ::= NUMBER:n
          {: RESULT=n; :}
            | expr:l PLUS expr:r
          {: RESULT=new Integer(l.intValue() + r.intValue()); :}
        | expr:l TIMES expr:r
          {: RESULT=new Integer(l.intValue() * r.intValue()); :}
        | LPAREN expr:e RPAREN
          {: RESULT=e; :}
        ;

In order to call the CUPtoCUP2 tool you have to type the following command in the root directory of the tool:

java -jar ./jar/Compiler.jar <on|off> <cup-file> <classname of the output class>

The classpath should of course include the classes in the directory of the tool.

The first argument tells the compiler to generate code for semantic actions or to leave out the actions. The second argument tells the compiler which *.cup file should be translated and the last argument specifies the name of the class that is generated. The output looks like this: TODO Way to big example res/MyCalculator.java The init with or the action code sections are translated by copy and paste. Comments are generated around these areas. Using this tool can save a lot of time when big grammars have to be transformed. As you can see the ordering of the precedences in line 55 is inverted, and of course shift actions like the one in line 87 work a little bit different compared to the old CUP specification - cf. line 23 in the old CUP file.

Most likely you would like to change the specified package in line 3 and do the corresponding changes in lines 15 and 16. The main method in line 24 is a little obsolete both in the old file (it is just an example) and in the new specification format, as the parser is usually called from a different class. Furthermore the import statement from the old CUP file was deleted.

Additionally the RESULT in line 91 is initialised to make the compiler work. These changes can be best seen when a diff or vimdiff is performed. Although a translation of a real CUP grammar is not that easy, this can be seen as a proof of concept, as the following code compiles.

TODO Way to big example "res/MyCalculator.java~"

6. Sample Project

We created a small yet expressive sample project to demonstrate many of the features CUP2 includes. This can be downloaded here.

It is a compiler for the language MiniJava (which is a subset of the Java programming language), which translates MiniJava source code into MiniJVM code (inspired by the JVM, yet much simpler) or a syntax tree. There is also a graphical user interface which looks like a small IDE, where you can see if your source code is correct (if not, the corresponding characters are underlined and also suggestions to fix that error are made).

Let us have a look at the source code of that compiler:

flex/MiniJavaSpec.flex: This is the JFlex specification of the MiniJava language that is used to build a scanner for our compiler. With JFlex, it is converted into the Java class src/compiler/MiniJavaLexer.java
src/compiler/MiniJavaSpec.java: This is the CUP2 specification of the MiniJava language which contains the grammar, precedences and all the semantic actions that are necessary to create a syntax tree from a MiniJava source.

7. Appendix

This chapter contains some background information about the design and implementation of CUP2.

7.1. Basic Design Principles

Here are the basic ideas why we decided to implement CUP2 from scratch instead of modifying the old CUP release.

7.1.1. Java and Object-Orientation

While CUP appears to be more or less a port of a C program, CUP2 was designed from scratch with a typical Java style in mind. This includes the usage of all kinds of object-oriented programming techniques and Java's special features like Generics and Reflection.

7.1.2. Reliability of the Algorithms

It is often not clear how algorithms work when reading source code, even if you developed it yourself. Therefore, if they are changed there is a danger of damaging the program, possibly at unexpected locations. When developing CUP2, the following steps were taken to minimise that risk:

References to books Though the algorithms are somewhat optimised for Java, the basic ideas are the same as described in the books from which they were taken. Where possible, there are comments explaining which part of the pseudo-code algorithm is implemented in the lines of Java code that follow.
JavaDoc Almost all classes and methods are documented using JavaDoc. The class documentation describes the basic idea about a class and important dependencies to referenced classes. The documentation above the methods tells the developer what a method does and what parameters are required.
JUnit tests Most semantic errors encountered during development were discovered using unit tests via the JUnit framework. These tests are included in the source distribution of CUP2, so that developers can use them to quickly check if the algorithms are still working and to write their own tests based on the existing ones. Of course these tests are no proofs in a mathematical sense, but at least they help to identify errors in the code more quickly and provide means to ensure that hitherto correct behaviour survives future changes.

7.1.3. Platform for Different Parsing Algorithms

One goal of CUP2 was to create a platform for many types of parser generators, even more than just LR(1) and LALR(1), based on a hierarchy of abstract classes for reusing as much code as possible, as shown in figure TODO.

TODO: Somebody created a nice UML diagram showing the current state of the CUP2 generator classes, can you insert it here? Thanks, Andi

At the moment, LR(0), LR(1), LALR(1) and LR(1)-to-LALR (that means generating a parsing table with LR(1) first and then merging states according to the LALR principle) are included in CUP2. However, the object-oriented structure of the project allows adding any parsing algorithms meeting the following requirements:

Input Any generator must accept a given grammar, and optionally precedence directives and a level of verbosity (useful for debugging or logging).
Output For any LR generator, an LR parsing table has to be returned, which can then be used by the built-in LR parser driver. For any other type of generator, the type of output is arbitrary as long as a suitable parser driver is developed, too.

This design has at least four advantages:

Adding new algorithms in the future is possible without having to rebuild the whole framework (defining what a grammar and what a production is, parsing a specification file, and so on). For example, when creating an LL parser generator, the developers can concentrate on their main task instead of dealing with the framework.
As all generators are based on the same platform and thus share the same general conditions, it is much easier to compare the performance of different parsing algorithms in practice.
If developers decide to switch to another parsing algorithm, they are not forced to switch to another tool, but can even reuse the same specification without any changes.
The class hierarchy reflects the connections between the different algorithms, which follows the didactic concept also used in many books (like \cite{appel02}), making the code easier to understand with step by step instructions for new developers.

7.1.4. Specification using Java Data Structures

A specification contains a list of all symbols, the grammar including its productions and precedence directives and semantic actions. From a plain text specification, all these objects have to be converted into Java-compatible structures when parsing the grammar. This leads to the idea of defining the specification using the Java grammar directly. The advantages of doing this are as follows:

Since everything is valid Java code, all tools such as syntax highlighting or auto-completion are fully supported. A Java IDE like Eclipse could dramatically speed up the definition of a specification since many kinds of errors are immediately detected and reported to the developer.
No specification parsing (conversion from plain text to Java structures) has to be done (only a few checks for correctness of the types). The data structures are already within the JVM and can be used immediately after loading the specification class.
The instances of symbols, productions, actions, and all other kinds of specification objects are used throughout the whole process of generating the parser without any kind of modification, so debugging is much easier and more performance optimisations are possible.
No Java source code has to be generated by the program. The user just adds the CUP2 library to the classpath, writes a specification class, and that is it.

There are also some disadvantages of course. Mainly the definition in Java code is usually more verbose, since it has to conform to the Java grammar, which makes for example the definition of symbols with values more complicated. Another drawback is the definition of semantic actions. For each action, an anonymous subclass of Action has to be created, instead of just writing the method body like in CUP.

7.2. Generator Class Hierarchy

Below you can see the hierarchy of the generator classes and how generators for different parsers are located in this hierarchy. Besides the actual generator classes there are also classes that generate an automaton for a certain parsing algorithm.

Generator Class Hierarchy as of August 18 2009