JavaLet Your Parser Go for the GOLD

Let Your Parser Go for the GOLD

Developer.com content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

A Parser Generator that Speaks YOUR Language

In a previous installment, you visited the inner workings of the Yet Another Compiler-Compiler (YACC) parsing system. This month, you go for the gold: the Grammar Oriented Language Developer (GOLD) Parser. Like most parsing systems, GOLD uses the LALR(1) state machine (algorithm) to analyze syntax and a Deterministic Finite Automaton (DFA) to identify different lexical units (tokenizer). Practically all common parser generators, such as YACC/Bison, use these algorithms. However, GOLD takes a different approach than common compiler-compilers. GOLD is freeware and uses the “zlib” style license so you can integrate it with your own apps without worry.

In GOLD, the LALR and DFA algorithms are merely simple automata, using lookup tables to determine both state transition and parsing actions. As a result, most of the computational work is performed before any input text is parsed. It is the computation of these tables that is both time-consuming and complex. Once created, these parsing tables are essentially independent of any programming language binding. As if to prove this point, the GOLD developers have implemented a dozen language bindings as of this writing (as either linkable .LIB import libraries or plug-ins), including:

  • ActiveX DLL
  • Assembly – Intel x86
  • C/C++
  • C#
  • D
  • Delphi
  • Java
  • Pascal
  • Python
  • Visual Basic .NET
  • wxWidgets

GOLD takes a unique approach of logically separating the development of the LALR and DFA parse tables from the algorithms that use them. Unlike what we saw with YACC/Bison, GOLD does not require you to embed your source code directly into the grammar. Grammars are completely independent of any implementation language or platform. Instead, the application analyzes the grammar and then saves the parse tables to a separate file. This file can be subsequently loaded by the actual parser engine and used.

How Does It Work?

The GOLD Parsing System is designed to aid in the development of compilers, interpreters, and translators while supporting multiple programming languages’ backends.

Because the LALR and DFA algorithms can be implemented with simple state transition graphs, the parsing can be easily hosted by any programming language because the logic simply looks up a value in a table and then acts accordingly. The creation of these lookup tables is where all the hard work takes place. It’s not surprising, then, that GOLD is composed of three pieces: Builder, parser Engine, and an output file that contains table information. In a nutshell, the Builder constructs the parse tables and the Engine executes the tables.

The Builder is a Win32 app that reads a source grammar written in the GOLD Meta-Language, produces the parse tables, and then writes them to the Compiled Grammar Table file. Builder does a couple unrelated things, such as creating enum lists and allowing you to interactively test a grammar. The Builder could theoretically be ported to other OSes, but is only available for Win32 at this time.

The Engine component performs the actual parsing and can be developed in any programming language necessary. Because all the complex work was already performed by the Builder, the Engine simply needs to read the Compiled Grammar Table (CGT) file and implement the LALR and DFA automata. Although Engine implementations vary in design, they all behave identically because they completely are table-driven with respect to the input.

Mechanically speaking, this means each time you modify the grammar file, you’ll run the Builder to regenerate the CGT file. Once the grammar is completely refined, the bulk of your work shifts to the implementation of semantics and your focus changes to the Engine.

The Calculator Revisited

For comparison’s sake, you will now reverse-engineer the simple calculator you developed in GNU Bison. Fortunately, the GOLD Builder includes an import tool for legacy YACC/Bison grammars to get you started. For those of you starting from scratch, the New Grammar Wizard will give you a leg up on some standard types of grammars by quizzing you on some basic language characteristics: Do newlines terminate statements or not? Does it have typical programming-language type identifiers? Does it contain string constants? And so on. The output from the File | Import A YACC/Bison Grammar menu selection is shown in Figure 1. I modified it slightly to follow the suggested practice of defining the terminal symbols using the “=” operator. Although you literally could write in ‘+’ for the plus operator, I feel that spelling out the terminals (“PLUS”) both provides abstraction while supporting maintenance and readability. The GOLD Meta-Language syntax is as close as anyone will ever get to notating in pure Backus-Naur Form (BNF), so if you have any prior experience in formal grammars it will be a cakewalk.

"Start Symbol" = <input>

NEWLINE = 'n'
PLUS = '+'
MINUS = '-'
SLASH = '/'
ASTERISK = '*'
LPAREN = '('
RPAREN = ')'


! ======================================= Rules

<input>
   ::=
    |  <input> <line>

<line>
   ::= NEWLINE
    |  <expr> NEWLINE

<expr>
   ::= <expr> PLUS <term>
    |  <expr> MINUS <term>
    |  <term>

<term>
   ::= <term> ASTERISK <factor>
    |  <term> SLASH <factor>
    |  <factor>

<factor>
   ::= LPAREN <expr> RPAREN
    |  NUMBER

Figure 1: Calc.grm (grammar file)

Schooling Your Grammar

Now, it’s time to test the grammar! Rather than write a lot of code to do this, you simply can choose Window | Test Grammar on the Builder menu. You’ll test the same expression you used in the previous article on YACC: 3 * ( 4 + 5 ). The results are shown in Figure 2:

Figure 2: Parse Actions Display

You also can have a look at the Parse Tree view that shows a nice treeview control so you can drill down to exactly the right rules in your grammar (see Figure 3).

Figure 3: Parse Tree Display

Building the GOLD Engine

If there is any downside to the GOLD parser, it is that the language-specific bindings for the GOLD Engine are neither documented online nor offline. Instead, you have to use Doxygen to generate the HTML, which I found a silly chore—especially considering that C++ is probably one of the most popular platforms and documentation for its GOLD Engine should have been readily available. In the end, I felt like I was more reverse-engineering the demo programs rather than actually learning something from the docs. This is not quite the same as a step-by-step tutorial.

I just improvised by compiling all the code in the SRC directory and shoving it into a library, like this:

cl /c /GX *.cpp /I..include
lib /out:cpp-gpenginelibgpengine.lib *.obj

However, there are Project files available for a variety of Visual Studio versions if you want to do that, even one that builds the Engine as a DLL. Again, I don’t see a reason why precompiled binaries weren’t provided for these common platforms. The final link step, then, looks like this:

cl /GX main.cpp /Id:cpp-gpengineinclude /link
d:cpp-gpeginelibgpengine.lib

An App Made of GOLD

Last, you’ll take a look at a simple program that uses the CALC.CGT written out by the Gold Builder in the earlier steps and proceeds to parse it using the C++ binding you’ve already prepared:

 1 #include <iostream>
 2 #include <stdlib.h>
 3
 4 #include "CGTFile.h"
 5 #include "simpleErrorRep.h"
 6 #include "calc.h"    // list of constants generated by Builder
 7
 8 char *load_source (char *filename);
 9 void printTokens (vector <Token*> &t);
10
11 int main(int argc, char *argv[])
12 {
13    CGTFile    cgtFile;
14    char       *source;
15    Symbol     *rdc;
16    DFA        *dfa;
17    LALR       *lalr;
18
19    ErrorTable       *myErrors;
20    SimpleErrorRep   myReporter;
21
22    // Load grammar file
23    if (cgtFile.load ("calc.cgt")) {
24       wprintf (L"%sn", "Grammar loaded succesfully");
25       cgtFile.printInfo ();
26    } else {
27       wprintf (L"%sn", "error loading file");
28       return -1;
29    }
30
31    // Make some source!
32    source = "3 * (4 + 5)r";
33
34    // Get DFA (Deterministic Finite Automata) from the cgt file
35    dfa = cgtFile.getScanner();
36
37    // Scan the source in search of tokens
38    dfa->scan(source);
39
40    // Get the error table
41    myErrors = dfa->getErrors();
42
43    // If there are errors report them
44    if (myErrors->errors.size() > 0) {
45       for (int i=0; i < myErrors->errors.size(); i++) {
46          cout << myReporter.composeErrorMsg
            (*myErrors->errors[i]) << endl;
47       }
48       return -1;
49    }
50
51    // Get the tokens to feed the LALR machine with them
52    vector <Token*> tokens = dfa->getTokens();
53    cout << "printing tokens..." << endl;
54    printTokens (tokens);
55
56    // Get the LALR (Look Ahead, Left-to-right parse,
      //               Rightmost-derivation)
57    lalr = cgtFile.getParser();
58
59    // Parse the tokens
60    rdc = lalr->parse (tokens, true, true);
61
62    myErrors = lalr->getErrors();
63    if (myErrors->errors.size() != 0) {
64       for (int i=0; i < myErrors->errors.size(); i++) {
65       cout << myReporter.composeErrorMsg
         (*myErrors->errors[i]) << endl;
66    }
67    return -1;
68    }
69
70    cout << "printing reduction tree..." << endl;
71    lalr->printReductionTree(rdc, 0);
72
73    delete rdc;
74
75    return EXIT_SUCCESS;
76 }
77
78
79 void printTokens (vector <Token*> &t) {
80    for (integer i = 0; i < t.size(); i++) {
81       wchar_t *pwe = t[i]->symbol;
82       wprintf (t[i]->symbol);
83       wprintf (L":");
84       wprintf (t[i]->image);
85       wprintf (L"n");
86    }
87 }
88

The program starts in main() on Line 23 where you load the externally generated Compiled Grammar Table (CGT), which is the binary output of the GOLD Builder. Thus, you can tweak the grammar as much as you like without necessarily needing a recompile.

On Line 38, you do the actual scanning of the source text (hardcoded for this simple example) using a DFA built from the stored image in the CGT file. The next steps are to get the tokens out of the DFA results and stuff them into the parser.

51    // Get the tokens to feed the LALR machine with them
52    vector <Token*> tokens = dfa->getTokens();
53    cout << "printing tokens..." << endl;
54    printTokens (tokens);
55
56    // Get the LALR (Look Ahead, Left-to-right parse,
      //               Rightmost-derivation)
57    lalr = cgtFile.getParser();
58
59    // Parse the tokens
60    rdc = lalr->parse (tokens, true, true);

The two arguments at the end of Line 60 designate trim reduction and whether to stop parsing after the first error is encountered. Trim reduction means to simplify rules of the form: NonTerminal1 := Nonterminal2 (intermediate productions). Last, to prove that your program did something, you print the tokens (see Line #53) and print the reduction tree (see Line #71). The printing of the tokens is mainly to show that you recognized each of tokens correctly and should read left-to-right along the input. The reduction tree shows a tree you would use to control the expression evaluation so that, in fact, it does “4 + 5” and then multiplies that result by 3. Below is the output from the demo program:

printing tokens...
NUMBER:3
ASTERISK:*
LPAREN:(
NUMBER:4
PLUS:+
NUMBER:5
RPAREN:)
NewLine:
EOF:EOF

printing reduction tree...
input
 term
  factor
   NUMBER:3
  ASTERISK:*
  factor
   LPAREN:(
   expr
    factor
     NUMBER:4
    PLUS:+
    factor
     NUMBER:5
   RPAREN:)
NewLine:

Conclusion

I’ve barely scratched the surface of what you can do with the GOLD parser system. In this article really only took you through the steps up to a complete parse tree. Adding in the semantics is the next step for you to take. There are lots of applications from compilers to XML translators that you can implement with this flexible tool. The added advantage of being able to test your grammar inside the Builder can save you time wasted recompiling and debugging your code just to test the quality of the grammar. This alone is a big step up from the traditional YACC/Bison solutions. Add in the incredible cross-platform support for everything from .NET to Java and you’ll be sure to be finding new ways to use GOLD every day.

About the Author

Victor Volkman has been writing for C/C++ Users Journal and other programming journals since the late 1980s. He is a graduate of Michigan Tech and a faculty advisor board member for Washtenaw Community College CIS department. Volkman is the editor of numerous books, including C/C++ Treasure Chest and is the owner of Loving Healing Press. He can help you in your quest for open source tools and libraries; just drop an e-mail to sysop@HAL9K.com.

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories