August 12, 2020
Hot Topics:

# Classic Parsing with Flex and Bison

### A Simple Calculator Language

I chose a calculator language as the example for many reasons. First of all, nearly all programming applications need to parse expressions with plus, minus, multiply, and divide operators and parenthesis. Second, it's a universally understood problem and not specific to any particular language. Although many examples use postfix (RPN) notation, this one uses good old infix notation since everyone understands that. Specifically, the goal is for the calculator to be able to handle expressions like this:

`-1.1 + 2 * ( 4 / 3 )`

First off, you need to determine what the tokens of the language are. Basically, you will have numbers, four operators, parenthesis, and newline (standing in for the "=" key on your calculator). As such, you can write a straightforward Flex scanner (".l" file) to process this, as Listing 1 illustrates with lexer.l:

```  1 %{
2 // lexer.l -- From tcalc: a simple calculator program
3
4 #include <stdio.h>
5 #include <string.h>
6 #include <stdlib.h>
7 #include "calc.tab.h"
8 extern YYSTYPE yylval;
9
10 %}
11 %option noyywrap
12 delim         [ \t]
13 whitesp       {delim}+
14 digit         [0-9]
15 number        [-]?{digit}*[.]?{digit}+
16 %%
17 {number}  { sscanf(yytext, "%lf", &yylval); return NUMBER;}
18 "+"       { return PLUS; }
19 "-"       { return MINUS; }
20 "/"       { return SLASH; }
21 "*"       { return ASTERISK; }
22 "("       { return LPAREN; }
23 ")"       { return RPAREN; }
24 "\n"      { return NEWLINE; }
25 {whitesp} { /* No action and no return */}
```

Listing 1. lexer.l

Note a few things in lexer.l (Listing 1):

1. The definition of number on line 15 is a regular expression built up from instances of the earlier expression digit on line 14. The re-use of expressions is one of Flex's most powerful features.
2. It thoroughly converts single-character symbols into tokens even though you could have interpreted them as characters later on in Bison. I feel that it enhances maintainability not to hardcode character references in the grammar file later. Also, it would help to isolate internationalization issues, if any, in the input language.
3. Return values are used to designate the type of token, whereas the actual value of the token is stored in yytext.

Now, take a look at the grammar—the rules and actions that involve building the parse tree. Listing 2 shows these with tcalc.y:

```  1 /* tcalc.y - a four function calculator */
2 %{
3 #define YYSTYPE double      /* yyparse() stack type */
4 #include <malloc.h>
5 #include <stdlib.h>
6 %}
7 /* BISON Declarations */
8 %token NEWLINE NUMBER PLUS MINUS SLASH ASTERISK LPAREN RPAREN
9
10 /* Grammar follows */
11 %%
12 input:              /* empty string */
13     | input line
14     ;
15 line: NEWLINE
| expr NEWLINE           { printf("\t%.10g\n",\$1); }
17     ;
18 expr: expr PLUS term         { \$\$ = \$1 + \$3; }
19     | expr MINUS term        { \$\$ = \$1 - \$3; }
20     | term
21     ;
22 term: term ASTERISK factor   { \$\$ = \$1 * \$3; }
23     | term SLASH factor      { \$\$ = \$1 / \$3; }
24     | factor
25     ;
26 factor:  LPAREN expr RPAREN  { \$\$ = \$2; }
27       | NUMBER
28       ;
29 %%
30 /*--------------------------------------------------------*/
31 /* Additional C code */
32 /* Error processor for yyparse */
33 #include <stdio.h>
34 int yyerror(char *s)        /* called by yyparse on error */
35 {
36     printf("%s\n",s);
37     return(0);
38 }
39
40 /*--------------------------------------------------------*/
41 /* The controlling function */
42 int main(void)
43 {
44     yyparse();
45     exit(0);
46 }
```

Listing 2. tcalc.y

Like all good C programs, the program begins in the main() function on line 42. The instruction yyparse() begins the input parsing and returns when end-of-input is reached. The output of the calculator will be displayed after each NEWLINE token is reduced on line #16. Also, notice the yyerror() function called in times of parsing failure on lines 34-38. More sophisticated state management routines are available for error handling, but this is only a simple example!

By way of screen dumps, Figure 3 offers some proof that the tcalc application works.

Figure 3. TCALC Screen Dumps

Here is the batch file I used to build the project in a CMD window:

```set CCOPT=/c /Z7 /W3
bison calc.y
cl %CCOPT% calc.tab.c
bison -d calc.y
flex lexer.l
cl %CCOPT% lex.yy.c
```

### Rapidly Develop Useful Parsers

Yacc/Bison remains one of the top 10 tools in the compiler writer's toolbox, despite being over 30 years old. Part of its enduring appeal is the similarity of the grammar input (".y" files) to the actual BNF syntax of the language you are parsing. This contributes greatly to the maintainability of Bison-based parsers. This ease of use combined with its low footprint and high degree of portability make Bison and its friend Flex the clear winners for rapidly developing useful parsers for today's challenging translation applications.

### Book Recommendation

 Compilers: Principles, Techniques, and Tools (Hardcover) by Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman (ISBN 0201100886) The legendary "dragon book" is still the preferred text for learning compiler design methodologies. This book has a perfect mixture of theory and practice that is unequalled. Along the way, you'll use real-world tools such as Flex and Bison to help develop your projects and, of course, your understanding of compiler technologies.

Victor Volkman has been writing for C/C++ Users Journal and other programming journals since the late 1980s. He is a graduate of Michigan Tech and a faculty advisor board member for Washtenaw Community College CIS department. Volkman is the editor of numerous books, including C/C++ Treasure Chest and is the owner of Loving Healing Press. He can help you in your quest for open source tools and libraries; just drop an e-mail to sysop@HAL9K.com.

Page 2 of 2