What programming job is more common than reading a data file, doing some manipulations, and writing it out again? Probably none! In fact, a former manager of mine who had 30 years’ experience in what used to be called data processing opined that all programs were but a slight variation on that theme. Perhaps more often today we read data files and write XML, but the core need essentially is the same: to accurately act on and react to incoming character data streams in a way that is both predictable and debuggable. You can do this one at a time, writing your own lexical analyzers for each special data file (or worse, cloning them), or you can take a step back and manage your work from a lexical analyzer generator.
The mother of all lexical analyzer generators is, of course, the ubiquitous Unix tool Lex. Dating from the early 1970s, it is perhaps one of the oldest compiler tools still in use. (The best source of Lex documentation is still the 1975 paper Lex—A Lexical Analyzer Generator by M.E. Lesk and E. Schmidt.) The modern successor to Lex is GNU Flex: the Fast Lexical analyzer generator.
Flex is available for almost any platform you can image. However, because Flex is a tool for programmers only, the releases are very much do-it-yourself. If you are using Linux, BSD, or any POSIX-compliant version of Linux, you almost certainly have Flex installed and ready to go! If you are a Windows user or you simply demand the latest and greatest, visit the SourceForge project.
Windows users also will need WinRAR, or they must go through the headaches of finding a tar and gunzip or bzip to uncompress it. They also will need a working install of bash to run the configure script plus the GNU gcc compiler. An easier alternative would be simply installing the Cygwin tools. Cygwin is a port of common GNU and other Linux/Unix public domain tools for the Win32 platform. It includes tar, gunzip, bash, gcc, and flex in Win32 executable binary form, so you can simply download these and get going with little fuss. You easily can access Cygwin mirror sites around the world with their download tool (setup.exe).
Taking the path of least resistance, I developed this article with Flex 2.5.4 pre-compiled binaries from Cygwin, although I did build a new Flex 2.5.33 with the Cygwin tools just to prove it could be done.
Introduction to Tokens and Lexemes
Suppose you’re not only reading data files but reading (and perhaps interpreting) a scripting language input file, such as Perl or VB source code. Lexical analysis is the lowest level translation activity. The purpose of a lexical analyzer or scanner is to convert an incoming stream of characters into an outgoing stream of tokens. The scanner operates by matching patterns of characters into lexemes. Each pattern describes what an instance of a particular token must match. For example, a common pattern for an identifier (for example, user-specified variable or constant) in a script language is a letter followed by one or more occurrences of a letter or digit. Some lexemes that would match this pattern are index, sum, and i47.
Things that your input stream defines as useless, such as white space and comments, are not lexemes and can be safely discarded by the scanner. Several classes of tokens are found in the definitions of most script languages. Table 1 lists some typical ones.
Table 1: Typical Tokens
|Keywords||Reserved words (such as procedure and return) that cannot be redefined|
|Operators||Typically strings (1–3 characters) such as /, >=, and >>= used in expressions|
|=Identifiers||User-specified objects similar to keywords in form|
|Numbers||Integer, real, or double-precision as specified|
|Character constants||Single characters such as c or|