How To Write A Simple Lexical Analyzer or Parser

From Adobe Open Source Wiki
Revision as of 03:19, 2 May 2008 by SeanParent (Talk | contribs)

Jump to: navigation, search


This paper is a response to an e-mail question asking if anyone had to validate if a sequence of characters was a valid XML name which was not a reserved XML name (starting with x|X m|M l|L) or a qualified name (one containing :). See 2.3 of the XML Specification.

Languages are defined using a grammar. There are many notations for grammars but most are some variant of [1], XML is no exception to this rule. The variant of EBNF used for the XML notation is described in 6 of the XML Specification. Transforming an EBNF grammar into a lexical analyzer and/or parser is a relatively straight forward process. For simplicity here I'm going to define the lexical analyzer as the portion of code which transforms a stream of characters into tokens and the parser as something which reads the tokens and takes some action once it has recognized production rules. There are libraries, such as the Spirit Library and well as tools such as [2] and [3] which can be used to aid in writing lexical analyzers and parser but knowing how to write these simply and directly is an invaluable tool in any programmers tool chest - also experience with writing such systems will make you more productive when using the libraries and tools and give you some insight into when you need to use them.

When you look at a typical EBNF grammar it will often not be separated into which parts are the lexical analyzer production rules and which are the parser production rules - the structure of both a simple parser and simple lexer will be the same, and the separation becomes apparent once you start writing the grammar. Because the structure is the same I'm going to refer from here on to both parts as simply the parser.

A Simple Recursive Decent Parser

To solve our problem of validating a name we're going to write a simple recursive decent,or top-down, parser. The basic idea is we will be reading from a stream of character denoted by a pair of InputIterators - we will only need to read each character once without backtracking (a grammar which can be read without backtracking in this fashion is known as an[4] grammar.

To do this we are going to translate each production rule into a function which will return a bool with the value true if it recognizes the production and false if it does not. If it does recognize the production then the stream will be advanced to the end of the production and a value will be updated with the recognized production. Otherwise, the stream and value are left unchanged. Such a function will follow this pattern:

bool is_production(I& first, I last, T& value);

Rather than passing first and last through all of our functions we're going to create a simple struct which holds them and write the is_production() as a member function:

template <typename I> // I models InputIterator
struct parser {
    bool is_production(T& value);

    I f_m; // first
    I l_m; // last

Writing the body of the production is generally quite simple as we shall see - let's start one level down in our grammar from the XML specification with this production rule:

NameChar  ::=  Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender

The specification of the XML parser is in terms of UCS2 code points (as opposed to characters) and this production is only going to recognize a single code point - so we will write it as:

uint16_t code_point_t;

template <typename I> // I models InputIterator
// require value_type(I) == code_point_t
struct xml_parser {
    bool is_name_char(code_point_t& c) { /*...*/ }

    I f_m; // first
    I l_m; // last