How To Write A Simple Lexical Analyzer or Parser

From wiki
Revision as of 05:23, 2 May 2008 by SeanParent (Talk | contribs)

Jump to: navigation, search


This paper is a response to an e-mail question asking if anyone had to validate if a sequence of characters was a valid XML name which was not a reserved XML name (starting with 'x'|'X' 'm'|'M' 'l'|'L') or a qualified name (one containing ':'). See 2.3 of the XML Specification.

Languages are defined using a grammar. There are many notations for grammars but most are some variant of [1], XML is no exception to this rule. The variant of EBNF used for the XML notation is described in 6 of the XML Specification. Transforming an EBNF grammar into a lexical analyzer and/or parser is a relatively straight forward process. For simplicity here I'm going to define the lexical analyzer as the portion of code which transforms a stream of characters into tokens and the parser as something which reads the tokens and takes some action once it has recognized production rules. There are libraries, such as the Spirit Library and well as tools such as [2] and [3] which can be used to aid in writing lexical analyzers and parser but knowing how to write these simply and directly is an invaluable tool in any programmers tool chest - also experience with writing such systems will make you more productive when using the libraries and tools and give you some insight into when you need to use them.

When you look at a typical EBNF grammar it will often not be separated into which parts are the lexical analyzer production rules and which are the parser production rules - the structure of both a simple parser and simple lexer will be the same, and the separation becomes apparent once you start writing the grammar. Because the structure is the same I'm going to refer from here on to both parts as simply the parser.

A Simple Recursive Decent Parser

To solve our problem of validating a name we're going to write a simple recursive decent,or top-down, parser. The basic idea is we will be reading from a stream of character denoted by a pair of InputIterators - we will only need to read each character once without backtracking (a grammar which can be read without backtracking in this fashion is known as an[4] grammar.

To do this we are going to translate each production rule into a function which will return a bool with the value true if it recognizes the production and false if it does not. If it does recognize the production then the stream will be advanced to the end of the production and a value will be updated with the recognized production. Otherwise, the stream and value are left unchanged. Such a function will follow this pattern:

bool is_production(I& first, I last, T& value);

Rather than passing first and last through all of our functions we're going to create a simple struct which holds them and write the is_production() as a member function:

template <typename I> // I models InputIterator
struct parser {
    bool is_production(T& value);

    I f_m; // first
    I l_m; // last

Writing the body of the production is generally quite simple as we shall see - let's start one level down in our grammar from the XML specification with this production rule:

NameChar  ::=  Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender

Because we only want unqualified names we're going to use a production dropping the ':'.

NameCharUnqualified  ::=  Letter | Digit | '.' | '-' | '_'  | CombiningChar | Extender

The specification of the XML parser is in terms of UCS2 code points (as opposed to characters) and this production is only going to recognize a single code point - so we will write it as:

uint16_t code_point_t;

template <typename I> // I models InputIterator
// require value_type(I) == code_point_t
struct xml_parser {
    bool is_name_char_unqualified(code_point_t& c)
    { /*...*/ }

    I f_m; // first
    I l_m; // last

The body of the function is nearly identical to the production rule:

bool is_name_char_unqualified(code_point_t& c)
    return is_letter(c) || is_digit(c) || is_match('.', c) || is_match('-', c)
        || is_match('_', c) || is_combining_char(c) || is_extender(c);

The is_letter() function is even simpler:

Letter  ::= BaseChar | Ideographic
bool is_letter(code_point_t& c) { return is_base_char(c) || is_ideographic(c); }

The is_match() is a helper function for terminal productions. A terminal production is a production which is not defined in terms of another production. It is in the terminal productions where the interesting work is done. Here is the body.

bool is_match(code_point_t x, code_point_t& c)
    if (f_m == l_m || *f_m != x) return false;
    c = x; ++f_m; return true;

If we are at the end of our steam or the current character is not a match then return false. Otherwise we set our value to the matched character, increment the stream, and return true.

The remaining productions are all of a form which match one of a set of code points:

Ideographic  ::= 	[#x4E00-#x9FA5] | #x3007 | [#x3021-#x3029]

Some of the tables, such as the one for BaseChar are quite large making it prohibitively expensive to write an expression like (0x4E00U <= x && x <= 0x9FA5U) || (x == 0x3007U) || .... Instead we will create a simple lookup table. When dealing with a parser that works with a simple char type we could just build a table with 256 elements that contained an enumeration as to the character type. Our table for these would be a bit large (but we could do it - 64K elements) - but instead we'll create an array of ranges which contain the code points we want to match and search to find if our character is in one of the ranges. To do that we'll create a semi-open ranges for each character range of the form (f0, l0], (f1, l1], ... and then use lower bound to search the table, if we end up on an odd index (starting at 0) then the item is in the range, an even index it is not. This strategy means we can't represent a range which includes 0 but 0 is not a valid character in an XML production. The code for the table lookup for <Ideographic> is like so:

bool in_ideographic(code_point_t c)
    static const code_point_t table[] =
        0x4DFFU, 0x9FA5U, // [#x4E00-#x9FA5]
        0x3006U, 0x3007U, // #x3007
        0x3020U, 0x3029U  // [#x3021-#x3029]

    return (adobe::lower_bound(table, c) - boost::begin(table)) & 0x01;

To transform from the closed ranges to the semi-open ranges we subtract one from the first item in the range. Note that this code doesn't have to be a member function and is not a template - we can put these lookup tables in a .cpp file. We'll use this function in our is_ideographic() member function:

bool is_ideographic(code_point_t& c) { return is_match(in_ideographic, c); }

is_match() is another variant of our previous is_match which takes a predicate function instead of a value:

bool is_match(bool (*p)(code_point_t), code_point_t& c)
    if (f_m == l_m || !p(*f_m)) return false;
    c = *f_m; ++f_m; return true;

The remaining productions used by is_name_char_unqualified()are written the same way - just with larger tables.

The only production left to handle is the one for Name:

Name ::=  (Letter | '_' | ':') (NameChar)*

We drop the ':' from this production as well:

NameUnqualified ::=  (Letter | '_') (NameCharUnqualified)*

And we could write this production something like the following:

bool is_name_unqualified(string16_t& s)
    string16_t      result;
    code_point_t c;

    if (!is_letter(c) && !is_match('_', c)) return false;
    result += c;
    while(is_name_char_unqualified(c)) result += c;

    s = result;
    return c;

Here we see we are moving past a lexical analyzer and into a parser - for our needs this is relatively inefficient - we're constructing and copying a string twice for which we really don't care about the value except that we need to assure that it doesn't start with 'xml' (case insensitive). If we really needed to get into the parser level we would write a more efficient way of managing tokens - but here we really don't need to. Instead we'll check for 'xml' as we proceed - if we find it we'll return false. This will violate our rules on stream advancement