Difference between revisions of "How To Write A Simple Lexical Analyzer or Parser"

From Adobe Open Source Wiki
Jump to: navigation, search
(A Simple Recursive Descent Parser)
 
(20 intermediate revisions by one user not shown)
Line 1: Line 1:
 
== Introduction ==
 
== Introduction ==
  
This ''paper'' is a response to an e-mail question asking if anyone had to validate if a sequence of characters was a valid XML name which was not a reserved XML name (starting with <tt>'x'|'X' 'm'|'M' 'l'|'L'</tt>) or a qualified name (one containing <tt>':'</tt>). ''See [http://www.w3.org/TR/REC-xml/#sec-common-syn|Section 2.3 of the XML Specification].''
+
This ''paper'' is a response to an e-mail question asking if anyone had code to validate that a sequence of characters was a valid XML name which was not a reserved XML name (starting with <tt>'x'|'X' 'm'|'M' 'l'|'L'</tt>) or a qualified name (one containing <tt>':'</tt>). ''See [http://www.w3.org/TR/REC-xml/#sec-common-syn Section 2.3 of the XML Specification].''
  
Languages are defined using a grammar. There are many notations for grammars but most are some variant of [http://en.wikipedia.org/wiki/Extended_Backus–Naur_form | EBNF], XML is no exception to this rule. The variant of EBNF used for the XML notation is described in [http://www.w3.org/TR/REC-xml/#sec-notation|Section 6 of the XML Specification]. Transforming an EBNF grammar into a lexical analyzer and/or parser is a relatively straight forward process. For simplicity here I'm going to define the lexical analyzer as the portion of code which transforms a stream of characters into tokens and the parser as something which reads the tokens and takes some action once it has recognized production rules. There are libraries, such as the [http://www.boost.org/doc/libs/1_35_0/libs/spirit/index.html|Boost Spirit Library] and well as tools such as [http://en.wikipedia.org/wiki/Lex_programming_tool| Lex] and [http://en.wikipedia.org/wiki/Yacc| Yacc] which can be used to aid in writing lexical analyzers and parser but knowing how to write these simply and directly is an invaluable tool in any programmers tool chest - also experience with writing such systems will make you more productive when using the libraries and tools and give you some insight into when you need to use them. Some issues (including this one) can also be addressed by writing a regular expression using a library such as the [http://www.boost.org/doc/libs/1_35_0/libs/regex/doc/html/index.html|Boost Regex Library] - however, dealing with the numerous ranges of values in the grammar production with this problem are difficult to express in a regular expression.
+
Languages are defined using a grammar. There are many notations for grammars but most use some variant of [http://en.wikipedia.org/wiki/Extended_Backus–Naur_form Extended Backus-Naur form] (EBNF), XML is no exception to this rule. The variant of EBNF used for the XML notation is described in [http://www.w3.org/TR/REC-xml/#sec-notation Section 6 of the XML Specification]. Transforming an EBNF grammar into a lexical analyzer and/or parser is a relatively straight forward process. For simplicity I define the lexical analyzer as the system which transforms a stream of characters into tokens and the parser as the system which transforms a stream of tokens into a sequence of actions for each production rule. There are libraries, such as the [http://www.boost.org/doc/libs/1_35_0/libs/spirit/index.html Boost Spirit Library] and tools such as [http://en.wikipedia.org/wiki/Lex_programming_tool Lex] and [http://en.wikipedia.org/wiki/Yacc Yacc] which can be used to aid in writing lexical analyzers and parser. Knowing how to write these systems simply and directly is an invaluable tool in any programmers tool chest and experience with writing such systems will make you more productive when using libraries and tools and give you some insight into when you should and should not use them. Some issues (including this one) can also be addressed by writing a regular expression using a library such as the [http://www.boost.org/doc/libs/1_35_0/libs/regex/doc/html/index.html Boost Regex Library] - however, dealing with the numerous ranges of values in the grammar production with this problem are difficult to express in a regular expression. A [http://en.wikipedia.org/wiki/Finite_state_machine finite automata] can also be constructed for many such problems (including this one) and such systems have the advantage of inverting control flow - you put values into a finite automata as opposed to executing a parse function which pulls values from a stream.
  
When you look at a typical EBNF grammar it will often not be separated into which parts are the lexical analyzer production rules and which are the parser production rules - the structure of both a simple parser and simple lexer will be the same, and the separation becomes apparent  once you start writing the grammar. Because the structure is the same I'm going to refer from here on to both parts as simply the ''parser''.
+
When you look at a typical EBNF grammar it will frequently not distinguish which parts are the lexical analyzer production rules and which are the parser production rules - the structure of both a simple parser and simple lexer will be the same. The separation is the point where the value returned by the production rule is not a character from the stream, but some form of token. Because the structure is the same I'm going to refer from here on to both parts as simply the ''parser''.
  
== A Simple Recursive Decent Parser ==
+
== A Simple Recursive Descent Parser ==
  
To solve our problem of validating a name we're going to write a simple recursive decent,or top-down, parser. The basic idea is we will be reading from a stream of character denoted by a pair of InputIterators - we will only need to read each character once without backtracking (a grammar which can be read without backtracking in this fashion is known as an[http://en.wikipedia.org/wiki/LL_parser|LL(1)] grammar.
+
To solve the problem of validating an XML name we're going to write a simple recursive descent, or top-down, parser. The idea is to read from a stream of characters denoted by a pair of InputIterators - we will only need to read each character once without backtracking. A grammar which can be read in this form without backtracking is known as an [http://en.wikipedia.org/wiki/LL_parser LL(1)] grammar.
  
 
To do this we are going to translate each production rule into a function which will return a <tt>bool</tt> with the value <tt>true</tt> if it recognizes the production and false if it does not. If it does recognize the production then the stream will be advanced to the end of the production and a value will be updated with the recognized production. Otherwise, the stream and value are left unchanged. Such a function will follow this pattern:
 
To do this we are going to translate each production rule into a function which will return a <tt>bool</tt> with the value <tt>true</tt> if it recognizes the production and false if it does not. If it does recognize the production then the stream will be advanced to the end of the production and a value will be updated with the recognized production. Otherwise, the stream and value are left unchanged. Such a function will follow this pattern:
Line 15: Line 15:
 
<pre>
 
<pre>
 
bool is_production(I& first, I last, T& value);
 
bool is_production(I& first, I last, T& value);
 +
</pre>
 +
 +
The stream denoted by <tt>[first, last)</tt> is going to be UTF16 code points because that is how the XML specification is defined. We'll define code_point_t to be a 16 bit integer:
 +
 +
<pre>
 +
typedef uint16_t code_point_t;
 
</pre>
 
</pre>
  
Line 21: Line 27:
 
<pre>
 
<pre>
 
template <typename I> // I models InputIterator
 
template <typename I> // I models InputIterator
struct parser {
+
// require value_type(I) == code_point_t
 +
struct parser
 +
{
 
     bool is_production(T& value);
 
     bool is_production(T& value);
  
Line 29: Line 37:
 
</pre>
 
</pre>
  
Writing the body of the production is generally quite simple as we shall see - let's start one level down in our grammar from the XML specification with this production rule:
+
Writing the body of the production is simple. We start at the top level production rule and work our way down - in this case we are interested in parsing <tt>Name</tt>. The <tt>Name</tt> production rule looks like this:
  
 
<pre>
 
<pre>
NameChar  ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender
+
Name ::= (Letter | '_' | ':') (NameChar)*
 
</pre>
 
</pre>
  
Because we only want unqualified names we're going to use a production dropping the <tt>':'</tt>.
+
And we could write this production something like the following:
  
 
<pre>
 
<pre>
NameCharUnqualified  ::=  Letter | Digit | '.' | '-' | '_' | CombiningChar | Extender
+
bool is_name(string16_t& s)
 +
{
 +
    string16_t  result;
 +
    code_point_t c;
 +
 
 +
    /* (Letter | '_' | ':') */
 +
    if (is_letter(c) || is_match('_', c) || is_match(':', c)) ; else return false;
 +
 
 +
    result += c;
 +
 
 +
    /* (NameChar)* */
 +
    while(is_name_char(c)) result += c;
 +
 
 +
    s = result;
 +
    return c;
 +
}
 
</pre>
 
</pre>
  
The specification of the XML parser is in terms of UCS2 code points (as opposed to characters) and this production is only going to recognize a single code point - so we will write it as:
+
Here string16_t is our token type. For this problem, we don't actually care what the token value is - and constructing the token is relatively expensive. If we were writing a full-blown XML parser we would spend some energy on dealing with tokens efficiently. This code also doesn't tell is if the name is qualified or reserved - those are attributes of our token value. We rewrite the function to drop the token and add some attributes. We'll deal with attributes in a similar fashion to the value of the production rule, except if the production rule is recognized the attributes will be accumulated (as opposed to overwritten). We don't need to pass the attribute to all production rules, only those that have attributes. Here is the definition of our attribute type:
  
 
<pre>
 
<pre>
uint16_t code_point_t;
+
enum xml_parse_attribute_t {
 +
    xml_none        = 0,
 +
    xml_qualified  = 1,
 +
    xml_reserved    = 2
 +
};
  
template <typename I> // I models InputIterator
+
xml_parse_attribute_t& operator|=(xml_parse_attribute_t& r, xml_parse_attribute_t x)
// require value_type(I) == code_point_t
+
{ r = xml_parse_attribute_t(r | x); return r; }
struct xml_parser {
+
</pre>
     bool is_name_char_unqualified(code_point_t& c)
+
 
     { /*...*/ }
+
We define <tt>operator|=()</tt> on our type to make it simple to accumulate. <tt>xml_qualified</tt> will mean that the name contains a <tt>':'</tt> and <tt>xml_reserved</tt> will mean it starts with the sequence <tt>xml</tt>. To check if the name is reserved we unroll the first couple of iterations of our loop:
 +
 
 +
<pre>
 +
bool is_name(xml_parse_attribute_t& a)
 +
{
 +
    code_point_t c;
 +
 
 +
    if (is_match(':', c)) { a |= xml_qualified; }
 +
    else if (is_letter(c) || is_match('_', c)) ; else return false;
 +
     bool reserved = c == code_point_t('X') || c == code_point_t('x');
 +
 
 +
    if (is_name_char(c, a)) ; else return true;
 +
     reserved &= (c == code_point_t('M') || c == code_point_t('m'));
  
     I f_m; // first
+
     if (is_name_char(c, a)) ; else return true;
     I l_m; // last
+
     reserved &= (c == code_point_t('L') || c == code_point_t('l'));
};
+
 
 +
    while (is_name_char(c, a)) ;
 +
 
 +
    if (reserved) a |= xml_reserved;
 +
 
 +
    return true;
 +
}
 
</pre>
 
</pre>
  
The body of the function is nearly identical to the production rule:
+
Now we can write our remaining production rules. We'll tackle <tt>NameChar</tt> next. This is the only other production rule which needs to deal with attributes because it may contain a <tt>':'</tt>.
  
 
<pre>
 
<pre>
bool is_name_char_unqualified(code_point_t& c)
+
NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender
 +
</pre>
 +
 
 +
The specification of the XML parser is in terms of UCS2 code points (as opposed to characters) and this production is only going to recognize a single code point. Except for the minor complexity introduced to handle the attribute, the code is nearly identical to the production rule.
 +
 
 +
<pre>
 +
bool is_name_char(code_point_t& c, xml_parse_attribute_t& a)
 
{
 
{
     return is_letter(c) || is_digit(c) || is_match('.', c) || is_match('-', c)
+
    if (is_match(':', c)) { a |= xml_qualified; return true; }
        || is_match('_', c) || is_combining_char(c) || is_extender(c);
+
 
 +
     return is_letter(c) || is_digit(c) || is_match('.', c)
 +
        || is_match('-', c) || is_match('_', c) || is_combining_char(c) || is_extender(c);
 
}
 
}
 
</pre>
 
</pre>
Line 70: Line 123:
  
 
<pre>
 
<pre>
Letter ::= BaseChar | Ideographic
+
Letter ::= BaseChar | Ideographic
 
</pre>
 
</pre>
 
<pre>
 
<pre>
Line 86: Line 139:
 
</pre>
 
</pre>
  
If we are at the end of our steam or the current character is not a match then return false. Otherwise we set our value to the matched character, increment the stream, and return true.
+
If we are at the end of our steam or the current character is not a match then return <tt>false</tt>. Otherwise we set our value to the matched character, increment the stream, and return <tt>true</tt>.
  
 
The remaining productions are all of a form which match one of a set of code points:
 
The remaining productions are all of a form which match one of a set of code points:
  
 
<pre>
 
<pre>
Ideographic ::= [#x4E00-#x9FA5] | #x3007 | [#x3021-#x3029]
+
Ideographic ::= [#x4E00-#x9FA5] | #x3007 | [#x3021-#x3029]
 
</pre>
 
</pre>
  
Some of the tables, such as the one for <tt>BaseChar</tt> are quite large making it prohibitively expensive to write an expression like <tt>(0x4E00U <= x && x <= 0x9FA5U) || (x == 0x3007U) || ...</tt>. Instead we will create a simple lookup table. When dealing with a parser that works with a simple <tt>char</tt> type we could just build a table with 256 elements that contained an enumeration as to the character type. Our table for these would be a bit large (but we could do it - 64K elements) - but instead we'll create an array of ranges which contain the code points we want to match and search to find if our character is in one of the ranges. To do that we'll create a semi-open ranges for each character range of the form <tt>(f0, l0], (f1, l1], ...</tt> and then use lower bound to search the table, if we end up on an odd index (starting at 0) then the item is in the range, an even index it is not. This strategy means we can't represent a range which includes <tt>0</tt> but <tt>0</tt> is not a valid character in an XML production. The code for the table lookup for <Ideographic> is like so:
+
Some of the tables, such as the one for <tt>BaseChar</tt> are quite large making it prohibitively expensive to write an expression like <tt>(0x4E00U <= x && x <= 0x9FA5U) || (x == 0x3007U) || ...</tt>. Instead we will create a simple lookup table. When dealing with a parser that works with a simple <tt>char</tt> type we could just build a table with 256 elements that contained an enumeration indicating what production rule the character matched. A table of 16 bit values would be a bit large (but we could do it - 64K elements) - but instead we'll create an array of values denoting ranges which contain the code points we want to match and search to find if our character is in one of the ranges. To do that we'll create a semi-open ranges for each character range of the form <tt>(f0, l0], (f1, l1], ...</tt> and then use <tt>lower_bound()</tt> to search the table, if we end up on an odd index (starting at 0) then the item is in the range, an even index it is not. This strategy means we can't represent a range which includes <tt>0</tt> but <tt>0</tt> is not a valid character in an XML production. The code for the table lookup for <tt>Ideographic</tt> is this:
  
 
<pre>
 
<pre>
Line 110: Line 163:
 
</pre>
 
</pre>
  
To transform from the closed ranges to the semi-open ranges we subtract one from the first item in the range. Note that this code doesn't have to be a member function and is not a template - we can put these lookup tables in a .cpp file. We'll use this function in our <tt>is_ideographic()</tt> member function:
+
To transform from the closed ranges to the semi-open ranges we subtract one from the first item in the range. This code doesn't have to be a member function and is not a template - we can put these lookup tables in a .cpp file. We'll use this function in our <tt>is_ideographic()</tt> member function:
  
 
<pre>
 
<pre>
Line 126: Line 179:
 
</pre>
 
</pre>
  
The remaining productions used by <tt>is_name_char_unqualified()</tt>are written the same way - just with larger tables.
+
The remaining productions are written the same way - just with larger tables. Our parser is complete!
 
+
The only production left to handle is the one for <tt>Name</tt>:
+
 
+
<pre>
+
Name ::=  (Letter | '_' | ':') (NameChar)*
+
</pre>
+
 
+
We drop the <tt>':'</tt> from this production as well:
+
 
+
<pre>
+
NameUnqualified ::=  (Letter | '_') (NameCharUnqualified)*
+
</pre>
+
 
+
And we could write this production something like the following:
+
 
+
<pre>
+
bool is_name_unqualified(string16_t& s)
+
{
+
    string16_t  result;
+
    code_point_t c;
+
 
+
    if (!is_letter(c) && !is_match('_', c)) return false;
+
    result += c;
+
    while(is_name_char_unqualified(c)) result += c;
+
 
+
    s = result;
+
    return c;
+
}
+
</pre>
+
 
+
This is were we are moving past a lexical analyzer and into a parser, however, for our needs this is relatively inefficient. We're constructing and copying a string twice for which we really don't care about the value except that we need to assure that it doesn't start with <tt>'xml'</tt> (case insensitive). If we needed a full parser we would write a more efficient way of managing tokens - but for this problem we don't need that. Instead we'll check for <tt>'xml'</tt> as we proceed by unrolling a the first couple of executions of the loop. If we find it '<tt>'xml'</tt> return false. This will violate our rule on stream advancement only if we fully recognize a production - we'll choose a different naming convention for this function so we don't accidentally confuse it for a normal production later.
+
 
+
<pre>
+
bool valid_name_unqualified_unreserved()
+
{
+
    code_point_t c;
+
 
+
    if (!is_letter(c) && !is_match('_', c)) return false;
+
    bool reserved = c == code_point_t('X') || c == code_point_t('x');
+
 
+
    if (!is_name_char_unqualified(c)) return false;
+
    reserved = reserved && (c == code_point_t('M') || c == code_point_t('m'));
+
 
+
    if (!is_name_char_unqualified(c)) return false;
+
    if (reserved && (c == code_point_t('L') || c == code_point_t('l'))) return false;
+
 
+
    while (is_name_char_unqualified(c)) ;
+
 
+
    return true;
+
}
+
</pre>
+
  
 
Finally - to make this a little simpler to invoke we write a wrapper function which just creates a temporary instance of the class and we add one last check to make sure that the valid name made up the entire range provided:
 
Finally - to make this a little simpler to invoke we write a wrapper function which just creates a temporary instance of the class and we add one last check to make sure that the valid name made up the entire range provided:
Line 187: Line 189:
 
{
 
{
 
     xml_parser<I> parser = { f, l };
 
     xml_parser<I> parser = { f, l };
     return parser.valid_name_unqualified_unreserved() && parser.f_m == l;
+
    xml_parse_attribute_t a = xml_none;
 +
     return parser.is_name(a) && parser.f_m == l && a == xml_none;
 
}
 
}
 
</pre>
 
</pre>
  
That's the complete code - besides having a solution to our problem we also have a general framework for building other solutions to parsing problems and a pretty decent start that can be reused for other XML related issues. Parsers written in this style are simple to extend even if they are packaged as a library, you can simply inherit from the struct and add additional productions. The code described here took very little time to write (the biggest hassle was building the larger lookup tables from the productions in the XML specification.
+
That's the complete code. Besides having a solution to our problem we have a general framework for building other solutions to parsing problems and a pretty descent start that can be reused for other XML related issues. Parsers written in this style are simple to extend even if they are packaged as a library, you can simply inherit from the struct and add additional productions. The code described here took very little time to write (the biggest hassle was building the larger lookup tables from the productions in the XML specification.
 +
 
 +
The complete example file can be found [http://stlab.adobe.com:8080/sandbox/papers/xml_parser_example/xml_parser_example.cpp?ac=64&rev1=3 here.]

Latest revision as of 16:11, 10 October 2013

Introduction

This paper is a response to an e-mail question asking if anyone had code to validate that a sequence of characters was a valid XML name which was not a reserved XML name (starting with 'x'|'X' 'm'|'M' 'l'|'L') or a qualified name (one containing ':'). See Section 2.3 of the XML Specification.

Languages are defined using a grammar. There are many notations for grammars but most use some variant of Extended Backus-Naur form (EBNF), XML is no exception to this rule. The variant of EBNF used for the XML notation is described in Section 6 of the XML Specification. Transforming an EBNF grammar into a lexical analyzer and/or parser is a relatively straight forward process. For simplicity I define the lexical analyzer as the system which transforms a stream of characters into tokens and the parser as the system which transforms a stream of tokens into a sequence of actions for each production rule. There are libraries, such as the Boost Spirit Library and tools such as Lex and Yacc which can be used to aid in writing lexical analyzers and parser. Knowing how to write these systems simply and directly is an invaluable tool in any programmers tool chest and experience with writing such systems will make you more productive when using libraries and tools and give you some insight into when you should and should not use them. Some issues (including this one) can also be addressed by writing a regular expression using a library such as the Boost Regex Library - however, dealing with the numerous ranges of values in the grammar production with this problem are difficult to express in a regular expression. A finite automata can also be constructed for many such problems (including this one) and such systems have the advantage of inverting control flow - you put values into a finite automata as opposed to executing a parse function which pulls values from a stream.

When you look at a typical EBNF grammar it will frequently not distinguish which parts are the lexical analyzer production rules and which are the parser production rules - the structure of both a simple parser and simple lexer will be the same. The separation is the point where the value returned by the production rule is not a character from the stream, but some form of token. Because the structure is the same I'm going to refer from here on to both parts as simply the parser.

A Simple Recursive Descent Parser

To solve the problem of validating an XML name we're going to write a simple recursive descent, or top-down, parser. The idea is to read from a stream of characters denoted by a pair of InputIterators - we will only need to read each character once without backtracking. A grammar which can be read in this form without backtracking is known as an LL(1) grammar.

To do this we are going to translate each production rule into a function which will return a bool with the value true if it recognizes the production and false if it does not. If it does recognize the production then the stream will be advanced to the end of the production and a value will be updated with the recognized production. Otherwise, the stream and value are left unchanged. Such a function will follow this pattern:

bool is_production(I& first, I last, T& value);

The stream denoted by [first, last) is going to be UTF16 code points because that is how the XML specification is defined. We'll define code_point_t to be a 16 bit integer:

typedef uint16_t code_point_t;

Rather than passing first and last through all of our functions we're going to create a simple struct which holds them and write the is_production() as a member function:

template <typename I> // I models InputIterator
// require value_type(I) == code_point_t
struct parser
{
    bool is_production(T& value);

    I f_m; // first
    I l_m; // last
};

Writing the body of the production is simple. We start at the top level production rule and work our way down - in this case we are interested in parsing Name. The Name production rule looks like this:

Name ::= (Letter | '_' | ':') (NameChar)*

And we could write this production something like the following:

bool is_name(string16_t& s)
{
    string16_t   result;
    code_point_t c;

    /* (Letter | '_' | ':') */
    if (is_letter(c) || is_match('_', c) || is_match(':', c)) ; else return false;

    result += c;

    /* (NameChar)* */
    while(is_name_char(c)) result += c;

    s = result;
    return c;
}

Here string16_t is our token type. For this problem, we don't actually care what the token value is - and constructing the token is relatively expensive. If we were writing a full-blown XML parser we would spend some energy on dealing with tokens efficiently. This code also doesn't tell is if the name is qualified or reserved - those are attributes of our token value. We rewrite the function to drop the token and add some attributes. We'll deal with attributes in a similar fashion to the value of the production rule, except if the production rule is recognized the attributes will be accumulated (as opposed to overwritten). We don't need to pass the attribute to all production rules, only those that have attributes. Here is the definition of our attribute type:

enum xml_parse_attribute_t {
    xml_none        = 0,
    xml_qualified   = 1,
    xml_reserved    = 2
};

xml_parse_attribute_t& operator|=(xml_parse_attribute_t& r, xml_parse_attribute_t x)
{ r = xml_parse_attribute_t(r | x); return r; }

We define operator|=() on our type to make it simple to accumulate. xml_qualified will mean that the name contains a ':' and xml_reserved will mean it starts with the sequence xml. To check if the name is reserved we unroll the first couple of iterations of our loop:

bool is_name(xml_parse_attribute_t& a)
{
    code_point_t c;

    if (is_match(':', c)) { a |= xml_qualified; }
    else if (is_letter(c) || is_match('_', c)) ; else return false;
    bool reserved = c == code_point_t('X') || c == code_point_t('x');

    if (is_name_char(c, a)) ; else return true;
    reserved &= (c == code_point_t('M') || c == code_point_t('m'));

    if (is_name_char(c, a)) ; else return true;
    reserved &= (c == code_point_t('L') || c == code_point_t('l'));

    while (is_name_char(c, a)) ;

    if (reserved) a |= xml_reserved;

    return true;
}

Now we can write our remaining production rules. We'll tackle NameChar next. This is the only other production rule which needs to deal with attributes because it may contain a ':'.

NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender

The specification of the XML parser is in terms of UCS2 code points (as opposed to characters) and this production is only going to recognize a single code point. Except for the minor complexity introduced to handle the attribute, the code is nearly identical to the production rule.

bool is_name_char(code_point_t& c, xml_parse_attribute_t& a)
{
    if (is_match(':', c)) { a |= xml_qualified; return true; }

    return is_letter(c) || is_digit(c) || is_match('.', c)
        || is_match('-', c) || is_match('_', c) || is_combining_char(c) || is_extender(c);
}

The is_letter() function is even simpler:

Letter ::= BaseChar | Ideographic
bool is_letter(code_point_t& c) { return is_base_char(c) || is_ideographic(c); }

The is_match() is a helper function for terminal productions. A terminal production is a production which is not defined in terms of another production. It is in the terminal productions where the interesting work is done. Here is the body.

bool is_match(code_point_t x, code_point_t& c)
{
    if (f_m == l_m || *f_m != x) return false;
    c = x; ++f_m; return true;
}

If we are at the end of our steam or the current character is not a match then return false. Otherwise we set our value to the matched character, increment the stream, and return true.

The remaining productions are all of a form which match one of a set of code points:

Ideographic ::= [#x4E00-#x9FA5] | #x3007 | [#x3021-#x3029]

Some of the tables, such as the one for BaseChar are quite large making it prohibitively expensive to write an expression like (0x4E00U <= x && x <= 0x9FA5U) || (x == 0x3007U) || .... Instead we will create a simple lookup table. When dealing with a parser that works with a simple char type we could just build a table with 256 elements that contained an enumeration indicating what production rule the character matched. A table of 16 bit values would be a bit large (but we could do it - 64K elements) - but instead we'll create an array of values denoting ranges which contain the code points we want to match and search to find if our character is in one of the ranges. To do that we'll create a semi-open ranges for each character range of the form (f0, l0], (f1, l1], ... and then use lower_bound() to search the table, if we end up on an odd index (starting at 0) then the item is in the range, an even index it is not. This strategy means we can't represent a range which includes 0 but 0 is not a valid character in an XML production. The code for the table lookup for Ideographic is this:

bool in_ideographic(code_point_t c)
{
    static const code_point_t table[] =
    {
        0x4DFFU, 0x9FA5U, // [#x4E00-#x9FA5]
        0x3006U, 0x3007U, // #x3007
        0x3020U, 0x3029U  // [#x3021-#x3029]
    };

    return (adobe::lower_bound(table, c) - boost::begin(table)) & 0x01;
}

To transform from the closed ranges to the semi-open ranges we subtract one from the first item in the range. This code doesn't have to be a member function and is not a template - we can put these lookup tables in a .cpp file. We'll use this function in our is_ideographic() member function:

bool is_ideographic(code_point_t& c) { return is_match(in_ideographic, c); }

is_match() is another variant of our previous is_match which takes a predicate function instead of a value:

bool is_match(bool (*p)(code_point_t), code_point_t& c)
{
    if (f_m == l_m || !p(*f_m)) return false;
    c = *f_m; ++f_m; return true;
}

The remaining productions are written the same way - just with larger tables. Our parser is complete!

Finally - to make this a little simpler to invoke we write a wrapper function which just creates a temporary instance of the class and we add one last check to make sure that the valid name made up the entire range provided:

template <typename I> // I models InputIterator
// require value_type(I) == code_point_t
bool valid_name_unqualified_unreserved(I f, I l)
{
    xml_parser<I> parser = { f, l };
    xml_parse_attribute_t a = xml_none; 
    return parser.is_name(a) && parser.f_m == l && a == xml_none;
}

That's the complete code. Besides having a solution to our problem we have a general framework for building other solutions to parsing problems and a pretty descent start that can be reused for other XML related issues. Parsers written in this style are simple to extend even if they are packaged as a library, you can simply inherit from the struct and add additional productions. The code described here took very little time to write (the biggest hassle was building the larger lookup tables from the productions in the XML specification.

The complete example file can be found here.