stlab.adobe.com Adobe Systems Incorporated

A relatively lightweight and simple xml (subset) parser. More...

#include <adobe/xml_parser.hpp>

List of all members.

Public Types

typedef xml_element_proc_t callback_proc_t
typedef boost::function< bool(const
token_range_t &)> 
preorder_predicate_t
typedef xml_lex_t::token_type token_type

Public Member Functions

const line_position_tnext_position ()
xml_parser_toperator= (const xml_parser_t &rhs)
void parse_content ()
void parse_document ()
void parse_element_sequence ()
void set_preorder_predicate (preorder_predicate_t pred)
 xml_parser_t (const xml_parser_t &rhs)
 xml_parser_t (uchar_ptr_t first, uchar_ptr_t last, const line_position_t &position, preorder_predicate_t predicate, callback_proc_t callback, O output)
virtual ~xml_parser_t ()

Protected Member Functions

void content_callback (token_range_t &result_element, const token_range_t &old_element, const token_range_t &start_tag, const attribute_set_t attribute_set, const token_range_t &content, bool preorder_parent)
const token_typeget_token ()
bool is_attribute (token_range_t &name, token_range_t &value)
bool is_attribute_set (attribute_set_t &attribute_set)
bool is_bom (token_range_t &bom)
bool is_content (token_range_t &element)
bool is_e_tag (token_range_t &name, token_range_t &close_tag)
bool is_element (token_range_t &element)
bool is_prolog ()
bool is_token (xml_lex_token_set_t name)
bool is_token (xml_lex_token_set_t name, token_range_t &value)
bool is_xml_decl (token_range_t &xml_decl)
void putback ()
void require_token (xml_lex_token_set_t name)
void require_token (xml_lex_token_set_t name, token_range_t &value)
void throw_exception (const char *error_string)
void throw_exception (xml_lex_token_set_t found, xml_lex_token_set_t expected)

Protected Attributes

callback_proc_t callback_m
output_m
preorder_predicate_t pred_m

Related Functions

(Note that these are not member functions.)


template<typename O >
xml_parser_t< O > make_xml_parser (uchar_ptr_t first, uchar_ptr_t last, const line_position_t &position, typename xml_parser_t< O >::preorder_predicate_t predicate, typename xml_parser_t< O >::callback_proc_t callback, O output)

Detailed Description

template<typename O>
class adobe::xml_parser_t< O >

Introduction
For an interesting number of xml applications, even the most simple conforming xml parsers are overkill. This is not a criticism of such parsers. Indeed, most such systems, expat <http://expat.sourceforge.net> for example, are excellent solutions for processing xml documents from generic or unknown sources. However, there exist a number of applications that apply strict preconditions to the xml documents they process and/or tightly control the source of their documents.
Examples of such applications include preference files (which are typically read and written by a program and not intended to be generically edited) and program configuration files (e.g. localization glossaries). These examples and the applications they represent have a number of characteristics in common:
  • The xml file is not intended to be generically editable or processed by other xml tools.
  • The application does not require a validating parser.
  • The application can require that documents are encoded with a specific character encoding (e.g. it can require utf8-encoding).
  • The application does not use "advanced" capabilities of xml documents.
For applications such as these, xml_parser_t delivers a simple and efficient solution.
xml_parser_t enforces well-formedness and provides a sax-like interface to the application. Although xml_parser_t implements much of the XML 1.1 specification <http://www.w3.org/TR/xml11>, it is not strictly a conforming parser because of these deviations:
  • Validation schemes (e.g. DTDs, XML Schemas) are not supported.
  • Documents must be encoded in an 8-bit character encoding for which US-ASCII is a proper subset (e.g. US-ASCII, UTF-8, ISO-Latin-1).
  • The parser accepts UTF-8 and UTF-16 BOMs at the start of a document stream but will throw an exception if a UTF-16 BOM is encountered.
  • Namespaces are not supported.
  • The parser recognizes processing instructions but ignores them, neither handling processing instructions nor passing them to the application.
  • The parser recognizes comments but ignores them, neither handling comments nor passing them to the application.
In addition, an application can use xml_parser_t to parse some document fragments that may not be conforming xml documents in their own right.
All strings xml_parser_t passes to the application are in the form of adobe::token_range_t, which are essentially const character ranges. These ranges refer to ranges within the xml data being parsed. This is a large part of how xml_parser_t delivers speed and memory efficiency, by eliminating data copying in the parser. The application is free to copy any data it requires.
For ease of use, applications are encouraged to use a helper function, adobe::make_xml_parser, to create xml_parser_t objects. Because xml_parser_t is a template class, templatized on the type of its output iterator, correctly speaking the concrete name of a parser can be confusing or overly wordy. For most applications, the using make_xml_parser is significantly easier. As in:
adobe::make_xml_parser(start_of_xml_document,
                       end_of_xml_document,
                       line_position_t("sample document"),
                       my_preorder_predicate,
                       my_content_callback,
                       my_output_iterator).parse_document();
Note that in this usage, the parser is never even stored in a local variable. Instead, the result of make_xml_parser is immediately told to parse the document. This is a very common coding pattern for creating and using xml_parser_t.
Basic Parsing Model
xml_parser_t uses two application callbacks and one app-provided object, an OutputIterator, to process xml data. One app callback is called the content callback and allows the application to process the data of and in a single element. The second callback, called the preorder predicate, allows the application to inform the parser that a given element needs to be processed by the app's content callback, as opposed to being processed by the parser itself. In the following discussion, we will refer to this sample xml document:
<parent id="node1">
	<simple-child id="node2"/>
	<complex-child id="node3">
		<grandchild id="node4"/>
	</complex-child>
</parent>
xml_parser_t identifies the range corresponding to a given element and its contents -- in the example document above, the first element encountered is the parent element, which encompasses the enter document. The app's preorder predictate is called with the name of the element, "parent" in the example.
When the preorder predicate returns true for a given element, then the app's content callback is invoked, passing the name of the element, its attributes (collected into an adobe::attribute_set_t), and its contents. Continuing our example, the content callback would be given the name "parent", an attribute set corresponding to { ("id","node1") }, and a content range that began after the closing '>' character in the start tag and ending at the '<' character in the "</parent>" tag.
When the preorder predicate returns false for a given element, xml_parser_t copies the element's tag to the output iterator if the tag is an empty tag (i.e. one closed with the "/>" sequence, as in <empty-tag/>). When the element contains content, xml_parser_t recursively creates a parser to process the contents of the element, reusing its same preorder predicate, content callback, and output iterator for the new parser.
When it is invoked, the application's content callback is responsible for processing the content of the indicated element and returning the token_range_t that should be copied to the output iterator. This range can be empty if nothing should be copied to the output iterator. Alternatively, the callback can return the original content range or a completely altered range, as appropriate.
From this basic parsing model, a wide variety of simple xml applications can be written, although the combinatorics of applying the three concepts from which the parser is built can be daunting. Luckily, just about all parsers fall into one of two categories for which we can describe a basic usage pattern that makes the job significantly easier. Those two categories are Document Filters and Command Processors.
Building Document Filters
A document filter is fundamentally an application that consumes a stream of characters containing markup and produces a stream of characters derived from that input stream. Such an application could be as complex as an XST Transform implementation or as simple as a string localization application. In fact, adobe::xstring is a good example of a document filter that uses xml_parser_t.
Document filters use the parser's output iterator as the location to which the output (or processed) data is written. As such, it is common to use std::ostream_iterator or std::back_inserter(my_string) as the output iterator when creating parsers for filter applications.
A filter's preorder predicate returns true for any element that the filter wishes to change in the output stream. The filter's content callback examines the content of each such element and provides replacement text to be output. For example, consider a simple example whereby an attribute of some tags are used as keys to look up replacement text. Such a simple application could be written like so:
static const token_range_t target_tag_k( static_token_range("replace-me") );

token_range_t lookup_replacement_text(const token_range_t&);

bool my_preorder_predicate(const token_range_t& tag_name)
{
	return token_range_equal(tag_name, target_tag_k);
}

token_range_t my_content_callback(
	const token_range_t&     /* entire_element_range */,
	const token_range_t&     /* name */,
	const attribute_set_t&   attribute_set,
	const token_range_t&     value)
{
	static const token_range_t id_attr_k(  static_token_range("id") );

	const token_range_t id( attribute_set[id_addr_k] );
	if (0 == boost::size(id))
		throw std::runtime_error("replace-me tags require an id attribute");

	return lookup_replacement_text(id);
}
With this application (and appropriate pre-population of a replacement dictionary), xml input data like this
Dear <replace-me id="their-name"/>,
Thank you for your recent letter of <replace-me id="date"/>. Yadda Yadda Yadda.
Sincerely,
<replace-me id="my-name"/>
might come out looking like this
Dear Mr. Smith,
Thank you for your recent letter of 17 June. Yadda Yadda Yadda.
Sincerely,
John Q. Public
An application like that described above might be initiated with a function written like this:
std::string perform_markup_replacement(const std::string& input)
{
	std::string result;
	
	make_xml_parser(
			input.begin(), input.end(),
			line_position_t("markup replacement string"),
			my_preorder_predicate,
			my_content_callback,
			std::back_inserter(result)).parse_content();

	return result;
}
Note that we use std::back_inserter(result) for the output iterator to which the processed output is written. This has the effect of storing the result in a new string to be returned from the function.
Note also that the function calls the parser's parse_content member function. This is because our source code is not an xml document. Specifically, it is not well-formed (e.g. missing a root element). However, the data is valid xml content (i.e. if one wrapped it with a start and end tag, the result could be a well-formed document. Thus, the data can be processed via parse_content.
Building Command Processors
A command processor is a more generic model of an xml application than a document filter in which the application performs some action in response to each element in a document. The net result is often the creation or annotation of some program structure (e.g. modifying application preferences) or performing some structured program function (e.g. drawing graphics).
By their nature, command processors are often unconcerned with the output iterator associated with the parser object. Similarly, command processors typcially perform some action with each tag in the document. As it happens, helpers exist to trivialize each function: adobe::always_true and adobe::implementation::null_output_t. [The author concedes that clients are encouraged to avoid direct use of things in the adobe::implementation namespace, but null_output_t is just too damned useful to avoid in this case.] Using these helpers, command processors can trivially ignore both the output iterator and preorder predicate and can concentrate on the content callback, where all the action of a command process typically takes place.
Consider a simple document that describes a piece of graphics:
<canvas>
	<rect sides="0 0 100 100"/>
	<circle center="5 5" radius="10"/>
	<polygon vertices="1 2 6 8 1 8"/>
</canvas>
Assuming that the parser's preorder predicate can be convinced to return true for all tags in the document (we'll do this later), this document can be parsed with a simple content callback.
token_range_t my_content_callback(
	const token_range_t&     /* entire_element_range */,
	const token_range_t&     name,
	const attribute_set_t&   attribute_set,
	const token_range_t&     value,
	graphic_context_t&		 graphics)
{
	static const token_range_t canvas_tag_k(  static_token_range("canvas") );
	static const token_range_t rect_tag_k(  static_token_range("rect") );
	static const token_range_t circle_tag_k(  static_token_range("circle") );
	static const token_range_t polygon_tag_k(  static_token_range("polygon") );

	if (token_range_equal(canvas_tag_k, name))
	{
		make_xml_document(value.first, value.second,
			line_position_t("canvas"),
			adobe::always_true<token_range_t>(),
			boost::bind(my_simple_content_callback, _1, _2, _3, _4, boost::ref(graphics)),
			adobe::null_output_t()).parse_content();
	}
	else if (token_range_equal(rect_tag_k, name))
	{
		if (0 != boost::size(value))
			throw std::runtime_error("rect elements must be empty");
			
		draw_rectangle(attribute_set, graphics);
	}
	else if (token_range_equal(circle_tag_k, name))
	{
		if (0 != boost::size(value))
			throw std::runtime_error("circle elements must be empty");

		draw_circle(attribute_set, graphics);
	}
	else if (token_range_equal(polygon_tag_k, name))
	{
		if (0 != boost::size(value))
			throw std::runtime_error("polygon elements must be empty");

		draw_polygon(attribute_set, graphics);
	}
	else
	{
		throw std::runtime_error("encountered unrecognized tag");
	}

	return token_range_t();
}

void draw_graphics(const std::string& xml_shape, graphics_context_t& graphics)
{
	make_xml_document(xml_shape.begin(), xml_shape.end(),
		line_position_t("xml shape"),
		adobe::always_true<token_range_t>(),
		boost::bind(my_content_callback, _1, _2, _3, _4, boost::ref(graphics)),
		adobe::null_output_t()).parse_document();
}
With this content callback function and external entry function, we have implemented a simple system that draws graphics based on the contents of an xml document and provides a modicum of error checking on the document contents (e.g. tags that are not allowed to have complex content confirm this fact). One potential problem is that this application does not guarantee that the root element is a canvas. Indeed, a document with a single <rect ... /> element is conformant with the application, as written above. This may either be good or bad, depending on the design of your specific application.
Although the implementation above is rather simple and straightforward, command processors are often more complex. Documents processed by command processors often contain sub-structures that must themselves be parsed. Consider this, slightly more complex, version of our example graphics document:
<canvas>
	<rect sides="0 0 100 100"/>
	<circle center="5 5" radius="10"/>
	<group translation="5 10">
		<polygon>
			<vertex xy="1 2"/>
			<vertex xy="6 8"/>
			<vertex xy="1 8"/>
		</polygon>
		<rect sides="3 3 10 10"/>
	</group>
</canvas>
Here, the polygon element's vertices are elements within the polygon's content instead of attributes. This requires polygon's content to be parsed to create an appropriate primitive to draw. Similarly, the document grammar has added a group element that groups primitives together within a coordinate transformation.
As it happens, these complications do not significantly complicate our code. Rather, it encourages us to refactor the code into sets of composable functions. As we modify our system to accomodate these changes, we will add a correctness check that the root element is what we expect.
token_range_t my_polygon_callback(
	const token_range_t&     /* entire_element_range */,
	const token_range_t&     name,
	const attribute_set_t&   attribute_set,
	const token_range_t&     /* value */,
	polygon_t&				 polygon)
{
	static const token_range_t vertex_tag_k(  static_token_range("vertex") );

	if (token_range_equal(vertex_tag_k, name))
	{
		polygon.add_vertex( make_vertex(attribute_set) );
	}
	else
	{
		throw std::runtime_error("encountered expected tag inside polygon content");
	}

	return token_range_t();
}

token_range_t my_group_callback(
	const token_range_t&     /* entire_element_range */,
	const token_range_t&     name,
	const attribute_set_t&   attribute_set,
	const token_range_t&     value,
	graphic_context_t&		 graphics)
{
	static const token_range_t rect_tag_k(  static_token_range("rect") );
	static const token_range_t circle_tag_k(  static_token_range("circle") );
	static const token_range_t group_tag_k(  static_token_range("group") );
	static const token_range_t polygon_tag_k(  static_token_range("polygon") );

	if (token_range_equal(group_tag_k, name))
	{
		graphic_context_t translated_graphics( graphics, attribute_set );

		make_xml_document(value.first, value.second,
			line_position_t("group"),
			adobe::always_true<token_range_t>(),
			boost::bind(my_group_callback, _1, _2, _3, _4, boost::ref(translated_graphics)),
			adobe::null_output_t()).parse_content();
	}
	else if (token_range_equal(rect_tag_k, name))
	{
		if (0 != boost::size(value))
			throw std::runtime_error("rect elements must be empty");
			
		draw_rectangle(attribute_set, graphics);
	}
	else if (token_range_equal(circle_tag_k, name))
	{
		if (0 != boost::size(value))
			throw std::runtime_error("circle elements must be empty");

		draw_circle(attribute_set, graphics);
	}
	else if (token_range_equal(polygon_tag_k, name))
	{
		polygon_t polygon;

		make_xml_document(value.first, value.second,
			line_position_t("polygon"),
			adobe::always_true<token_range_t>(),
			boost::bind(my_polygon_callback, _1, _2, _3, _4, boost::ref(polygon)),
			adobe::null_output_t()).parse_content();

		draw_polygon(polygon, graphics);
	}
	else
	{
		throw std::runtime_error("encountered unrecognized tag in group");
	}

	return token_range_t();
}

token_range_t my_canvas_callback(
	const token_range_t&     /* entire_element_range */,
	const token_range_t&     name,
	const attribute_set_t&   attribute_set,
	const token_range_t&     value,
	graphic_context_t&		 graphics)
{
	static const token_range_t canvas_tag_k(  static_token_range("canvas") );

	if (token_range_equal(canvas_tag_k, name))
	{
		make_xml_document(value.first, value.second,
			line_position_t("canvas"),
			adobe::always_true<token_range_t>(),
			boost::bind(my_group_callback, _1, _2, _3, _4, boost::ref(graphics)),
			adobe::null_output_t()).parse_content();
	}
	else
	{
		throw std::runtime_error("encountered unrecognized tag in document");
	}

	return token_range_t();
}

void draw_graphics(const std::string& xml_shape, graphics_context_t& graphics)
{
	make_xml_document(xml_shape.begin(), xml_shape.end(),
		line_position_t("xml shape"),
		adobe::always_true<token_range_t>(),
		boost::bind(my_canvas_callback, _1, _2, _3, _4, boost::ref(graphics)),
		adobe::null_output_t()).parse_document();
}
Note that this solution guarantees that the root element is a canvas, accomodates groups that contain graphic primitives and other groups, and parses elements within polygon content as vertices. An interesting side effect of the refactoring that produced this system is that the functions tend to be smaller and more composable than the single monolithic function in our first example.
And because xml_parser_t is an exceptionally lightweight object, the overhead of creating sub-parsers for each specific piece of content is trivial for all but the most stack-restricted programs.

Definition at line 442 of file xml_parser.hpp.


Member Typedef Documentation

Definition at line 445 of file xml_parser.hpp.

typedef boost::function<bool (const token_range_t&)> preorder_predicate_t

Definition at line 446 of file xml_parser.hpp.

typedef xml_lex_t::token_type token_type

Definition at line 447 of file xml_parser.hpp.


Constructor & Destructor Documentation

xml_parser_t ( uchar_ptr_t  first,
uchar_ptr_t  last,
const line_position_t position,
preorder_predicate_t  predicate,
callback_proc_t  callback,
output 
)

Definition at line 449 of file xml_parser.hpp.

xml_parser_t ( const xml_parser_t< O > &  rhs )

Definition at line 462 of file xml_parser.hpp.

virtual ~xml_parser_t (  ) [virtual]

Definition at line 481 of file xml_parser.hpp.


Member Function Documentation

void content_callback ( token_range_t result_element,
const token_range_t old_element,
const token_range_t start_tag,
const attribute_set_t  attribute_set,
const token_range_t content,
bool  preorder_parent 
) [protected]

Definition at line 735 of file xml_parser.hpp.

const token_type& get_token (  ) [protected]

Definition at line 578 of file xml_parser.hpp.

bool is_attribute ( token_range_t name,
token_range_t value 
) [protected]

Definition at line 1130 of file xml_parser.hpp.

bool is_attribute_set ( attribute_set_t attribute_set ) [protected]

Definition at line 1033 of file xml_parser.hpp.

bool is_bom ( token_range_t bom ) [protected]

Definition at line 1071 of file xml_parser.hpp.

bool is_content ( token_range_t element ) [protected]

Definition at line 908 of file xml_parser.hpp.

bool is_e_tag ( token_range_t name,
token_range_t close_tag 
) [protected]

Definition at line 1019 of file xml_parser.hpp.

bool is_element ( token_range_t element ) [protected]

Definition at line 788 of file xml_parser.hpp.

bool is_prolog (  ) [protected]

Definition at line 1047 of file xml_parser.hpp.

bool is_token ( xml_lex_token_set_t  name,
token_range_t value 
) [protected]

Definition at line 677 of file xml_parser.hpp.

bool is_token ( xml_lex_token_set_t  name ) [protected]

Definition at line 696 of file xml_parser.hpp.

bool is_xml_decl ( token_range_t xml_decl ) [protected]

Definition at line 1114 of file xml_parser.hpp.

const line_position_t& next_position (  )

Definition at line 484 of file xml_parser.hpp.

xml_parser_t& operator= ( const xml_parser_t< O > &  rhs )

Definition at line 470 of file xml_parser.hpp.

void parse_content (  )

Parses the content range as the content of an xml element.

This function is most useful when invoking a sub-parser within an application's content callback function.
Example:
Consider this document:
<?xml encoding="UTF-8" version="1.0" ?>
<root>
    <content>sample document content</content>
</root>
An application that wished to enforce the string structure of the document could use a content callback like the following to ensure that the root element of the document is "root", while also processing the content of the root element.
token_range_t top_level_callback(
                                  const token_range_t&     entire_element_range,
                                  const token_range_t&     name,
                                  const attribute_set_t&   attribute_set,
                                  const token_range_t&     value)
{
    assert(token_range_equal(name, static_token_range("root")));
    
    make_xml_parser(value.first, value.second,
                           line_position_t("top_level_callback"),
                           always_true<token_range_t>(),
                           root_callback,
                           null_output_t())
        .parse_content();
    
    return token_range_t();
}
In this example, the top level callback performs a sanity check that the element it encounters is the tag "root". It then creates a new parser to process the content of the root element.
A less strict application could choose to use a single callback function for all elements or annotate a more complex data structure as the document is processed.

Definition at line 1162 of file xml_parser.hpp.

void parse_document (  )

Parses the content range as a well-formed xml document.

Definition at line 1190 of file xml_parser.hpp.

void parse_element_sequence (  )

Parses the content range as a sequence of xml elements. Each element encountered in the content range is processed by the application. Character data between top-level elements in the content range is ignored by the parser and is not processed.

Example:
Consider this content range:
<top-level type="simple">element 1</top-level>

these characters are ignored

<top-level type="complex">element 2<embedded/></top-level>
Parsing this content range as an element sequence yields two top-level elements, one of which contains embedded elements. Each top-level element is processed by the application according to the application's preorder predicate and content callback functions. The content between the two top-level elements is ignored.
Element sequences can be useful for preference sets or other simple data that do not need significant structure (e.g. as a full document would contain).

Definition at line 1147 of file xml_parser.hpp.

void putback (  ) [protected]

Definition at line 580 of file xml_parser.hpp.

void require_token ( xml_lex_token_set_t  name,
token_range_t value 
) [protected]

Definition at line 711 of file xml_parser.hpp.

void require_token ( xml_lex_token_set_t  name ) [protected]

Definition at line 724 of file xml_parser.hpp.

void set_preorder_predicate ( preorder_predicate_t  pred )

Allows the client to specify a different preorder predicate after object instantiation

Parameters:
[in]predpredicate that indicates whether the client wants a given element to be parsed pre-order or in-order

Definition at line 492 of file xml_parser.hpp.

void throw_exception ( xml_lex_token_set_t  found,
xml_lex_token_set_t  expected 
) [protected]

Definition at line 592 of file xml_parser.hpp.

void throw_exception ( const char *  error_string ) [protected]

Definition at line 590 of file xml_parser.hpp.


Friends And Related Function Documentation

xml_parser_t< O > make_xml_parser ( uchar_ptr_t  first,
uchar_ptr_t  last,
const line_position_t position,
typename xml_parser_t< O >::preorder_predicate_t  predicate,
typename xml_parser_t< O >::callback_proc_t  callback,
output 
) [related]

Create an object that will parse the indicated content range using the preorder and content functions indicated.

Parameters:
[in]firstthe start of the content range (analagous to a begin iterator)
[in]lastthe end of the content range (analagous to an end iterator)
[in]positionan annotation of the line number at which then content range begins. Used when errors are encountered while parsing the content range.
[in]predicatea predicate that indicates whether the application's content callback will be called pre-order or in-order for a given element
[in]callbackthe applications content callback function
[in]outputan object that models OutputIterator to which the parser will insert the result of processing the content range
Returns:
an xml parser object that will process the indicated content range using the indicated application callbacks

Definition at line 1222 of file xml_parser.hpp.


Member Data Documentation

Definition at line 612 of file xml_parser.hpp.

O output_m [protected]

Definition at line 613 of file xml_parser.hpp.

Definition at line 611 of file xml_parser.hpp.

Copyright © 2006-2007 Adobe Systems Incorporated.

Use of this website signifies your agreement to the Terms of Use and Online Privacy Policy.

Search powered by Google