Are you using a real XML parser

Recently there’s been a bunch of announcements of new XML parsers that claim to be very fast, very small or both. I also see a lot of people get very enthusiastic about using them in their applications. Just the other day I got an email from a user asking if it was possible to use CodeSynthesis XSD with a light-weight XML parser that he found instead of Xerces-C++. Out of curiosity I checked the parser’s description and there I saw a number of common traits of most new, fast, and small XML parsers these days: no support for DTD (internal subset) and CDATA sections, limited support for character and entity references.

Once I tell people about these problems with their choice of XML parser, some say they don’t care since nobody uses these features anyway. It is true that most of these features are seldom used. You can get away with using a non-conforming XML parser if you control the production of XML you are planning to parse and thus restrict the set of XML constructs that can appear in your documents. This, however, is not a very common situation. If you control both the production and consumption of the data then you might as well choose a more natural (for your application and environment) and efficient (that’s why choose this new parser) exchange format than XML. A more common scenario is when you have to parse XML supplied by various third parties and once you say your format is XML then all bets are off; it is only a matter of time before someone sends you a perfectly valid XML document that your application won’t be able to handle.

The W3C XML specification defines two types of conforming XML parsers: validating and non-validating (see Section 5.1, “Validating and Non-Validating Processors”). Any parser that wants to be called capable of parsing XML 1.0 documents must at least satisfy the non-validating parser requirements. Besides the expected things like being able to parser elements and attributes as well as making sure they are well-formed, a conforming non-validating XML parser should also support the following:

  • At least UTF-8 and UTF-16 character encodings
  • CDATA sections (<![CDATA[<greeting>Hello, world!</greeting>]]>)
  • Character references (&#x20;)
  • Entity references including predefined (&amp;) and user-defined in the internal DTD subset
  • Parse and check for well-formedness the internal DTD subset
  • Normalize and supply default attribute values according to the internal DTD subset

The internal DTD subset consist of the DTD declarations that are found in the XML document itself as opposed to the external subset which consists of the declarations placed into separate DTD files and referenced from the XML documents. Here is a sample document that uses most of the above features (download it to test your favorite parser: test.xml):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE hello [
<!ENTITY greeting-text "hello">
<!ATTLIST greeting lang NMTOKEN "en">
<!ATTLIST name lang NMTOKEN "en">
]>
<hello>
  <greeting>&greeting-text;</greeting>
  <name lang="  fr  ">tout&#x20;le&#x20;<![CDATA[monde]]></name>
</hello>

Parsing this document with a conforming non-validating XML parser should be equivalent to parsing the following document:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<hello>
  <greeting lang="en">hello</greeting>
  <name lang="fr">tout le monde</name>
</hello>

Here is the list of C and C++ XML parsers that are either conforming or strive to be conforming:

  • Xerces-C++: C++, DOM, SAX2, validating (DTD and XML Schema)
  • Libxml2: C, DOM, SAX2, Pull, validating (DTD)
  • Expat: C, SAX2, non-validating, small & fast
  • Faxpp: C, Pull, non-validating (no default/normalized attributes), small & fast

And here is a list of parsers that while calling themselves XML parsers, are actually parsers for markup languages that are subsets of XML (based on their description at the time of writing):

  • VTD-XML
  • Rapidxml
  • TinyXML
  • XML parser provided by applied-mathematics.net

8 Responses to “Are you using a real XML parser”

  1. John Snelson Says:

    Everyone (including me) thinks that writing an XML parser will be relatively simple when they start off. However, parsing elements and attributes is probably only a quarter of the work needed to get things like DTD parsing and entity replacement done. It took a long time for me to even understand how entity replacement was meant to work!

    http://snelson.org.uk/archives/2008/03/fast_xml_pull_p.php

    However, I now think that the greatest barrier to writing faster XML parsers is checking that every character in it matches the “Char” grammar production:

    http://www.w3.org/TR/2006/REC-xml-20060816/#charsets

    I don’t see any way to do this other than a hash-lookup or binary search for every relevant character - and that’s not going to be fast.

  2. boris Says:

    Hm, not sure I understand what you mean here. The rule specifies three individual characters and three ranges which should be pretty straightforward and fast to check with simple if() statements. This will probably get a bit slower if the document (and application) encoding is, say, UTF-8 and the parser needs to decode each multi-byte sequence before it can check the rule.

    I’ve seen the NameChar rule implemented as a lookup table, though.

  3. John Snelson Says:

    When parsing XML documents, the performance bottleneck is usually the part which reads the character content of elements. The slowest operation that a conforming XML parser has to perform in that inner loop (besides the mandatory memory read to get the character) is the check against the “Char” grammar production.

    To check the “Char” production using simple if() statements, you need 9 comparisons (2 comparisons for each range, 1 for each single char). A binary search over the grammar production as sorted ranges only requires worst case 4 comparisons. Even so, 4 comparisons in the inner loop is a lot of extra overhead.

  4. Lars D Says:

    It doesn’t make sense to make an application that can parse any XML file, unless it’s a generic XML tool. It make much more sense to specify that you use XML, and then some more information. For instance, you could specify that you’re using Google’s KML format.

    In the same way, many business applications can specify that they’re exporting XML data with specific tags and a specific character set. In that case, your application needs to support specific XML tags and a specific character set, and then it makes sense to use an XML parser with certain features, if it perfoms significantly better than a standards-complying XML parser.

  5. Boris Kolpackov Says:

    Lars, the mechanisms I described (like entity references, CDATA sections, DTD) are available to any XML vocabulary. They are there to allow you to efficiently encode your data in XML.

    I have never seen an XML vocabulary specification that says something like “You cannot use entity references.” That’s also the reason why vocabulary specification languages (e.g., DTD, XML Schema) don’t have any mechanisms to control the use of these features.

  6. Lars D Says:

    I have, many times, made by many different organizations.

  7. Boris Kolpackov Says:

    Lars, are any of these cases perhaps open standards that you can refer me to? I truly have never seen anybody restricting the physical XML representation mechanisms offered by the XML spec.

  8. Lars D Says:

    Google gives a few examples, but I guess here is one from Apple, which strongly recommend not to use CDATA:

    http://www.apple.com/dk/itunes/store/podcaststechspecs.html

    However, most examples that I have seen, are non-published standards used internally between vendors in large organizations. They’re basically open, but only relevant for 2-3 companies and therefore not published.