May 19th, 2008
Recently there’s been a bunch of announcements of new XML parsers that claim to be very fast, very small or both. I also see a lot of people get very enthusiastic about using them in their applications. Just the other day I got an email from a user asking if it was possible to use CodeSynthesis XSD with a light-weight XML parser that he found instead of Xerces-C++. Out of curiosity I checked the parser’s description and there I saw a number of common traits of most new, fast, and small XML parsers these days: no support for DTD (internal subset) and CDATA sections, limited support for character and entity references.
Once I tell people about these problems with their choice of XML parser, some say they don’t care since nobody uses these features anyway. It is true that most of these features are seldom used. You can get away with using a non-conforming XML parser if you control the production of XML you are planning to parse and thus restrict the set of XML constructs that can appear in your documents. This, however, is not a very common situation. If you control both the production and consumption of the data then you might as well choose a more natural (for your application and environment) and efficient (that’s why choose this new parser) exchange format than XML. A more common scenario is when you have to parse XML supplied by various third parties and once you say your format is XML then all bets are off; it is only a matter of time before someone sends you a perfectly valid XML document that your application won’t be able to handle.
The W3C XML specification defines two types of conforming XML parsers: validating and non-validating (see Section 5.1, “Validating and Non-Validating Processors”). Any parser that wants to be called capable of parsing XML 1.0 documents must at least satisfy the non-validating parser requirements. Besides the expected things like being able to parser elements and attributes as well as making sure they are well-formed, a conforming non-validating XML parser should also support the following:
- At least UTF-8 and UTF-16 character encodings
- CDATA sections (<![CDATA[<greeting>Hello, world!</greeting>]]>)
- Character references ( )
- Entity references including predefined (&) and user-defined in the internal DTD subset
- Parse and check for well-formedness the internal DTD subset
- Normalize and supply default attribute values according to the internal DTD subset
The internal DTD subset consist of the DTD declarations that are found in the XML document itself as opposed to the external subset which consists of the declarations placed into separate DTD files and referenced from the XML documents. Here is a sample document that uses most of the above features (download it to test your favorite parser: test.xml):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE hello [
<!ENTITY greeting-text "hello">
<!ATTLIST greeting lang NMTOKEN "en">
<!ATTLIST name lang NMTOKEN "en">
]>
<hello>
<greeting>&greeting-text;</greeting>
<name lang=" fr ">tout le <![CDATA[monde]]></name>
</hello>
Parsing this document with a conforming non-validating XML parser should be equivalent to parsing the following document:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<hello>
<greeting lang="en">hello</greeting>
<name lang="fr">tout le monde</name>
</hello>
Here is the list of C and C++ XML parsers that are either conforming or strive to be conforming:
- Xerces-C++: C++, DOM, SAX2, validating (DTD and XML Schema)
- Libxml2: C, DOM, SAX2, Pull, validating (DTD)
- Expat: C, SAX2, non-validating, small & fast
- Faxpp: C, Pull, non-validating (no default/normalized attributes), small & fast
And here is a list of parsers that while calling themselves XML parsers, are actually parsers for markup languages that are subsets of XML (based on their description at the time of writing):
- VTD-XML
- Rapidxml
- TinyXML
- XML parser provided by applied-mathematics.net
Posted in Development, XML | 8 Comments »
May 11th, 2008
Yesterday I went to see 21 where one of the scenes brings up the Monty Hall problem: there are three doors behind which there are a car and two goats. You choose a door with the goal of winning the car. Then the host opens one of the remaining doors which hides a goat. The question is whether it is to your advantage to switch your choice to the other door. The answer is yes (and yes, I thought it does not matter while watching the movie). When asked to explain why it is a good idea to change the door the main character utters some gibberish about all the variables being changed, etc. Afterwards I checked this problem out on Wikipedia (follow the link above) which gave a few strict proofs making it clear that indeed changing the door increases your chances of winning the car by 1/3. While the formal proofs do their jobs just fine I always prefer to have an intuitive feeling of why a seemingly counter-intuitive answer is actually correct. In this case I wanted to understand what changes once the host opens one of the doors, what extra information is added that makes the difference.
The part of the rule which says that the host has to reveal the other goat brings in the extra information. This happens in the case when you initially selected the door with a goat behind it. In this situation the host is forced to eliminate the other goat: he cannot open the door you have selected and he cannot reveal where the car is. In other words we have two possible outcomes:
- if you selected a goat then the remaining door hides the car
- if you selected the car then the remaining door hides a goat
The probability of initially selecting the goat is 2/3 (two doors out of three hide goats) and the car — 1/3. Thus it is more likely that you will first select a goat instead of the car. And in this more likely case the host is forced to single-out the door which hides the car. Thus changing your selection gives you a better chance of winning the car.
Note also that the probability of your initial choice being the car remains 1/3 even after the host opened one of the doors. It is the probability of the other remaining door hiding the car that has changed (from 1/3 to 2/3) due to the rules of the game forcing the host not to reveal the car.
Posted in Uncategorized | 1 Comment »
April 17th, 2008
By now every C++ engineer worth her salt knows that virtual inheritance is not free. It has object code, runtime (both CPU and memory), as well as compilation time and memory overheads (for an in-depth discussion on how virtual inheritance is implemented in C++ compilers see “Inside the C++ Object Model” by Stanley Lippman). In this post I would like to consider the object code as well as compilation time and memory overheads since in modern C++ implementations these are normally sacrificed for the runtime speed and can present major surprises. Unlike existing studies on this subject, I won’t bore you with “academic” metrics such as per class or per virtual function overhead or synthetic tests. Such metrics and tests have two main problems: they don’t give a feeling of the overhead experienced by real-world applications and they don’t factor in the extra code necessary to account for the lack of functionality otherwise provided by virtual inheritance.
It is hard to come by non-trivial applications that can provide the same functionality with and without virtual inheritance. I happened to have access to such an application and what follows is a quick description of the problem virtual inheritance was used to solve. I will then present some measurements of the overhead by comparing to the same functionality implemented without virtual inheritance.
The application in question is XSD/e, validating XML parser/serializer generator for embedded systems. Given a definition of an XML vocabulary in XML Schema it generates a parser skeleton (C++ class) for each type defined in that vocabulary. Types in XML Schema can derive from each other and if two types are related by inheritance then it is often desirable to be able to reuse the base parser implementation in the derived one. To support this requirement, the current implementation of XSD/e uses the C++ mixin idiom that relies on virtual inheritance:
// Parser skeletons. Generated by XSD/e.
//
struct base
{
virtual void
foo () = 0;
};
struct derived: virtual base
{
virtual void
bar () = 0;
};
// Parser implementations. Hand-written.
//
struct base_impl: virtual base
{
virtual void
foo ()
{
...
}
};
struct derived_impl: virtual derived,
base_impl
{
virtual void
bar ()
{
...
}
};
This approach works well but we quickly found out that for large vocabularies with hundreds of types the resulting object code produced by g++ was unacceptably large. Furthermore, on a schema with a little more than a thousand types, g++ with optimization turned on (-O2) runs out of memory on a machine with 2GB of RAM.
After some analysis we determined that virtual inheritance was to blame. To resolve this problem we have developed an alternative, delegation-based implementation reuse method (will appear in the next release of XSD/e) that is almost as convenient to use as mixin (this is the case because all the support code is automatically generated by the XSD/e compiler). The idea behind the delegation-based approach is illustrated in the following code fragment:
// Parser skeletons. Generated by XSD/e.
//
struct base
{
virtual void
foo () = 0;
};
struct derived: base
{
derived (base* impl)
: impl_ (impl)
{
}
virtual void
bar () = 0;
virtual void
foo ()
{
assert (impl_);
impl_->foo ();
}
private:
base* impl_;
};
// Parser implementations. Hand-written.
//
struct base_impl: base
{
virtual void
foo ()
{
...
}
};
struct derived_impl: derived
{
derived_impl ()
: derived (&base_impl_)
{
}
virtual void
bar ()
{
...
}
private:
base_impl base_impl_;
};
The optimized for size (-Os) and stripped test executable built for the above-mentioned thousand-types schema using virtual inheritance is 15MB in size. It also takes 19 minutes to build and peak memory usage of the C++ compiler is 1.6GB. For comparison, the same executable built using the delegation-based approach is 3.7MB in size, takes 14 minutes to build, and peak memory usage is 348MB. That’s right, the executable is 4 times smaller. Note also that the generated parser skeletons are not just a bunch of pure virtual function signatures. They include XML Schema validation, data conversion, and dispatch code. The measurements also showed that the runtime performance of the two reuse approaches is about the same (most likely because g++ performs a similar delegation under the hood except that it has to handle all possible use-cases thus the object code overhead).
Posted in GCC g++, C++ Compilers, C++ | 5 Comments »