[xsd-users] Poor performance of Unicode conversion

Wed Sep 26 14:12:44 EDT 2007

Ray,

Ray Lischner <rlischner at proteus-technologies.com> writes:

> DOM-to-object model stage. Some of our schemas are predominantly
> strings.

Ok, good. I am going to profile this and see if anything can be
optimized. I will get back to you with the results. BTW, have you
upgraded to Xerces-C++ 2.8.0? We've done some optimizations to
the XML-to-DOM parsing code which results in 25-30% speedup.

> That's a show-stopper. I've noticed how libxml2 says an XML document
> is valid when Xerces says it isn't. With only one exception, Xerces
> has been correct and libxml2 has been wrong.

Yes, their XML Schema support is nowhere near production quality and
the person who has started the development in that area is not working
on it anymore. So there is little hope for the near future.

> The Xerces library is extremely C-like. (No destructors, so everything
> needs a formal release call. Test node type by checking node-type field
> that contains an enumerated value. Raw pointers passed everywhere. No
> UTF-16 string type, so raw XMLCh* pointers passed around. And so on.)

That's all true but I think there is still a big gap between Xerces-C++
API and libxml2, especially if one is using custom smart pointers.

> I agree, but I also find the Apache Xerces-C++ documentation to be
> unusable. I spend a lot of time experimenting to understand what the
> library actually does.

At least there are examples and some basic introductory documentation
in the Programming Guide. Libxml2 has no examples (you are supposed
to study tests instead) and the only documentation is the API
reference.

> We need to manipulate document trees when we don't always have a
> schema. But if the interface is event-driven, I would need to invent
> my own DOM-like model to store the tree. Maybe that would be best,
> however.

The reason why DOM is so complex and often slow is because it has to
support all cases, however insane they are. If your documents don't
use mixed content (and that's the case for probably 90% of all XML
vocabularies out there), a tree-like in-memory representation of XML
becomes very simple. I agree that it should be fairly straightforward
to add support for basic XML trees. We may even distribute something
like this as part of the XSD runtime if there is demand.

The reason why we don't want to base this mapping on a tree-like XML
representation is because we want to be able to handle XML documents
that are larger that the available RAM.

Boris