[xsd-users] Large XSD-schema, speed and identity constraint validation

Tue May 12 07:11:57 EDT 2020

Stefan de Konink <stefan at konink.de> writes:

> One of the main problems that we face is the syntax validation of 100MB+
> XML-document with this schema, but especially: constraint validation.
> Practically I am looking for a better than libxml2/xmllint speed, where I
> notice that many - if not all - tools have a direct single threaded
> performance bottleneck. I am trying to find a generic form to overcome this,
> I am surprised that it is difficult to find one. Practically parallel syntax
> validation using sharding could work for us, but identity constraint
> validation needs all parts of the document, hence I would expect a "better
> way".

Based on your reference to identity constraints further down in your
email I am going to assume that by "constraint validation" above you
mean "identity constraint validation".

Over all these years, I can't remember seeing many cases where this was
an issue. Which probably explains why there is no better/faster way.

Also, keep in mind that CodeSynthesis XSD delegates XML Schema validation,
including identity constraint validation, to Xerces-C++.

> Question 1;
> 
> My first question is concerning the c++ code generation. I am currently
> using the following command for generating the XML interface. How can I
> generate a single file instead of 4011 individual files? When I omit
> --file-per-type I don't get all the types in the single file.

There is no way to get a single header/source set from multiple schema
files. And, for a large schema, you probably wouldn't want to, since you
may not be able to compile the resulting source file (yes, we have run
into this and have a mechanism to split single source file into multiple
parts; see the --parts* options).

But seeing that you depend on GML 3.2, file-per-type is probably your
only option (you could try to compile your schemas in the file-per-schema
mode to minimize the number of files, but getting that to work is more
of an art than science). See these release notes for background on this
mode:

http://www.codesynthesis.com/~boris/blog/2008/02/13/codesynthesis-xsd-3-1-0-released/

> Question 2;
> 
> When I am comparing the cold performance of the following code, the millage
> may vary. I would state the performance is similar to Xerces in Java. Which
> makes me wonder if he 'hot' performance would be much better? Or that I am
> trying to do something that even with the generated C++ code is not
> optimised. I am aware of the performance example in the source code, that
> could preload the schema once and run from it many times.
> 
> [...]
> 
> real	17m6.611s
> user	17m1.399s
> sys	0m3.917s
> 
> real	5m21.199s
> user	5m19.587s
> sys	0m1.450s

I am confused, what are these two results for? Hot vs cold?

Overall, if you know that the identity constraint validation is your
bottleneck, I wouldn't expect pre-loading the schemas to help much.

> Question 3;
> 
> One of the other things I noticed is that the Codesynthesis identity
> constraint validation only reports the location of the end-tag (and
> therefore an ocean of duplicates), missing the exact location that xmllint
> does produce, for example:
> 
> [...]
> 
> Is there a way to get the invalid line, with the correct type?

That would most likely require improvements to Xerces-C++.