[xsd-users] Large XSD-schema, speed and identity constraint validation

Tue May 12 10:41:17 EDT 2020

Dear Boris,

Thanks for your in depth reply.

On Tuesday, May 12, 2020 1:11:57 PM CEST, Boris Kolpackov wrote:
> Based on your reference to identity constraints further down in your
> email I am going to assume that by "constraint validation" above you
> mean "identity constraint validation".

Correct.

> Over all these years, I can't remember seeing many cases where this was
> an issue. Which probably explains why there is no better/faster way.

I don't mind to get my hands dirty on this subject.

> Also, keep in mind that CodeSynthesis XSD delegates XML Schema validation,
> including identity constraint validation, to Xerces-C++.

Does this practically mean that if I would only care about XSD-validation, 
there would not be any net benefit to use the XSD toolset, because the 
resulting code is not used to generate a specific parser that is employed 
while doing a XSD validation? I am thinking in the direction of XML 
Screamer research.

> There is no way to get a single header/source set from multiple schema
> files. And, for a large schema, you probably wouldn't want to, since you
> may not be able to compile the resulting source file (yes, we have run
> into this and have a mechanism to split single source file into multiple
> parts; see the --parts* options).
>
> But seeing that you depend on GML 3.2, file-per-type is probably your
> only option (you could try to compile your schemas in the file-per-schema
> mode to minimize the number of files, but getting that to work is more
> of an art than science). See these release notes for background on this
> mode:

Understood. As you may have see the "art" of designing light XSD's, that 
only define a single profile (where the net effect is that a validator 
would complain about extra elements) is something that could greatly 
optimise the performance of the validator. Obviously this is expected 
behavior but not many XSD tools support cutting unused the bloat in a 
consistent matter. Meaning that designing a smaller XSD is typically bottom 
up again.

>> real	17m6.611s
>> user	17m1.399s
>> sys	0m3.917s
>> 
>> real	5m21.199s
>> user	5m19.587s
>> sys	0m1.450s
>
> I am confused, what are these two results for? Hot vs cold?

Same machine, same data, multiple runs, same code, showing the min and max. 
>From my benchmarking background I would consider them both cold. I cannot 
explain (other than hardware reasons, tested it on a laptop Ryzen 2500U) 
why the results give huge outliers for both libxml2 and xerces-c. I cannot 
exclude the initial loading (i/o) of the XSD-schema either.

> Overall, if you know that the identity constraint validation is your
> bottleneck, I wouldn't expect pre-loading the schemas to help much.

I am considering to create an alternative identity constraint validation 
mechanism. But I would have to dive into the current mechanism if a novel 
approach is actually improving anything over the lack of work on the 
subject in the last 10 years.

>> Question 3;
>> 
>> One of the other things I noticed is that the Codesynthesis identity
>> constraint validation only reports the location of the end-tag (and
>> therefore an ocean of duplicates), missing the exact location that xmllint
>> does produce, for example: ...
>
> That would most likely require improvements to Xerces-C++.

I think this was partially a wrong statement.

Within Xerces-Java the Line Number does represent the expected tag.

org.xml.sax.SAXParseException; lineNumber: 131; columnNumber: 27; Not 
enough values specified for <key name="VehicleType_AnyVersionedKey"> 
identity constraint specified for element "PublicationDelivery".

The ouput I get from the XSD-validation, thus probably Xerces-C++:

/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV-pushed.xml:131:27 error: element 
'PublicationDelivery' does not have enough values for identity constraint 
key 'VehicleType_AnyVersionedKey'

But specifically the Java version is capable of doing this:

org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key 
'StopArea_KeyRef' with value 'SYNTUS:StopArea:60103,20200422' not found for 
identity constraint of element 'PublicationDelivery'.
org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key 
'ScheduledStopPoint_KeyRef' with value 
'SYNTUS:ScheduledStoppoint:50203005,20200422' not found for identity 
constraint of element 'PublicationDelivery'.
org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key 
'TransportAdministrativeZone_KeyRef' with value 
'NL:AdministrativeZone:AL,any' not found for identity constraint of element 
'PublicationDelivery'.
org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key 
'Operator_KeyRef' with value 'SYNTUS,20200422' not found for identity 
constraint of element 'PublicationDelivery'.

While the C++ version does:

/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV-pushed.xml:1499081:23 error: 
identity constraint key for element 'PublicationDelivery' not found
(duplicated: 1196 times)

So I am missing the "Key/Value" report but get an ocean of duplicates where 
I can't find out the reason. I'll drop the Xerces-C++ mailinglist a line.

-- 
Stefan