[xsd-users] Large XSD-schema, speed and identity constraint validation

Mon May 11 08:58:42 EDT 2020

Hello,

I am part of the standardisation group that works on a public transport 
standard for network and timetable exchange. It is available as XSD on 
github <https://github.com/NeTEx-CEN/NeTEx> under a GPL license. I noticed 
that RailML is part of the wiki, I hope we can do the same for NeTEx.

One of the main problems that we face is the syntax validation of 100MB+ 
XML-document with this schema, but especially: constraint validation. 
Practically I am looking for a better than libxml2/xmllint speed, where I 
notice that many - if not all - tools have a direct single threaded 
performance bottleneck. I am trying to find a generic form to overcome 
this, I am surprised that it is difficult to find one. Practically parallel 
syntax validation using sharding could work for us, but identity constraint 
validation needs all parts of the document, hence I would expect a "better 
way".

Question 1;

My first question is concerning the c++ code generation. I am currently 
using the following command for generating the XML interface. How can I 
generate a single file instead of 4011 individual files? When I omit 
--file-per-type I don't get all the types in the single file.

In order to successfully run the above command, our schema had to be 
modified. I assume the root cause lies in duplicated QNames. This should 
obviously be investigated. I also noticed some compilation errors in the 
generated code with "any". Might file a bug report later.

xsdcxx cxx-tree --file-per-type --generate-polymorphic  --generate-wildcard 
--namespace-map "http://www.opengis.net/gml/3.2=gml"  
/home/skinkie/Sources/NeTEx/xsd/NeTEx_publication.xsd

Question 2;

When I am comparing the cold performance of the following code, the millage 
may vary. I would state the performance is similar to Xerces in Java. Which 
makes me wonder if he 'hot' performance would be much better? Or that I am 
trying to do something that even with the generated C++ code is not 
optimised. I am aware of the performance example in the source code, that 
could preload the schema once and run from it many times.

xml_schema::properties props;
props.schema_location ("http://www.netex.org.uk/netex", 
"file:///home/skinkie/Sources/NeTEx/xsd/NeTEx_publication.xsd");
netex::PublicationDelivery ("/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV.xml", 
0, props);

This process takes:
time ./test 1>/tmp/xsd.txt 2>&1

real	17m6.611s
user	17m1.399s
sys	0m3.917s

real	5m21.199s
user	5m19.587s
sys	0m1.450s

Opposed to:

time xmllint --noout --schema 
/home/skinkie/Sources/NeTEx/xsd/NeTEx_publication.xsd 
/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV.xml 1>/tmp/xmllint.txt 2>&1

real	18m13.272s
user	18m8.838s
sys	0m3.097s

real	9m3.236s
user	9m1.706s
sys	0m1.259s

If I change the XSD to a much lighter version, tailored to the information 
profile we exchange, the validation occurs within 8 seconds, such XSD can 
be found here: <https://github.com/BISONNL/NeTEx-NL/tree/master/xsd>

Question 3;

One of the other things I noticed is that the Codesynthesis identity 
constraint validation only reports the location of the end-tag (and 
therefore an ocean of duplicates), missing the exact location that xmllint 
does produce, for example: 

/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV.xml:1817356:42 error: element 
'PublicationDelivery' does not have enough values for identity constraint 
key 'Journey_AnyVersionedKey'

/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV.xml:13495: Schemas validity error : 
Element '{http://www.netex.org.uk/netex}FromPointRef': No match found for 
key-sequence ['SYNTUS:RoutePoint:30000018'] of keyref 
'{http://www.netex.org.uk/netex}FromPointRef'.

Is there a way to get the invalid line, with the correct type?

I already found this discussion: 
<https://lists.w3.org/Archives/Public/xmlschema-dev/2007May/0020.html>

-- 
Stefan