From lmontmailler at edap-tms.com Sat May 9 06:09:47 2020 From: lmontmailler at edap-tms.com (Laurent MONTMAILLER) Date: Mon May 11 07:05:38 2020 Subject: [xsd-users] XSD codesynthesis code generator modification In-Reply-To: References: Message-ID: Hi Boris, When I just add a few comments in the XSD codesynthesis code generator, I can't handle indents properly. For example, if I modify 'tree-source.cxx' as is: void generate_tree_source (Context& ctx, size_t first, size_t last) { //>>>customization-debug ctx.os << "// dbg_45, " << __PRETTY_FUNCTION__ << endl; ctx.os << "/* here is my test xsd generator modification */" << endl; //<< [Facebook icon] [LinkedIn icon] [Twitter icon] [Twitter icon] [Logo] [Banner] From daniel.kaasa at gmail.com Sat May 9 11:40:13 2020 From: daniel.kaasa at gmail.com (=?UTF-8?B?RGFuaWVsIEvDpXNh?=) Date: Mon May 11 07:05:38 2020 Subject: [xsd-users] AnyType and the extraction of contents Message-ID: Hi, I might not understand fully how to automatically have the generated bindings extract the contents for AnyType. I expected that when parsing and passing the dom document, the AnyType elements would contain content and this could be accessible by invoking the dom_content() method, however I experience these to be empty "null_content()". Note that I have include the --generate-any-type option when generating the bindings, but this has not apparent effect. Is there something I might not have done correctly? Thanks in advance, Daniel From jeroennwitmond at gmail.com Sun May 10 15:49:41 2020 From: jeroennwitmond at gmail.com (Jeroen N. Witmond) Date: Mon May 11 07:05:38 2020 Subject: [xsd-users] std::ostream setprecision() causes invalid output for ::xml_schema::date_time Message-ID: Greetings! When using setprecision(std::numeric_limits::digits10 + 1) on the output stream, the ::xml_schema::date_time value written to it will fail validation when the number of seconds is less than 10. Note that the call to setprecision() need not be in the same statement as the output of the ::xml_schema::date_time value; it can even be in a different source file. For instance: The value "2020-05-01T06:06:04.000Z" will be written as "2020-05-01T06:06:4.0000000000000000000Z" which will result in error message "error: invalid character encountered" when parsed. Adding a zero between the colon and the four removes the error. I'm aware that careless use of setprecision() can be regarded as a user error; in that case this message will serve as a warning. I'm using C++/Tree version 4.0.0 with Xerces-C 3.1.4. A full testcase can be provided on request. Regards, Jeroen. From boris at codesynthesis.com Mon May 11 07:02:21 2020 From: boris at codesynthesis.com (Boris Kolpackov) Date: Mon May 11 07:13:18 2020 Subject: [xsd-users] XSD codesynthesis code generator modification In-Reply-To: References: Message-ID: Laurent MONTMAILLER writes: > The generated code I get is: > // dbg_45, void CXX::Tree::generate_tree_source(CXX::Tree::Context&, std::size_t, std::size_t) > /* here is my test xsd generator modification */ > > And not, as expected (2 lines starting at column 0): > // dbg_45, void CXX::Tree::generate_tree_source(CXX::Tree::Context&, std::size_t, std::size_t) > /* here is my test xsd generator modification */ > > Why this behavior ? Our ad hoc indenter is a bit of a voodoo: it tries to analyze the syntax of what's being written and indent things accordingly, which for C++ is not exactly trivial. My guess is it mis-analyzes your first line. Try to change it to: ctx.os << "// dbg_45: " << __PRETTY_FUNCTION__ << ';' << endl; Of course, you are also welcome to try to fix the indenter ;-). From stefan at konink.de Mon May 11 08:58:42 2020 From: stefan at konink.de (Stefan de Konink) Date: Mon May 11 09:09:46 2020 Subject: [xsd-users] Large XSD-schema, speed and identity constraint validation Message-ID: Hello, I am part of the standardisation group that works on a public transport standard for network and timetable exchange. It is available as XSD on github under a GPL license. I noticed that RailML is part of the wiki, I hope we can do the same for NeTEx. One of the main problems that we face is the syntax validation of 100MB+ XML-document with this schema, but especially: constraint validation. Practically I am looking for a better than libxml2/xmllint speed, where I notice that many - if not all - tools have a direct single threaded performance bottleneck. I am trying to find a generic form to overcome this, I am surprised that it is difficult to find one. Practically parallel syntax validation using sharding could work for us, but identity constraint validation needs all parts of the document, hence I would expect a "better way". Question 1; My first question is concerning the c++ code generation. I am currently using the following command for generating the XML interface. How can I generate a single file instead of 4011 individual files? When I omit --file-per-type I don't get all the types in the single file. In order to successfully run the above command, our schema had to be modified. I assume the root cause lies in duplicated QNames. This should obviously be investigated. I also noticed some compilation errors in the generated code with "any". Might file a bug report later. xsdcxx cxx-tree --file-per-type --generate-polymorphic --generate-wildcard --namespace-map "http://www.opengis.net/gml/3.2=gml" /home/skinkie/Sources/NeTEx/xsd/NeTEx_publication.xsd Question 2; When I am comparing the cold performance of the following code, the millage may vary. I would state the performance is similar to Xerces in Java. Which makes me wonder if he 'hot' performance would be much better? Or that I am trying to do something that even with the generated C++ code is not optimised. I am aware of the performance example in the source code, that could preload the schema once and run from it many times. xml_schema::properties props; props.schema_location ("http://www.netex.org.uk/netex", "file:///home/skinkie/Sources/NeTEx/xsd/NeTEx_publication.xsd"); netex::PublicationDelivery ("/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV.xml", 0, props); This process takes: time ./test 1>/tmp/xsd.txt 2>&1 real 17m6.611s user 17m1.399s sys 0m3.917s real 5m21.199s user 5m19.587s sys 0m1.450s Opposed to: time xmllint --noout --schema /home/skinkie/Sources/NeTEx/xsd/NeTEx_publication.xsd /var/tmp/NeTEx_SYNTUS_20200422_New_NDOV.xml 1>/tmp/xmllint.txt 2>&1 real 18m13.272s user 18m8.838s sys 0m3.097s real 9m3.236s user 9m1.706s sys 0m1.259s If I change the XSD to a much lighter version, tailored to the information profile we exchange, the validation occurs within 8 seconds, such XSD can be found here: Question 3; One of the other things I noticed is that the Codesynthesis identity constraint validation only reports the location of the end-tag (and therefore an ocean of duplicates), missing the exact location that xmllint does produce, for example: /var/tmp/NeTEx_SYNTUS_20200422_New_NDOV.xml:1817356:42 error: element 'PublicationDelivery' does not have enough values for identity constraint key 'Journey_AnyVersionedKey' /var/tmp/NeTEx_SYNTUS_20200422_New_NDOV.xml:13495: Schemas validity error : Element '{http://www.netex.org.uk/netex}FromPointRef': No match found for key-sequence ['SYNTUS:RoutePoint:30000018'] of keyref '{http://www.netex.org.uk/netex}FromPointRef'. Is there a way to get the invalid line, with the correct type? I already found this discussion: -- Stefan From boris at codesynthesis.com Mon May 11 09:06:48 2020 From: boris at codesynthesis.com (Boris Kolpackov) Date: Mon May 11 09:17:44 2020 Subject: [xsd-users] AnyType and the extraction of contents In-Reply-To: References: Message-ID: Daniel K?sa writes: > I might not understand fully how to automatically have the generated > bindings extract the contents for AnyType. I expected that when parsing and > passing the dom document, the AnyType elements would contain content and > this could be accessible by invoking the dom_content() method, however I > experience these to be empty "null_content()". Note that I have include the > --generate-any-type option when generating the bindings, but this has not > apparent effect. Is there something I might not have done correctly? It sounds like you've taken all the right steps. Can you make sure that --generate-any-type is specified when compiling all your schemas (in particular, it must be in effect when compiling elements of anyType). If this does not help, try to come up with a small test (i.e., schema, test driver, and an XML file), that reproduces this issue and I will take a look. From boris at codesynthesis.com Mon May 11 09:20:18 2020 From: boris at codesynthesis.com (Boris Kolpackov) Date: Mon May 11 09:31:14 2020 Subject: [xsd-users] std::ostream setprecision() causes invalid output for ::xml_schema::date_time In-Reply-To: References: Message-ID: Jeroen N. Witmond writes: > I'm aware that careless use of setprecision() can be regarded as a user > error; in that case this message will serve as a warning. Yes, the only way to fix this in current implementation is to have an explicit call to setprecision() before (and probably another one after, to restore the original value) around every floating point value serialization, which could have a performance impact. Probably the correct way to fix this is to use C++11 snprintf() (when we drop support for C++98) or even C++17 to_chars() (that API was made specifically for these type of "serialization" use-cases). Thanks for the warning! From lmontmailler at edap-tms.com Mon May 11 08:30:00 2020 From: lmontmailler at edap-tms.com (Laurent MONTMAILLER) Date: Tue May 12 03:53:45 2020 Subject: [xsd-users] XSD codesynthesis code generator modification In-Reply-To: References: Message-ID: Thank you Boris for your explanation. I've tried your solution but it did not work. It seems that it is the use of __PRETTY_FUNCTION__ which is not suitable for the indenter. If I instead use __FUNCTION__ all is OK. Unfortunately this macro is less rich of informations, but at least the indentation if correct and make my code readable. I will try later to improve this point, just after having fixed the indenter ;) as you suggested. Thank you for your support. Where are you, USA ? Best regards Laurent (France) -----Message d'origine----- De : Boris Kolpackov Envoy? : lundi 11 mai 2020 13:02 ? : Laurent MONTMAILLER Cc : xsd-users@codesynthesis.com Objet : Re: [xsd-users] XSD codesynthesis code generator modification Laurent MONTMAILLER writes: > The generated code I get is: > // dbg_45, void CXX::Tree::generate_tree_source(CXX::Tree::Context&, std::size_t, std::size_t) > /* here is my test xsd > generator modification */ > > And not, as expected (2 lines starting at column 0): > // dbg_45, void CXX::Tree::generate_tree_source(CXX::Tree::Context&, > std::size_t, std::size_t) > /* here is my test xsd generator modification */ > > Why this behavior ? Our ad hoc indenter is a bit of a voodoo: it tries to analyze the syntax of what's being written and indent things accordingly, which for C++ is not exactly trivial. My guess is it mis-analyzes your first line. Try to change it to: ctx.os << "// dbg_45: " << __PRETTY_FUNCTION__ << ';' << endl; Of course, you are also welcome to try to fix the indenter ;-). www.edap-tms.com [Facebook icon] [LinkedIn icon] [Twitter icon] [Twitter icon] [Logo] [Banner] From daniel.kaasa at gmail.com Mon May 11 20:55:37 2020 From: daniel.kaasa at gmail.com (=?UTF-8?B?RGFuaWVsIEvDpXNh?=) Date: Tue May 12 03:53:45 2020 Subject: [xsd-users] AnyType and the extraction of contents In-Reply-To: References: Message-ID: Hi Boris, I did as you asked and fabricated a test from the examples. However, due to the complexity of the schemas I am currently working with; I have mocked a test schema to resemble somewhat the original that I am having issues with. The fabricated test works as expected. When I look at the generated code for both cases, it is clear that I haven't achieved an exact same representation. *shown below" I am using the following switches more or less for both cases: --generate-inline --generate-serialization --generate-wildcard --generate-any-type --import-maps --generate-polymorphic --root-element OM_Observation Note that we are creating libraries for all our schemas and linking them together based on their dependencies where GML is one of many. We have added --generate-any-type to all of them but see no difference for any types. Snip from the generated "observation.xsd" OGC Observation & Measurements schema: ---------------------------- snip ------------------------------ // result // { ::std::unique_ptr< ::xsd::cxx::tree::type > tmp ( ::xsd::cxx::tree::type_factory_map_instance< 0, char > ().create ( "result", "http://www.opengis.net/om/2.0", &::xsd::cxx::tree::factory_impl< result_type >, true, true, i, n, f, this)); if (tmp.get () != 0) { if (!result_.present ()) { ::std::unique_ptr< result_type > r ( dynamic_cast< result_type* > (tmp.get ())); if (r.get ()) tmp.release (); else throw ::xsd::cxx::tree::not_derived< char > (); this->result_.set (::std::move (r)); continue; } } ---------------------------- snip ------------------------------ Snip from the generated mockup test that works as expected: ---------------------------- snip ------------------------------ // result // if (n.name () == "result" && n.namespace_ () == " http://www.kongsberg.com/observation") { ::std::auto_ptr< result_type > r ( result_traits::create (i, f | ::xml_schema::flags::extract_content, this)); if (!result_.present ()) { this->result_.set (r); continue; } } ---------------------------- snip ------------------------------ Daniel On Mon, May 11, 2020 at 3:06 PM Boris Kolpackov wrote: > Daniel K?sa writes: > > > I might not understand fully how to automatically have the generated > > bindings extract the contents for AnyType. I expected that when parsing > and > > passing the dom document, the AnyType elements would contain content and > > this could be accessible by invoking the dom_content() method, however I > > experience these to be empty "null_content()". Note that I have include > the > > --generate-any-type option when generating the bindings, but this has not > > apparent effect. Is there something I might not have done correctly? > > It sounds like you've taken all the right steps. Can you make sure that > --generate-any-type is specified when compiling all your schemas (in > particular, it must be in effect when compiling elements of anyType). > > If this does not help, try to come up with a small test (i.e., schema, > test driver, and an XML file), that reproduces this issue and I will > take a look. > From boris at codesynthesis.com Tue May 12 07:11:57 2020 From: boris at codesynthesis.com (Boris Kolpackov) Date: Tue May 12 07:22:57 2020 Subject: [xsd-users] Large XSD-schema, speed and identity constraint validation In-Reply-To: References: Message-ID: Stefan de Konink writes: > One of the main problems that we face is the syntax validation of 100MB+ > XML-document with this schema, but especially: constraint validation. > Practically I am looking for a better than libxml2/xmllint speed, where I > notice that many - if not all - tools have a direct single threaded > performance bottleneck. I am trying to find a generic form to overcome this, > I am surprised that it is difficult to find one. Practically parallel syntax > validation using sharding could work for us, but identity constraint > validation needs all parts of the document, hence I would expect a "better > way". Based on your reference to identity constraints further down in your email I am going to assume that by "constraint validation" above you mean "identity constraint validation". Over all these years, I can't remember seeing many cases where this was an issue. Which probably explains why there is no better/faster way. Also, keep in mind that CodeSynthesis XSD delegates XML Schema validation, including identity constraint validation, to Xerces-C++. > Question 1; > > My first question is concerning the c++ code generation. I am currently > using the following command for generating the XML interface. How can I > generate a single file instead of 4011 individual files? When I omit > --file-per-type I don't get all the types in the single file. There is no way to get a single header/source set from multiple schema files. And, for a large schema, you probably wouldn't want to, since you may not be able to compile the resulting source file (yes, we have run into this and have a mechanism to split single source file into multiple parts; see the --parts* options). But seeing that you depend on GML 3.2, file-per-type is probably your only option (you could try to compile your schemas in the file-per-schema mode to minimize the number of files, but getting that to work is more of an art than science). See these release notes for background on this mode: http://www.codesynthesis.com/~boris/blog/2008/02/13/codesynthesis-xsd-3-1-0-released/ > Question 2; > > When I am comparing the cold performance of the following code, the millage > may vary. I would state the performance is similar to Xerces in Java. Which > makes me wonder if he 'hot' performance would be much better? Or that I am > trying to do something that even with the generated C++ code is not > optimised. I am aware of the performance example in the source code, that > could preload the schema once and run from it many times. > > [...] > > real 17m6.611s > user 17m1.399s > sys 0m3.917s > > real 5m21.199s > user 5m19.587s > sys 0m1.450s I am confused, what are these two results for? Hot vs cold? Overall, if you know that the identity constraint validation is your bottleneck, I wouldn't expect pre-loading the schemas to help much. > Question 3; > > One of the other things I noticed is that the Codesynthesis identity > constraint validation only reports the location of the end-tag (and > therefore an ocean of duplicates), missing the exact location that xmllint > does produce, for example: > > [...] > > Is there a way to get the invalid line, with the correct type? That would most likely require improvements to Xerces-C++. From boris at codesynthesis.com Tue May 12 07:16:19 2020 From: boris at codesynthesis.com (Boris Kolpackov) Date: Tue May 12 07:27:18 2020 Subject: [xsd-users] AnyType and the extraction of contents In-Reply-To: References: Message-ID: Daniel K?sa writes: > Snip from the generated "observation.xsd" OGC Observation & Measurements > schema: > > ---------------------------- snip ------------------------------ > // result > // > { > ::std::unique_ptr< ::xsd::cxx::tree::type > tmp ( > ::xsd::cxx::tree::type_factory_map_instance< 0, char > ().create ( > "result", > "http://www.opengis.net/om/2.0", > &::xsd::cxx::tree::factory_impl< result_type >, > true, true, i, n, f, this)); This suggests that anyType is treated as a polymorphic type. What is the semantics of this `result` element? Is it part of a substitution group? If so, perhaps what you are getting in your object model if not anyType but an instance of one of its derived types? That would explain why you are not seeing any DOM content. From stefan at konink.de Tue May 12 10:41:17 2020 From: stefan at konink.de (Stefan de Konink) Date: Tue May 12 10:52:26 2020 Subject: [xsd-users] Large XSD-schema, speed and identity constraint =?iso-8859-1?Q?validation?= In-Reply-To: References: Message-ID: Dear Boris, Thanks for your in depth reply. On Tuesday, May 12, 2020 1:11:57 PM CEST, Boris Kolpackov wrote: > Based on your reference to identity constraints further down in your > email I am going to assume that by "constraint validation" above you > mean "identity constraint validation". Correct. > Over all these years, I can't remember seeing many cases where this was > an issue. Which probably explains why there is no better/faster way. I don't mind to get my hands dirty on this subject. > Also, keep in mind that CodeSynthesis XSD delegates XML Schema validation, > including identity constraint validation, to Xerces-C++. Does this practically mean that if I would only care about XSD-validation, there would not be any net benefit to use the XSD toolset, because the resulting code is not used to generate a specific parser that is employed while doing a XSD validation? I am thinking in the direction of XML Screamer research. > There is no way to get a single header/source set from multiple schema > files. And, for a large schema, you probably wouldn't want to, since you > may not be able to compile the resulting source file (yes, we have run > into this and have a mechanism to split single source file into multiple > parts; see the --parts* options). > > But seeing that you depend on GML 3.2, file-per-type is probably your > only option (you could try to compile your schemas in the file-per-schema > mode to minimize the number of files, but getting that to work is more > of an art than science). See these release notes for background on this > mode: Understood. As you may have see the "art" of designing light XSD's, that only define a single profile (where the net effect is that a validator would complain about extra elements) is something that could greatly optimise the performance of the validator. Obviously this is expected behavior but not many XSD tools support cutting unused the bloat in a consistent matter. Meaning that designing a smaller XSD is typically bottom up again. >> real 17m6.611s >> user 17m1.399s >> sys 0m3.917s >> >> real 5m21.199s >> user 5m19.587s >> sys 0m1.450s > > I am confused, what are these two results for? Hot vs cold? Same machine, same data, multiple runs, same code, showing the min and max. >From my benchmarking background I would consider them both cold. I cannot explain (other than hardware reasons, tested it on a laptop Ryzen 2500U) why the results give huge outliers for both libxml2 and xerces-c. I cannot exclude the initial loading (i/o) of the XSD-schema either. > Overall, if you know that the identity constraint validation is your > bottleneck, I wouldn't expect pre-loading the schemas to help much. I am considering to create an alternative identity constraint validation mechanism. But I would have to dive into the current mechanism if a novel approach is actually improving anything over the lack of work on the subject in the last 10 years. >> Question 3; >> >> One of the other things I noticed is that the Codesynthesis identity >> constraint validation only reports the location of the end-tag (and >> therefore an ocean of duplicates), missing the exact location that xmllint >> does produce, for example: ... > > That would most likely require improvements to Xerces-C++. I think this was partially a wrong statement. Within Xerces-Java the Line Number does represent the expected tag. org.xml.sax.SAXParseException; lineNumber: 131; columnNumber: 27; Not enough values specified for identity constraint specified for element "PublicationDelivery". The ouput I get from the XSD-validation, thus probably Xerces-C++: /var/tmp/NeTEx_SYNTUS_20200422_New_NDOV-pushed.xml:131:27 error: element 'PublicationDelivery' does not have enough values for identity constraint key 'VehicleType_AnyVersionedKey' But specifically the Java version is capable of doing this: org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key 'StopArea_KeyRef' with value 'SYNTUS:StopArea:60103,20200422' not found for identity constraint of element 'PublicationDelivery'. org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key 'ScheduledStopPoint_KeyRef' with value 'SYNTUS:ScheduledStoppoint:50203005,20200422' not found for identity constraint of element 'PublicationDelivery'. org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key 'TransportAdministrativeZone_KeyRef' with value 'NL:AdministrativeZone:AL,any' not found for identity constraint of element 'PublicationDelivery'. org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key 'Operator_KeyRef' with value 'SYNTUS,20200422' not found for identity constraint of element 'PublicationDelivery'. While the C++ version does: /var/tmp/NeTEx_SYNTUS_20200422_New_NDOV-pushed.xml:1499081:23 error: identity constraint key for element 'PublicationDelivery' not found (duplicated: 1196 times) So I am missing the "Key/Value" report but get an ocean of duplicates where I can't find out the reason. I'll drop the Xerces-C++ mailinglist a line. -- Stefan From boris at codesynthesis.com Wed May 13 06:36:34 2020 From: boris at codesynthesis.com (Boris Kolpackov) Date: Wed May 13 06:47:37 2020 Subject: [xsd-users] AnyType and the extraction of contents In-Reply-To: References:

Message-ID: Daniel K?sa writes: > The schema type which I am using is from the following > http://schemas.opengis.net/om/2.0/observation.xsd. Ok, assuming there is no xsi:type in the XML document, that element is a non-polymorphic anyType. > If I drop "--generate-polymorphic" then anyType is initialized with the > "flags:extract_contents" and I have no longer an issue. Besides --generate-polymorphic you should also have one or more of the --polymorphic-type* options and one of them (such as --polymorphic-type-all) forces treating anyType as polymorphic. You can try to "tighten" these options to only treat types that are truly polymorphic as such (of course, it's possible that somewhere else in your schema anyType is used as a polymorphic base in which case this workaround won't work). > Can we assume that these two modes may not be used simultaneously then? I think it's safe to say this is a bug. As described above, anyType can be used both ways in the same vocabulary and the generated code should be smart enough to handle it. From daniel.kaasa at gmail.com Tue May 12 15:14:10 2020 From: daniel.kaasa at gmail.com (=?UTF-8?B?RGFuaWVsIEvDpXNh?=) Date: Wed May 13 06:50:08 2020 Subject: [xsd-users] AnyType and the extraction of contents In-Reply-To: References: Message-ID: The schema type which I am using is from the following http://schemas.opengis.net/om/2.0/observation.xsd. The type that is returned and which is valid for the element "result" is anyType, however it is not initialized with the "flags::extract_contents". if I drop "--generate-polymorphic" then anyType is initialized with the "flags:extract_contents" and I have no longer an issue. Without generating polymorphic and importing of maps: 1. It is possible to access the content using dom_content(). 2. The parsed anyType is also written when serialized. With generating polymorphic and importing of maps: 1. It is *not* possible to access the content using dom_content(). 2. The parsed anyType is *not* written when serialized. Can we assume that these two modes may not be used simultaneously then? Thank you for your help! Daniel On Tue, May 12, 2020 at 1:16 PM Boris Kolpackov wrote: > Daniel K?sa writes: > > > Snip from the generated "observation.xsd" OGC Observation & Measurements > > schema: > > > > ---------------------------- snip ------------------------------ > > // result > > // > > { > > ::std::unique_ptr< ::xsd::cxx::tree::type > tmp ( > > ::xsd::cxx::tree::type_factory_map_instance< 0, char > ().create ( > > "result", > > "http://www.opengis.net/om/2.0", > > &::xsd::cxx::tree::factory_impl< result_type >, > > true, true, i, n, f, this)); > > This suggests that anyType is treated as a polymorphic type. What is the > semantics of this `result` element? Is it part of a substitution group? > If so, perhaps what you are getting in your object model if not anyType > but an instance of one of its derived types? That would explain why you > are not seeing any DOM content. > From boris at codesynthesis.com Thu May 14 08:43:28 2020 From: boris at codesynthesis.com (Boris Kolpackov) Date: Thu May 14 08:54:33 2020 Subject: [xsd-users] Large XSD-schema, speed and identity constraint validation In-Reply-To: References:

Message-ID: Stefan de Konink writes: > >Also, keep in mind that CodeSynthesis XSD delegates XML Schema validation, > >including identity constraint validation, to Xerces-C++. > > Does this practically mean that if I would only care about XSD-validation, > there would not be any net benefit to use the XSD toolset, because the > resulting code is not used to generate a specific parser that is employed > while doing a XSD validation? I am thinking in the direction of XML Screamer > research. Correct. Validation in generated code (also called "perfect parser") works well for smaller/simpler schemas (which is the reason why we went this way for XSD/e, our mobile/embedded version). But for schemas we are talking about (e.g., GML), the size of the generated code becomes impractical in many cases. > >>real 17m6.611s > >>user 17m1.399s > >>sys 0m3.917s > >> > >>real 5m21.199s > >>user 5m19.587s > >>sys 0m1.450s > > > >I am confused, what are these two results for? Hot vs cold? > > Same machine, same data, multiple runs, same code, showing the min and max. > From my benchmarking background I would consider them both cold. I cannot > explain (other than hardware reasons, tested it on a laptop Ryzen 2500U) why > the results give huge outliers for both libxml2 and xerces-c. I cannot > exclude the initial loading (i/o) of the XSD-schema either. Do you perhaps have remote (e.g., http://) schema references in (some of) your schemaLocation attributes? That would explain these results quite well. > So I am missing the "Key/Value" report but get an ocean of duplicates where > I can't find out the reason. I haven't looked into this in detail but maybe you can resolve the schema names referenced in the error message back to schema locations based on the loaded schema grammar. From stefan at konink.de Thu May 14 09:11:58 2020 From: stefan at konink.de (Stefan de Konink) Date: Thu May 14 09:23:12 2020 Subject: [xsd-users] Large XSD-schema, speed and identity constraint =?iso-8859-1?Q?validation?= In-Reply-To: References:

Message-ID: On Thursday, May 14, 2020 2:43:28 PM CEST, Boris Kolpackov wrote: > Correct. Validation in generated code (also called "perfect parser") works > well for smaller/simpler schemas (which is the reason why we went this way > for XSD/e, our mobile/embedded version). But for schemas we are talking > about (e.g., GML), the size of the generated code becomes impractical > in many cases. So XSD/e would be running on the generated C++ code without Xerces? As comparison I would still find this an interesting approach. > Do you perhaps have remote (e.g., http://) schema references in (some > of) your schemaLocation attributes? That would explain these results > quite well. Thanks for this tip. I'll try to run a wireshark session to validate if this happens. If this happens, is there any way of registering or caching "local" equivalents without changing the XSD? >> So I am missing the "Key/Value" report but get an ocean of >> duplicates where >> I can't find out the reason. > > I haven't looked into this in detail but maybe you can resolve the schema > names referenced in the error message back to schema locations based on > the loaded schema grammar. I have asked a questions in the Xerces C++ group, concerning this issue, but I didn't receive any input on it. I am also surprised about the number of duplications, either the Java version is not picking it up or the C++ is duplicating it. -- Stefan From stefan at konink.de Thu May 14 17:58:36 2020 From: stefan at konink.de (Stefan de Konink) Date: Fri May 15 09:29:29 2020 Subject: [xsd-users] Large XSD-schema, speed and identity constraint =?iso-8859-1?Q?validation?= In-Reply-To: References:

Message-ID: <99a5923b-b1f8-414c-8802-ca9ece26c653@konink.de> Hi Boris, On Thursday, May 14, 2020 2:43:28 PM CEST, Boris Kolpackov wrote: > I haven't looked into this in detail but maybe you can resolve the schema > names referenced in the error message back to schema locations based on > the loaded schema grammar. It took some cozy debugging setting, a couch, gdb, two people looking at a screen, you get the setting. Xerces luckely has only one place where the Identity Constraints are checked. So that suggests one place to improve; 302 { 303 FieldValueMap& valueMap = iter.nextElement(); 304 305 if (!keyValueStore->contains(&valueMap) && fDoReportError) { 306 307 fScanner->getValidator()->emitError(XMLValid::IC_KeyNotFound, 308 fIdentityConstraint->getElementName()); 309 } 310 } 311 } The getElementName resolves the elementName under which in the Schema the key-constraint has been placed. Personally I am more interested in: p fIdentityConstraint.fIdentityConstraintName $34 = (XMLCh *) 0x590070 u"ToPointRef" p fIdentityConstraint.fSelector.fXPath.fExpression $40 = (XMLCh *) 0x590390 u".//netex:ToPointRef" p fIdentityConstraint.fFields.fElemList[0].fXPath.fExpression $32 = (XMLCh *) 0x590930 u"@ref" print keyValueStore.fIdentityConstraint.fIdentityConstraintName $70 = (XMLCh *) 0x57cf10 u"ScheduledStopPointId" print *valueMap.fValues.fElemList $41 = 0x8ce9d90 u"CXX-ALL:RoutePoint:78210040" (gdb) p fIdentityConstraint->getElementName() $77 = (XMLCh *) 0x5902f0 u"ServiceFrame" I could envision an error message creating an error similar to the Java output. Key 'ToPointRef' with value 'CXX-ALL:RoutePoint:78210040' not found for identity constraint of element 'ServiceFrame'. But I would need to figure out how a compound key, but even a naive approach significantly improves the current output. And I would also love the line number of the actually checked object. (Help appreciated!) Opened an issue here -- Stefan From Cihan-Kaya.Guenduez at l3harris.com Wed May 27 09:28:53 2020 From: Cihan-Kaya.Guenduez at l3harris.com (Cihan-Kaya.Guenduez) Date: Thu May 28 10:26:25 2020 Subject: [xsd-users] How to get rid of "P0Y" as default zero duration value Message-ID: Dear Sir or Madam, is there any possibility to get around the default definition of the zero xs:duration value "P0Y" (zero years)? For example the definition of xs:dayTimeDuration uses a restriction pattern: Using the generated serialization functions (cxx/tree/serialization into output stream), a zero dayTimDuration always yield in "P0Y" and violates the restriction pattern. I look forward to hearing from you soon. Yours faithfully, Cihan-Kaya G?nd?z CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient and may contain material that is proprietary, confidential, privileged or otherwise legally protected or restricted under applicable government laws. Any review, disclosure, distributing or other use without expressed permission of the sender is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies without reading, printing, or saving. From boris at codesynthesis.com Thu May 28 10:30:32 2020 From: boris at codesynthesis.com (Boris Kolpackov) Date: Thu May 28 10:42:22 2020 Subject: [xsd-users] How to get rid of "P0Y" as default zero duration value In-Reply-To: References: Message-ID: Cihan-Kaya.Guenduez writes: > For example the definition of xs:dayTimeDuration uses a restriction pattern: > > > > > > > > Using the generated serialization functions (cxx/tree/serialization into > output stream), a zero dayTimDuration always yield in "P0Y" and violates the > restriction pattern. I think the best way to do this is to customize the dayTimeDuration generated C++ type and provide custom serialization implementation that does what you want. You can read about type customization in C++/Tree here: http://wiki.codesynthesis.com/Tree/Customization_guide There is also a bunch of examples in the examples/cxx/tree/custom/ directory of XSD distribution that show how to do this (the 'double' and 'contacts' are probably the most relevant to your case).