[xsd-users] FW: map files and extended types...

Tue Jul 20 10:38:40 EDT 2010

Hi John,

Dingelstad, John <john.dingelstad at dhv.com> writes:

> Too many tools allow for simple things to be done in to many 
> different ways!

I don't think one can really expect that processing multi-gigabyte
XML files that use a fairly complex XML vocabulary like DATEXII
will be a simple task, regardless of which tool one chooses.

> I like your tools, but they also make sometimes wondering whether 
> I am doing things the right way, each time i discover something 
> new. There is nothing wrong with that, but sometimes, I feel I 
> am spending more time understanding all the bells and whistles 
> of the xsd tools than on working on my actual problem... Guess 
> it's all part of the learning curve.

Yes, it is hard to "see" the best way to apply an unfamiliar
tool to a fairly complex problem. I think spending some time and
getting to know what's available, while time consuming, I am 
afraid is the only way to solve this. Well, describing your
problem and asking for suggestions on the mailing list is 
probably another alternative ;-).

For example, XSD included a large number of examples that show 
how to solve some fairly tricky problems. The C++/Tree mapping 
has 24 such examples and one of them ('streaming') addresses the
issue that you are trying to overcome.

> Anyway, maybe you could tell me your opinion on what the best 
> approach would be in my case:
> 
> Basically, I get 2 DATEX files delivered. One which contains a 
> measured data publication and which indirect refers to the other
> file which contains the measured site table publication. The
> measured data publication (approx. 1GB large) contains all kinds
> of measurement data records, which I shall process and store into
> a database and in order to do so, I need info from the measurement
> site table publication.
> 
> Due to the large size of the measured dat publication, I choose
> the C++ parser approach. First I will parse the measurement site
> table publication and create an internal data structure of only
> that data in which I am actually interested in. This is something
> I've nearly finished. The 2nd step would be parse the measured
> data publication. Once I've collected/parsed a record, I could
> do the necessary processing in one of the callback functions.

I see three possible ways to approach this:

1. Use the C++/Tree mapping with the streaming extension (see the
   'streaming' example). This will allow you to parse the document
   a chunk at a time and handle the object model fragment for
   this chunk. This approach will probably require the least
   amount of work.

2. Use the C++/Hybrid mapping from XSD/e in the partially in-memory 
   mode (again, see the 'streaming' example but this time in the
   XSD/e distribution). The idea is the same as in (1) above however
   here you will override C++/Parser skeletons to "intercept" object
   model fragments (C++/Hybrid is built on top of C++/Parser). While
   this will probably require slightly more work than (1), the 
   advantage over C++/Tree is the more compact (runtime memory wise)
   object model.

3. Use C++/Parser as you are trying to do now. This will require a 
   lot more work than the above two approaches. However, it is the
   most flexible approach since you control how the object model
   will look (for example, if you don't need certain fields, then
   you can leave them out of your object model and save some
   memory).

Let me know if you have any questions on any of this.

> BTW is there a way to pass on some extra data to the parser? I.e. 
> I'll need to access the data I previously collected from the 
> measurement site table publication within the parser callback 
> functions  when I've collected a measurement data record.

You can add member variables to your parser implementation classes 
and initialize them when the parsers are being created.

Boris