[xsd-users] dealing with xml written/read on-the-fly
cerion at kestrel.ws
Mon Oct 19 12:50:43 EDT 2009
Boris Kolpackov wrote:
>> Boris Kolpackov wrote
>>> If the stream ends with EOF then the parser assume there is no more
>>> data available. And if the document is incomplete, then you will get
>>> a parsing error. In your case, I guess, you will need to provide a
>>> custom std::istream implementation (or xercesc::InputSource -- that
>>> could actually be easier) that doesn't return on EOF but instead keeps
>>> polling the file for more data (e.g., you could save the offset of the
>>> last byte read, wait some time, re-open the file, seek to that saved
>>> offset, and see if there is more data). I assume you will need to
>>> implement this logic somewhere in the application in any case. With
>>> this approach it will just be in the stream.
>> I had a look at doing this, but this I'm not happy about this direction.
>> Xerces buffers the file data, and if the buffer gets low, it reads
>> ahead. This means there may be data available to xerces (in its buffer),
>> but we're going to block on the file anyway.
> What actually happens is this: if the raw character buffer has less than
> 100 bytes when Xerces-C++ tries to transcode the next batch of characters,
> then it will try to read some more. There is actually a technical reason
> for this other than efficiency (it has to do with multi-byte encodings
> and the buffer containing only some of the bytes constituting a code
> Because Xerces-C++ won't keep trying to read more if the stream returned
> less than 100 bytes, one way to mitigate this would be to return the data
> from InputSource::readBytes() in small chunks. If you return it one byte
> at a time, there will be no buffering at all.
Eugh - that's horrible! :-)
>> Plus I would need to take a look at the data last read from the file (i.e.
>> in xerces buffer, or seek back in the file), to see if EOF has been reached
>> correctly (closing tag has been read in).
> You mean you will need to check if "real" EOF has been reached, not the
> "fake" one ;-)? This is what happens when you try to "reuse" the same
> concept for different things.
> I wonder if there is better design for this? Can't you use a pipe or socket instead?
Moving to a pipe/socket might be the only way to go, indeed.
I would _really_ have liked to use a file for debugging purposes: I
don't trust the XML source (Valgrind) not to mess things up, and I
wanted to allow users of my program to send me the XML file so I could
reproduce the error.
>> If I can avoid it, I'd prefer not to work with separate threads at all
>> (the above blocking read solution would need that). I imagined my Qt app
>> could be the driver, with a loop to pull in the next (few) top level
>> tags, and then update the GUI, and so on. This simplifies the whole
>> setup, and keeps Qt in control.
> Hm, that's hard to achieve. You want to pass the data and query the next
> construct. Something like this:
> parser p;
> p.here_is_more_data (buf, n);
> chunk c = p.give_me_next_construct ();
> The problem is that you may not pass enough data so there is no construct
> to return. While it is probably possible to implement an XML parser like
> this, it will complicate the design significantly since the parser must
> be prepared to stop parsing at any point, return control to the user and
> then resume parsing from that point again.
What I did before with Qt3 was fairly straightforward: SAX reader,
callbacks on the end-tags to construct a DOM model.
The Qt SAX parser gives 'parse' and 'parseContinue' functions, which
keep track of the file position and buffer the XML data until it's
handed off via the end-tag callback.
All works well, and is simple.
Unfortunately, there's just no binding, so updates to the XML protocol
are horrible to maintain :-(
>> Qt solves this EOF problem by returning an UnexpectedEOF error, but make
>> this recoverable, so we can continue parsing.
>> From what I understand from the docs and source code, XSD / Xerces don't
>> (yet) support recovery from this?
> No, and probably never will. I don't think such "EOF overloading" is a
> very common practice (or good design, for that matter).
Fair enough, although I'm not sure you understand - Qt4 doesn't use EOF
overloading: just as Xerces does, the parser throws the error, but it
isn't _fatal_, and is easily recoverable from. One just needs to handle
that EOF exception, wait for more data, and continue parsing.
>> If they do, how is this possible, and is this a way forward?
> I think the way forward would be to lower the chunk size returned by
> readBytes() as suggested above. If Qt must be in control, then I don't
> see any way to achieve this other than using a separate thread.
> I would also suggest that you use something other than a file to
> communicate the data between the two processes so that you don't
> need to play this real/fake EOF game.
Ok, I will ponder upon this a little more.
>> P.S. Do you have plans to make a xml binder for the Qt parsers? ;-)
> We may implement the "Qt/Tree" mapping one day which will use the "Qt
> way of doing things", including XML parsers. But there are no immediate
More information about the xsd-users