[xsd-users] CodeSynthesis/tree (version 4.0.0) and Nvidia NVCC compiler (CUDA version 7.0)

Wed Jul 15 18:26:22 EDT 2015

I will try to answer each of your questions:
"Specifically: what exactly exhausts this "global constmemory"?  Is it the header sizes? Number of inclusions?
The short answer is that I really don't know what is causing it. Basically when I  include the *.hxx header files generated by CodeSynthesis from XSD schemas in *.cu source code to be compiled and linked under nvcc that is when the problem occurs. The nvcc compiler sets aside device-side storage in cuda global constant memory for every *.cu source file in which *.hxx files are included, and additional memory is required for each source file. The nvcc linker will fail whenever the cumulative global constant memory exceeds 64K bytes and this limits the number of source files. The header file size does not matter. If I take a very small header and include it many times it fails. If I take a larger header and include it just a few times it also fails. So the number of inclusions also does not seem to matter. But I could be wrong as this was just some random experiment.   
"Something specific in the headers?"
While we don't know the exact cause, it seems that the problem is that NVCC is misinterpreting some of the symbols in the CodeSynthesis header files. Why? 
It appears  that the NVCC compiler is encountering symbols in the libxsd /xsd/cxx/tree/ headers that it interprets as a request for constant global memory (CGM) space.  Each compilation unit - i.e.source file containing these headers - requires a few bytes of CGM which eventually exhausts the GCM cache.

Attached trivial source file that includes </xsd/cxx/tree/elements.hxx> and compiled it under nvcc for cuda 7.0. Compiler output below shows 16 bytes of CGM are required.  The number of CGM bytes will vary depending upon which CodeSynthesis headers are included. 

If we were to make copies of the attached file and recompile, we would see 16 x # files bytes of CGM. Eventually (4K files!) we would exhaust the 64K CGM cache.

CGM is a NVDIA cache for read-only memory. It should only be generated when one defines a global constant using the __constant__ keyword (which appears NOWHERE in CodeSynthesis, Boost, or Xerces). The issue is that something in <elements.hxx> or a file it contains is confusing the compiler.

I have done some preliminary experiments and it appears that constructs like 

return x_ != 0 ? &self_::true_ : 0;

in <containers.hxx> that mix member function pointers and built-in types contribute to the problem.

I realize that this is a bug in the NVCC compiler and not a limitation of CodeSynthesis ... but modifying CodeSynthesis might be the easiest way to resolve the problem. Every compiler has its unique quirks ... 

PS All compiler options as below ... the include paths are required for boost version 1.57.0 and xerces 3.1.1

make all 
Building file: ../boris.cu
Invoking: NVCC Compiler
/usr/local/cuda-7.0/bin/nvcc -DBOOST_NO_SFINAE_EXPR -I/home/main-server/cuda-workspace/3rd-party-libs/linux/codeSynthesis/xsd-4.0.0+dep/xsd/libxsd -I/usr/local/include -g -O0 -Xcompiler -fPIC -Xptxas -v -std=c++11 -gencode arch=compute_50,code=sm_50  -odir "." -M -o "boris.d" "../boris.cu"
/usr/local/cuda-7.0/bin/nvcc -DBOOST_NO_SFINAE_EXPR -I/home/main-server/cuda-workspace/3rd-party-libs/linux/codeSynthesis/xsd-4.0.0+dep/xsd/libxsd -I/usr/local/include -g -O0 -Xcompiler -fPIC -Xptxas -v -std=c++11 --compile --relocatable-device-code=false -gencode arch=compute_50,code=compute_50 -gencode arch=compute_50,code=sm_50  -x cu -o  "boris.o" "../boris.cu"
ptxas info    : 8 bytes gmem, 16 bytes cmem[3]
/home/main-server/cuda-workspace/3rd-party-libs/linux/codeSynthesis/xsd-4.0.0+dep/xsd/libxsd/xsd/cxx/tree/elements.hxx: In member function ‘virtual void xsd::cxx::tree::_type::_container(xsd::cxx::tree::container*)’:
/home/main-server/cuda-workspace/3rd-party-libs/linux/codeSynthesis/xsd-4.0.0+dep/xsd/libxsd/xsd/cxx/tree/elements.hxx:635:78: warning: ‘auto_ptr’ is deprecated (declared at /usr/include/c++/4.8/backward/auto_ptr.h:87) [-Wdeprecated-declarations]
           XSD_AUTO_PTR<map>& m (dr ? dr->map_ : map_);
                                                                              ^
Finished building: ../boris.cu

Building target: libboris.a
Invoking: NVCC Archiver
nvcc  -lib -o  "libboris.a"  ./boris.o   
Finished building target: libboris.a 

does this memory somehow
persist over compile/link cycles? If so, can't you clear it before
each recompile?

No. 

     From: Boris Kolpackov <boris at codesynthesis.com>
 To: All Herald <loredofilms1 at yahoo.com> 
Cc: "xsd-users at codesynthesis.com" <xsd-users at codesynthesis.com> 
 Sent: Wednesday, July 15, 2015 1:11 PM
 Subject: Re: [xsd-users] CodeSynthesis/tree (version 4.0.0) and Nvidia NVCC compiler (CUDA version 7.0)

All,

All Herald <loredofilms1 at yahoo.com> writes:

> Thanks for the response. Looked at XSD/e and still having the same
> issue. The problem is that the NVCC Compiler (CUDA 7.0) has this hard memory
> limitation. Even with a fairly small schema of just a few bytes if you make
> repeated calls the global const memory accumulates and very soon you hit
> that hard limit imposed by the NVCC Compiler. So the issue quite frankly is
> with the compiler, but that being said, the compiler is what it is. Would
> you be able to download the the NVCC CUDA 7 compiler and take a look at this
> if I send you a small example?

While I can try to reproduce this myself by downloading the NVCC
compiler, etc., I would much prefer if you help "jump-start" my
understanding of the problem and possible solutions. And, to be
honest, I am just as clueless about what's going on as after your
first email. Specifically: what exactly exhausts this "global const
memory"? Is it the header sizes? Something specific in the headers?
Number of inclusions?

You say:

"Even with a fairly small schema of just a few bytes if you make
 repeated calls [...]"

What "calls" are you talking about? Calls to what exactly?

Then you say:

"[...] try and compile and link using a small schema repeatedly [...]"

This doesn't make any sense to me at all: does this memory somehow
persist over compile/link cycles? If so, can't you clear it before
each recompile?

Boris

-------------- next part --------------
A non-text attachment was scrubbed...
Name: boris.cu
Type: application/octet-stream
Size: 64 bytes
Desc: not available
Url : http://codesynthesis.com/pipermail/xsd-users/attachments/20150715/8ed37bf9/boris.obj