A profiling war story

The other day I ran across an XML schema that took the XSD compiler unusually long to translate to C++. While the schema itself was quite large, almost 2MB of definitions spread over 80 files, the compilation time on my machine was excessive, about 160 seconds. We also had another report from several months ago where XSD would seem to run indefinitely while trying to compile a really large schema (15K element declarations across about 200 files). Unfortunately, the reporter couldn’t provide us with that schema for debugging due to distribution restrictions. All we could do was give them a new version of the compiler and they had to tell us whether it worked or not.

I set out to figure out what was going on using the schema I had suspecting that it was the same problem as the one affecting the other schema. The first step was to understand which part of the compilation process takes so much time. A good profiler can be very helpful in accomplishing this. I personally prefer OProfile mainly because: 1) it is system-wide so you can see the complete picture (your application, libraries, even the kernel), 2) it is transparent in the sense that you don’t need a special build of your application and all the libraries that it depends on, and 3) it has a low overhead which means that your application doesn’t take 10 times as long to execute when profiling.

I ran OProfile while the XSD compiler was compiling the schema. My first step in analyzing the program execution was to get a flat symbol list that shows how much time is spent in each function. Here is the top of the listing for XSD:

samples  %        image name  symbol name
 
405527   14.4254  libc.so     _int_malloc
334761   11.9081  libc.so     _int_free
287232   10.2174  libc.so     free
285081   10.1409  xsd         Cult::MM::allocate
199565    7.0989  libc.so     malloc
124309    4.4219  xsd         Schema::find
119779    4.2608  xsd         operator delete
104893    3.7312  xsd         std::map<...>::find
94964     3.3780  libc.so     wmemcmp
85689     3.0481  xsd         Cult::MM::Key::size
80308     2.8567  xsd         operator new
62390     2.2193  libc.so     strlen
37687     1.3406  xsd         std::string::_S_create
35205     1.2523  libc.so     memcpy

When you read articles about profiling you are normally presented with a listing similar to the one above where one or two functions at the top account for 90% percent of the execution time. This, however, doesn’t happen very often for complex programs that were built by competent engineers. Complex programs perform several operations in a pipe-line manner instead of just one algorithmic task (image conversion is good example of an algorithmic application where you may see a single function at the top taking 90% of the time). Competent engineering means that there are no silly programming or architectural mistakes that result in single-function bottlenecks. Instead, you normally see a listing like the one above: execution time is spread over many functions without any time standing out as unreasonable. Such information is completely useless in understanding what’s going on in the program.

What we need instead are the aggregate times for each function. That is, a time spent in the function itself and in all the functions that it calls. We also need to know which other functions each function calls with their respective times. With this information we can find which operation in our program takes most of the execution time as well as figure out which parts of this operation are responsible for the slowdown.

The best way to get this information from OProfile is to collect the samples with the call graph information (see the --callgraph option), then convert them to the gprof format with the opgprof command, and finally generate the call graph with the gprof -q command. One limitation of the call graph information is that it is very inaccurate if your application is optimized. To get a useful result you will need to recompile your application with optimization turned off.

The call graph output is quite a bit more complex and can be a lot more verbose compared to the flat symbol list presented above. The gprof call graph format is explained in detail in the gprof documentation.

One way to get a summary of the gprof call graph that just tells us which functions have the largest aggregate times is to run the following sed script on the call graph output:

sed -e '/^\[/p;d' callgraph >summary

Here is the top of the summary listing for XSD:

% time self children called   name
 
99.8      0  645554           main
94.0  16021  591541  588759+  Schema::find
                     3889477
82.6   1024  533243  512793+  <cycle 4 as a whole>
                     36104907
82.4      0  532686  512443   Parser::parse
55.5      8  358965  465329   Node<Element>::trampoline
49.5  27896  291869  265608+  <cycle 1 as a whole>
                     36968
47.9      0  309389  299838   Resolver::resolve_element
42.6  18427  256696  215619   operator new
38.6      1  249727  242001   Resolver::traverse

From this listing it is immediately apparent that the problem is most likely in the Schema::find operation. It takes up 94% of the program execution time and is called half a million times from other functions and almost four million times recursively.

The Schema::find() function looks up namespaces with the specified XML namespace name in the schema graph. To understand how it works one needs a basic understanding of how the schema graph is organized. Each schema file is represented by a Schema object which contains a list of references to included and imported schema files (also represented as Schema objects) along with the Namespace object which contains the actual type, element, etc., declarations. Schema::find() returns a list of Namespace object corresponding to a namespace name. To collect this list, it traverses the include/import graph to make sure it visits every Schema object visible, directly or indirectly, from the initial schema.

There could be two general reasons why Schema::find() takes a disproportional amount of time. It can be because it is called too many times by other functions. Or the problem could be in its implementation. Or it can be both. Seeing that Schema::find() is called almost 8 times more often recursively than from outside, the suspicion is that it is the implementation. Here is the fragment of the call graph listing corresponding to this function:

self  children    called       name
 
16021  591541  588759+3889477  Schema::find
 8275  217657  208177/208268    Context::set
 3528  174629  158008/158008    std::list<...>::insert
 3190   45372  65030/65055      Context::remove
  346   47712  43879/43880      std::list<...>::clear
 3500   28218  32082/37821      Scope::find
 3560    9619  20461/20586      Context::count

As you can see the time is dominated by two sets of operations. The first are Context operations which are used to detect recursive import/inclusion (that is, a flag is set in the Schema’s context to indicate that it is being traversed). The second are the list insertion and removal operations. This tells us two things: a lot of Schema objects are being traversed and a lot of Namespace objects are being inserted into the resulting list.

The size of the schema, and in particular the number of files which corresponds to the number of Schema objects, could not by itself explain the long compilation time since there are comparable or even larger schemas that compile much faster. There had to be something else special about this schema. As it turned out, and this is something I just had to figure out on my own since no profiling tool is available for this last step, the schema had a very expansive include/import tree. Most files in this schema included or imported a handful of other files which in turn did the same. The end result was that each schema file was included or imported, directly and indirectly, hundreds of times. That meant that the Schema::find() implementation had to traverse the same files over and over again (thus all the Context operations) and insert the same Namespace objects into the resulting list over and over again (thus all the list operations).

The solution was fairly straightforward. I added a set which keeps track of all the Schema objects that has already been seen in this execution of Schema::find(). The compilation time dropped from 160 seconds to 1.5 seconds which is more than 100 times speedup. And, as it turned out, this was the same problem that affected the larger schema which we could not get.

This entry was posted on Monday, May 4th, 2009 at 4:22 am and is filed under Development, C++. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Comments are closed.