Parsing C++ with GCC plugins, Part 2

By popular demand, here is the second installment in the series of posts on parsing C++ using the new GCC plugin architecture. In the previous post we concentrated on setting up the plugin infrastructure and identifying the point in the compilation sequence where we can perform our own processing. In this post we will see how to work with the GCC AST (abstract syntax tree) in order to access the parsed C++ representation. By the end of this post we will have a plugin implementation that prints the names, types, and source code locations of all the declarations in the translation unit.

First let’s cover a few general things about the GCC internals and AST that are useful to know. GCC C++ compiler, cc1plus, can only process one file at a time (you can pass several files to the compiler driver, g++, but it simply invokes cc1plus separately for each file). As a result, GCC doesn’t bother with encapsulation and instead makes heavy use of global variables. In fact, most of the “data entry points” are accessible as global variables. We have already seen a few such variables in the previous post, notably, error_count (number of compilation errors) and main_input_filename (name of the file being compiled). Perhaps the most commonly used such variable is global_namespace which is the root of the AST.

The GCC AST itself is a curious data structure in that it is an implementation of the polymorphic data type idea in C (next time someone tells you that polymorphism works perfectly in C and they don’t need “bloated” C++ for that, show them the GCC AST). The base “handle” for all the AST nodes is the tree pointer type. Because the actual nodes can be of some “extended” types, access to the data stored in the AST nodes is done via macros. All such macros are spelled in capital letters and normally perform two operations: they check that the actual node type is compatible with the request and, if so, they return the data requested. A large number of macros defined for the AST are predicates. That is, they check for a certain condition and return true or false. Such macros normally end with _P.

Each tree node in the AST has a tree code of type int which identifies what kind of node it is. To get the tree code you use the TREE_CODE macro. Another useful global variable available to you is tree_code_name which is an array of strings containing human-readable tree code names. It is quite useful during development to see what kind of tree nodes you are getting, for example:

tree decl = ...
int tc (TREE_CODE (decl));
cerr << "got " << tree_code_name[tc] << endl;

Each tree node type has a tree code constant defined for it, for example, TYPE_DECL (type declaration), VAR_DECL (variable declaration), ARRAY_TYPE (array type), and RECORD_TYPE (class/struct type). Oftentimes macros that only apply to a specific kind of nodes have their names start with the corresponding prefix, for example, macro DECL_NAME can only be used on *_DECL nodes and macro TYPE_NAME can only be used on *_TYPE nodes.

To allow the construction of the AST out of the tree nodes, the tree type supports chaining nodes in linked lists. To traverse such lists you would use the TREE_CHAIN macro, for example:

tree decl = ...
 
for (; decl != 0; decl = TREE_CHAIN (decl))
{
  ...
}

The AST type system also supports two dedicated container nodes: vector (TREE_VEC tree code) and two-value linked list (TREE_LIST tree code). However, these containers are used less often and will be covered as we encounter them.

One major class of nodes in the GCC AST is declarations. A declaration in C++ names an entity in a scope. Examples of declarations include a type declaration, a function declaration, a variable declaration, and a namespace declaration. To get to the declaration’s name we use the DECL_NAME macro. This macro returns a tree node of the IDENTIFIER_NODE type. To get the declaration’s name as const char* we can use the IDENTIFIER_POINTER macro. For example:

tree decl = ...;
tree id (DECL_NAME (decl));
const char* name (IDENTIFIER_POINTER (id));

While most declarations have names, there are certain cases, for example an unnamed namespace declaration, where DECL_NAME can return NULL.

Other macros that are useful when dealing with declarations include TREE_TYPE, DECL_SOURCE_FILE, and DECL_SOURCE_LINE. TREE_TYPE returns the tree node (with one of the *_TYPE tree codes) corresponding to the type of entity being declared. The DECL_SOURCE_FILE and DECL_SOURCE_LINE macros return the file and line information for the declaration.

Let’s now see how we can use all this information to traverse the AST and print some information about the declarations that we encounter. The first thing that we need is a way to get the list of declarations for a namespace. The GCC Internals documentation states that we can call the cp_namespace_decls function to get “the declarations contained in the namespace, including types, overloaded functions, other namespaces, and so forth.” However, this is not the case. With this function you can get to all the declarations except nested namespaces. This is because nested namespace declarations are stored in a different list in the cp_binding_level struct. If you want to know what the cp_binding_level is for, I suggest that you read its description in the GCC headers. Otherwise, you can just treat it as magic and use the following code to access all the declarations in a namespace:

void
traverse (tree ns)
{
  tree decl;
  cp_binding_level* level (NAMESPACE_LEVEL (ns));
 
  // Traverse declarations.
  //
  for (decl = level->names;
       decl != 0;
       decl = TREE_CHAIN (decl))
  {
    if (DECL_IS_BUILTIN (decl))
      continue;
 
    print_decl (decl);
  }
 
  // Traverse namespaces.
  //
  for(decl = level->namespaces;
      decl != 0;
      decl = TREE_CHAIN (decl))
  {
    if (DECL_IS_BUILTIN (decl))
      continue;
 
    print_decl (decl);
    traverse (decl);
  }
}

You may be wondering what the DECL_IS_BUILTIN checks are for. Besides the declarations that come from the file being compiled, the GCC AST also contains a number of implicit declarations for RTTI, exceptions, and static construction/destruction support code as well as compiler builtin declarations. Normally we would want to skip such declarations since we are not interested in them. But feel free to disable the above checks and see what happens.

The print_decl() function is shown below:

void
print_decl (tree decl)
{
  int tc (TREE_CODE (decl));
  tree id (DECL_NAME (decl));
  const char* name (id
                    ? IDENTIFIER_POINTER (id)
                    : "<unnamed>");
 
  cerr << tree_code_name[tc] << " " << name << " at "
       << DECL_SOURCE_FILE (decl) << ":"
       << DECL_SOURCE_LINE (decl) << endl;
}

Let’s now plug this code into the GCC plugin skeleton that we developed last time. All we need to do is add the traverse(global_namespace); call after the following statement in gate_callback():

  //
  // Process AST. Issue diagnostics and set r
  // to 1 in case of an error.
  //
  cerr << "processing " << main_input_filename << endl;

We can now try to process some C++ code with our plugin. Let’s try the following few declarations:

void f ();
 
namespace n
{
  class c {};
}
 
typedef n::c t;
int v;

The output from running our plugin on the above code will be something along these lines:

starting plugin
processing test.cxx
var_decl v at test.cxx:10
type_decl t at test.cxx:8
function_decl f at test.cxx:1
namespace_decl n at test.cxx:4
type_decl c at test.cxx:5

When I just started working with the GCC AST, I expected that I would be iterating over declarations in the same order as they were declared in the source code. As you can see from the above output this is clearly not the case. While having multiple lists for declarations (for example, names and namespaces in the namespace node) would already not allow such ordered iteration, the order of declarations in the same list is not preserved either, as evident from the above output. And it gets worse. Consider the following C++ fragment:

namespace n
{
  class a {};
}
 
void f ();
 
namespace n
{
  class b {};
}

The output from our plugin looks like this:

function_decl f at test.cxx:6
namespace_decl n at test.cxx:2
type_decl b at test.cxx:10
type_decl a at test.cxx:3

What happens is GCC merges all namespace declarations for the same namespace into a single AST node.

If you think about what GCC does with the AST, this organization is not really surprising. In the end, all GCC cares about are function bodies for which it needs to generate machine code. And for that the order of declarations is not important. However, if you are going to produce any kind of human-readable information from the AST, then you will probably want this information to be in the declaration order as found in the source code.

There is a way to iterate over declarations in the source code order, however, it requires a bit of extra effort. In a nutshell, the idea is to first collect all the declarations, then sort them according to the source code order, and finally traverse that sorted list of declarations. But how can we sort the declarations according to the source code order? We have seen how to get the file name and line information for a declaration, however, we cannot compare this information without a complete knowledge of the #include hierarchy. To make this work we need to understand how GCC tracks location information in the AST.

Storing file/line/column information with each tree node would require too much memory so instead GCC stores an instance of the location_t type (currently defined as unsigned int) in tree nodes. The location_t values consist of three bit-fields: the index into the line map, line offset, and column number. The line map stores entries that represent continuous file fragments, that is, file fragments that are not interrupted by #include directives. Line map entries contain information such as the file name and start line position. Using the location_t value one can look up the line map entry and get the file name, line number (start line plus offset) and column number. One property of the location_t values that we are going to exploit is that values for locations further down in the translation unit have greater values. As a result we can create the following container that will automatically keep declarations that we insert into it in the source code order:

struct decl_comparator
{
  bool
  operator() (tree x, tree y) const
  {
    location_t xl (DECL_SOURCE_LOCATION (x));
    location_t yl (DECL_SOURCE_LOCATION (y));
 
    return xl < yl;
  }
};
 
typedef std::multiset<tree, decl_comparator> decl_set;

Now we can implement the collect() function which adds all the declarations into the set:

void
collect (tree ns, decl_set& set)
{
  tree decl;
  cp_binding_level* level (NAMESPACE_LEVEL (ns));
 
  // Collect declarations.
  //
  for (decl = level->names;
       decl != 0;
       decl = TREE_CHAIN (decl))
  {
    if (DECL_IS_BUILTIN (decl))
      continue;
 
    set.insert (decl);
  }
 
  // Traverse namespaces.
  //
  for(decl = level->namespaces;
      decl != 0;
      decl = TREE_CHAIN (decl))
  {
    if (DECL_IS_BUILTIN (decl))
      continue;
 
    collect (decl, set);
  }
}

The new traverse() implementation will then look like this:

void
traverse (tree ns)
{
  decl_set set;
  collect (ns, set);
 
  for (decl_set::iterator i (set.begin ()),
       e (set.end ()); i != e; ++i)
  {
    print_decl (*i);
  }
}

If we now run this new implementation of our plugin on the C++ fragment presented earlier, we will get the following output:

function_decl f at test.cxx:1
type_decl c at test.cxx:5
type_decl t at test.cxx:8
var_decl v at test.cxx:9

Note that now we don’t track namespace declaration nodes since they are merged into one anyway. If you need to recreate the original namespace hierarchy, the best approach is to use the namespace information that can be inferred from declaration nodes using the CP_DECL_CONTEXT macro. For example, the following function returns the namespace name for a declaration:

std::string
decl_namespace (tree decl)
{
  string s, tmp;
 
  for (tree scope (CP_DECL_CONTEXT (decl));
       scope != global_namespace;
       scope = CP_DECL_CONTEXT (scope))
  {
    tree id (DECL_NAME (scope));
 
    tmp = "::";
    tmp += (id != 0
            ? IDENTIFIER_POINTER (id)
            : "<unnamed>");
    tmp += s;
    s.swap (tmp);
  }
 
  return s;
}

And that’s it for today. If you have any questions or comments, you are welcome to leave them below. The complete source code for the plugin we have developed in this post is available as the plugin-2.cxx file (it is fun to try to run it on some real C++ source files). In the next post we will talk about types (*_TYPE tree codes) and in particular how to traverse classes.

8 Responses to “Parsing C++ with GCC plugins, Part 2”

  1. Sebastien Binet Says:

    thanks a lot

  2. Philip Craig Says:

    It’s great to see the examples and the downloadable .cxx file

    Any chance you could also make available a script or makefile that will a) build it and b) given an instance of g++-4.5 will run it on something?

  3. Boris Kolpackov Says:

    Philip,

    I showed how to build the plugin and how to run it on a C++ file in the previous post. Here are the command lines for your reference:

    $ g++-4.5 -I`g++-4.5 -print-file-name=plugin`/include -fPIC -shared plugin.cxx -o plugin.so
    $ g++-4.5 -S -fplugin=./plugin.so test.cxx

    Also the GCC Internals documentation includes a sample makefile for building plugins (scroll all the way down):

    http://gcc.gnu.org/onlinedocs/gcc-4.5.0/gccint/Plugins.html#Plugins

  4. Yosh Says:

    Thx again

  5. Dmitriy V'jukov Says:

    Thank you!

  6. yoco Says:

    Thank you for the great job.

  7. Ben Says:

    Good job and thanks for the tutorial

    Where do you get the base SDK for making GCC plugins? It would be cool to make a plugin to modify the c++ language. One cool thing is to allow declaring of functions for a class/struct outside of it, without modifying the original source files.

  8. Boris Kolpackov Says:

    Ben,

    The plugin headers are installed along with GCC if plugin support was enabled during the GCC configuration (–enable-plugin configure option). To find out where these headers are, you can do:

    g++-4.5 -print-file-name=plugin