Saturday, August 1, 2009

A guide to pork, part 3

In the previous installment, I covered details on the patcher API and the basics of the Elsa AST API. If I were to write this as a reference manual, this is the point at which I would be spewing out a lot of verbose information on the C++ AST which would be hard to use for an introductory patch. One goal of this series of articles is to produce something close to a reference manual, so the parts will be numbered in that order. Instead, I will present the information in an order more likely to facilitate understanding.

In summary:
Step 1: Building and running your tool
Step 1.1: Running the patcher
Step 2: Using the patcher
Step 3: The structure of the Elsa AST

Aside 1: An introduction to porky

Having had more time to evaluate the new porky wrappers since my announcement of its commit last week, let me introduce you to it briefly. The tool takes in a list of rewrite specifications such as type PRCondVar** => mozilla::CondVar**, which will automatically change everything of the first type to the second type. It can also transform method calls into other method calls or even new/delete expressions. I would go into more detail, but that's for Chris Jones to discuss. If he ever blogs about it.

Step 3 (continued):

The following is an image representation of the C++ AST, excluding specific Expression and Statement subclasses:

To examine AST examples in detail, let us consider the example of a typical hello world program in C++:


#include <iostream>
 
int main() {
  std::cout << "Hello World!" << std::endl;
  return 0;
}

After being preprocessed, this small program becomes 19,153 lines of C++ goodness, thanks to the many header files being recursively included. Although we only define a single function, g++ defines for us namespaces (with GCC-specific attributes), templates, template specialization, typedefs, structs, classes, unions, function declarations, and half of the other C++ language features.

Let us look at each of these in turn.

Step 3.1.3: Function (Function definitions)

The Function AST node represents the definition of a function. It contains these variables:

DeclFlags dflags
TypeSpecifier retspec
Declarator nameAndParams
S_compound /* (Statement) */ body
FakeList<MemberInit> inits
FakeList<Handler> handlers

Unlike most AST nodes, the Function node does not have a SourceLoc loc member; instead, it has a getLoc() method which returns the location of the nameAndParams member, which would represent the location of the name. In our example, this would be line 3, column 5 (the beginning of `main').

Most of the members are self-explanatory: they represent, in order, the declaration flags (such as static or inline), the type of the return, the name and parameters aglomeration, and the statements of the body. The last two will be less common and NULL in the example I have here. The inits member represents the intializers of a constructor, while the handlers represents try-catch blocks that are scoped for the entire method. An example of such a block:


class Base {
  Base(const char *name)
    try : data(name) { } catch (...) { std::cerr << "Oops!" << std::endl; }
  const std::string data;
};

Strictly speaking, the enclosing try-catch is not limited to constructors, although I doubt it would be used outside of them except for stress-testing compilers.

There is little that you will want to do with these objects that does not fall under the use of other objects. Probably the most burning question you have is how to get the name of the function. These are the shortest ways: func->nameAndParams->var->fullyQualifiedName0().c_str() (going to a const char *) and func->nameAndParams->decl->getDeclaratorId()->toString() (going to an sm::string). The latter form will probably be more helpful if you are looking for specific methods, since it overrides the == operator.

Step 3.4.5: DeclFlags (declaration modifiers)

DeclFlags is an enum that specifiers certain flags about declarations. Each of the values is in the form DF_, always uppercase. The standard declarations--auto, register, static, extern, mutable, inline, virtual, explicit, friend, and typedef--have values in this manner. Other flags in the enum have uses for the Variable constructs, which I will not go into detail now.

Step 3.1.6: TypeSpecifier (first half of declarations)

A TypeSpecifier node represents a declaration of type, such as char. If you recall your C/C++ syntax, the declaration char *x, y; declares only one variable as a pointer--the * is matched with the variable name and not the type itself; in the AST therefore, TypeSpecifier does not receive these pointers (there come about in Declarator nodes).

By themselves, these nodes only have a CVFlags cv variable (representing the const- and volatile-ness of the type), as well as a SourceLoc loc location member. Instead, it has five subclasses with more specific attributes.

TS_name, TS_elaborated, TS_classSpec, and TS_enumSpec will be discussed in a future part.

TS_simple nodes represent the built-in simple types, like char. It has a single additional member, a SimpleTypeId id member.

Step 3.1.11: Declarator (the other half of declarations)

A Declarator node represents the non-type part of a declaration. It contains these variables and methods:

IDeclarator *decl
Initializer *init
PQName *getDeclaratorId() /*(And a const version)*/

Step 3.1.12: IDeclarator (the real declaration part)

The IDeclarator node does the heavy work in terms of annotating declarations. In some corner cases, the AST for a declarator can get very deep; the parallel type structure makes working with declarations much easier.

With the exception of the D_name and D_bitfield subclasses, all subclasses contain at least an IDeclarator *base member. In addition, IDeclarator holds these variables and methods:

SourceLoc loc
PQName *getDeclaratorId() /*(And a const version)*/
IDeclarator *getBase() /*(And a const version)*/
IDeclarator *skipGroups()
bool bottomIsDfunc()
D_func getD_func()

The skipGroups method skips through any excess groupings (i.e., parentheses layers). getBase returns either the base of the declaration or NULL if it is a leaf. getD_func returns the bottom-most D_func node, while bottomIsDfunc will tell you if the declaration is a function (but not a function pointer). getDeclaratorId obviously returns the name of the object at the very base of the declaration tree.

The first subclass is D_name, which is the typical leaf of the declaration. It has a single PQName *name attribute, which returns the name of the variable or function being declared. The other leaf class is D_bitfield, which has both the name attribute as well as an Expression *bits representing the number of bits in the declaration.

D_pointer and D_reference represent a pointer or reference indirection, respectively; D_pointer additionally has a CVFlags cv variable representing the qualifications of its pointer type.

D_ptrToMember is similar to D_pointer, but it adds another PQName *nestedName attribute to represent the construct whose member the pointer is pointing to. For those who don't recognize this feature, it can be demonstrated in this C++ snippet:


struct Foo { int Bar() { return 0; } };
typedef int (Foo::*ptrToMember)();
// The declarator tree (following base):
// D_func->D_grouping->D_ptrToMember->D_name
ptrToMember p = &Foo::Bar;

D_array represents an array declaration. In addition to its base, it also has the possibly null Expression *size member to represent the size of the array. Multidimensional arrays have these members arranged from the outside in.

D_grouping is a dummy node used mostly for the purposes of AST disambiguation during the parsing phase (it represents the use of parentheses in declarations). The skipGroups function can be used to pass these nodes.

D_func is necessarily the most complex declarator. Its base is typically the name of the function, although function pointers can have nested declarators. It has a FakeList<ASTTypeId> *params member for the parameters, CVFlags cv member for the const member functions, and ExceptionSpec *exnSpec for the exception specifiers.

Predicting declarator trees is not difficult. In general, you can apply standard rules to find the declarator: each * and & at the beginning creates the respective declarator indirection node; a [] or () at the end creates the array and function nodes, respectively; surrounding with paranthesis yields a grouping node. Pointers to members are created when the syntax is used (look for the ::*), and the choice between bitfield and name as the leaf comes from obvious decisions. One also needs to remember that the structures on the left are parsed before the ones on the right, unless overriden by parentheses, so the obscene int (*(*asdf)())[0] is yielded as array, grouping, pointer, function, grouping, pointer, name. If you're wondering, it's a zero-sized array named "asdf" that contains pointers to functions that return poiinters to integers.

There is much more to talk about in the world of ASTs. This will get you started on function bodies and declarations; the next part of the guide will cover more in-depth knowledge on declarations and begin to introduce classes.

1 comment:

Anonymous said...

You are effectively giving an high-level view of the software architecture in contrast to a lexicographic reference view. I appreciate your effort very much! Will you preserve this view also in the final document(s)? For instance, something like a reading trail? It would be sad if your effort of producing introductory information would get lost. A reference is good for, well, referencing -- but never for getting into a subject and learning (grabbing) it.