Parser

Parser takes the token soup produced by tokenizer, performs some further simple transforms (for instance replacing the general identifier i with tokens correspondig to specific C keywords) and produces parse tree using simpleparse module and a bit simplified C grammar.

Parser adds several more tokens to the soup, they are all either uppercase characters or symbols:

Symbol	Description
`A`	GCC-like attribute (`__attribute__`), ignored together with all arguments
`B`	Built-in type (`char`, `int`, `float`, `double`), these can be modified by modifiers
`C`	Const keyword (`const`, `G_CONST_RETURN`, `__const`)
`D`	Ignored decorator (`G_GNUC_NO_INSTRUMENT`, `G_GNUC_MALLOC`, …)
`E`	Enum keyword (`enum`)
`H`	Decorator passed through, i.e. will be shown in the documentation
`I`	Ignorable (`G_BEGIN_DECLS`, …), treated similarly to comments
`M`	Type storage class (`static`, `extern`, `volatile`, …)
`P`	Prefix type modifier (`long`, `short`, `signed`, `unsigned`), of built-in types
`R`	Restrict keyword (`restrict`, `__restrict`)
`S`	Struct keyword (`struct`)
`T`	Typedef keyword (`typedef`)
`U`	Union keyword (`union`)
`X`	Suffix type modifier (`complex`), of built-in types
`=*&`	Direct character (formerly `b` or `m`), each represents self

Continuing the tokenizer example, the code

typedef int (*foo)(const char *a[],
                   /* Anonymous structs like this are not very useful */
                   struct { long p; double complex q; } b,
                   GtkWidget* (*c)(void*));

is transformed to

TB(*i)(CB*i[],S{Pi;BXi;}i,i*(*i)(i*));

before parsing.

Function bodies and non-atomic initializers are just skimmed, anything that has correctly grouped { and } passes.

Similarly, __attribute__ can contain all kinds of rubbish provided that it has correctly grouped ( and ).

Implicit int rule works for modified types such as unsigned, but not when the only type information is a qualifier or storage class. If you write

volatile i;
const foo();

you will get errors – and frankly, you deserve them.

C does not allow to introduce new symbols, all syntax changes are done with macros. It is assumed that for the purpose of declaration scanning, all we need to recognize as a special token has one of following forms:

word
word(...)

This is then transformed to a specified token (see method add_keywords).

Implementation: The soup following word is actually specified as a regular expression now, which means parentheses are not assumed and in addition have to be quoted. This perhaps need a rethink. Also, the current code assumes a signle token is produced, this is relatively easy to change if required.

The following top-level declarations are recognized/produced:

typedef – type definition (of any kind, plain, function, struct, …)
```
typedef unsigned int guint;
```
enum_def – plain enum definition
```
enum {
    ONE_PLUS_ONE = 3
};
```
struct_def – plain structure or union definition
```
struct S {
    int a;
    double b;
};
```
variable_def – variable declaration/definition
```
extern int major_version;
```
prototype – function prototype
```
void*
foo(int a, double b);
```

inline_function – any function with a body

void*
foo(int a, double b)
{
    return memcpy(malloc(sizeof(double) + a), &b, sizeof(double));
}

enum_forward – forward enum declaration
```
enum E;
```
struct_forward – strcuture or union forward declaration
```
struct S;
```

Non-pointer style typedefs of user functions

typedef void gdk_fb_draw_drawable_func (GdkDrawable *drawable,
                                        GdkGC       *gc,
                                        GdkPixmap   *src,
                                        GdkFBDrawingContext *dc,
                                        gint         start_y,
                                        gint         end_y,
                                        gint         start_x,
                                        gint         end_x,
                                        gint         src_x_off,
                                        gint         src_y_off,
                                        gint         draw_direction);

are not recognized. This should be easy to fix (as well as higher-level pointer typedefs).

Only typedefs defining a single name are recognized, for instance

typedef struct _LineFace {
    double  xa, ya;
    int     dx, dy;
    int     x, y;
    double  k;
} LineFaceRec, *LineFacePtr;

fails.

Generally it is not possible to combine declarations as in

int a, foo(void);

The only recognized construct of this type is mutliple variable declaration (also in the form of multiple struct fields, but again function and non-function types cannot be mixed)

int a, *b, c[4];

The primary difficulty with combined declarations is not parsing them but documenting them in a sensible manner. It would be probably necessary to split and reformat them into separate declarations in yagdoc, which makes quite reasonable to require the users to write them as separate declarations in the first place.

In certain places, where C permits anonymous types, an identifier is required. For instance the function prototype

int atexit (void (*)(void));

is not recognized (most human programmers would have hard time parsing this too). Since documentation requires identifiable parts that can be referred to, and these declarations lack them, they are not expected to appear in documented interfaces anyway. As internal or auxiliary declarations, they can be shielded from yagdoc if necessary using YAGDOC_FOR_PRESIDENT.

Functions returning function pointers and other oddities are not recognized – unless the function type is a named, typedefed type, of course. If you have not difficulties understanding

int *(*f())(int *(*())());

I admire you. Note however, most readers of the documentation would have difficulties.

Yagdoc/gtk-doc directives are syntactical, i.e. they are not allowed where they do not belong. A directive such as /*< private >*/ represents C++ keyword private and thus it can appear only where the keyword could.

After an error, recovery is performed by trying again from the next token starting at the start of a line. This is a successful strategy, unless top-level declarations in your headers are indented – in such case lot more than necessary will be skipped.

The unparsable sequence is included in the error message, see token soup above.