Parser takes the token
soup produced by tokenizer, performs some further simple transforms (for
instance replacing the general identifier i
with tokens
correspondig to specific C keywords) and produces parse tree using
simpleparse
module and a bit simplified C grammar.
Parser adds several more tokens to the soup, they are all either uppercase characters or symbols:
Symbol | Description |
---|---|
A |
GCC-like attribute (__attribute__ ), ignored together with all arguments |
B |
Built-in type (char , int , float , double ), these can be modified by modifiers |
C |
Const keyword (const , G_CONST_RETURN , __const ) |
D |
Ignored decorator (G_GNUC_NO_INSTRUMENT , G_GNUC_MALLOC , …) |
E |
Enum keyword (enum ) |
H |
Decorator passed through, i.e. will be shown in the documentation |
I |
Ignorable (G_BEGIN_DECLS , …), treated similarly to comments |
M |
Type storage class (static , extern , volatile , …) |
P |
Prefix type modifier (long , short , signed , unsigned ), of built-in types |
R |
Restrict keyword (restrict , __restrict ) |
S |
Struct keyword (struct ) |
T |
Typedef keyword (typedef ) |
U |
Union keyword (union ) |
X |
Suffix type modifier (complex ), of built-in types |
=*& |
Direct character (formerly b or m ), each represents self |
Continuing the tokenizer example, the code
typedef int (*foo)(const char *a[], /* Anonymous structs like this are not very useful */ struct { long p; double complex q; } b, GtkWidget* (*c)(void*));
is transformed to
TB(*i)(CB*i[],S{Pi;BXi;}i,i*(*i)(i*));
before parsing.
Function bodies and non-atomic initializers are just skimmed, anything
that has correctly grouped {
and }
passes.
Similarly, __attribute__
can contain all kinds of rubbish
provided that it has correctly grouped (
and )
.
Implicit int
rule works for modified types such as
unsigned
, but not when the only type information is a qualifier
or storage class. If you write
volatile i; const foo();
you will get errors – and frankly, you deserve them.
C does not allow to introduce new symbols, all syntax changes are done with macros. It is assumed that for the purpose of declaration scanning, all we need to recognize as a special token has one of following forms:
word word(...)
This is then transformed to a specified token (see method
add_keywords
).
Implementation: The soup following word
is
actually specified as a regular expression now, which means parentheses are
not assumed and in addition have to be quoted. This perhaps need a rethink.
Also, the current code assumes a signle token is produced, this is relatively
easy to change if required.
The following top-level declarations are recognized/produced:
typedef
– type definition (of any kind, plain, function,
struct, …)
typedef unsigned int guint;
enum_def
– plain enum definition
enum { ONE_PLUS_ONE = 3 };
struct_def
– plain structure or union definition
struct S { int a; double b; };
variable_def
– variable declaration/definition
extern int major_version;
prototype
– function prototype
void* foo(int a, double b);
inline_function
– any function with a body
void* foo(int a, double b) { return memcpy(malloc(sizeof(double) + a), &b, sizeof(double)); }
enum_forward
– forward enum declaration
enum E;
struct_forward
– strcuture or union forward declaration
struct S;
Non-pointer style typedefs of user functions
typedef void gdk_fb_draw_drawable_func (GdkDrawable *drawable, GdkGC *gc, GdkPixmap *src, GdkFBDrawingContext *dc, gint start_y, gint end_y, gint start_x, gint end_x, gint src_x_off, gint src_y_off, gint draw_direction);
are not recognized. This should be easy to fix (as well as higher-level pointer typedefs).
Only typedefs defining a single name are recognized, for instance
typedef struct _LineFace { double xa, ya; int dx, dy; int x, y; double k; } LineFaceRec, *LineFacePtr;
fails.
Generally it is not possible to combine declarations as in
int a, foo(void);
The only recognized construct of this type is mutliple variable declaration (also in the form of multiple struct fields, but again function and non-function types cannot be mixed)
int a, *b, c[4];
The primary difficulty with combined declarations is not parsing them but documenting them in a sensible manner. It would be probably necessary to split and reformat them into separate declarations in yagdoc, which makes quite reasonable to require the users to write them as separate declarations in the first place.
In certain places, where C permits anonymous types, an identifier is required. For instance the function prototype
int atexit (void (*)(void));
is not recognized (most human programmers would have hard time parsing
this too). Since documentation requires identifiable parts that can be
referred to, and these declarations lack them, they are not expected to
appear in documented interfaces anyway. As internal or auxiliary
declarations, they can be shielded from yagdoc if necessary using
YAGDOC_FOR_PRESIDENT
.
Functions returning function pointers and other oddities are not recognized
– unless the function type is a named, typedef
ed type, of
course. If you have not difficulties understanding
int *(*f())(int *(*())());
I admire you. Note however, most readers of the documentation would have difficulties.
Yagdoc/gtk-doc directives are syntactical, i.e. they are not allowed where
they do not belong. A directive such as /*< private >*/
represents C++ keyword private
and thus it can appear only where
the keyword could.
After an error, recovery is performed by trying again from the next token starting at the start of a line. This is a successful strategy, unless top-level declarations in your headers are indented – in such case lot more than necessary will be skipped.
The unparsable sequence is included in the error message, see token soup above.