Tokenizer

Tokenizer fulfils the function of tokenizer and preprocessor (with a crooked idea of what such a preprocessor does). It takes input lines and produces the preprocessed text, list of code tokens and list of defined macros. Tokens are represented with single characters which creates a concise and relatively human-readable representation (see “token soup” below) that is subsequently fed to the parser.

Token soup produced by the tokenizer consists of lowercase characters and symbols:

Symbol	Description
`a`	Access operator (`.`, `->`, `::`, …)
`b`	Binary operator (`%`, `+=`, `<`, `##`, …)
`c`	Character literal (`'a'`, `'\''`, …)
`e`	Ellipsis (`...`)
`i`	Identifier (`GtkWidget`, `enum`, …), includes all keywords
`m`	Ambiguous operator (`+`, `-`, `*`, `&`), binary or unary
`n`	Numeric literal (`1.0e-7`, `0X3F`, …), all kinds
`r`	Increment/decrement operator (`++`, `--`)
`s`	String literal (`"a"`, `"\\\""`, …)
`u`	Unary operator (`!`, `~`)
`y`	Yagdoc directive (`/< private >/`, …)
`[](){};,?:`	Direct character, each represents self

These token-representing characters occur in output, a few more are used internally for preprocessor features such as digraphs or leading parts of multiline tokens.

For instance the code

typedef int (*foo)(const char *a[],
                   /* Anonymous structs like this are not very useful */
                   struct { long p; double complex q; } b,
                   GtkWidget* (*c)(void*));

gives the following token soup

ii(mi)(iimi[],oi{ii;iii;}i,im(mi)(im));

If comment

/*<private_header>*/

is encountered, the file is considered private and tokenization aborts (and attribute private_header is set to True). As with all gtk-doc/yagdoc directives, whitespace does not matter. However, yagdoc ignores also its own directives inside preprocessor-excluded code, and it ignores directives in mutli-line comments. Therefore

#if 0
/*< private_header >*/
#endif
/*
   <private_header>
 */

will not make the header private.

Implementation of multiline-comments/preprocessor interaction should be correct. All kinds of crap found for instance in Boost headers is parsed, for instance

# /*
#  * Copyright J. Random Hacker
#  */
/* */ # /*
         */ define /*
                    */ a /*
                          */ b

is the same as

#define a b

Non-tokenizable input causes errors. This means if you do

#ifdef _GENERATE
... awk code to generate some file...
#endif

you will get errors, possibly lots of them, which is annoying although harmless. Shield the non-code from yagdoc by adding !_GENERATE assumptions (see Tokenizer.assume()) that will make the tokenizer skip the block and consequently suppress errors there.

More precisely, tokenization errors are suppressed in code skipped due to conditionals and inside #error and #warning. So this will pass silently:

#error Don't do this!

but this will be awarded with an error:

#ifdef __GNUC__
...
#elif defined(_MSC_VER)
...
#else
Don't know what to do!
#endif

In addition, as the yagdoc preprocessor always takes all branches after an #elif, it is not possible to skip the #else part with a simple assumption. You are encouraged to use #error to induce errors.

Anything tokenizable after # is accepted and cheerfully ignored if not recognized:

#}rubbish!{

Non-ASCII characters in identifiers are not recognized ($ is accepted though). Actually, non-ASCII characters written \unnnn or \Unnnnnnnn are not recognized anywhere. I have yet to see them used in public headers. In string literals, this is probably good as they should appear in this expanded form in the documentation too.

Implementation: can be done on demand, requires to start caring about character sets.

Long character literal are recognized (up to four bytes), wide character literal are not. Character constant

L'a'

forms two tokens, identifier L and character literal 'a' that are then likely rejected by the parser (though not necessarily, as macro bodies, compound initializers and function bodies are not examined thoroughly).

Implementation: can be done on demand, not sure whether it worths it. This area is quite confusing, especially when __STDC_ISO_10646__ cannot be assumed. And almost everyone uses UTF-8 instead of wide characters on Unix anyway…

Hexadecimal floating point numbers such as

# define HUGE_VALF (__extension__ 0x1.0p255f)

are not recognized. They are hardly used anywhere beside gcc semi-internal headers.

Of other number formats, the standard decimal, hexadecimal and octal integers and floating point numbers are recognized. Integers can have various queer suffixes such as ui64. Also words of the form 1fh are identified as hexadecimal numbers (they occur in unquoted assembler), but note deafbeefh will be classified as an identifier, not a number. In most cases you will probably want to exclude the code containing this kind of stuff using the preprocessing features anyway.

Identifiers cannot start with a number. Of course they cannot, but sometimes identifier-like tokens starting with a number occur in preprocessor macros, attached to something sane with ## operator.

Macro names must be identifiers. A few header files do things such as

#define net-snmp-config_multilib_redirection_h

I wonder whether this is valid – regardless, it is not accepted.

If all code (i.e. everything non-whitespace and non-comment) is protected against repeated inclusions with

#ifndef FOO_H
#define FOO_H
...
#endif

that is, the definition of FOO_H (the name does not matter) is excluded from the list of macros. If there is any code outside this or the definition of FOO_H is not the first code after the #ifndef FOO_H (of #if !defined(FOO_H)), then FOO_H is normally included.

This behaviour can be switched off by setting Tokenizer.exclude_ifndef_wrapper to False.

Trigraphs are recognized and substitited everywhere, digraphs outside string and character literals, as specified by ISO C 89 and C 99.

Trigraphs however cause skew of reported column numbers in error messages. In my opinion people actually using this “feature” deserve it.

Rudimentary preprocessing. By default, preprocessor conditionals are just ignored (with the exception of #if 0 and YAGDOC_FOR_PRESIDENT, see below). This means declarations from both (all) branches get to the output.

It is however possible to request inclusion/exclusion based on preprocessor conditionals (with assume() method). In this case the following constructs are recognized:

#ifdef FOO
#ifndef FOO
#if defined(FOO)
#if !defined(FOO)
#if 0
#if !0
#else
#endif

Note specifically #elif and more complex variants of #if are not recognized and both branches are taken as usual. Particularly the presence of #elif inhibits any further evaluation. The initial #if-conditioned block is possibly exclued according to the condition, all the other branches (either #elif or #else) are always taken. If #elif is not present and #if is one of the simple conditons above, then #else branch is included/excluded as expected (opposite of the #if branch).

This preprocessing is intentionally very simple and will be kept so. It should be powerful enough to shield some code (and sometimes untokenizable input) from yagdoc. On the other hand, gradual partial implementation of real C preprocessor conditonal expression evaluation would change the set of blocks ignored in input files with every version, which is very undesirable. Full implementation of conditionals would require nothing less than full C preprocessor implementation and does not worth it. Finally, a supposedly fixed partial but relatively powerful implementation would generate a stream of extension requests, also undesirable.

It is possible to mark special ranges in the input file with preprocessor directives (see add_range_type() method). The logic is exactly same as in the previous item. However, both branches are taken (unless the same conditions are present among the assumptions, of course), the tokens just notes down ranges of tokens corresponding these conditons.

This feature is used to implement “deprecated guards” of gtkdoc-scan.

Preprocessing can be also used to make yagdoc see different declarations than the compiler, with an effect similar (though a bit more powerful) to gtk-doc overrides. To avoid name clashes, use a symbol starting with YAGDOC_ for this purpose, the officially recommended name is YAGDOC_FOR_PRESIDENT which is also automatically assumed defined by the yagdoc preprocessor

#ifndef YAGDOC_FOR_PRESIDENT
compiler sees this
#else
yagdoc sees this
#endif

If you use YAGDOC_FOR_PRESIDENT more than three times in a file, yagdoc will display a flying flag of your country in the terminal as animated ASCII-art and play the national anthem through the speaker (this feature is currently unimplemented).

If you for some reason don't use G_BEGIN_DECLS/G_END_DECLS for C++ declaration wrapping, and do

#ifdef __cplusplus
extern "C" {
#endif /* __cplusplus */
...
#ifdef __cplusplus
}
#endif /* __cplusplus */

instead, put !__cplusplus into preprocessor assumptions to filter it out as the parser cannot handle this.

Expression tokenization is not 100% reliable. The most normal failing case is

(guint)-1

where the minus sign becomes an m token instead of a part of the number. However, since Yagdoc does not evaluate the expressions, it will usually pass through the parser.

After an error, recovery is performed by skipping the rest of line. With line-continued strings this typically produces an error on each subsequent line of the string too. Preprocessor directives in such string can wreak havoc.

There are two main causes of the tokenizer slowness:

awful lots of regexp matching performed
Token object use

The Token object represents a token, it's a lot slower than the tuple originally used to represent tokens. Maybe objects can be competitive, but I don't know how to achieve this (already using __slots__).