Tokenizer fulfils the function of tokenizer and preprocessor (with a crooked idea of what such a preprocessor does). It takes input lines and produces the preprocessed text, list of code tokens and list of defined macros. Tokens are represented with single characters which creates a concise and relatively human-readable representation (see “token soup” below) that is subsequently fed to the parser.
Token soup produced by the tokenizer consists of lowercase characters and symbols:
Symbol | Description |
---|---|
a |
Access operator (. , -> , :: , …) |
b |
Binary operator (% , += , < , ## , …) |
c |
Character literal ('a' , '\'' , …) |
e |
Ellipsis (... ) |
i |
Identifier (GtkWidget , enum , …), includes all keywords |
m |
Ambiguous operator (+ , - , * , & ), binary or unary |
n |
Numeric literal (1.0e-7 , 0X3F , …), all kinds |
r |
Increment/decrement operator (++ , -- ) |
s |
String literal ("a" , "\\\"" , …) |
u |
Unary operator (! , ~ ) |
y |
Yagdoc directive (/*< private >*/ , …) |
[](){};,?: |
Direct character, each represents self |
These token-representing characters occur in output, a few more are used internally for preprocessor features such as digraphs or leading parts of multiline tokens.
For instance the code
typedef int (*foo)(const char *a[], /* Anonymous structs like this are not very useful */ struct { long p; double complex q; } b, GtkWidget* (*c)(void*));
gives the following token soup
ii(mi)(iimi[],oi{ii;iii;}i,im(mi)(im));
If comment
/*<private_header>*/
is encountered, the file is considered private and tokenization aborts
(and attribute private_header
is set to True
).
As with all gtk-doc/yagdoc directives, whitespace does not matter. However,
yagdoc ignores also its own directives inside preprocessor-excluded code,
and it ignores directives in mutli-line comments. Therefore
#if 0 /*< private_header >*/ #endif /* <private_header> */
will not make the header private.
Implementation of multiline-comments/preprocessor interaction should be correct. All kinds of crap found for instance in Boost headers is parsed, for instance
# /* # * Copyright J. Random Hacker # */ /* */ # /* */ define /* */ a /* */ b
is the same as
#define a b
Non-tokenizable input causes errors. This means if you do
#ifdef _GENERATE ... awk code to generate some file... #endif
you will get errors, possibly lots of them, which is annoying although
harmless. Shield the non-code from yagdoc by adding !_GENERATE
assumptions (see Tokenizer.assume()
) that will make the
tokenizer skip the block and consequently suppress errors there.
More precisely, tokenization errors are suppressed in code skipped due to
conditionals and inside #error
and #warning
. So
this will pass silently:
#error Don't do this!
but this will be awarded with an error:
#ifdef __GNUC__ ... #elif defined(_MSC_VER) ... #else Don't know what to do! #endif
In addition, as the yagdoc preprocessor always takes all branches after
an #elif
, it is not possible to skip the #else
part
with a simple assumption. You are encouraged to use #error
to
induce errors.
Anything tokenizable after #
is accepted and cheerfully
ignored if not recognized:
#}rubbish!{
Non-ASCII characters in identifiers are not recognized ($
is
accepted though). Actually, non-ASCII characters written
\unnnn
or \Unnnnnnnn
are
not recognized anywhere. I have yet to see them used in public headers. In
string literals, this is probably good as they should appear in this expanded
form in the documentation too.
Implementation: can be done on demand, requires to start caring about character sets.
Long character literal are recognized (up to four bytes), wide character literal are not. Character constant
L'a'
forms two tokens, identifier L
and character literal
'a'
that are then likely rejected by the parser (though not
necessarily, as macro bodies, compound initializers and function bodies are
not examined thoroughly).
Implementation: can be done on demand, not sure whether
it worths it. This area is quite confusing, especially when
__STDC_ISO_10646__
cannot be assumed. And almost everyone uses
UTF-8 instead of wide characters on Unix anyway…
Hexadecimal floating point numbers such as
# define HUGE_VALF (__extension__ 0x1.0p255f)
are not recognized. They are hardly used anywhere beside gcc semi-internal headers.
Of other number formats, the standard decimal, hexadecimal and octal
integers and floating point numbers are recognized. Integers can have
various queer suffixes such as ui64
. Also words of the form
1fh
are identified as hexadecimal numbers (they occur in
unquoted assembler), but note deafbeefh
will be classified as
an identifier, not a number. In most cases you will probably want to exclude
the code containing this kind of stuff using the preprocessing features
anyway.
Identifiers cannot start with a number. Of course they cannot, but
sometimes identifier-like tokens starting with a number occur in preprocessor
macros, attached to something sane with ##
operator.
Macro names must be identifiers. A few header files do things such as
#define net-snmp-config_multilib_redirection_h
I wonder whether this is valid – regardless, it is not accepted.
If all code (i.e. everything non-whitespace and non-comment) is protected against repeated inclusions with
#ifndef FOO_H #define FOO_H ... #endif
that is, the definition of FOO_H
(the name does not matter)
is excluded from the list of macros. If there is any code outside this or
the definition of FOO_H
is not the first code after the
#ifndef FOO_H
(of #if !defined(FOO_H)
), then
FOO_H
is normally included.
This behaviour can be switched off by setting
Tokenizer.exclude_ifndef_wrapper
to False
.
Trigraphs are recognized and substitited everywhere, digraphs outside string and character literals, as specified by ISO C 89 and C 99.
Trigraphs however cause skew of reported column numbers in error messages. In my opinion people actually using this “feature” deserve it.
Rudimentary preprocessing. By default, preprocessor conditionals are just
ignored (with the exception of #if 0
and
YAGDOC_FOR_PRESIDENT
, see below). This means declarations from
both (all) branches get to the output.
It is however possible to request inclusion/exclusion based on
preprocessor conditionals (with assume()
method). In this case
the following constructs are recognized:
#ifdef FOO #ifndef FOO #if defined(FOO) #if !defined(FOO) #if 0 #if !0 #else #endif
Note specifically #elif
and more complex variants of
#if
are not recognized and both branches are
taken as usual. Particularly the presence of #elif
inhibits any
further evaluation. The initial #if
-conditioned block is
possibly exclued according to the condition, all the other branches (either
#elif
or #else
) are always taken. If
#elif
is not present and #if
is one of the simple
conditons above, then #else
branch is included/excluded as
expected (opposite of the #if
branch).
This preprocessing is intentionally very simple and will be kept so. It should be powerful enough to shield some code (and sometimes untokenizable input) from yagdoc. On the other hand, gradual partial implementation of real C preprocessor conditonal expression evaluation would change the set of blocks ignored in input files with every version, which is very undesirable. Full implementation of conditionals would require nothing less than full C preprocessor implementation and does not worth it. Finally, a supposedly fixed partial but relatively powerful implementation would generate a stream of extension requests, also undesirable.
It is possible to mark special ranges in the input file with preprocessor
directives (see add_range_type()
method). The logic is exactly
same as in the previous item. However, both branches are taken (unless the
same conditions are present among the assumptions, of course), the tokens
just notes down ranges of tokens corresponding these conditons.
This feature is used to implement “deprecated guards” of
gtkdoc-scan
.
Preprocessing can be also used to make yagdoc see different declarations
than the compiler, with an effect similar (though a bit more powerful) to
gtk-doc overrides. To avoid name clashes, use a symbol starting with
YAGDOC_
for this purpose, the officially recommended name is
YAGDOC_FOR_PRESIDENT
which is also automatically assumed defined
by the yagdoc preprocessor
#ifndef YAGDOC_FOR_PRESIDENT compiler sees this #else yagdoc sees this #endif
If you use YAGDOC_FOR_PRESIDENT
more than three times in
a file, yagdoc will display a flying flag of your country in the terminal
as animated ASCII-art and play the national anthem through the speaker (this
feature is currently unimplemented).
If you for some reason don't use
G_BEGIN_DECLS
/G_END_DECLS
for C++ declaration
wrapping, and do
#ifdef __cplusplus extern "C" { #endif /* __cplusplus */ ... #ifdef __cplusplus } #endif /* __cplusplus */
instead, put !__cplusplus
into preprocessor assumptions to
filter it out as the parser cannot handle this.
Expression tokenization is not 100% reliable. The most normal failing case is
(guint)-1
where the minus sign becomes an m
token instead of a part of
the number. However, since Yagdoc does not evaluate the
expressions, it will usually pass through the parser.
After an error, recovery is performed by skipping the rest of line. With line-continued strings this typically produces an error on each subsequent line of the string too. Preprocessor directives in such string can wreak havoc.
There are two main causes of the tokenizer slowness:
Token
object useThe Token
object represents a token, it's a lot slower than
the tuple originally used to represent tokens. Maybe objects can be
competitive, but I don't know how to achieve this (already using
__slots__
).