*nix Documentation Project
·  Home
 +   man pages
·  Linux HOWTOs
·  FreeBSD Tips
·  *niX Forums

  man pages->Tru64 Unix man pages -> flex (1)              



NAME    [Toc]    [Back]

       flex - Generates a C Language lexical analyzer

SYNOPSIS    [Toc]    [Back]

       flex [-bcdfinpstvFILT8] -C[efmF] [-Sskeleton] [file...]

OPTIONS    [Toc]    [Back]

       Generates  backtracking information to lex.backtrack. This
       is a list of scanner states that require backtracking  and
       the input characters on which they do so.  By adding rules
       you can remove backtracking states.  If  all  backtracking
       states  are eliminated and -f or -F is used, the generated
       scanner will run faster.  Makes the generated scanner  run
       in  debug  mode.  Whenever a pattern is recognized and the
       global yy_lex_debug is nonzero (which is the default), the
       scanner writes to stderr a line of the form:

              --accepting rule at line 53 ("the matched text")

              The  line number refers to the location of the rule
              in the file defining  the  scanner  (the  input  to
              lex).  Messages are also generated when the scanner
              backtracks, accepts the default rule,  reaches  the
              end  of its input buffer (or encounters a NULL), or
              reaches an End-of-File.  Specifies full  table  (no
              table compression is done). The result is large but
              fast. This option is equivalent to -Cf.   Instructs
              flex  to  generate a case-insensitive scanner.  The
              case of letters given in the  flex  input  patterns
              will  be  ignored,  and tokens in the input will be
              matched regardless of case.  The matched text given
              in  yytext  will have the original case (as read by
              the scanner).  Generates a  performance  report  to
              stderr.  This identifies features of the flex input
              file that will cause a loss of performance  in  the
              resulting  scanner.   Causes the default rule (that
              unmatched scanner input is echoed to stdout) to  be
              suppressed.   If  the scanner encounters input that
              does not match any of its rules, it aborts with  an
              error.  Instructs flex to write the scanner it generates
 to  standard  output  instead  of  lex.yy.c.
              Specifies  that  flex should write to stderr a summary
 of statistics regarding the scanner it  generates.
  Specifies that the fast scanner table representation
 should be used.  This  representation  is
              about  as  fast  as  the  full table representation
              (-f), and for some sets of patterns will be considerably
  smaller  (and  for  others,  larger).  This
              option is equivalent to  -CF.   Instructs  flex  to
              generate an interactive scanner; that is, a scanner
              that stops immediately rather than looking ahead if
              it  knows that the currently scanned text cannot be
              part of a longer rule's match. Note, -I  cannot  be
              used  in conjunction with full or fast tables; that
              is, the -f, -F, -Cf,  or  -CF  options.   Instructs
              flex  not to generate #line directives in lex.yy.c.
              The default is to generate such directives so error
              messages  in  the actions will be correctly located
              with respect to the original lex input file.  Makes
              flex  run in trace mode.  It will generate a lot of
              messages to stdout concerning the form of the input
              and  the resultant nondeterministic and deterministic
 finite automata.  This option is mostly for use
              in maintaining flex.  Instructs flex to generate an
              8-bit scanner (which is the default).  Controls the
              degree of table compression. The default setting is
              -Cem which provides the  highest  degree  of  table
              compression.    Faster-executing  scanners  can  be
              traded off at the cost of larger  tables  with  the
              following generally being true:

              Slowest and smallest

              -Cem -Cm -Ce -C -C{f,F}e -C{f,F}

              Fastest and largest

              The  -C  options  are  not cumulative; whenever the
              option is encountered, the previous -C settings are
              forgotten.   The  -f  or  -F and -Cm options do not
              make sense together; there is  no  opportunity  for
              meta-equivalence  classes if the table is not being
              compressed.  Otherwise, the options may  be  freely
              mixed.  A lone -C specifies that the scanner tables
              should  be  compressed  and   neither   equivalence
              classes  nor  meta-equivalence  classes  should  be
              used.   Directs  flex  to   construct   equivalence
              classes;  for example, sets of characters that have
              identical lexical properties.  Equivalence  classes
              usually  give  dramatic  reductions  in  the  final
              table/object file sizes (typically a factor of 2 to
              5)  and are inexpensive performance-wise (one array
              look-up per character scanned).   Directs  flex  to
              construct  meta-equivalence classes, which are sets
              of equivalence classes (or characters,  if  equivalence
 classes are not being used) that are commonly
              used together.  Meta-equivalence classes are  often
              a  big  win  when using compressed tables, but they
              have a moderate performance impact (one or two "if"
              tests and one array look-up per character scanned).
              Specifies that the full scanner  tables  should  be
              generated;  flex  should not compress the tables by
              taking advantage of  similar  transition  functions
              for  different states.  Specifies that the alternative
 fast scanner representation  should  be  used.
              Overrides the default skeleton file from which flex
              constructs its scanners.  This is useful  for  flex
              maintenance  or  development.  Specifies table-compression
  options.   (Obsolescent)  Suppresses  the
              statistics  summaries  that the -v option typically
              generates.  (Obsolete)

DESCRIPTION    [Toc]    [Back]

       The flex command is a tool for generating  scanners:  programs
  which  recognize lexical patterns in text. The flex
       command reads the given input files, or its standard input
       if no filenames are given or if a file operand is - (dash)
       for a description of a scanner to generate.  The  description
  is in the form of pairs of regular expressions and C
       code, called rules.  The flex command generates as  output
       a  C  source  file,  lex.yy.c,  which  defines  a  routine
       yylex(). This file is compiled and  linked  with  the  -ll
       library  to  produce an executable. When the executable is
       run, it scans its input and the regular expressions in its
       rules  looking for the best match (longest input). When it
       has selected a rule it  executes  the  associated  C  code
       which  has  access to the matched input sequence (commonly
       referred to as a token). This process then  repeats  until
       input is exhausted.

       The flex command treats multiple input files as one.

   Syntax for Input    [Toc]    [Back]
       This  section  contains  a  description  of the flex input
       file, which is normally named with a suffix.  The  section
       provides  a  listing  of  the  special values, macros, and
       functions recognized by flex.

       The flex input file consists of three sections,  separated
       by a line with just %% in it:

       [ definitions ] %% [ rules ] [ %% [ user functions ]]

       Contains  declarations  to simplify the scanner specification,
 and declarations of start states which are explained
       below.   Describes  what  the  scanner is to do.  Contains
       user-supplied functions that copied  straight  through  to

              With  the  exception  of  the first %% sequence all
              sections are  optional.  The  minimal  scanner  %%,
              copies its input to standard output.

       Each  line in the definitions section can be: Defines name
       to expand to regexp.  name is a word beginning with a letter
 or an underscore (_) followed by zero or more letters,
       digits, underscores or dashes (-). In the  regular-expression
  parts  of the rules section, flex substitutes regexp
       wherever  you  refer  to  {name}  (name  within   braces).
       Defines names for states used in the rules section. A rule
       may be made conditionally  active  based  on  the  current
       scanner  state. Multiple lines defining states can appear,
       and each can contain multiple state  names,  separated  by
       white  space.  The name of a state follows the same syntax
       as that of regexp names except that dashes ('-')  are  not
       permitted.  Unlike  regexp  names, state names share the C
       #define namespace. In the rules section states are  recognized
 as <state> (state within angle brackets).

              The  %x  directive  names exclusive states.  When a
              scanner is in an exclusive state, only  rules  prefixed
  with that state are active. Inclusive states
              are named with the %s directive.   When  placed  on
              lines  by  themselves, these symbols enclose C code
              to be passed verbatim into the  global  definitions
              of  the  output  file.  Such lines commonly include
              preprocessor directives and declarations of  external
  variables and functions.  Lines beginning with
              a space or  tab  in  the  definitions  section  are
              passed  directly  into the lex.yy.c output file, as
              part of the initial global definitions.

       The rules section follows the definitions, separated by  a
       line  consisting  of %%.  The rules section contains rules
       for matching input and taking actions,  in  the  following
       format: pattern [action]

       The  pattern  starts  in  the first column of the line and
       extends until the first non-escaped white space character.
       The flex command attempts to find the pattern that matches
       the longest input  sequence  and  execute  the  associated
       action.  If  two or more patterns match the same input the
       one which appears first in the rules section is chosen. If
       no  action  exists  the  matched input is discarded. If no
       pattern matches the input the default is  to  copy  it  to
       standard output.

       All action code is placed in the yylex() function. Text (C
       code or declarations) placed at the beginning of the rules
       section is copied to the beginning of the yylex() function
       and may be used in actions. This text must  begin  with  a
       space  or  a tab (to distinguish it from rules).  In addition,
 any input (beginning with a space or within  %{  and
       %}  delimiter  lines)  appearing  at  the beginning of the
       rules section before any rules are specified will be written
  to  lex.yy.c  after the declarations of variables for
       the yylex() function and before the first line of code  in

       Elements  of  each  rule  are:  A pattern may begin with a
       comma separated list of  state  names  enclosed  by  angle
       brackets  (<  state  [,state...]   >).   These  states are
       entered via the BEGIN statement. If a pattern begins  with
       a  state,  the  scanner can only recognize it when in that
       state.  The initial state is 0 (zero).  A regular  expression
  to  match  against  the  input  stream.  The regular
       expressions in flex provide a rich character matching syntax.

              The   following   characters,  shown  in  order  of
              decreasing  precedence   have   special   meanings:
              Matches  the  character  x.  Enclose characters and
              treat them as literal strings.  For  example,  "*+"
              is  treated  as  the asterisk character followed by
              the plus character.  If str is one of  the  characters
 a, b, f, n, r, t, or v, then the ANSI C interpretation
 is adopted (for example,  \n  is  a  newline).
   If  str  is a string of octal digits it is
              interpreted as a character with octal value str. If
              str  is a string of hexadecimal digits with a leading
 x it is interpreted as a  character  with  that
              value.  Otherwise, it is interpreted literally with
              no special meaning. For example,  x\*yz  represents
              the  four  characters x*yz.  Represents a character
              class  in  the  enclosed  range  ([.-.])   or   the
              enclosed  list  ([...]). The dash character is used
              to define a range  of  characters  from  the  ASCII
              value  or  the  8-bit  class  of the character that
              comes before it to the ASCII  value  or  the  8-bit
              class  of  the character that follows it. For example,
 [abcx-z] matches a, b, c, x, y, or z.

              The circumflex when it appears as the first character
  in a character class, indicates the complement
              of the set of characters within  that  class.   For
              example,  [^abc]  matches any character except a, b
              or c, including special  characters  like  newline.
              Groups  regular expressions. For example, (ab) will
              be considered as a single regular expression.  When
              enclosing  numbers,  indicates a number of consecutive
  occurrences  of  the  expression  that  comes
              before  it.   For  example,  (ab){1,5}  indicates a
              match for from 1 to 5 occurrences of the string ab.

              When  enclosing a name, the name represents a regular
 expression defined in the definitions  section.
              For  example,  {digit}  is  replaced by the defined
              regular expression for digit. Note that the  expansion
 takes place as if the definition were enclosed
              in  parentheses.   Matches  any  single   character
              except newline.  Matches zero or one of the preceding
 expressions. For example, ab?c matches both  ac
              and  abc.   Matches  zero  or more of the preceding
              expressions. For example, a* is zero or  more  consecutive
  a  characters.   The  utility of matching
              zero occurrences is  more  obvious  in  complicated
              expressions.   For  example, the expression, [A-Zaz][A-Za-z0-9]*
 indicates all  alphanumeric  strings
              with  a  leading  alphabetic  character,  including
              strings that are  only  one  alphabetic  character.
              Matches  one  or more of the preceding expressions.
              For example, [a-z]+ is  all  strings  of  lowercase
              letters.   Matches the expression x followed by the
              expression y.  Matches either the preceding expression
  or  the  following  expression.  For example,
              a(br matches either ab or cd.  Matches expression x
              only if expression y (trailing context) immediately
              follows it. For example, ab/cd matches  the  string
              ab  but  only  if followed by cd. Only one trailing
              context is permitted per pattern.  When it  appears
              at  the beginning of the pattern matches the beginning
 of a line. For example, ^abc  will  match  the
              string  abc  if  it  is found at the beginning of a
              line.  When it appears at  the  end  of  a  pattern
              matches the end of a line. It is equivalent to /\n.
              For example, abc$ will match the string abc  if  it
              is  found at the end of a line.  Matches an End-ofFile.
  Identifies a state name (see above) and  may
              only  appear  at  the  beginning  of a pattern. For
              example, <done><<EOF>> matches an End-of-File,  but
              only if it is in state done.

              In  addition, the following rules apply for bracket
              expressions: These represent the set  of  collating
              elements  in  an equivalence class and are enclosed
              within bracket-equal delimiters ([= =]). An equivalence
 class generally is designed to deal with primary-secondary
 sorting; that is, for languages like
              French  that define groups of characters as sorting
              to the same primary location, and then have a  tiebreaking,
 secondary sort. For example, if a, `, and
              ^  belong  to  the  same  equivalence  class,  then
              [[=a=]b],  [[=`=]b],  and [[=^=]b] are each equivalent
 to [a`^b].  These represent the set of characters
  in  the current locale belonging to the named
              ctype class. These are expressed as a  ctype  class
              name  enclosed in bracket-colon delimiters ([: :]).

              In the C or POSIX locale,   this  operating  system
              supports the following character class expressions:
              [:alpha:],   [:upper:],    [:lower:],    [:digit:],
              [:alnum:],    [:xdigit:],   [:space:],   [:print:],
              [:punct:], [:graph:], [:cntrl:].

              Other  locales  may  define  additional   character

              Letters  and digits never have special meanings.  A
              character such as ^ or -, which has a special meaning
 in particular contexts, refers simply to itself
              when found outside that context.  Spaces  and  tabs
              must  be escaped to appear in a regular expression;
              otherwise they indicate the end of the  expression.
              Each  pattern in a rule has a corresponding action,
              which can be any arbitrary C statement. The pattern
              ends  at  the first non-escaped white space character;
 the remainder of the line is  its  action.  If
              the  action  is  empty,  then  when  the pattern is
              matched the input which matched it is discarded.

              If the action contains a {, then the  action  spans
              till  the  balancing } is found, and the action may
              cross multiple lines. Using a return  statement  in
              an action returns from yylex().

              An  action  consisting solely of a vertical bar (|)
              means same as the action for the next rule.

              The flex variables which can be used within actions
              are:  A  string  (char  *)  containing  the current
              matched input. It cannot be modified.   The  length
              (int)  of  the  current matched input. It cannot be
              modified.  A stream (FILE *) that flex  reads  from
              (stdin  by  default). It may be changed but because
              of the buffering flex uses this  makes  sense  only
              before  scanning  begins.  Once scanning terminates
              because an End-of-File  was  seen,  void  yyrestart
              (FILE  *new_file)  may be called to point yyin at a
              new input file. Alternatively, yyin may be  changed
              whenever a new or different buffer is selected (see
              yy_switch_to_buffer()).  A stream (FILE *) to which
              ECHO  output is written (stdout by default). It can
              be changed by the user.  Returns the current buffer
              (YY_BUFFER_STATE) used for scanner input.

              The  flex  command macros and functions that may be
              used within actions are: Copies yytext to the scanner's
  output.   Changes  the  scanner  state to be
              state.  This affects which rules  are  active.  The
              state  must  be  defined in a %s, or %x definition.
              The initial state of the scanner is  INITIAL  or  0
              (zero).  Directs the scanner to proceed immediately
              to the next best pattern  that  matches  the  input
              (which  may  be  a  prefix  of  the current match).
              yytext and yyleng are  reset  appropriately.   Note
              that  REJECT is a particularly expensive feature in
              terms of scanner performance; if it is used in  any
              of  the scanner's actions, it will slow down all of
              the scanner's pattern matching operations.   REJECT
              cannot be used if flex is invoked with either -f or
              -F options.  Indicates that the next  matched  text
              should be appended to the currently matched text in
              yytext (rather than replace it).  Returns  all  but
              the first n characters of the current token back to
              the input stream, where they will be rescanned when
              the  scanner  looks for the next match.  yytext and
              yyleng are adjusted accordingly.  Returns 0  (zero)
              if  there  is  more  input to scan or 1 if there is
              not. The default yywrap() always  returns  1.  Currently
  it  is  implemented  as a macro, however in
              future implementations it may  become  a  function.
              Can  be  used  in  lieu of a return statement in an
              action.  It terminates the scanner and returns a  0
              (zero) to the scanner's caller.

              yyterminate()  is automatically called when an Endof-File
 is encountered. It is a macro  and  may  be
              redefined.   Returns  a YY_BUFFER_STATE handle to a
              new input buffer large enough to  accommodate  size
              characters and associated with the given file. When
              in doubt, use YY_BUF_SIZE for the  size.   Switches
              the  scanner's  processing  to scan for tokens from
              the given buffer, which must be a  YY_BUFFER_STATE.
              Deletes the given buffer.  Enables scanning to continue
 after yyin has been pointed at a new file  to
              process.    Controls  how  the  scanning  function,
              yylex() is declared. By default, it is int yylex(),
              or,  if prototypes are being used, int yylex(void).
              This definition may be changed  by  redefining  the
              YY_DECL  macro.  This macro is expanded immediately
              before the {...} (braces) that delimit the  scanner
              function body.  Controls scanner input. By default,
              YY_INPUT reads from  the  file-pointer  yyin.   Its
              action is to place up to max_size characters in the
              character array buf and return in the integer variable
 result either the number of characters read or
              the constant YY_NULL to indicate EOF. Following  is
              a  sample  redefinition of YY_INPUT, in the definitions
 section of the input file:

              %{         #undef         YY_INPUT          #define
                     int c = getchar();\
                     result = (c == EOF) ? YY_NULL : (buf[0] = c,
                 } %}

              When the scanner receives an End-of-File indication
              from  YY_INPUT, it checks the yywrap() function. If
              yywrap() returns zero, it is assumed that the  yyin
              has been set up to point to another input file, and
              scanning continues. If it  returns  non-zero,  then
              the  scanner  terminates,  returning  zero  to  its
              caller.  Redefinable to provide an action which  is
              always  executed  prior  to  the  matched pattern's
              action.  Redefinable to provide an action which  is
              always  executed before the first scan.  Is used in
              the  scanner  to  separate  different  actions.  By
              default, it is simply a break, but may be redefined
              if necessary.

       The user functions section consists of  complete  C  functions,
  which are passed directly into the lex.y.cc output
       file (the effect is similar to defining the  functions  in
       separate files and linking them with lex.y.cc).  This section
 is separated from the rules section by the %%  delimiter.

       Comments,  in  C  syntax,  can appear anywhere in the user
       functions or definitions sections.  In the rules  section,
       comments  can  be  embedded within actions. Empty lines or
       lines consisting of white space are ignored.

       The following macros are not  normally  called  explicitly
       within  an action, but are used internally by flex to handle
 the input and output streams.  Reads the next  character
  from  the  input stream. You cannot redefine input().
       Writes the next character to the output stream.  Puts  the
       character  c  back  onto  the input stream. It will be the
       next character scanned. You cannot redefine unput().

              The libl.a contains default  functions  to  support
              testing  or  quick  use  of  a flex program without
              yacc; these functions can be linked in through -ll.
              They  can  also  be provided by the user.  A simple
              wrapper that  simply  calls  setlocale()  and  then
              calls  the  yylex()  function.  The function called
              when the  scanner  reaches  the  end  of  an  input
              stream.   The  default definition simply returns 1,
              which causes  the  scanner  in  turn  to  return  0

NOTES    [Toc]    [Back]

       Some  trailing context patterns cannot be properly matched
       and generate warning messages

              Dangerous trailing context

              These are patterns where the ending  of  the  first
              part  of the rule matches the beginning of the second
 part, such as zx*/xy*, where the x* matches the
              x  at  the  beginning of the trailing context.  For
              some trailing context rules, parts that  are  actually
 fixed length are not recognized as such, leading
 to the previously mentioned  performance  loss.
              In particular, patterns using {n} (such as test{3})
              are always considered variable length.

              Combining trailing context with the special | (vertical
 bar) action can result in fixed trailing context
 being turned into the more expensive  variable
              trailing  context.   This  happens in the following

              %% abc| xyz/def Use of unput() invalidates the contents
  of yytext and yyleng within the current flex
              action.  Use of unput() to push back more text than
              was  matched  can  result  in  the pushed-back text
              matching a beginning-of-line (^) rule  even  though
              it did not come at the beginning of the line.  Pattern
 matching of NULLs is substantially slower than
              matching  other  characters.  The flex command does
              not generate  correct  #line  directives  for  code
              internal  to  the  scanner; thus, bugs in flex.skel
              yield invalid line numbers.  Due to both  buffering
              of  input and read-ahead, you cannot intermix calls
              to  <stdio.h>  routines,  such  as,  for   example,
              getchar(),  with  flex rules and expect it to work.
              Call input()  instead.   The  total  table  entries
              listed  by  the  -v  option  excludes the number of
              table entries needed to  determine  what  rule  was
              matched.   The  number  of  entries is equal to the
              number  of  deterministic  finite-state   automaton
              (DFA)  states  if  the scanner does not use REJECT,
              and somewhat greater than the number of  states  if
              it  does.   REJECT cannot be used with the -f or -F

EXAMPLES    [Toc]    [Back]

       The following command processes the  file  lexcommands  to
       produce the scanner file lex.yy.c: flex lexcommands

              This is then compiled and linked by the command: cc
              -oscanner lex.yy.c -ll

              This produces a program scanner.  The scanner  program
   converts  uppercase  to  lowercase  letters,
              removes spaces at the end of a line,  and  replaces
              multiple spaces with single spaces. The lexcommands
              command contains:

              %% [A-Z]   putchar(tolower(yytext[0])); [ ]+$ [  ]+
              putchar(' ');

FILES    [Toc]    [Back]

       Skeleton scanner.  Generated scanner C source.  Backtracking
 information generated from -b option.

SEE ALSO    [Toc]    [Back]

       Commands:  yacc(1), sed(1), awk(1)

       Files:  locale(4)

[ Back ]
 Similar pages
Name OS Title
lex FreeBSD fast lexical analyzer generator
flex++ OpenBSD fast lexical analyzer generator
flex OpenBSD fast lexical analyzer generator
flex FreeBSD fast lexical analyzer generator
flex NetBSD fast lexical analyzer generator
lex NetBSD fast lexical analyzer generator
lex OpenBSD fast lexical analyzer generator
flex++ FreeBSD fast lexical analyzer generator
lex++ FreeBSD fast lexical analyzer generator
flex IRIX fast lexical analyzer generator
Copyright © 2004-2005 DeniX Solutions SRL
newsletter delivery service