The last time Hackerfall tried to access this page, it returned a not found error. A cached version of the page is below, or clickhereto continue anyway

cat /dev/null — What Antelope Does (And What I Hope it Will Do)

What Antelope Does (And What I Hope it Will Do)

In 1970, our good buddy Stephen C. Johnson at AT&T created a program. The purpose of this program was to syntaxically break down other languages that had been defined by a grammar. This program was called Yacc, standing for Yet another compiler compiler; it, when run, would create a LALR parser that could compile the given language.

%{
#include <stdio.h>
%}

%token NAME NUMBER
%left '-' '+'
%left '*' '/'
%nonassoc UMINUS

%%
statement: NAME '=' expression
         | expression
    ;
expression: expression '+' expression   { $$ = $1 + $3; }
          | expression '-' expression   { $$ = $1 - $3; }
          | expression '*' expression   { $$ = $1 * $3; }
          | expression '/' expression   { $$ = $1 / $3; }
          |   '-' expression %prec UMINUS  { $$ = - $2; }
          |   '(' expression ')'        { $$ = $2;      }
          |   NUMBER                    { $$ = $1;      }
    ;
%%
int main (void) {
    return yyparse();
}

An example grammar file for Yacc.

Since then, more variants of Yacc have popped up - one of the more well-used ones is GNU bison (heh, get it, because Yacc sounds like Yak? And Bison? Nevermind.), which is currently used for Ruby’s parser, as well as a few other things. Bison was written by the GNU organization, and is open source under the GPL, with a special exception for its generated parser.

Antelope aspires to be like both Yacc and Bison. Antelope should be able to handle the grammar above, and produce an LALR(1) parser (like Bison’s) that correctly parses a proper input to the grammar. But, in order to determine what needs to happen with Antelope, the actual definition of an LALR(1) parser needs to be looked at.

Parsers

Let’s start with the basics: what is the point of a parser? In terms of computer science, we want to take human input, and turn it into a computer-understandable structure. In order to simplify the basic operation of a parser, it’s normally seperated into two parts: the lexer, and the actual parser. The lexer takes an input and turns it into a stream of tokens. The parser takes the tokens and gives them structure.

Lexers

A lexer’s job is to take a String and turn it into a series of tokens. There are a few tools to do this already, like Flex and Ragel. A lexer can, in itself, be a parser. The lexer must make decisions at certain points, for example, given the string ifathenf, should the lexer split the string into [if][a][then][f], or only give one token, [ifathenf]? Ideally, the lexer would go with the longest match – the maximal munch (Pacman!). Lexical tokens are probably best easily described by regular expressions (read up on them, you’ll need it for this). So, let’s define a basic language we can lex! I’m going to go with a basic calculator, using Ragel’s syntax:

# These are our basic machines; they are not implementations.
# Basically, when we actually use it in an implementation, Ragel
# will copy the values of these definitions straight into the
# implementation.

# Anything in quotes is an exact character.
plus     = '+';
minus    = '-';
multiply = '*';
divide   = '/';
equals   = '=';

# This definition is more regex-like; first, we have any digit
# one or more times, possibly followed by a period and one or
# more digits.
number   = [0-9]+ ('.' [0-9]+)?;

# This is our main implementation.  The lexer can match any of
# our definitions any number of times; when it matches the
# machine, it runs the blocks.  Blocks starting with %
# occur when the machine was fully matched; blocks starting with
# > are run just before the machine is started.
main := (
    plus     %{ emit(PLUS);     }
  | minus    %{ emit(MINUS);    }
  | multiply %{ emit(MULTIPLY); }
  | divide   %{ emit(DIVIDE);   }
  | equals   %{ emit(EQUALS);   }
  | number   >{ num = fpc;      } 
             %{ emit2(NUMBER);  }
  | space
)*;

In order to visualize this, we can create a NFA - a Nondeterministic Finite Automata.

Our NFA.

This visualization shows every state our NFA can be in. The circles represent the states, with the double circles represent accepting states – states that the lexer may end in. Arrows connecting the circles represent transitions – we enter through the start transition. Then, depending on the next character, we transition to another state. If, however, the transition is marked with a ε, then that transition is automagically taken. However, using our NFA for actual lexing would cause \( O(mn^2) \) time consumption, with \( m \) being the size of the string, and \( n \) being the size of the states. Therefore, Ragel takes our NFA and turns it into a DFA - a Deterministic Finite Automata. DFAs are slightly harder to visualize:

Our DFA, generated by Ragel

So, now we’ve got the lexer working, but how do we generate tokens? The lexer doesn’t generate tokens for you; it’s up to the programmer to do so. If you notice, earlier, when I defined the implementation, I used the emit and emit2 functions. These functions create a lexical token - a token that contains an identifier, a lexeme (the original text the token is associated with), and attributes. The lexeme and attributes are optional. For example, let’s assume that the Ragel lexer is going to be implemented in Ruby, and so we define our methods as such:

def emit(token)
  @tokens << { identifier: token }
end

def emit2(token)
      # We're assuming @data is where the source string is; for
      # some reason, ragel requires the source to be an array of
      # characters.  @num comes from the `num = fpc` line in the
      # ragel file; @p is the current position of the lexer.
  @tokens << { identifier: token, lexeme: @data[@num..(@p - @num)].join }
end

And run our lexer on the string 2 + 3 = 5. With a few modifications to the original Ragel code, we should get a @tokens array containing the following:

[
  { identifier: "NUMBER", lexeme: "2" },
  { identifier: "PLUS" },
  { identifier: "NUMBER", lexeme: "3" },
  { identifier: "EQUALS" },
  { identifier: "NUMBER", lexeme: "5" }
]

We keep the lexeme data of the numbers and not the operations because we need the values of the numbers – essentially, all the information we need about the operations is the type, which is given by the identifier. However, for numbers, we also need to know the value – which we can only get by looking at the lexeme.

These tokens can then be passed onto the main parser.

Part II can be found here!

Continue reading on redjazz96.tumblr.com