The last time Hackerfall tried to access this page, it returned a not found error. A cached version of the page is below, or click here to continue anyway

research!rsc: Yacc is Not Dead

Yacc is Not Dead

Posted on Monday, December 6, 2010.

The internet tubes have been filled with chatter recently about a paper posted on arXiv by Matthew Might and David Darais titled “Yacc is dead.” Unfortunately, much of the discussion seems to have centered on whether people think yacc is or should be dead and not on the technical merits of the paper.

Reports of yacc's death, at least in this case, are greatly exaggerated. The paper starts by arguing against the “cargo cult parsing” of putting together parsers by copying and pasting regular expressions from other sources until they work. Unfortunately, the paper ends up being an example of a much worse kind of cargo cult: claiming a theoretical grounding without having any real theoretical basis. That's a pretty strong criticism, but I think it's an important one, because the paper is repeating the same mistakes that led to widespread exponential time regular expression matching. I'm going to spend the rest of this post explaining what I mean. But to do so, we must first review some history.

Grep and yacc are the two canonical examples from Unix of good theory—regular expressions and LR parsing—producing useful software tools. The good theory part is important, and it's the reason that yacc isn't dead.

Regular Expressions

The first seminal paper about processing regular expressions was Janus Brzozowski's 1964 CACM article “Derivatives of regular expressions.” That formulation was the foundation behind Ken Thompson's clever on-the-fly regular expression compiler, which he described in the 1968 CACM article “Regular expression search algorithm.” Thompson learned regular expressions from Brzozowski himself while both were at Berkeley, and he credits the method in the 1968 paper. The now-standard framing of Thompson's paper as an NFA construction and the addition of caching states to produce a DFA were both introduced later, by Aho and Ullman in their various textbooks. Like many primary sources, the original is different from the textbook presentations of it. (I myself am also guilty of this particular misrepresentation.)

The 1964 theory had more to it than the 1968 paper used, though. In particular, because each state computed by Thompson's code was ephemeral—it only lasted during the processing of a single byte—there was no need to apply the more aggressive transformations that Brzozowski's paper included, like rewriting x|x into x.

A much more recent paper by Owens, Reppy, and Turon, titled “Regular-expression derivatives reexamined,” returned to the original Brzozowski paper and considered the effect of using Brzozowski's simplification rules in a regular expression engine that cached states (what textbooks call a DFA). There, the simplification has the important benefit that it reduces the number of distinct states, so that a lexer generator like lex or a regular expression matcher like grep can reduce the necessary memory footprint for a given pattern. Because the implementation manipulates regular expressions symbolically instead of the usual sets of graph nodes, it has a very natural expression in most functional programming languages. So it's more efficient and easier to implement: win-win.

Yacc is dead?

That seems a bit of a digression, but it is relevant to the “Yacc is dead” paper. That paper, claiming inspiration from the “Regular-expression derivatives reexamined” paper, shows that since you can define derivatives of context free grammars, you can frame context free parsing as taking successive derivatives. That's definitely true, but not news. Just as each input transition of an NFA can be described as taking the derivative (with respect to the input character) of the regular expression corresponding to the originating NFA state, each shift transition in an LR(0) parsing table can be described as taking the derivative (with respect to the shift symbol) of the grammar corresponding to the originating LR(0) state. A key difference, though, is that NFA states typically carry no state with them and can have duplicates removed, while a parser associates with each LR(0) parse states a parse stack, making duplicate removal more involved.

Linear vs Exponential Runtime

The linear time guarantee of regular expression matching is due to the fact that, because you can eliminate duplicate states while processing, there is a finite bound on the number of NFA states that must be considered at any point in the input. If you can't eliminate duplicate LR(0) states there is no such bound and no such linear time guarantee.

Yacc and other parser generators guarantee linear time by requiring that the parsing state automata not have even a hint of ambiguity: there must never be a (state, input symbol) combination where multiple possible next states must be explored. In yacc, these possible ambiguities are called shift/reduce or reduce/reduce conflicts. The art of writing yacc grammars without these conflicts is undeniably arcane, and yacc in particular does a bad job of explaining what went wrong, but that is not inherent to the approach. When you do manage to write down a grammar that is free of these conflicts, yacc rewards you with two important guarantees. First, the language you have defined is unambiguous. There are no inputs that can be interpreted in multiple ways. Second, the generated parser will parse your input files in time linear in the length of the input, even if they are not syntactically valid. If you are, say, defining a programming language, these are both very important properties.

The approach outlined in the “Yacc is dead” paper gives up both of these properties. The grammar you give the parser can be ambiguous. If so, it can have exponentially many possible parses to consider: they might all succeed, they might all fail, or maybe all but the very last one will fail. This implies a worst-case running time exponential in the length of the input. The paper makes some hand-wavy claims about returning parses lazily, so that if a program only cares about the first one or two, it does not pay the cost of computing all the others. That's fine, but if there are no successful parses, the laziness doesn't help.

It's easy construct plausible examples that take exponential time to reject an input. Consider this grammar for unary sums:

S → T
T → T + T | N
N → 1

This grammar is ambiguous: it doesn't say whether 1+1+1 should be parsed as (1+1)+1 or 1+(1+1). Worse, as the input gets longer there are exponentially many possible parses, approximately O(3n) of them. If the particular choice doesn't matter, then for a well-formed input you could take the first parse, whatever it is, and stop, not having wasted any time on the others. But suppose the input has a syntax error, like 1+1+1+...+1+1+1++1. The backtracking parser will try all the possible parses for the long initial prefix just in case one of them can be followed by the erroneous ++1. So when the user makes a typo, your program grinds to a halt.

For an example that doesn't involve invalid input, we can use the fact that any regular expression can be translated directly into a context free grammar. If you translated a test case that makes Perl's backtracking take exponential time before finding a match into a grammar, the translated case would make the backtracking parsers in this paper take exponential time to find a match too.

Exponential run time is a great name for a computer science lab's softball team but in other contexts is best avoided.

Approaching Ambiguity

The “Yacc is dead” paper observes correctly that ambiguity is a fact of life if you want to handle arbitrary context free grammars. I mentioned above the benefit that if yacc accepts your grammar, then (because yacc rejects all the ambiguous context free grammars, and then some), you know your grammar is unambiguous. I find this invaluable as a way to vet possible language syntaxes, but it's not necessary for all contexts. Sometimes you don't care, or you know the language is ambiguous and want all the possibilities. Either way, you can do better than exponential time parsing. Context free parsing algorithms like the CYK algorithm or the Generalized LR parsing can build a tree representing all possible parses in only O(n3) time. That's not linear, but it's a far cry from exponential.

Another interesting approach to handling ambiguity is to give up on context free grammars entirely. Parsing expression grammars, which can be parsed efficiently using packrat parsing, replace the alternation of context free grammars with an ordered (preference) choice, removing any ambiguity from the language and achieving linear time parsing. This is in some ways the same approach yacc takes, but some people find it easier to write parsing expression grammars than yacc grammars.

These are basically the only two approaches to ambiguity in context free grammars: either handle it (possible in polynomial time) or restrict the possible input grammars to ensure linear time parsing and as a consequence eliminate ambiguity. The obvious third approach—accept all context free grammars but identify and reject just the ambiguous ones—is an undecidable problem, equivalent to solving the halting problem.

Cargo cults

The “Yacc is dead” paper begins with a scathing critique of the practice of parsing context free languages with (not-really-)regular expressions in languages like Perl, which Larry Wall apparently once referred to as “cargo cult parsing” due to the heavy incidence of cut and paste imitation and copying “magic” expressions. The paper says that people abuse regular expressions instead of turning to tools like yacc because “regular expressions are `WYSIWYG'—the language described is the language that gets matched—whereas parser-generators are WYSIWYGIYULR(k)—`what you see is what you get if you understand LR(k).' ”

The paper goes on:

To end cargo-cult parsing, we need a new approach to parsing that:

Then the paper starts talking about the “Regular-expression derivatives reexamined” paper, which really got me excited. I hoped that just like returning to derivatives led to an easy-to-implement and oh-by-the-way more efficient approach to regular expression matching, I had high hopes that the same would happen for context free parsing. Instead, the paper takes one of the classic blunders in regular expression matching—search via exponential backtracking because the code is easier to write—and applies it to context free parsing.

Concretely, the paper fails at goal #2: “parses efficiently on average.” Most arbitrary context-free grammars are ambiguous, and most inputs are invalid, so most parses will take exponential time exploring all the ambiguities before giving up and declaring the input unparseable. (Worse, the parser couldn't even parse an unambiguous S-expression grammar efficiently without adding a special case.)

Instead of ending the supposed problem of cargo cult parsing (a problem that I suspect is endemic mainly to Perl), the paper ends up being a prime example of what Richard Feynman called “cargo cult science” (Larry Wall was riffing on Feynman's term), in which a successful line of research is imitated but without some key aspect that made the original succeed (in this case, the theoretical foundation that gave the efficiency results), leading to a failed result.

That criticism aside, I must note that the Haskell Scala code in the paper really does deliver on goals #1 and #3. It handles arbitrary context-free grammars, it takes little effort, and on top of that it's very elegant code. Like that Haskell Scala code, the C regular expression matcher in this article (also Chapter 9 of The Practice of Programming and Chapter 1 of Beautiful Code) is clean and elegant. Unfortunately, the use of backtracking in both and the associated exponential run time makes the code much nicer to study than to use.

Is yacc dead?

Yacc is an old tool, and it's showing its age, but as the Plan 9 manual says of lex(1), “The asteroid to kill this dinosaur is still in orbit.” Even if you think yacc itself is unused (which is far from the truth), the newer tools that provide compelling alternatives still embody its spirit. GNU Bison can optionally generate a GLR parser instead of an LALR(1) parsers, though Bison still retains yacc's infuriating lack of detail in error messages. (I use an awk script to parse the bison.output file and tell me what really went wrong.) ANTLR is a more full featured though less widespread take on the approach. These tools and many others all have the guarantee that if they tell you the grammar is unambiguous, they'll give you a linear-time parser, and if not, they'll give you at worst a cubic-time parser. Computer science theory doesn't know a better way. But any of these is better than an exponential time parser.

So no, yacc is not dead, even if it sometimes smells that way, and even if many people wish it were.

Continue reading on research.swtch.com