Test-driven grammar-class understanding with ANTLR

Suppose you would like to understand some grammar classes such as LL(1) and LL(*). Then you might appreciate a test suite like the following, or you would actually try to discover such a test suite yourself. I used the following suite in a lab of my "Software Language Processing" class. It is very convenient means of illustrating things. Students can copy and paste these codes now and further investigate. It helps understanding formal notions and actual, operational parser behaviors.

The first ANTLR input specifies a trivial grammar with a fixed lookahead of 1.

grammar LL1;
options { k = 1; }
s : 'a' 'x' | 'b' 'x' ;

Let's apply ANTLR:

java -classpath ".:antlr-3.2.jar" org.antlr.Tool LL1.g

ANTLR likes that grammar just fine. (There are no complains.) We are allowed to infer hence that the grammar is indeed LL(1).

The next ANTLR input combines two alternatives that are clearly not LL(1). However, we still use a lookahead of 1.

grammar NotLL1;
options { k = 1; }
s : 'a' 'x' | 'a' 'y';

Let's apply ANTLR again:

java -classpath ".:antlr-3.2.jar" org.antlr.Tool NotLL1.g
warning(200): NotLL1.g:3:3: Decision can match input such as "'a'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
error(201): NotLL1.g:3:3: The following alternatives can never be matched: 2

That is, ANTLR complains that both alternatives are to be selected with the same lookahead, and hence, ANTLR announces effectively that it will (pragmatically) favor the first alternative. As a result, the second alternative can never be matched operationally.

The next ANTLR input is basically the same as before except that we use a lookahead of 2 now.

grammar LL2;
options { k = 2; }
s : 'a' 'x' | 'a' 'y';

The grammar is indeed LL(2). ANTLR likes the grammar just fine.

The next ANTLR input goes back to lookahead 1 but it enables backtracking.

grammar Backtracking;
options { backtrack = true; k = 1; }
s : 'a' 'x' | 'a' 'y';

With backtracking turned on, ANTLR just gracefully accepts the grammar even with the insufficient look ahead. It is easy to guess that the operational behavior will combine use of lookahead for the optimization of decisions with the use of backtracking for completeness. That is, whenever input cannot be matched on the grounds of lookahead-based decisions, then additional choices are to be explored through backtracking.

The next ANTLR input explores the ANTLR semantics of backtracking. That is, we retain the same grammar as before---except that we inject a semantic action for output into the first alternative---just to see whether semantic actions will be prematurely executed---even if an alternative will not be completed, and a different alternative is chosen by backtracking eventually.

grammar Delay;
options { backtrack = true; k = 1; }
s : 'a' { System.out.println(42); } 'b' 'x' | 'a' 'b' 'y';

When we apply the generated parser for the grammar to the input "aby", then no output will be printed. That is, the parser explores the alternative with the semantic action first, but it delays execution of semantic actions. Such delayed actions are a safe choice when backtracking is involved.

The next ANTLR input starts two alternatives with a kind of prefix that rules out finite lookahead for deterministic decisions.

grammar Not42;
options { k = 42; }
s : 'a'* 'x' | 'a'* 'y';

In particular, the grammar is not LL(42). This is how ANTLR reports the problem:

java -classpath ".:antlr-3.2.jar" org.antlr.Tool Not42.g
warning(200): Not42.g:3:3: Decision can match input such as "'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input

Hence, a parser based on lookahead of 42 would not be able to make the right decision for longer inputs.

The next ANTLR input simply removes the LL(k) option. Thereby, ANTLR's infinite lookahead applies.

grammar LLstar;
s : 'a'* 'x' | 'a'* 'y';

Indeed, ANTLR likes the grammar just fine. The grammar is hence LL(*).

The next ANTLR input starts two alternatives with a kind of prefix that rules out even infinite (DFA-based) lookahead for deterministic decisions. A recursive nonterminal is used which clearly rules out DFAs for lookahead.

grammar NotLLstar;
s : p 'x' | p 'y';
p : 'a' p 'b' | ;

Indeed, ANTLR complains.

java -classpath ".:antlr-3.2.jar" org.antlr.Tool NotLLstar.g
error(211): NotLLstar.g:2:3: [fatal] rule s has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
make: [NotLLstar.gen] Error 1 (ignored)

ANTLR basically suggests a few options which are all worth investigating: factoring, syntactic predicates, and backtracking.

Let's try factoring first. It does not take a rocket grammar engineer to even guess that we get back to LL(1):

grammar Factoring;
options { k = 1; }
s : p ('x' | 'y');
p : 'a' p 'b' | ;

Indeed, ANTLR likes the grammar just fine. We quickly transitioned from non-LL(*) to LL(1).

Now let's try syntactic predicates. That is, we attach essentially grammar expressions as guards to the alternatives:

grammar SynPreds;
s : (p 'x') => p 'x'
| (p 'y') => p 'y';
p : 'a' p 'b' | ;

ANTLR also is Ok with that grammar. However, we should not expect too much. Essentially, we are using an operational semantics now such that the first alternative is tried first subject to the syntactic predicate to be satisfied; otherwise we continue with the second alternative which is guarded as well. No formal property for the predicates is to be checked. All reasoning is based on order. From a practical point of view, the predicates at hand are somewhat terrible because they actually run the complete alternative in the guard. (We simulate backtracking in a sense.) We should try to find a shorter prefix for the test, but this is impossible in the example at hand. However, even guards that test a lot of the actual alternative may be efficient if we could additionally leverage memoization, but we do not attempt this here.

It is reasonable to turn on backtracking again. In this case, the grammar itself does not need to be transformed:

grammar Desperate;
options { backtrack = true; }
s : p 'x' | p 'y';
p : 'a' p 'b' | ;

We can parse just fine with that grammar. Demo omitted. Of course, efficiency issues are to be expected whenever backtracking is involved, but, again, we could try to leverage memoization in some cases, or we should indeed try to factor---as it is clearly trivial and appropriate in the specific case at hand.

Happy grammar engineering!