a context-sensitive tale

It was recently suggested to me that I should give an example implementation of the grammar {a^m*b^m*c^m, where a b and c are terminals, * is concatenation, and m >= 0}, as this is a standard example of a grammar that is not context-free.

I agreed that this would be useful, but also pointed out my existing aibjaibj example, which deals with a similar case. We had some difficulties (changes to the D language, case-insensitve filenames in certain operating systems). All this was useful feedback - I wish someone had grumbled before - these problems have now been fixed.

one way of doing it

I provided one way of implementing this language. It isn't the standard way of doing it, but it recognises precisely the language in question. It works by ensuring that each 'a' provides a recogniser for a corresponding 'b' and 'c'. The grammar as specified generates (to use that jargon) the empty string. I added the requirement for ';' as a terminator, because it is otherwise logically impossible to distinguish between success and failure. The terminator is in effect part of the test harness. Here it is.

the proper way to do it

We encountered more glitches - useful feedback as ever. But it was suggested, and I agreed, that this was too far from the standard way of writing this grammar - the standard monotone grammar for the context-sensitive language a^nb^nc^n looks like this (I was told):

 S : | abc | aSBc.
 cB: Bc.
 bB: bb.

In fact, it appears from the wikipedia page about context-sensitive grammar that 'noncontracting grammars cannot generate any language that contains the empty string' (monotone grammars are also called noncontracting grammars).

My first approach to this grammar was wrong, in ways that can be instructive. If you blindly translate the grammar as given, using '_' where a symbol would otherwise be treated as a variable, you get this:

               <- _S;         // not correct LMN - empty left-side must start with 'that hyphen'
 'abc'         <- _S;         // ok
 'a' _S _B 'c' <- _S;         // ok
 _B 'c'        <- 'c' _B;     // terminal symbol as goal? left-initial nonterminal - maybe topdown? 
 'b' _B        <- 'b' 'b';    // terminal symbol as goal?

The comments reflect my experience. Also we need a test harness. So here are the rules with a minimal test harness, and modified to be accepted as valid LMN.

 - error       <- eof;        // H1: error case  
 - _S ';'      <- eof;        // H2: the test harness with ';' as terminator

 -             <- _S;         // S1: empty left-side must start with 'that hyphen'
 'abc'         <- _S;         // S2: ok
 'a' _S _B 'c' <- _S;         // S3: ok
 -   _B 'c'    <- 'c' _B;     // S4: terminal symbol as goal? left-initial was nonterminal- go topdown
 'b' _B        <- 'b' 'b';    // S5: terminal symbol as goal?

These rules don't work. It was clear that having a rule for 'c' didn't matter. But tracing showed that the second 'b' in 'aabbcc' was being matched against 'c'. In other words these rules don't get a chance to recurse on 'b'. The thing to do is to rewrite the rule S2 so as not to match 'b' until after trying recursion. You force a mismatch by using a nonterminal symbol. Should this be the existing nonterminal _B, or a new one? Occam's razor's a wonderful thing (but sharp) - so I tried using _B.

 -             <- _S;         // S1: empty left-side must start with 'that hyphen'
 'a' _B 'c'    <- _S;         // S2: 
 'a' _S _B 'c' <- _S;         // S3: ok
 - _B 'c'      <- 'c' _B;     // S4: terminal symbol as goal? left-initial was nonterminal - use topdown
 'b' _B        <- _B 'b';     // S5: 
 'b'           <- _B    ;     // S6: treat 'b' as _B

This grammar accepts correct cases. But my correspondent correctly deduced that it would also match 'aabcbc', 'aaabcbbcc' and similar cases - that was good. I had detected and dealt with one of these, but that wasn't enough, and I had in any case used the wrong kind of rule - an unrestricted rule (not included above).

In fact we do need a symbol to drive recursion on 'b' in rule S2, but it isn't the same symbol as _B. As I say Occam's razor is a wonderful thing (but sharp, very sharp).

 -             <- _S;         // S1: empty left-side must start with 'that hyphen'
 'a' _b 'c'    <- _S;         // S2: use _b to give recursion a chance
 'a' _S _B 'c' <- _S;         // S3: ok
 - _B 'c'      <- 'c' _B;     // S4: terminal symbol as goal? left-initial was nonterminal - use topdown
 'b' _b        <- _b  _B;     // S5: use _b in place of 'b' as goal for this rule
 'b'           <- _b;         // S6: treat 'b' as _b

These rules work as intended. In fact rule S4 doesn't need to go top down, as it is triggered by the _B substituted in rules S4 and/or S5. It helps to have the rules say "yes" or "no" - here they are.

So the differences from the 'standard' grammar are as follows:

rules are written the other way round, with each alternative as a separate rule
three additional rules provide a test harness and require a terminator
that hyphen is required in the null case
the nonterminal _b is used as a way of giving recursion a chance
one additional rule allows 'b' to be treated as _b

not pleased

My correspondent however was not pleased. Leaving aside unfortunate glitches, my first grammar looks too much like counting, I am accused of trying to hide the fact my second was wrong, I have added a terminator, I should have coped with the rules as they result directly from a crude translation of the original grammar, I introduce irreducible words, the grammar he presented was an RL-attributed grammar, I have transformed it to LR-attributed. I claim to have a toolkit for grammar, every other system is also Turing-complete, etc.

a reply

For some reason, the tone was no longer the cheerful song of open-minded enquiry. I never claim that the language machine will directly implement any conceivable representation of any possible grammar, just that it is a toolkit for language and grammar which is small, powerful and easy to use - but you have to learn how to use it - and I am still learning.

Although the 'standard' grammar is fine as a toy, it is a serious mistake in the general case to rely heavily on rules that are triggered by terminal symbols as goal - unless those rules are genuinely applicable wherever the terminal symbol may appear as the goal symbol in a mismatch. Generative grammar does not describe how analysis works, and will always be at one remove from practical use.

The language machine is certainly more than a toy, as can be seen from the j2d java to D language translator, which in less than two weeks was developed from an existing D-to-D translator to the point where it it makes sense of about 750k lines of Java source code (comments included).

The phrase 'RL-attributed grammar' produces no results when googled. As for 'counting', the same technique is used here to provide recognisers which also build lists, either in the forward or the reversed order. The so-called 'counting' technique scales easily, and easily generalises to much more complex cases. By contrast the wikipedia article goes on to say that of course you can write a grammar for {a^m*b^m*c^m*d^m} but it will be 'much more complex than the grammar above' - the 'standard' and in their case genuinely monotone grammar.

Leaving LR/RL-attribution aside, the language machine concentrates unashamedly on what actually happens: one symbol arrives at a time, and the future comes after the past. If there exist other tools that deal with unrestricted grammars and are reasonably efficient, I would of course be interested to know. It is of course true some extremely stupid systems are also Turing-complete - the interesting feature of the lambda experiment is that that it demonstrates the close relationship between the lambda calculus, functional languages and grammatical rules in the language machine.

This whole process was useful to me, and will I hope be useful to others. It uncovered some glitches, and helped me understand the 'standard' approach. Two different solutions result. Both of them work. One remains very close to the 'standard' grammar, the other demonstrates powerful and original features of the language machine, features that scale well and can be immensely useful. Both are extremely concise. As far as I know there is no generally accepted 'standard' implementation that actually works.

The following are example solutions and related rulesets:
* the non-standard implementation of {a^i*b^j*a^i*b^j}: aibjaibj
* an implementation of {a^m*b^n*a^m*b^n} that uses no variables: ambnambn
* a non-standard solution for {a^n*b^n*c^n}: anbncn_each
* an implementation of the 'standard' solution for {a^n*b^n*c^n}: anbncn
* an implementation for {a^m*b^m*c^m*d^m*e^m} that uses no variables: ambmcmdmem
* beyond 'counting' - exploring the non-standard approach: hippos

Peri Hankey, 15 April 2006