© Copyright 2005 Peri Hankey - documentation license Gnu FDL - code license Gnu GPL - validate HTML
SourceForge.net Logostemming.lmn

stemming.lmn: (C) Copyright 2005 Peri Hankey (mpah@users.sourceforge.net). This source text is published under the terms of the Gnu General Program License. It comes with absolutely no warranty.


Note: this line and others that start in column 1 is treated as comment (you can also use C-style comments within the actual source). You can use a subset of the mediawiki markup notation within comments so that the source can be converted to html and viewed as a web page.

home introduction

This is a little ruleset that shows how you can use associative arrays to find common stems of symbol sequences. Identical techniques can be used to track common subexressions in a code generator for arithmetic expressions. Of course in most languages that is complicated by control-flow analysis, which is not required in this case.

The internal format is in effect the parse tree, but we have complete control over it, and we can use the same kind of rules to analyse it as were used to produce it.

home getting started

.stemming()
   start var I = 1; var L = []; var R = []; var Ix = []; var Jx = []; var Nx = []; body  <- eof; 

We have finished when we get "eof". We apply code generator rules to anything wrapped between "generate" and "output". Anything that is not recognised as matching expr :X ";" is treated as an error - a line is skipped and analysis restarts on the next line.

   eof                                                  <- body   ;
   generate output                                      <- body - ;
   - line  var  Z;                                      <- body - generate sp "E " $(format("line %s", varLn(N))) nl eoc; 

The default analysis applies common stem rules to the left-side of a rule, but leaves the right-side unchanged. On the left, the common stem rules produce rules to recognise patterns as extensions of common stems - each distinct stem is represented by a unique symbol that takes the form @mN, and the rules are reductive - they go from the patterns to the stems.

There are two alternative styles of analysis. If the rule to be translated is prefixed with a '.', the right-side is treated as a pattern to be generated by expanding common stem symbols - each distinct stem is represented by a unique symbol that takes the form @rN, and the rules are generative - they go from the stems to the patterns.

If the analysis is prefixed by a ':', the same reductive common stem rules are generated for both sides. This is just to show how easy it is to do this kind of thing - on the face of it the results do not look very useful.

   -   initial :S expr :X "<-" initial :T expr  :Y ";"; <- body - generate cx :S :T df :X :Y eoc;
   '.' initial :S expr :X "<-" initial :T expr  :Y ";"; <- body - generate cx :S :T dr :X :Y eoc;
   ':' initial :S expr :X "<-" initial :T expr  :Y ";"; <- body - generate cx :S :T dx :X :Y eoc;

home the expression analyser

Rules to recognise sequence expressions: first we have an easy rule for bracketed expressions - it restarts the priority within the brackets at zero.

 .stemming(0B)
  '(' expr :X ')'                      <- term :{ br :X };
 .stemming()
   - term :A                           <- initial :A term :A;
   - term :A op                        <- expr - ;
   -                                   <- op expr  :A;
 .stemming(18L)
   - expr :B                           <- op term :{ fa :A   :B }; 

These rule very crudely turn a single quoted string which has been turned intp a sequence of character symbols into a sequence of operands so that each character is dealt with as a separate element in the stem. The rules need modifying to deal correctly with escape conditions. An equivalent problem is correctly dealt with in the lmn-to-D and lmn-to-C backend rules.

 .stemming(B)
   - anything % toSym :X <- char term :{ c :X } ch;
   ch char               <- term -;
   ch eos                <- - ;

home the code generator rules

We use an extra level of nesting to ensure that code generation takes place entirely within a high priority context so that spaces and newlines are not thrown away by the space deletion rules. An alternative approach is to place code generator rules in a different grammar, and have a rule that switches into that grammar to do code generation.

The code generator operates as a filter: anything that cannot be matched is sent to standard output. This will apply until the eoc is seen, matched, provides code, and gets us out of the code generator context.

 .stemming(20R)
   - code                                      <- output;
   eoc                                         <- code;
   - out                                       <- code - ;
   sp                                          <- code - ' '  ;
   nl                                          <- code - '\n' ;
   sc                                          <- code - ";"  ;
   mm :A :B :J                                 <- code - sp A sp B sp "<-" sp m :J  sc nl;
   rr :A :B :J                                 <- code - sp r :J sp "<-" sp A sp B  sc nl;
   r  :N                                       <- code - "@r" N;
   m  :N                                       <- code - "@m" N;    
   d  :N                                       <- code - "\"" N "\"";    
   s  :N                                       <- code - "\'" N "\'";    
   c  :N                                       <- code - "\'" N "\'";    
   a  :A                                       <- code - A;
   x  :A                                       <- code - A;
   n  :A                                       <- code - A;

Different styles of rule compilation - stemming to the left, reductive stemming to left with generative stemming to right, and reductive stemming on both sides. The rules for u do reductive stemming. the rules for w do generative stemming. The trick here is to look inside elements that come from the frontend - this is done by :u and :w - these match the value of a binding against the symbols u and w.

   cx :S :T code                               <- code;
   df :u :X :P :A :Q                           <- code - X Y sp '(' S sp T ')' sp P sp "<-" sp Q sc nl ;
   dr :u :X :P :A :w :Y :Q :B                  <- code - X Y sp '(' S sp T ')' sp P sp "<-" sp Q sc nl ;
   dx :u :X :P :A :u :Y :Q :B                  <- code - X Y sp '(' S sp T ')' sp P sp "<-" sp Q sc nl ;
   fa :A :B                                    <- code - A sp B ;
   br :B                                       <- code - '(' B ')';
   fx :F :u :X :P :A  :u :Y :Q :B              <- code - X  Y sp F sp P sp Q;
   f2 :F :u :X :P :A  :u :Y :Q :B              <- code - X  Y sp F sp P Q ;
   f1 :F :u :X :P :A                           <- code - X    sp F sp P;

The rules for finding and dealing with common stems follow. First, brackets are represented in the internal format as br :X bindings. The stemming rules have to look inside the brackets:

   br :u :B                                                  <- u :B;
   br :w :B                                                  <- w :B;

The first time an individual element appears, it needs an entry in the stemming tables - the L table is used for left-side reductive stems, and the R table is used for right-side generative stems.

The components of u and w elements are: a code sequence that defines the entry, its representation in the output, and the value that is to be used as a key for extensions of the stem.

   c  :A if(!(Ix[A])) { Ix[A] = A; L[A] = []; R[A] = []; }   <- u :{}  :{ c :A } :A;  
   d  :A if(!(Ix[A])) { Ix[A] = A; L[A] = []; R[A] = []; }   <- u :{}  :{ d :A } :A;
   x  :A if(!(Ix[A])) { Ix[A] = A; L[A] = []; R[A] = []; }   <- u :{}  :{ x :A } :A;
   n  :A if(!(Nx[A])) { Nx[A] = A; L[A] = []; R[A] = [];}    <- u :{}  :{ n :A } :A;
   c  :A if(!(Ix[A])) { Ix[A] = A; L[A] = []; R[A] = []; }   <- w :{}  :{ c :A } :A;
   d  :A if(!(Ix[A])) { Ix[A] = A; L[A] = []; R[A] = []; }   <- w :{}  :{ d :A } :A;
   x  :A if(!(Ix[A])) { Ix[A] = A; L[A] = []; R[A] = [];}    <- w :{}  :{ x :A } :A;
   n  :A if(!(Nx[A])) { Nx[A] = A; L[A] = []; R[A] = [];}    <- w :{}  :{ n :A } :A;    

Each element in a pattern is represented by an fa :A :B element - fa stands for function application, as the same principles apply to a succession of left-associative haskell-style curried functions. So these need to create and define a new stem entry for each unique stem.

   fa  var J = 0; var C = {};    :u :X :P :A :u :Y :Q :B 
     if(L[B][A]) { J = L[B][A];} 
     else        { J = L[B][A] = ++I; L[J] = []; C = { X Y mm :P :Q :J }; } 
                                               <- u :{ C } :{ m :J } :J;
   fa  var J = 0; var C = {};    :w :X :P :A :w :Y :Q :B 
     if(R[B][A]) { J = R[B][A];} 
     else        { J = R[B][A] = ++I; R[J] = []; C = { X Y rr :P :Q :J }; } 
                                               <- w :{ C } :{ r :J } :J;

home the lexical rules

Lexical rules are at a higher left-associative (non-nesting) priority

 .stemming(20L)

Ignore whitespace and recognise an end-of-command condition

   .[ \t\n]                            <- -   ;
   ';'                                 <- ";" ;
   '='                                 <- "=" ;
   '<-'                                <- "<-";
   '-'                                 <- "-" ;

Recognise identifiers and numbers

   '\'' var T; single '\'' var Str = toChars(T);               <- term :{ } op ch Str eos;
   '\"' var T; double '\"' var Sym = usym(T);                  <- term :{ d :Sym };
   .[a-zA-Z_]         % { repeat .[a-zA-Z_]   % }                                 toSym :N <- iden :{ x :N };
   .[a-zA-Z_]         % { repeat .[a-zA-Z_]   % }                                 toSym :N <- term :{ x :N };
   '0'                % { repeat .[0-7]       % } octal yes                       toOct :N <- term :{ n :N };
   .[0-9]             % { repeat .[0-9]       % } { option '.' % repeat [0-9] % } toNum :N <- term :{ n :N };
   '0x'  .[0-9a-zA-Z] % { repeat .[0-9a-zA-Z] % } { option '.' % repeat [0-9] % } toHex :N <- term :{ n :N };
   '0b'  .[01]        % { repeat .[01]        % }                                 toBin :N <- term :{ n :N };
 .stemming()
   -                                   <- octal yes;
   '.'                                 <- octal no ;
   'x'                                 <- octal no ;
   'b'                                 <- octal no ;
   '\''                                <- single '\'';
   '\\' escape                         <- single -   ;
   - (T)                               <- single -   ;
   '\"'                                <- double '\"';
   '\\' escape                         <- double -   ;
   - (T)                               <- double -   ;
   .[abfnrtv] :E                       <- escape "\\" E    ;
   '\\'                                <- escape "\\" "\\" ;
   '\''                                <- escape "\\" "\'" ;
   '\"'                                <- escape "\\" "\"" ;

Rules to skip the rest of a line

   - anything                          <- line - ;
   eof                                 <- line eof;
   '\n'                                <- line   ;
home