atom feed15 messages in org.perl.perl5-portersPattern matching in SNOBOL4 (long, di...
FromSent OnAttachments
Mark-Jason DominusApr 15, 1998 10:23 pm 
Ilya ZakharevichApr 15, 1998 11:34 pm 
Moore, PaulApr 16, 1998 2:16 am 
Moore, PaulApr 16, 1998 2:49 am 
Chaim FrenkelApr 16, 1998 6:50 am 
Mark-Jason DominusApr 16, 1998 7:20 am 
Ilya ZakharevichApr 16, 1998 9:53 am 
Ilya ZakharevichApr 16, 1998 10:08 am 
Larry WallApr 16, 1998 10:41 am 
Chaim FrenkelApr 16, 1998 11:03 am 
Ton HospelApr 16, 1998 3:18 pm 
kst...@chapin.eduApr 16, 1998 4:41 pm 
Peter PrymmerApr 16, 1998 4:55 pm 
Ton HospelApr 17, 1998 1:39 pm 
Ton HospelApr 17, 1998 2:20 pm 
Subject:Pattern matching in SNOBOL4 (long, digression)
From:Mark-Jason Dominus (mj@plover.com)
Date:Apr 15, 1998 10:23:31 pm
List:org.perl.perl5-porters

This note started out as an analysis of SNOBOL4's `FAIL' pattern, and turned into a huge ramble about SNOBOL in general. If it has a point, the point is only that SNOBOL's pattern matching is *still* a lot better than Perl's, and that it is worth studying, because we could learn a lot from it.

The canonical SNOBOL reference is

The SNOBOL4 Programming Language R.E. Griswold, J. F. Poage, I. P. Polonsky Prentice-Hall, 1971

----------------------------------------------------------------

Someone showed up in clpm today wanting to find the longest substring common to two strings $s and $t, and while I was thinking about that I got sidetracked into thinking about how to transform

"abc"

into

('abc', 'ab', 'a', 'bc', 'b', 'c', '')

One way is to write an explicit loop and use substr, of course. But it was late, and my mind went down the m//g path, and I realized that m/.*/g wouldn't do it, of course. But this reminded me of a feature in SNOBOL that was useful for similar purposes, and I dug out my SNOBOL book and got sucked in, as I always do, and I came to the same conclusion that I always come to, which is that SNOBOL4 was a remarkably usable language, especially for 1971, and that people should pay more attention to it. It had associative arrays, recursive functions with locally-scoped variables, pattern matching that was better than then Perl's is now. On the other hand, SNOBOL's control flow and syntax are hopelessly 1971.

SNOBOL4 has a backtracking pattern matcher very similar to Perl's. The feature that I remembered was a pattern-matching primitive called FAIL, which causes the pattern matcher to fail and backtrack. It doesn't cause a complete failure of the entire match (ABORT does that); rather, it just causes a failure of the current backtracking alternative, so that the pattern matcher backs up and tries the next alternative. The example from the SNOBOL book is

&ANCHOR = 0 'MISSISSIPPI' ('IS' | 'SI' | 'IP' | 'PI') $OUTPUT FAIL

Matching is SNOBOL is anchored at the beginning and the end unless you specify &ANCHOR=0; this enables matching substrings. 'MISSISSIPPI' is the string to be matched; the rest of the line is the pattern. ('IS' | 'SI' | 'IP' | 'PI') is just like (IS|SI|IP|PI) in Perl. $ is like a backreference; it says that the whatever was matched by the previous pattern component should be stored into the variable OUTPUT, even if the pattern match failed. In SNOBOL, the OUTPUT variable is special: Anything `stored' in OUTPUT is actually printed instead. Without the FAIL, the IS would match and be output, and then the pattern would succeed, and that would be the end of it; the FAIL causes a backtracking to try the next alternative, and so on, so that this pattern ends up printing:

IS SI IS SI IP PI

The Snobol manual has several examples of how this might be useful:

``In general, the behavior of the scanner during any pattern match may be observed using aa statement of the form

STR PAT $OUTPUT FAIL

''

This feature alone makes FAIL valuable.

You may want to stop reading at this point; an extended example follows, in which FAIL figures only peripherally.

----------------------------------------------------------------

There's another example, which unfortunately isn't very persuasive. So I looked through the book to find other uses of FAIL, and found this exercise:

``Exercise 4.3: Define a function that converts algebraic expressions into Polish notation. Let the definition of expression include the binary operator ! for exponentiation in addition to the operators +, -, *, and /. Assume that exponentiation associates to the right.''

Yeah, I *wish* I could do that with regexes. Let's see what the SNOBOL solution looks like. My comments and translations into Perlspeak are interspersed.

&ANCHOR = 1

This enables `anchored mode' for pattern matching. In Perlspeak, it turns on an implicit ^ and $ at the front and end of every pattern.

EXPAT = BAL .X '!' .OP REM .Y

BAL matches any string that contains balanced parentheses. .X is a backreference; it says that whatever matched the preceding expression should be assigned to variable X, but only if the entire pattern match is successful. REM is the SNOBOL version of /.*$/.

Thus EXPAT is a pattern for detecting and parsing expressions that contain the exponentiation operator '!'. If the pattern match succeeds, the longest possible initial sequence with balanced parentheses is stored into X; the literal '!' is stored into OP, and the rest of the string, presumably the exponent, is stored into Y. Note that pattern-matching in SNOBOL is usually nongreedy by default, so that in a string like this:

(..A..)!(..B..)!(..C..)

X gets '(..A..)', OP gets '!', and Y gets '(..B..)!(..C..)'. This will eventually result in eexponentiation being right-associative.

PMPAT = (ARBNO(BAL ANY('+-')) $X FAIL | *DIFFER(X) TAB(*(SIZE(X)-1))) .X LEN(1) .OP REM .Y

ARBNO(pat) is the SNOBOL version of /(pat)*?/. ANY(X) is the snobol version of [X]. $X is like .X, but the assignment occurs immediately, as soon as the expression on its left matches, even if the entire pattern match fails.

DIFFER(X) succeeds iff X is not the null string. * in a SNOBOL pattern signals a match-time evaluation of the pattern, so that where DIFFER(X) would check the value of X that held before the statement was executed, *DIFFER(X) in the second clause checks the value of X that was assigned by the $X in the first clause. Similarly, *(SIZE(X)-1) uses the value of X assigned in the $X in the first clause. TAB(N) is like /^.{N}/ in Perl. LEN(1) is like /./.

What's going on here? Snobol checks the alternatives in a P|Q clause left to right, so first it tries ARBNO(BAL ANY('+-')). In Perl, this is like /(\P[+-])*?/, where \P is a notional sequence that matches the longest possible string with balanced parentheses. For example, if the string in question is

(..A..)+(..B..)-(..C..)*(..D..)+(..E..)

then this part matches the (..A..)+(..B..)- part, regardless of what the ..A.. and ..B.. are. Having done that, it immedately stores the matched string into X. Then FAIL does two things: It forces Snobol to keep trying this first clause until the ARBNO, and the string assigned to X, is as long as it can be, and the it forces the first clause to fail anyway.

The first clause must fail, so SNOBOL tries the second clause. DIFFER makes sure that the string in X is nonempty; if it is empty, the entire pattern fails. Otherwise, TAB(..) skips over the (..A..)+(..B..) and assigns that to X. LEN(1) .OP assigns the next single character, the -, to OP. REM .Y assigns the remainder of the string to Y. The result of all this will be left-associativity for + and -.

Then there's a similar pattern for parsing multiplications and divisions. It's exactly the same except it has */ instead of +-.

MDPAT = (ARBNO(BAL ANY('*/')) $X FAIL | *DIFFER(X) TAB(*(SIZE(X)-1))) .X LEN(1) .OP REM .Y

DEFINE('POL(POL)X,Y,OP')

This declares POL as a new function with one formal parameter, also called POL, and three other local variables, X, Y, and OP. Here's the definition of the POL function:

STRIP = '(' BAL .POL ')' RPOS(0)

RPOS(0) is the SNOBOL idiom equivalent to Perl's /whatever$/. This pattern just strips off a pair of outer parentheses if there is one, turning (EXPRESSION) into EXPRESSION.

Here's the definition of the POL function:

POL POL STRIP :S(POL) POL PMPAT = OP '(' POL(X) ',' POL(Y) ')' :S(RETURN) POL MDPAT = OP '(' POL(X) ',' POL(Y) ')' :S(RETURN) POL EXPAT = OP '(' POL(X) ',' POL(Y) ')' :(RETURN)

This is a little tricky, because POL is a function name, a variable name, and a statement label. As a variable, POL is both the function's argument and also, when the function returns, POL is examined for the return value.

The first line strips off a pair of outer parentheses, if there is one, and if so, goes to line POL to try to do that again. When it can't do that, the pattern match fails and falls through to the second line.

The other three lines are very similar to one another. SNOBOL statements that have the form

STRING PATTERN = EXPRESSION :S(LABEL)

are nearly equivalent to Perl's

goto LABEL if $string =~ s/pattern/expression/e;

The POL function tries to match its argument against PMPAT. If this succeeds, the argument was parsed into three parts: The first operand is stored in X, the second in Y, and the operator, either '+' or '-', is in OP. POL is called recursively to transform X and Y into Polish notation, the results are formatted as OP(X,Y), and this result is returned.

If the argument didn't match PMPAT, POL tries matching MDPAT, and if that didn't work, it tries matching EXPAT. Trying the three patterns in this order associates the proper precedences to the five operators.

I was very impressed with this. The entire program is ten statements long. The complicated pattern assignment statements are no worse than similar regexes in Perl, and are rather more powerful. The POL function is entirely straightforward.

A quick checklist of SNOBOL features demonstrated here that Perl doesn't have:

BAL match string with balanced parentheses *(EXPR) for match-time expression evaluation $X for immediate backreference assignment backreference assignment to variables POS(N) jump to position N FAIL force backtracking at this point &ANCHOR force anchor mode by default (this would be bad) Arbitrary expressions in patterns

A typical use of * in Snobol is for recursively defined patterns. For example: If you didn't have BAL, you cuold create it like this:

BAL = SPAN('()') | '(' *BAL ')' | ARBNO(*BAL)

SPAN('()') is the SNOBOL quuivalent of Perl's /[^()]*?/.

There's been some talk about recursive patterns in Perl, but it doesn't have the full generality of SNOBOL's * operator. There's also been some talk about backreference assignment to named variables.

Aside from this, Perl has no equivalents for the other items on this list, all of which sound useful, except perhaps for &ANCHOR. Even &ANCHOR might benefit Perl if it were a regex modifier, with /X/a meaning the same as /^X$/.