atom feed8 messages in [JavaCC] multiple lexical states
FromSent OnAttachments
Howard KatzAug 25, 2003 11:35 am 
Sreenivasa ViswanadhaAug 25, 2003 3:21 pm 
Howard KatzAug 25, 2003 3:27 pm 
Sreenivasa ViswanadhaAug 25, 2003 3:33 pm 
Howard KatzAug 25, 2003 4:18 pm 
Sreenivasa ViswanadhaAug 25, 2003 4:19 pm 
Howard KatzAug 25, 2003 5:37 pm 
Brian GoetzAug 28, 2003 2:18 pm 
Subject:Re: [JavaCC] multiple lexical states
From:Brian Goetz (
Date:Aug 28, 2003 2:18:58 pm

You can use the curLexState variable and SwitchTo method to explcitly do the state transition using Java code rather than the JavaCC syntax. Something like;


If you use the BackupCharStream that I posted to the list recently (this has been in production for two years, so it works) and the associated backup code in your grammar, you can also execute state transitions from semantic actions. This enables a wide range of grammar simplifications.

I'm hoping that this code will make it into the next version of JavaCC, but in the meantime, download BackupCharStream and add the code below to your parser. Here's a repeat of my earlier post:

The backup supports backup of half the char stream's buffer size; default buffer size is 4000. It uses a double-buffered read approach to facilitate backup, which could probably also be tweaked for better performance.

Here's what we put in our parser to support backup:

TOKEN_MGR_DECLS : { // Required by SetState void backup(int n) { input_stream.backup(n); } }

We added a method called SetState to the parser, so you can initiate state transitions from semantic actions. We use this in a number of places, some for convenience, some out of necessity. For example, we extend the JavaCC parser with table driven extensions (you can add new statements dynamically to the parsed langauge through a table-drive mechanism). Here's an example where we do it out of convenience -- we want to eat whitespace, but we might be in another state where the whitespace tokens are different. So we 'push' the state and go to the "WM" state, eat and record the whitespace, and pop the state:

void EatWsNlOrSpace(BlockBuilder b) : { int entryState = token_source.curLexState; Token w, n=null; } { { SetState(WM); } ( LOOKAHEAD(<NEWLINE>) <NEWLINE> | LOOKAHEAD(<WS>) ( w=<WS> [ LOOKAHEAD(<NEWLINE>) n=<NEWLINE> ] { if (n == null) b.addElement(w.image.substring(1)); } ) )? { SetState(entryState); } }

Here's another example: quotes. The token definition inside quoted strings is different from outside. Why don't we use static states? Because there are cases in our parser where a quote does NOT necessarily trigger a transition to the QS state.

Builder QuotedString() : { int entryState = token_source.curLexState; Token t; QuotedStringBuilder qs = new QuotedStringBuilder(); Object dr; } { <QUOTE> { SetState(QS); } ( t=<QS_TEXT> { qs.addElement(t.image); } | t=<QCHAR> { qs.addElement(t.image.substring(1)); } | t=<SLASH> { qs.addElement(t.image); } | LOOKAHEAD(t=<DOLLAR>) dr=DollarReference() { qs.addElement(dr); } ) * <QUOTE> { SetState(entryState); } { return qs; } }

And here's the meat of SetState, which goes in the .jj file. When you do a SetState(newState), it looks at any tokens already matched, including special tokens, gets their image strings, and backs them up on the input stream.

private void SetState(int state) { if (state != token_source.curLexState) { Token root = new Token(), last=root; = null;

// First, we build a list of tokens to push back, in backwards order while ( != null) { Token t = token; // Find the token whose is the last in the chain while ( != null && != null) t =;

// put it at the end of the new chain =; last =;

// If there are special tokens, these go before the regular tokens, // so we want to push them back onto the input stream in the order // we find them along the specialToken chain.

if ( != null) { Token; while (tt != null) { = tt; last = tt; = null; tt = tt.specialToken; } } = null; };

while ( != null) { token_source.backup(; =; } jj_ntk = -1; token_source.SwitchTo(state); } }