Flexible Parsing without Regexps
Posted: 2010/07/07 Filed under: FormLis, Forth, Parsing, Python | Tags: FormLis, Parsing, Python 2 Comments »I had trouble writing a wiki parser for FormLis because it had to feel natural. Some wiki’s require N spaces before the bullet asterisks, or exactly 4 spaces for an indented line. FormLis doesn’t care. The parser I’ve written has no lexer and no regexps.
(Feb 14: I’ve written an article about parsing CSV files — this one has Lisp code!)

But regexps are the building blocks of most parsers, so why use an alternatives? For starters regexps are blackboxs’, you can’t integrate your own logic into them. Regexps match the letter(s) and multiplicity you requested, but it can’t match:
- A specific substring having occurred an odd or even number of times before hand. (Can be used to pair ‘*’ which sets up italic regions).
- An upper-case flag, which was set by detecting a run of capitals. (Used to detect regions to make BOLD).
- A character being part of an html-entity character list without writing out the list dozens of times in dozens of regexps, and you can forget about marking the place. (Used to properly HTML escape characters like >, < and &).
- A whitespace character that can be SPACE or UNDERLINE depending on what is being parsed. (So I can reuse the parser in both plain text and inside field definitions).
If I wanted to use regexps I’d have to pick a prewritten regex based on those flags. The extra conditionals and regex variations make the parser unmaintainable.
A Custom DFA compiler is no better
I wrote a custom regex compiler (e.g. a deterministic finite automaton (DFA) compiler) that matched by function instead of character. It could detect bulleted items with:
(dfa (:single #'indentation-match)
(:single #'number)
(:many #'number)
(:single #'period)
(:or (:and (:single #'space) (:capture (:many *))) (:single #'newline)))
It was ugly: expressions were long, the dfa compiler was obtuse, and because functions replaced characters the temptation was to write simple reusable functions. Which was problematic because high level matchers can’t be built of low-level ones, not here nor in regexps: an email detecting regex can’t be embedded within another regex without writing it again.
Problems With Regexes & DFAs
- Require a custom language and compiler.
- They can’t use outside information (like checking or setting the status of a flag variable).
- They return just pass/fail and maybe some captured input.
- They don’t combine.
- They have limited domain (e.g. characters).
State Machines are the solution
The problems come from the regex compilers, not the underlying state machine that regexps and DFA’s ultimately compile too. So skip the compiler and hand-write the state machines; It’s not too hard if we borrow good ideas from Finite State Machines in Forth.
The Example Code
I’m going to write example code in Python (so I can try WordPress’ syntax highlighting). My parser is in SBCL Lisp, and was inspired by Forth; this technique is widely applicable, but nowhere as beautiful as in its native Forth.
I’ll be producing a parser that transforms UPPER CASE TEXT into bold text, things encased in ‘*’ will be turned into emphasized text. Emoticons, URLs, and some common HTML escaped characters are also picked out.
Implementation
The first ‘trick’ is using a 2D array (conceptually. I just simulate 2D with a 1D array) to map the input and current state to an action. The X-position is decided by the incoming character’s type and the Y-position by the current state number. The second ‘trick’ is to separate the state code from state transition. This permits reuse of the state code, as you’ll see in this example parser which collects the first 3 alpha characters in the input string.
collected = ""
state = 0
def input_class(c):
return 1 if ('a' <= c <= 'z') or ('A' <= c <= 'Z') else 0
def collect(c):
global collected
collected+=c
def skip(c):
return
# Other A-Za-z
sm = [ skip, 0, collect, 1, # 0 chars
skip, 1, collect, 2, # 1 char
skip, 2, collect, 3, ] # 2 chars
def process(incoming):
global collected, state, sm
collected = ""
state = 0
for c in incoming:
offset = (state * 4) + (2 * input_class(c))
statefunction = sm[offset]
statefunction(c)
state = sm[offset+1]
if (state > 2): break
return collected
>>> process(" 135yo u")
'you'
The Larger Parser
In Forth you’d put a string pointer on the top of the stack and have each action update that pointer. In Lisp you might try a system-area-pointer, but you’d have to explicitly pass arguments. In Python you’d probably need C pointers; I’ll keep it simple and use global string and offset-into-string variables.
I apologize in advance for the horizontal scrolling on the source code. I don’t think there is a way to break long python lines.
string = ""
position = 0
state = 0
collected = [ ["text", False, False, 0, 0] ]
italic_p = False
bold_p = False
def current_char():
global string, position
return string[position];
def next_char():
global string, position
return string[position+1] if position < len(string) else False;
def emote_p ():
return True if next_char() in [')', '(', 'P'] else False;
def url_p (): # Simple URL detector. Returns the position after the '.com'
global string, position
if string.startswith("www.", position) or string.startswith("http://", position):
compos = string.find(".com", position)
if compos > -1:
return compos + 4;
return False
def html_p ():
return True if current_char() in ['<', '>', '&'] else False;
def set_bold():
global bold_p
if 'A' <= current_char() <= 'Z' and 'A' <= next_char() <= 'Z':
bold_p = True
else:
bold_p = False
Now that we’ve defined some globals and some useful primitives, its time to define some words for use in the state machine. The idea is that I collect a set of ‘ranges’ into collected. Each ‘range’ has a label (text, emote, url, etc), bold and italic flags, and start and end position.
def emit():
global collected, position, bold_p, italic_p, position
if html_p():
collected.append(["escape", bold_p, italic_p, position, position+1])
collected.append(["text", bold_p, italic_p, position+1, position+1])
else:
collected[len(collected) - 1][4]+=1
position+=1
def start_word():
global bold_p, italic_p, position
set_bold()
previous = collected[len(collected) - 1]
if previous[0] == "text" and previous[1] == bold_p and previous[2] == italic_p and previous[4] == position:
emit()
else:
collected.append(["text", bold_p, italic_p, position, position+1])
position+=1
def maybe_emote():
global bold_p, italic_p, position
if emote_p():
collected.append(["emote", bold_p, italic_p, position, position+1])
collected.append(["text", bold_p, italic_p, position+2, position+2])
position+=2
else:
start_word()
def maybe_url():
global bold_p, italic_p, position
pos = url_p()
if (not pos):
start_word()
else:
collected.append(["url", bold_p, italic_p, position, pos])
collected.append(["text", bold_p, italic_p, pos, pos])
position=pos
def maybe_word():
return 0 if current_char() in ['\n', '*' ' ', '\t', '\r', '(', '[', '{', '*', '}', ']', ')'] else 1;
def t_i(): # toggle italic
global italic_p, position
italic_p = not italic_p
position+=1
collected.append(["text", bold_p, italic_p, position, position])
def zero():
return 0;
def one():
return 1;
I’ve changed transitions from constants to functions. This lets the transition ensure that open braces don’t count as word starts. The rest of the program is modified but familiar.
# Other : or ; w or h * whitespace
sm = [ start_word, maybe_word, maybe_emote, maybe_word, maybe_url, maybe_word, t_i, zero, emit, zero, #Start of word
emit, one, emit, one, emit, one, t_i, maybe_word, emit, zero] # Inside word
def input_class(c):
if c == ':' or c == ';':
return 1
elif c == 'w' or c == 'W' or c == 'h' or c == 'H':
return 2
elif c == '*':
return 3
elif c == ' ' or c == '\t' or c == '\n' or c == '\r':
return 4
else:
return 0
def process(incoming):
global position, string, collected, state, italic_p, bold_p
string = incoming
position = 0
state = 0
collected = [ ["text", False, False, 0, 0] ]
italic_p = False
bold_p = False
while position < len(incoming):
offset = (state * 10) + (2 * input_class(current_char()))
statefunction = sm[offset]
statefunction()
transfunction = sm[offset+1]
state=transfunction()
if (state > 1): break
return collected
>> process("This *IS a (test string ;( for* you) & YOUR FRIENDS AT http://www.formlis.com check it out!"). Which means:
[['text', False, False, 0, 5], ['text', False, True, 6, 6], ['text', True, True, 6, 9], ['text', False, True, 9, 24], ['emote', False, True, 24, 25], ['text', False, True, 26, 30], ['text', False, False, 31, 37], ['escape', False, False, 37, 38], ['text', False, False, 38, 39], ['text', True, False, 39, 55], ['url', True, False, 55, 70], ['text', True, False, 70, 71], ['text', False, False, 71, 84]]
- ‘This ‘ is regular text
- ‘IS ‘ is bold-italic
- ‘a (test string ‘ is italic-text
- ‘;(‘ is an emote
- ‘ for’ is italic-text
- ‘ you) ‘ is regular text
- ‘&’ is an escapable character
- ‘ ‘ is regular text
- ‘YOUR FRIENDS AT ‘ is bold text
- ‘www.formlis.com’ is a url (still bold)
- ‘ ‘ is bold text
- ‘check it out!’ is regular text
We’ve successfully picked out emotes, escapable HTML characters, bold, italic and urls, all in about 130 loc.
The text parser behind FormLis is a lot like this but with added rules for form fields and more robust url and bold handling. But it’s remarkable how little code can recognize so many details. Try that with your regexps!
Request For Comments
My previous post (Forth in Lisp) got views, but that doesn’t tell me if readers understood or enjoyed the article. I wrote this code in Python to try to make it more accessible, which I’m not convinced has worked.
Please tell me: which article was easier to read? Whats your favorite language for reading code? Were you able to understand both articles? Would you prefer simpler, but more limited, examples? What do you use for making diagrams for computer-science related things?
I came to the same conclusion once while fighting a regex (use a state machine!). I took advantage of tail calls to write mine.
http://uniquelyjoe.blogspot.com/2010/02/slide-title-extractor-part-3-of-3.html
problem definition:
http://uniquelyjoe.blogspot.com/2010/02/slide-title-extractor-part-1-of-3.html
excuse the poor writing.
I don’t think that you will please people by switching the language around, I would show the actual code that you are using rather than translating it.
You are right about switching the language. I didn’t enjoy writing python code & this entry got a dismal number of views compared to the previous Lisp one; and I don’t think subject matter was the problem.
My next post will probably recover this ground with Lisp code. Emacs htmlize will (hopefully) allow me to pretty print the code.