Flexible Parsing without Regexps

I had trouble writing a wiki parser for FormLis because it had to feel natural. Some wiki’s require N spaces before the bullet asterisks, or exactly 4 spaces for an indented line. FormLis doesn’t care. The parser I’ve written has no lexer and no regexps.

(Feb 14: I’ve written an article about parsing CSV files — this one has Lisp code!)

FormLis parser in action

But regexps are the building blocks of most parsers, so why use an alternatives? For starters regexps are blackboxs’, you can’t integrate your own logic into them. Regexps match the letter(s) and multiplicity you requested, but it can’t match:

  • A specific substring having occurred an odd or even number of times before hand. (Can be used to pair ‘*’ which sets up italic regions).
  • An upper-case flag, which was set by detecting a run of capitals. (Used to detect regions to make BOLD).
  • A character being part of an html-entity character list without writing out the list dozens of times in dozens of regexps, and you can forget about marking the place. (Used to properly HTML escape characters like >, < and &).
  • A whitespace character that can be SPACE or UNDERLINE depending on what is being parsed. (So I can reuse the parser in both plain text and inside field definitions).

If I wanted to use regexps I’d have to pick a prewritten regex based on those flags. The extra conditionals and regex variations make the parser unmaintainable.

A Custom DFA compiler is no better

I wrote a custom regex compiler (e.g. a deterministic finite automaton (DFA) compiler) that matched by function instead of character. It could detect bulleted items with:

(dfa (:single #'indentation-match)
     (:single #'number)
     (:many #'number)
     (:single #'period)
     (:or (:and (:single #'space) (:capture (:many *))) (:single #'newline)))

It was ugly: expressions were long, the dfa compiler was obtuse, and because functions replaced characters the temptation was to write simple reusable functions. Which was problematic because high level matchers can’t be built of low-level ones, not here nor in regexps: an email detecting regex can’t be embedded within another regex without writing it again.

Problems With Regexes & DFAs

  • Require a custom language and compiler.
  • They can’t use outside information (like checking or setting the status of a flag variable).
  • They return just pass/fail and maybe some captured input.
  • They don’t combine.
  • They have limited domain (e.g. characters).

State Machines are the solution

The problems come from the regex compilers, not the underlying state machine that regexps and DFA’s ultimately compile too. So skip the compiler and hand-write the state machines; It’s not too hard if we borrow good ideas from Finite State Machines in Forth.

The Example Code

I’m going to write example code in Python (so I can try WordPress’ syntax highlighting). My parser is in SBCL Lisp, and was inspired by Forth; this technique is widely applicable, but nowhere as beautiful as in its native Forth.

I’ll be producing a parser that transforms UPPER CASE TEXT into bold text, things encased in ‘*’ will be turned into emphasized text. Emoticons, URLs, and some common HTML escaped characters are also picked out.

Implementation

The first ‘trick’ is using a 2D array (conceptually. I just simulate 2D with a 1D array) to map the input and current state to an action. The X-position is decided by the incoming character’s type and the Y-position by the current state number. The second ‘trick’ is to separate the state code from state transition. This permits reuse of the state code, as you’ll see in this example parser which collects the first 3 alpha characters in the input string.

collected = ""
state = 0

def input_class(c):
    return 1 if ('a' <= c <= 'z') or ('A' <= c <= 'Z') else 0

def collect(c):
    global collected
    collected+=c

def skip(c):
    return 

#      Other        A-Za-z      
sm = [ skip, 0,     collect, 1,   # 0 chars
       skip, 1,     collect, 2,   # 1 char
       skip, 2,     collect, 3, ] # 2 chars

def process(incoming):
    global collected, state, sm
    collected = ""
    state = 0
    for c in incoming:
        offset = (state * 4) + (2 * input_class(c))
        statefunction = sm[offset]
        statefunction(c)
        state = sm[offset+1]
        if (state > 2): break
    return collected


>>> process(" 135yo u")
'you'

The Larger Parser

In Forth you’d put a string pointer on the top of the stack and have each action update that pointer. In Lisp you might try a system-area-pointer, but you’d have to explicitly pass arguments. In Python you’d probably need C pointers; I’ll keep it simple and use global string and offset-into-string variables.

I apologize in advance for the horizontal scrolling on the source code. I don’t think there is a way to break long python lines.

string = ""
position = 0
state = 0
collected = [ ["text", False, False, 0, 0] ]
italic_p = False
bold_p = False

def current_char():
    global string, position
    return string[position];

def next_char():
    global string, position
    return string[position+1] if position < len(string) else False; 

def emote_p ():
    return True if next_char() in [')', '(', 'P'] else False;

def url_p ():  # Simple URL detector. Returns the position after the '.com'
    global string, position
    if string.startswith("www.", position) or string.startswith("http://", position):
        compos = string.find(".com", position)
        if compos &gt; -1:
            return compos + 4;
    return False

def html_p ():
    return True if current_char() in ['<', '&gt;', '&'] else False;

def set_bold():
    global bold_p
    if 'A' <= current_char() <= 'Z' and 'A' <= next_char() <= 'Z':
        bold_p = True
    else:
        bold_p = False

Now that we’ve defined some globals and some useful primitives, its time to define some words for use in the state machine. The idea is that I collect a set of ‘ranges’ into collected. Each ‘range’ has a label (text, emote, url, etc), bold and italic flags, and start and end position.


def emit():
    global collected, position, bold_p, italic_p, position
    if html_p():
        collected.append(["escape", bold_p, italic_p, position, position+1])
        collected.append(["text", bold_p, italic_p, position+1, position+1])
    else:
        collected[len(collected) - 1][4]+=1
    position+=1

def start_word():
    global bold_p, italic_p, position
    set_bold()
    previous = collected[len(collected) - 1]
    if previous[0] == "text" and previous[1] == bold_p and previous[2] == italic_p and previous[4] == position:
        emit()
    else:
        collected.append(["text", bold_p, italic_p, position, position+1])
        position+=1

def maybe_emote():
    global bold_p, italic_p, position
    if emote_p():
        collected.append(["emote", bold_p, italic_p, position, position+1])
        collected.append(["text", bold_p, italic_p, position+2, position+2])
        position+=2
    else:
        start_word()

def maybe_url():
    global bold_p, italic_p, position
    pos = url_p()
    if (not pos):
        start_word()
    else:
        collected.append(["url", bold_p, italic_p, position, pos])
        collected.append(["text", bold_p, italic_p, pos, pos])
        position=pos

def maybe_word():
    return 0 if current_char() in ['\n', '*' ' ', '\t', '\r', '(', '[', '{', '*', '}', ']', ')'] else 1;

def t_i():  # toggle italic
    global italic_p, position
    italic_p = not italic_p
    position+=1
    collected.append(["text", bold_p, italic_p, position, position])

def zero():
    return 0;

def one():
    return 1;

I’ve changed transitions from constants to functions. This lets the transition ensure that open braces don’t count as word starts. The rest of the program is modified but familiar.

#      Other                   : or ;                   w or h                 *                whitespace
sm = [ start_word, maybe_word, maybe_emote, maybe_word, maybe_url, maybe_word, t_i, zero,       emit, zero, #Start of word
       emit, one,              emit, one,               emit, one,             t_i, maybe_word, emit, zero] # Inside word

def input_class(c):
    if c == ':' or c == ';':
        return 1
    elif c == 'w' or c == 'W' or c == 'h' or c == 'H':
        return 2
    elif c == '*':
        return 3
    elif c == ' ' or c == '\t' or c == '\n' or c == '\r':
        return 4
    else:
        return 0

def process(incoming):
    global position, string, collected, state, italic_p, bold_p
    string = incoming
    position = 0
    state = 0
    collected = [ ["text", False, False, 0, 0] ]
    italic_p = False
    bold_p = False
    while position < len(incoming):
        offset = (state * 10) + (2 * input_class(current_char()))
        statefunction = sm[offset]
        statefunction()
        transfunction = sm[offset+1]
        state=transfunction()
        if (state &gt; 1): break
    return collected

>> process("This *IS a (test string ;( for* you) & YOUR FRIENDS AT http://www.formlis.com check it out!")
[['text', False, False, 0, 5], ['text', False, True, 6, 6], ['text', True, True, 6, 9], ['text', False, True, 9, 24], ['emote', False, True, 24, 25], ['text', False, True, 26, 30], ['text', False, False, 31, 37], ['escape', False, False, 37, 38], ['text', False, False, 38, 39], ['text', True, False, 39, 55], ['url', True, False, 55, 70], ['text', True, False, 70, 71], ['text', False, False, 71, 84]]
. Which means:

  • ‘This ‘ is regular text
  • ‘IS ‘ is bold-italic
  • ‘a (test string ‘ is italic-text
  • ‘;(‘ is an emote
  • ‘ for’ is italic-text
  • ‘ you) ‘ is regular text
  • ‘&’ is an escapable character
  • ‘ ‘ is regular text
  • ‘YOUR FRIENDS AT ‘ is bold text
  • ‘www.formlis.com’ is a url (still bold)
  • ‘ ‘ is bold text
  • ‘check it out!’ is regular text

We’ve successfully picked out emotes, escapable HTML characters, bold, italic and urls, all in about 130 loc.

The text parser behind FormLis is a lot like this but with added rules for form fields and more robust url and bold handling. But it’s remarkable how little code can recognize so many details. Try that with your regexps!

Request For Comments

My previous post (Forth in Lisp) got views, but that doesn’t tell me if readers understood or enjoyed the article. I wrote this code in Python to try to make it more accessible, which I’m not convinced has worked.

Please tell me: which article was easier to read? Whats your favorite language for reading code? Were you able to understand both articles? Would you prefer simpler, but more limited, examples? What do you use for making diagrams for computer-science related things?

About these ads

3 Comments on “Flexible Parsing without Regexps”

  1. Joseph Iacobucci says:

    I came to the same conclusion once while fighting a regex (use a state machine!). I took advantage of tail calls to write mine.
    http://uniquelyjoe.blogspot.com/2010/02/slide-title-extractor-part-3-of-3.html
    problem definition:
    http://uniquelyjoe.blogspot.com/2010/02/slide-title-extractor-part-1-of-3.html
    excuse the poor writing.

    I don’t think that you will please people by switching the language around, I would show the actual code that you are using rather than translating it.

  2. formlis says:

    You are right about switching the language. I didn’t enjoy writing python code & this entry got a dismal number of views compared to the previous Lisp one; and I don’t think subject matter was the problem.

    My next post will probably recover this ground with Lisp code. Emacs htmlize will (hopefully) allow me to pretty print the code.

  3. Halley says:

    How is Sea Salt different from Salt?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 34 other followers