Parsing CSV files with Lisp.

In a previous blog post I talked about the parsing method I use here. It’s a hand made state machine, which (if you do it right) can be remarkably short. This parser parses CSV files. I haven’t widely tested it, but it should be compatible with excel & gmail exported csv files. CSV isn’t standard but generally:

  • Use Commas as the deliminator.
  • Any field that has a comma or a newline is treated as complex.
  • Complex values are wrapped in quotes.
  • Quotes in complex values are represented as double quotes.

This parser shoud support any line ending (cr+lf, lf or cr) and handles regular and complex values. I’ve heard of CSV files using C style escapes (like \n) but I’ve never seen one, and this this parser doesn’t support that.

p.s I’m also using Posterous to try posting this. Unfortunately posterous code embedder also doesn’t support lisp — so I continue to have to use emacs HTMLIZE.


Character Classes

  1. Space
  2. Return
  3. Linefeed
  4. Quote
  5. Seperator character (“,” by default)
  6. Other

States

  1. Start (eats spaces and look for quote (–> 3) or letters (–> 2)
  2. Start + CR. Like previous but seeing LF won’t emit another newline.
  3. Unquoted field. Take characters and look for “,” (–> 0) or newline (CR –> 1, LF –> 0).
  4. Quoted field. Take characters (incl. \n & “,”) and look for another quote (–>5)
  5. Quoted field + CR. Like previous but seeing LF won’t emit another newline.
  6. Quoted field + “. If next character is “, collect and continue (–> 3). If the next character is a “,” or newline than field is finished (CR –> 1, “,” and LF –> 0). If it’s space go to whitespace eating (–> 6). If the next character is text — it’s an error.
  7. Whitespace eating — this quoted field is done (but has spaces after it) anything but a newline or seperator or more spaces is an error.

(defun parse-csv (string &optional (sep #\,) &aux (state 0) (pos 0) (results (list (list ""))))
  (labels ((noop   (c) (declare (ignore c)))
           (ship   (c) (declare (ignore c)) (push (list "") results))
           (addc   (c) (setf (caar results) (concatenate 'string (caar results) (list c))))
           (next   (c) (declare (ignore c)) (setf results (cons (cons "" (car results)) (cdr results))))
           (addlf  (c) (declare (ignore c)) (addc #\Newline))
           (instlf (c) (declare (ignore c)) (addlf) (ship))
           (err    (c) (error "CSV Parse error, quoted value ended with ~a instead of ~a at position ~d" c sep pos)))
    (let ((fn (make-array (* (* 6 2) 7) :initial-contents
                          ;;    WHITE,    RETURN,     LF,         QUOTE,     SEP,       OTHER
                          (list #'noop 0  #'ship 1    #'ship 0    #'noop 3   #'next 0   #'addc 2 ;; Start
                                #'noop 0  #'ship 1    #'noop 0    #'noop 0   #'next 0   #'addc 2 ;; Seen Return
                                #'addc 2  #'ship 1    #'ship 0    #'addc 2   #'next 0   #'addc 2 ;; Unquoted
                                #'addc 3  #'addlf 4   #'addlf 3   #'noop 5   #'addc 3   #'addc 3 ;; Quoted
                                #'addc 3  #'addlf 4   #'noop 3    #'noop 5   #'addc 3   #'addc 3 ;; Quoted seen Return
                                #'noop 6  #'ship 1    #'ship 0    #'addc 3   #'next 0   #'err  0 ;; Quoted seen quote.
                                #'noop 6  #'ship 1    #'ship 0    #'err  0   #'next 0   #'err  0 ;; Quoted seen quote space
                                ))))
      (loop for char across string
            for class = (case char (#\Space 0) (#\Return 1) (#\Linefeed 2) (#\" 3) (otherwise (if (char= char sep) 4 5)))
            do (funcall (elt fn (+ (* class 2) (* state 12))) char)
            do (incf pos)
            do (setf state (elt fn (+ 1 (* class 2) (* state 12))))
            finally (return (remove '("") (nreverse (mapcar #'nreverse results)) :test #'equalp))))))

As before, check out Finite State Machines In Forth that’s where I learned this technique from.

But, what I’ve done is build 2D matrix. One axis is current state, and the other is input character class: either Whitespace, Return, Linefeed, quote, delimiter or ‘other’.

 

To every combination of state & character class I associate a function to run, and the state to transition to. It isn’t very difficult. You sketch it out in the margins, reason it through, and add states and character classes as you realize you need them.

 


About these ads

4 Comments on “Parsing CSV files with Lisp.”

  1. formlis says:

    Zach (of Planet Lisp) asked me if I’d looked at pre-existing libraries for CSV parsing. I actually hadn’t.

    Here are the couple I found:

    Fare http://common-lisp.net/gitweb?p=users/frideau/fare-csv.git.

    csv-parser http://members.optusnet.com.au/apicard/csv-parser.lisp

  2. Sohail says:

    I use csv-parser.lisp. Works good enough for the import I need.

  3. programmingpraxis says:

    I solved this exercise in a couple of exercises on my Programming Praxis blog. Here‘s a function to read csv records, and here‘s a framework for reading fielded input files, including csv and other record types. Code in Scheme, but should translate easily to Lisp.

  4. formlis says:

    Note: If you follow the programming Praxis links click the Page 2 link at the bottom to see the scheme solution code.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 33 other followers