Parsing CSV files with Lisp.Posted: 2011/02/14
In a previous blog post I talked about the parsing method I use here. It’s a hand made state machine, which (if you do it right) can be remarkably short. This parser parses CSV files. I haven’t widely tested it, but it should be compatible with excel & gmail exported csv files. CSV isn’t standard but generally:
- Use Commas as the deliminator.
- Any field that has a comma or a newline is treated as complex.
- Complex values are wrapped in quotes.
- Quotes in complex values are represented as double quotes.
This parser shoud support any line ending (cr+lf, lf or cr) and handles regular and complex values. I’ve heard of CSV files using C style escapes (like \n) but I’ve never seen one, and this this parser doesn’t support that.
p.s I’m also using Posterous to try posting this. Unfortunately posterous code embedder also doesn’t support lisp — so I continue to have to use emacs HTMLIZE.
- Seperator character (“,” by default)
- Start (eats spaces and look for quote (–> 3) or letters (–> 2)
- Start + CR. Like previous but seeing LF won’t emit another newline.
- Unquoted field. Take characters and look for “,” (–> 0) or newline (CR –> 1, LF –> 0).
- Quoted field. Take characters (incl. \n & “,”) and look for another quote (–>5)
- Quoted field + CR. Like previous but seeing LF won’t emit another newline.
- Quoted field + “. If next character is “, collect and continue (–> 3). If the next character is a “,” or newline than field is finished (CR –> 1, “,” and LF –> 0). If it’s space go to whitespace eating (–> 6). If the next character is text — it’s an error.
- Whitespace eating — this quoted field is done (but has spaces after it) anything but a newline or seperator or more spaces is an error.
(defun parse-csv (string &optional (sep #\,) &aux (state 0) (pos 0) (results (list (list "")))) (labels ((noop (c) (declare (ignore c))) (ship (c) (declare (ignore c)) (push (list "") results)) (addc (c) (setf (caar results) (concatenate 'string (caar results) (list c)))) (next (c) (declare (ignore c)) (setf results (cons (cons "" (car results)) (cdr results)))) (addlf (c) (declare (ignore c)) (addc #\Newline)) (instlf (c) (declare (ignore c)) (addlf) (ship)) (err (c) (error "CSV Parse error, quoted value ended with ~a instead of ~a at position ~d" c sep pos))) (let ((fn (make-array (* (* 6 2) 7) :initial-contents ;; WHITE, RETURN, LF, QUOTE, SEP, OTHER (list #'noop 0 #'ship 1 #'ship 0 #'noop 3 #'next 0 #'addc 2 ;; Start #'noop 0 #'ship 1 #'noop 0 #'noop 0 #'next 0 #'addc 2 ;; Seen Return #'addc 2 #'ship 1 #'ship 0 #'addc 2 #'next 0 #'addc 2 ;; Unquoted #'addc 3 #'addlf 4 #'addlf 3 #'noop 5 #'addc 3 #'addc 3 ;; Quoted #'addc 3 #'addlf 4 #'noop 3 #'noop 5 #'addc 3 #'addc 3 ;; Quoted seen Return #'noop 6 #'ship 1 #'ship 0 #'addc 3 #'next 0 #'err 0 ;; Quoted seen quote. #'noop 6 #'ship 1 #'ship 0 #'err 0 #'next 0 #'err 0 ;; Quoted seen quote space )))) (loop for char across string for class = (case char (#\Space 0) (#\Return 1) (#\Linefeed 2) (#\" 3) (otherwise (if (char= char sep) 4 5))) do (funcall (elt fn (+ (* class 2) (* state 12))) char) do (incf pos) do (setf state (elt fn (+ 1 (* class 2) (* state 12)))) finally (return (remove '("") (nreverse (mapcar #'nreverse results)) :test #'equalp))))))
As before, check out Finite State Machines In Forth that’s where I learned this technique from.
But, what I’ve done is build 2D matrix. One axis is current state, and the other is input character class: either Whitespace, Return, Linefeed, quote, delimiter or ‘other’.
To every combination of state & character class I associate a function to run, and the state to transition to. It isn’t very difficult. You sketch it out in the margins, reason it through, and add states and character classes as you realize you need them.