Parsing CSV files with Lisp.

In a previous blog post I talked about the parsing method I use here. It’s a hand made state machine, which (if you do it right) can be remarkably short. This parser parses CSV files. I haven’t widely tested it, but it should be compatible with excel & gmail exported csv files. CSV isn’t standard but generally:

  • Use Commas as the deliminator.
  • Any field that has a comma or a newline is treated as complex.
  • Complex values are wrapped in quotes.
  • Quotes in complex values are represented as double quotes.

This parser shoud support any line ending (cr+lf, lf or cr) and handles regular and complex values. I’ve heard of CSV files using C style escapes (like \n) but I’ve never seen one, and this this parser doesn’t support that.

Read the rest of this entry »


Lisp Forum: Its like sudoku but for lispers.

I thought I’d plug Lisp Forum, a forum I frequent that answers Lisp related questions; I go by the name “Warren Wilkinson”.

I go when I’m taking a break from my code. The problems posted by new lispers are often short and stand-alone, like a good puzzle. And you learn a lot too; In this post new user aaron4osu created a cypher to permutate the words in a file. His version of randomize-list was very clever.

Randomize List


(defun randomize-list (list)  
  (mapcar #'car (sort (mapcar (lambda (c) (cons c (random 1.0))) list)
                      #'> :key #'cdr)))

I’ve written randomize-list before, but mine worked by taking items from the list in random order to build up a new list:


(defun randomize-list (a)
  (let ((a (copy-list a))
        (b nil)
        (n (length a)))
    (loop repeat n
          for place = (random (length a))
          for item = (elt a place)
          do (push item b)
          do (setf a (delete item a :start place :count 1)))
    b))

Doing it aaron4osu’s way is about 3 times faster, and you can make it slightly faster by consing random fixnums and using a fixnum-only <.


(defun fixnum< (a b) (> (the fixnum a) (the fixnum b)))
(defun randomize-list (list)  
  (mapcar #'cdr (sort (mapcar (lambda (c) (cons (random #xFFFFFF) c)) list)
                      #'fixnum< :key #'car)))

Removing Duplicates

Surprisingly, it’s faster to remove duplicates after than to filter them before. Using remove-duplicates is about 20 times faster than checking them before hand (but using a hash table to test inclusion is much quicker, but still twice as slow as remove-duplicates). In SBCL, using delete-duplicates was really slow.

Replacing non-destructive routines with destructive ones (e.g. map vs map-into, remove-duplicates vs delete-duplicates, a new string vs replace) makes things slower.

Templating

In this post, I probably scared newbie indianerrostock away, when he asked for improvements to his templating program.

Given a text file that contained #SPORTNAME# in multiple places, his program replaced that with the name of a sport. The kicker was it did it several times in batch for many sport names.

His method was the obvious one: Loop through the string searching for #SPORTNAME# while making a copy of the string with #SPORTNAME# replaced.

My solution was to sort the sport names biggest to smallest and precompute where all the #SPORTNAMES# where, and then replace them in the source string (i.e. no duplicate string, no memory allocation, etc) with a sport name. Then I would repeat this replacing the previous sport name with the next one.

My super optimized no-allocation and no-searching method was 2.5 times slower. It boggles the mind.

How Fast is Sorting?

This post starts out about CLOS and ends with a discussion of sorting. SBCL sorts lists using merge sort, but vectors with heap-sort. stable-sort for both vectors and lists uses merge-sort.

Its faster to use STABLE-SORT for vectors.

Also, merge-sort works well without random access, so sorting singly linked lists is fast. There is no need to convert the list into a vector before sorting.

Why So Counter-intuitive?

Well, I think a lot of this has to do with memory allocation in Lisp. Lisp is garbage collected, most Lisps use a stop-and-copy garbage collector. This is like a constant defrag on your memory. One benefit of which is that allocation is VERY FAST — far faster than C’s malloc. In C you’d have to search a binary tree for available space , but in lisp, you just increment the memory-end pointer.

If you made a linked list a node at a time C would have you search a hundred times, while Lisp would increment a pointer 100 times. C’s linked list would be scattered in memory, while Lisp’s would be continuous.

This is why Lisp is so fast at functional programming. You create a lot of garbage, but you create it fast, and it goes away fast.


Co-Routines in RethinkDB

RethinkDB is a startup that is ‘Rethinking Databases’, they are making a new databases optimized for today’s hardware (like fast random-access disks) and research (like append only databases).

Lots of their blog posts are pretty interesting, especially to me since my project, FormLis, is powered by a database I wrote myself. So I contrast how they solve problems and how I did.

In this blog post, the guys at RethinkDB talk about their use of co-routines, a subject I’ve briefly covered before. In the co-routine spectrum they fall under the ‘separate stack’ co-routine model.

It sounds to me like they have a thread per cpu that accesses the database, and these threads request resources and block until they are ready (they probably wait on a disk-fetching thread). Multiple threads from a program talk to these database threads, so the database threads don’t like to wait for the disk, when they could be doing work for a different program thread.

Previously they implemented blocking and notification with call backs, but found it unsatisfactory, because you had to split your logic into several parts — the forward parts and the callback parts — implement them as methods and bundle them into a class. You also had to keep your variables as instance variables.

Using co-routines, you can do a more C-like approach. Instead of using objects-representing-logic, you make callbacks by cloning the stack and bundling it into a co-routine. Because you copied the stack, both the co-routine callback and the original caller can see and (safely) modify your existing local variables. The effect is that callback code is written inline with regular code, and can use local variables; its a lot like a closure, but more heavy-weight.

So, how does FormLis do it? Well for starters I only have 1 database, sitting in (essentially) unpartitioned space. RethinkDBs use case is multiple databases on multiple drives, operating concurrently with (say) apache that’s reading files almost at random. So for them it makes a lot of sense to not wait, but go on and access the next file. For me, I only have 1 databases that has structures to multiplex to multiple clients, It makes sense to wait.

RethinkDB also uses their co-routines for high level operations that do block. Here FormLis is different, because there is no (long term) blocking at the high level (the only waits are for mutexes). My use case is many, many more readers than writers (an assumption RethinkDB can’t make). I designed my database so that every transaction runs (successfully) to completion, and only when they try to commit do I throw an error that essentially says “Oh sorry, another transaction collided with you, and you’ve been toast for a while now.” At this point my transaction just does a GOTO back to where the transaction started and reruns it.

Thats another advantage to my database, because of my restricted domain, every transaction comes from only a few places in my Lisp code. I don’t have to handle transactions in a generic sense like a SQL db. So when a transaction fails, I don’t have to examine what it did to try it again, I just GOTO.


HACK: Co-routines in SBCL Lisp

There is a lot of mystery around co-routines; What are they? What can you do with them? Does language X have them? Experts agree they can implement multithreading, and they are the mechanism behind the yield operator in certain languages and are thus the driving force of ‘generators’. They are often mentioned along with continuations; “Continuations”, they say “can be used to implement co-routines”. What on earth does it mean! Read the rest of this entry »


A Handful Of Useful Lisp Programming Tricks

Here are a few useful Lisp tricks, some are applicable to other languages. Read the rest of this entry »


How I Debugged a Bad Race Condition

The last week and a half I’ve spent searching for race conditions in my database code. A race condition is a bug where two (or more) processes interact in an unforseen way causing bad things to happen. Read the rest of this entry »


Programming Language Syntax

In last weeks article I used Python; I didn’t like the experience. I don’t like its syntax which is just like C, C++, Basic, Java, C#, Fortran, &c. These languages maintain an invented separation between statements and expressions. Read the rest of this entry »


Forth in Lisp

In preparation of FormLis leaving open alpha and entering beta status, I though I’d write about some of the technology behind the system. One cool feature is the powerful view engine, which permits powerful yet readable view descriptions. Read the rest of this entry »


Follow

Get every new post delivered to your Inbox.

Join 33 other followers