aquiline ascension .( Deleting duplicates: a comparison ).

published: 2010-11-25
author: Hans Nowak
tags:
chicken
scheme

index
index by tag
about
homebase
blogroll

Deleting duplicates: a comparison

I am working on a pet project of mine (well, one of dozens :-), using Chicken Scheme. Without going into too much detail right now, it involves stripping the HTML from a web page, and extracting the "words". (These can be words in the traditional sense, but some characters other than letters and digits are allowed as well.) These are returned as a list of strings. It's likely that many of the words will appear multiple times; however, I want a list in which each word only appears once. In other words, a unique function.

After some searching, it turned out that a variant of this function can be found in SRFI-1 (as delete-duplicates). I'm not sure if there are other alternatives available in Chicken, either out of the box or in an egg. I consulted Chickadee but didn't find anything that looked like it; maybe it's there but has a name that is non-obvious or simply didn't occur to my limited imagination, I don't know. Either way, I ended up writing my own version of said function.

First, the naive version:

(define (unique lst)
  (let loop ((lst lst) (acc '()))
    (if (null? lst)
        (reverse acc)
        (let ((x (car lst)))
          (if (member x acc)
              (loop (cdr lst) acc)
              (loop (cdr lst) (cons x acc)))))))

(This was the original version. It doesn't really need to reverse its accumulated results before returning them, but see below.)

This version is great for homework. It is not so great if you have to process a list of 12,000 strings (or more), because its performance is terrible (O(N²) if I'm not mistaken).

Before Python grew a set type, people often used a dictionary to get a list of unique items. You loop over the list, add each item as a key to the dictionary with a dummy value, then return the dictionary's keys. This works well, assuming the order of elements in the result list is unimportant.

Chicken has a hash table type built in, which is a good replacement for Python's dictionaries, so I used that, producing the following function:

(define (unique-2 lst)
  (let ((hash (make-hash-table size: 1000)))
    (for-each (lambda (word)
                (hash-table-set! hash word #t))
              lst)
    (hash-table-keys hash)))

This is much, much faster -- about a hundred times as fast as the naive version, for my test list of ~12K strings. Here are the results using the time macro, including a few other alternative implementations:

naive, list-based unique:
3.091s CPU time, 0.004s GC time (major), 16 mutations, 1/40489 GCs (major/minor)
unique-2:
0.037s CPU time, 4106 mutations, 0/437 GCs (major/minor)
unique-3:
2.751s CPU time, 0.07s GC time (major), 167365 mutations, 11/199730 GCs (major/minor)
unique-4:
3.123s CPU time, 0.006s GC time (major), 16 mutations, 1/40461 GCs (major/minor)
unique-5:
17.844s CPU time, 13.023s GC time (major), 10 mutations, 79/32092 GCs (major/minor)

unique-3 is a version that sorts the list first, then loops over it, eliminating duplicates by comparing the first element of the list being processed to the first element of the accumulated results. I'll post it here for the sake of completeness:

(define (unique-3 lst)
  (let ((sorted-list (sort lst string<)))
    (let loop ((lst sorted-list) (acc '()))
      (if (null? lst)
          acc
          (if (and (not (null? acc))
                   (equal? (car lst) (car acc)))
              (loop (cdr lst) acc)
              (loop (cdr lst) (cons (car lst) acc)))))))

It is slightly faster than the naive version; it mostly depends on the performance of sort. Still, it comes nowhere near the version using the hash table, performance-wise.

unique-4 is a version I found on StackOverflow, originally a reply to someone's homework question. :-) It's similar to the first version:

(define (unique-4 seq)
  (define (loop ans rest)
    (cond ((null? rest) ans)
          ((member (car rest) ans)  ; *new*
           (loop ans (cdr rest)))   ; *new*
          (else
           (loop (cons (car rest) ans)
                 (cdr rest)))))
  (loop '() seq))

Rather shockingly, unique-5 is SRFI-1's delete-duplicates function, which clocks in at an abysmal 17 seconds.

Anyway... the moral is, that this old Python trick works well in Chicken too. ^_^

One more thing. This puzzles me a bit: The original unique and unique-4 have very similar performance times, both around 3.1s on this machine. As I mentioned earlier, unique reverses its result list before returning it, while unique-4 does not. I guess the act of reversing doesn't affect performance very much. However... when I replace the (reverse acc) in unique with just acc, then both functions become faster! Not much faster, but the runtime of both becomes ~2.7s. These are great mysteries. I assume it has something to do with caching and maybe GC, but I don't know for sure what is going on.

Please use the comments below to point out e.g. that Chicken does have a function somewhere that does this better, or that there are errors in my implementation, etc. <0.5 wink> Of course, if someone can cook up a version that is faster, or if you have an explanation for the phenomenon described in the previous paragraph, that would be great too!