Comments

Interesting idea but the slowness can be attributed to a couple of factors:

- Pickles are good for anything less than 80 Megabytes. Growing above cripples your system due to hard-disk activity. In a previous project, my pickle grew over 100 Megs and my computer was essentially useless for about 4 minutes while it was being loaded.

- There isn't enough horsepower in today's average computer to support Object Databases. I didn't they don't work. They just work too slowly for humans to accept.

- for a different spin on the same idea without having to deal with database, take a look at DocIndexer (http://www.methods.co.nz/docindexer/). People have called it googling on the desktop.

posted by Hoang Do at 06:31:15 AM on March 25, 2004
Zope uses ZCatalog to do indexing for tasks like this. It isn't exactly factored well, having overgone many changes while retaining backwards compatibility over four years or so, but it works quite nicely. In particular, ZCTextIndex (written mainly by Tim Peters) is a really nice full-text index which ZCatalog makes use of that is probably factored in such a way that it could be used outside of ZCatalog. Its test suite is particularly good, and it can probably serve as the basis for a reasonable indexing system for your purposes.

As far as the previous commenter's observations, the first is applicable in any serialization system (you need to be smarter about it than just having one pickle per file if it's a large item: having an 80MB string in RAM is almost never good), and the second just really isn't true although I agree that it would be true to say that you might need to understand the fundamentals of the operation of a given object database in order to efficiently query and store to it. In particular, ZODB does not do any indexing of data internally: you need to DIY. OTOH, with ZODB, you don't need to write any marshalling code (as you do all over the place with RDBs), so often it's a wash in the end.
posted by Chris McDonough at 10:10:55 AM on March 25, 2004
Interesting project Hans. But a file system already has all of these features. So, apart from being a curious hacker, what's the point? ...when there already exists grep, gzip, find and CVS.

I suppose one could have a little python module that "gates" the file system so that you can do::

fs_db.addFile('C:\report.tex')

which would analyse the file and stick it where it belongs. I.e. 'tex' files to /documents/texes/reports/cv/ by inspecting what the file contains. Similarly::

for each in fs_db.findByContent("polymorphism"):
print each.location()

An added benefit is that you get a lovely browsing feature, e.g. Explorer.
posted by Peter Bengtsson at 11:03:08 AM on March 25, 2004
Chandler has a similar vision. Perhaps you should evaluate it for ideas.
posted by an anonymous coward at 11:31:33 AM on March 25, 2004
"""But a file system already has all of these features. So, apart from being a curious hacker, what's the point? ...when there already exists grep, gzip, find and CVS."""

It's very important that I can associate a "document" with keywords. Let's say I jot down an idea about, for example, automating the creation of events in Wax using metaclasses. :-) Appropriate keywords for that piece of text would then be 'python', 'wax', 'metaclasses', maybe also 'events' or 'programming'. I want to be able to do a search later on any (combination of) these keywords, and get this document back (together with other ones matching the keywords, of course).

A file system will not let me do this (easily). I'm sure it's possible to write some kind of metadata in a special file, associating a filename with certain keywords. The obvious problem with this approach is that if a file is moved, copied, etc, from outside of the program, then the file with metadata is not updated. It also means that my program now must take care of copying or moving, otherwise my document will always be stuck in the same spot.

Also, a file system is hierarchical... I could store the text mentioned above in python/wax, but not at the same time in python/metaclasses. (Well, I guess an alias would make this possible, but I'm on Windows...) The point is that I don't really want a hierarchical structure.
posted by Hans at 12:31:59 PM on March 25, 2004
Why not use a real database? I've worked on something similar in the past, for storing notes and stuff. Using mysql and fulltext searching it works well.
posted by Bryan L. Fordham at 02:52:18 PM on March 25, 2004
Um, the ZODB *is* a real database... But if you mean "relational database", I tried that with the first version a few years ago, which used MS SQL Server, using Delphi's database-aware components etc. Like I said, the problem was that I had to traverse the whole table record by record to do the "Pythonic" search, which made things very slow and inefficient. -- Of course, I have to do the same now... hmm...
posted by Hans at 05:34:15 PM on March 25, 2004
Xerox Parc experimented with similar ideas some years ago: Placeless Documents. Worth a look.

And don't forget WinFS, of course.
posted by has at 10:07:42 AM on March 26, 2004
It seems like the really important factor is providing metadata and searchability for documents that aren't in the database -- like an ebook or other document.

Like, I'd love to have a search index of all the web pages I look at. Or the ability to specially mark those pages with keyword arguments. But I don't know that I want to keep the page locally (a Google-like cache is nice, but I'd rather look at the original if it's available).

The nice part of this in Python is that you should be able to define some sort of document interface that is common over several different storage mediums -- files, URLs, RDBMS data, ZODB data, etc. The value is in adding searchable metadata to all of these items.

Include emails, contact information, and ephemeral items like to-do items and calendar information, and you perhaps get Chandler... but not necessarily. Right now they are thinking a lot about storage, but that might be a distraction.

posted by Ian Bicking at 10:00:02 PM on March 27, 2004
The main point of the Chandler storage was to avoid making the user define a schema up front. Users are very poor at estimating the kind of data they will need to store. A pure relational application essentially precludes this by design. You should be able to pour 'your stuff' into it and derive a rough schema from the content and metadata after the fact.

I don't know if it's practical, but it's an interesting idea.
posted by Jeff Sacksteder at 05:31:18 AM on March 30, 2004
You've described the basics of DEVONthink (http://www.devon-technologies.com/products/devonthink.php). It's already written and ready to use! Of course if you're more interested in the project of writing the program than using it, you might not be interested. :) And it is a commercial product, but costs a not-unreasonable $40. There's a free demo and all that.
posted by Jim at 10:50:26 PM on March 30, 2004
The idea reminds me http://plone.org with very little customizations.
posted by Sergey Konozenko at 12:49:57 PM on April 01, 2004