1. Like Chris wrote in http://zephyrfalcon.org/weblog/arch_d7_2004_03_20.html#e530 you can use a ZCatalog from Zope. And why would you want to store every word in every document?

    For my webblog, after I have written my new blog entry I click on "Keywords" which lists unsual words in the text for me to edit. For example: "is" or "internet" are so common that I don't index them. Then, I index the blog object into a ZCatalog. I wrote a little script to calculate what common words are by loading it with loads and loads of english texts.
      posted by Peter Bengtsson at 03:42:23 AM on March 28, 2004  
  2. You should use an OOBTree instead of a PersistentMapping. I expect it would reduce the size of your Data.fs tremendously (the whole PersistentMapping is stored when you change a single item). Also, if you indexed several documents per transaction, the database would also grow slower.

    Also, it might be productive to look for existing ZODB indexing solutions instead of rolling your own.
      posted by Marius Gedminas at 09:19:49 AM on March 28, 2004  
  3. I tried the OOBTree for this purpose, but in some cases it seems to be a lot slower than PersistentMapping. Well, I can still go either way, since they behave the same. The index is not the cause of most of the database growth, though. I'd much rather get rid of the versioning/undo stuff.
      posted by Hans at 09:58:49 AM on March 28, 2004  
  4. OOBTrees really shouldn't noticesably slower from the perspective of a user of the system than use of a dictionary or a PersistentMapping in the same place. I'm surprised by that.

    In any case, as you probably know already, the quadradic growth of the data that you see is related to the nature of existing undoing storages (which essentially append repersisted data to a log file on every write transaction). But this is definitely aggravated by the fact that every time you update, add, or remove a key/value pair in your data PersistentMapping, the entire state of the mapping (which consists of a key and a pointer to another persistent object for each key in the mapping) needs to be repersisted. ZODB BTrees don't have this behavior because their state is broken apart into many persistent objects, so only a small part of what has changed about the BTree needs to be repersisted on an update, addition, or deletion. If you use a BTree instead of a PersistentMapping to store your mapping from filename to file object, transactions which encapsulate additions, updates and deletions to/from that mapping will consume far less space than they are consuming now. It's not as good as having a non-undo storage, but each transaction would consume less space.

    Another strategy that you might try instead of committing a transaction after every document addition in a batch upload would be to commit only once for a set of documents. The database only grows when you finalize a transaction via "get_transaction().commt()". If you commit frequently, you will be repersisting the same data over and over to disk. If you commit less freqently, you will repersist that data less frequently (and thus consume less disk space overall) at the expense of potentially consuming more memory during the course of the transaction. Judiciuous use of ZODB subtransactions and manual "ghosting" of persistent objects or cache clears can reduce memory consumption during the course of a single transaction, too. For an example of how to do this, see Zope's lib/python/Products/ZCatalog.py's "catalog_object" method, about halfway down, starting out "if self.threshold is not None:".

    Unfortunately, the only storage (other than those that store data only in RAM) that I can think of that doesn't do undo is the one you already tried (BDB minimal storage). I have no idea why you see a 167MB database after adding only a few K of files. Maybe this is just Berkeley filesize overhead, and each addition only consumes space more or less equivalent to the picklesize of the document?

    You're right, though, it's a shame there's no widely-used non-undoing storage for ZODB. Other storages aren't very widely used beceause FileStorage has the right profle for its main application's (Zope's) target purposes: many reads, few writes, no-nonsense setup, and good cross-platform compatibility. It would be nice to see someone come up with a storage that had the same characteristics but didn't continually grow. That said, if you are careful, using an undoing storage doesn't need to mean insane disk space usage.
      posted by Chris McDonough at 01:52:54 PM on March 28, 2004  
  5. You've got to try DirectoryStorage if you are concerned with packing times and memory usage of ZODB.

    http://dirstorage.sourceforge.net/

    And yes, ZCatalog and BTrees are the way to go. Even more, try BTreeFolder2 for simplicity sake.

      posted by Sergey Konozenko at 03:59:52 PM on March 31, 2004  
  6. I'd probably simplify this:

    try:
    self.index[word].append(id)
    except KeyError:
    self.index[word] = [id]

    to this:

    self.index.setdefault(word, []).append(id)
      posted by Ben Sizer at 09:42:46 AM on April 01, 2004