Preservation musings
There’s a debate in progress on the DSpace tech and development lists about the costs and benefits of modularizing the DSpace architecture and providing plugin hooks. The list gurus are doing quite a remarkable job of heading off unproductive arguments at the pass, I must say.
Gives me to think, though, about this Dan Cohen post which I promised to respond to a long time ago and never located sufficient round tuits for. Also gives me to think about library data silos and how lovely it would be to breach them, and which silos ought to be targeted first—and I had a huge post in the pipeline about this, but I trashed it because Roy Tennant and Lorcan Dempsey are saying everything I had to say better than I could have said it.
“Preservation,” like most other words, has multiple meanings. DSpace’s definition of the word is “once you’ve got it, hang onto it like grim death.” Which is a good definition. Dan Cohen’s definition is different, though, more like “grab it before it disappears!” Which is another good definition. And DSpace is horrible, horrible, horrible at that.
(I do find myself wondering about permissions issues with regard to projects like the Hurricane Digital Memory Bank. I’ve had to do all kinds of tapdancing I don’t especially care for to be sure I’m not exposing MPOW to copyright liability. I don’t know what CHNM is doing about the problem—but if they aren’t doing anything, I strongly recommend they talk to the nice university lawyers, or the Copyright Office.)
Something I would love to bolt onto DSpace is a mudroom, a front lobby, a sandbox. Someplace to stash a whole bunch of files and know they’re being looked after (checksums, backups, format-checking on upload, assigning a temporary identifier, et cetera) until I have time to do a proper workup on them. DSpace’s concept of “workload” just doesn’t extend far enough back in the process—you can’t dump a file in without entering its metadata first, and sometimes the metadata entry just plain needs to wait. Not least because of permissions issues!
Such a thing would help solve Dan’s problem, I believe, and it’d solve a lot of my workflow problems, too.
As for modifying-to-preserve, I think DSpace has the right answer to that, frankly: don’t change and delete (as Dan suggests), change and add. DSpace’s media filter lets one build in automatic file transformations, with the result of the transformation added to an item as a new file. It would be entirely possible to bring in miscellaneous junk images and transform them all to (say) PNGs of the same bit depth and quality, without throwing the originals away. This is a good thing. You just never know when that original, however junky, will come in handy.
The problem at that point is that DSpace’s file addressing is godawful, as is its concept of the relationship between items and files (and, for that matter, between items and items—there’s “ispartofseries” in Dublin Core, and that’s about it). Here are some things I can’t do:
- Reliably address files in DSpace from outside. (It’s possible; it’s just ugly, and its URLs aren’t 404-proof the way handles are.) So I can’t, say, build a nice pretty whizzy photograph library that uses DSpace as its back-end. The only solution is to get all the files out of DSpace and store them elsewhere. Which is frankly rather wasteful, and leads to stupid pointless fights about who stores what where. (Yes, been-there-done-that. Don’t even ask.)
- Build a pretty whizzy photograph library into DSpace.
- Do content-negotiation with client software to pass the right file format over. This is even more a pity because DSpace’s media filter goes a long way toward solving nasty content-negotiation problems.
- Concatenate files for presentation to the client when that makes sense.
I need DSpace to understand that multiple files in an item may have different relationships to each other. They may be different formats of the same content (as with my poetry collection, which contains a lossy mp3 plus either an AIFF or a WAV for the non-lossiness of it all). They may be components of an overall whole, as with websites. DSpace kinda-sorta groks websites, but things other than websites have that structure and ought to have it respected. They may be ordered sequences, as with the hundreds of page-image TIFFs in a couple of the books I have that were scanned for preservation. They may be slightly different views of the same intellectual object; I don’t have anything like this that I can think of, but photographs may want to be stored this way.
Or—and here’s the really fun part!—some combination of the above. Some of my poetry pieces have an introduction in separate soundfiles. So there’s two mp3s and two WAVs; each file stands in a format relationship to one file and a sequence relationship to another. And I cannot represent that in DSpace. At all. It’s insanely frustrating, because it locks DSpace’s files behind a wall of incomprehensibility such that it’s all but impossible to build kewl ’leet public-facing services and websites from them. Why are we locking out the remix culture we ought to be embracing?
The danger, of course, is that adding these relationships adds a layer of complexity to DSpace management. Fine, okay, I agree with you. But at least give me the option to enrich my stuff in this fashion!
Maybe when the plugin architecture arrives. In DSpace 1.4. Which we won’t see until March at the earliest. Sigh.