Warning: fopen(/home/.lasher/yarinare/cavlec.yarinareth.net/wp-content/cache/) [function.fopen]: failed to open stream: Is a directory in /home/.lasher/yarinare/cavlec.yarinareth.net/wp-content/plugins/wp-cache/wp-cache-phase2.php on line 96
Caveat Lector » Batch-replacing items in DSpace

Dies Martis, 8 Ianuarii 2008

Batch-replacing items in DSpace

Something I ought to have mentioned in yesterday’s post is the -t flag. This does a “test run” of your import, catching many (though not all) problems. (It will not notice if something is wrong inside your dublin_core.xml file. If you don’t have a dublin_core.xml file, it will notice.) I always run an import command with -t, then if it runs clean, arrow-up to recall the command, delete the -t, and off it goes.

If an import does happen to choke and die in the middle, don’t panic; running the command again with the -r (for “resume”) flag will pick up the import where it left off.

Right. Now, moving on to the situation where an individual item or every item in a collection is seriously messed up, and would be much faster to correct outside DSpace. This can be done! I have done it. But it’s annoyingly error-prone.

Step one is to export the item or collection. This works a lot like importing. As the DSpace administrator user, go to DSpace’s bin directory and run the following:

  • dsrun org.dspace.app.itemexport.ItemExport Command invocation.
  • --type=COLLECTION Or ITEM, depending on which you’re exporting.
  • --id=0123/4567 The item or collection’s handle.
  • --dest=/home/me/stuff The directory on the server where the exported items should end up. Make sure the DSpace administrator user can write to this directory!
  • --number=1 The exporter names the individual item directories with sequential numbers. Instead of peeking into the directory, finding the highest-numbered existing directory, and adding one (which would be the KIND way to handle this), DSpace insists that you give it a start number.

In toto: dsrun org.dspace.app.itemexport.ItemExport --type=COLLECTION --id=0123/4567 --dest=/home/me/stuff --number=1

Inside each exported item’s directory, you’ll see a “contents” file, a “dublin_core.xml” file, one or more license files, and the bitstreams, all of which should be fairly familiar territory. You will also see a file called “handle” which is a plain-text file containing (surprise!) the item’s handle. At this point you can download all the folders and fix whatever you need to.

To re-import the items without losing their handles or duplicating them, you need to create a mapfile. This is just a plain-text file, with a folder name and the corresponding item’s handle on each line, separated by a space:

1 0123/4567
2 0123/8901
3 0123/2345

The way I do this, since the exporter isn’t smart enough to create a mapfile on its own, is with a little Python hack that runs through a directory of items and associates each item’s “handle” file with its directory name. (I meant to upload my Python hackery yesterday, but either WordPress or Apache was and is being extremely annoying about letting a Python source file load, so hang on while I sort that out—and with any luck I won’t break my blog permalinks this time!)

Now you need to keep DSpace from duplicating metadata on re-import. Yes, DSpace will do this if you let it. One way to deal with this is to run a script called ds-migrate from the bin folder on your items before you batch-import them.

I don’t like this solution, however, despite its being fast and easy. The script is intended for the not-uncommon situation where you mount a collection on a test server and then want to migrate it over to production, leaving no hint whatever that it was ever anywhere else. The script therefore wipes existing provenance and date information—which is bad for a collection you’re exporting from and re-importing to a production server. You’re losing important item history there!

So what I do—and you may well decide differently—is only kill the really troublesome extra metadata out of all the dublin_core.xml files: dc.format.extent, dc.format.mimetype, and dc.identifier.uri.

The first two are easy regular-expression replaces: <dcvalue element="format" qualifier="extent">[^<]+</dcvalue> (you can make the appropriate substitution for dc.format.mimetype without my help, I’m sure!). The last one is a tiny bit trickier, because you only want to get rid of identifier.uri when it’s the DSpace-assigned handle, not when someone has actually entered a different URI. Most people, then, will want this: <dcvalue element="format" qualifier="extent">http://hdl.handle.net/[^<]+</dcvalue> (If you run your own handle server instead of using CNRI’s, substitute its URL, of course.)

The element dc.date.issued causes a slightly subtler problem, in that you may want to keep it if it was DSpace-assigned, but you want to get rid of it if it was user-assigned because it’ll be duplicated. I get rid of it, because DSpace-assigned issue dates are completely meaningless. Your call whether you do too.

I’m told that the event system going into 1.6 is already smart enough to check for duplicated metadata on import. This makes me very happy, because deleting duplicate URIs is a hassle. (Not that I—oh, never mind.)

At any rate, once you’ve taken care of all this, just import as normal, using the mapfile you created as the value for the -m flag and adding the --replace flag. Should work fine.

120t free keypad motorola ringtonemotorola v 400 ringtonesfree halloween ringtones