A plea came in to the DSpace techlist for how to use the DSpace command-line batch importer. “RTFM!” was the immediate chorus.
Well, okay, it’s how I learned to use the batch importer, but that doesn’t mean everyone should have to learn that way. So forthwith, a nuts-and-bolts minimal-techspeke tutorial on getting stuff into DSpace through the back alley.
First, some vocabulary. A “bitstream” is what you and I, being normal folks, think of as a file. An “item” consists of one or more bitstreams, plus descriptive information (author, title, etc.) about those bitstreams, plus license information. A “bundle” is a DSpace-specific construct (you won’t even see it in the UI, really) that keeps license bitstreams separate from content bitstreams inside an item. An “eperson” is someone registered with the DSpace instance; s/he is usually referred to by his/her email address.
To import an item into DSpace, you need to give DSpace three things: the bitstreams, the item’s descriptive information, and (because DSpace is fairly brain-dead) a plain-text listing of the bitstreams. All these things need to be in a single folder. If you are importing more than one item at once, each item needs to be in its own folder. DSpace does not care how you name the folder or the bitstreams. It does care how you name the bitstream listing, the file containing descriptive information, and the license files if any, as I’ll explain in a moment.
License information is optional. If you do not provide it, DSpace simply doesn’t attach a license to the imported item. If you do provide a license for the item, it should be in the form of a plain-text file inside the item’s folder named “license.txt.” (I’m leaving Creative Commons licenses out of the picture for now; if you care, I have another post on the subject which you should read only after you read and understand this one.)
The plain-text listing of the bitstreams needs to be named “contents”. Each filename should be on its own line; order is irrelevant. If you are only importing content files (no license files), you’re done. If, however, you have license files, you need to tell DSpace to put them in a different bundle from the content files. Easier to demonstrate than explain:
contentfile1.txt bundle:ORIGINAL
contentfile2.txt bundle:ORIGINAL
license.txt bundle:LICENSE
The whitespace between the filename and the bundle name must be a single tab character.
The descriptive information lives in a little XML file whose name must be “dublin_core.xml.” To keep this post to a manageable length, I am not going to go heavily into detail about Dublin Core metadata; the easiest way to bootstrap yourself is to look at existing items in a repository in full-listing view. A bare-bones dublin_core.xml file looks something like this:
<dublin_core>
<dcvalue element="contributor" qualifier="author">Public, John Q.</dcvalue>
<dcvalue element="language" qualifier="iso">en</dcvalue>
<dcvalue element="subject" qualifier="none">Technology</dcvalue>
<dcvalue element="title" qualifier="none">Sample Dublin Core record</dcvalue>
<dcvalue element="type" qualifier="none">Article</dcvalue>
</dublin_core>
The order in which you place individual Dublin Core elements generally does not matter, although you should put authors in the correct order (first author first, second author second, etc.) because DSpace does respect that order, and if you don’t angry faculty will come after you with long knives.
If you have all this together, you are now ready to use the batch importer. Put the item folder on the DSpace server somewhere that the DSpace administrator user has read and write privileges. As the DSpace administrator user, cd over to the bin folder inside the running DSpace instance (note: not the source-code folder that you run ant from when you recompile DSpace). I’m going to run through the command one bit at a time, and then put it all together at the end.
dsrun org.dspace.app.itemimport.ItemImport Command invocation.
-a Tells DSpace that you’re adding new items.
-e me@myu.edu Eperson who should be held responsible for the submitted items. This need not necessarily be you! It does need to be someone the system knows about, so if you’re depositing on behalf of someone who’s never used the system, you need to use the DSpace administrative interface to add them as an eperson.
-c 0123/4567 Which collection the items should go into. Go to the collection’s home page and grab up its handle. (Note that the batch importer is deadly stupid about this; there is no way to do a single batch import of items that belong to different collections. Also, you can’t map items into additional collections via the batch importer.)
-s /home/me/stuff The directory on the server where the item folders are. DSpace will error out if the admin user does not have read access to this directory!
-m /home/me/stuff/mapfiles/mapfile.txt Where to put the dumb little “map file” that DSpace generates, telling you which item got assigned which handle. This is basically a throwaway (it’s easy to regenerate if you ever actually need it), but if you don’t let DSpace generate its dumb little map file, DSpace sulks and won’t import your items.
So, the full command looks something like this:
dsrun org.dspace.app.itemimport.ItemImport -a -e me@myu.edu -c 0123/4567 -s /home/me/stuff -m /home/me/stuff/mapfiles/mapfile.txt
For most items, you are now done. If your item was a website, you have one more step: setting the “primary bitstream” to the website’s home or entry page. Anyone with edit rights on the item can do this from the item’s edit page; there’s a column of radio buttons labeled “Primary bitstream?” beside the bitstream listings near the bottom. Alternately, you can employ some SQL-fu in your database (instructions are for Postgres, not Oracle).
The batch importer can also replace items, so if you’ve completely hosed a collection in some reasonably fixable fashion, you can export it, fix it, and re-import it. Danger Will Robinson! There are several gotchas in this process. (Not that I know this by experience or anything—okay, I’m not fooling anyone here. I’ve run into all of them.) For this, you will need a mapfile, and you need to add --replace to the command line. I’ll reserve the other gotchas for a separate post, noting only that I have it on good authority that several of them will be going away in version 1.5 or 1.6.
And there you are. I hope.