Warning: fopen(/home/.lasher/yarinare/cavlec.yarinareth.net/wp-content/cache/) [function.fopen]: failed to open stream: Is a directory in /home/.lasher/yarinare/cavlec.yarinareth.net/wp-content/plugins/wp-cache/wp-cache-phase2.php on line 96
Caveat Lector » 2008 » January

Dies Martis, 1 Ianuarii 2008

Busted

So, not a day after I ask the hypesters to leave my damn blogosphere the hell alone, I get another tout email. Do you morons not read? (Yes, okay, that one answers itself.)

Here’s my new policy. I’m publishing any of those I get. Sans links. With names. Call it my little gesture toward turning over the rock and watching the little grubs squirm.

Hi,

We just posted an article “[article name deleted] “( [article URL deleted] ). I thought I’d bring it to your attention just in case you think your readers would find it interesting.

Either way, thanks for your time! Happy New Year!

Amy S Quinn

Email address was from GMail; headers gave no immediate reason to doubt that provenance.

Bite me, Ms. Quinn.

Dies Mercurii, 2 Ianuarii 2008

Public Domain Day

In honor of Public Domain Day, I remind everyone that I granted Caveat Lector to the public domain quite some time ago.

When Creative Commons gets CC0 in gear finally, I’ll pop a badge and a license up, something I decided against years ago because CC0 didn’t exist.

Until then, take my word for it. I won’t sue. Honest.

Unwarranted optimism and unjustifiable assumptions

The after-holidays literature catch-up round this morning started with the ARL’s e-science report (PDF, found via Open Access News), which is well worth reading. This repository-rat, however, feels it incumbent upon her to note for the ARL’s edification the difference between the collective skillsets present in all of academic librarianship and the actual skillsets present at any given institution’s library.

Take page 13, “Current Library Capability to Support E-Science.” Research libraries have, quoth ARL:

  • Expert understanding of the policies and principles related to open exchange of scholarly information, as well as the roles that can be played by institutional repositories in assuring that exchange, and a demonstrated ability to offer and support both institutional repositories and domain repositories (e.g., arXiv)
  • Experience with developing and supporting integration and interoperability tools for information distribution and discovery, both within and across disciplines (e.g., SFX, metasearch, metadata standards, SIMILE)
  • Experience with developing and supporting both business and technical strategies for long-term archiving (e.g., archival support generally, Portico, grant research with NARA, NDIIPP)
  • Understanding of archival and life-cycle aspects of scientific information, including the importance of assuring access and usability over the long term (preservation, metadata)

As a statement of the collective capacity of academic librarianship in the United States, this is more than fair. As a statement of what’s present at any given research library—it’s ludicrous. Risible. Absurd. Except that I can’t even laugh, because I want to cry.

ARL gets it, actually; the report authors clearly realize libraries just plain aren’t there yet: “a limited amount of current activity in this area,” the next section says primly. (I am biting my tongue. I am biting it good and hard. I may yet bite it all the way through. Just sayin’.) But, ARL, from the bottom of my heart let me say: you do not do your cause any good when you puff up library administrators’ heads into thinking they’re better prepared and staffed than they are!

And let me also wish the ARL luck. They have an uphill climb in front of them, and from the look of their recommendations, they know it. (Disclaimer ho: MPOW is an ARL institution, but I personally have no contact with ARL. What I’m saying here is not the result either of discussions within MPOW or discussions directly with anyone at ARL.)

Part of the reason ARL is in for trouble is the investment many of its institutions have already made in institutional repositories. (I make this point, very briefly, in Roach Motel. This post is an expansion upon it.) But wait, you say, how could that possibly be a problem? Isn’t it a sign that libraries are awake, aware, and ready to contribute? The ARL report spins it that way (page 6), but undercuts itself a little bit later (page 11) by referring specifically to disciplinary repositories (not institutional ones) as collaboration drivers.

Why is research-library investment in IRs a problem for e-science initiatives?

  • IR software and service platforms are wholly incapable of meeting the needs of data-driven science. Here we run into the problem of collective versus individual resources, again. A bare handful of forward-thinking institutions do have services built on IR platforms (or, I should say, built on Fedora, because EPrints and DSpace are completely hopeless for this task) that do the job. So the capabilities exist, somewhere in the vast collectivity of librarianship. That’s a far cry from being widespread enough to make significant inroads into the problem.
  • Most research libraries that have opened IRs have been badly burned by them. Uptake has been minimal, expense considerable, and number of items collected embarrassing. So now the ARL is going to dangle vastly more expensive, complex, and un-library-like systems in front of library administrators, harping upon several of the same themes that convinced those same administrators to open IRs? I do not foresee much enthusiasm. I foresee a lot of “This is too expensive and too far outside our skillset for our library system even to consider. Go talk to campus IT.” (Incidentally, see my colleague Mike Simpson’s excellent report for why leaving this just to campus IT is a problem.)
  • Resources both human and financial that might go into e-science initiatives are being hurled down the IR black hole. I add, also, that “scholarly communication” may have become a similar boondoggle. There are a fair few “scholarly communication librarians” out there. I would surely love to see a catalogue of their accomplishments. Not, mind you, a list of their initiatives and efforts, as I know they’re all motivated and hardworking people—I mean accomplishments, visible evidence of change. Cards on the table: I don’t think there have been many.

Me, I freely admit I’m cynical: I think IRs have been written off as a failed experiment at the Big Thinker level. Over the past year I’ve seen Cliff Lynch shift from his “essential infrastructure” stance to avoiding mention of them whenever possible. The ARL report gives them no more than the barest of lip service. I’m also cynical enough to say that I don’t think this a wrong move by Lynch or by ARL. Heck, I’d love to be able to serve e-science. I’m hobbled by a software platform that can’t and a mission that doesn’t. If I want to serve e-science, I have no choice but to win free of both—and why should I be surprised that Lynch and ARL appear to have come to similar conclusions?

I hope ARL is making plans to reclaim and repurpose repository-rats such as myself, but I can’t pretend I’m sanguine about that, either. Certainly the report mentioned nothing of the kind. Missed opportunity, ARL. I know a fair few good rats who can already do the kind of jobs you’re envisioning, and who deserve better than being written off. And trust me, they need your help kicking over the IR traces.

How did this happen? How did IRs become such a tarpit?

I’m going to pick on what is probably the oftenest-cited success story in IR history: the famous Foster study on faculty behavior and institutional repositories. Actually, I love this study. I love its methodology to pieces, and I’m constantly disappointed that more ethnographic studies of faculty behavior aren’t being done. The problem with this study isn’t its methodology or the conclusions drawn therefrom.

The problem is that Foster, doubtless under pressure from libraries and library grant-funders, asked the wrong question. She asked how to goose faculty into putting stuff in IRs. The very title of the study says so! That immediately builds a very tiny box around the vast potential of library contribution to research processes. It assumes ab initio that IRs, since we have them already, are the answer—if we only find the right question. This assumption is wholly without foundation and unjustifiable, yet it’s colored nearly all subsequent investigation.

The far broader, more interesting, and more useful question she ought to have asked is, “What are researchers doing with computers and digital data these days, and where do their processes fall short of ideal in ways that libraries could profitably rectify?” The set of answers yielded by such a study, I have every confidence, would demonstrate starkly that IRs are a vanishingly small segment of any decent library response to researcher needs.

Libraries are still, to this very day, asking how we can goose faculty into putting stuff in IRs. I answered that question loudly and clearly in my NISO/PALINET presentation and in Roach Motel: we can’t, it is not possible, cope already! Until we move on from this question, IRs will remain a tarpit, and productive library involvement in e-science no more than a distant dream.

Dies Jovis, 3 Ianuarii 2008

Next at the NIH

Gavin Baker has a grounded and reasonable set of predictions for what the nuts and bolts of the NIH open-access policy will look like (remember, folks, law is just the bones; the executive branch puts the meat on ’em).

I’ll be pointing local folks at this, and I recommend others do likewise.

Dies Veneris, 4 Ianuarii 2008

Just when I was convinced they’re not losers

The Association for American Publishers just does not learn.

Here, guys. Have a free sledgehammer, on me. The NIH policy is a done deal. Weep, wail, mourn, and move the hell on. (Oh, and those first three? Should be done in private. The little temper tantrum that was PRISM didn’t get you very far, did it? Learn from that.)

If you bring a lawsuit, it is doomed. If you raise hell with faculty, faculty will raise hell right back, because the NIH is more important to them than you are. If you take it back to Congress, you will be met with incomprehension. The NIH itself shows every sign of being tired of your shenanigans.

Grow up, people. Smile, put on your grownup undies, and stop throwing good money down the rathole of an already-lost fight. Sheesh. Do you think I like calling you losers? I don’t.

Dies Saturni, 5 Ianuarii 2008

In praise of the blog

A couple-three things happened last week that (combined with another thing that happened some time past) have left me feeling vindicated on some of my less-happy opinions. I’m not exactly schadenfreudish about it; more a sense that finally, maybe, there will be some forward motion. That can only be good.

And I can say in all honesty that at least one such thing wouldn’t have happened at all if I hadn’t possessed a quasi-professional public soapbox firewalled off from my job strictly enough that third parties can’t easily get me in trouble for it. Because, the third parties in question? Have a history of getting people they find bothersome in trouble at work. In my case, they’re coming to the table instead, presumably having evaluated the available opportunity to dunk me in the soup and decided it either wasn’t possible or wasn’t worth the effort.

(Not that I trust them further than I could conveniently throw them, mind you. I’m optimistic, not stupid. Due self-protection measures are being taken.)

Over the last couple years I’ve learned that I can do professional writing, though it takes a hell of a lot out of me and I don’t think I will ever find it easy. Speaking is worlds easier, and whole universes more fun. (Combine Walt Crawford, to whom good writing comes as naturally as breathing, and me and you’d have one frighteningly effective public-figure librarian.)

I’ve also learned, though, that much professional publishing is limited-impact, especially when the goal is to motivate action (as my implicit professional-writing goal usually is). The thing I wrote for Library Journal wasn’t wholly bad, but it sank like a stone, to judge by the lack of reaction. My essay for Information Tomorrow—I was satisfied with it. It was solid if uninspired writing (and “solid but uninspired” is about the best I can do, folks). And when I wrote it, it broke some ground. When it was finally published, though… not so much with the groundbreaking. Kind of unfair.

If I’d waited for Roach Motel to be formally published, I suspect the same thing would have happened. I am not the only person saying some of what’s in Roach Motel (though I do, perhaps over-enthusiastically, think some of its observations and analysis are original). If I’d waited, why would anyone bother to read me? Or believe that I’d come up with this stuff off my own bat rather than reading it elsewhere?

As it is—I’ve laid my claim, with Roach Motel and with the NISO/PALINET talk, and people are listening, and wheels are slowly starting to turn.

There’s a taxonomy in all this, somewhere. (I am such a flippin’ librarian sometimes.) The blog is for open dissent and matters that won’t wait for my agonizingly slow formal-composition process. Speaking is for education and out-on-a-limb assertions. Professional writing is for persuasion, and open access to professional writing is for establishing primacy and expanding reach.

Perhaps it’s a sign of a hopelessly contrary nature that I need that open-dissent safe-space. Can’t imagine doing without it. Moreover, I’m just contrary enough to think that the blog’s helped my chosen profession as much as or more than anything else I’ve written for it. I’m satisfied with that.

Dies Solis, 6 Ianuarii 2008

The Monochromes

Nothing overcomes feline antipathy like cold weather. Behold.

three cats on couch

Dies Lunae, 7 Ianuarii 2008

Because people believe it

Disclaimer: I took the survey mentioned in this post. I do not intend to discuss my responses here, but CavLec readers will be able to divine most of them.

Peter Suber is annoyed that an ASIST survey contains some well-known canards against open access.

I’m not.

The simple fact is, people believe this garbage. But the extent of that belief is not known and not taken sufficiently seriously because surveys refuse to measure it. This in turn produces false “sure, they’ll all self-archive!” happytalk. Who gets left holding the bag for that happytalk? Me. And all the other repository-rats who damned well know better.

The other methodological peccadilloes Suber mentions I have no religion on, not being a statistician. I also recognize that framing a survey with explicitly false information creates a risk that people who didn’t believe the garbage believe it after taking the survey. I consider that acceptable risk.

Anything to cut the happytalk-crack with some reality.

The DSpace batch importer

A plea came in to the DSpace techlist for how to use the DSpace command-line batch importer. “RTFM!” was the immediate chorus.

Well, okay, it’s how I learned to use the batch importer, but that doesn’t mean everyone should have to learn that way. So forthwith, a nuts-and-bolts minimal-techspeke tutorial on getting stuff into DSpace through the back alley.

First, some vocabulary. A “bitstream” is what you and I, being normal folks, think of as a file. An “item” consists of one or more bitstreams, plus descriptive information (author, title, etc.) about those bitstreams, plus license information. A “bundle” is a DSpace-specific construct (you won’t even see it in the UI, really) that keeps license bitstreams separate from content bitstreams inside an item. An “eperson” is someone registered with the DSpace instance; s/he is usually referred to by his/her email address.

To import an item into DSpace, you need to give DSpace three things: the bitstreams, the item’s descriptive information, and (because DSpace is fairly brain-dead) a plain-text listing of the bitstreams. All these things need to be in a single folder. If you are importing more than one item at once, each item needs to be in its own folder. DSpace does not care how you name the folder or the bitstreams. It does care how you name the bitstream listing, the file containing descriptive information, and the license files if any, as I’ll explain in a moment.

License information is optional. If you do not provide it, DSpace simply doesn’t attach a license to the imported item. If you do provide a license for the item, it should be in the form of a plain-text file inside the item’s folder named “license.txt.” (I’m leaving Creative Commons licenses out of the picture for now; if you care, I have another post on the subject which you should read only after you read and understand this one.)

The plain-text listing of the bitstreams needs to be named “contents”. Each filename should be on its own line; order is irrelevant. If you are only importing content files (no license files), you’re done. If, however, you have license files, you need to tell DSpace to put them in a different bundle from the content files. Easier to demonstrate than explain:

contentfile1.txt    bundle:ORIGINAL
contentfile2.txt    bundle:ORIGINAL
license.txt    bundle:LICENSE

The whitespace between the filename and the bundle name must be a single tab character.

The descriptive information lives in a little XML file whose name must be “dublin_core.xml.” To keep this post to a manageable length, I am not going to go heavily into detail about Dublin Core metadata; the easiest way to bootstrap yourself is to look at existing items in a repository in full-listing view. A bare-bones dublin_core.xml file looks something like this:

<dublin_core>
    <dcvalue element="contributor" qualifier="author">Public, John Q.</dcvalue>
    <dcvalue element="language" qualifier="iso">en</dcvalue>
    <dcvalue element="subject" qualifier="none">Technology</dcvalue>
    <dcvalue element="title" qualifier="none">Sample Dublin Core record</dcvalue>
    <dcvalue element="type" qualifier="none">Article</dcvalue>
</dublin_core>

The order in which you place individual Dublin Core elements generally does not matter, although you should put authors in the correct order (first author first, second author second, etc.) because DSpace does respect that order, and if you don’t angry faculty will come after you with long knives.

If you have all this together, you are now ready to use the batch importer. Put the item folder on the DSpace server somewhere that the DSpace administrator user has read and write privileges. As the DSpace administrator user, cd over to the bin folder inside the running DSpace instance (note: not the source-code folder that you run ant from when you recompile DSpace). I’m going to run through the command one bit at a time, and then put it all together at the end.

  • dsrun org.dspace.app.itemimport.ItemImport Command invocation.
  • -a Tells DSpace that you’re adding new items.
  • -e me@myu.edu Eperson who should be held responsible for the submitted items. This need not necessarily be you! It does need to be someone the system knows about, so if you’re depositing on behalf of someone who’s never used the system, you need to use the DSpace administrative interface to add them as an eperson.
  • -c 0123/4567 Which collection the items should go into. Go to the collection’s home page and grab up its handle. (Note that the batch importer is deadly stupid about this; there is no way to do a single batch import of items that belong to different collections. Also, you can’t map items into additional collections via the batch importer.)
  • -s /home/me/stuff The directory on the server where the item folders are. DSpace will error out if the admin user does not have read access to this directory!
  • -m /home/me/stuff/mapfiles/mapfile.txt Where to put the dumb little “map file” that DSpace generates, telling you which item got assigned which handle. This is basically a throwaway (it’s easy to regenerate if you ever actually need it), but if you don’t let DSpace generate its dumb little map file, DSpace sulks and won’t import your items.

So, the full command looks something like this:

dsrun org.dspace.app.itemimport.ItemImport -a -e me@myu.edu -c 0123/4567 -s /home/me/stuff -m /home/me/stuff/mapfiles/mapfile.txt

For most items, you are now done. If your item was a website, you have one more step: setting the “primary bitstream” to the website’s home or entry page. Anyone with edit rights on the item can do this from the item’s edit page; there’s a column of radio buttons labeled “Primary bitstream?” beside the bitstream listings near the bottom. Alternately, you can employ some SQL-fu in your database (instructions are for Postgres, not Oracle).

The batch importer can also replace items, so if you’ve completely hosed a collection in some reasonably fixable fashion, you can export it, fix it, and re-import it. Danger Will Robinson! There are several gotchas in this process. (Not that I know this by experience or anything—okay, I’m not fooling anyone here. I’ve run into all of them.) For this, you will need a mapfile, and you need to add --replace to the command line. I’ll reserve the other gotchas for a separate post, noting only that I have it on good authority that several of them will be going away in version 1.5 or 1.6.

And there you are. I hope.

Dies Martis, 8 Ianuarii 2008

Batch-replacing items in DSpace

Something I ought to have mentioned in yesterday’s post is the -t flag. This does a “test run” of your import, catching many (though not all) problems. (It will not notice if something is wrong inside your dublin_core.xml file. If you don’t have a dublin_core.xml file, it will notice.) I always run an import command with -t, then if it runs clean, arrow-up to recall the command, delete the -t, and off it goes.

If an import does happen to choke and die in the middle, don’t panic; running the command again with the -r (for “resume”) flag will pick up the import where it left off.

Right. Now, moving on to the situation where an individual item or every item in a collection is seriously messed up, and would be much faster to correct outside DSpace. This can be done! I have done it. But it’s annoyingly error-prone.

Step one is to export the item or collection. This works a lot like importing. As the DSpace administrator user, go to DSpace’s bin directory and run the following:

  • dsrun org.dspace.app.itemexport.ItemExport Command invocation.
  • --type=COLLECTION Or ITEM, depending on which you’re exporting.
  • --id=0123/4567 The item or collection’s handle.
  • --dest=/home/me/stuff The directory on the server where the exported items should end up. Make sure the DSpace administrator user can write to this directory!
  • --number=1 The exporter names the individual item directories with sequential numbers. Instead of peeking into the directory, finding the highest-numbered existing directory, and adding one (which would be the KIND way to handle this), DSpace insists that you give it a start number.

In toto: dsrun org.dspace.app.itemexport.ItemExport --type=COLLECTION --id=0123/4567 --dest=/home/me/stuff --number=1

Inside each exported item’s directory, you’ll see a “contents” file, a “dublin_core.xml” file, one or more license files, and the bitstreams, all of which should be fairly familiar territory. You will also see a file called “handle” which is a plain-text file containing (surprise!) the item’s handle. At this point you can download all the folders and fix whatever you need to.

To re-import the items without losing their handles or duplicating them, you need to create a mapfile. This is just a plain-text file, with a folder name and the corresponding item’s handle on each line, separated by a space:

1 0123/4567
2 0123/8901
3 0123/2345

The way I do this, since the exporter isn’t smart enough to create a mapfile on its own, is with a little Python hack that runs through a directory of items and associates each item’s “handle” file with its directory name. (I meant to upload my Python hackery yesterday, but either WordPress or Apache was and is being extremely annoying about letting a Python source file load, so hang on while I sort that out—and with any luck I won’t break my blog permalinks this time!)

Now you need to keep DSpace from duplicating metadata on re-import. Yes, DSpace will do this if you let it. One way to deal with this is to run a script called ds-migrate from the bin folder on your items before you batch-import them.

I don’t like this solution, however, despite its being fast and easy. The script is intended for the not-uncommon situation where you mount a collection on a test server and then want to migrate it over to production, leaving no hint whatever that it was ever anywhere else. The script therefore wipes existing provenance and date information—which is bad for a collection you’re exporting from and re-importing to a production server. You’re losing important item history there!

So what I do—and you may well decide differently—is only kill the really troublesome extra metadata out of all the dublin_core.xml files: dc.format.extent, dc.format.mimetype, and dc.identifier.uri.

The first two are easy regular-expression replaces: <dcvalue element="format" qualifier="extent">[^<]+</dcvalue> (you can make the appropriate substitution for dc.format.mimetype without my help, I’m sure!). The last one is a tiny bit trickier, because you only want to get rid of identifier.uri when it’s the DSpace-assigned handle, not when someone has actually entered a different URI. Most people, then, will want this: <dcvalue element="format" qualifier="extent">http://hdl.handle.net/[^<]+</dcvalue> (If you run your own handle server instead of using CNRI’s, substitute its URL, of course.)

The element dc.date.issued causes a slightly subtler problem, in that you may want to keep it if it was DSpace-assigned, but you want to get rid of it if it was user-assigned because it’ll be duplicated. I get rid of it, because DSpace-assigned issue dates are completely meaningless. Your call whether you do too.

I’m told that the event system going into 1.6 is already smart enough to check for duplicated metadata on import. This makes me very happy, because deleting duplicate URIs is a hassle. (Not that I—oh, never mind.)

At any rate, once you’ve taken care of all this, just import as normal, using the mapfile you created as the value for the -m flag and adding the --replace flag. Should work fine.

Next Page »
009 accompli motorola ringtonemotorola ringtone composerturkey ringtones