Archive for October, 2008

31 Octobris 2008

Miniature disasters and minor catastrophes

KT Tunstall’s wonderful song is playing on Pandora as I type this, and it’s just so fitting I have to use it as this post title!

This is a tale of beating DSpace and OS X with many, many rocks until they sorta-kinda work. I present it here in hopes of sparing someone else considerable annoyance.

One of my best clients emailed me with a “please fix this link in my HTML item” request. Simple enough, right?

The said HTML item is nested in folders three deep. This means that DSpace’s regular exporter breaks, because it’s not smart enough to create intermediate folders. Joy.

So I kicked that up to the dspace-tech list, and got a kind response from Larry Stone of MIT: “use the METS packager export instead.” I did, and lo! it worked.

So I twiddled the file needing twiddling, zipped up the whole, and tried to put it back. First the METS ingester barfed because I’d zipped the folder containing all the files, not the files themselves. Okay, durrr, I felt stupid and zipped the files properly.

Then the METS ingester barfed because unbeknownst to me, Mac OS X’s native zip utility adds OS X-specific junk into the zip file. Quite properly, the ingester said primly, “Your METS manifest doesn’t match your actual files. Go forth and fix it.” The solution to this little difficulty turned out to be YemuZip, which can emit a normal zip file.

Then the METS ingester barfed because the file I’d twiddled was a different size from what the METS file was claiming, logically enough. Helpfully, the ingester’s error message told me what size the file actually was, so I could pop into the METS file and fix the size in the several places it appears.

Then the METS ingester barfed because the checksums in the METS file didn’t match the checksum of the file I’d twiddled. There’s probably a quick and easy way to calculate a checksum from the command line, but CheckSumApp has a cute little GUI. Like the file size, the checksum appears several places in the METS file, so I made sure I got all of them.

Then the METS ingester actually worked. So now I have to go in and do database magic so that the item handle points to the new item, because the METS ingester doesn’t have a replace option the way the normal ingester does.

Anybody who thinks that a normal repository manager is going to go through all this to fix a link in an HTML file is as barking mad as I am. This is the ridiculousness that DSpace’s insistence on no-versioning, butterfly-pinned-to-wall “final archival” reduces me to. Yes, it’s funny—but it also cost me an entire hour to fix one link.

30 Octobris 2008

The dangers of reused code

So because the title of an item on display is shown at the top of item-display pages (imagine!) in my Manakin themes, I went and took the title out of the metadata listing to avoid redundancy and clutter.

One small problem with that. The same code gets called in the case when the item is being edited or checked over, but in that case the page title is “Item submission” or something similarly inane, so the item title doesn’t appear anywhere on the page. Doubleplusungood.

My current fix is to put the title back in if the string “workflow” appears in the URL anywhere. That’s… kinda hacky, I admit, and I’m not entirely sure it handles all edge cases, but it’s at least not as broken as before.

Stockholm-style OA

I didn’t go to ASIST this year, having SPARC-DR and Open Repositories 2009 to go to instead. So I missed the bit where my good twin had to deal with some creep spouting the Harnad-Poynder line about librarians not doing enough for open access and when are they gonna step up, anyway?

Honest to $DEITY, there are days I think that librarians’ continued involvement with the open-access movement is a sign of Stockholm Syndrome.

Open-access movement? Howsabout we shut these yahoos up, please? And howsabout we stop feeding their unjustified sense of entitlement and erroneous scapegoating by recognizing librarians’ contributions to the movement? If I see another book on open access where neither “libraries” nor “librarians” is an index term, I swear I’m done. I’m just done.

But before I go, I’m going to spearhead a publicity stunt. (You didn’t think I’d go quietly, did you?) Every single library-run digital repository, institutional or disciplinary, and library-sponsored open-access journal or similar project will go dark for a day. Then we’ll see just how much of the open-access movement is left without libraries.

29 Octobris 2008

The mark of Zotero

Well, if Thomson Reuters was expecting a craven retreat on the part of George Mason University, they didn’t get it. I rather suspected GMU would hold its ground, for various reasons that I won’t expound on here. I am pleased to be correct, however, whatever the reasoning behind GMU’s courage.

Thomson Reuters’s strategy here looks even more like a loser than it did when I first commented on it. They got themselves dissed in Nature. Ouchies. Bad PR aside, they’re not just losing the open-source developers and the network effects by now: they’re losing actual end-users, who are now unsure what they are allowed to do with the style files they have constructed. Reasonably enough, they aren’t happy about that, and I’m seeing noise about migration.

GMU cancelled its EndNote license, which makes sense because Thomson sued GMU on the basis of violation of license provisions, not anything overarching like copyright or trade secret law. Losing the GMU site license probably isn’t a big deal to Thomson Reuters. What is a big deal is that other universities, if they have the sense $DEITY gave a flea, are now scrutinizing their EndNote site license to see if there’s anything Thomson Reuters can gig them on. I’d expect that EndNote salesfolk will have some awfully squirmy questions to answer, next license-renewal go-round.

One of my students did her mini-job-talk last night on Zotero, which spurred a little discussion of competitors afterward. Consensus was that on quality of user interface alone, Zotero is light-years ahead of EndNote and RefWorks. This is a conclusion with which (no surprises here) I wholeheartedly agree. So what is Thomson Reuters doing instead of improving its product? Suing its users. Because that always ends well, yup yup.

Congrats, Thomson Reuters legal and strategic teams. Sure looks like you’ve succeeded in lawyering your own product to a slow and painful death. Heckuva job there.

But that’s easy!

Last night’s class session was a new one, not one I’d used last year. I taught them the basic idea behind a relational database along with a smidge of SQL, and the basic idea behind markup with a demo of validating an (intentionally broken) XHTML document.

They stuck right with me through elucidation of a few simple tables leading to a fairly complicated SQL query involving two nested subqueries. A couple-three bright lights were all “But that’s easy! And sensible! Why didn’t it happen until the 1960s?”

I love my students. I truly do.

Obviously there’s a lot more to SQL and databases than I could show them in a couple of hours. There’s a lot more to SQL and databases than I myself know anything about! Database optimization, query optimization, denormalization for performance—I am only an egg. I can’t do that stuff. Heck, joins still confuse me sometimes.

But my goal wasn’t to turn them into database and markup ninjas. My goal was to get across that neither databases nor markup is geek voodoo; they’re things that ordinary mortals can usefully work with. And in that sense, last night was a smashing success.

28 Octobris 2008

I am agog, I am aghast

So in just the last week, I’ve heard about at least four small non-doctoral US institutions on their way to opening library-sponsored institutional repositories.

One is an outlier. Two is a curiosity. Four in a week is a trend, and I want very badly to know what in the world is going on here.

If you’re in this situation or know someone at an institution who is, I’d appreciate an email, confidentiality guaranteed. Why is this happening? What do these institutions expect to gain? What workflows are they planning for? What content are they targeting? How are they planning to ingest it? Who are their role models in this space?

I don’t want to think these people are stupid or unaware. Neither stupidity nor cluelessness is typical of librarians. My only other hypothesis is that smaller institutions have different needs that they believe IR software can fill. Gosh, I hope that turns out to be correct.

27 Octobris 2008

eResearch room on FriendFeed

If FriendFeed is your social-networking poison of choice, I’ve just opened a room for eResearch there. Quality links showing up already; join the fun!

Fugly Manakin hack: User-friendly file descriptions

DSpace’s JSPUI had one pleasantly usable feature: instead of displaying the MIME type of a file, it displayed a short admin-editable file-type description. “PDF” is a vast improvement over “application/pdf.” For one thing, it’s shorter.

Unfortunately, Manakin only knows from MIME types, since METS isn’t very friendly to niceties like user-friendly file descriptors. Fixing that was on my to-do list. I was told by the DSpace developers that the right and proper way to fix that was to insert PREMIS metadata into the METS. To do this, I would have to figure out PREMIS and then write an Aspect (I think) to twiddle the METS.

People, I am too damn lazy for that. So you get this fugly hack instead. I don’t feel too bad about it; storing descriptions in the database is kind of a fugly hack too.

First, figure out what’s actually in your DSpace instance by way of MIME types by running this query on your database: select mimetype, short_description from bitstreamformatregistry order by mimetype;. (You will probably immediately notice a potential problem with this hack: text/plain has two values, depending on whether it’s a content or license bitstream. I think this is not actually a problem, because this hack should only get called for content bitstreams.)

Then create a template in your theme as below, making one <xsl:when> for each MIME type you want a user-friendly description for.

<xsl:template name="getFileTypeDesc">
    <xsl:param name="mimetype"/>
    <xsl:choose>
        <xsl:when test="$mimetype='application/pdf'">
            <xsl:text>PDF</xsl:text>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$mimetype"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

The <xsl:otherwise> returns the MIME type as a last-ditch descriptor.

Next, go to the code that builds each bitstream row; I’d show you mine, but I have hacked the living daylights out of it because I cannot stand unnecessary HTML tables. Look for the MIME-type code (hint: it’s the contents of <with-param> below) and replace it with the following:

<xsl:call-template name="getFileTypeDesc">
    <xsl:with-param name="mimetype">
        <xsl:value-of select="substring-before($file/@MIMETYPE,'/')"/>
        <xsl:text>/</xsl:text>
        <xsl:value-of select="substring-after($file/@MIMETYPE,'/')"/>
    </xsl:with-param>
</xsl:call-template>

Whenever you add a new content-type to the repository, you’ll have to add a new <xsl:when>—but realistically, how often does that happen? I’ve done it once a year, maybe? Less?

It’s a fugly hack, but it works.

Yes, it’s about journals!

Peter Suber misses the mark about as often as I commit random acts of kindness and senseless beauty. For both of us, though, it does happen; and Peter missed the point of an anthropologist’s critique of open access when he blogged it the other day.

It’s just not true that open access isn’t about the journal literature. There are salient and cogent (if not necessarily good) reasons that it is, no matter the chosen road, no matter the rhetoric. What is it we’re asking faculty to self-archive? Theses and dissertations, yes; faculty are much happier mandating somebody else’s behavior than their own. It’s not faculty’s books, though, for economic and public-relations reasons. It’s not their learning objects; that’s Somebody Else’s Problem. It’s only secondarily their data or their gray literature; many OA dogmatists look down their aristocratic noses at that stuff, and others (myself included) question the technological fitness of the green road to accommodate such materials.

Yes, we’re talking about the journal literature here. Of course we are. The third sentence of the Very Brief Introduction starts in with “scholarly journals don’t pay authors.” If we weren’t talking about the journal literature, why would repository-rats get so much flak from the likes of Stevan Harnad when we take in other things?

So follow Dr. Marcus’s train of thought here: if the journal literature isn’t all it’s cracked up to be, why would he waste time fighting for open access to it? There’s a lot to fight for in the world!

I’m not unsympathetic to that argument, myself. I’ve got two journal articles in press, and I’ve written essays for essay anthologies. Scholarly and professional publishing is a pain in the nether regions. Reviews, permissions, style requirements, citations, last-minute changes, copyright-transfer agreements, spare me—and by the time it’s actually published, it’s out of date. Roach Motel was trenchant and prescient when I wrote it and posted its preprint. By the time it’s published, a year or more after the preprint, it’ll be a stale bagel, the golden age of its impact long since passed.

Why did I bother? Dr. Marcus has it right. For all I’ve made vastly more progress among those in the know with CavLec than anything I’ve gone through the publication wringer with, some people can’t see anything or anyone if they’re not formally published. It’s a game. It’s a stupid, slow, expensive game, but I play it because I have to. Dr. Marcus doesn’t have to any longer, and more power to him for it. Why shouldn’t he ignore those of us who are still enmeshed in it?

Honestly, if the open-access movement is suddenly waking up and finding that it doesn’t like being in bed with journals, it has no one to blame but itself, for blinkered vision, crippled rhetoric, and hobbled technology. Climbing out of the pit we have digged will be a tough business, because many of our allies still believe the line of hooey we were dishing out; read the Rand Corporation’s report on the subject of what librarians want from open access if you don’t believe me.

Still, it’s good for Peter Suber to come up against the real-world results of the aforementioned pit. He’s someone I trust to engineer a rhetoric reevaluation.

24 Octobris 2008

Ya think?

It’s like I’m psychic or something. The RAND Corporation folks just came out with a report on IRs in the UK. It reads like they cribbed off Roach Motel (and no, that is not an accusation; I’m quite sure they didn’t). Check out some of the funny-if-it-weren’t-so-sad:

Overall, the interviews seemed to validate the hypothesis of the EMBRACE project board that digital repositories are currently underutilised, and that there are significant barriers to a strategic commitment.

Ya think?

However, the findings revealed a complicated picture of disciplinary differences, departmental and institutional differences, and heterogeneity between and within stakeholder groups.

You don’t say.

Even if most of the barriers identified in this report – e.g., the lack of awareness, a technology that is in its infancy, risks of reputation damage, or the administrative burden of depositing – can be overcome, one major challenge remains for digital repositories, namely the lack of incentives for the wider institutional community to provide content for these repositories.

Imagine that!

While undertaking this study, a clear theme emerged. There appears a misalignment between the objectives of the repository and the needs of different groups of stakeholders.

No, really?

And that’s just the executive summary, people. There’s way more happy-fun reality-checking (right into the boards, too, ouch—okay, apologies for the hockey joke) in the report proper. I’m not all the way through it, so I’m still unclear whether the report will recommend retrenching or outright bailing.

We need reports like this. I’m glad this one was written, and I dearly hope it is widely read. I’m just annoyed that it wasn’t written two or three years ago, when it might have done more good.

I’m still hearing about teensy-tiny institutions in the US starting IRs. Makes me want to get out my stompy boots, that does, and put the YakTrax on them for extra indelibility. You heard it here first: If you aren’t a doctoral institution, don’t bother with an IR. No, I don’t care what Harvard did; you aren’t Harvard, and you have many less-futile things to do with your precious library budget and staff.

Geesh. I really need not to read such things on a day I took off work.