11 Martii 2005

Googly-eyed

American Libraries, which is usually a yawner, had a couple of good bits this month.

The big article on Google repays attention, of course; I did my best to skip the Michael Gorman bits for the sake of my blood pressure, but the ones I couldn’t avoid assimilating (damn you, speedreading! damn you!) and the reactions I’ve seen indicate it was quintessential Gorman—and we librarians are in for a very, very long year.

(Me? Regretting my vote? Well, um—yeah. Yes, I rather think I am. I’m still going to wear my “One of the Blog People” button during Gorman’s visit, and I’m certainly not going to pretend I’m anything other than a text geek—but I fully expect that Gorman and I will dearly loathe each other before the day is out. I’m not a saint, a martyr, or an XSLT template; I don’t do instantaneous conversions.)

I got thoroughly annoyed at the (unnamed) interviewer’s “gatekeeper to the world’s knowledge” crack. Can we please, please manage to admit that Google is giving back files and associated use-rights to the libraries that own the books? Please?

Now, if the line is that the libraries won’t actually do anything with these files, leaving Google to play steamroller, fine—let’s discuss that so that we can light a fire under said libraries. But I don’t think that’s what’s going to happen, unless (as still seems possible to me) the files turn out to be thoroughly useless, so much so that the libraries throw up their hands and call the experiment a failure.

Why would I say such a heinous thing? Because I don’t know enough about Google’s methods, and Google hasn’t been forthcoming. I’ve been hearing bits and pieces about their wizzy new scanning/OCR process and how fast and cool and damageless and revolutionary it is—but nothing, nothing whatever, about post-processing.

Reality check. (Roll a d20, add your Wisdom modifier.) SCANNING/OCR IS THE EASY PART. Heck, Project Gutenberg’s had that licked for a couple decades. It’s everything else that’s hard, if you want to come out with a usable digital object on the back end.

Never mind non-Roman alphabets; Google has hemmed and hawed about that a bit (though I want to see them handle Chinese!). But how about math, both block equations and inline? Charts, photographs both color and grayscale, line art, maps (all of which require different scanning techniques for best results), not to mention their captions and legends? Footnotes/endnotes? Scannos and proofreading? (OCR has come a long way in fifteen years, as I have reason to know, but it’s nowhere near a hundred percent accurate, especially on older books with fading print.) Marginalia? Tables? Two-column typography? Weird fonts? End-of-line and page-crossing hyphenation? Tip-ins and other externalia, such as that one-meter-square table folded into a pocket in the back of my Ireland TEI book? Where is Google’s plan to deal with these common digitization problems?

Let’s be clear: some of these problems can be worked on algorithmically, and I pretty much trust Google to figure out decent algorithms. Hyphenation, for example, can be dealt with reasonably effectively by creating a concordance of the book once it’s digitized and comparing hyphenated words against it. (There will still be some outliers that someone will have to look at. That’s life in big, bad Digital City.) That same concordance can detect a lot of scannos; I built one for the Ireland book, and it helped quite a bit.

Some of these problems can’t be solved with an algorithm, though: e.g. books where floats are nowhere explicitly called out, making linking to them from text a matter of human judgment. What’s Google going to do about that?

Maybe Google intends to do nothing, indeed. A plain text dump, while entirely useless as a digital object, is still more or less searchable. Good enough for government Google work. In which case, the libraries will have decades—nay, centuries!—of work to do if they’ve any intention of making human-usable digital objects. Go-go Google. Or something.

And that’s just text problems. We haven’t even discussed any sort of markup or digital typography yet, never mind metadata. As was recently pointed out in the Chronicle of Higher Education (behind the firewall, sorry, but the March 11 Mark Y. Herring article), and as I have better reason to know than most, many have tried to make something usable out of factory-farmed digital text—and they’ve just about all failed, because (in my jaundiced-by-experience opinion) they ignored the intricacies of text artisanry.

Ergo much remains to be seen. Right now, we just don’t know—and a lot of people are nonetheless blithely assuming that Google’s end-product will be a human-usable digital object, and crying the Last Days therefore. Do dearly wish they wouldn’t.

I note, however, that ACRL has given a last-minute panel slot to a Google representative and the University of Michigan’s (very savvy, very worth listening to) digitization librarian John Wilkin. I expect the venue to be packed to the absolute gills, so I’m not sure I’ll even try to get in—but I hope somebody asks questions like mine. And I hope somebody asks the very simple “So what’s Michigan planning to do with the files Google gives it?”

(And I hope somebody blogs the panel. I do rather hate to miss it. Though I expect sufficient cluelessness on display that it wouldn’t be good for my blood pressure.)

Anyway, Joseph Janes had another good (and related, I promise) bit in his column this month:

Would unfettered, albeit prepaid, access obviate selection? Readers’ advisory would still be important, but does this imply that collection development would become passé—or even counterproductive—in such a world?

If we do put all the books on the Web, what do we need library catalogs for again? Do collection developers turn to making endless pathfinders?

I think we should recognize that this question has come up already. What is the Big Deal in serials if not unfettered, prepaid access to huge swathes of the literature? It didn’t turn out as unfettered as it appeared at first, of course. (Ruritania’s dean and I have agreed to disagree on this point; he’s happy with the Big Deal for the sake of his budget, but he nonetheless undermines it long-term with plans for an institutional repository, which is an approach I can respect even though I’m glad everybody isn’t doing it.) What are all these full-text databases, if not unfettered, prepaid access to gobs and gobs of stuff?

In other words, we’ve already given up a lot of collection development to our vendors. More doesn’t seem unlikely. Time and past we sorted out how to cope with that, especially as it dovetails rather neatly with recent rumblings on the uselessness of OPACs and even ILSes in the face of the changing roles of libraries vis-a-vis their collections.