Taking what we can get
I hate PDF. I have plenty of PDF-hater cred. It is a wretched format for any digital object that has any conceivable future use other than perusal by a sighted human being. It’s terrible for the print-disabled. It’s terrible for text-mining. It’s terrible for transclusion or other reuse. It’s terrible for metadata-embedding. I hate it.
The ETD conference’s one serious misstep was letting in an Adobe shill. (Just repeating how much I loathe commercials masquerading as conference sessions, so we all know, right?) As shills do, this one talked about the whizbang new wonderfulness that the next version of PDF will have—the new wonderfulness being multimedia embedding. I dropped my head into my hands (because multimedia embedding makes my preservation job harder, not easier!), and then I sat up, raised my hand, and let the shill have it with both barrels. This embedding, it’ll be Adobe-proprietary, won’t it? I asked. Uhhhhhhhh, I dunno, the shill said. (Why is it that companies who send shills to conferences don’t prepare them at all for the questions they are likely to encounter from a given audience?) Well, I said, we’re librarians here, and we worry a lot about digital preservation, and proprietary formats are total non-starters with us. The shill winced. I nodded in satisfaction.
Plenty of other people despise PDF too, most of them far smarter and more influential than I’ll ever be. I have my own reasons, is all.
I was working on batch-import metadata for the repository last week. (I’m still working on batch-import metadata for the repository; 40 years of papers suddenly dropped in my lap, and even without the half I can’t use yet due to no permissions, that’s a lot of papers! I got twelve years in last Friday. It’s a start.) Because I am a librarian, and they’ll yank my MLS if I don’t get obsessive about metadata, I’ve been grabbing out abstracts whenever possible. Whatever tool produced the PDFs for the two-column ACM layout that a lot of the more recent articles are in, it produced PDFs that don’t grok columnar layouts. I can’t cut and paste the abstract without culling a lot of garbage from the second column.
(This is one reason PDF is terrible for text-mining. Most halfway-sophisticated text-mining apps pay attention to a word’s context. When PDF mucks up the context by not indicating where logical rather than layout-driven line ends are, it mucks up text-mining engines.)
I have been retyping those abstracts into the metadata. I’m just that obsessive. Let’s not talk about how much time that cost me, hm? All because of a bad, bad, bad format. My able co-presenter Tim Donohue is working on a DSpace filter from Office docs to corresponding OpenOffice formats, and I can’t wait, I’m so sick of PDF.
I still take PDFs into the repository. I take PDFs that are nothing but mashed-up page scans. I take PDFs that look like warmed-over death. I even make PDFs for the repository, usually out of slideshows or Word documents. (Yes, yes, I ingest the original too. I’m not stupid about PDF.)
Why? Because I can’t afford to be picky, and more often than not, PDF is all there is. That’s reality. I’d love to be a format snob. I can’t, because format-snobbery means many things disappear forever, in any format. Capture first—then we can talk about preservation. Maybe we can even talk about pouring time into pulling something useful out of the PDF. But capture first.
More fundamentally, I can’t be a format snob because I don’t work in publishing any more. (I strongly believe I will again someday, because I think academic libraries will be the university press reborn, but my vindication on this point is still a decade or more away. I expect CavLec will still be here in 2016, so everyone can laugh at my hubris then.) I don’t control editing or typesetting practices. I triply don’t control author behavior. Until the rules of the game change, repository-rats are beggars, and beggars don’t get format menus.
A Portico rep dropped by to give us a presentation last week (you can read about it via my boss). Portico has been smart about this. For publishers that produce markup in some form or other, they’re grabbing it, NLM-DTDifying it, and storing that alongside the original.
Yet even Portico has to store PDF too. That’s just how it is. We do our best to preserve what we have, and what we have is PDF. Doesn’t make me like the format any better, of course, but I do wish folks would allow that I’m not an Adobe tool; I’m just being realistic.