Conversion versus scanning
It feels odd to be giving electronic-publishing advice again, I must say. Thought I’d left all that behind. Still… A specific group of journal publishers has even more reason to be wary of a Google deal than the usual run: any publisher that has kept accurate electronic files for its journal.
“Accurate” is a key word. A journal that makes its final corrections on plates, never feeding them back into the typesetting or archival files, does not have accurate electronic files. This may seem a stupid thing to have to say, but given the typical disconnect between publishing execs and peasants, as well as the disconnect between short-term workflow expediency and long-term electronic-file usefulness, it really does need saying. Back in the day, fellow conversion peasants and I used to commiserate regularly over publishers who thought they’d sent us accurate files but hadn’t.
Google has only one digitization method: scan and OCR. A lot of vendors in this space use this method; I’m not knocking it. Sometimes it’s all that can reasonably be done. A journal with accurate electronic files, however, is almost always better off converting those to desired output and archival formats rather than going the scan-and-OCR route.
Why?
-
Accuracy. Even the best OCR leaves lots of scannos in its wake. I admit that we’re putting up with horrendous OCR accuracy in a good many article databases—but why put up with it for your journal if you don’t have to?
-
Ceci n’est pas une lettre. A scan-to-PDF isn’t text, but a picture of text; computers can’t search pictures. When OCR is included in a scan-to-PDF (which it isn’t always; check a dissertation database near you), it’s buried somewhere in the PDF binary. This is very much not an ideal situation. It’s not ideal for full-text indexing and searching, it’s not ideal for preservation, and it’s even lousy for disk space. Building a PDF from a typesetting file is invariably a better (smaller, more searchable, more preservation-friendly) option.
-
Retention of computer-accessible document structure. Okay, okay, “XML-friendliness.” Most of the time it’s easier and cheaper to convert a typesetting file to good, useful XML (please note the adjectives; not all XML is created equal) than to work with a scanned document. When it’s not easier, XML is most likely an impossible target either way.
Why does this matter? Well, consider searchability again. A search engine can only be told to add extra relevance weight to search terms titles, abstracts, and section headings if it can reliably pick those out. It can reliably pick those out of a good XML document; they’re labeled. (I can pick them out reliably from a competently-produced typesetting file, because typesetters use styles. This is the secret to creating good XML from typesetting files!) To do so with a PDF—even a good PDF—it’s got to behave like a human eye and brain, detecting typography changes. Trust me, most times this doesn’t happen.
-
Format evolution. Back in the day, I sent out a lot of journal articles in SGML, usually some variant of the ISO 12083 DTD. ISO 12083 has gone the way of the dodo; the NLM DTD suite now reigns supreme. Are the publishers I sent those SGML files to crying? Of course not. If they see a need, they can build an automated transform to NLM at quite low cost, and consider throwing in some up-migration (citation parsing, perhaps) as long as they’re touching the files anyhow. Try that with a scan-to-PDF.
It is true, mind you, that keeping typesetting files only is not ideal future-proofing. Good old Penta has also gone the way of the dodo. Quark is swiftly following in its wake. I’ve had to perform data-capture from ancient typesetting formats I’d never even heard of. Even so, it proved more cost-effective to hire the company I worked for to do data rescue than to go the scan-OCR-proof-markup route. Keep your eyes open with regard to your typesetting practices—it’s cheaper and easier to rescue data from Penta or weird variants of TEX than from Quark—but either way, keep your files, and keep them accurate!
Any journal with a good run of electronic files is shooting itself in the foot if it goes with Google.