Kevin pointed me at a recent ebook apologia and asked my opinion. I’ll give it, naturally, but before I do I want to address the apologia’s author’s words on PDF.
I wish PDF-for-ebooks would fold up and die, but it won’t. Therefore I need to explain why, except in certain limited circumstances, relying on it for electronic books is a terrible idea.
First, though, I need to explain that I’m not miffed enough at Adobe for its constant anti-XML and anti-OEB FUD to lambaste them merely on that basis. I’m certainly miffed, no question about that; I wish they’d cut it the heck out. But that would be expecting a company to do what’s best for its customers (publishers and readers alike) rather than itself, and what company does that these days?
Much has been said about PDF’s unfitness for onscreen reading. I need not and shall not reiterate it. I merely note that PDF is inextricably based on a “page” unit. Unchangeable page boundaries are stupid. They are artifacts of the paper medium, meaningless in themselves, subject to change even among different editions of the same book. Yes, many information-management techniques (e.g. indexing) are based on them, but that doesn’t mean the page is an unquestioned good; it means that the techniques need to evolve.
No, it isn’t the much-touted “reading experience” that frustrates me about PDF. Partly, it’s Adobe successfully creating the widespread illusion (mentioned in the apologia) that PDF is non-proprietary.
Of course it’s proprietary. It is completely owned by Adobe. “PDF” is an Adobe trademark, for Pete’s sake! (I think. It’s not listed in Adobe’s official trademark list, though the PDF logo is. Anybody got the real scoop on that?) Read my lips: PDF IS PROPRIETARY. I tell you three times: PDF IS PROPRIETARY. PDF IS PROPRIETARY. PDF IS PROPRIETARY. Don’t make me repeat myself on this again, please; I am enormously tired of saying it.
Yes, there are non-proprietary PDF writers available at the moment (and what Apple was thinking of when they built it into OS X I am sure I don’t understand; I think it’ll bite ’em one of these fine days), but they are available on Adobe’s sufferance only. Adobe can at any time come up with a new version of the format and refuse permission for non-proprietary implementations. I’m not sure, but I think that given their patent portfolio they could yank permission on existing non-proprietary implementations. Will they? Your guess is as good as mine. Do you want your ebooks held hostage to that guess?
(Moreover, no matter your guess on general PDF, I will lay any odds you care to name that the Adobe ebook-writer implementation will part ways from PDF one of these days, and it will be proprietary. Too good a chance to soak on-the-hook publishers and hanger-on implementors such as GoReader at the same time.)
A more serious flaw with PDF is its accessibility problems. Again, much has been written; I need not comment, except to say that Adobe has had plenty of time to solve these problems and just plain hasn’t.
My chief quarrel with PDF: it is a dead-end format; it as well as its inputs are utterly putrid for archival purposes. Once you have a PDF, you are not guaranteed to be able to back it out to anything useful. (If you’ve saved it right, you can get at the PostScript, admitted. Sure you save your PDFs right? And PostScript is no treat to convert to anything useful, either.) It’s simply not futureproof, something that ebooks ought to be.
How do you get a PDF in the first place? Via typesetting. (Some people word-process; doesn’t affect my argument.) If your archived PDF dies, or Adobe comes out with a nifty new feature that you want to add to an already-published ebook, you have to return to your typesetting files. If your typesetting files are no longer usable, you are out of luck.
Typesetting systems come and go (except possibly for TeX, which seems to be forever). Penta died recently. If Quark can’t get its act together, Adobe is going to eat its lunch with InDesign. If you use Microsoft Word for typesetting, shame on you—and your files aren’t guaranteed to last either.
Allow me a digression, to tell my favorite archive story: Two jobs ago, I handled an XML conversion project for a two-volume English dictionary. Source material was archived typesetting files from a no-longer-extant typesetting system. I looked over the documentation (do you have documentation for your typesetting system, to aid in converting old files? you should!) and got started writing my usual ton of regular expressions. Something odd, though: regexes that should have worked didn’t; the conversion just wasn’t clean.
On closer inspection, we discovered that some evil little gremlin had gone through the files at some point (during archival?) and randomly deleted characters here and there, in content and in typesetting codes alike. The client had only one master copy of the data, and the typesetting system (along with the company that had employed it) was dead and buried, so new source data could not be generated.
I could work around missing characters in the typesetting codes, and I did. The client, however, had no choice but to proofread and correct the entire dictionary. That is non-future-proofing with a vengeance. That is what you are letting yourself in for if you rely on PDF and typesetting files as your future.
Someone out there has been howling while reading all this. Possibly many someones. “Microsoft Reader and Gemstar’s formats are just as proprietary as PDF! How dare you claim they’re an improvement, woman?”
Simple. They aren’t. What feeds into MS Reader and Gemstar alike, however, isn’t (necessarily) proprietary. It’s material tied to a genuinely open specification, the Open eBook Publication Structure, and the open standards (e.g. XML and CSS) underneath it. If UnHappyBook bites the dust, and you made a heavy bet on UnHappyBook, you’re not sunk if you got to UnHappyBook via OEB. You can use those OEB files (you did make them, right? and you did keep them? if you didn’t, I have no patience for you, you moron; use PDF and welcome) to generate new ebooks for other platforms.
Speaking strictly about text, well-constructed OEB 1.2-compliant XML files are a reasonable archive. They’re typesettable. They’re easily adapted to HTML for the Web. They’re accessible. They’re plain-text, so if all else fails you can pull the text out of them (with a single regular expression, if you’re slick) and try again. “All else” is unlikely to fail, however; good XML is pretty much as futureproof as text gets.
(Note how careful I’m being here. Not all XML qualifies as “well-constructed.” If you create garbage XML, you’re just about as badly off as someone with garbage typesetting files.)
I’m less happy about JPEGs, which are vastly too lossy for print publishing. Keep your high-resolution TIFFs with your archive, definitely. But I digress.
When the latest hot OEB-based ebook reading system comes out, all you will need to do is grab its compiler and use it on your existing OEB files. No retypesetting, no data-crunching, nada. When the OEBPS comes out with its latest cool featureset (1.2 is the latest; much new CSS goodness), the likelihood that you will have to start over from scratch is—zero. Will you have work to do? Probably. But it’s incremental work. Fancying up a stylesheet. Redoing the package file. Creating better navigation. Minor tasks that simply aren’t remotely comparable to the effort involved in a complete retypesetting. And if you’re happy with what you’ve got, you may not need to change anything at all.
New edition? Changes in the text? No sweat. Change your XML (your package file and CSS are probably fine as is), pass it through the OEB compilers, bam! new ebook. Compare the cost of that to the cost of the total retypesetting you need for a new PDF, and then tell me that a PDF workflow is cheaper, long-term. Ha, I tell you. Ha bloody ha!
Given all that, why would any sane person bet the farm on PDF?
Well, sane persons publishing scientific/technical/medical, foreign-language texts, or design-critical texts (e.g. high-design textbooks) may genuinely have no choice. They get a free pass from me, for now.
Sane persons may also make limited bets on PDF, just as they might on Microsoft Reader or Cytale or whatever. As long as they understand they’re not futureproof, and have plans in place to archive and migrate their typesetting data as needed, I can understand that decision also. If they have an XML-based typesetting workflow that produces PDF as one byproduct, or a non-XML-based typesetting workflow that both accepts XML as input and produces it as output (as well as outputting PDF), they’re golden. Good for them.
PDF archives, however, frost my britches. Honestly. Bad, bad, bad, bad idea. Could be worse; we do at least have documentation of the current format, so when Adobe abandons PDF however many years in the future (doesn’t take a crystal ball to predict that), we have a fighting chance to rescue the data. Still. Not good. Very not good.
Two closing thoughts. First, the calls for the OEBF to spec out an end-user binary format to compete with or supplant .lit, .rb, PDF, and the like are misguided. End-user binary formats are not long for this world; they can’t be. They’ll change as technology changes, just as they should—in fact, spec-ing out such a thing could prove a barrier to ebook innovation, which would unquestionably be bad.
A single end-user binary format also can’t easily accommodate different device specifications and different reading audiences (the visually-disabled being a cogent example). This is just plain awful. Unacceptable.
What the OEBF has done instead is set up an archivable system of human-readable and -editable files that can flexibly be used to generate plenty of end-user binary formats, now and in the future. Good call. Shame people don’t realize what a good call it was. (In their defense, I didn’t think it through at first either. Took some patient explaining before I got it.)
Second, data durability is a powerful argument for the kind of source-code escrow that people like Lawrence Lessig want to make a condition of copyright protection for software. Much (though by no means all; possibly not even most) of the data loss librarians and other information management experts currently wring their hands over is due to inability to decipher obsolete file formats. Source code escrow for the software that generates a given file format would save data from extinction.
Which has been my argument throughout this obscenely long post. Don’t settle on a system that will someday make your data extinct!



