Markup challenges for ebooks
As long as Leigh is sending eyes my way, I might as well give them something other than snark to read.
I implied previously that we over in OEBF-land have been butting our heads against the current technological ceiling in markup. Perhaps a discussion of one or two of our worst problems is in order.
Random vocabulary mixing
This is huge. This is huger than huge. This is freakin’ colossal, okay?
The OEBPS is believed by entirely too many people to rest entirely on XHTML. This is not the case. The Authoring Group, to its immense and enduring credit (and I wasn’t on the Authoring Group, so I can say that) realized almost from the beginning that XHTML was not going to cut it as the end-all and be-all of book production vocabularies.
So they said, okay, let’s use XHTML as a base tagset for the unambitious, and allow other XML tagsets as long as their visual-display behavior is defined via CSS. (And then they made this immensely smart decision all but useless by not allowing hierarchical CSS selectors. But I digress—and we’re about to fix that error, I promise.)
But that only gets you so far, if you’re a content creator. STM publishers (that’s Scientific-Technical-Medical, y’all non-book-geeks) can’t very well style MathML tags with CSS. Lit-critters and linguists will want to do more with their TEI tagging than make it pretty. XLink could be highly useful juju for ebooks, but its usefulness has got diddly-squat to do with CSS.
The long and short of it is that ebooks need to be able to do much of the same stuff that XHTML Modularization is designed to address—but they need to be able to do it without necessarily involving XHTML at all. If I’ve rolled my own STM journal DTD based on ISO-12083 with the addition of ChemML and MathML data islands, I see no reason I should have to turn my ChemML and MathML into images to display them in an ebook reading system that’s supposed to be based on XML processing anyway.
In my more innocent days, I thought XML Namespaces solved this problem. How wrong I was. I can set apart my ChemML and MathML islands from the rest of my markup using namespaces, sure. That’s no guarantee at all that anything will be done with those islands. The namespace URI doesn’t have to mean anything. Often, namespace URIs (insofar as they mean anything at all) mean more than one thing, with XHTML itself being the biggest, stinkiest example. (XHTML 1.0 and 1.1 share the same namespace URI despite not being the same vocabulary.)
Worse, the namespace URI doesn’t say anything at all about proper handling of a particular data island. The MathML namespace URI, for instance, doesn’t tell a processor whether the markup in a MathML data island is to be displayed visually or fed to Mathematica for number-crunching. Huge difference—that cannot be got at except by examining the markup. So what’s the use of the namespace label, hm?
What we’re trying to come up with is namespace metadata. It’s an uphill battle. We can’t very well afford to become vocabulary arbiters, saying that some vocabularies are acceptable in ebooks and others aren’t. We’d spend all our time doing nothing but studying and debating vocabularies. Plus, inevitably, not all reading systems need to support all XML vocabularies. (Beach Blanket Bob doesn’t need his reading system to handle MathML before he can read his Ludlum novels.)
Cross this issue with…
MIME types
… and you’ve got a real lulu. Zen koan, markup style: What is the MIME type of a multiple-namespace XML document?
text/xml, you say? Not so fast. What if one of the namespaces is XHTML, which has its own MIME type? What if another is MathML, which also has its own MIME type?
The OEBPS uses MIME types in its “package file,” which was designed in part so that reading systems could figure out how to handle a publication without having to sniff every single file included. This system worked only moderately well to start with, and got creakier every time we re-examined it. At the moment, for XML files, it’s just plain broken. MIME types just don’t carry sufficient information.
Which leads us to…
Packaging and packaging metadata
A typical ebook has a lot of files in it. XML files, CSS files, JPEG and PNG files, possibly other files (there’s a core set of supported file types, and a way to include other file types as long as they fall back to a core file type—go read the spec for more). Once again, it’s not a good idea to force a reading system to figure out what to do with all these files based on file-sniffing, not if we want ebook gadgets smaller than laptops.
So we gotta come up with a way to describe those files. (My husband is glaring at me because he needs the ’puter, so let me be brief here.) This we have done, but as time passes, we discover more and more description that we want to include. A rigidly-defined XML document, such as that created for OEBPS 1.0, just won’t cut it any longer. What architecture should we adopt, then?
It’s looking like RDF, despite some internal dissension. Ye xml-dev denizens, keep an eye out for my colleague Garret Wilson, who has done an amazing lot of work on a packaging spec useful (we hope) not just to us, but to you as well.
Okay, dear, I’m done now.