Archive for March, 2002

30 Martii 2002

Graças a Deus…

A few years ago when I took Portuguese over the summer, we watched a documentary on Brazilian favelas. I haven’t managed to forget the woman who proclaimed, “Graças a Deus, não sou religiosa.”

Seems a strange thing to pass along the day before Easter, and I apologize to the gentle AKMA, but folks watching today’s news will perhaps understand why it sprang to mind.

Enough temptation to hatred and groupthink I have; religion I do not need to add to that devil’s brew. So I repeat: Graças a Deus, não sou religiosa.

29 Martii 2002

The ideal browser

The ideal web browser would have:

  • Mozilla’s support for standards (with a dash of Opera here and there).
  • Opera’s tabbed browsing.
  • IE5 for Mac’s ability to pull up URLs from history while I type, and go to them with a mere tap of the Enter key. (Opera makes you arrow-down to the page you want. I hate that.)
  • Opera’s disable-popups preference. Boy, is that nice.
  • IE6 for Windows’s support for P3P.
  • ICab’s instant-page-validation feature.

Just some musings as I work from the G4, since the laptop is still hosed. (Light at end of tunnel. Rescue CD-ROM arriving by mail shortly.) I’m going to be tweaking the Caveat Lector design once the laptop is functional; it looks like garbage on IE5/Mac. Fixed that (on my desktop machine); now need to be sure the fixes work on the other side of the ocean.

Some things, of course, are byproducts of inconsistent CSS support across platforms, and for the most part they’re minor enough that I don’t worry about them. The first paragraph of a block quotation should not be indented. Works in Opera, nowhere else. The first paragraph inside a list item shouldn’t indent either, but it does on IE5 Win (or maybe in Opera? Can’t check). Depending on your browser, entry headings may be all-small-caps (wrong) or caps-small-caps (correct). And I give up on consistent font sizes. Increase or decrease font size in your browser until you’re comfortable; it’s what I do, and it works right nice.

Microsoft, the White Knight?

Let’s see how much more trouble I can get myself into, hm? From Doc Weinberger’s weblog:

So, here’s my plea to Bill Gates: Be the white knight. Swing your mighty sword in favor of building the most vibrant marketplace for ideas and creativity the earth has ever seen. Storm the halls of Congress. Make it your personal compaign, Bill.

Sorry, Doc W, but my experience with one Microsoft ebookist suggests that you’ve got a lost cause on your hands there.

Disclaimers: The occasion I am about to recount occurred well over a year ago, and at least one ebook-team reorg has happened at Microsoft since then. Things may well have changed. I CATEGORICALLY REFUSE to disclose the identity of the Microsoft employee central to my tale. I won’t even disclose hir gender (and the ebook team I knew at the time was split pretty evenly between men and women), so you’ll have to cope with weird pronouns when I refer to er.

The only NDA I think I may have signed with Microsoft (and I may not have signed any; I honestly can’t remember) would have been signed well after this occasion, so I’m not exposing myself to legal ugliness as far as I can tell. Moreover, my conclusion could be suspected by any half-aware person; all I am really doing is offering anecdotal evidence in support of it.

And a prefatory note: Microsoft has been the single strongest supporter of the Open eBook Forum, committing considerable sums of money and considerable time from several brilliant techies. I frankly doubt the OEBF would have survived without Microsoft. I can also say honestly that I have never, not once, seen even the subtlest attempt by Microsoft to strongarm any OEBF action into something that would benefit Microsoft specifically, or cause specific harm to any of Microsoft’s ebookspace competitors. Whatever their private plans, they’ve played fair so far in ebookspace.

Enough disclaimer. Enough preface. On to our story.

But first… a bit of technical background. (I’ll try to make it painless.) “OEB” is often spoken of as an “ebook format.” This usage is somewhat dubious, because the OEBPS Publication is often (almost always, in fact) not the final digital form the ebook’s content takes. (When you consider that there’s no especially good way to add encryption to a miscellaneous collection of files, this will perhaps seem less surprising.)

The OEBPS says that a “reading system” is a black box that takes in the collection of files (text, images, etc.) that constitutes an OEBPS Publication and spits out a reading experience. How the reading system does this—the details of the black box, or even whether it’s one black box or several—is entirely up to the reading system designers.

Several reading systems, including Microsoft’s, have split the task into two tasks accomplished wholly separately: boiling down the OEBPS Publication to a single binary file, and presenting that binary file to the person reading. (In Microsoft’s case, the former task is accomplished by a software library called litgen.dll, slick GUI wrappers for which are made by my ex-employer. The latter task is what Microsoft Reader accomplishes.)

This is a perfectly acceptable solution vis-a-vis the OEBPS; it’s not the only acceptable solution, but it’s fine. Moreover, the OEBPS says nothing about what these intermediate binary files look like, act like—or how they compete in the marketplace. (How could it? The OEBPS doesn’t say these binary files have to exist!)

Unfortunately, these binary files (some call them “delivery formats,” a usage I can live with) have gained the status of “format” in a lot of people’s minds. Since people (especially journalists, in my experience) like a good fight, these delivery formats are perceived as competing, even though they can be generated from the same OEBPS Publication and are better thought of as implementation details of the reading systems, irrelevant in and of themselves to reading-system competition in the marketplace.

(I exaggerate the interoperability of OEBPS Publications slightly. Reading system quirks mean that an OEBPS Publication tailored for one reading system won’t look ideal on another. The differences are far less pervasive and difficult to work around than those in web browsers, however.)

So by its silence on the topic of delivery format, the OEBPS has tried to eliminate ebook-market fragmentation on the basis of initial coding of content (which was the point of the OEBPS in the first place), but has created that same fragmentation on the basis of delivery format!

(This is an enormously difficult problem to solve. I call it insoluble, myself, and think the OEBPS did the right thing by staying away from it. But that’s a topic for another blog.)

So. We now have Microsoft’s delivery format competing with others, just as we did at the NIST eBook 2000 conference in Washington DC…

A month or two before that conference, I had gotten a call out of the blue at work from a Microsoft recruiter. (I know who spurred that call now, but at the time I felt like a cartoon character flattened by a totally unexpected ten-ton anvil.) Shortly before I left for DC, some Microsoft ebookists—ones I didn’t already know from OEBF work—invited me out to dinner on arrival night.

“Self, you’re being sussed out for a job offer,” I told myself. “Be careful. You’d rather slay a fire-breathing dragon than work for the Evil Empire.” (Both my husband and a former co-worker can verify my fascinated horror at the recruitment call and the invitation, in case I am suspected of either sour grapes or undeserved self-praise here. I assure you, I tellin’ it like it wuz.)

I accepted the invitation, but found the first occasion I could at dinner to explain that I had a husband in graduate school and was therefore not mobile. As we parted at the Microsofties’ hotel (I was staying with my in-laws, who live in the DC area), the person who had done most of the talking said that if I were amenable, which e quite understood I was unable to be, e would “hire me on the spot.” I thanked er for hir good opinion of me and left, congratulating myself silently on how gracefully I had managed to extricate myself from the situation (social grace, especially under pressure, is not a hallmark of mine).

E hadn’t given up, though. E took me to the food court at the Reagan Center the next day for coffee (hot chocolate for me) and a talk. The talk was an ersatz job interview; e didn’t try to hide that, or even be subtle about it. E did, though, talk pretty freely about Microsoft’s strategy in the ebook arena.

Y’all won’t be surprised, after all the buildup I’ve done, to find out that that strategy consisted of pushing other reading systems out of the market by aggressive development and promotion of Microsoft’s delivery format. (Secondary strategy, which I got the impression hadn’t been very thoroughly thought out, involved bypassing the OEBPS by creating direct-to-MS-delivery-format filters for Word and Quark, two ubiquitous tools in print publishing.) To give er all due credit, though, e indicated firmly and repeatedly (and, I believe, truthfully) that making MS’s delivery format as rich and powerful as possible was the cornerstone of this strategy. Honey, not vinegar.

But I wouldn’t call on Bill Gates and crew to open up the creation and dissemination of content if I were Doc W. They weren’t thinking in those terms at eBook 2000, and I’d be plentifully surprised if they are now. Control, control, control, with a dash of lack of choice.

Caveat lector! Caveat lector! Caveat lector!

Hi, AKMA! (Ebooks and copyright)

I shall have to brush up my toes. AKMA kindly emailed me to tell me he’d linked to me.

He says he’s interested in the copyright morass shambling about in the USA right now. (A shambling morass. I don’t know where I’m getting my mental images this morning.) I confess rather shamefacedly that in my ebookist persona I have intentionally stayed well away from what are called “digital rights management” issues, not so much for lack of interest as in the awareness that I couldn’t possibly keep my temper in check, and would no doubt do more harm than good for my own side.

(How far away? I’ll tell a story on myself. At an OEBF working group summit last year in Chicago, attended by somewhere around 75 people, I put together an afternoon’s seminar on the Open eBook Publication Structure, intentionally aimed at the non-technical. My talk was up against an open forum on the DMCA. The minute I learned that, I said to myself in my best holdover Southern accent, “Self, ain’t nobody gonna be at your talk.” I was right, of course. Three attendees, one of whom didn’t need it.)

So what you’ll read here about ebooks is mostly the technical details of putting together content so that it both works as an ebook and lasts past any particular generation of ebook-reading gadgets. Those are the areas in which I possess some expertise. In the next few days, I’ll try to dig up some links that address AKMA’s concerns.

I sat up and took negative notice of copyright some years ago, when my husband first got involved with Tolkien’s languages. Not to bore you all with a long, sad, and often ugly history: the situation is that the Tolkien Estate has gifted a small number of people with access to JRRT’s unpublished linguistic writings, and over time both this group and the TE have done their best to discourage, threaten, and lawyer (yes, “lawyer” is definitely a verb in this case) other scholars and scholarship out of existence under the banner of protecting copyright.

My husband has a book-length, publishable grammar of Sindarin. The only other grammars of Elvish (Sindarin or Quenya) are at best outdated and at worst utterly inaccurate. (The stuff on Ardalambion is good as far as it goes.) I don’t know if it’ll ever see the light of day. I hope so. It’s rocking good stuff.

Copyright preventing the dissemination of scholarship. For Pete’s sake, has anyone read a chap named Jefferson lately?

On a whole ’nother subject, let me return AKMA’s favor: AKMA says what I wish I had about Mideast politics, as well as many other things. I don’t often discuss politics, for the precise reasons AKMA cites. Perhaps my voice is no great loss; I am a truly lousy historian and a worse psychologist. I do not doubt, however, that other voices more valuable than mine are silenced for these reasons, and that is a tragedy.

28 Martii 2002

The right copyright

I agree wholeheartedly with this discussion of what’s wrong with copyright and how to fix it. I love working with ebooks, but my ethics bump has been bugging me for some time now, because I do not want to be even peripherally involved with any movement that further mangles the public domain and withholds power over art and knowledge from ordinary folks like myself.

Daily dose of XML

Check out xml.com this week for a pretty darn good basic definition of XML. Seems like a no-brainer, but try writing your own definition (I have) and see how hard it is to do well.

Noted a pointer in one article to an RDF primer. That goes on my reading list.

27 Martii 2002

Goth-kitty blues

Vet says the Goth-kitties have issues: high pH and crystals in their urine.

I’m not sure how seriously to take this, frankly. This particular vet hawked one of those expensive must-buy-at-vet cat foods the instant she saw the cats’ teeth. Now she’s hawking a different one based on the lab results. What is a cat owner supposed to think, other than that the vet’s cut of the profits might have, I don’t know, some sort of impact on the recommendation?

But where can I find a vet for whom this is not the case?

I’m going to try to do my homework, become informed on the health issues in question here, so I can make a decision that works for the Goth-kitties while not unduly assaulting my wallet. If they really need the special food, fine. If they don’t, I am so getting a new vet, one who doesn’t cater to the half-million-dollar-home Shorewood Hills crowd.

Working on it

So I got started on putting together the HTML deriving from the PowerPoint deriving from “Print and Screen: The Collision of Print and Electronic Publishing,” which is the quasi-famous talk I blogged about yesterday.

It is truly astonishing how long it takes to type out a lecture. I don’t see why college professors ever bother. When would they have time to do anything else?

I got through only six slides of eighteen, I’m afraid. I’ll try to hack through some more tomorrow.

PowerPoint HTML turns out to be less horrible than I had feared. (Word’s is dreadful, not that that’s news to anyone.) Get rid of the style attributes, put class attribute values in quotes, change div tags to something intelligible, and the result is downright usable. PP even exports Unicode rather than dumbing down special characters. I appreciate that; I spent enough time as a typesetter to value smart quotes and em dashes.

26 Martii 2002

Markup challenges for ebooks

As long as Leigh is sending eyes my way, I might as well give them something other than snark to read.

I implied previously that we over in OEBF-land have been butting our heads against the current technological ceiling in markup. Perhaps a discussion of one or two of our worst problems is in order.

Random vocabulary mixing

This is huge. This is huger than huge. This is freakin’ colossal, okay?

The OEBPS is believed by entirely too many people to rest entirely on XHTML. This is not the case. The Authoring Group, to its immense and enduring credit (and I wasn’t on the Authoring Group, so I can say that) realized almost from the beginning that XHTML was not going to cut it as the end-all and be-all of book production vocabularies.

So they said, okay, let’s use XHTML as a base tagset for the unambitious, and allow other XML tagsets as long as their visual-display behavior is defined via CSS. (And then they made this immensely smart decision all but useless by not allowing hierarchical CSS selectors. But I digress—and we’re about to fix that error, I promise.)

But that only gets you so far, if you’re a content creator. STM publishers (that’s Scientific-Technical-Medical, y’all non-book-geeks) can’t very well style MathML tags with CSS. Lit-critters and linguists will want to do more with their TEI tagging than make it pretty. XLink could be highly useful juju for ebooks, but its usefulness has got diddly-squat to do with CSS.

The long and short of it is that ebooks need to be able to do much of the same stuff that XHTML Modularization is designed to address—but they need to be able to do it without necessarily involving XHTML at all. If I’ve rolled my own STM journal DTD based on ISO-12083 with the addition of ChemML and MathML data islands, I see no reason I should have to turn my ChemML and MathML into images to display them in an ebook reading system that’s supposed to be based on XML processing anyway.

In my more innocent days, I thought XML Namespaces solved this problem. How wrong I was. I can set apart my ChemML and MathML islands from the rest of my markup using namespaces, sure. That’s no guarantee at all that anything will be done with those islands. The namespace URI doesn’t have to mean anything. Often, namespace URIs (insofar as they mean anything at all) mean more than one thing, with XHTML itself being the biggest, stinkiest example. (XHTML 1.0 and 1.1 share the same namespace URI despite not being the same vocabulary.)

Worse, the namespace URI doesn’t say anything at all about proper handling of a particular data island. The MathML namespace URI, for instance, doesn’t tell a processor whether the markup in a MathML data island is to be displayed visually or fed to Mathematica for number-crunching. Huge difference—that cannot be got at except by examining the markup. So what’s the use of the namespace label, hm?

What we’re trying to come up with is namespace metadata. It’s an uphill battle. We can’t very well afford to become vocabulary arbiters, saying that some vocabularies are acceptable in ebooks and others aren’t. We’d spend all our time doing nothing but studying and debating vocabularies. Plus, inevitably, not all reading systems need to support all XML vocabularies. (Beach Blanket Bob doesn’t need his reading system to handle MathML before he can read his Ludlum novels.)

Cross this issue with…

MIME types

… and you’ve got a real lulu. Zen koan, markup style: What is the MIME type of a multiple-namespace XML document?

text/xml, you say? Not so fast. What if one of the namespaces is XHTML, which has its own MIME type? What if another is MathML, which also has its own MIME type?

The OEBPS uses MIME types in its “package file,” which was designed in part so that reading systems could figure out how to handle a publication without having to sniff every single file included. This system worked only moderately well to start with, and got creakier every time we re-examined it. At the moment, for XML files, it’s just plain broken. MIME types just don’t carry sufficient information.

Which leads us to…

Packaging and packaging metadata

A typical ebook has a lot of files in it. XML files, CSS files, JPEG and PNG files, possibly other files (there’s a core set of supported file types, and a way to include other file types as long as they fall back to a core file type—go read the spec for more). Once again, it’s not a good idea to force a reading system to figure out what to do with all these files based on file-sniffing, not if we want ebook gadgets smaller than laptops.

So we gotta come up with a way to describe those files. (My husband is glaring at me because he needs the ’puter, so let me be brief here.) This we have done, but as time passes, we discover more and more description that we want to include. A rigidly-defined XML document, such as that created for OEBPS 1.0, just won’t cut it any longer. What architecture should we adopt, then?

It’s looking like RDF, despite some internal dissension. Ye xml-dev denizens, keep an eye out for my colleague Garret Wilson, who has done an amazing lot of work on a packaging spec useful (we hope) not just to us, but to you as well.

Okay, dear, I’m done now.

Portrait of the artisan as a young geek

So now Leigh is granting me another link and reposting my snarky comments to xml-dev (I should watch my mouth, as my mother would no doubt tell me). This is fine; I accept responsibility for my public snarking. (And no one seems to have responded to it anyway, so perhaps I’ll escape scot-free this time.)

I should probably explain just who the hell I am to be snarking, though. If nothing else, it provides a counter-datapoint to the endless stream of database gurus cited as XML’s main (sole?) audience.

It was all an accident. I burned out of graduate school in late 1998, and after a few months of pink-collar temping landed a job with Impressions Book and Journal Services as an Electronic Publishing Assistant.

Impressions thought I might have some luck learning SGML because of work I had done in graduate school keying medieval and early Renaissance manuscripts and books into a weird, homegrown, non-SGML-based markup system. As it turned out, they were more or less right. Within six to eight months, I could handle data analysis, write a DTD, edit and document already-written DTDs, or turn garbage into SGML via regular expressions and the beginnings of Desperate Python Hacking.

I shouldn’t (and I don’t) claim that all this prowess stemmed from personal brilliance. C’mon, we’re talking about a grad school dropout here, okay? I had excellent teachers. A good deal of their excellence as teachers, in fact, came from their having been in my shoes. None of them were computer scientists. None of them were trained programmers. There was plenty they didn’t know; there was plenty that we collectively didn’t know.

But they understood books and book production. All of them were typesetters; now that I come to think of it, I was the first non-typesetter in that department. (I understood texts, being an ex-lit-critter and ex-historical-linguist, but I had a lot to learn about books.) They did a pretty amazing job integrating what they knew about books with what they learned about markup.

As a publishing-services company, Impressions noticed the ebook hype pretty quickly. I don’t think I can take credit for that; I honestly don’t even remember who in the EP department first noticed the Open eBook Authoring Group and its new “Open eBook Publication Structure.” I can take credit for sussing things out quickly, though. (And here I go, about to get myself in legal trouble. Oh, well. Maintaining the fiction gets old anyway.) I wrote what is now the OEBPS FAQ and published it via ebooknet.com (no link; no longer extant) in late 1999.

When the Open eBook Forum held its first annual meeting in May 2000, I represented Impressions. I immediately joined the Publication Structure Working Group and in short order became its scribe, a position I still hold. I am and I’m not a typical PSWG member. Typical PSWG members, like me, are made, not born, markup artists; only our long-time chair Allen Renear (of the Text Encoding Initiative, among other things) has been around markup for a long time. Typical PSWG members, unlike me, are software and hardware designers, not content creators. (I was the only PSWG member who could make any claim at all to understanding publishers’ issues for a very long time. Made me nervous.)

I left Impressions rather badly, in May 2001, when a management shift sharply curtailed my on-the-job autonomy and I didn’t have the political savvy to resist effectively. I had planned to start my own tiny ebook conversion/consulting business, but landed at OverDrive instead. About that, I have said enough and more than enough in recent days.

All that said, what can I actually do? Well, aside from the skills already mentioned—not much, to tell the truth. I wrote an ISO-entity-to-Unicode translator in Python that I still use (props to John Cowan for the actual equivalence listings). I can fiddle a bit with XSLT, if I have a reference open on my lap. (I lost a lot of interest in XSLT when I discovered—the hard way, of course—that it did not solve one of the commonest problems I face, that of adding hierarchy to flat data. Data that comes out of typesetting systems is flat, flat, flat.) I can sort of read, but not at all write, an XML Schema. I understand more about XML namespaces than anyone should. I can follow some of RDF. I grok almost all of CSS2, enough to grumble at current web-browser implementations. I’m teaching myself basic SAX scripting via a project I’ll blog about some other time. (SAX makes more sense than DOM in this context; again, working with a very flat and simple structure, but potentially lots of data.)

I am about to start a new job completely unrelated to XML or ebooks. Whether I expend the additional effort to continue with the OEBF remains to be seen. I do not, however, expect to be gone from markup forever. Perhaps I’ll come back a database wonk, who knows?

What is to be learned from this topsy-turvy tale, if you, like Leigh Dodds, are interested in the underground XML world? Here’s what I think:

  • Markup isn’t rocket science. (This statement has reached the status of aphorism—all right, in-joke—in the OEBF, because I’ve said it too often.) You do not have to be a Ph.D in computer science to handle markup effectively.
  • Going it alone is bloody hard. Probably impossible. Even if markup isn’t rocket science, it contains plenty of gotchas for the uninitiated. Maybe there are geniuses somewhere who can learn XML out of books. I am sure as heck not one of them. Without tutelage, without having started out on ongoing projects (SGML journals, mostly) that more experienced people had worked the kinks out of, without coworkers able to grin and say, “Yeah, it’s stupid, but it just works like that,” I would never have made it as far as I have.
  • Too many vendors are selling learned helplessness. I don’t think I need to say more on that unpleasant subject.
  • Some markup standards are broken. I don’t know that everyone would agree on which ones and how, but I see the evidence often. (My own list of candidates: XML namespaces, XML Schema vis-a-vis character-entity handling and human-readability, CSS vis-a-vis conversion from print production tools, XSLT vis-a-vis adding hierarchy. XSL-FO is unfinished, but my impression is that it’s not broken.)
  • Markup standards are pitifully unusable in the vast majority of today’s book production contexts. There are lots of reasons for this, by no means all of them due to markup standards themselves. (In other words, improving the standards won’t necessarily solve the problems.)

I am known (insofar as I’m known at all) for pontificating on that last subject; I’ll try to get my most famous (ha!) pontification up in half-readable form soon.

If I can un-hose my laptop, that is.