28 Novembris 2005

Word vs. OpenDoc XML smackdown

A blogger I read religiously helped write a smashing comparison of MS Word’s XML format with OpenDocument. Good stuff, though I admit I’m not sure XLink is a terribly impressive selling point.

The authors missed a detail, though, one I’m rather surprised they didn’t comment on. They mentioned, correctly, that OpenDocument’s mixed-content model looks very XHTML-ish and readable, whereas MSXML looks like a document pureed in a blender. (Okay, phrasing there is mine. I’ve had to cut down some MSXML documents for use in ordinary XHTML. Extremely not fun.)

What they don’t mention is that OpenDocument’s manner of handling inline markup (such as bold or italic formatting) easily leads to a well-formed XHTML (or other XML) output. MSXML’s—doesn’t, necessarily. I don’t know whether MS has actually fixed Word to make impossible the case I am about to lay out, but I do know that the underlying data model used to give my old VB-guru friend Damon fits, because of all the extra processing he had to do at paragraph marks to get anything even vaguely resembling well-formed output.

Anyway, I’m not going to try to write MSXML to make my point, because I loathe MSXML just that much. However, the basic idea is that MSXML will let you get away with this:

<p>Here’s some text in a paragraph, <start what="bold"/>and the end is boldfaced.</p>

<p>Whereas the beginning of this paragraph is boldfaced<end what="bold"/>, and the end is not.</p>

That’s well-formed XML, yes, but do you see what happens if you try to boil that down to XHTML? Your output won’t be well-formed, because of how MSXML treated that boldfaced text:

<p>Here’s some text in a paragraph, <b>and the end is boldfaced.</p>

<p>Whereas the beginning of this paragraph is boldfaced</b>, and the end is not.</p>

Don’t try this at home, kiddies, because your validator won’t like it.

I had a conversation once with a Very Smart Person who does things like write citation parsers for grotty author input. He told me that there used to be much friction in word-processor-space between applications that enforce some notion of well-formedness (like OpenOffice) and those that don’t (like MS Word). The additional programming burden of well-formedness, added to users’ nonexistent understanding of the concept, meant a significant enough annoyance load both for user and programmer that sloppy data models, in which inline formatting can overlap block formatting, won out.

I’d be thrilled to pieces to see that pernicious little trend halted. I’m aware well-formedness causes problems—annotations like comments and change-tracking are the big use-case; their targets, though logically inline, can legitimately span blocks. Still, OpenDocument seems to be managing all right. I hope they keep doing so.