‘Markup languages’ Archive

5 Aprili 2007

Blogging nekkid

It’s CSS Naked Day again. CavLec is a proud participant, in part because its owner is aware of the difficulty of making anything “naked” on the ’net halfway tasteful, and is deeply impressed that the organizer does actually manage it.

Also to remind herself that a markover for CavLec is well overdue.

17 Iulii 2006

em Considered Harmful

I had a grouchy weekend, filled with Googlebot blasting my DSpace installation into smithereens not once but twice (how much does Dorothea hate Java, boys and girls? A LOT, that’s how much) and a TAG markup project that led to the growl following.

People who understand books and book production understand that individual aspects of typography are overloaded. Overloaded in the programming sense, I mean—depending on context, a given typographical embellishment may have a different meaning. Overloaded, polysemous, ambiguous—whatever word floats your boat.

Take the humble italic font. It demonstrates emphasis. It sets off the titles of books and other extended-length works of art. It sets foreign terms apart from surrounding text. It sets biological genus-species names apart from surrounding text. It delineates ship names (but not, curiously, aircraft names).

It can also be used just because somebody thought italics was a good idea at the time. Colonial-era American typesetters were absolutely notorious for this. If you can extract rhyme or reason from their type choices, you’re a braver woman than I.

Italics, in other words, are a cue. They don’t unambiguously tell the reader the reason for their existence; the reader picks from a mental list of what she’s known italics to signify in past reading, and happily goes on from there.

The neat thing about markup is that it permits various uses of italics to be disambiguated behind the scenes, if desirable. If I’m writing a biology textbook, it’s probably not a bad idea to disambiguate genus-species names from other uses of italics—that makes it possible to create a handy-dandy index of organisms named in the book.

Understand, though, that this disambiguation doesn’t just happen. Somebody’s got to actually do it. Trust me, that somebody is not going to be anybody in standard book production. Italics is italics, end of story. (You might get a clueful copyeditor. I wouldn’t count on it, though—and the clueful copyeditor’s work is wiped off the slate when the book hits print anyway.)

This brings us to HTML, where back in the day, <i> was <i>, and that’s all she wrote. But this is bad! cried the semantic generation of HTML designers. <i> doesn’t mean anything! We have to have tags that mean things!

Which is a complete misunderstanding of the problem. The problem is not that <i> is meaningless. The problem is that it means too many things. The proper solution to this problem, given HTML’s problem domain, would have been to add tags for the commoner uses of italics on the Web and perhaps to insist that <i> be embellished with a class attribute for less-common uses that HTML cannot be expected to anticipate. (I don’t think many practicing biologists sit on W3C working groups, so a separate tag for genus-species names was probably never in the cards.)

What happened instead? <i> was deprecated—people were told not to use it!—in favor of <em>, which means “emphasis.” So let’s step back. Web folks used to tag things ambiguously. This is sometimes necessary (perhaps I don’t know why something is italic!), sometimes not great, but can always be lived with; we’ve lived with it in print for centuries. Now, with the blanket replacement of <i> by <em>, Web folks are demonstrably tagging many things incorrectly, because not every use of italics is for emphasis! This is an improvement? I think not.

I spent much of the weekend wincing at (and either fixing or actually performing) tag abuse of <em>, <strong>, and <q>. And checking my work email every hour or so to make sure DSpace hadn’t run out of memory again. No wonder I’m grouchy.

24 Aprili 2006

When not to fiddle

I am an inveterate markup and CSS fiddler. I am never happy with any out-of-the-box web app. Not infrequently, what I come up with is worlds worse than what came out of the box, but that is the price of fiddling. (And I’m still proud of what I did with the repository I run. That puppy looks good, yo.)

Sometimes, though? It pays not to fiddle with stuff. I spent most of my last workday fiddling with Open Journal Systems’ markup and CSS. I came in today, ruthlessly copied the out-of-the-box defaults over my work, and am starting over. Why? Because I want to work with (okay, okay, rip off and change) other people’s journal designs, and if I fiddle with the markup and the base CSS, I can’t. Fiddling’s more trouble than it’s worth in this case.

I’m fiddling. A little. Journals without a logo now get MPOW’s logo as a default, and I’ve messed with the sidebar some because I don’t agree with how it’s organized. But most of it I’m just leaving alone, and when I get the urge to fiddle with it, I’m just gonna slap my hand good and hard to stop myself.

I used to think that the stupid mechanical overhead involved in fiddling with DSpace designs (upload, ant update, copy .wars, kill Tomcat, restart Tomcat; have I mentioned lately I hate Java?) was a bug. I’m starting to think that for inveterate fiddlers, it’s actually a feature.

4 Aprili 2006

Nekkid

Well, this is about as naked as I get online. See why.

3 Aprili 2006

TEI is not metadata!

I was reading through Indiana University’s answer to the University of California’s future-of-cataloguing doc (grotty Word doc) when I ran into a widely-repeated nostrum that I think is completely bogus requires further thought.

“Catalogers will create metadata in formats such as MODS, EAD, VRA Core, and TEI,” it says (page 11). Um. Yeah. One of these things is not like the others. One of these things is not the same.

TEI contains metadata, as librarians and even cataloguers generally understand metadata, in the so-called “TEI header.” The bulk of TEI, however, is not metadata in that sense. It’s document markup, is what it is. It’s closer to editing or book design or typesetting than to cataloguing. You can’t do it just by examining the title page.

This isn’t to say that cataloguers can’t learn to do TEI. Of course they can; doubtless they can do TEI immensely better than I do MARC. But if they think working with TEI is going to be like working with MARC, they seriously need to rethink. It’s just a completely different beast.

11 Decembris 2005

The Microsoft Word Nobbling Council

I swear, if I ever in my life again have to clean up craptacular HTML expelled from the nether regions of Microsoft Word, I am going to have to follow the exalted example of the president of the Mid-Galactic Arts Nobbling Council and gnaw my own leg off.

This stuff is truly, madly, deeply vile, and it resists being cleaned up like a three-year-old making mud pies.

This rant has been brought to you by a TAG contract that I probably should never have signed and am incredibly thankful to say runs out at the end of this year. Tech writing, especially when it consists of gluing together bits and pieces from sixteen different sources? Is very not my thing. Rah-rah those who do it without going bananas. Me, I’m going bananas.

9 Decembris 2005

DIDL ordering?

This came up on #code4lib, and I tried my librarianly best to come up with a definitive answer, but the bloody DIDL spec is not helpful, so I’m throwing myself on the mercy of the LazyWeb.

Is there an ordering semantic in MPEG-21 DIDL? For example, I have a scanned book in which each page is a separate file. Can I create a DIDL file that will spit back the pages in the correct order?

My gut says “no,” based on no explicit language about ITEM ordering and no implicitly or explicitly ordered examples in the back of the spec. But I haven’t implemented this puppy. Does anybody know for sure?

28 Novembris 2005

Word vs. OpenDoc XML smackdown

A blogger I read religiously helped write a smashing comparison of MS Word’s XML format with OpenDocument. Good stuff, though I admit I’m not sure XLink is a terribly impressive selling point.

The authors missed a detail, though, one I’m rather surprised they didn’t comment on. They mentioned, correctly, that OpenDocument’s mixed-content model looks very XHTML-ish and readable, whereas MSXML looks like a document pureed in a blender. (Okay, phrasing there is mine. I’ve had to cut down some MSXML documents for use in ordinary XHTML. Extremely not fun.)

What they don’t mention is that OpenDocument’s manner of handling inline markup (such as bold or italic formatting) easily leads to a well-formed XHTML (or other XML) output. MSXML’s—doesn’t, necessarily. I don’t know whether MS has actually fixed Word to make impossible the case I am about to lay out, but I do know that the underlying data model used to give my old VB-guru friend Damon fits, because of all the extra processing he had to do at paragraph marks to get anything even vaguely resembling well-formed output.

Anyway, I’m not going to try to write MSXML to make my point, because I loathe MSXML just that much. However, the basic idea is that MSXML will let you get away with this:

<p>Here’s some text in a paragraph, <start what="bold"/>and the end is boldfaced.</p>

<p>Whereas the beginning of this paragraph is boldfaced<end what="bold"/>, and the end is not.</p>

That’s well-formed XML, yes, but do you see what happens if you try to boil that down to XHTML? Your output won’t be well-formed, because of how MSXML treated that boldfaced text:

<p>Here’s some text in a paragraph, <b>and the end is boldfaced.</p>

<p>Whereas the beginning of this paragraph is boldfaced</b>, and the end is not.</p>

Don’t try this at home, kiddies, because your validator won’t like it.

I had a conversation once with a Very Smart Person who does things like write citation parsers for grotty author input. He told me that there used to be much friction in word-processor-space between applications that enforce some notion of well-formedness (like OpenOffice) and those that don’t (like MS Word). The additional programming burden of well-formedness, added to users’ nonexistent understanding of the concept, meant a significant enough annoyance load both for user and programmer that sloppy data models, in which inline formatting can overlap block formatting, won out.

I’d be thrilled to pieces to see that pernicious little trend halted. I’m aware well-formedness causes problems—annotations like comments and change-tracking are the big use-case; their targets, though logically inline, can legitimately span blocks. Still, OpenDocument seems to be managing all right. I hope they keep doing so.

18 Octobris 2005

A note to XML validator writers

Please, please, please, if you’re going to report unclosed-tag errors (and you are), tell me where the start-tag is in the error message!

XML Nanny, which I otherwise quite like, just tells me unhelpfully that a p tag remains unclosed as of the end of the file. Well, duh. Which p tag, dagnabbit? There’s hundreds of ’em in this file!

4XSLT is (understandably; it’s only a validator en passant) even less informative, but Saxon did the Right Thing and told me which line the errant start-tag was on.

I suppose nobody’s compiled SP for OS X? Boy, do I miss SP.

17 Octobris 2005

Wrong tool

I just spent a lot of time uselessly trying to create an XSLT stylesheet that would number sections in an HTML document based on their header level, and build a nicely nested TOC to boot.

Which doesn’t sound too complicated, but is, because XSLT assumes you’ve used nice sane nested markup, and HTML tends to be insanely nested when nested at all.

So tomorrow I shall chuck it all and do what I should have done in the first place: use SAX, which is the right tool.