‘Markup languages’ Archive

14 Octobris 2005

Get thee to an information architect

At a work meeting yesterday, I was told of this site, which was held up as an Object of Scorn. Whatever the other folks attending the meeting want (and that’s still up in the air), it’s Not That.

Looking at the site this morning, I saw why. It’s not that the content is bad; the content seems excellent. What’s bad? The total lack of information architecture.

The organization is bad. The navigation is bad. The search is bad (this site is crying out for fielded limit searching). The visual design is bad. The labelling is beyond abysmal. I would never send anyone here to find a resource; it’d be like asking them to find a specific snowflake in an avalanche. Honestly, I didn’t think professional sites with information architecture this bad still existed.

Hints of better site structure peek out from the piled-high badness avalanche; the “virtual bookshelves” aren’t a bad notion at all. But the whole is just bad, bad, bad, so bad I can’t even start taking it apart because I’m not sure what to rant about first.

Moral of the story: Spend some grant money on an information architect (or a librarian! or a librarian information architect!) in addition to your JSP coders, won’t you? Your users will bless you for it, and your competitors will have one less way to cut you out of the picture.

22 Septembris 2005

Doctor, doctor

The LazyWeb came through in fine style indeed; I had three offers of help and one “don’t use XSLT! use this!” message.

Turns out my stylesheet wasn’t the cow; 4XSLT was. The cure, as the doctor said to the patient, is “don’t do that.”

So I’m grabbing Python bindings to Saxon and Xalan and whathaveyou, and I’ll take care of business that way.

21 Septembris 2005

Calling the LazyWeb

If an XSLT guru out there would be willing to take a look at a stylesheet I’m working on with a view toward explaining to me why it runs like a dyspeptic cow on Valium, please let me know. Bonus points if you are familiar with 4Suite, which is what I’m using to do the transform (because at some point I expect to roll it into a Python workflow).

I’m trying to cobble together something workable out of two different EAD to HTML transforms. I’ve got something that works (more or less—output needs minor tweakage), but oh my gosh does it run slooooooow, and I’m not XSLT-savvy enough to understand why.

Help?

19 Septembris 2005

Gacky markup

There’s a dead easy way to prove that the discipline imposed by XHTML doesn’t automatically lead to good HTML practices.

Just read XSLT stylesheets whose intended target is HTML. Ugh.

11 Augusti 2005

I kill DSpace now

Right, so I start digging into the interior pages in DSpace, because something in there is making all the type in tables, like, two pixels tall in Safari… and I come across the Browse By Titles page.

That page in a default DSpace install is a table. (Everything in a default DSpace install is a table. If DSpace were a furniture store, good luck finding a chair or a mattress or a lamp. Tables. All tables.) Moreover, it’s a deathly stupid table, because the attribute that the site user has chosen as being salient—namely, the title—isn’t first in the table, and isn’t distinguished in any way typographically except by a bolded column header.

(What’s first? Date. Date? Who the hell cares about the date something was added to the repository, I ask you? Sure, it’s a great thing to sort by, but nobody cares about actually seeing it. Librarians and visible metadata. Sometimes I just want to strangle my own kind, which is sad.)

It’s even worse in the current state of my redesign, because I’ve gotten rid of the table colors. So, okay, I think to myself, time to put the sweat of my brow where my mouth is and redo this in a typographically sensible fashion.

Except. I. Can’t.

They HARD-CODED the table INTO THE JAVA LIBRARIES. Not into a JSP. There’s an entire fricking CLASS in the source code that ONLY EXISTS to write out that stupid bloody useless table.

Well. Looks like I’ll be messing with Java a lot sooner than I meant to. War has been declared. That table is history. History, do you hear me?

I am tempted to file a bug report pointing out that one can’t change all the HTML from the JSPs. To me, that’s a bug—and it ought to be to them, too, because the HTML that lives in the Java suffers from all the markup badness that the rest of DSpace is rife with.

Eh, well. Job security, I suppose. I’d rather have clean markup out of the box, though, and spend my time on other things.

3 Augusti 2005

More on DSpace

Every time I think “I can do this! And it won’t be hard!” about the DSpace redesign, something else hits.

We’ll leave out the 1995-era markup. I’ve dealt with that. (Mostly. I haven’t touched the admin pages yet, though I can tell I’ll have to.) Dealt with it: capitalized markup, unquoted attribute values, grotty tables and all.

(The unquoted attribute values really chap my hide, though. There are general markup best practices, even if you’re not chasing XHTML. Quoting your attribute values is one of ’em. I shouldn’t have to fix that kind of stuff, not in this day and age.)

We’ll also leave out the browser-sniffing hacks. (They’re still trying to code to Netscape 4. Watch me toss that requirement right out the window. It really is 2005, folks. Honest.)

I was hoping I could just overlay a new stylesheet on theirs. Yesterday I looked at theirs.

THEY DID FONT SIZES IN POINTS. In POINTS!

Hell with it. I’m redoing the whole thing, no matter how much cussing it takes me. There is a level of CSS badness up with which I shall not put.

I’ve seen some uneasy awareness in the DSpace roadmaps that there’s too much Java code in the pages. Well, there is, no question of that. What interests me is why there is. My somewhat-uneducated opinion is that they thought about DSpace in terms of pages, and the page was too big a unit for an ideal design to come out the back end.

“We need a page for X. What goes on it?” they asked, and merrily did they code away until the page did what they wanted it to. Thing is, that led to them coding the same stuff on several pages, differently on each page, and it also blinkered them a bit, in that they didn’t necessarily see how one gizmo could have been useful on several different pages.

Compare this to how blog software is developed—first you figure out the tags, then you build pages from the tags; if you need new functionality, you create a new tag. Some tags permit context-sensitivity, even, behaving slightly differently depending on what page they’re on. That’s how it should have been done. It’s harder to code at first, but it’s easier to document, more flexible to work with, and less crufty in the long run.

(Documentation? Hoo, boy. Paraphrased: “You can customize the look of DSpace by messing with the JSPs or the CSS.” Plus a cursory description of the most commonly-used tags—if you scroll down a lot—sans examples. End documentation. For real documentation of the tags, you’re supposed to go into the .tld file and read the comments. Yeah, that’s friendly. Mm-hm. I haven’t done it yet, so I don’t know if the comments are any good.)

They tried to go tag-based, I think; they just didn’t commit to being draconian about it the way the MovableType or WordPress people did. They are realizing their mistake now, as they consider how hard it’s going to be for people like me (who are doing some fairly hefty customizations—with DSpace architected and UI-designed the way it is, just cleaning up the markup or shifting the layout is a hefty customization!) to update their work when new DSpace versions come out.

I’ve heard Fedora people laughing up their sleeves at issues like this. I don’t know Fedora well enough to judge whether these particular issues were handled any better, frankly. (I suspect not, though. I don’t think any of these systems involved info-architects, HCI people, or markup experts from the get-go. Ronco Spray-on Usability, indeed.)

To some extent, this is life on the bleeding edge. I expect DSpace to follow the fabled software pattern of being really pretty decent by the 2.1 release. (It’s on 1.2 at the moment; 1.3 is around the corner, and 2.0 is being roadmapped.) Right now, though, it’s pretty frustrating.

24 Maii 2005

Digital medievalists need libraries

A month or so ago, the incredibly nifty Digital Medievalist launched its new journal, RSS feeds and all.

(In passing, I love journal-TOC RSS feeds. They’re the greatest thing ever, like having your very own pull service or academic librarian. I’m glad the Digital Medievalists were so forward-thinking. That said—folks? Fix your URLs, please. I guarantee you won’t be using Cold Fusion forever, and the sooner you make Cold Fusion emit futureproof URLs, the less hair-tearing you’ll have to do when you move off it.)

I ripped through the whole first issue as soon as I heard about it—and yes, I also am proof (if any more were needed) that medieval studies spawns more text geeks per capita than any other discipline there is. How does one not love an article on digital paleography?

Because I’m an armchair accessibility wonk, I also very much appreciated the article on ensuring accessibility of digitized medieval collections. Nothing earthshaking about the techniques or the methods, just a good, solid reminder that accessibility needs attention. Worth perusal, especially for digitization librarians.

The article that grabbed me by the throat and shook, however, talked about digital critical editions and why they’re going away. I really felt for the author; I’ve seen the Ivory Tower shoot down digital-edition projects again and again. Not to mention that I’ve now and again ranted on the subject of markup tools and why they’re horrible.

The story-in-brief here is that the article’s author, desirous of creating digital critical editions of various and sundry works, got frozen out by prestigious print publishers, when they discovered “that electronic editions cost no less than print editions to produce and require staff to be educated in the new possibilities.” (I might add that the revenue stream of such a work is uncertain at best.) He then built himself an academic fiefdom to put out such editions; he says he’s been moderately successful, but he’s not satisfied.

He blames the unwillingness of scholars to grant due professional credit to the authors of electronic editions (isn’t it just bloody funny how academia is falling all over itself to use digital resources, but doesn’t in the least want to produce them?) and the appalling state of production tools and workflows. (I have no quarrel with that last point, of course, but I confess to a mild amount of bogglement that he thinks TEX is somehow easier than your average XML editor.)

What I want to know is, where the seven hells are this guy’s librarians?

Of course no for-profit or even cost-recovery publisher is going to touch digital editions. Too much expense, not enough audience. Of course spare-bedroom fiefdoms can’t pick up the slack; nothing can that lives only as long as a spare bedroom.

But academic libraries are doing these jobs by the metric tonne. We know things about TEI that people who wrote the Guidelines don’t. (Such as, TEI-Lite is unusable because it doesn’t allow both <head> and <label> at the beginning of a text division. What blockhead decided that, pray?) A lot of us have humanities backgrounds, even humanities-computing backgrounds; I really wasn’t kidding about ex-medievalist text geeks.

And neither publishers nor fiefdoms tend to pay any attention at all to the long-term survival of the result. Libraries are different. Long-term survival of knowledge is our business.

There’s a lot of buzz buzzing about how to extend open-access into the humanities space. I submit that journals aren’t the place to start. Open-access journals, institutional repositories, and so forth didn’t get any legs until the current system started getting too messy for the sci-tech-med folks. To extend those gains, we academic digitization librarians need to keep an eye out for areas where the current publishing system breaks down for humanities scholars. This is one. So let’s get cracking.

15 Februarii 2005

Happy hacking

Poor Walt is trying to do a nice thing for me and people like me who read onscreen, and all I can do is kvetch. It’s bad of me, no doubt about it. So I’m going to turn it into an opportunity to talk a little more about desperate-hacking, hoping to bring a little good out of evil.

I’m using a loose definition of “hacking” here. It’s not necessarily scripting or heavy-duty programming—sometimes it’s wading through a program’s configuration options to make it behave in a particular desirable way. For me, it’s sometimes writing a regular expression search-and-replace instead of fixing several instances of the same problem by hand.

Just to be clear, I’m not trying to push Walt to go in any particular direction. He’ll do whatever works for him, and that’s fine. He just makes an extremely convenient example.

Walt’s problem is a not-atypical one in markup circles. He’s got stuff in Microsoft Word (version unknown, so don’t talk to me about WordML) that he wants to put on the Web in HTML. He wants to keep typographical niceties such as curly quotes and em dashes. He wants a certain amount of typographical attractiveness in the final result, but he isn’t horrendously picky about his layout. He is extremely unwilling to hand-tweak HTML code. He is also extremely unwilling to learn new tools (especially if he has to manually chain tools together) or futz around with the process once it’s working, because this is not a one-time conversion deal—he wants to keep doing it as he creates new articles.

I want a few additional things, as it happens. I want Walt’s web pages to look at least half as polished as his PDFs. I want links and some basic navigation. I want reasonably clean code under the hood, because who knows what Walt (or someone else) may want to do with these in the future? (No, I’m not insisting on XHTML 1.0 Strict. Valid HTML 4.01 Transitional will do fine.)

So can a single process make both of us happy?

Well, sine qua non: there’d be an upfront development cost. We’d have to build a CSS stylesheet that made both Walt and me happy, and we’d have to turn me loose for a while until whatever I hack actually works, including handling grotty edge cases. (If I were hacking for myself, I wouldn’t be as picky; a few hand tweaks are no big deal for me. Walt, however, shouldn’t have to mess with that.) The question is whether we’d save time in the long run.

If I had access to Walt’s desktop and could install a scripting language, sure. Then it’d be easy. I could knock something out in Python to beat Word’s horrible HTML to a pulp in a day or three—and it’d handle a whole issue at a time, too, no cut-and-pasting individual articles. (Because, yuck. Who wants to do that all day?) Would I? Oh, sure. Here’s my thought process:

  1. Is this problem patterned enough to be hackable? Yes, certainly. Process links and paragraph-internal spans of boldface or italics, chop up an issue into individual articles (found by their headings), dump each article into a canned HTML template containing links to CSS files and the basic navigation structure, write to disk, end of problem. (For extra points, take care of the FTP.) Same process for each zine issue, so no hand-tweak problems once the bugs are worked out.

    Lesson: Problems without patterns aren’t hackable. If you give me a random collection of Word documents from a random collection of people, I can’t swing a one-shot, no-tweaks conversion, because each document will be different from all other documents. Hacks need patterns to grab onto, because computers need patterns in order to work; it’s only human beings who are wired to impose patterns on what they see.

    Authors? When editors ask you to be consistent about your use of styles? THIS IS WHY. It isn’t petty authoritarianism, honest. Cataloguers? When programmers complain about MARC-tagging errors that don’t affect access? THIS IS WHY. You’ve obliterated a pattern; a human may be able to see past it, but the computer can’t.

  2. Is this problem ongoing or repetitive? Yes. Walt wants to do essentially the same thing many times.

    I don’t bother hacking something for a one-time project if hacking will take me more time than fixing the problem by hand. Just common sense.

    The book I’m currently working on for the Digital Content Group has a few nasty pieces of two-column print. The OCR engine blithely captured a line at a time ignoring the columning entirely, which of course scrambled the text like eggs. I might be able to script part of a fix—but it’s not worth my time; there isn’t enough two-column text and it’s a rather hard problem owing to scannos and a basic lack of patterning. So I’m fixing it by hand.

    Lesson: Ongoing or repetitive tasks are top hack candidates. If you’re doing something by hand over and over, don’t; figure out how to hack it. But don’t spend an hour hacking something that’s a one-time two-minute hand-fix. Over time, one develops a sense of how long it’ll take to hack something versus hand-fixing it.

  3. Is this problem so difficult (for the hacker in question, not for, say, Donald Knuth) that hacking it given the time constraints isn’t practical? Nah. I think I could swing this even in (ugh) VB, though I’d have to steal code from a friend of mine. (He won’t mind. He gave it to me.)

    This shouldn’t be a deal-breaker, though, or you’ll never hack anything. If it is a tough-but-solvable problem, ask yourself if what you’ll learn from it justifies the extra time it’ll take to hack it vis-a-vis hand-tweaking it. This, I think, is where a lot of librarians miss a lot of boats. You learn to hack by hacking; it’s not magic. I’m terribly rusty at the moment, in fact (not that I’ve ever exactly been expert), simply because I haven’t been doing much hacking lately. With any luck, whatever job I land will get me back into the swing of things.

    Lesson: Hack anything that’ll teach you how the system works. The knowledge will pay off down the road.

  4. If I hack it, am I likely to create more problems than I solve? In this case, no; the alternative is that Walt either goes back to his old method or quits putting out HTML altogether, which wouldn’t bother him.

    But this is a serious consideration. Hacking your .htaccess file to keep out referrer spammers can bork your site. Loosing a regular expression on your file can change things you didn’t mean to change. Until a hack is pretty thoroughly tested, make sure it doesn’t overwrite any files or make any other irrevocable changes. (My personal regex search-and-replace engine simply can’t overwrite original input files, I’m so paranoid about this.)

    Lesson: Hacking can be dangerous. Practice safe computing.

I don’t have access to Walt’s desktop to install Python, though, so his problem gets a little harder. Either I have to use (ugh) VB to bend Word to my bidding, or I have to find tools that do what I need them to do without tripping Walt’s annoyance meter.

I still think I could do it without resorting to (ugh) VB. Almost. The sticky point is chopping up the issue into articles. If Walt’s willing to do that much by hand (and he is doing so now, as I understand it), the rest is feasible. All it takes is a search-and-replace engine capable of running a number of regular-expression search-and-replaces on a number of files at once. Given that, Walt saves out an issue, hand-cuts it up into articles, batch-runs the resulting files through the engine with a list of searches I would hack together, and that’s that.

Are there such engines? Yes. Ye UNIX types can button it, because I know about sed, thank you. Such engines exist for Windows with pretty GUI front-ends, is what I’m saying. For free, yet. (ReplaceEm is the slickest one I know about, but there are others.)

All in all, though, the safest route would probably be (ugh) VB, since Walt works in Word anyway. But that’s not the point. The point is: Some jobs are more hackable than others. Hacking a hackable job costs time in the short run but saves it in the long run. Hack safely, and you’ll be happy.

15 Novembris 2004

Unclear on the concept

My prospective TAG client is having word-processing issues.

No, no, I don’t know what word-processor she uses, or even whether she uses one. But she’s clearly got a word-processing mental model of XML, and it’s starting to be a problem.

First she wanted to know what font I was going to use to enter her Greek. Um, I’m going to give her XML with Unicode. As long as she has a Unicode font and I have a Unicode font, it doesn’t even remotely matter if my font isn’t the same as hers; if I’ve entered an alpha, she’s going to see an alpha.

Now she wants to know what XML editor and validator I use. I told her outright that I use a text editor and shell out to nsgmls. But, look, it DOESN’T MATTER. XML is XML.

These things matter in word processors. You don’t have the right font? Watch your document implode. You don’t have the right word processor? Forget about even opening the document.

But XML isn’t a word-processing format. (Word-processing formats are being written in XML, mind you. But that’s not the same thing.) XML qua XML doesn’t give a damn what you edit it in.

Part of this is my fault. I’m familiar (from unfortunate experience) with “XML” that has Sooper Seekrit Bits inserted by the editing software. Mess with the Sooper Seekrit Bits (say, because you use a different XML editor), and bad things happen when the original user tries to use the file. I mentioned this as a possibility because I like to cover all possibilities (even remote ones). I am now kicking myself for getting embroiled in irrelevancies.

Oh, well. Guess that’s why they pay me the big bucks. (*snerk*)

19 Octobris 2004

Ideologues

Me, to Simon, on technological ideologues: “They’ve never seen a problem they can’t solve because they resolutely ignore problems they can’t solve.”

Which is, if I do say so myself, remarkably descriptive of more than a few people I know, talking about more than a few technologies, from RDBMSes to MARC to markup to the Semantic Web.

Why is it that folks have a hard time hearing “this is great, but it doesn’t solve my problem”?