Happy hacking
Poor Walt is trying to do a nice thing for me and people like me who read onscreen, and all I can do is kvetch. It’s bad of me, no doubt about it. So I’m going to turn it into an opportunity to talk a little more about desperate-hacking, hoping to bring a little good out of evil.
I’m using a loose definition of “hacking” here. It’s not necessarily scripting or heavy-duty programming—sometimes it’s wading through a program’s configuration options to make it behave in a particular desirable way. For me, it’s sometimes writing a regular expression search-and-replace instead of fixing several instances of the same problem by hand.
Just to be clear, I’m not trying to push Walt to go in any particular direction. He’ll do whatever works for him, and that’s fine. He just makes an extremely convenient example.
Walt’s problem is a not-atypical one in markup circles. He’s got stuff in Microsoft Word (version unknown, so don’t talk to me about WordML) that he wants to put on the Web in HTML. He wants to keep typographical niceties such as curly quotes and em dashes. He wants a certain amount of typographical attractiveness in the final result, but he isn’t horrendously picky about his layout. He is extremely unwilling to hand-tweak HTML code. He is also extremely unwilling to learn new tools (especially if he has to manually chain tools together) or futz around with the process once it’s working, because this is not a one-time conversion deal—he wants to keep doing it as he creates new articles.
I want a few additional things, as it happens. I want Walt’s web pages to look at least half as polished as his PDFs. I want links and some basic navigation. I want reasonably clean code under the hood, because who knows what Walt (or someone else) may want to do with these in the future? (No, I’m not insisting on XHTML 1.0 Strict. Valid HTML 4.01 Transitional will do fine.)
So can a single process make both of us happy?
Well, sine qua non: there’d be an upfront development cost. We’d have to build a CSS stylesheet that made both Walt and me happy, and we’d have to turn me loose for a while until whatever I hack actually works, including handling grotty edge cases. (If I were hacking for myself, I wouldn’t be as picky; a few hand tweaks are no big deal for me. Walt, however, shouldn’t have to mess with that.) The question is whether we’d save time in the long run.
If I had access to Walt’s desktop and could install a scripting language, sure. Then it’d be easy. I could knock something out in Python to beat Word’s horrible HTML to a pulp in a day or three—and it’d handle a whole issue at a time, too, no cut-and-pasting individual articles. (Because, yuck. Who wants to do that all day?) Would I? Oh, sure. Here’s my thought process:
-
Is this problem patterned enough to be hackable? Yes, certainly. Process links and paragraph-internal spans of boldface or italics, chop up an issue into individual articles (found by their headings), dump each article into a canned HTML template containing links to CSS files and the basic navigation structure, write to disk, end of problem. (For extra points, take care of the FTP.) Same process for each zine issue, so no hand-tweak problems once the bugs are worked out.
Lesson: Problems without patterns aren’t hackable. If you give me a random collection of Word documents from a random collection of people, I can’t swing a one-shot, no-tweaks conversion, because each document will be different from all other documents. Hacks need patterns to grab onto, because computers need patterns in order to work; it’s only human beings who are wired to impose patterns on what they see.
Authors? When editors ask you to be consistent about your use of styles? THIS IS WHY. It isn’t petty authoritarianism, honest. Cataloguers? When programmers complain about MARC-tagging errors that don’t affect access? THIS IS WHY. You’ve obliterated a pattern; a human may be able to see past it, but the computer can’t.
-
Is this problem ongoing or repetitive? Yes. Walt wants to do essentially the same thing many times.
I don’t bother hacking something for a one-time project if hacking will take me more time than fixing the problem by hand. Just common sense.
The book I’m currently working on for the Digital Content Group has a few nasty pieces of two-column print. The OCR engine blithely captured a line at a time ignoring the columning entirely, which of course scrambled the text like eggs. I might be able to script part of a fix—but it’s not worth my time; there isn’t enough two-column text and it’s a rather hard problem owing to scannos and a basic lack of patterning. So I’m fixing it by hand.
Lesson: Ongoing or repetitive tasks are top hack candidates. If you’re doing something by hand over and over, don’t; figure out how to hack it. But don’t spend an hour hacking something that’s a one-time two-minute hand-fix. Over time, one develops a sense of how long it’ll take to hack something versus hand-fixing it.
-
Is this problem so difficult (for the hacker in question, not for, say, Donald Knuth) that hacking it given the time constraints isn’t practical? Nah. I think I could swing this even in (ugh) VB, though I’d have to steal code from a friend of mine. (He won’t mind. He gave it to me.)
This shouldn’t be a deal-breaker, though, or you’ll never hack anything. If it is a tough-but-solvable problem, ask yourself if what you’ll learn from it justifies the extra time it’ll take to hack it vis-a-vis hand-tweaking it. This, I think, is where a lot of librarians miss a lot of boats. You learn to hack by hacking; it’s not magic. I’m terribly rusty at the moment, in fact (not that I’ve ever exactly been expert), simply because I haven’t been doing much hacking lately. With any luck, whatever job I land will get me back into the swing of things.
Lesson: Hack anything that’ll teach you how the system works. The knowledge will pay off down the road.
-
If I hack it, am I likely to create more problems than I solve? In this case, no; the alternative is that Walt either goes back to his old method or quits putting out HTML altogether, which wouldn’t bother him.
But this is a serious consideration. Hacking your
.htaccessfile to keep out referrer spammers can bork your site. Loosing a regular expression on your file can change things you didn’t mean to change. Until a hack is pretty thoroughly tested, make sure it doesn’t overwrite any files or make any other irrevocable changes. (My personal regex search-and-replace engine simply can’t overwrite original input files, I’m so paranoid about this.)Lesson: Hacking can be dangerous. Practice safe computing.
I don’t have access to Walt’s desktop to install Python, though, so his problem gets a little harder. Either I have to use (ugh) VB to bend Word to my bidding, or I have to find tools that do what I need them to do without tripping Walt’s annoyance meter.
I still think I could do it without resorting to (ugh) VB. Almost. The sticky point is chopping up the issue into articles. If Walt’s willing to do that much by hand (and he is doing so now, as I understand it), the rest is feasible. All it takes is a search-and-replace engine capable of running a number of regular-expression search-and-replaces on a number of files at once. Given that, Walt saves out an issue, hand-cuts it up into articles, batch-runs the resulting files through the engine with a list of searches I would hack together, and that’s that.
Are there such engines? Yes. Ye UNIX types can button it, because I know about sed, thank you. Such engines exist for Windows with pretty GUI front-ends, is what I’m saying. For free, yet. (ReplaceEm is the slickest one I know about, but there are others.)
All in all, though, the safest route would probably be (ugh) VB, since Walt works in Word anyway. But that’s not the point. The point is: Some jobs are more hackable than others. Hacking a hackable job costs time in the short run but saves it in the long run. Hack safely, and you’ll be happy.