25 Martii 2005

Scattered thoughts on meta-search

Lorcan Dempsey has a fine and pointed post on why the Googlish “spider it up and search it all locally” approach tends to work better in many situations than either the libraryish “offer a ton of databases and let the user sort ’em out” approach or the federated-search “send a single query hither and yon” approach.

Don’t miss the comments, which make a cogent point about excluding likely-irrelevant resources being highly helpful to successful searching. Remember the uproar when Google pondered excluding blogs from search results? If that offended you—consider the case of paid search-result placement, or “safe-search” controls on image searches. Some stuff I guarantee you don’t want to search.

The last comment (as of this posting) by Larry Campbell, though, got closer to my somewhat inchoate thoughts:

Wouldn’t you expect to find (to mix metaphors) some specialized search boutiques along with the big boxes, as well as all manner of sizes and services in between?

I’m not so much concerned with specialized search boutiques (which I take to mean individual targeted databases) as with specialized search parameters. If I’ve learned nothing else this semester in my How To Search Good class, I’ve learned the awesome power of database-specific search parameters. I can pop into Dun and Bradstreet’s and find you all companies in a specific area of a specific market capitalization; try that with Google. I can find you mentions in the medical or chemistry literature of specific chemicals I’ve never before heard of. Honestly, I’d think this was magic if I hadn’t spent the semester learning about the search parameters behind the curtain.

But how is even Google supposed to give me this kind of niche-directed search power? I don’t see how they can, in large part because the search power isn’t a function of the search engine as much as the organization of the data being searched. It’s all about the metadata, again, and we all know what happens when we tell Web authors to use metadata—not to mention that it’s unclear how Web authors are supposed to know which sets of targeted metadata to use, as that amounts to telepathic awareness of how other people will find their content useful.

That said, even DIALOG has figured out the uses of federated search, and I have to say I think they do a pretty fair job with it. One could do a lot worse than examine how DIALOG’s OneSearch rips apart queries, discards the bits of them that don’t work in particular databases, and reports back to the user about what it could and couldn’t do.

I don’t think DIALOG could do as well as it does, though, if it hadn’t converged on a more-or-less common set of labels for database fields. I can pretty much trust that if there’s an author field in the DIALOG database I’m searching, I get to it with AU=. (I know, I know, personal names and corporate authorship—work with me here, will you?) The /ti suffix never means anything but “title.”

Compare this to the chaotic stupidity of database-field labelling across the universe of scholarly-article databases, a substantial reason RefWorks is such a royal pain to enter data into. This is largely fixable and should be fixed. Time for vendors to sit down in a smoke-filled room and hammer out a standard or two.

Despite my hatred of edge cases, I would want to see such a standard allow specialized parameters to be passed from searcher to engine(s). I don’t want to lose the power of CO= in Dun and Bradstreet just because I’m using my favorite federated-search engine that (because it’s a general-purpose tool) doesn’t recognize it. I would hope this would be as simple as offering a key/value pair option, where the sophisticated user inputs both key and value… but I daresay there are wrinkles I’m not considering, as I haven’t yet thought about this very hard.

While the vendors are in that smoke-filled room, by the way, I’d like to see a standard for search syntax. I don’t care that not all databases support truncation; that’s fine. (Speaking of which, when did Google start stemming? For some reason I only just noticed.) What I want is to use the same character for truncation in every single database I search. It is truly stunningly dumb that I can’t do that already.

If regular-expression engines hadn’t converged on Perl syntax, nobody would use regexes; they’re annoying enough even given their (mostly-)standard syntax! Search is the same way. Nobody should have to read the helpfiles for every database they search just for syntax peculiarities. No federated search should fail because of incompatible search syntaxes. That’s insane. Time and past to fix it.

I’m aware that the hard part of that problem is figuring out fallback behaviors for bits of syntax that a particular database doesn’t support. I’m willing to say the database folks ought to just work through that, though. Over time, it ought to be a unifying force on the industry, as laggards spruce up their search engines to support the new “standard” syntax.

And then we all win, librarians and novice searchers alike.