27 Maii 2004

Aggregating metadata

One of the librarian communiques I got about the trials and tribulations of metadata aggregation asked the very simple question, “What do you think aggregators ought to do faced with bad metadata?”

Well, I’m not sure. Let’s take that a piece at a time and see where we end up.

The first question, I suppose, is whether bad metadata is always better than no metadata at all. If it is, then aggregator policy should be to fix whatever can possibly be fixed, and only reject what is so malformed as to be uninterpretable. Sort of the “version-3 browser” method of coping.

I suspect the current situation is very close to this, because metadata aggregators are staggering-out-of-the-eggshell new, and if they’re to have any impact (not to mention further funding) they need to grab as much material as they can get their spider-claws on. No room to be choosy. The chaotic result is not (she said mildly) a state that commends itself very highly to librarians, however.

Frankly, if I were in this situation I’d plan to throw one database away. Accept everything for now, call it a learning experience — and junk the whole thing later when best practices are clear in favor of starting fresh when it’s possible to be draconian about what one accepts. Really. I would do this. It’s easier and cheaper than trying to weed out the existing database.

To do it at all, though, there will have to be a substantial consciousness-raising about metadata quality. (I mean, really, non-valid XML? If it’s defined by a DTD and not an XML Schema — as EAD is — there just isn’t any excuse for that. Validate your stuff before it goes out, people!)

Two ways to do that. One is the good old non-threatening educational thing. Conferences, workshops, publications, sample files, and so on. (Catching up to these issues in library school would help. If UW-SLIS is any indication, though, there’s considerable distance to go there.) Unfortunately, I guarantee you that the worst offenders will pay zero attention to these. I absolutely guarantee it.

There is nothing to do with such people except embarrass them. What’s more, the sooner you do it, the better. This actually militates against an accept-everything approach. Bad markup creators are like puppies; if you don’t catch ‘em in the act and whack their noses, they don’t clue in.

Fortunately, an aggregator has a few sticks available. The first one is obvious: don’t accept crap content. Now, “crap content” depends on one’s perspective, so it’ll be a bit of a chore setting the rejection/acceptance threshold to the right place. Even so, the principle is clear enough.

The second one is public humiliation. Again, the definition of “public” may vary — but we all like to be competent, so when an aggregator says “Podunk U’s metadata is garbage” louder than a whisper, Podunk U is likely to sit up and take notice.

Neither of these is any use, however, without accompanying availability of training, education, and consultation. That’s unpalatable, no question, but it’s the truth. If well-intentioned people can’t even learn to do things right owing to lack of resources, well… watch me nail up the OEBPS as a Stern Warning.

I wonder whether some of the aggregator builders thought their results would end up something like OCLC’s WorldCat (a vast and surprisingly accurate union catalogue contributed to by thousands of libraries and librarians). Not possible, I’m afraid. WorldCat is fact-checked by its users, themselves librarians. Librarians will indeed use metadata aggregators (and should be able and encouraged to leave feedback!), but there just aren’t going to be enough qualified eyes hitting each entry for that to suffice as an error-checking mechanism; finding aids get a lot less use than catalogue entries. Like it or not, the aggregator owners must stand up and take some responsibility.

So, all this said, what do I think I would do, if I ran a metadata aggregator?

I think I would build a three-tier database. Everything spidered lands in a staging area and gets some kind of once-over. Stuff from people with a history of good production may only be mechanically checked before hitting the top tier of the database. Problematic stuff may make it into the lower tier, may be rejected altogether — either way, notification is immediately sent to the source, along with an offer of help. Stuff from new providers gets checked pretty intensively. A top-tier metadata provider can be demoted if quality declines, and obviously lower-tier providers who clean up their act can be promoted.

How users see this is up to the aggregator, but if it was my aggregator, ordinary users would only see the top tier. If the entire corpus, including garbage, is available to users, that removes a major incentive for metadata providers to fix problems. I would be tempted to bury a whole-database search somewhere remote, however.

I would build a training program and lots of documentation, probably at least in part online. I would love these things, keep them up-to-date, and make sure lots of my staff knew how to write documentation and train.

But that’s me, and I’ve never, ever, ever done this or anything like it, so what do I know?

All this worry over metadata quality has implications for the Semantic Web too, incidentally. If an RDF spider accepts uncritically whatever it’s handed, muddying the waters becomes trivially easy. There’s lots of handwaving about reification creating trust metrics and authority, but to my eye it’s just handwaving — I can fake the reification dead easily (”Leigh Dodds said this! Really!”) and the only way to stop me is to peer outside the RDF. I think some of the trust-metric handwaving tries to address this issue, but again… show me something that isn’t handwaving and maybe I’ll play along.