Warning: fopen(/home/.lasher/yarinare/cavlec.yarinareth.net/wp-content/cache/) [function.fopen]: failed to open stream: Is a directory in /home/.lasher/yarinare/cavlec.yarinareth.net/wp-content/plugins/wp-cache/wp-cache-phase2.php on line 96
Caveat Lector » I <3 Cliff Lynch

Dies Lunae, 12 Iunii 2006

I <3 Cliff Lynch

Sitting in the JCDL 2006 plenary listening to Google and OCA talk about their mass scanning projects.

Then Cliff Lynch got up and nailed ’em to the wall. “What about the OCR piece?” he asked, and he talked about the same concerns I have about it. Human intervention versus error rates. Algorithms. Typesetting issues.

Google guy talked about “what the challenges are.” He says 98-99% accuracy is often good enough. (Yeah, for his purposes. But for a human?) Says it’s the most computationally expensive part of the process. (And what about non-Latin character sets? says the person next to me, cogently.)

Mentions long-s versus f as a “very embarrassing” mistake source.

They are doing language-specific computing on the scans. “More obscure” languages (Arabic is obscure? oh, please) cause even more problems.

Library guy: Not uncommon to have somebody come in looking for the one word you got wrong! (E.g. textual scholarship.)

Cliff: Multi-lingual issues (Chirac of France going ballistic), “likes the idea of nations competing to digitize” cultural heritage. Even just from five research libraries, you get plenty of multi-lingual, multi-national stuff. How to diversify the content base?

OCA guy: OCA is opportunistic; they digitize what they can grab onto. Latin scripts only! They want to do more (but can they?).

49% of the material from the Five Partners is English. Hundreds of other languages represented. Google can do Cyrillic, Greek, Latin scripts; working on Chinese, Japanese, Arabic. Google doesn’t want to play the cultural imperialist; “doing it for them” hubristically ignores that they want to do it themselves. (I agree.)

OCA guy: let’s provoke the French! and the Russians! and everybody! more digitization is good!

Cliff: Preservation. Cliff asks for confirmation that brittle materials are not being targeted for scanning. Preservation-worthy surrogates, or not? Is Google a disaster-recovery plan (floods, war, whatever) for libraries?

Library guy: Michigan has been pushing a digitization agenda. Only Michigan (of the five) is allowing access to brittle materials. Ongoing negotiation with Google.

Google: did not design a process for high-value, extremely brittle materials. Google is okay with that; other people can handle it with different techniques. Digitization quality: standards unclear. Some say Google scans are good enough; some don’t. Scalability of digital-preservation requirements; debate is changing from how to handle a specific book to how to handle whole collections.

OCA: think of it as part of collection-management decisions; for some highly redundant, low-use stuff, 98% accuracy is plenty good enough. The aim is to avoid costs on the redundant stuff, so that resources can be spent on the special stuff.

Google: preservation of artifact is different from preservation of the information content. Most users want the info content, not the artifact. (Yep. Book-smellers are a minority, and they need to get over themselves.)

Cliff Lynch: Google can do massive computation on its massive datastore that other people can’t. That could be a pretty serious business advantage, a serious research advantage. Why is the access to the datastore so private? How can it be opened up?

Google: Send email, tell Google what research you want to do! Hope is not to hinder the advancement of scholarship. Where is the balance? NYPL won’t be able to make that corpus available for research algorithm; too computationally expensive.

OCA: huge free-rider problem. Notes that UMich can do as it pleases with its own copy of the data (subject to restrictions e.g. from publishers). Will the restrictions relax? Too early to tell. It is a serious problem. How do we platform this stuff such that the research can happen? “A blessing and a curse” to be involved.

Google guy: SETI@Home as an example of how to make big computation on big datasets possible. Supporting that kind of environment a non-trivial problem.

Opened to audience questions.

Post-1923 public domain works: OCA is not trying to research this. Orphan works are an even bigger headache. Google isn’t trying either, and is being very conservative about potentially copyright-governed stuff. Maybe a community effort to clear rights, but what evidence of out-of-copyrightness can Google accept, since there’s no central authority to sign off on it?

Who owns, preserves, manages annotations and other value-adds? Google: whoever does it and puts it on their servers owns it, but who owns a link? Can’t answer; can only restate question. NYPL: it’s being talked about. OCA: this is one of the problems (like copyright checking and copyediting) that’s “too big” to tackle.

Good session. The first Google-book-project session I’ve ever witnessed or heard about that didn’t devolve into brain-dead hysteria. Go Cliff Lynch!

a835 motorola mp3 ringtonemotorola from sound to ringtoneringtones free