redbird | A project

It sounded plausible, last week: we have a bibliographic database that we're putting on the Web, and my bosses want me to look over the author names for coding problems, both actual errors (some doozies there, each of which has to be dealt with individually) and codes that aren't recognized if I run IE with "language" set to English only.

So I started scanning at A, with a separate (Netscape) window open to a table that offers numeric codes, like Č, to replace such things as š and ł.

Three hours, not counting my lunch break, later, I was up to "Ademovič" and starting to realize the enormity of the problem.

A database of more than 300,000 items, most of them with multiple authors, each of which has to be at least glanced at. It isn't much comfort that, for example, there are four entries under "Ademovič", not when they sometimes have different first names, which also need to be examined.

That Bernie sat me down and explained this, and I asked questions like "what language should I set it to accept?" rather than saying "We couldn't do this in three weeks even if I did nothing else" shows that neither of us really grasped the problem as he was defining it.

I've left him voice mail--which he will retrieve on Wednesday--explaining the difficulty, and am wondering if there's any good way to automate this. I can't just scan the source for tags, because in most cases the tags aren't actually wrong, it's just that there are lots of different character sets out there, many of them overlapping, and with close to 20 years of data on material from most of the world, we're using characters in too many of them to just specify one and be done with it.

(Further technical details suppressed, in the Lewis Carroll sense of the term; I'll be under my desk if you need me.)

Flat | Top-Level Comments Only

Date: 2001-06-11 09:01 pm (UTC)

From:

darius.livejournal.com

automation

I don't understand what you're trying to do, but might you first filter out all the items that don't have tags, before eyeballing what's left? The usual next step after that would be grouping into equivalence classes depending on the details of your problem.

(Offered in the hope but not the expectation that this is helpful.)

Date: 2001-06-13 02:05 pm (UTC)

redbird

Re: automation

Thanks for the suggestion, but the problem is that we need to distinguish between (a) items that have tags that the browser renders correctly, (b) items that have tags that aren't rendered, for one reason or another, and (c) other weirdness. Having done that, we then need to sort (b) into two categories: typos (which we can fix) and coding that we need to find a way for the browser to handle.

(Category (c) will probably be small, and probably wouldn't be found by any mechanical search, since it basically is "this isn't HTML, but it isn't plain text either," where "plain text" includes most of the European languages, as well as transliterated Chinese, Japanese, and Russian (at least).

Date: 2001-06-13 03:00 pm (UTC)

I know it's annoying when some pushy ignorant person tries to tell me how to do my job; I hope this doesn't come across that way.

What I meant by grouping into equivalence classes was to have a program find every different potential tag in your database, so you can look at just a few instances of each possible error, just enough to decide how to classify it, then go on to the next tag. The hope is to find most of the error classes relatively quickly in the first pass before getting some poor human to look at every single item. This has the drawback that if you still have to eyeball the remainder, there's less to find, making it harder to stay alert. On the other hand, you can focus better on those filtered candidates than the raw database, so it's not clear which gets a better result when cost is no object...

The `other weirdness' is more trouble to get help with, though there are algorithms for finding likely mistakes that don't depend on knowledge of the language, based on statistics from a corpus -- they were developed for spellchecking back before computers were too small for big wordlists. I'm afraid they're probably too hard to apply to the kind of mishmash you're describing.

Er, hi, anyway. I found your journal through baldanders'.

Praise then darkness, and creation unfinished

Don't mourn, organize

A project

automation

Re: automation

Re: automation

About Me

A Few Useful Links

Page summary

Active entries

Style credit

Expand cut tags

Praise then darkness, and creation unfinished

Don't mourn, organize

A project

automation

Re: automation

Re: automation

About Me

A Few Useful Links

Most-used tags

Page summary

Active entries

Style credit

Expand cut tags