Corpora: Lexical confusions (was Suggestor algorithms)

Bruce Lambert (bruce@ludwig.pmad.uic.edu)
Tue, 7 Oct 97 14:29:34 -0600

Martin Kay said:

>It seems to me that what you are looking for is less a new algorithm than
>an embodiment of something along the following lines. You can design a
>finite-state transducer that will do a half-way decent job of mapping
>between spellings and their possible pronunciations. Composing this
>transducer with its own inverse gives something that carries spellings onto
>other spellings that share some possible pronunciation with the
>original---interestingly enough, without containing any explicit
>representation of pronunciations at all. Now what you probably want is a
>composition of two machines that are not exact inverses of one another.
>You probably want one that carries spellings onto *probable* pronunciations
>on the input side and one that maps pronunciations onto *possible* spellings
>on the output side.

This is an intriguing set of suggestions. I wonder if Martin could be a bit
more specific. For example, can we get some references that describes a
transducer that maps spellings to possible pronunciations? From my reading of
the psycholinguistic literature, at least, it appears that the
orthography-to-phonology mapping problem is still a (relatively) open problem.
Even though there are text-to-speech systems, I thought they were pretty crude,
stumbling on irregular pronunciations, etc.

A few more comments on the same problem: In this thread, we've been talking
about spelling errors as they occur primarily in typewritten text. To suggest
alternatives, one searches a database of correctly spelled words, returning to
the user a ranked set of orthographically or phonologically similar
'neighbors'. In my work on look-alike and sound-alike medication errors, the
problem we've been discussing arises in a very general form. In the domain of
medication errors, the task is to anticipate (and prevent, if possible),
lexical confusions involving multiple perceptual modalities and multiple
communication media.

For example, sometimes medication errors occur because a handwritten
prescription is faxed to a pharmacy, and the blurred fax is misread. This is a
(visual) perceptual recognition error. To anticipate it, measures of
orthographic similarity would have to take into account OCR-type error patterns
for both handwriting and type (i.e., similarity between p and r, m and n,
etc.). On the other hand, sometimes errors occur when a pharmacist misremembers
(or mishears) one word (e.g., Zantac) that sounds like another word (e.g.,
Xanax). This is either a short-term memory error or an audotory perceptual
error involving phonological similarity. To anticipate it, one would need
accurate phonological representations of words, and a good way of computing
similarity between these representations. Still other errors can occur when a
drug name is mistyped into a computer. To anticipate these errors, one would
need to take into account typical insertions, deletions, transpositions, and
substitutions as they occur on a standard 'qwerty' keyboard. Finally, errors
occasionally occur because one drug is confused with another drug that shares
the same indication, dosage form, manufacturer, mechanism of action, color,
shape, etc. Here one needs a 'semantic' representation of each drug, not just a
phonological or orthographic one. On top of all of this, one needs frequency
data on all possible drug names, because the well-known word frequency effect
strongly influences the type and direction of confusions that may occur (i.e.,
rare words are more likely to be misperceived and misremembered if they have
high frequency neighbors).

Thus, what's needed are measures for predicting the likelihood of lexical
confusions across media and perceptual modalities. Also needed are strategies
for preventing these confusions when we predict they are likely to occur. I
offer these examples beacuse I think they may shed some light on the general
problem of lexical confusion, and because I could use some feedback on how to
proceed.

Bruce Lambert, PhD
College of Pharmacy
University of Illinois at Chicago
Phone: 312-996-2411
Fax: 312-996-0868