Re: Corpora: Lexical confusions (was Suggestor algorithms)

Ted E. Dunning (ted@aptex.com)
Tue, 7 Oct 1997 13:50:07 -0700

>>>>> "b" == Bruce Lambert <bruce@ludwig.pmad.uic.edu> writes:

b> Martin Kay said:

>> ... spelling -> sound -> spelling ...

b> From my reading of the
b> psycholinguistic literature, at least, it appears that the
b> orthography-to-phonology mapping problem is still a
b> (relatively) open problem.

mapping to *correct* pronunciation is definitely a problem. mapping
to the set of *possible* or *likely* pronunciations is not nearly so
hard.

b> ... model the error mechanisms ...

modelling sources of errors is definitely a good idea for some
applications. i am not very convinced that accurate error modelling
is necessary for the original application, though.

another interesting parallel, though, is the matching of DNA
sequences. the similarity of two sequences is generally modelled by a
using multiple error models at both the level of the raw nucleotide
sequence as well as at the level of the encoded amino acids or even in
protein structure. for example, it is possible for there to be a
shift in reading frame or an insertion of non-coding material into a
gene without loss of function. these would constitute nucleotide
level changes. on the other hand, there are a variety of amino acid
substitutions which do not alter the structure and function of a
protein. finally, mutations in functional areas of a protein are much
more critical than mutations in regions with purely structural
purposes. the overall result is that sequences which superficially
appear quite different can be quite similar in function and this
similarity can (sometimes) be detected by computational means.

b> Thus, what's needed are measures for predicting the likelihood
b> of lexical confusions across media and perceptual
b> modalities.

seconded and passed. for some situations.

in others, very crude measures may suffice.

b> ... I could use some feedback on how to proceed.

hmmmm.... that is a much taller order!