Re: spell-checking corpora

vosse@ruls41.fsw.LeidenUniv.nl
Wed, 05 Apr 95 13:40:48 +0100

> I am not sure whether I should read the WHOLE corpus to check for mistakes,
> or whether it is enough to use a spell-checker.

I think that depends on the types of errors your particular scanner
makes. E.g., if it sometimes confuses 'in' with 'm', and you scan an
English text, no spell checker will draw your attention to words such
as 'inate' vs. 'mate', or 'rain' vs 'ram', ... So, the answer to this
question strongly depends on
1) the size of your corpus
2) the particular errors of the scanner (which may vary per text and
font and even scanning slope)
3) the chance of a scanning error resulting in an existing word
4) the spell quality of your input (Garbage In, Garbage Out).

> Does anyone know, even if for
> other languages, the percentage in which a scanner, when mistaken, produces
> new words from words and not only non-words from words?

You can look at work by Rice, Kanai and Nartker from the University of
Nevada; they have collected such data in 'The Third Annial Test of OCR
Accuracy'. The document (in PostScript) is obtainable at
ftp:////ftp.isri.unlv.edu/AR-94.ps

Ciao,

Theo Vosse
----------
Unit for Experimental Psychology
University of Leiden
The Netherlands