Re: Corpora: software for sampling and analysing corpus

Alex Chengyu Fang (alex@phonetics.ucl.ac.uk)
Wed, 08 Oct 1997 16:49:30 +0100

At 03:23 PM 8/10/97 +0100, Jean Hudson wrote:
>Can anyone recommend software (apart from Wordsmith) or computational
>methods for doing vocabulary analysis of samples of text to control the
>balance within
>a large text corpus?
>
>I would like to be able to take a sample of, say, 100,000 words and
>see how many different word forms there are within it. Also, I would
>like to see how the word frequencies within the sample match up with
>a control list of frequencies taken from a larger mixed-text corpus.
>
>It would also be useful to have a list of the words that occur
>significantly more frequently within the sample than they do
>within the language as a whole.

I remember there was a similar discussion a couple of years ago.

Again, J.B. Carroll's F (Frequency), D (Dispersion), U (Adjusted Frequency),
and SFI (Standard Frequency Index) are very useful in this kind of
investigation. For a more detailed description, see my web page:

http://www.phon.ucl.ac.uk/home/alex/project/strata/strata.htm

---------------------------------------------
Alex Chengyu Fang
Senior Research Fellow
Department of Phonetics and Linguistics
University College London
Gower Street, London WC1E 6BT, U.K.

E-Mail: alex@phonetics.ucl.ac.uk
http://www.phon.ucl.ac.uk/home/alex/home.htm

Tel: 0171 388 4309
0171 387 7050 ext. 3169
Fax: 0171 383 4108
---------------------------------------------