In trying to come up with frequency lists for bigrams and trigrams I
find that when the corpus size hits 100,000 words I run out of memory
on the computer. While I might be able to tweak my program and get
that number up to 200,000 or maybe 500,000 (doubt it) I think the
system limitations here will prevent me from coming up with bigram
and trigram counts for a 1,000,000 word corpus.
So...if someone with much greater computing resources than I has come
up with bigram and trigram frequency lists I'd love to hear about
it. It would be ideal if such counts were available for the ACL/DCI
WSJ corpus as that is the corpus I've been working with.
Regards
Ted
-- * Ted Pedersen pedersen@seas.smu.edu * * http://www.seas.smu.edu/~pedersen/ * * Department of Computer Science and Engineering, * * Southern Methodist University, Dallas, TX 75275 (214) 768-3712 *