put each object which you want a sample of (eg, sentence, paragraph,
or identifier for same) on a line. (This will be the difficult bit,
but depends entirely on the format/markup of the corpus and the types
of units you want to sample, so it's not possible to give general
help.)
then, in unix, (mks-awk, gawk or nawk will do this, though the basic-grade awk
on my system won't; all Unixes come armed with nawk, I think)
gawk '{print rand(),$0}' infile | sort | gawk '{sub($1 " ", "");print}'> random
file
and the sorted file is now randomly ordered so, eg,
head -50 randomfile > outfile2
gives you a random sample, size 50, and
head -100 randomfile | tail -50 > outfile3
gives you another.
adam
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff tel: (44) 1273 642919
Research Fellow (44) 1273 642900
Information Technology Research Institute fax: (44) 1273 606653
University of Brighton
Lewes Road email:
Brighton BN2 4AT ak28@itri.bton.ac.uk
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%