Corpora: Help please - downloading text from the Web

From: Geoff Wilkins (geoffw@cobuild.collins.co.uk)
Date: Thu Mar 23 2000 - 12:34:28 MET

Next message: PELCRA: "Corpora: XSYS 2000 - 1st Circular"

Previous message: I.Kuscu: "Corpora: CEC2000 - Challenge and Competitions"
Next in thread: Knut Hofland: "Re: Corpora: Help please - downloading text from the Web"
Reply: Knut Hofland: "Re: Corpora: Help please - downloading text from the Web"
Reply: Christian Coseru: "Re: Corpora: Help please - downloading text from the Web"
Reply: Andrew Harley: "Re: Corpora: Help please - downloading text from the Web"
Reply: Mark Lewellen: "RE: Corpora: Help please - downloading text from the Web"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi. Can anyone help me with the following:

I'm looking for software - preferably freeware or shareware - to
use to download text from Web sites, for use in a corpus.

This will be from large sites, with a lot of files, sub-directories
and internal links. Most basically, the software would simply download
HTML files from the site, following internal links from the Home page.
I've tried various "bots" that do this, but have had problems with all
of them. So I'd welcome recommendations for software that others have
found unproblematic (and powerful/multi-functioned) for this purpose.

And if anyone knows of packages that are more specifically aimed at the
task I'm undertaking, that would be even better.

Also useful would be software that mapped out the structure of sites, giving
an idea of the size of the files.

Geoff Wilkins

Next message: PELCRA: "Corpora: XSYS 2000 - 1st Circular"
Previous message: I.Kuscu: "Corpora: CEC2000 - Challenge and Competitions"
Next in thread: Knut Hofland: "Re: Corpora: Help please - downloading text from the Web"
Reply: Knut Hofland: "Re: Corpora: Help please - downloading text from the Web"
Reply: Christian Coseru: "Re: Corpora: Help please - downloading text from the Web"
Reply: Andrew Harley: "Re: Corpora: Help please - downloading text from the Web"
Reply: Mark Lewellen: "RE: Corpora: Help please - downloading text from the Web"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Mar 27 2000 - 00:05:45 MET DST