Corpora: Help please - downloading text from the Web

From: Geoff Wilkins (geoffw@cobuild.collins.co.uk)
Date: Thu Mar 23 2000 - 12:34:28 MET

  • Next message: PELCRA: "Corpora: XSYS 2000 - 1st Circular"

    Hi. Can anyone help me with the following:

    I'm looking for software - preferably freeware or shareware - to
    use to download text from Web sites, for use in a corpus.

    This will be from large sites, with a lot of files, sub-directories
    and internal links. Most basically, the software would simply download
    HTML files from the site, following internal links from the Home page.
    I've tried various "bots" that do this, but have had problems with all
    of them. So I'd welcome recommendations for software that others have
    found unproblematic (and powerful/multi-functioned) for this purpose.

    And if anyone knows of packages that are more specifically aimed at the
    task I'm undertaking, that would be even better.

    Also useful would be software that mapped out the structure of sites, giving
    an idea of the size of the files.

    Geoff Wilkins



    This archive was generated by hypermail 2b29 : Mon Mar 27 2000 - 00:05:45 MET DST