Re: [Corpora-List] Query on the use of Google for corpus research

From: Mark P. Line (mark@polymathix.com)
Date: Tue May 31 2005 - 19:08:10 MET DST

  • Next message: Ergin Altintas: "[Corpora-List] A Survey of NLP & WSD"

    Marco Baroni said:
    >> > How do you deal with spider traps?
    >>
    >> Why would spider traps be a concern (apart from knowing to give up on
    >> the site if my IP address has been blocked by their spider trap) when
    >> all I'm doing is constructing a sample of text data from the Web?
    >
    > First of all, your crawler has to understand that it fell into a trap.
    > Second, some spider traps generate dynamic pages containing random text
    > for you to follow -- now, that's a problem if you're trying to build a
    > linguistic corpus, isn't it?

    It's not much of a problem unless you presuppose that a corpus linguist
    would have difficulty finding a way to distinguish between a valid text in
    her target language and a random text generated by a spider trap.

    > Incidentally, a "spider trap" query on google returns many more results
    > about crawlers, robots.txt files etc. than about how to capture
    > eight-legged arachnids... one good example of how one should be careful
    > when using the web as a way to gather knowledge about the world...

    I believe there's a huge difference between using the web as a way to
    gather knowledge about the world (especially if this is being done
    automatically) and using the web as a way to populate a corpus for
    linguistic research. The latter use is much less ambitious, and simply
    doesn't need to be weighed down by most of the concerns that web-mining or
    indexing applications do.

    Most corpus linguists who are constructing a dataset on the fly are just
    interested in being able to track their samples back to the underlying
    population, and are usually willing to add or change samples indefinitely
    until their corpus has the characteristics they need. If web-served HTML
    and plaintext is adequate to support their research questions, then a
    simple web crawler will work just fine.

    -- Mark

    Mark P. Line
    Polymathix
    San Antonio, TX



    This archive was generated by hypermail 2b29 : Tue May 31 2005 - 19:14:57 MET DST