RE: [Corpora-List] Developing and testing new similarity measures for word clustering

From: Adam Kilgarriff (adam@lexmasterclass.com)
Date: Sun Oct 10 2004 - 11:15:10 MET DST

  • Next message: Katarzyna Horszowska: "[Corpora-List] Chinese language corpus"

    Normand,

    There's a growing literature on thesaurus evaluation which has seen
    really interesting recent developments. It starts from thesis work by
    Sparck Jones (1960s), Hindle (1990) then Grefenstette (1994), Lilian Lee
    (eg ACL 99), Dekang Lin (eg COLING 1998), more recently (eg 2003-04)
    James Curran (Edinburgh/Sydney, who did very extensive experimentation,
    evaluating against Roget, WordNet, and other human-made thesauruses) and
    Julie Weeds (Sussex, also work of hers with David Weir which presents a
    nice theoretical analysis of various measures in terms of their
    precision vs recall properties)

    Sorry if you knew all this already

    Adam

    -----Original Message-----
    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    Behalf Of Normand Peladeau
    Sent: 08 October 2004 13:47
    To: CORPORA@uib.no
    Subject: [Corpora-List] Developing and testing new similarity measures
    for word clustering

    I have been reviewing some of the similarity measures used to perform
    word
    clustering (Jaccard, Dice, Simple Matching, correlation, etc.) and I
    came
    to the conclusion that many of those measures had some metric problems
    that
    probably make them non optimal for word clustering.

    I am working now on some modified versions of those indices and I need
    some
    ways to benchmark those new similarity measures. I would like to have a

    series of benchmarks for several kinds of application (dimension
    reduction,
    automatic identification of themes, automatic taxonomy development,
    etc.).

    I would like suggestions for ways to benchmark those new measures and
    compare their performance with the more traditional ones. Any idea,
    reference, data set would be welcome.

    I am also looking for existing articles where those measures have been
    compared (either empirically or theoretically)

    Thanks,

    Normand Peladeau
    Provalis Research



    This archive was generated by hypermail 2b29 : Sun Oct 10 2004 - 20:41:12 MET DST