Corpora: Summary: Need for texts to evaluate named entity recognition software in En, Fr, De and Es

From: Ralf Steinberger (ralf.steinberger@jrc.it)
Date: Wed Mar 27 2002 - 19:34:43 MET

  • Next message: Randall Jones: "Corpora: taggers for German"

    Thanks to Hamish Cunningham, Antonio S. Valderrábanos, Gabriel Pereira
    Lopes, Fabio Ciravegna, Milena Slavcheva and Valerie Mapelli for replying to
    our query, which is repeated at the end of this message.

    In spite of receiving several replies to our query, we did not succeed in
    getting a collection of texts containing named entities that could be used
    for the evaluation of named entity recognition software. Most replies
    pointed to general purpose collections of parallel text, but we needed a
    specific test set of preferably short texts containing many named entities.

    Finally, we compiled a test collection of 19 short parallel documents
    ourselves. This was done by identifying a number of identical strings in
    English texts and their Spanish translations, by hand-picking those strings
    that were likely to be named entities, and by then choosing those short
    texts that contained many of these hand-picked strings. This small test
    corpus is available on request.

    Below, you will find the main answers:

    ----------------------------------
    Antonio S. Valderrábanos wrote:

    you could have a look at the following pages:
    MLCC corpora (second part). A parallel collection in nine EC languages
    Description at ELDA: http://www.elda.fr/cata/text/W0023.html
    ECI Corpus. Contains different multilingual text collections; some of them
    are parallel and may contain your language combinations. Description at
    ELDA: http://www.elda.fr/cata/text/W0004.html You may want to find a more
    detailed description thant the one at ELDA.
    CRATER
    http://www.elda.fr/cata/text/W0003.html
    A good one but doesn't contain German

    Besides, the UN site (www.un.org) contains large amounts of parallel texts
    (although not in German).

    ----------------------------------
    Gabriel Pereira Lopes wrote:

    Just pick up texts from European legislation in force:
    http://europa.eu.int/eur-lex/

    ----------------------------------
    Fabio Ciravegna wrote:

    Johannes Matiasek at OFAI, Vienna, (john@ai.univie.ac.at) developed a named
    entity recogniser for German as part of the Facile project. He annotated a
    corpus with named entities. Maybe you can contact him and ask for the
    corpus. He is a nice guy.
    ----------------------------------
    Hamish Cunningham wrote:

    we have work just starting on NE in french and german, and a system that
    currently runs in english, bulgarian and romanian. apart from that I don't
    know of anything other than for english (but I think there must be some
    stuff out there...)

    ----------------------------------
    Milena Slavcheva wrote:

    Below you can find information about a parallel German-French corpus. The
    style of the texts is, broadly speaking, administrative and you can find
    plenty of named entities. From the message below (that I forward to you), I
    can see that the corpus is already distributed by ELRA. You can also turn
    for information to prof. Wolfgang Teubert at teubertw@hhs.bham.ac.uk who was
    the leader of the project producing GeFRePaC at the Institut fuer deutsche
    Sprache (IDS) in Mannheim.

    ----------------------------------
    Valerie Mapelli wrote:

    I would like to refer you to the ELDA catalogue, where you may find corpora
    of interest. Our web site: http://www.elda.fr Please do not hesitate to
    contact me for any further information.

     -----Original Message-----
    From: Ralf Steinberger [mailto:ralf.steinberger@jrc.it]
    Sent: Monday, March 18, 2002 5:15 PM
    To: CORPORA@HD.UIB.NO
    Cc: 'Ralf STEINBERGER (JRC)'
    Subject: Need for texts to evaluate named entity recognition software in En,
    Fr, De and Es

    Hello,

    we are looking for texts containing many named entities such as peoples'
    names, company names, names of organisations/authorities and geographical
    places in the languages English, French, German and Spanish.

    The texts will be used for the evaluation of named entity recognition
    software. Parallel texts (texts and their translations) would be preferred
    as they would make the evaluation easier. It is not strictly necessary that
    the named entities be marked up in the text.

    The evaluation will be carried out by a student, who is writing her Master's
    thesis on this subject, in collaboration with the EC's Joint Research
    Centre. The thesis will be made publicly available.

    Any hints are welcome. Thanks in advance.

    Ralf Steinberger (ralf.steinberger@jrc.it)
    European Commission, Joint Research Centre (http://www.jrc.it/langtech/)
    Institute for the Protection and Security of the Citizen (IPSC)



    This archive was generated by hypermail 2b29 : Wed Mar 27 2002 - 18:30:21 MET