Corpora: linguistic database

Jvrg Tiedemann (joerg@strindberg.ling.uu.se)
Tue, 7 Oct 1997 09:57:05 +0200 (DFT)

Hi there,

In Uppsala at the Linguistic Department we want to build a linguistic
database for several text corpora, parallel data as well as monolingual
data, including lexical information.

I would like to get an overview about current standards and standard
systems which are currently used in storing this kind of data. Maybe
somebody can help me and provide me some information about experience with
different systems.

The system should store the corpus information in an efficient way and
tokens should be linked to corresponding lexical information. It should be
possible to execute all kinds of queries to obtain information about the
stored data.

As far as I have realized, corpus data seem to be stored commonly in SGML
format. Is there any better way to go around this space consuming storage
format?

Currently most of our corpus data are stored using a subset of TEI SGML.
Maybe there is a good way to combine this format with data stored in a
"real database system"?

Thanks for any help!

Joerg

***********/\/\/\/\/\/\/\/\/\/\/\************************************
** Joerg Tiedemann e-mail: **
** Sernanders vaeg 5:520 joerg@stp.ling.uu.se **
** 75261 Uppsala tiedeman@sunpool.cs.uni-magdeburg.de **
** SWEDEN http://stp.ling.uu.se/~joerg/ **
*************************************/\/\/\/\/\/\/\/\/\/\/\**********