Multilingual Text Tools and Corpora

Existing tools for NLP and MT corpus-based research are typically embedded in large, non-adaptable systems which are fundamentally incompatible. Little effort has been made to develop software standards, and software reusability is virtually non-existent. As a result, there is a serious lack of generally usable tools to manipulate and analyze text corpora that are widely available for research, especially for multi-lingual applications. At the same time, the availability of data is hampered by a lack of well-established standards for encoding corpora. Although the TEI has provided guidelines for text encoding, they are so far largely untested on real-scale data, especially multi-lingual data. Further, the TEI guidelines offer a broad range of text encoding solutions serving a variety of disciplines and applications, and are not intended to provide specific guidance for the purposes of NLP and MT corpus-based research. MULTEXT proposes to tackle both of these problems. First, MULTEXT will work toward establishing a software standard, which we see as an essential step toward reusability, and publish the standard to enable future development by others. Second, MULTEXT will test and extend the TEI standards on real-size data, and ultimately develop TEI-based encoding conventions specifically suited to multi-lingual corpora and the needs of NLP and MT corpus-based research. These efforts will be accomplished in close collaboration with the relevant EAGLES sub- groups, and the results will serve as input to the EAGLES effort to establish operational standards to be adopted by ongoing and future European corpus projects.


Many Multext applications will require the ability to perform various kinds of analysis on word tokens. For example, in some cases it will be necessary to abstract away from inflectional variation, so that e.g. walk, walks, walking, and walked are all treated as the same word type at the level of textual annotations. Conversely, it will sometimes be desirable to make use of richer information than that available in the raw text, so that e.g. walking can be identified as the present participle of `walk'. In addition, it is easy to envisage a need for flexibility in the triangular relation between word-token, textual annotations and lexical information; a single fixed linguistic analysis cannot fulfil the requirements of diverse text processing tasks. Mmorph provides the means by which lexicons can be constructed and modified, and texts annotated with lexical information.

Very generally, the program operates by relating the form of a word as found in text to an entry in a lexical database containing arbitrary information expressed in terms of attributes and values. Various modes of interaction with mmorph exist, depending on whether the user is developing, compiling, or exploiting a description. The lexical database is created from a set of initial lexical entries and a set of structural rules.

MMorph (version 2.3.4)
http download as

POS Tagging

A set of tools for multilingual part-of-speech tagging based on a Hidden Markov Model.
Basic technology that has proven useful for monolingual processing tasks is adapted and extended to accomodate a range of natural languages. Emphasis is placed on facilites for experimenting with different tagsets and aiding the user to evaluate and modify the results. Aside from the tagger, the tools include modules to prepare the text for training and tagging, define new tagsets or modify existing sets, declare linguistic preferences, train or retrain with hand annotated data, and facilities to compare results with hand corrected data. These tools provide a flexible environment to experiment with a range of tagsets in different languages.

mtag - The multext version of the tagger (no longer available)

TATOO - The ISSCO TAgger TOOl (version 3.00)

ISSCO Projects