Multilingual Text Tools and Corpora
Existing tools for NLP and MT corpus-based research are typically
embedded in large, non-adaptable systems which are fundamentally
incompatible. Little effort has been made to develop software
standards, and software reusability is virtually non-existent. As a
result, there is a serious lack of generally usable tools to
manipulate and analyze text corpora that are widely available for
research, especially for multi-lingual applications.
At the same time, the availability of data is hampered by a lack of
well-established standards for encoding corpora. Although the TEI has
provided guidelines for text encoding, they are so far largely
untested on real-scale data, especially multi-lingual data. Further,
the TEI guidelines offer a broad range of text encoding solutions
serving a variety of disciplines and applications, and are not
intended to provide specific guidance for the purposes of NLP and MT
MULTEXT proposes to tackle both of these problems. First, MULTEXT will
work toward establishing a software standard, which we see as an
essential step toward reusability, and publish the standard to enable
future development by others. Second, MULTEXT will test and extend the
TEI standards on real-size data, and ultimately develop TEI-based
encoding conventions specifically suited to multi-lingual corpora and
the needs of NLP and MT corpus-based research. These efforts will be
accomplished in close collaboration with the relevant EAGLES sub-
groups, and the results will serve as input to the EAGLES effort to
establish operational standards to be adopted by ongoing and future
European corpus projects.
Many Multext applications will
require the ability to perform various kinds of analysis on word
tokens. For example, in some cases it will be necessary to abstract
away from inflectional variation, so that e.g. walk, walks, walking,
and walked are all treated as the same word type at the level of
textual annotations. Conversely, it will sometimes be desirable to
make use of richer information than that available in the raw text, so
that e.g. walking can be identified as the present participle of
`walk'. In addition, it is easy to envisage a need for flexibility in
the triangular relation between word-token, textual annotations and
lexical information; a single fixed linguistic analysis cannot fulfil
the requirements of diverse text processing tasks. Mmorph
means by which lexicons can be constructed and modified, and texts
annotated with lexical information.
Very generally, the program operates by relating the form of a word as
found in text to an entry in a lexical database containing arbitrary
information expressed in terms of attributes and values.
Various modes of interaction with mmorph exist, depending on
whether the user is developing, compiling, or exploiting a description.
The lexical database is created from a set of initial
lexical entries and a set of structural rules.
- MMorph (version 2.3.4)
http download as
A set of tools for multilingual
part-of-speech tagging based on a Hidden Markov Model.
Basic technology that has proven useful for monolingual
processing tasks is adapted and extended to accomodate a
range of natural languages. Emphasis is placed on
facilites for experimenting with different tagsets and
aiding the user to evaluate and modify the results. Aside
from the tagger, the tools include modules to prepare the
text for training and tagging, define new tagsets or
modify existing sets, declare linguistic preferences,
train or retrain with hand annotated data, and facilities
to compare results with hand corrected data. These tools
provide a flexible environment to experiment with a range
of tagsets in different languages.
mtag - The multext version of the tagger (no longer available)
TATOO - The ISSCO TAgger TOOl (version 3.00)