Susan Armstrong, Pierrette Bouillon,
ISSCO, University of Geneva,Switzerland
Overview of POS tagging
The tagging modules
Preparing the conversion files
The interest in part-of-speech (POS) tagging has increased considerably
over the past decade as witnessed by the numerous reports and programs
now publicly available ([Cutting and al.92], [Elworthy94], [Feldweg95],
[Sanchez95], [Schmid95], [Tzoukermann95], etc.). The focus has been
on attaining a high level of accuracy (at least 95%) with a given tagset
rather than on general purpose tools for different tagsets. These
programs have typically been written for use with one language often incorporating
a number of language-specific assumptions regarding tokenization and segmentation.
The resources they use, such as abbreviations and lists of words associated
with the tags are often embedded in the program and difficult to extend
or modify. And in the organization of the program itself, different
logically distinct subtasks may be collapsed, making it impossible to experiment
with other modules or different processing techniques for a given subtask.
Little attention has been paid to how these programs might be used for
For experimentation with POS tagging, our goal was to develop an open
platform adequate for different users with varied requirements. The
core design principles embedded in_the_suite_of_tools_can_be summarized
These principles underly the design choices in the development of the tagging
tools in view of experimentation over a range of languages. As previous
work has demonstrated, the number and choice of tags will vary, not only
across languages but also according to the needs of a given application
-- our concern is to provide the means for this type of experimentation.
The concrete implementation of these principles is exemplified below in
the presentation of the tagging modules.
Language-independence:Multilingual tagging tools require a
clear separation between language resources (lexicons and tagset
definitions) and the programs that use them. All language specific
information must be held independent of the core programs.
Modularity:Clear separation of subtasks and well defined interfaces
between modules allows for the importation of various external resources
in a variety of formats. Modularity also provides the basis
for users to exploit individual tools differently, perhaps replacing
one or more components.
Flexibility:Developing a model for different languages an
experimenting with varying annotation schemes argue for a design
that permits users to adapt the programs and resources according
to their needs. Independently defined language resources can
be imported and new annotation schemes can be defined.
Overview of POS Tagging
The core task of POS tagging (or disambiguation) is to choose the correct
tag for each word in context from a set of possible tags. Based
on a Hidden Markov Model [Rabiner89], this process is accomplished in two
basic steps commonly referred to as the training and tagging phase.
In the training phase, values are calculated for the probabilities of the
occurrance of each tag/word pair in a given context and stored in matrices.
The instantiation of the matrices, setting the values, is known as building
the language model. In the tagging phase, the language model is applied
to select the most probable tag(s) from the proposed set of ambiguous tags
for each word in the text.
The Hidden Markov Model relies on three parameters, commonly referred
to as the A, B and PI matrices. For a tagging application, the A
matrix records the probabilities of the transitions between any two tags
(or states), e.g. the probability of a determiner to precede
a noun. The B matrix records the relation between the occurrence
of a given tag (or state) and the set of ambiguous tags in which it occurs
(the set of ambiguous tags is often referred to as the ambiguity or equivalence
class). The PI matrix records the probability of a tag to occur in
the initial state (i.e. at the beginning of a sentence).
There are a number of prerequisites to building a language model for
the tagging process. The text must be prepared according to the format
specifications of the program. This requires identifying the words and
other relevant tokens and segments in the text, marking sentence boundaries
and inserting the appropriate annotation. Programs for the segmentation
tasks under development in the Multext project are described in [Veronis94].
A set of potential annotations must then be assigned to each token or word.
The mmorph program [Russell and Petitpierre95], also developed as part
of the Multext toolkit, provides the means to define and assign an annotation
to each of the lexical words (and any other token) in the text. Another
necessary resource is the definition of the (initial) tagset in view of
experimenting with and building a language model. In our implementation,
the mapping between the word or token annotation and corpus tag(s) is described
independently of the lexicon to assure the principles of modularity and
flexibility; this separation offers a platform for experimentating with
different tagsets while using the same lexical resources.
During the training phase these resources are used (possibly in iterations)
to recalculate the values in the instantiated matrices to refine and correct
the language model. The resources required are a prepared text (each
word is replaced by its ambiguity class), a list of states, and set of
instantiated matrices. The program uses the Baum-Welch algorithm
[Baum72] to reestimate the parameters in view of optimizing the model.
Various options to influence or manually readjust the values in the matrices
and thus improve the performance are described below. The tagging
program assigns to each word in the text the most probable tag (or tags);
it is based on the Viterbi algorithm [Viterbi67].
For experimentation on a range of languages and with different
tagsets, a number of utilities have been developed to automatically evaluate
the output of the tagger. The results of tagging output can be compared
with a manually corrected text. A corrected text can also be used
to readjust the parameters in the model. For development work (identifying
where the problems lie) and in view of final evaluation, facilities are
provided to give statistics on the number and type of errors. These
errors are presented according to the incorrect tags as well as according
to the context in which they occurred.
The Tagging Modules
We now turn to a description of the program modules we have developed
for tagging. The presentation is organized according to the four
main phases -text preparation, training and tagging and evaluation.
We make a fundamental distinction between lexical analysis (tasks concerning
recognition and interpretation of lexical properties independent of context)
and the tagging process (based on the distributional properties of words
as observed in the text). This separation allows for greater flexibility
in the use of different base lexical resources and experimentation with
a range of corpus tagsets. The programs have been designed in a modular
way to facilitate evaluation and inspection of the results and subsequent
In preparation for tagging, the base resources must be prepared and
properly formatted. Tokenization and identifying sentence boundaries
is the first step.
The format requirement for the program is that the text contains one
word per line, including annotations and sentence boundary markers. Command
line options are provided to specify the fields containing the relevant
information, thus assuring some flexibility in the range of input formats
accepted. The separators between the annotations can be user declared
with command line options. Any line beginning with ``#'' is considered
a comment line and thus ignored. A filter written for mpreptxt
automatically inserts ``#'' markers at the beginning of lines containing
only formatting information (e.g. paragraph and sentence initial
and final tags but no textual words). Other users may wish to comment
out text that might bias the tagger inappropriately, e.g. titles, reduced
list items, etc.
The first stage is the preparation of the data for tests. This means
two main sub-tasks: morphological analysis of the data and
disambiguation of a representative part of the data. They will
be examined successively. The morphological analysis relates to each
word identified by the segmenter all of its possible morphological
analyses. For building a language model it is desirable to provide
a correct and stable set of data. Once a good language model
is established, it is assumed that it will also be able to
handle data containing some mistakes and unknown words. After morphological
analysis, each word is therefore represented by at least one set
of feature structures. Ambiguous words receive as many sets as it
has categories and differing attribute-value pairs. BOS and EOS indicate
the beginning and the end of sentences.
||la\Noun[gen=m num=s]|le\Pron[gen=f num=s per=3]|le\Det[gen=f num=s]
||pouvoir\Verb[mode=ind tns=pres num=s per=3]
||elle\Pron[gen=f num=s per=3]
||=\Pron[gen=m num=s per=3]|=\Det[gen=m num=s]
||temporaire\Adj[gen=m num=pl]|temporaire\Adj[gen=f num=pl]
||son\Det[gen=m!f num=pl per=3]
The sequences are separated by tokens marking beginning and end of
sentences (BOS and EOS). The format required is a record/field format,
one word per line including annotations and sentence boundary tokens.
As exemplified in the first column, the text may contain information not
relevant to the tagging process (such as indexing information).
The user can specify which fields hold the relevant information, thus allowing
some flexibility in the range of input data the program can accommodate.
User defined options are also provided for specifying separators between
fields, and alternative lexical annotations. In this example, fields
are separated by a ``TAB'' and lexical descriptions by ``|''. Any
line beginning with ``#'' is considered a comment and ignored. This
latter facility offers the possibility to comment out text that might bias
the tagger inappropriately, e.g. titles, reduced list items, etc.,
or to insert comments.
This French extract shows examples of ambiguity over major categories.
The first word la, for example, has three analyses as it can be a
noun (the musical note), a pronoun or a determiner. The word de received
only two analyses, one for the determiner and one for the preposition.
An example of a disjunction in the attributes can be found in the coding
of ``gender'' for the words de and ses. An example of the English
output for the same text is given below. As would be expected, the
English morphology makes different distinctions in the attributes and values
assigned to the wordforms.
||can/N[num=m gen=n]|can/|can/V[tns=pres type=m!v]
||=/Det[typ=gen num=pl]|=/Pro[typ=gen per=3 num=pl gen=n]
||official/N[num=pl typ=c gen=m!f]
||be/V[tns=pres num=sg per=2 typ=a]|be/V[tns=pres num=pl typ=a]
In the French version of the corpus, about 40% of the words are ambiguous,
i.e. receive more than one morphological analysis. 10% receive
three tags; 28% two. The more frequent ambiguities (once the lexical annotations
have been mapped to corpus tags) are those shown below.
The most common ambiguity class in French, for example, is due to the
high frequency of de which can be a determiner or a preposition; the second
one is due to the frequency of words like la. In English, the ambiguities
are less frequent: about 27% of the corpus. It is interesting
to see that, contrary to French, the most frequent errors for English are
between noun and verb.
||French Tag Ambiguities
||English Tag Ambiguities
Preparing the conversion files
The second task is the preparation of the conversion files, whose aim
is the establishment of a systematic relation between the set of feature
structures (the lexical annotations) and the corpus tags. Specific
words can also be assigned special tags. This conversion is necessary
for three reasons:
Some morphological distinctions might not be useful for the tagger.
The gender of the word, for example, might not help the tagger if it is
often ambiguous and thus does not show distinctive distributional behavior.
Such features can be ignored in the corpus tagset.
Specific words may have a very distinctive distribution that is not reflected
in the morpho-syntactic value assigned to the words. In this
case one might wish to assign special tags to take advantage of the lexical
classes. Examples in English are the auxiliaries be, do and have
and the negation particle in French.
Some wordforms represent ambiguities over major classes, but occur in similar
environments for each of the readings and thus are indistinguishable for
the tagger. In French, the word de (ambiguous as article or preposition)
and its contracted forms du, d', des, is such an example. In English,
the word that often also receives a special treatment (i.e. only
a subset of possible categories are coded).
Classification of words according to these three categories is
a matter for experimentation as other systems have demonstrated.
If accuracy is not a priority, then finer classifications might be desirable,
taking into account that not all solutions will be correct. A given
text type may also display different characteristics that can be exploited
in the tagset or the use of a specialized lexicon may reduce ambiguities
found in general lexica.
This preparation of the conversion file means two things :defining
the corpus tags we want to use and relating the lexical annotations to
these tags. Two separate files are foreseen to accommodate
the issues raised above. The tag conversion file specifies the general
mappings between lexical annotations and corpus tags and the word conversion
file specifies the tags to be assigned to specific words. The word
mappings take precedence over the general tag conversions.
For French, experimentation was done with different sets of tags.
One set was adapted from the Xerox set ([Chanod and Papanainen94])
and another from the AIX set (as supplied for Multext). The
figure below gives examples of the tag and word conversion files
for French (mapping from mmorph output to the Xerox corpus tagset).
Tag Conversion File
Word Conversion File
This sample from a Tag conversion file illustrates a reduction of the
features structures for French adjectives to three tags: ADJ-PL,
ADJ-SG and ADJ-INV. (The latter tag is used for adjectives which have the
same forms for singular and plural). In the Word conversion file
some specific words are assigned special tags, for example de into tags
PREP-DE, DET-SG and DET-PL.
Distinctions made in the morphology can be mapped one-to-one as in
the mapping to corpus tags for adjectives, or collapsed, as in the coding
of prepositions and conjunctions. The formalism also allows the assignment
of more than one corpus tag to a given lexical description.
After applying the conversion files to the output of morphological
analysis and lexical look-up, we get a file in the form given below, where
each feature stucture has been replaced by the tag given in the conversion
file. The number of tags for one word depends on the number of feature
structures. As expected, la got three tags DET-SG, NOUN-SG and PRON
and d' three: DET-SG, DET-PL and PREP-DE..
The text after applying the conversions.
This will be used as an input for the tagger. If the results
are not felt to be adequate (after appropriate training and tuning), the
tagset can be modified to try and overcome persistent errors. Similarly,
a new user of the tools might wish to take advantage of the lexical resources
supplied, while defining a different tagset.
The training module takes as its input sequences of ambiguity classes
as defined in the mapping tables mentioned above and instantiated by the
data preparation program. It uses the Baum-Welch algorithm to produce
a training Hidden Markov Model [Rabiner89]. Subsequent iterations
attempt to optimize the model parameters for the tagger program.
As introduced above, the values in the three matrices A, B and PI record
the probabilities used in training. Recall, the A matrix records
the transition probabilities from one state (or tag) to the next, the B
matrix records the probability of the relation between tags and classes
and the PI matrix records the probability of a given initial state (i.e.
for a given tag to occur at the beginning of a sentence). This method
of training does not require a pre-tagged corpus, though a small set of
hand corrected data can help readjust the parameters significantly.
This is confirmed in the experiences reported on in [Feldweg95] using
the Xerox tagger: ``the performance of the resulting HMM is very
poor if no initial biases are used to help the training process find suitable
The use of word equivalence classes was first introduced for POS tagging
in [Kupiec92] and is employed in the Xerox tagger described in [Cutting
and al.92]. This method simpifies the model by generalizing over classes
of words displaying the same set of ambiguous tags (instead of considering
the set of tags assigned to each individual word as a unique class), thus
reducing complexity and improving efficiency. ,In iterative passes the
parameters are reestimated and can be influenced by the user in a number
Facilities for instantiating the matrices are provided and can take
into account the observed frequencies over a text corpus. The program
to instantiate the transition matrix uses the following measure to assign
For a given word annotated with tags <S1,..,Sm> followed by a word
annotated with tags <T1,..,Tn> the probability of the sequence of the
pair of tags (Si;Tj) is calculated as
P (Si;Tj) =1\n .m
averaged over all occurrences of the pair (Si;Tj) in the entire text.
A pretagged corpus can be used to instantiate the matrices or applied
during training to adjust the model. This facility offers the means
to adapt a general language model to a specific (perhaps idiosyncratic)
set of texts.
Hand correcting a small portion of the training data can help readjust
the model to the new data.
The source of the training is the non-disambiguated text given above.
As introduced in the overview, an initial set of matrices and a list of
biases can also be given as input. The biases are typically written
after some experimentation and evaluation of tagging errors.
Examples of the biases written for French are given below. These biases
state user-defined preferences on transitions between tags. In this
example, a few of the possible transitions are stongly disfavored, i.e.
there is a very low probability that the category of the word following
a determiner (DET-SG and DET-PL) will be any type of verb (third person
singular: VERB-P3SG, third person plural: VERB-P3PL and second/first
person singular and plural:VERB-P1P2).
The values of the biases (ranging from +10 to -10) are used to recalculate
the values in the matrices A and PI. The last line illustrates the preference
for a sentence (the initial state) to begin with a determiner.
The core program for the training phase, mtrain
, uses the files prepared by mpreptxt
and attempts to optimize the model parameters to be used by the tagging
The number of iterations the mtrain
program should attempt is given with the ``-l" option -- each iteration
implies a reestimation of the parameters in an attempt to optimize the
model. However, it is well-known that increasing the number of loops
will not necessarily lead to better results. Note that the results
of applying the model to a text can only be inspected by running the tagging
program which uses these parameters to calculate the most probable tag
sequences. According to the results (number and types of errors), a number
of options are available to readjust the values and retrain the model.
A useful facility provided as a command line option during training is
to take user defined ``transition biases'' into account.
The mbiases program takes a list
of biases and computes a new set of value for the matrices A and PI.
The tagger mtag uses the matrices
created by the training module to calculate the most probable sequence
of tags for a given text using the Viterbi algorithm [Viterbi67].
The output of this program is a human-readable version of the text with
one or more tags assigned to each word.
As for the mtrain program, the tagger
assumes that the text and associated data files have been prepared by the
mpreptxt module and that the compiled
matrices file has been instantiated by the mtrain
In the general case, mtag would simply
be used to tag a text. This case, however, relies on a performant
and stable language model. The tools presented here focus on the
experimentation and development phase in building just such a model.
In an attempt to move beyond current practices of a tagger with one model
for one language, we have focussed on providing a flexible environment
for different languages and users who may wish to adjust the model for
a given application. The latter options listed above provide the
basis for experimenting and modifying the results.
The precision factor (-P option) allows the solution set to be extended
to more than one tag. This can be useful to identify cases of very
closely competing solutions (which may thus often be incorrect).
Alternatively, this factor may also identify cases where the correct solution
is not being taken into account in the model.
If the solutions found by the tagger are not adequate, a manually corrected
list of tags can be used to automatically retrain and retag the text (the
matrices are adjusted according to the correct data). The number
of loops and the declaration of a new matrix file is only necessary in
such a case of retagging. The new matrices can then be used to tag
another text and evaluate the results.
A number of facilities have been foreseen to aid the user during development
of the language model to identify where adjustments are desirable.
These tools also provide the basis for global evaluation of the tagging
The program mtagfreq calculates
the state transitions as defined above for mcreate
and prints them out as probability values over tag sequences. This
tool can be useful for getting an initial overview (or profile) of a new
text. These figures can be compared with the values calculated for
a corrected set of data to identify potential problems or peculiarities
of a given corpus.
Two versions of a ``diff" program for tagged text have been developed
as well as a program to inspect mistakes in the context of sentence in
which they occur. mcontext supplies
the text as a list of tags and marks the errors the user has specified
in a command line option. The mdiff
program gives statistics on the errors between two lists of tags, where
it is assumed that one list would be the output of automatic tagging and
one would be a list the correct tags.
The second version of the ``diff" program, mdiffb,
provides not only statistics on the errors but also classifies them according
to the context in which they occur.
This facility can be a useful aid in view of readjusting the transition
biases. It can also indicate to the developer where important problems
lie in difficult cases of ambiguous tag sequences. Examples of the
latter case might suggest a change in the tagset to reduce the conflicts,
rather than simply adjusting the transition biases. A command line
option allows the user to specify which tags should be taken into account,
again, in view of focussing on specific problems during development of
a new language model adapted to a given corpus.
[Baum72] L.E. Baum An inequality an associated
maximization technique in statistical estimation for
probabilistic functions of Markov processes.
Inequalities, vol3, pp. 1-8, 1972.
[Chanod and Papanainen94] J. P. Chanod and P. Tapanainen
Statistical and Constraint-Based Taggers for French.
Technical Report MLTT-016, Rank Xerox, Grenoble, 1994.
[Cutting and al.92] D. Cutting, J. Kupiec, J. Pedersen, P.
Sibun A Practical Part-of-Speech Tagger. Proceedings of
the 3rd Conference on Applied Natural Language
Processing, Trento, March 31st--April 3rd, 1992,
[Elworthy94] D. Elworthy Does Baum-Welch Re-Estimation
Help Taggers ACL Conference on Apllied Natural Language
Processing. Stuttgard, october 1994.
[Feldweg95] H. Feldweg Implementation and evaluation of a
German HMM for POS disambiguation . EACL SIGDAT
workshop, Dublin, 1995.
[Kupiec92] J. Kupiec Robust Part-of-Speech Tagging Using a
Hidden Markov Model. Computer Speech and Language, vol
6, pp. 225-242.
[Rabiner89] L.R. Rabiner A Tutorial on Hidden Markov
Models and Selected Applications in Speech Recognition.
A. Waibel and K-F. Lee, eds., Readings in Speech
Recognition eds., Morgan Kaufmann, San Mateo, 267--296.
[Russell and Petitpierre95] G. Russell and D. Petitpierre
MMORPH - The Multext Morphology. Version 2.0, March
1995, MULTEXT deliverable report for task 2.3.1.
[Sanchez95] F. Sanchez Development of a Spanish Version of
the Xerox Tagger CRATER/WP6/FR1, May 19, 1995
[Schmid95] H. Schmid Improvements in Part-Of-Speech
Tagging with an Apllication to German.EACL SIGDAT
workshop, Dublin, 1995.
[Tzoukermann95] E. Tzoukermann Combining Linguistic
Knowledge and Statistical Learning in French. EACL
SIGDAT workshop, Dublin, 1995.
[Veronis94] J. Veronis et al. MULTEXT: Segmentation Tool.
Version 2.0, March 1995, MULTEXT deliverable Report for
[Viterbi67] A.J. Viterbi Error bounds for convolution
codes and an asymptotically optimal decoding algorithm.
IEEE Trans. Informat. Theory, vol. IT-13, pp. 260-269,