mtag - A practical Part-of-Speech Tagger 

SYNOPSIS

mtag
{-i <Input Text> }
{-o <Output Text> }
{-m <Compiled_matrices_file> }
{-P <Precision>}
{-C Nbr fields}
{-l Primary separator}
{-p secondary separator}
{-r <Correct tag list file>}
{-O Print the results with the original tag set}
{-n <Correct tag list file>}
{-L Loop Number}
{-M <Output Matrices file >}
{-t <file> }
{-v print version}
 

DESCRIPTION

A part-of-speech tagger that uses context to assign the most probable part of speech tag(s) to each word in a text from a set of tags. mtag employs the Viterbi algorithm and makes use of ambiguity classes in the model to reduce the number of parameters to be estimated. The tagger uses the values set in the matrices created in the training program to calculate the optimal tag sequence. The precision criterion allows the solution set to be expanded to include more than on tag.
An additional facility is provided to compare the output of the tagger with a pre-tagged version of the text. If the -r option is used, with the correct list of tags (desambiguited manually with mhandtag), the tagger prints out statistics on the errors. If the -n,-L,-M option(s) are indicated, the tagger will automatically readjust the values in the matrices according to the correct solutions and retag the text.

If no files are given, the standard input is used. The program returns the disambiguated text.

OPTIONS

mtag supports the following options:
-i Input text
Specifies the input text formatted by the mpreptxt program (default stdin).
-o Output text
Specifies the program output (default stdout).
-m compiled_matrices_file
Specifies the matrices file that is the output of the training program, with the definition of the tag and class set, (See also mcreate program) (the default is MM.cmp)
-C Nbr fields
Number of fields before the [BOS|EOS] field (the default is 1)
-P Precision
The precision criterion allows the solution set to be extended to more than one tag. The best score, i.e. only one solution per word, is obtained with the default value 0. Increasing the value, increases the interval of the probabilities between different tags accepted and thus the number of solutions. The default value indicates that no interval is tolerated, thus only one solution is permitted. Any increase my produce more than one result, depending on how close the probabilities are for a given set of tags. Note that increasing this value increases processing time.
-r Correct tag list file
mtag tags the input text and print on the standard output the statistics on the transition values obtained from the correspondence between the found tags and the corrected tag. This option allows to have useful informations for writing biases file. With this option the precision is 0. The Correct tag list file can be obtained with the mhandtag program, or can be create by hand with a list of correct tags in one column.
-n Correct tag list file
mtag tags the input text, compares the results with the hand corrected tag list (corresponding to the text) and retags the text. This operation allows to refine the transition matrix and normally improve the accuracy. But if the size of the hand corrected tag list is not sufficient (~ 10-20% of the original text), the performances can decrease. The Correct tag list file can be obtained with the mhandtag program, or can be create by hand with a list of correct tags in one column. It's possible to specify a number of loops (-L option) before printing the final tagged text to the output device. With this option the precision is 0.
-O Print result in the original tag set.
Print the result of the tagging with the original tag set. Depending of your tag conversion list you can introduce some disjunctions when you come back to the original tags.
-L Loop number
Number of loops made by the tagger before the printing the final tagged text to the output device (-n option).
-M Output matrices file
Save the new matrices created with -n or -r options.
-l primary separator
Specifies the separator within [LEM,TAG] pairs (the default character is'\').
-p secondary separator
Specifies the separator between [LEM,TAG] pairs (default character is '|').
-t Print tag list in file
This option inserts the correct tag list of tags into the file in the last column. This list is useful for the mdiff and mdiffb programs.
-v version
Print the program version

INPUT/OUTPUT

Description of the input and output files used by program.
 
- Initial text formatted : [stdin]
- Compiled matrices : $MM.cmp

Output ==>

- Disambiguated text : [stdout]
 

COMMAND EXAMPLES:

The tagger uses the matrices M1. The precision 1 (-P option) extends the solution set (one or more solutions) to include those tags assigned to a given word with very close probabilities.

mtag -i text -o text.tag -m M1 -P 1 -C 3 -l '\' -p '|'

To print the results with the original tag set:
mtag -i text -o text.tag -m M1 -C 3 -l '\' -p '|' -O

Tag the file using the hand corrected tag list file HAND_TAGGED corresponding to the file text with 5 loops and print the final matrices file in M.new:

mtag -i text -o text.tag -m M1 -P 1 -C 3 -l '\' -p '|' -n HAND_TAGGED -M M5.new -L 5
 

SEE ALSO

mpreptxt(1)
mtrain(1)
mcreate(1)
mtagfreq(1)
mprint(1)
mdiff(1)
mdiffb(1)
mcontext(1)
mbiases(1)
mhandtag(1)

AUTHOR

Gilbert ROBERT
(Gilbert.Robert@issco.unige.ch)
ISSCO, 54 route des Acacias
1227 Geneva, Switzerland

Comments, suggestions, and bug reports are always welcome.