![]() |
TATOO
ISSCO TAgger TOOl Copyright (c) 1994-1998, Gilbert
ROBERT
|
![]() |
The main programs are mpreptxt, mtrain and mtag. Facilities to aid in building and modifying the language model include mtagfreq and mcreate. Programs for evaluating the results are mdiff and mdiffb.
The programs assume input text in a record/field format as given by the Multext segmenter with sentence boundaries marked and with words annotated by lexical look-up (either mtlexax or mmorph look-up). But the input is quite flexible and can be defined as 'n' columns before the Beginning or End of sentence field and the morphosyntactic annotation field. The 'n' number has to be defined in the commands with the -C option. The interface to the current Multext segmenter output format is guaranteed by the mpretag program.
For more infomation on the tagging technologies read the overview of the Part-of-Speech Tagging.
A short tutorial is provided here on how to use the various modules with examples of the necessary data files and sample output of the programs. A typical session is presented in terms of the basic steps from text preparation, training and tagging, to evaluating and modifying the results. A separate man page is provided with this release for each of the programs.
An user interface, xtag,
is now available. This interface is written in expectk (v5-19 for Tcl 7.4
and Tk4.0).
List of Programs
Program | Description |
---|---|
mpretag | Convert MTLEX output format into TAGGER input format |
mposttag | Convert TAGGER output into MTLEX format. |
mpreptxt | Prepare a text for the different steps of the tagger. |
mtrain | Training module for the Part-of-Speech Tagger. |
mtag | A practical Part-of-Speech Tagger based on a Hidden Markov Model. |
mcreate | Create a file containing the simple transition probabilities. |
mprint | Print in ASCII format the matrices instantiated by the training program. |
mdiff | Provides the statistics on the errors between different tags assigned to words. |
mdiffb | Print out errors of tagging program. |
mbiases | Take a list of biases and compute a new set of value for the matrices A[][] and PI[]. |
mcontext | Print error context for defined tags. |
mtagfreq | Print out the transition probabilities as a percentage. |
mhandtag | Manually desambiguiate a text formatted in the TAGGER Format. |
xtag | An User Interface in tcl/tk/expectk |
gmake install # if you have source distribution
if you don't use gmake: edit Makefile.sun, adapt the macros and type:
make -f Makefile.sun install
(superuser status is only necessary if you install things in system directories).
Variables | Value | Comments | ||
---|---|---|---|---|
States max | 500 | Maximum number of states | ||
Classes max | 1000 | Maximum number of classes | ||
Conversion States max | 1000 | Maximum number of conversion states | ||
Conversion Words max | 500 | Maximum number of conversion words | ||
State Size | 150 | Maximum size of a state | ||
Class Size | 2000 | Maximum size of a class | ||
Sentence max time | 500 | Maximum number of items in a sentence |
What you need to start
** A Text file in the record/field format (TAGGER format)
The program assumes an input file of text in a record/field format
with one word per line, sentence boundaries and (a set of) lexical
annotations for each word. These fields must be separated by a TAB
character. Sentences beginning with the '#' character are considered as
comments.
n Fields | BOS/EOS Field | MSD Field | ||
---|---|---|---|---|
1.0 | le | TOK | BOS | =\det.m.sg|=\pro |
1.0 | usine | TOK | =\n.f.sg|usiner\v.imp.2.sg|usiner\v.ind.prs.1.sg|..... | |
1.1 | , | PUNCT | =\comma | |
1.2 | qui | TOK | =\pro | |
1.3 | devrait | TOK | =\v.con.3.sg | |
1.4 | être | TOK | =\n.m.sg|=\v.inf | |
1.5 | implantée | TOK | =\v.pps.f.sg | |
1.6 | à | TOK | =\p | |
1.7 | Eloyes | TOK | =\n | |
1.8 | Vosges | TOK | =\n | |
......... |
** An optional MSD conversion file: states.cnv
Initial Annotation | Substitute Tag |
---|---|
N\[gen=m num=s\] | NOUN-SG |
N\[gen=m num=s!m\] | NOUN-SG|NOUN-PL |
Pro* | PRON |
Det\[gen=f num=s\] | DET-SG |
V*ten=pr*num=s*pers=3* | VERB-P3SG |
COLON | = |
** An optional Lexical Conversion File: words.cnv
Transform a word (first column) in a TAG (second column). This word conversion
overwrite the precedent tag conversion.
Word | Substitute Tag |
---|---|
que | =\CONJQUE|=\PRON|=\ADV |
qu' | =\CONJQUE|=\PRON|=\ADV |
de | de\PREP-DE|de\DET-SG|de\DET-PL |
d' | de\PREP-DE|de\DET-SG|de\DET-PL |
des | de\PREP-DE|de\DET-PL |
du | de\PREP-DE|de\DET-SG |
à | =\PREP-A |
.......... |
If no conversion files (Tag_conversion_file or Word_conversion_file)
are specified, no substitutions will be applied.
The mpretag is the interface between the segmenter or the look-up of
Multext (Version: mtseg.sh 3.6 07 Nov 1995) and the tagger.
The next version of tatoo will be compatible with the new version
of the Multext segmenter (mtseg.sh 1.3.1 02/10/97), e.g.
mpretag will be updated.
Notes: The first n columns can be everything, just notice the
place of the word. The next column indicates the beginning
and the end of the sentence, [BOS|EOS] field, and can obviously be
empty. The last field is the Morphosyntactic part.
For all sentences that need to be ignored, the program put a '#' character
in front of it.
Command |
---|
mpretag < text.lex > text |
0- You need
Command | |
---|---|
mpreptxt | {-i Input Text} |
{-o Output Text} | |
{-m Matrices Output file} | |
{-c Tag_conversion_file} | |
{-w Word_conversion_file} | |
{-n Lexical_column_number} | |
{-C Nbr_fields} | |
{-H} | |
{-l Primary_separator} | |
{-p Secondary_separator} | |
{-v Print version} | |
Example: mpreptxt -i text -o text.tr -m MMinit -c states.cnv -w word.cnv -n 1 -C 1 |
|
Files created | text.tr Training file |
MMinit file with the definition of the tag set and class set |
Command | |
---|---|
mtrain | [-t Input_text] |
{-i Input_matrix_file } | |
{-o Output_matrix_file } | |
{-b Biases_file } | |
{ -l Nbr Loop} | |
{-v print version} | |
Example: mtrain -t text.tr -i MMinit -o MM1 |
|
File created : The matrices file MM1 |
|
---|
With biases: |
Exemple: mtrain -t text.tr -i MMinit -o MM1b -b biases.lst |
File created : The matrices file MM1b |
If no input or ouput files are given, the standard input or input is
used. The program returns the disambiguated text.
0- You need:
Command |
---|
mpreptxt -i text -o text.prep -m MM1b -c states.cnv -w word.cnv -C 1 -n 1 |
Command | |
---|---|
mtag | {-i Input Text } |
{-o Output Text } | |
{-m Matrices_file } | |
{-P Precision} | |
{-C Nbr fields} | |
{-l Primary separator} | |
{-p secondary separator} | |
{-r Correct tag list file} | |
{-n Correct tag list file} | |
{-L Loop Number} | |
{-M Output Matrices file} | |
{-t Tag list file } | |
{-v print version} | |
Example: mtag -m MM1b -i text.prep -o text.tag -C 1 -t text.tag.lst |
Command | |
---|---|
mhandtag | {-i Input_File} |
{-o Output_File} | |
{-l Primary_separator} | |
{-p Secondary_separator} | |
{-C Nbr_fields} | |
{-s Tag_List} | |
{-n Lexical_column_number} | |
{-c Concatenation option} | |
Example: mhandtag -i text.sh -o text.sh.hd -C 1 -n 1 -s TAG_CORRECT |
Command |
---|
mpreptxt -m MM1b -i text.sh.hd -o text.sh.hd.pr -C 1 -n 1 -c states.cnv -w words.cnv -H |
Command |
---|
mhandtag -i text.sh.hd.pr -o text.sh.hd.pr.hd -s TAG1 -n 1 -C 1 |
Command |
---|
mpreptxt -m MM1b -i text.sh
-o txt.sh.pr -c states.cnv -w words.cnv -n 1 -C 1
|
mtag -m MM1b -i text.sh.pr -o txt.sh.tag -t text.sh.list1 -C 1 -n 1 |
Two ``diff'' programs for tagging are available,
mdiff and mdiffb.
They analyse the difference between the tags assigned to the words by
mtag
and the list of pre-tagged data corresponding to this text. The input files
are simply the list of tags corresponding to the words in the text. The
first can be prepared by the mtag program
with the -t option, and the second is the correct list of tags prepared
by the mhandtag program.
mdiff
prints out the statistics on the errors between those differents tags.
Command | |
---|---|
mdiff | -f {tag_list} |
-g {correct_tag_list} | |
-v {version} | |
Example: mdiff -f text.sh.list1 -g TAG1 |
Previous Tag | Next Tag | Error Numbers | Error Percent |
---|---|---|---|
...... | |||
AJ0 | AV0 | --1 | 0.1 % |
AJ0 | NN1 | --1 | 0.1 % |
...... | |||
PNI | AV0 | --2 | 0.3 % |
PNI | DT0 | --4 | 0.5 % |
...... | |||
VVN | AJ0 | --1 | 0.1 % |
VVZ | NN2 | --3 | 0.4 % |
...... | |||
Not so bad: Welcome in the TOP 10
Error Total: 6.3% |
** Print errors with context (previous tag).
mdiffb like
mdiff
prints out the statistics as well as the context in which the errors occured.
The following table is the output of mdiffb.
This program provides not only the error rate but also the context in which the
errors occurred. The table contains the most frequent errors produced by
the tagger when training with biases. (``Next Tag 1'' is assumed to be the output
of the tagger and ``Next Tag 2'' is assumed to be the correct tag.)
Command | |
---|---|
mdiffb | -f {tag_list} |
-g {correct_tag_list} | |
{-p Secondary_separator} | |
{Tags List...} | |
Example: mdiffb -f txt.list1 -g TAG1 DT Subjonctive |
Previous Tag | Next Tag 1 | Next Tag 2 | Nbr | Mark |
---|---|---|---|---|
......... | ||||
Verb | Adv | Nn | 4 | |
DT | NUM | 3 | ** | |
DT | PR | 6 | ** | |
DT | SubConj | 7 | ** ** | |
INF | PREP | 25 | ||
INTJ | PR | 7 | ||
SubConj | Verb | 1 | ** | |
Verb | Nn | 34 | ||
WP | SubConj | 33 | ** | |
........ | ||||
Results:
Error Number: 54 Nbr Word:745 Error rate: %6.24 |
Command | |
---|---|
mbiases | {-i input compiled matrices file} |
{-f biases list file | |
{-o output compiled matrices file} | |
Example: mbiases -i MM1b -o MM1b.new -f biases.lst |
Biases file | ||
---|---|---|
AJ0 | NN1 | +4 |
NN2 | +4 | |
VVB | -4 | |
VVZ | -4 | |
AT0 | NN1 | =.5 |
NN2 | =.2 | |
VVB | =.3 | |
!OTHER | =0 | |
!PI | PRP | +4 |
NN1 | +0 | |
!OTHER | -2 | |
..... |
Command |
---|
mtag -m MM1b.new -i text.sh.pr -o text.sh.tag -t tag.lst1 -C 1 -n 1 |
Command |
---|
mtrain -t text.tr -i MM -o MM.new.bias -b biases.lst |
Command |
---|
mtag -i text.sh.pr -o text.sh.tg -t text.tag.lst1 -m MM.new.bias -C 1 -P 1 |
Command |
---|
mdiffb -f text.tag.lst1 -g TAG1 -p '\|' |
Errors appearing in the same context | |||||
---|---|---|---|---|---|
Previous Tag | Next Tag 1 | Next Tag 2 | Nbr | Mark | |
ADJ-SG | CONJQUE | ADV | 1 | ||
CONN | PREP-DE | DET-PL | 2 | ||
DET-PL | VERB-P1P2 | ADV | 1 | ||
....... | |||||
WPUNCT | CONN | ADJ-SG | 1 | ||
NOUN-PL | ADJ-PL | PREP-DE | 2 | ||
NOUN-PL | PAP-PL | PREP-DE | 2 | ||
WPUNCT( | DET-SG | PAP-PL | 1 | ||
[BeginOfSentence] | DET-SG | PREP-DE | 2 | ||
Errors detected and corrected with the Precision flag | |||||
Prev Tag | Next Tag 1 | Next Tag 2 | Correct Tag | Nbr | Mark |
ADJ-SG | DET-PL | PREP-DE | PREP-DE | 4 | |
DET-PL|PRON | DET-PL | PRON | PRON | 1 | |
NOUN-PL | DET-PL | PREP-DE | PREP-DE | 4 | |
.... | |||||
PRON | DET-PL | PREP-DE | PREP-DE | 1 | |
VERB-INF | ADV | CONN | CONN | 1 | |
WPUNCT | DET-SG | PREP-DE | PREP-DE | 1 | |
.......... | |||||
Results: Error Number : 72
Error Detected : 15 Possible Error Rate --> 3.2 % Word Number : 1770 Error Rate : 4.06 % |
New Sentence | Tag from tagger | Correct Tag | |
---|---|---|---|
CONJQUE | CONJQUE | ||
PRON | PRON | ||
VERB-P1P2 **** | **** VERB-P3SG | ||
DET-PL ... | ... ADJ-PL | ||
DET-PL | DET-PL | ||
NOUN-PL | NOUN-PL | ||
ADJ-PL | ADJ-PL | ||
PREP-DE | PREP-DE | ||
NOUN-PL | NOUN-PL | ||
ADJ-PL | ADJ-PL | ||
COMMA | COMMA | ||
CONN ... | ... ?? | ||
DET-SG | DET-SG | ||
NOUN-SG | NOUN-SG | ||
NEG | NEG | ||
VAUX-P3SG | VAUX-P3SG | ||
ADV | ADV | ||
ADV | ADV | ||
PRON ... | ... PREP | ||
VERB-P3SG ... | ... NOUN-SG | ||
PREP-DE | PREP-DE | ||
VERB-INF | VERB-INF | ||
DET-PL | DET-PL | ||
ADJ-PL ... | ... NOUN-PL | ||
PERIOD | PERIOD | ||
EOS | EOS | ||
Final Results | |||
Context Frequency of error tags | |||
ADJ-PL ... | ... NOUN-PL | 1 | |
ADV ... | ... CONN | 1 | |
CONN ... | ... ?? | 1 | |
DET-PL ... | ... ADJ-PL | 2 | |
NUM ... | ... DET-SG | 1 | |
PAP-SG ... | ... NOUN-SG | 1 | |
PRON ... | ... PREP | 1 | |
VERB-P1P2 **** | **** VERB-P3SG | 3 | |
VERB-P3SG ... | ... NOUN-SG | 1 |
** Create corresponding matrices, which can be a good initial set
for the training
Command | |
---|---|
mcreate | {-t Input_text} |
{-i Input Matrices_file} | |
{-o Output Matrices_file} | |
{-v print version} |
Command |
---|
mtag -i text.sh.pr -o text.sh.tag -m MM.new -C 1 -t tag.lst -n TAG1 -L 10 -M MM10 |
mdiffb -f tag.lst -g TAG1 |
Previous Tag | Next Tag 1 | Next Tag 2 | Nbr | Mark |
---|---|---|---|---|
............ | ||||
PREP | ADJ-SG | NOUN-SG | 2 | |
DET-SG | PRON | 1 | ||
NOUN-SG | NOUN-PL | 2 | ||
VERB-P3SG | PREP | 2 | ||
PRON | ABBR | PRON | 1 | |
ADJ-SG | PAP-SG | 1 | ||
PUNCT | PRON | PREP | 1 | |
............ | ||||
Results: | ||||
Error Number : 58
Word Number : 1770 Error Rate : 3.28 % |
Restrictions. No part of the Software may incorporated into any other software or product which is distributed to other parties without prior permission from ISSCO No part of the Software may be incorporated into any other software or product which is distributed to other parties without prior permission from ISSCO.The above copyright notice must appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation.The name of ISSCO can not be used in advertising or publicity pertaining to distribution of the software and data without specific, written prior permission.
No Liability. ISSCO makes no representations about the suitability of this software and data for any purpose. It is provided "as is" without express or implied warranty.
No Warranty. ISSCO disclaims all warranties with regard to this software and data, including all implied warranties of merchantability and fitness, in no event shall ISSCO be liable for any special, indirect or consequential damages or any damages whatsoever, action of contract, negligence or other tortious action, arising out of or in connection with the use or performance of this software.