mpreptxt - Prepare a text for the tagger|trainer program.  



{-i  Input Text}
{-o Output Text}
{-c Tag_conversion_file}
{-m Matrices_file}
{-w Word_conversion_file}
{-C Nbr_fields}
{-n Lexical_column_number}
{-l Primary_separator}
{-p Secondary_separator}
{-v Print version}
{-V Verbose mode}


Prepare a text for the trainer or tagger program.
The program assumes an input file of text in a record/field format with one word per line, sentence boundaries indicated and (a set of) lexical annotations for each of the words. These fields must be separated by TAB character.
In preparation for training or tagging, the lexical annotations are replaced by the corpus tags specified in the Tag conversion file. A tag or set of tags can also be assigned to individual words in the Word conversion file. The tags assigned to individual words take precedence over the general mappings specified in the Tag conversion file.
BOS  Begin of Sentence.
EOS  End of Sentence.
MSD  Morphosyntactic description
classes  ambiguity classes based on tags.

The data is reformatted for the mtrain or mtag programs and two files are returned: a formatted version of the initial text data (with the MSD and word conversions) and a formatted file with the crucials informations (default: MMinit) i.e. the set of states, the set of tags (the class) and the definitions of the matrices.  An history file is associated to each matrices file you create (with the suffix .history). This history file indicates the major action  you did using the matrices.

Conversion: (-H option) Allows to apply the conversion tables (words and states) on the input file without change the format. This option is usefull for the mhandtag program.
If some states and consequently some classes are unknown, the program append the states and classes in the matrices file. mpreptxt can be used as a filter for the tagger program mtag. If no input (output) file is given, the standard input (output) is used.


mpreptxt supports the following options:
-i Input text
Specifies the input file (default stdin).
-o Output text
Specifies the output file (default stdout).
-c Tag_conversion_file
Specifies the tag conversion list. If this option is not specified then no tag substitutions will be applied.
-m Matrices_file
Specifies the list of states and ambiguity classes (default MMinit) and the initialization of the matrices.
-w Word_conversion_file
Specifies the word conversion list. If this option is not specified then no special tag substitutions are applied to individual words.
-n Lexical_column_position
Specifies the column where to find the word. It is assumed that the following field indicates if the word is sentence initial/final with "BOS|EOS" and the subsequent field contains the lexical annotations. Each field must be separated by a TAB. Note that the lexical annotations for non-sentence final and initial words must be separated by two TABs. (The default lexical field is 1.)
-C Nbr_fields
Number of fields preceding the [BOS|EOS] field (the default is 1)
-l primary_separator
Specifies the separator within [LEM,ANNOT] pair (the default is '\').
-p secondary_separator
Specifies the separator between the sets of ambiguous [LEM,ANNOT] pairs assigned to a given word (default is '|').
-P Prepare Text for tagging
If this option is not specified then the text will be prepared for training program.
-H Prepare Text with Tag conversion tables
Allows to apply the conversion tables (words and states) on the input file without change the format.
-v version
Print the program version.
-V Verbose
Verbose mode.


Initial text


< Field 1>..<Field n> <BOS-EOS Field> <MSD Field>
< Field 1>..<Field n> <BOS-EOS Field> <MSD Field>
12  La  BOS la\N[gen=m num=s]|le\Pro[gen=f num=s per=3]|le\Det[gen=f num=s]
14  Commission  commission\N[gen=f num=s]
18  peut pouvoir\V[mode=ind ten=pr num=s pers=3]
25  elle elle\Pro[gen=f num=s pers=3]
28  =\COLON
30  EOS =\M
In this example n=2. All fields must be separated by TAB character.


The field [BOS,EOS] can be NULL, and in this case you must have two TAB characters separating the word and the annotation.

The Tag field is composed of pairs of [LEM,ANNOT]. The primary separator within [LEM,ANNOT] pairs in the example is the '\' character and the secondary separator between [LEM,ANNOT] pairs is the '|' character. The "=" implies that the lemme is the same as the word form.

Tag Conversion File
<Initial Annotation> <Substitute Tag>
Initial Annotation Substitute Tag
N\[gen=m num=s\]  NOUN-SG
N\[gen=m num=s!m\]  NOUN-SG|NOUN-PL
Pro*  PRON
Det*num=s*  DET-SG
Det*num=m*  DET-PL
V\[mode=ind ten=pr num=s pers=3\]  VERB-P3SG
V*infinitive*present*  VERB-INF
V*participle*past number=pl*  VERB-PAP-PL
V*participle*past number=si*  VERB-PAP-SG
V*participle*tense=present*  VERB-PRP
V*plural*  VERB-PL
V*singular*  VERB-SG
The fields are separated by TAB character. The '=' character refers to the initial annotation for substitution; in this case, the initial annotation will serve as the tag. You can specify one or more new tags for each initial annotation. You can use shell-style pattern matching for ?, \, [], and * characters. But don't forget that the subsitution is sequential and consequently take care about the order of the patterns, for example the pattern V* has to be placed after the pattern Verb*participle*. Don't forget to put a backslah character before a special character that you want to be interpreted like a normal one.

Example :

nez nez\N[gen=m num=s!g]  will be converted to  nez     nez\NOUN-SG|nez\NOUN-PL
If mpreptxt doesn't find a mapping in the conversion table for an annotation which occurred in the text, the same tag will be returned.
Lexical Conversion File
<Word> < Substitute Tag >
Word Substitute Tag
that  =\THAT
have  =\HV
be  =\BE
to  =\TO|=\PREP
The fields are separated by a TAB character; lines preceded with '#' are considered as comments and ignored.
Example :
be =\VB   will be converted to   be =\BE
If no conversion files (Tag_conversion_file or Word_conversion_file) are specified, no substitutions will be applied.





Gilbert ROBERT
Copyright (c) 1998 Issco, Geneva, Switzerland
ISSCO, 54 route des Acacias
1227 Geneva, Switzerland

Comments, suggestions, and bug reports are always welcome.