mpreptxt
- Prepare a text for the tagger|trainer program.
SYNOPSIS
-
mpreptxt
{-i Input Text}
{-o Output Text}
{-c Tag_conversion_file}
{-m Matrices_file}
{-w Word_conversion_file}
{-C Nbr_fields}
{-n Lexical_column_number}
{-H}
{-l Primary_separator}
{-p Secondary_separator}
{-v Print version}
{-V Verbose mode}
DESCRIPTION
Prepare a text for the trainer or tagger program.
The program assumes an input file of text in a record/field format
with one word per line, sentence boundaries indicated and (a set of) lexical
annotations for each of the words. These fields must be separated by TAB
character.
In preparation for training or tagging, the lexical annotations are
replaced by the corpus tags specified in the Tag conversion file.
A tag or set of tags can also be assigned to individual words in the Word
conversion file. The tags assigned to individual words take precedence
over the general mappings specified in the Tag conversion file.
| Terminologie |
| BOS |
Begin of Sentence. |
| EOS |
End of Sentence. |
| MSD |
Morphosyntactic description |
|
state
|
tag. |
| classes |
ambiguity classes based on tags. |
The data is reformatted for the mtrain
or mtag
programs and two files are returned: a formatted version of the initial
text data (with the MSD and word conversions) and a formatted file with
the crucials informations (default: MMinit) i.e. the set of states,
the set of tags (the class) and the definitions of the matrices.
An history file is associated to each matrices file you create (with the
suffix .history). This history file indicates the major action
you did using the matrices.
Conversion: (-H option) Allows to apply the conversion
tables (words and states) on the input file without change the format.
This option is usefull for the mhandtag program.
If some states and consequently some classes are unknown, the program
append the states and classes in the matrices file. mpreptxt
can be used as a filter for the tagger program mtag. If no input
(output) file is given, the standard input (output) is used.
OPTIONS
-
mpreptxt supports the following options:
-
-i Input text
-
Specifies the input file (default stdin).
-
-o Output text
-
Specifies the output file (default stdout).
-
-c Tag_conversion_file
-
Specifies the tag conversion list. If this option is not specified then
no tag substitutions will be applied.
-
-m Matrices_file
-
Specifies the list of states and ambiguity classes (default MMinit)
and the initialization of the matrices.
-
-w Word_conversion_file
-
Specifies the word conversion list. If this option is not specified then
no special tag substitutions are applied to individual words.
-
-n Lexical_column_position
-
Specifies the column where to find the word. It is assumed that the following
field indicates if the word is sentence initial/final with "BOS|EOS" and
the subsequent field contains the lexical annotations. Each field must
be separated by a TAB. Note that the lexical annotations for non-sentence
final and initial words must be separated by two TABs. (The default
lexical field is 1.)
-
-C Nbr_fields
-
Number of fields preceding the [BOS|EOS] field (the default is 1)
-
-l primary_separator
-
Specifies the separator within [LEM,ANNOT] pair (the default is '\').
-
-p secondary_separator
-
Specifies the separator between the sets of ambiguous [LEM,ANNOT] pairs
assigned to a given word (default is '|').
-
-P Prepare Text for tagging
-
If this option is not specified then the text will be prepared for training
program.
-
-H Prepare Text with Tag conversion tables
-
Allows to apply the conversion tables (words and states) on the input file
without change the format.
-
-v version
-
Print the program version.
-
-V Verbose
-
Verbose mode.
FORMAT
-
Initial text
Format:
< Field 1>..<Field n> <BOS-EOS Field> <MSD Field>
< Field 1>..<Field n> <BOS-EOS Field> <MSD Field>
Example:
| 12 |
La |
BOS |
la\N[gen=m num=s]|le\Pro[gen=f num=s per=3]|le\Det[gen=f
num=s] |
| 14 |
Commission |
|
commission\N[gen=f num=s] |
| 18 |
peut |
|
pouvoir\V[mode=ind ten=pr num=s pers=3] |
| 25 |
elle |
|
elle\Pro[gen=f num=s pers=3] |
| 28 |
: |
|
=\COLON |
| 30 |
1 |
EOS |
=\M |
| .... |
-
In this example n=2. All fields must be separated by TAB character.
The field [BOS,EOS] can be NULL, and in this case you must have two
TAB characters separating the word and the annotation.
The Tag field is composed of pairs of [LEM,ANNOT]. The primary separator
within [LEM,ANNOT] pairs in the example is the '\' character and the secondary
separator between [LEM,ANNOT] pairs is the '|' character. The "=" implies
that the lemme is the same as the word form.
-
Tag Conversion File
Format:
<Initial Annotation> <Substitute Tag>
Example:
| Initial Annotation |
Substitute Tag |
| N\[gen=m num=s\] |
NOUN-SG |
| N\[gen=m num=s!m\] |
NOUN-SG|NOUN-PL |
| Pro* |
PRON |
| Det*num=s* |
DET-SG |
| Det*num=m* |
DET-PL |
| Det* |
DET-SG|DET-PL |
| V\[mode=ind ten=pr num=s pers=3\] |
VERB-P3SG |
| V*infinitive*present* |
VERB-INF |
| V*participle*past number=pl* |
VERB-PAP-PL |
| V*participle*past number=si* |
VERB-PAP-SG |
| V*participle*tense=present* |
VERB-PRP |
| V*plural* |
VERB-PL |
| V*singular* |
VERB-SG |
| V* |
VERB-SG|VERB-PL |
| COLON |
= |
-
The fields are separated by TAB character. The '=' character refers to
the initial annotation for substitution; in this case, the initial annotation
will serve as the tag. You can specify one or more new tags for each initial
annotation. You can use shell-style pattern matching for ?, \, [], and
* characters. But don't forget that the subsitution is
sequential and consequently take care about the order of the patterns,
for example the pattern V* has to be placed after the pattern Verb*participle*.
Don't forget to put a backslah character before a special character
that you want to be interpreted like a normal one.
Example :
nez nez\N[gen=m num=s!g] will be converted to nez
nez\NOUN-SG|nez\NOUN-PL
If mpreptxt doesn't find a mapping in the conversion table for an
annotation which occurred in the text, the same tag will be returned.
-
Lexical Conversion File
Format:
<Word> < Substitute Tag >
Example:
| Word |
Substitute Tag |
| that |
=\THAT |
| have |
=\HV |
| be |
=\BE |
| to |
=\TO|=\PREP |
The fields are separated by a TAB character; lines preceded with '#'
are considered as comments and ignored.
Example :
be =\VB will be converted to be =\BE
If no conversion files (Tag_conversion_file or Word_conversion_file) are
specified, no substitutions will be applied.
COMMAND EXAMPLES:
-
Prepare text file for training:
mpreptxt -i text -o text.tr -w word.cnv -c tag.cnv -m
MMatrix01 -n 1 -C 2 -p '|' -l '\'
-
Training sequence:
mtrain -t text.tr -i MMatrix01 -o MMatrix02 -C 2
-
Tagging sequence:
mtag -i text.tr -o text.tag -C 2 -p '|' -l '\'
SEE ALSO
mtrain(1)
mtag(1)
mcreate(1)
mtagfreq(1)
mprint(1)
mdiff(1)
mdiffb(1)
mcontext(1)
mbiases(1)
mhandtag(1)
AUTHOR
Gilbert ROBERT
(Gilbert.Robert@issco.unige.ch)
Copyright (c) 1998 Issco, Geneva, Switzerland
ISSCO, 54 route des Acacias
1227 Geneva, Switzerland
Comments, suggestions, and bug reports are always welcome.