mtrain - Training
module for the Part-of-Speech Tagger.
SYNOPSIS
-
mtrain
{-t <Input_text>}
{-i <compiled_input_matrix_file> }
{-o <compiled_output_matrix_file> }
{-l <Nbr_loop> }
{-C Nbr_fields}
{-b <biases_file>}
{-v print version}
DESCRIPTION
The training module takes a text represented as a sequence of ambiguity
classes as input. It uses the Baum-Welch algorithm to produce a training
Hidden Markov model (HMM) for use by the tagger module.
In iterations (according to the number of loops specified) it attempts
to optimize the model parameters for the tagging program.
In the training phase probabilities are calculated for tag sequences.
The probabilities are stored in matrices and can be readjusted according
to new (external) input or by the program itself when more than one iteration
(-l option) is specified.
The input is a text where each word and special token is annotated
with one or more tags. The text must be formatted with mpreptxt.
The matrix file can be created by mtrain
which initializes the matrix with equi-probable values based on the tags
found in the corpus. The values can be readjusted to reflect user-defined
preferences as stated in the biases file. This training phase can be repeated
for any number of iterations where each iteration may assign different
probababilities. The matrices are used by the tagging program mtag
to calculate the most probable tag for each word in a text.
An input matrices file, prepared by mpreptxt,
has to be specified (-i option) to set the initial probabilities
(in the matrix A,B and PI) used for the training.
mtrain readjusts the parameters and returns the new values in
compiled matrices file.
OPTIONS
-
mtrain supports the following options:
-
-t Input_text
-
Specifies the input text formatted by the mpreptxt
program
-
-i Input_matrices_file
-
Specifies the matrices file that is either the output of a previous training
run, was created by mcreate,
or initially defined by the mpreptxt.
(the default is MMinit).
-
-o Output_matrices_file
-
Specifies the output matrices file (the default is MM.cmp).
-
-l Loop_Number
-
Specifies the number of passes that the training program be run the text
(the default is 1)
-
-b biases
-
Specifies a set of transition biases to be be applied to the input matrices
(See mbiases(1)).
-
-C Nbr_fields
-
Number of fields preceding the [BOS|EOS] field (the default is 1)
-
-v version
-
Print the program version.
INPUT/OUTPUT
Description of the input and output files used by this program.
- Text formatted with mpreptxt : $TEXT.tr
- Matrices file : $MMinit
Output ==>
- The formatted Matrices : $MM.cmp
COMMAND EXAMPLES:
training a prepared text with an input matrices file and biases:
mtrain -t text.tr -i M1 -o M2 -b biases.lst
SEE ALSO
mpreptxt(1)
mtag(1)
mcreate(1)
mtagfreq(1)
mprint(1)
mdiff(1)
mdiffb(1)
mcontext(1)
mbiases(1)
mhandtag(1)
AUTHOR
Gilbert ROBERT
(Gilbert.Robert@issco.unige.ch)
ISSCO, 54 route des Acacias
1227 Geneva, Switzerland
Comments, suggestions, and bug reports are always welcome.