mtrain - Training module for the Part-of-Speech Tagger.  

 

SYNOPSIS

mtrain
{-t <Input_text>}
{-i <compiled_input_matrix_file> }
{-o <compiled_output_matrix_file> }
{-l <Nbr_loop> }
{-C Nbr_fields}
{-b <biases_file>}
{-v print version}
 

DESCRIPTION

The training module takes a text represented as a sequence of ambiguity classes as input. It uses the Baum-Welch algorithm to produce a training Hidden Markov model (HMM) for use by the tagger module.
In iterations (according to the number of loops specified) it attempts to optimize the model parameters for the tagging program.
In the training phase probabilities are calculated for tag sequences. The probabilities are stored in matrices and can be readjusted according to new (external) input or by the program itself when more than one iteration (-l option) is specified.
The input is a text where each word and special token is annotated with one or more tags. The text must be formatted with mpreptxt. The matrix file can be created by mtrain which initializes the matrix with equi-probable values based on the tags found in the corpus. The values can be readjusted to reflect user-defined preferences as stated in the biases file. This training phase can be repeated for any number of iterations where each iteration may assign different probababilities. The matrices are used by the tagging program mtag to calculate the most probable tag for each word in a text.
An input matrices file, prepared by mpreptxt, has to be specified (-i option) to set the initial probabilities (in the matrix A,B and PI) used for the training.

mtrain readjusts the parameters and returns the new values in compiled matrices file.
 
 

OPTIONS

mtrain supports the following options:
-t Input_text
Specifies the input text formatted by the  mpreptxt program
-i Input_matrices_file
Specifies the matrices file that is either the output of a previous training run, was created by mcreate, or initially defined by the mpreptxt. (the default is MMinit).
-o Output_matrices_file
Specifies the output matrices file (the default is MM.cmp).
-l Loop_Number
Specifies the number of passes that the training program be run the text (the default is 1)
-b biases
Specifies a set of transition biases to be be applied to the input matrices (See mbiases(1)).
-C Nbr_fields
Number of fields preceding the [BOS|EOS] field (the default is 1)
-v version
Print the program version.

INPUT/OUTPUT

Description of the input and output files used by this program.
 
- Text formatted with mpreptxt : $TEXT.tr

- Matrices file : $MMinit

Output ==>
 

- The formatted Matrices : $MM.cmp
 

COMMAND EXAMPLES:

training a prepared text with an input matrices file and biases:
mtrain -t text.tr  -i M1 -o M2 -b biases.lst
 

SEE ALSO

mpreptxt(1)
mtag(1)
mcreate(1)
mtagfreq(1)
mprint(1)
mdiff(1)
mdiffb(1)
mcontext(1)
mbiases(1)
mhandtag(1)

AUTHOR

Gilbert ROBERT
(Gilbert.Robert@issco.unige.ch)
ISSCO, 54 route des Acacias
1227 Geneva, Switzerland

Comments, suggestions, and bug reports are always welcome.