|
TAGGER Overview
Susan Armstrong, Pierrette Bouillon,
Gilbert Robert
|
|
The Hidden Markov Model relies on three parameters, commonly referred to as the A, B and PI matrices. For a tagging application, the A matrix records the probabilities of the transitions between any two tags (or states), e.g. the probability of a determiner to precede a noun. The B matrix records the relation between the occurrence of a given tag (or state) and the set of ambiguous tags in which it occurs (the set of ambiguous tags is often referred to as the ambiguity or equivalence class). The PI matrix records the probability of a tag to occur in the initial state (i.e. at the beginning of a sentence).
The format requirement for the program is that the text contains one word per line, including annotations and sentence boundary markers. Command line options are provided to specify the fields containing the relevant information, thus assuring some flexibility in the range of input formats accepted. The separators between the annotations can be user declared with command line options. Any line beginning with ``#'' is considered a comment line and thus ignored. A filter written for mpreptxt automatically inserts ``#'' markers at the beginning of lines containing only formatting information (e.g. paragraph and sentence initial and final tags but no textual words). Other users may wish to comment out text that might bias the tagger inappropriately, e.g. titles, reduced list items, etc.
| 2\1 | La | BOS | la\Noun[gen=m num=s]|le\Pron[gen=f num=s per=3]|le\Det[gen=f num=s] |
| 2\5 | Commission | commission\Noun[gen=f num=s] | |
| 2\9 | peut | pouvoir\Verb[mode=ind tns=pres num=s per=3] | |
| 2\20 | -elle | elle\Pron[gen=f num=s per=3] | |
| 2\23 | : | =\COLON | |
| 2\24 | 1 | =\M | |
| 2\25 | ) | =\CPARENTH | |
| 2\26 | indiquer | =\Verb[mode=inf tns=pres] | |
| 2\33 | le | =\Pron[gen=m num=s per=3]|=\Det[gen=m num=s] | |
| 2\40 | nombre | =\Noun[gen=m num=s] | |
| 2\42 | d' | un\Det[gen=m!f num=pl]|de\Prep[form=surface] | |
| 2\44 | agents | agent\Noun[gen=m num=pl] | |
| 2\50 | temporaires | temporaire\Adj[gen=m num=pl]|temporaire\Adj[gen=f num=pl] | |
| 2\63 | travaillant | travailler\Verb[mode=part tns=pres] | |
| 2\73 | dans | dan\Noun[gen=m num=pl]|=\Prep[form=surface] | |
| 2\78 | ses | son\Det[gen=m!f num=pl per=3] | |
| 2\81 | services | EOS | service\Noun[gen=m num=pl] |
| Can | BOS | can/N[num=m gen=n]|can/|can/V[tns=pres type=m!v] |
| the | =/Det[typ=def] | |
| Commission | =/N[typ=p1!p2] | |
| say | =/V[vfm=bse typ=v] | |
| : | =/COLON | |
| 1. | =/ENUM | |
| how | =/Adv[deg=pos wh=q] | |
| many | =/Det[typ=gen num=pl]|=/Pro[typ=gen per=3 num=pl gen=n] | |
| temporary | =/A[deg=pos] | |
| officials | official/N[num=pl typ=c gen=m!f] | |
| are | be/V[tns=pres num=sg per=2 typ=a]|be/V[tns=pres num=pl typ=a] | |
| working | work/V[vfm=prp typ=v] | |
| at | =/Adp[pos=pre] | |
| the | =/Adp[pos=pre] | |
| commission | EOS | =/N[typ=p1!p2] |
| Frequent | French Tag Ambiguities | Frequent | English Tag Ambiguities |
| 2032 | DET-PL:DET-SG:PREP-DE | 531 | NN1:VVB |
| 1133 | DET-SG:NOUN-SG:PRON | 384 | NN2:VVZ |
| 717 | DET-PL:PREP-DE | 334 | PRP:TO0 |
| 670 | DET-SG:PRON | 373 | AJ0:NN1 |
| 579 | DET-PL:PRON | 285 | VVD:VVN |
| 501 | ADJ-SG:NOUN-SG | 162 | AV0:PRP |
| 456 | NOUN-SG:VERB-P1P2:VERB-P3SG | 135 | DT0:PNI |
| 379 | ADJ-PL:NOUN-PL | 131 | CJS:PRP |
| 376 | DET-SG:NUM:PRON | 127 | ART:ZZ0 |
| Adj[gen=f num=pl] | ADJ-Pl |
| Adj[gen=f num=s] | ADJ-SG |
| Adj[gen=m num=s] | ADJ-SG |
| Adj[gen=m num=pl] | ADJ-PL |
| Adj[gen=m!f num=s] | ADJ-SG |
| Adj[gen=m!f num=s!pl] | ADJ-INV |
| ne | =/NEG |
| n' | =/NEG |
| comme | =/COMME |
| que | =/COMME |
| qu' | =/CONJQUE |
| de | de/PREP-DE:de/DET-SG:de/DET-PL |
| d' | de/PREP-DE:de/DET-SG:de/DET-PL |
| La | BOS | le\DET-SG:la\NOUN-SG:le\PRON |
| Commission | commission\NOUN-SG | |
| peut | pouvoir\VERB-P3SG | |
| -elle | elle\PRON | |
| : | =\COLON | |
| 1 | =\M | |
| ) | =\CPARENTH | |
| indiquer | =\VERB-INF | |
| le | =\DET-SG:=\PRON | |
| nombre | =\NOUN-SG | |
| d' | de\PREP-DE:de\DET-SG:de\DET-PL | |
| agents | agent\NOUN-PL | |
| temporaires | temporaire\ADJ-PL | |
| travaillant | travailler\PAP-SG | |
| dans | dan\NOUN-PL:dans\PREP | |
| ses | son\DET-PL | |
| services | EOS | service\NOUN-PL |
This is confirmed in the experiences reported on in [Feldweg95] using the Xerox tagger: ``the performance of the resulting HMM is very poor if no initial biases are used to help the training process find suitable parameters".
The use of word equivalence classes was first introduced for POS tagging in [Kupiec92] and is employed in the Xerox tagger described in [Cutting and al.92]. This method simpifies the model by generalizing over classes of words displaying the same set of ambiguous tags (instead of considering the set of tags assigned to each individual word as a unique class), thus reducing complexity and improving efficiency. ,In iterative passes the parameters are reestimated and can be influenced by the user in a number of ways.
| DET-SG | VERB-P1P2 | =0 |
| VERB-P3SG | =0 | |
| VERB-P3PL | =0 | |
| NOUN-SG | +3 | |
| ADJ-SG | +3 | |
| DET-PL | VERB-P1P2 | =0 |
| VERB-P3PL | =0 | |
| VERB-P3SG | =0 | |
| NOUN-PL | +3 | |
| ADJ-PL | +3 | |
| !PI | DET-SG | +5 |
| DET-PL | +5 |
As for the mtrain program, the tagger assumes that the text and associated data files have been prepared by the mpreptxt module and that the compiled matrices file has been instantiated by the mtrain program.
[Cutting and al.92] D. Cutting, J. Kupiec, J. Pedersen, P.
Sibun A Practical Part-of-Speech Tagger. Proceedings of
the 3rd Conference on Applied Natural Language
Processing, Trento, March 31st--April 3rd, 1992,
133--140.
[Elworthy94] D. Elworthy Does Baum-Welch Re-Estimation
Help Taggers ACL Conference on Apllied Natural Language
Processing. Stuttgard, october 1994.
[Feldweg95] H. Feldweg Implementation and evaluation of a
German HMM for POS disambiguation . EACL SIGDAT
workshop, Dublin, 1995.
[Kupiec92] J. Kupiec Robust Part-of-Speech Tagging Using a
Hidden Markov Model. Computer Speech and Language, vol
6, pp. 225-242.
[Rabiner89] L.R. Rabiner A Tutorial on Hidden Markov
Models and Selected Applications in Speech Recognition.
A. Waibel and K-F. Lee, eds., Readings in Speech
Recognition eds., Morgan Kaufmann, San Mateo, 267--296.
[Russell and Petitpierre95] G. Russell and D. Petitpierre
MMORPH - The Multext Morphology. Version 2.0, March
1995, MULTEXT deliverable report for task 2.3.1.
[Sanchez95] F. Sanchez Development of a Spanish Version of
the Xerox Tagger CRATER/WP6/FR1, May 19, 1995
[Schmid95] H. Schmid Improvements in Part-Of-Speech
Tagging with an Apllication to German.EACL SIGDAT
workshop, Dublin, 1995.
[Tzoukermann95] E. Tzoukermann Combining Linguistic
Knowledge and Statistical Learning in French. EACL
SIGDAT workshop, Dublin, 1995.
[Veronis94] J. Veronis et al. MULTEXT: Segmentation Tool.
Version 2.0, March 1995, MULTEXT deliverable Report for
Task 2.2.
[Viterbi67] A.J. Viterbi Error bounds for convolution
codes and an asymptotically optimal decoding algorithm.
IEEE Trans. Informat. Theory, vol. IT-13, pp. 260-269,
Apr. 1967.