Tagging and segmentation

The most labour-intensive part of marking up a piece of LARA text is adding the tags that indicate the root forms of uninflected words. Doing this manually is possible, but it is both slow and error-prone - it is very easy to miss a word. LARA offers you two ways to speed things up. If you are working in a language that is supported by TreeTagger or one of the other taggers we have interfaced to LARA, the best alternative is definitely to use that to do your tagging automatically, and then edit the result.

If no tagger is available for your language, use the spreadsheet-based tagger. This still involves doing the job by hand, but you’ll only have to tag words once, and doing it with a spreadsheet means you’re less likely to make careless mistakes.

Automatic segmentation

Before you do tagging, you will probably want to perform automatic segmentation of your text using the NLTK Punkt sentence tokenizer. Punkt will try to split at segment boundaries; you may want to clean up the result by hand, but even so it should save you quite a lot of work. Your config file needs to define values for “unsegmented_corpus” and “untagged_corpus”, transforming one into the other. You invoke segmentation with a command-line call of the form:

python3 $LARA/Code/Python/lara_run.py segment [ <local-config-file>* ]

for example

python3 $LARA/Code/Python/lara_run.py segment $LARA/Content/kallocain/corpus/local_config.json

Automatic tagging using TreeTagger

For some languages (so far, Catalan, Czech, Danish, Dutch, English, Finnish, French, German, Middle High German, Greek, Italian, Norwegian, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swahili, Swedish, Russian), you can use LARA’s interface to TreeTagger to tag a plain corpus automatically. Automatically tagged text will need human postediting, but the tagger does a large proportion of the work, perhaps around 90-95% depending on the language. You need to install TreeTagger and the relevant TreeTagger parameters file first.

You invoke the interface as:

python3 $LARA/Code/Python/lara_run.py treetagger [ <local-config-file>* ]

for example

python3 $LARA/Code/Python/lara_run.py treetagger $LARA/Content/dante/corpus/local_config.json

The local config file needs to supply values for the parameters “untagged_corpus”, “language” and “corpus”. It’s probably also a good idea to suppy a value for “tagged_corpus”. If you do this, the result of doing automatic tagging is put in the “tagged_corpus” file, so you don’t risk accidentally overwriting the cleaned-up tagged version in the “corpus” file.

Automatic tagging using Google Cloud

If the language is one of those supported (currently Mandarin, English, French, German, Italian, Japanese, Korean, Portuguese, Spanish, Russian, LARA can perform tagging using Google Cloud Natural Language Morphology and Dependency Trees. To use this functionality, you need to do the following:

  • Install the Google Cloud SDK and the google-cloud-language Python package.

  • Create a valid account and put your Google Cloud license file in a location referenced by the environment variable GOOGLE_APPLICATION_CREDENTIALS

  • Set the config-file parameter tag_using_google_cloud to yes (default is no).

Automatic tagging using other taggers

For some other languages (so far, Farsi, Icelandic, Turkish and Polish), it is possible to use a different tagger to do automatic tagging. The command-line call is exactly the same as for the languages supported by TreeTagger, but the processing is different:

  • For Farsi, a tagger based on the hazm package is run on the local machine. This package only runs under Python 2, so you need to have installed both Python 2 and hazm.

  • For Icelandic, LARA will invoke a web service to run the University of Iceland’s ABLTagger/Nefnir pipeline.

  • For Turkish, LARA will invoke a web service to run Istanbul Technical University’s Turkish NLP pipeline.

  • For Polish, LARA will invoke the Morfeusz2/Concraft pipeline if the components are installed and running.

Spreadsheet-based tagging

The idea behind the spreadsheet-based tagger is simple. For most words, you’ll tag the word the same way each time it occurs: so LARA makes you a list of all the words in your document and puts them in a spreadsheet which you fill in. It then uses the spreadsheet to tag your text. If you have several documents in the same language, you’ll probably want to share your spreadsheet between them.

There will of course be some words which do need to be tagged in more than one way. For example, in English, “thought” can either be the past tense of the verb think, or a noun. In the first case, you want to tag it as thought#think; in the second, as thought#thought. In the spreadsheet, you can only fill in one possibility, so you write something like “think OR thought”. Then every time “thought” occurs, you’ll get it tagged as thought#think OR thought#, and it will be easy to do a query-replace later on when you edit the result.

If you want to use spreadsheet-based tagging, you need to include a line in your config file giving a value to “lemma_dictionary_spreadsheet”; you also need definitions for “untagged_corpus” and “tagged_corpus”. So the relevant part of your config file will look something like this:

  "id": "hur_gick_det_sen",
  "untagged_corpus": "$LARA/Content/hur_gick_det_sen/corpus/hur_gick_det_sen_untagged.txt",
  "lemma_dictionary_spreadsheet": "$LARA/Content/swedish/dict/swedish.csv",
  "tagged_corpus": "$LARA/Content/hur_gick_det_sen/corpus/hur_gick_det_sen_tagged.txt",
  "corpus": "$LARA/Content/hur_gick_det_sen/corpus/hur_gick_det_sen.txt",

Here, the untagged corpus (plain text) is $LARA/Content/hur_gick_det_sen/corpus/hur_gick_det_sen_untagged.txt, the spreadsheet defining the lemmas is $LARA/Content/swedish/dict/swedish.csv, the result of performing tagging is put in $LARA/Content/hur_gick_det_sen/corpus/hur_gick_det_sen_tagged.txt, and the edited version which you’ll use for later processing is $LARA/Content/hur_gick_det_sen/corpus/hur_gick_det_sen.txt`.

Once you’ve set up your config file, you can create an updated version of the tagging spreadsheet with the command:

python3 $LARA/Code/Python/lara_run.py minimaltagger_spreadsheet [ <local-config-file>* ]

for example

python3 $LARA/Code/Python/lara_run.py minimaltagger_spreadsheet $LARA/Content/hur_gick_det_sen/corpus/local_config.json

This will tell you where it’s put the temporary spreadsheet file. When you edit the spreadsheet to fill in blank entries, you can save time by marking the lemma as `* for uninflected words, i.e. words where the lemma and the surface word are the same. Note also that you can add multiwords to the spreadsheet. The following Swedish example illustrates:


Here, the words kanhända, kanna, knopp, krypa and kuslig have been marked as uninflected. There is an entry marking klev as an inflected form of kliva, and also one marking klev ut as an inflected form of kliva ut.

When you have edited the temporary lemma spreadsheet, you need to copy it to the place where you have declared that the real lemma spreadsheet is kept, here $LARA/Content/swedish/dict/swedish.csv. You can then use it to perform spreadsheet-based tagging, using a command of the form:

python3 $LARA/Code/Python/lara_run.py minimaltagger [ <local-config-file>* ]

for example

python3 $LARA/Code/Python/lara_run.py minimaltagger $LARA/Content/hur_gick_det_sen/corpus/local_config.json

Tagging multi-word expressions

One of the most time-consuming tasks in tagging is dealing with multi-word expressions (MWEs). For example, the bold text in the following English sentences can reasonably be considered as MWEs:

  • He didn’t like it at all.

  • We went round and round.

  • You seem to be looking forward to it.

  • I think they have given in.

  • He shook his head.

  • They decided to blow it up.

  • She threw the whole thing away.

Note that MWEs can include inflected forms: for example, looking forward to is an inflected form of look forward to, and given in is an inflected form of give in. Note also that the MWEs do not have to be continuous. The first five examples (at all, round and round, look forward to, give in and shake one’s head are continuous, but blow up and throw away have words in the middle that do not belong to the MWE.

When an MWE is continuous, you can tag it using the @ ... @ construct, for example:

He didn't like it @at all@.
I think they have @given in@#give in#

But this doesn’t work for discontinuous constituents, and the last two sentences have to be tagged roughly as follows:

They decided#decide# to blow#blow up# it up#blow up#.
She threw#throw away# the whole thing away#throw away#.

LARA provides functionality to make tagging of MWEs more systematic. There are four steps:

  • Create a file defining the MWEs.

  • Run a script that produces a list of possible MWEs found in the tagged text file.

  • Edit the file of possible matches to say which ones are correct.

  • Run a second script to insert tags for the MWEs that have been marked as correct in the tagged text file.

The details are as follows.

Define the MWEs

MWEs are put in a plain text file referenced by the config file parameter mwe_file. There is one definition per line. Empty lines and lines starting with a hash (#) are ignored.

The simplest kind of MWE definition is a fixed phrase. Here, all the words are written in lowercase, for example:

at all
round and round

If one or more of the words can be inflected, then these words must be marked as such. You can do this either by writing the words that can be inflected in uppercase, for example:

LOOK forward to
THROW away

or by placing asterisks at the beginnings and ends of these words, i.e.:

*look* forward to
*give* in
*blow* up
*throw* away

So the MWE definition LOOK forward to or *look* forward to will match “I look forward to”, “she is looking forward to”, etc.

An MWE may contain other words that can vary. In English, a common case is a possessive pronoun, e.g. “shake one’s head” (“I shook my head”, “He shook his head”, etc). You can handle MWEs of this kind by adding a line to define a class of words, and then using the name of the class in the MWE definition, for example:

# Class definitions
class: one's my your his her its our their
class: oneself myself yourself himself herself itself ourselves themselves

SHAKE one's head
TAKE one's time
ENJOY onself
BRACE oneself

In some languages, the words in an MWE may occur in more than one order. A common case in French is reflexive verbs, where the reflexive pronoun usually comes before the verb (“Je me repose”) but comes after it in an imperative clause (“Reposez-vous”). To be able to handle this systematically, it is also possible to add transform lines to a file of MWE definitions. A transform line starts with transform and maps a left hand side to a right hand side. “Variables”, words which are the same on both sides are indicated by enclosing them in asterisks. So the French example is handled as follows:

transform: se *verb* -> *verb* toi
class: toi toi vous

This means that for any MWE entry matching the left hand side, e.g. se REPOSER, a second entry is automatically added, here REPOSER toi.

Find possible MWE matches

Once you have a file of MWE definitions, you can apply them to your text to get a file of possible matches. You do this with a call of the form

python3 $LARA/Code/Python/lara_run.py mwe_annotate <ConfigFile>

for example:

python3 $LARA/Code/Python/lara_run.py mwe_annotate local_config.json

Here, <ConfigFile> is as usual the config file. LARA will match the MWE definitions against the file defined by the corpus parameter, and then write out a JSON-formatted trace file of possible matches to the tmp_resources directory. The name of the file will be <Id>_tmp_mwe_annotations.json.

Edit the file of possible matches

The next step is to edit the trace file to mark the correct matches. A record in the file is of the form:

{"match": ...,
 "mwe": ...,
 "ok": ...,
 "skipped": ...,
 "word_index_list": ...,
 "words": ...}

for example:

    "match": "Whatever *goes* *upon* four legs or has wings is a friend",
    "mwe": "go upon",
    "ok": "mwe_status_unknown",
    "skipped": 0,
    "word_index_list": [
    "words": [

Here, the first two lines show you the possible match, and the task is to edit the third line. If the match is correct, you need to change mwe_status_unknown to mwe_okay, and if it is not, you need to change it to mwe_not_okay.

The items are ordered by the value of skipped, the number of words skipped when performing the match. When skipped is zero, the matches are usually correct. When it is three or more, the matches are usually incorrect. Values of one or two are intermediate. So at the beginning of the file, you can accept most of the matches, and at the end you can reject most of them.

Insert the MWEs marked as correct

When you have edited your file of possible matches, you can insert them into the text file. You do this by copying your annotated trace file to the file referenced by the config file parameter mwe_annotations_file, and then making a call of the form:

python3 $LARA/Code/Python/lara_run.py apply_mwe_annotations <ConfigFile>

for example:

python3 $LARA/Code/Python/lara_run.py apply_mwe_annotations

This will process the file referenced by corpus, inserting the MWE matches marked as correct in the file referenced by mwe_annotations_file. Two new files will be written to the tmp_resources directory: <Id>_mwe_processed_corpus.txt (the new version of the corpus), and <Id>_mwe_trace.html (a human-readable trace file showing the changes made).