Phonetic texts

“Phonetic texts” are a new piece of functionality currently at an experimental stage. Normal LARA texts have three levels of structure: pages, segments and words. In contrast, a phonetic LARA text is divided into pages, words, and letter groups. The intention is to illustrate the relationship between words and sounds. An example of a phonetic text posted on the LARA examples page is the famous pronunciation poem The Chaos.

The notation for phonetic texts is the same as for normal texts, except that the hashtags are not used to associate words with lemmas, but rather letter groups with phonetic values. A typical line in a phonetic text looks like this:

“||@Wh@#w#|@a@#ɒ#|@t@#t#|| ||@are@#ɑː#|| ||@y@#j#|@ou@#uː#|| ||@d@#d#|@o@#uː#|@i@#ɪ#|@ng@#ŋ#|| ||@h@#h#|@ere@#i‍ə#||?” ||@h@#h#|@e@#iː#|| ||@a@#ɑː#|@s@#s#|@k@#k#|@ed@#t#|| ||@th@#ð#|@e@#ə#|| ||@d@#d#|@r@#ɹ#|@u@#ʌ#|@n@#ŋ#|@k@#k#||.

This represents the sentence ““What are you doing here?” he asked the drunk”.

It would evidently be very tedious to construct phonetic texts by hand, and there is a script which converts a normal LARA text into a phonetic version, keeping all the formatting unchanged.

Using the conversion script

In order to invoke the conversion script, your config file in most cases needs to include declarations for the following parameters:

  • “phonetic_lexicon_plain”: Points to a file containing a plain phonetic lexicon.

  • “phonetic_lexicon_aligned”: Points to a file containing an aligned phonetic lexicon.

The plain phonetic lexicon should be a JSON file which consists of a dict whose keys are words and whose values are phonetic strings. For example, the English phonetic lexicon starts:

  "a": "æ",
  "aah": "ˈɑː",
  "aardvark": "ˈɑːdvɑːk",
  "aardvarks": "ˈɑːdvɑːks",
  "aardwolf": "ˈɑːdwʊlf",

The aligned phonetic lexicon should be a JSON file which consists of a dict whose keys are words and whose values are aligned word/phonetic string pairs. Typical entries for English look like this:

"a": [
"about": [
"active": [

You invoke the conversion script as:

python3 $LARA/Code/Python/ make_phonetic_corpus <local-config-file>

The result is to create the following two files in the tmp_resources directory:

  • Phonetic version of the corpus in <Id>_phonetic_corpus.txt

  • List of missing aligned phonetic lexicon entries in <Id>_tmp_phonetic_aligned_lexicon.json

The intention is that the annotator should review and edit the list of missing aligned phonetic lexicon entries, add the corrected entries to the aligned phonetic lexicon, and rerun the script, possibly repeating the cycle several times.

The script guesses new aligned entries using a method which looks at the letter-group/phoneme-group alignments it has already seen in the aligned lexicon and tries to align words against phonemes so that at many alignments as possible are ones that have been seen before. It also prints out a list of words which do not appear in the plain phonetic lexicon.

Completely phonetic languages

If the language in question is written completely phonetically, it is possible to run the conversion script without specifying phonetic lexicon files. In this case, the phonetically meaningful letter groups can be specified in the file $LARA/Code/Python/, in the constant _phonetically_spelled_languages (see below under “Phonetic lexicon resources”).


Many languages contain Heteronyms, words that are pronounced differently depending on their intended meaning. For example, “tear”, “minute” and “wind” are common heteronyms in English.

If your text includes heteronyms, you can distinguish them by turning them into multiword expressions including a disambiguating phrase inside double square brackets. This line from the poem “The Chaos” illlustrates:

@Tear [[in eye]]@#tear# in eye, your#you# dress will @tear [[rip]]@#tear#.

Note that it is necessary to add a lemma tag after each item.

The disambiguating phrase will not be printed in the final version of the text. It can however be included in phonetic lexicon entries, e.g.:

"tear [[in eye]]": [
"tear [[rip]]": [

Phonetic lexicon resources

Preliminary phonetic lexicon resources are available for English, French and Barngarla.


The following files are available for English:

  • $LARA/Code/LinguisticData/en_UK_pronunciation_dict.json. Phonetic lexicon derived from the one in IPADict. This lexicon generally omits entries for heteronyms, though a few have been added.

  • $LARA/Code/LinguisticData/en_UK_heteronyms.txt. List of heteronyms.

  • $LARA/Code/LinguisticData/en_UK_pronunciation_dict_aligned.json. Preliminary aligned phonetic lexicon.


The following files are available for French:

  • $LARA/Code/LinguisticData/fr_FR_pronunciation_dict.json. Phonetic lexicon derived from the one in IPADict.

  • $LARA/Code/LinguisticData/fr_FR_pronunciation_dict_aligned.json. Preliminary aligned phonetic lexicon.


The revived Australian aboriginal language Barngarla is written phonetically. A list of phonetic letter groups is defined in $LARA/Code/Python/, in the constant _phonetically_spelled_languages:

_phonetically_spelled_languages = { 'barngarla': [ 'a', 'ai', 'aw',
                                                 'b', 'd',
                                                 'dy', 'dh', 'g', 'i',
                                                 'ii', 'l', 'ly', 'm',
                                                 'n', 'ng', 'nh', 'ny',
                                                 'oo', 'r', 'rr', 'rd',
                                                 'rl', 'rn', 'w', 'y'

CSS files for phonetic texts

For phonetic LARA texts, it seems useful to include lines similar to the following in a CSS file, so that letter groups are clearly highlighted when hovered over:

a:hover {
          color: red;
          border:1px solid black;