Local content

This section describes how to create LARA local content. Remember that this means a self-contained directory of LARA files which you can upload to a webserver.

The process consists of a number of steps. Many of these are optional and depend on what you’re going to include in your LARA content and what language you’re working in. We present the steps in order. Basically, there are three parts to the process:

  • Preparing your text corpus. It’s possible to use plain text as input to LARA, but you’ll get much better results if you spend some time marking it up. The most important thing is to tag each word by its associated headword, so that different forms of a word are grouped together in the example pages. For example, in Peter Rabbit you want to group together “rabbit”, “rabbits” and “rabbit-” (as in “rabbit-hole”). For many languages, the initial tagging can be done automatically, and you then only need to clean up the results. This is much faster than doing it all by hand.

  • Creating other resources. Usually, you’ll want to be recording audio and maybe also adding translations. LARA helps you do this in an efficient way. You run your marked-up corpus through the compiler, and it produces recording scripts and spreadsheets. When you use these, you can do the audio and translations quickly.

  • Creating the LARA pages. After you’ve done the first two parts, you invoke the LARA compiler a second time and it will automatically link everything together. You can then copy your LARA pages to a webserver, so that they are generally available.

We’ll now look at the details.

Directory structure

If you are constructing LARA local content, there are no constraints on where you put your source files. However, if you intend to make your material available as distributed LARA content, you will save yourself a lot of trouble if you use the distributed LARA directory structure from the beginning. You’ll find the details in the next section.

Writing a local config file

You need to provide a config file. This is a JSON-formatted file which tells the LARA compiler where to find the resources it’s going to use, and how to process them. Here is the config file for Peter Rabbit, which is a typical example of a simple config file that uses no special features:

{
    "id": "peter_rabbit",
    "corpus": "$LARA/Content/peter_rabbit/corpus/peter_rabbit.txt",
    "image_directory": "$LARA/Content/peter_rabbit/images",
    "segment_audio_directory": "$LARA/Content/peter_rabbit/audio/cathy",
    "word_audio_directory": "$LARA/Content/english/audio/cathy",
    "translation_spreadsheet": "$LARA/Content/english/translations/english_french.csv",
    "segment_translation_spreadsheet": "$LARA/Content/peter_rabbit/translations/english_french.csv",
    "translation_mouseover": "yes",
    "audio_mouseover": "no",
    "segment_translation_mouseover": "yes",
    "max_examples_per_word_page": 10
}

The meaning of the lines is as follows:

  • “id”: String that will be used as an identifier for creating tmp files etc.

  • “corpus”: Location of the tagged corpus file.

  • “image_directory”: Location of the directory containing images (JPEGs etc) used in the text.

  • “segment_audio_directory”: Location of the directory containing audio files for the text segments. This will be produced by LiteDevTools (see below).

  • “word_audio_directory”: Location of the directory containing audio file for individual words. This will be produced by LiteDevTools (see below).

  • “translation_spreadsheet”: CSV spreadsheet giving translations for words.

  • “segment_translation_spreadsheet”: CSV spreadsheet giving translations for segments.

  • “translation_mouseover”: Set to “yes” if you want to show translations when mousing over a word.

  • “audio_mouseover”: Set to “yes” if you want to play audio when mousing over a word.

  • “segment_translation_mouseover”: Set to “yes” if you want to show translations when mousing over the loudspeaker icon at the end of a segment.

  • “max_examples_per_word_page”: Maximum number of examples to put on a concordance page. A typical value is 10.

The following section gives a full list of available config file parameters.

Config file parameters

  • “aligned_phonetic_annotations_file”: Points to a file containing aligned phonetic annotations. May be used when creating a phonetic version of a text.

  • “aligned_segments_file”. Points to a file where automatic segment alignment writes out aligned results.

  • “aligned_segments_file_evaluate”. Points to a file where automatic segment alignment writes out aligned results for evaluation.

  • “abstract_html_format”: Set to json_only to create abstract HTML in plain JSON format instead of gzipped pickled format (useful for debugging).

  • “add_postags_to_lemma”: If using TreeTagger, set to “yes” to include part of speech information in the names of lemmas appearing in word pages.

  • “allow_bookmark”: Set to “yes” to get a bookmark in the compiled pages.

  • “allow_table_of_contents”: Set to “yes” to get a table of contents in the compiled pages. The headings are taken from <h1> and <h2> tags in the text.

  • “audio_alignment_corpus”: Corpus to align against when doing audio alignment.

  • “audio_tracking_file”: Include data to synchronise dynamic highlighting of lines with embedded audio.

  • “audio_cutting_up_parameters”: Data specifying MP3s and Audacity label files to cut up audio.

  • “audio_mouseover”: Set to “yes” if you want to play audio when clicking/hovering on a word.

  • “audio_on_click”: Set to “no” if you do not want the default behaviour where word audio is associated with clicking (i.e. associate with hovering instead).

  • “audio_words_in_colour”: Set to “red” if you want to mark word in colour that have audio recordings. You also need to set “coloured_words” to “no”.

  • “author_email”: Specify author email (will be used for social network).

  • “author_name”: Specify author name (will be used for social network).

  • “author_url”: Specify author URL (will be used for social network).

  • “chinese_tokeniser”: Set to “sharoff” to use Sharoff tokeniser (needs to be installed first).

  • “coloured_words”: Set to “no” to switch off use of colour to mark frequency of occurrence.

  • “comments_by_default”: Set to “yes” if you only want text to be treated by LARA when you enclose it inside double curly brackets, {{ ... }}. See “Including non-L2 text” below.

  • “compiled_directory”: Specify the directory under which compiled output will be placed. Default is $LARA/compiled.

  • “corpus”: Location of the tagged corpus file. This will be used as the input to the “resources” operation.

  • “css_file”: Specify a default CSS file to use.

  • “extra_page_info”: If value is “yes” (default), specifies that the concordance pages should contain links to designated online resources when they are available. So far, only meaningful for Icelandic (“Beygingarlýsing íslensks nútímamáls”, an Icelandic morphology resource) and Japanese (jisho.org, a versatile Japanese resource).

  • “font”: Specify the font. Default is “serif”.

  • “font_size”: Specify the font size. Possible values are “xx-small”, “x-small”, “small”, “medium”, “large”, “x-large”, “xx-large”. Default is “medium”.

  • “frequency_lists_in_main_text_page”: Set to “yes” to display the frequency lists in the pane used by the main text page, i.e. on the left in a left-to-right language.

  • “hide_images”: Set to “yes” to replace images with “alt” text.

  • “html_style”: Controls details of the way HTML is formatted. Options are “old” (default), “new” and “social_network”. Use the third option to create incomplete HTML suitable for embedding in the social network layer.

  • **”google_asr_language_code”: Language code for doing Google Cloud speech recognition.

  • “id”: String that will be used as an identifier for creating tmp files etc.

  • “id_on_examples”: Set to “yes” to add annotations on concordance page examples saying which text they come from. Only useful with combined texts.

  • “id_printform”: String used to create annotation for this text if “id_on_examples” is enabled.

  • “image_dict_spreadsheet”: CSV spreadsheet giving associated images for (some) lemmas.

  • “image_dict_words_in_colour”: Set to “yes” if you want to mark words in colour that have an assocated image. You also need to set “coloured_words” to “no”.

  • “image_directory”: Location of the directory containing images (JPEGs etc) used in the text.

  • “keep_comments”: Set to “yes” to include comment text in examples.

  • “labelled_source_corpus”: File to use for creating a version of the source corpus with labelled segment breaks, needed when cutting up audio.

  • “language”: You need this if you want to use the TreeTagger interface. The value should be a string identifying the language that TreeTagger will use. The currently supported values are “english”, “french”, “german”, “italian”, “spanish”, “dutch”, “russian” and “middle-high-german”.

  • “language_ui”: Specify the language used for labels. Currently supported values are “english” (default) and “german”.

  • “lara_tmp_directory”: Specify the directory in which generated LARA resource files will be placed. Default is $LARA/tmp_resources.

  • “linguistics_article_comments”: Set to “yes” if you have “comments_by_default” set and also want each marked passage to be treated a single phrase. See “Including non-L2 text” below.

  • “max_examples_per_word_page”: Maximum number of examples to put on a concordance page. A typical value is 10.

  • “mwe_annotations_file”: File containing Multi Word Expression annotations, generated by the mwe_annotate command.

  • “mwe_file”: File containing Multi Word Expression definitions.

  • “mwe_words_in_colour”: Set to “yes” if you want to mark words in colour that are part of a multiword expression. You also need to set “coloured_words” to “no”.

  • “notes_spreadsheet”: CSV spreadsheet giving associated notes for (some) lemmas.

  • “note_words_in_colour”: Set to “yes” if you want to mark words in colour that have an associated note. You also need to set “coloured_words” to “no”.

  • “parallel_version_id”: LARA ID for parallel version of text.

  • “parallel_version_id2”: LARA ID for second parallel version of text.

  • “parallel_version_label”: Name on link to parallel version of text.

  • “parallel_version_label2”: Name on link to second parallel version of text.

  • “phonetic_lexicon_aligned”: Points to a file containing an aligned phonetic lexicon. Required for creating a phonetic version of a text.

  • “phonetic_lexicon_plain”: Points to a file containing a plain phonetic lexicon. Required for creating a phonetic version of a text.

  • “phonetic_text”: Set to “yes” for phonetic texts.

  • “picturebook”: Set to “yes” for picturebook texts.

  • “picturebook_word_locations_file”: For picturebook texts, points to a JSON file containing locations of words and other components.

  • “pinyin_corpus”: For Chinese texts, points to a utf8-encoded file which contains a pinyin-annotated version of the corpus. This is constructed by pasting the segmented corpus into the tool at https://www.chineseconverter.com/en/convert/chinese-to-pinyin, using the option 我(wǒ).

  • “play_parts”: If your text is a play, you can set this to a list of strings to define the names of the different parts.

  • “play_combine_parts”: If your text is a play, you can set this to a dict which associates the names of parts defined by “play_parts” with the names of directories used to hold audio for that part. Multiple parts can be associated with the same directory.

  • “postags_file”: If using “add_postags_to_lemma”, specify how to render the part of speech tags.

  • “postags_colours_file: If using “add_postags_to_lemma”, optionally specify a CSV file which associates POS tags with colours. The colours need to be members of the list ‘red’, ‘light_red’, ‘pink’, ‘green’, ‘light_green’, ‘blue’, ‘light_blue’, ‘burgundy’, ‘yellow’, ‘orange’, ‘purple’, ‘violet’, ‘brown’, ‘gray’, ‘black’. Words whose POS tag does not have an associated colour will be shown in black.

  • “preferred_translator”: Use translations from specified translator if possible. (Only relevant in distributed LARA).

  • “preferred_voice”: Use audio from specified voice if possible. (Only relevant in distributed LARA).

  • “relative_compiled_directory”: Specify the relative directory for incomplete HTML files that will be embedded in the social network layer. Thus for example a multimedia file will be referenced as “<Dir>/multimedia/<File>”, where <Dir> is the value of this parameter and <File> is the name of the file. Should be used together with “html_style” = “social_network”.

  • “reading_history”: List of reading history items in a distributed config file.

  • “resource_file”: Name of resource file in a distributed config file.

  • “script_file”: Specify a script file to use.

  • “segment_audio_directory”: Location of the directory containing audio files for the text segments. This will be produced by LiteDevTools (see below).

  • “segment_audio_keep_duplicates”: Set this to “yes” if you want to record separate pieces of audio for different occurrences of the same string. This will in general be appropriate for literary works, particularly for plays. Default is “no”.

  • “segment_audio_word_breakpoint_csv”: If extracting word audio from sentence audio, set this to point to the CSV file holding the breakpoints for the words.

  • “segment_audio_word_offset”: If extracting word audio from sentence audio, add this offset to the numbers in the file referenced by “segment_audio_word_breakpoint_csv”. Default is 0.0.

  • “segment_translation_as_popup”: Set to “no” to show segment translations in the context window instead of displaying them as popups. Default value is “yes”.

  • “segment_translation_character”: If this is set to a one-character string, the character is used to produce a separate control for showing segment translations. Default is to attach the segment translations to the segment audio control.

  • “segment_translation_mouseover”: Set to “yes” if you want to show translations when mousing over the loudspeaker icon at the end of a segment.

  • “segment_translation_spreadsheet”: CSV spreadsheet giving translations for segments.

  • “tagged_corpus”: You should include this if you want to use the TreeTagger interface. TreeTagger will process the file specified by “untagged_corpus” and put tagged output into the file designated by tagged_corpus parameter.

  • “tag_using_google_cloud”: Set to “yes” to perform tagging using Google Cloud Natural Language Morphology and Dependency Trees. This requires the necessary packages to be installed and a valid license file. The details are given under “Tagging”.

  • “text_direction”: Set to “rtl” for right-to-left languages like Arabic, Farsi and Hebrew.

  • “toggle_translation_button”: Set to “yes” to include a button which allows optional display of segment translations inside the text.

  • “translated_words_in_colour”: Set to “yes” to mark in red words which have a translation. Needs to be combined with “coloured_words” = “no”.

  • “translation_label”: Set this to a string representing an HTML tag to change the text on the label for the “Translation” line on concordance pages. For example, the local config file for Tina has the value "<b><u>Translation: English</u></b>". If you look at the concordance pages for Tina, you will see this text appearing at the top. You can mouse over it to get a translation.

  • “translation_mouseover”: Set to “yes” if you want to show translations when mousing over a word.

  • “translation_spreadsheet”: CSV spreadsheet giving translations for lemmas. Required if the value of “word_translations_on” is “lemma” (default).

  • “translation_spreadsheet_surface”: CSV spreadsheet giving translations for surface word types. Required if the value of “word_translations_on” is “surface_word_type” and may optionally be used if values if “surface_word_token”.

  • “translation_spreadsheet_tokens”: CSV spreadsheet giving translations for surface word tokens. Required if the value of “word_translations_on” is “surface_word_token” (default).

  • “translation_spreadsheet”: CSV spreadsheet giving translations for words.

  • “tts_engine”: TTS engine to use when invoking the create_tts_audio command. So far, the only supported values are "readspeaker" and "abair".

  • “tts_url”: URL to send request to when invoking the create_tts_audio command. Default is to use the URL defined in the relevant list from lara_python._tts_info.

  • tts_word_substitution_spreadsheet: two-column spreadsheet specifying that single words in the left-hand column should be pronounced by TTS as though they were the corresponding word in the right-hand column.

  • “tts_voice”: TTS voice to use when invoking the create_tts_audio command. Default is to use the first voice defined in the relevant list from lara_python._tts_info

  • “untagged_corpus”: You need this if you want to use the TreeTagger interface. The value should be a string giving the location of the untagged corpus file. TreeTagger will process this and put tagged output into the file designated by the “tagged_corpus” parameter if it is specified (recommended), otherwise the “corpus” parameter.

  • “video_annotations”: Set to “yes” if you want to use video instead of audio annotations.

  • “video_annotations_from_translation”: Set to “yes” if you want to record video annotations from translations of the source text. The translations will be performed using the entries in the files “translation_spreadsheet” and “segment_translation_spreadsheet”. (This is intended for the case where video annotations are in a signed language whose closest aural/oral language is not the text language of the document).

  • “word_audio_directory”: Location of the directory containing audio files for individual words. This will be produced by LiteDevTools (see below).

  • “working_tmp_directory”: Specify the directory in which temporary working files will be placed. Default is $LARA/tmp.

  • “word_translations_on”: Specify how word translation is done: options are “lemma” (default), “surface_word_type” and “surface_word_token”.

Nearly all the parameters are optional, except as follows:

  • If the config file is used for local content, you need to define “id”.

  • If the config file is used for distributed content, you need to define “id”, “reading_history” and “resource_file”.

Format of tagged LARA text

Here is an example of what tagged LARA data looks like:

<page>
@Once upon a time@ there were#be# four little Rabbits#rabbit#, and their names#name#
were#be#--
          Flopsy#Flopsy#,
       Mopsy#Mopsy#,
   Cotton-tail#Cottontail#,
and Peter#Peter#.||

They lived#live# with their Mother in a sand-|bank, underneath the root of a
very big fir-|tree.||
<img src="01VeryBigFirTree.jpg" width="271" height="317"/>

<page>
'Now my dears#dear#,' said#say# old Mrs. Rabbit one morning, 'you may go into
the fields or down the lane, but don't go into Mr. McGregor#McGregor#|'s garden:
your Father had#have# an accident there; he was#be# put in a pie by Mrs.
McGregor#McGregor#.'||
<img src="02YourFatherHadAnAccident.jpg" width="279" height="302"/>
'Now run along, and don't get into mischief. I am going#go out# out.'||

Things to note:

  • The text is divided into pages using the <page> tag. Each page is presented separately in the compiled output.

  • The text is further divided into segments using the double vertical bar (||). Each segment is associated with a piece of recorded audio, and can be used as an example on a concordance page.

  • Words need to be tagged by headword. Tags are enclosed in a pair of hash signs (# ... #). For example, looking at the top line, “were” is tagged as a form of “be”, and “Rabbits” is tagged as a form of “rabbit”. If you don’t add a tag, the headword is assumed to be the lowercased form of the word.

  • If you want to break a compound word into smaller pieces which are indexed separately, you can do this using the single vertical bar (|). For example, “sand-bank” is split into the two pieces “sand-” and “bank”.

  • Conversely, you can mark a phrase as a single unit by putting a pair of at-signs round it (@ ... @). For example, the conventional phrase “Once upon a time” has been marked in this way.

  • Formatting, using spaces, is kept in the compiled LARA pages. For example, look at the indented names “Flopsy”, “Mopsy” and “Cotton-tail” in the online text.

  • You can insert images using the standard HTML <img> tag.

Adding HTML formatting to LARA text

LARA supports simple use of HTML tags for text formatting. So far, the only tags which have been tested reasonably well are h1 (main heading), h2 (subheading), <i> (italic) and <b> (boldface). You can see examples of the first three in the LARA version of Alice in Wonderland. Compare the source text with the online version.

To illustrate the use of HTML markup, here are the first few paragraphs from Alice:

<h2>CHAPTER I. Down the Rabbit-|Hole</h2>||

Alice was#be# beginning#begin# to get very tired#tire# of sitting#sit# by her sister on the
bank, and of having#have# nothing to do:|| once or twice she had#have# peeped#peep# into the
book her sister was#be# reading#read#, but it had#have# no pictures#picture# or conversations#conversation# in
it, ‘and what is#be# the use of a book,’ thought#think# Alice ‘without pictures#picture# or
conversations#conversation#?’||

So she was#be# considering#consider# in her own mind (as well as she could, for the
hot day made#make# her feel very sleepy and stupid), whether the pleasure
of making#make# a daisy-|chain would be worth the trouble of getting#get# up and
picking#pick# the daisies#daisy#,|| when suddenly a White Rabbit with pink eyes#eye# ran#run#
close by her.||

There was#be# nothing so <i>very</i> remarkable in that; nor did#do# Alice think it so
<i>very</i> much out of the way to hear the Rabbit say to itself, ‘@Oh dear@!||
@Oh dear@!|| I shall be late!’|| (when she thought#think# it over afterwards, it
occurred#occur# to her that she ought to have wondered#wonder# at this, but at the time
it all seemed#seem# quite natural);|| but when the Rabbit actually <i>took#take# a watch
out of its waistcoat-|pocket</i>, and looked#look# at it, and then hurried#hurry# on,
Alice started#start# to her feet#foot#,|| for it flashed#flash# across her mind that she had#have#
never before seen#see# a rabbit with either a waistcoat-|pocket, or a watch
to take out of it,|| and burning#burn# with curiosity, she ran#run# across the field
after it, and fortunately was#be# just in time to see it pop down a large
rabbit-|hole under the hedge.
<img src="01TookAWatchOutOfHisWaistcoatPocket.jpg" width="204" height="304"/>
||

Text inside HTML tags can be marked up in the same way as any other LARA text. For example, the text inside the <h2> tag at the beginning contains a vertical bar | to indicate that the two words “rabbit” and “hole” should be indexed separately (try clicking on each one), and the italicized passage <i>took#take# a watch out of its waistcoat-|pocket</i> has one word with a vertical bar, and one with a hashtag.

HTML formatting can be used inside words. For example, in the later passage

They were#be#
just beginning#begin# to write this down on their slates#slate#, when the White Rabbit
interrupted#interrupt#: ‘<i>Un</i>important#unimportant#, your Majesty means#mean#, @of course@,’ he said#say#

the phrase <i>Un</i>important correctly links to the word page for “unimportant”. Similarly, HTML formatting can be used inside multiword expressions marked by the @ ... @ notation:

‘I can’t go no lower#low#,’ said#say# the Hatter: ‘I|’m#be# on the floor, as it is#be#.’||

‘Then you may @<i>sit</i> down@,’ the King replied#reply#.||

Here @<i>sit</i> down@ links to the word page for “sit down”.

Special characters

The characters #, @ and < are treated specially by LARA. If you want to include them in your text as ordinary characters, you need to escape them by adding a backlash before the character, so that they become \#, \@ and \<.

If you use the “segment” functionality, this will automatically add backslashes before these characters.

Occasionally, you may want to include an obscure character that you can’t easily put in a Unicode text file. In this case, you can write it as an HTML character code. For example, you can include a Hebrew segol by writing the sequence &#1462;.

Heteronyms

Some texts contain heteronyms, words that are pronounced differently depending on their intended meaning. For example, “bow”, “minute” and “wind” are common heteronyms in English.

You can handle heteronyms by making them into multiword items ending in a disambiguating phrase presented inside double square brackets. This line from the poem “The Chaos” illlustrates:

@Tear [[in eye]]@#tear# in eye, your#you# dress will @tear [[rip]]@#tear#.

Note that it is necessary to add a lemma tag after each item.

The disambiguating phrases are presented for recording audio and entering translations, but automatically removed in the compiled text. Thus when someone is recording audio for words, they will see the items

dress
eye
in
Tear [[in eye]]
tear [[rip]]
will
your

but the compiled version of the line will be

Tear in eye, your dress will tear.

Including non-L2 text

Sometimes it can be useful to include text that will not be processed by LARA, typically comments or annotations. You can do this by enclosing the non-L2 text in the sequence, /* ... */. The material inside the comment will stay as plain text in the LARA document; it will not be associated with concordance pages, audio files or translations. Here is an Italian poetry example, where the non-L2 text is the attribution for the poem.

<h2>Soldati</h2>||
/*A Selection of Modern Italian Poetry in Translation.
Translation. Soldiers. Roberta L. Payne,
McGill-Queen's University Press, 2004, pp. 114-115*/||

<audio src="Soldati.mp3"/>

Si sta#stare# come||
d#di#'autunno||
sugli alberi#albero#||
le#la# foglie#foglia#.||

There are cases where you want most of the text to be ignored by LARA. The most obvious example is if your document is some kind of linguistics text. Probably most of it will be comments and explanations, and you’ll only want to use LARA to mark the foreign words and phrases, maybe because you want people to be able to listen to them. You can do this efficiently if you add the line:

"comments_by_default": "yes",

to your config file. Then you can mark the pieces of text that are to be processed by LARA using double sets of curly brackets, {{ ... }}. Here is an example taken from a passage in a forthcoming book by Ghil’ad Zuckermann:

Recent research has proved that some languages are harder than others for the dyslexic.|| If
you have to be dyslexic, make sure you are born in Spain or Germany, rather than England.|| You
should definitely avoid present-day Israel.|| There is no doubt in my mind that Israeli is much
more problematic than Hebrew, the reason being that while Israeli’s phonetic system is primarily
European, it still uses the Hebrew orthography.|| As aforementioned, there is no one-to-one
correlation between signs and sounds: {{כ (<i>k</i>)}} and {{ק (<i>q</i>)}} are both pronounced [k],
{{ ת (<i>t</i>)}} and {{ ט (ţ)}} – [t], while more and more Israeli children use interchangeably
{{ ע (<i>ʕ</i>)}}, {{ א (<i>ʔ</i>)}} and {{ ה (<i>h</i>)}}.||

The result is that there is usually no phonetic difference between {{ מכבסים <i>meχabsím</i>}}
‘doing laundry (masculine plural)’ and {{ מחפשׂים <i>meħapsím</i>}} ‘looking for (masculine plural)’
– both are pronounced {{<i>mekhapsím</i>}} [mehapˈsim] (note the anticipatory assimilation of the
voiced b to the voiceless s in the former, resulting in the voiceless [p]).|| Similarly,{{ ידע <i>yada‘</i>}}
‘(he) knew’ is pronounced like {{ ידעה <i>yad‘a</i>}} ‘(she) knew’ and like {{ ידהּ <i>yadah</i>}}
‘her hand’ – all {{<i>yadá</i>}} [jaˈda].|| Israeli {{ קריאה <i>qri’a</i>}} ‘reading’,
{{ קריעה <i>qri‘a</i>}} ‘tearing’, {{ כריה <i>kriya</i>}} ‘mining’, and {{ כריעה <i>kri‘a</i>}}
‘kneeling’ are all pronounced {{<i>kriá</i>}} [kʁiˈa].|| So, do not be too surprised to see an Israeli
child spelling {{ עקבותיו (pronounced <i>ikvotáv</i>)}} ‘his traces’ as{{ אכּווטב }}(cf. Hopkins 1990: 315).||
In Yiddish one would say that this child spells {{ נח מיט זיבן גרייזן <i>nóyekh mit zíbn gráyzn</i>}}
‘{{ נח (“Noah”)}} with seven errors’ (e.g. {{ נאייעך <i>nóyekh</i>}}) – cf. {{ נח מיט זיבן קרייזן
<i>nóyekh mit zíbn kráyzn</i>}}, ‘“Noah” with seven circles’:||

If you include the additional line in the config file:

"linguistics_article_comments": "yes",

you will also add an implicit @ ... @ inside each {{ ... }}, so each marked passage will be treated as a single phrase, i.e. not split up into its component words. This has been done in the passage above.

Adding <img> and <video> tags

You can include image and video files in your text using the <img> and <video> tags. In both cases, the files will be taken from the images subdirectory, and the format is similar: you need to specify the name of the file (src), and the dimensions of the image (width, height), measured in pixels. You can optionally include an alt tag. Here are two examples:

<page>
'Now run along, and don't get into mischief. I am @going out@#go out#.'||
<img src="03DontGetIntoMischief.jpg" width="279" height="277" alt="Mrs Rabbit and Peter"/>

<page>
<h2>/*German*/||</h2>
<video src="german-stfh-2.mp4" width="352" height="640"/>
/*<b>Deutsch</b>*/
Bleibt verdammt nochmal @zu Hause@!||

Adding <audio> tags

It is possible to insert <audio> tags, to include links to independent pieces of audio content. Most texts will not require this functionality. It is however useful for poems, where you may want to be able to hear larger parts of the poem than a single segment read aloud. The following example shows how to do it:

<h1>Giuseppe Ungaretti (1888-1970)</h1>||

<h2>Soldati</h2> ||
A Selection of Modern Italian Poetry in Translation. Translation. Soldiers. Roberta L. Payne, McGill-Queen’s University Press, 2004, pp. 114-115||

<audio src="Soldati.mp3"/>

Si sta come||
d'autunno||
sugli alberi||
le foglie.||

Here, the element <audio src="Soldati.mp3"/> says to insert an audio control to play the file Soldati.mp3, which contains a reading of the entire poem. The file is placed in the segment audio directory, and the metadata file needs to include a line listing the file. The format of the line is as follows:

NonLDTAudioFile Fratelli.mp3

The compiled LARA content will look something like this:

_images/AudioTags.jpg

Adding a combined audio file for a whole page

If the src field of an embedded audio tag has the special value this page, the audio used is created by concatenating all the mp3 files in the page where the tag appears. The concatenation is performed automatically as part of the second stage of compilation.

Note that the concatenation will only work if all the mp3s have the same sampling rate. This will almost always be true. If for some reason your mp3s have a mixture of sampling rates, you can create a copy of the audio directory containing them using a command of the form:

python3 lara_run_for_portal.py copy_audio_dir_with_uniform_sampling_rate <Dir> <Dir1> <ConfigFile>

where <Dir> is the original audio directory, <Dir1> is the new directory with uniform sampling rate, and <ConfigFile> is the config file for the project in question.

Presenting segment audio as embedded audio

If the src field of an embedded audio tag has the special value this segment, the audio used is the segment audio for the segment in which the audio file appears. This is particularly useful in conjunction with the “audio tracking” feature immediately below.

Including audio tracking

It is possible to add audio tracking to an embedded audio file. This means that the file is associated with a number of lines in the text, each of which is highlighted in turn when the file is played. For the highlighting to synchronise correctly with the sound, the annotator needs to supply the appropriate timings for each file used. This feature is most obviously appropriate for poetry.

In order to include audio tracking, you have to format your tagged corpus file appropriately. You need tabular HTML layout, each audio file which uses audio tracking has to have the annotation tracking="yes" added, and timings need to be added to the matching lines using the tag end_time="<Time>". The following example illustrates:

<table>
<tr><td><audio tracking="yes" src="this segment"/><br/></td></tr>
<tr><td end_time="3.1">Mary had#have# a little lamb</td></tr>
<tr><td end_time="6.4">Its fleece was#be# white as snow,</tr>
<tr><td end_time="9.3">And everwhere that Mary went#go#</td></tr>
<tr><td end_time="12.7">That lamb was#be# sure to go||</td></tr>
<tr><td></td></tr>
<tr><td><audio tracking="yes" src="this segment"/><br/></td></tr>
<tr><td end_time="2.9">It followed her to school one day</td></tr>
<tr><td end_time="5.8">Which was against the rule,</tr>
<tr><td end_time="9.0">It made the children laugh and play</td></tr>
<tr><td end_time="11.9">To see a lamb at school||</td></tr>
<tr><td></td></tr>
</table>

This presents the segment audio for the first and second segments as embedded audio files and associates each one with the four immediately following lines, transitioning the highlighting at the marked times.

The tricky part is getting the timings: you have to listen to each file carefully and figure out where each line break happens. A good tool to use here is Audacity.

Using colours to mark parts of speech (POS)

For some languages, currently German, English, French, Icelandic, Japanese and Turkish, you can use colours to mark parts of speech (POS) in a LARA text. You do this by adding lines of the following form in the config file:

"add_postags_to_lemma": "yes",
"postags_colours_file": "<PostagsToColoursSpreadsheet>",

where <PostagsToColoursSpreadsheet> is a UTF-8 encoded tab-separated two-column CSV file in which the first column contains the POS tags you wish to colour and the second column gives the colour for the POS tag in question. The colours need to be members of the list ‘red’, ‘light_red’, ‘pink’, ‘green’, ‘light_green’, ‘blue’, ‘light_blue’, ‘burgundy’, ‘yellow’, ‘orange’, ‘purple’, ‘violet’, ‘brown’, ‘gray’, ‘black’. Words whose POS tag does not have an associated colour will be shown in black.

For example, the following spreadsheet could be used for English:

_images/POSColoursSpreadsheet.jpg

A text marked with the above colours will look like this:

_images/POSColoursExample.jpg

Note that the tagset will depend on the language. Look at the tagged file generated by your tagger to find out what the relevant tags are.

You may want to manually edit the tags to add distinctions not made by the tagger, for example to distinguish nouns by gender. If you do this, your tag/colour file needs to list colours for the new tags you have added.

Special support for plays

We have begun adding special support for creating LARA versions of plays. So far this has only been used in two texts, the English and French editions of Antigone, which are available in the LARA content repository in the folders antigone_en and antigone. The English version is complete. The French version is complete except for audio, which is about 40% done.

For plays, you should include the line

"segment_audio_keep_duplicates": "yes",

in your config file. This is necessary to produce correctly formatted recording scripts, where in particular the names of parts will occur many times. You also need to format the text using the following conventions:

  • Use bold-face markup, <b>, ONLY to mark names of characters when introducing their lines. Make the name a segment on its own, with the exception that it can optionally end in a comma.

  • Use italic markup, <i>, ONLY to mark stage directions. Make the stage direction as one or more complete segments, split off from part names and spoken lines.

The following example illustrates:

<i>The Nurse looks at her.|| She shakes her head.</i>||

<b>THE NURSE</b>|| /*—*/ You have a lover?||

<b>ANTIGONE</b>,|| <i>in a strange voice, after a silence</i>.|| /*—*/ Yes nurse, yes, the poor guy. I have a lover.||

When your play has been formatted in this way, you will be able to use the following command-line options to lara_run.py:

  • python3 $LARA/Code/Python/lara_run.py list_unrecorded_play_lines <ConfigFile>. Produce a file listing all unrecorded audio segments, labelled by play part.

  • python3 $LARA/Code/Python/lara_run.py combine_play_segment_audio <ConfigFile>. You may find it convenient to post separate recording tasks for the different parts. You can do this by making a copy of the config file for each recording account, using a different value of segment_audio_directory in each one. Then create a “master” version of the config file which defines values for play_parts and play_combine_parts, listing the segment audio directory to use for each part, and you will be able to combine the audio. The following example illustrates, where audio for THE NURSE and THE MESSENGER is associated with the directory cathyc, and audio for all the other parts is associated with the detault directory laraantigone:

    "play_parts": [
    "ANTIGONE",
    "THE NURSE",
    "ISMENE",
    "HAEMON",
    "CREON",
    "THE CHORUS",
    "THE PROLOGUE",
    "THE GUARD",
    "THE SECOND GUARD",
    "THE THIRD GUARD",
    "THE MESSENGER",
    "THE PAGE"
    ],
    
    "play_combine_parts": {
    "ANTIGONE": "laraantigone",
    "THE NURSE": "cathyc",
    "ISMENE": "laraantigone",
    "HAEMON": "laraantigone",
    "CREON": "laraantigone",
    "THE CHORUS": "laraantigone",
    "THE PROLOGUE": "laraantigone",
    "THE GUARD": "laraantigone",
    "THE SECOND GUARD": "laraantigone",
    "THE THIRD GUARD": "laraantigone",
    "THE MESSENGER": "cathyc",
    "THE PAGE": "laraantigone"
    },
    

Picturebook mode

It is possible to create LARA documents in “picturebook mode” by setting the config file parameter picturebook to "yes". In a picturebook document, the text must also be supplied in image form, with one image per LARA page, and the user is responsible for defining the locations of words and other elements in each image. The compiled LARA document only shows the images. Words in the image are associated with audio and translations in the usual way. A config file for a toy picturebook text looks like this:

{
  "id": "mary_manuscript",
  "language": "english",
  "picturebook": "yes",
  "corpus": "$LARA/Content/mary_manuscript/corpus/mary.txt",
  "picturebook_word_locations_file": "$LARA/Content/mary_manuscript/corpus/mary_word_locations.json",
  "word_translations_on": "surface_word_token",
  "translation_spreadsheet_tokens": "$LARA/Content/mary_manuscript/translations/token_english_french.csv",
  "translation_mouseover": "yes",
  "segment_translation_spreadsheet": "$LARA/Content/mary_manuscript/translations/english_french.csv",
  "segment_translation_mouseover": "yes",
  "segment_audio_directory": "$LARA/Content/mary_manuscript/audio/mannyrayner",
  "word_audio_directory": "$LARA/Content/english/audio/mannyrayner",
  "image_directory": "$LARA/Content/mary_manuscript/images",
  "audio_mouseover": "yes",
  "max_examples_per_word_page": 10,
  "coloured_words": "no"
}

and the text itself is the following:

<page>
<img src="page1.jpg" width="717" height="791"/>

Mary#Mary# had#have# a little lamb||
Its#it# fleece was#be# white as snow||


<page>
<img src="page2.jpg" width="717" height="791"/>

And everywhere that Mary#Mary# went#go#||
That lamb was#be# sure to go||

Here, page1.jpg and page2.jpg are images realising the appearance of the pages.

The locations in a picturebook document need to be put in the file defined by the config file parameter picturebook_word_locations_file. The contents of this file must be a JSON dict, indexed by the names of the image files used to supply the pages. Each page is associated with a list of lists, one for each segment on the page. Each list associates the words in the segment with their location, and in addition allows specification of the special elements SPEAKER-CONTROL and TRANSLATION-CONTROL. If used, these will define the locations of controls to show the audio and translation for the segment. Locations are lists of two-element [ X, Y ] coordinate pairs. If the list has two coordinate pairs, they are treated as the opposite corners of a rectangle. If there are three or more coordinate pairs, they are treated as the vertices of a polygon.

During the “resources” stage of LARA compilation, a version of the word locations file is generated, initially with the location coordinates uninstantiated. The “resources” stage also generates a zipfile containing the word locations file and all the images it references. These two files are created as

<TmpResourcesDirectory>/<Id>_tmp_word_locations.json
<TmpResourcesDirectory>/<Id>tmp_word_locations_zipfile.zip

where TmpResourcesDirectory is the tmp resources directory (by default $LARA/tmp_resources) and <Id> is the project’s id.

The initial part of a toy word locations file is shown below:

{
    "page1.jpg": [
        [
            [
                "Mary",
                [
                    32,
                    101
                ],
                [
                    215,
                    217
                ]
            ],
            [
                "had",
                [
                    230,
                    124
                ],
                [
                    342,
                    186
                ]
            ],
            [
                "a",
                [
                    360,
                    147
                ],
                [
                    406,
                    187
                ]
            ],
            [
                "little",
                [
                    426,
                    119
                ],
                [
                    556,
                    207
                ]
            ],
            [
                "lamb",
                [
                    578,
                    142
                ],
                [
                    668,
                    211
                ]
            ],
            [
                "SPEAKER-CONTROL",
                [
                    613,
                    233
                ],
                [
                    650,
                    263
                ]
            ],
            [
                "TRANSLATION-CONTROL",
                [
                    653,
                    232
                ],
                [
                    689,
                    267
                ]
            ]
        ],
        ...

All elements in the segment, plus SPEAKER-CONTROL` and TRANSLATION-CONTROL, must be included. If you do not wish to define a location for an element, the relevant coordinates are replaced by empty strings. For example, if you wish to omit a location for a translation control, the last element in the example above would be replaced by:

[
    "TRANSLATION-CONTROL",
    [
        "",
        ""
    ],
    [
        "",
        ""
    ]
]

Phonetic mode

A normal LARA text is composed of segments, which in turn are composed of words where each word is associated with a lemma. A phonetic LARA text, in contrast, is composed of words, which in turn are composed of letter groups where each letter group is associated with a phonetic value. To use the phonetic text mode, include the line

"phonetic_text": "yes",

in the config file.

A typical paragraph from a phonetic text will look like this:

||@À@#a#|| ||@L@#l#|@É@#e#|@ON@#ɔ̃#|| ||@W@#u#|@E@#e#|@R@#r#|@TH@#t#||
||@QU@#k#|@AND@#ɑ̃#|| ||@I@#i#|@L@#l#|| ||@É@#e#|@T@#t#|@AI@#ɛ#|@T@#(silent)#|| ||@P@#p#|@E@#ə#|@T@#t#|@I@#i#|@T@#(silent)#|| ||@G@#g#|@A@#a#|@R@#ʁ#|@Ç@#s#|@ON@#ɔ̃#||

Note that words are separated by double vertical bars (||), letter groups are separated by single vertical bars (|), and letter groups are enclosed in at-signs (@) with the phonetic value after enclosed in hash signs (#).

This format is obviously very difficult to write manually. In practice, you will almost certainly want to create your phonetic text by first writing it as a normal text and then automatically converting it into phonetic text form. To do this, you need to add declarations to the config file for the normal version of the text that declare phonetic lexicon files, and then use an invocation of the form

python3 $LARA/Code/Python/lara_run.py make_phonetic_corpus [ <local-config-file>* ]

This will create a phonetic version of the text in the tmp_resources directory and also create material in the same directory that may be added to the phonetic lexicon files. We now describe what these are.

The primary resource is a plain phonetic lexicon for the language in question. This is declared using the config file parameter phonetic_lexicon_plain. The file needs to be in JSON format and contain a dict structure indexed on lowercase words, where the associated values are phonetic strings. For example, the English phonetic lexicon starts like this:

{
  "aah": "ˈɑː",
  "aardvark": "ˈɑːdvɑːk",
  "aardvarks": "ˈɑːdvɑːks",
  "aardwolf": "ˈɑːdwʊlf",
  ...

Free phonetic lexicon resources for many languages can be downloaded from https://github.com/open-dict-data/ipa-dict

In order to perform the conversion from normal LARA text to phonetic LARA text, an aligned phonetic lexicon is also needed. This splits up the words into letter groups associated with phonetic values. The aligned phonetic lexicon is declared using the config file entry phonetic_lexicon_aligned. The format is as illustrated in the following extract from the French aligned lexicon:

"afin": [
    "a|f|in",
    "a|f|ɛ̃"
],
"ai": [
    "ai",
    "ɛ"
],
"ainsi": [
    "ain|s|i",
    "ɛ̃|s|i"
],
"alors": [
    "a|l|o|r|s",
    "a|l|ɔ|ʁ|"
],

As with the plain phonetic lexicon, the keys are lowercase words, but the values are aligned pairs, where vertical bars are used to mark the alignment. Note that a letter group can be aligned with an empty phonetic value. An example is the word alors, where the final s is silent, i.e. aligned with an empty phonetic string.

At the moment, all aligned entries need to be added explicitly. We plan to add a component soon which will guess other alignments based on the ones that already exist.

Parallel LARA texts

Sometimes it may be useful to create a LARA text in parallel versions. The phonetic texts immediately above are one example, where the phonetic text is parallel to the plain text. Another example is a plain version of a text parallel to a manuscript version.

When you have parallel LARA texts, you can link them to each other by adding declarations to the config files of the form

"parallel_version_id": "<Id>",
"parallel_version_label": "<Label>",

for example

"parallel_version_id": "le_petit_prince_abc2_phonetic",
"parallel_version_label": "Phonetic version",

If you do this, the navigation bar on each compiled page will include a link to the same page in the parallel version, with the label specified by parallel_version_label. It is possible to link to two parallel versions by including similar declarations of the form

"parallel_version_id2": "<Id>",
"parallel_version_label2": "<Label>",

For obvious reasons, declaring parallel version of LARA texts only makes sense if the parallel texts have the same number of LARA pages, as defined by <page> tags in the source text, and pages with the same number actually do correspond to each other in some meaningful way.

First invocation of LARA compiler (“resources”)

The first time you invoke the LARA compiler, you will only have the corpus and the image directory (if your corpus uses images). You perform the invocation as:

python3 $LARA/Code/Python/lara_run.py resources [ <local-config-file>* ]

This will first check that the tagging is consistent. If there are errors in the format, typically missing # signs, they are written to the trace output and also to the “tagging feedback” file (this is useful for languages with non-European character sets, which may not print properly). Errors will be reported like this:

*** Error in segment: Once upon a time there were#be# four little Rabbits#rabbit#, and their names#name
were#be#--
          Flopsy#Flopsy#,
       Mopsy#Mopsy#,
   Cotton-tail#Cottontail#,
and Peter#Peter#.
*** Error in segment:
<img src="06SqueezedUnderTheGate.jpg" width="252" height="279"/>
First he ate#eat# some lettuces#lettuce# and some French beans#bean#; and then he ate#eat
some radishes#radish#;
<img src="07ThenHeAteSomeRadishes.jpg" width="278" height="304"/>
And then, feeling rather sick, he went#go# to look for some parsley.

In the above, the errors are missing # signs after “name” in the first segment, and “eat” in the second.

Once the tagging is consistent, the compiler should write warnings and statistics to the “tagging feedback” file. Typical output will now look like this:

*** Warning: Inconsistent tags for "going":
"go"                After a time he began to wander about, going lippity--lippity--not very fast, and looking all round.
"go out"            'Now run along, and don't get into mischief. I am going out.'

*** Warning: Inconsistent tags for "got":
"get"               Peter never stopped running or looked behind him till he got home to the big fir-tree.
"get away"          After losing them, he ran on four legs and went faster, so that I think he might have got away altogether if he had not unfortunately run into a gooseberry net, and got caught by the large buttons on his jacket. It was a blue jacket with brass buttons, quite new.

*** Warning: Inconsistent tags for "running":
"run"               Peter never stopped running or looked behind him till he got home to the big fir-tree.
"run after"         And tried to put his foot upon Peter, who jumped out of a window, upsetting three plants. The window was too small for Mr. McGregor, and he was tired of running after Peter. He went back to his work.

--- 40 segments
--- 964 words
--- 353 different tags
--- Longest segments:
After losing them, he ran on four legs and went faster, so that I think he might have got away altogether if he had not unfortunately run into a gooseberry net, and got caught by the large buttons on his jacket. It was a blue jacket with brass buttons, quite new.

'Now my dears,' said old Mrs. Rabbit one morning, 'you may go into the fields or down the lane, but don't go into Mr. McGregor's garden: your Father had an accident there; he was put in a pie by Mrs. McGregor.'

And tried to put his foot upon Peter, who jumped out of a window, upsetting three plants. The window was too small for Mr. McGregor, and he was tired of running after Peter. He went back to his work.

Peter sat down to rest; he was out of breath and trembling with fright, and he had not the least idea which way to go. Also he was very damp with sitting in that can.

Peter asked her the way to the gate, but she had such a large pea in her mouth that she could not answer. She only shook her head at him. Peter began to cry.

The trace output will also list several other files which LARA has generated that you will need to use in the following stages. You will find these files in the directory $LARA/tmp_resources. They will have names like the following:

  • peter_rabbit_record_segments.txt. Needed for recording audio for segments using LiteDevTools.

  • peter_rabbit_record_words.txt. Needed for recording audio for words using LiteDevTools.

  • peter_rabbit_tmp_segment_translations.csv. Needed for filling in segment translations.

  • peter_rabbit_tmp_translations.csv. Needed for filling in word translations.

Now that you have the tmp directory files, you use them to create the other resources.

Recording LARA audio using LiteDevTools

You create the audio files using the recording tool on the LiteDevTools site. To do this, you need a LiteDevTools account that allows you to create recording jobs. Log into LiteDevTools, and use the “record_words” and “record_segments” files to post jobs. When you download the results from LiteDevTools, you will get a zipfile containing both the audio files and a metadata file in the format required by LARA. You need to put them all in the same directory, possibly combining them with data that is already there. Remember that you will use different directories for segment audio and word audio. The segment audio will be in a corpus-specific directory (e.g. $LARA/Content/peter_rabbit/audio/cathy), while the word audio will be in a language-specific directory (e.g. $LARA/Content/english/audio/cathy).

Another complication is that LiteDevTools currently creates files in wav format. LARA will work better if you use ffmpeg to convert these to mp3, which occupies about a tenth of the space.

The easist way to unpack, convert and install an LDT zipfile is to call lara_run.py with the install_ldt_zipfile option:

python3 $LARA/Code/Python/lara_run.py install_ldt_zipfile <Zipfile> <RecordingScript> <Type> <ConfigFile> [ <BadMetadataFile> ]

e.g.

python3 $LARA/Code/Python/lara_run.py install_ldt_zipfile "Peter Rabbit words.zip" $LARA/tmp_resources/peter_rabbit_record_words.json words $LARA/Content/peter_rabbit/corpus/local_config.json

The second argument is the recording script submitted to LDT, and is used to check the validity of the data returned. The optional last argument is a file that will be used to list any audio files that failed to unpack correctly.

You should get a lot of trace output from ffmpeg followed by a summary.

Creating LARA audio using a TTS engine

There is preliminary support for using TTS engines to create LARA audio. So far, the only engines supported are ReadSpeaker and ABAIR. In order to use ReadSpeaker, you need to have a valid ReadSpeaker license key in the UTF-8 text file $LARA/Code/Python/readspeaker_license_key.txt.

You can create TTS audio using the same “record_words” and “record_segments” files as for human audio recorded using LiteDevTools (see preceding section). The command-line call is of the form:

python3 $LARA/Code/Python/lara_run.py create_tts_audio <RecordingScriptFile> <ConfigFile> <Zipfile>

for example

python3 $LARA/Code/Python/lara_run.py create_tts_audio $LARA/tmp_resources/sample_english_surface_tokens_tts_record_segments.json local_config_tts.json readspeaker_segments.zip

The config file must specify the TTS engine using the parameter tts_engine, e.g.

"tts_engine": "abair",

If there is more than one voice available for the TTS engine and language, the voice can be specified using the parameter tts_voice, e.g.

"tts_voice": "ga_UL_anb_nnmnkwii",

The default is to use the first voice in the relevant list from lara_config._tts_info.

The last argument in the command-line call is the zipfile of TTS audio files to be produced. It will be in the same format as the ones downloaded from LDT, and will in particular contain similarly formatted metadata. It is consequently possible to install it using the install_ldt_zipfile option immediately above, e.g. here

python3 $LARA/Code/Python/lara_run.py install_ldt_zipfile readspeaker_segments.zip $LARA/tmp_resources/sample_english_surface_tokens_tts_record_segments.json segments local_config_tts.json

Adjusting TTS pronunciation of single words

Sometimes it is useful to be able to adjust the TTS engine’s pronunciation of single words. For example, French ReadSpeaker pronounces “j’” as though it were the letter J, but it is better to pronounce it as “je”.

You can effect this kind of adjustment using the config file parameter tts_word_substitution_spreadsheet. This should point to a two-column spreadsheet specifying that single words in the left-hand column should be pronounced by TTS as though they were the corresponding word in the right-hand column.

Creating segment audio by cutting up MP3s

There is support for creating segment audio by cutting up MP3s. This can be a good way to add high-quality audio to a LARA document. You basically do it by adding labels to the MP3 to say where to make the cuts, using the free Audacity tool. Once you’ve got some practice with using Audacity, the process is simple and fairly quick; I can typically annotate a 5 minute MP3 in about 15-20 minutes. The process works as follows.

First, you need to add a labelled_source_corpus element to your config file to specify where you will create the text file you’ll use to get your labels. This file can go anywhere, but an obvious place is the corpus folder. A typical line will look like this:

"labelled_source_corpus": "$LARA/Content/thieving_boy/corpus/thieving_boy_segmented_for_audio.txt",

If you have already made your corpus file, you can now create the labelled source corpus file with an invocation of the form

python3 $LARA/Code/Python/lara_run.py make_labelled_corpus <ConfigFile>

For example, if this is the annotated corpus file

<page><audio src="this page"/>||
<h1>/*Thieving Boy*/||</h1>
<b>/*Cleo Laine*/</b>||

(Intro)||

All my#i# sadness, all my#i# joy, came#come# from loving#love# a thieving boy
All my#i# sadness, all my#i# joy, came#come# from loving#love# a thieving boy||
Came#come# from loving
Came#come# from joying
Came#come# from holding came#come# from toying#toy#, swift his#he# hand and deft my#i# love could steal the down right off a dove||
All my#i# sadness, all my#i# joy, came#come# from loving#love# a thieving boy
All my#i# sadness, all my#i# joy, came#come# from loving#love# a thieving boy||
They prisoned#prison# him#he# for it is#be# true that if you steal they come for you
So watch you ladies#lady# how I wait right outside the prison gate||
All my#i# sadness, all my#i# joy, came#come# from loving#love# a thieving boy
All my#i# sadness, all my#i# joy, came#come# from loving#love# a thieving boy||
Came#come# from loving#love# a thieving boy||

then the automatically generated labelled source file will look like this:

|1|
|2|
|3|

(Intro)|4|

All my sadness, all my joy, came from loving a thieving boy
All my sadness, all my joy, came from loving a thieving boy|5|
Came from loving
Came from joying
Came from holding came from toying, swift his hand and deft my love could steal the down right off a dove|6|
All my sadness, all my joy, came from loving a thieving boy
All my sadness, all my joy, came from loving a thieving boy|7|
They prisoned him for it is true that if you steal they come for you
So watch you ladies how I wait right outside the prison gate|8|
All my sadness, all my joy, came from loving a thieving boy
All my sadness, all my joy, came from loving a thieving boy|9|
Came from loving a thieving boy|10|
|11|

You now use Audacity to specify where the labelled segment breaks go in the MP3. Audacity lets you add a label to the ‘Label track’ using the Ctrl-B command. The result for the running example looks like this:

_images/AudacityThievingBoy.jpg

When you are finished, save your labels to a file using the Audacity “Export Labels” function. Now add an audio_cutting_up_parameters entry to your config file. This tells LARA where find the data it will need to cut up the audio. In our example, the entry looks like this:

"audio_cutting_up_parameters":
        [ { "audio_file": "$LARA/Content/thieving_boy/audio/cleo_laine_src/ThievingBoy.mp3",
            "audio_labels_file": "$LARA/Content/thieving_boy/corpus/LabelTrack.txt",
            "start_label": 1,
            "end_label": 10
                }
        ],

Here, audio_file is the original MP3, audio_labels_file is the labels file you downloaded from Audacity, start_label is the first label to use, and end_label is the last one.

You now perform the actual cutting using a command of this form:

python3 $LARA/Code/Python/lara_run.py cut_up_audio <ConfigFile>

If everything is correct, this will invoke ffmpeg to create segment audio in the directory specified by your config file.

Creating segmented text by aligning against cut-up audio

The drawback of the above method is that it involves manually aligning the segmented text against the audio. An alternate method, which can be considerably faster, is to cut up the audio first and then automatically align it against the text to add corresponding segmentation marks. Here, we will use Baudelaire’s poem Recuiellement as the running example, and assume that the original audio is available in two files, for the first and second halves of the poem respectively. This is artificial for such a short text, but longer texts (e.g. ones from Librivox) often do have the audio in multiple pieces, perhaps one per chapter. There are five parts to the process: preparing the data, cutting up the audio, performing speech recognition, performing automatic alignment and (optionally) manually post-editing.

Preparing the data

The first step is to use Audacity to create a label track showing where the audio is to be cut up. In contrast to the manual method, the labels can have any values: a simple choice is to make all of them “x”. The result will look like this:

_images/AudacityRecuillement.jpg

Next, you add a audio_cutting_up_parameters entry to your config file to say where the audio and labels file are:

"audio_cutting_up_parameters":
        [ { "id": "part1",
                "audio_file": "$LARA/Content/recueillement/audio/litteratureaudio_src/Charles_Baudelaire_-_Les_Fleurs_du_mal_P8_104_Recueillement_part1.mp3",
                "audio_labels_file": "$LARA/Content/recueillement/corpus/LabelTrack_part1.txt"
                },
          { "id": "part2",
                "audio_file": "$LARA/Content/recueillement/audio/litteratureaudio_src/Charles_Baudelaire_-_Les_Fleurs_du_mal_P8_104_Recueillement_part2.mp3",
                "audio_labels_file": "$LARA/Content/recueillement/corpus/LabelTrack_part2.txt"
                }
        ],

Corresponding to this, you create a version of the text marked to show which portions correspond to the original audio files you have available and declare it in the config file as "audio_alignment_corpus". Here, the declaration will be

"audio_alignment_corpus": "$LARA/Content/recueillement/corpus/recueillement_for_alignment.txt",

and the corpus itself will look like this:

<file id="part1">
Recueillement

Sois sage, ô ma Douleur, et tiens-toi plus tranquille.
Tu réclamais le Soir; il descend; le voici:
Une atmosphère obscure enveloppe la ville,
Aux uns portant la paix, aux autres le souci.
Pendant que des mortels la multitude vile,
Sous le fouet du Plaisir, ce bourreau sans merci,
Va cueillir des remords dans la fête servile,
<file id="part2">Ma Douleur, donne-moi la main; viens par ici,
Loin d'eux. Vois se pencher les défuntes Années,
Sur les balcons du ciel, en robes surannées;
Surgir du fond des eaux le Regret souriant;
Le soleil moribond s'endormir sous une arche,
Et, comme un long linceul traînant à l'Orient,
Entends, ma chère, entends la douce Nuit qui marche.

As you can see, the two parts of the text are marked with tags of the form <file id="(name of part)">.

Cutting up the audio

You can now cut up the audio into files defined by the labels using a command of the form

python3 $LARA/Code/Python/lara_run.py cut_up_audio_without_text <ConfigFile> <Id>

which here will be

python3 $LARA/Code/Python/lara_run.py cut_up_audio_without_text local_config.json part1

The cut-up audio will be put in the directory defined by "segment_audio_directory".

Performing speech recognition

The next step is to perform speech recognition on the cut-up audio, using Google Cloud speech-to-text. To be able to do this, you need to have set up a Google Cloud account with appropriate permissions and put your Google account key in the file $LARA/Code/Python/callector-lara-google-account-key.json. You can then do recognition with a command of the form

python3 $LARA/Code/Python/lara_run.py recognise_segment_audio <ConfigFile> <NFiles>

where <NFiles> is a positive integer or all. Here, an appropriate command is

python3 $LARA/Code/Python/lara_run.py recognise_segment_audio local_config.json all

The results will be put in the directory defined by "segment_audio_directory", in a file called recognition_results.json.

Performing automatic alignment

Finally, you can align the recognised audio against the text in the file referenced by "audio_alignment_corpus" using a command of the form

python3 $LARA/Code/Python/lara_run.py recognise_segment_audio align_segment_audio <ConfigFile> <Id> <MatchFunction> <Mode>

where <Id> is the ID passed to cut_up_audio_without_text, <MatchFunction> is the ID of a matching function, and <Mode> is create or evaluate, so here

python3 $LARA/Code/Python/lara_run.py recognise_segment_audio align_segment_audio local_config.json part1 binary create

The matching function needs to be a function that takes two strings as arguments and returns a float between 0 and 1. At the moment, the matching functions and associated IDs are defined in the file $LARA/Code/Python/lara_align_from_audio.py, and the only matching function is binary, which returns 0 if the strings are different and 1 if they are the same.

If <Mode> is create, the segmented text is put in the file referenced by "untagged_corpus", corresponding metadata is put in "segment_audio_directory", and aligned data is put in the file referenced by "aligned_segments_file".

If <Mode> is evaluate, aligned data is put in the file referenced by "aligned_segments_file_evaluate".

Manually post-editing

Once you have created "aligned_segments_file", you can manually posted-edit it and then recreate the segmented text and metadata. The format of the data is like this:

{
  "part1": [
      {
          "edit_distance": 0,
          "file": "$LARA/Content/recueillement_tmp/audio/litteratureaudio/extracted_file_part1_2.mp3",
          "recognised": "recueillement",
          "status": "fully_correct",
          "text_aligned": "\nRecueillement",
          "text_aligned_reference": "Recueillement"
      },
      {
          "edit_distance": 16,
          "file": "$LARA/Content/recueillement_tmp/audio/litteratureaudio/extracted_file_part1_3.mp3",
          "recognised": false,
          "status": "wrong",
          "text_aligned": false,
          "text_aligned_reference": "*no_text* part1 2"
      },
      {
          "edit_distance": 0,
          "file": "$LARA/Content/recueillement_tmp/audio/litteratureaudio/extracted_file_part1_4.mp3",
          "recognised": "sois sage ô ma douleur et tiens-toi plus tranquille",
          "status": "fully_correct",
          "text_aligned": "\n\nSois sage, ô ma Douleur, et tiens-toi plus tranquille.",
          "text_aligned_reference": "Sois sage, ô ma Douleur, et tiens-toi plus tranquille."
      (...)

You can recreate from edited aligned data using a command of the form

python3 $LARA/Code/Python/lara_run.py update_from_aligned_file <ConfigFile> <Id>

so here

python3 $LARA/Code/Python/lara_run.py update_from_aligned_file local_config.json part1

Extracting word token audio from sentence audio

A related capability currently under development is extracting word audio from sentence audio. By default, word audio is specified for word types using the file referenced by word_audio_directory. You can also specify audio for word tokens, as follows:

  • Define a value for the config file parameter segment_audio_word_breakpoint_csv. This should be a file with a .csv extension.

  • Perform the “resources” stage of LARA processing. This should create a file in the tmp_resources directory with a name of the form <Id>_tmp_segment_audio_word_breakpoint_file.csv.

  • This file should contain a group of three lines for each text segment which has an associated audio file. The first line is the name of the file; the second is the words in the segment, plus the special token *end*; the third is a list of numbers, one for each word in the second line. The intention is that the numbers indicate the number of seconds after the start the file where the matching word begins. The number under *end* indicates where the final word ends.

  • Fill in some or all of the lines of numbers. At the moment, you need to use an audio editing tool like Audacity to find the word boundaries and do it by hand. The image below shows what the result should look like.

_images/WordAudioBreakpointsCSV.jpg
  • For some reason, the timing numbers given by audio editors are systematically different from the ones used by the JavaScript utility which plays the audio. To correct this, you need to set the parameter segment_audio_word_offset in the config file. The appropriate value for Audacity appears to be about -0.100, i.e. the Audacity numbers are 0.100 second greater than the ones needed for playing the audio in a LARA document.

  • Copy the filled-in version of <Id>_tmp_segment_audio_word_breakpoint_file.csv to segment_audio_word_breakpoint_csv and perform the word_pages stage of LARA processing. Segments where audio breakpoints are defined should have word audio associated with the relevant pieces of segment audio. In other places, word audio will as usual be taken from the directory named by segment_audio_directory.

The intention is that the audio word breakpoints file will later be instantiated using a forced alignment package, probably the Montreal Forced Aligner.

Filling in LARA translation spreadsheets

It is possible to attach translations to both segments and words. The “segments” part is straightforward. The file identified in the config file by the parameter segment_translation_spreadsheet is a two-column CSV spreadsheet which associates each segment with its translation. During the “resources” compilation step, LARA will produce a file in the tmp_resources directory which contains the current information and leaves blanks for the new translations that need to be filled in. The name of the file will be <Id>_tmp_segment_translations.csv where <Id> is the identifier for the corpus.

The way in which word translations are filled in is similar but more complicated. LARA supports three different ways of attaching word translations, controlled by the config parameter word_translations_on. This can have the values “lemma” (default), “surface_word_type” and “surface_word_token”. The word translation file is identified in the config file by one of the following parameters:

  • translation_spreadsheet if word_translations_on = “lemma”

  • translation_spreadsheet_surface if word_translations_on = “surface_word_type”

  • translation_spreadsheet_tokens if word_translations_on = “surface_word_token”

“lemma” translation model

The default model is to attach translations to lemma types; the same lemma is always associated with the same translation. So for example if we are translating from English into French, then “be”, “is” and “was” will all be translated as “être” (infinitive of “to be”). The translation spreadsheet will again have two columns, one for source and one for target.

This model requires the smallest amount of work, but also gives the least satisfactory result.

“surface_word_type” translation model

The second model is to attach translations to surface word types; the same surface word type is always associated with the same translation. If we are translating English into French, then “be”, “is” and “was” will all be translated as different words. Probably “be” will be “être” and “is” will be “est”. It is not clear what to put as the translation of “was”, since “I was” is “j’étais” while “he was” is “il était”. Perhaps one will want to write “étais/était”. But in general, since the model is more fine-grained, it will give better results. However, it also requires more work, since it will be necessary to supply translations for several inflected forms of the same lemma.

The form of the translation spreadsheet is the same; it will again have two columns, one for source and one for target.

“surface_word_token” translation model

The third model is to attach translations to surface word tokens, which means that each individual word in the text can be associated with a different translation. This means that the annotator has to do a great deal more work, but has full control.

If the “surface_word_token” model is used, the annotator must start by filling in the segment translation spreadsheet. The blank word translation spreadsheet produced by the next invocation of the “resources” compilation step will then consist of a set of three-line groups, one for each segment, where the first line contains the words in the source segment, the second line is blank, and the third line consists of the words in the target segment. The annotator fills in the second line by looking at the words in the third one. The following example illustrates.

_images/WordTokenBlankFile.jpg

The filled-in file will look like this:

_images/WordTokenFilledInFile.jpg

Combined “surface_word_type” and “surface_word_token”

The problem with the “surface_word_token” model is that filling in a translation for each word in the text takes a while. You can work much more quickly if you combine the “surface_word_type” and “surface_word_token” models. You do this by adding an extra declaration in the config file with the key translation_spreadsheet_surface, pointing to a “surface_word_type” spreadsheet. The idea is that you start by filling in the “surface_word_type” spreadsheet, so that you have defined a translation for each surface word in the text. This information is then used by LARA to fill in the “surface_word_token” with default values. Some of these will of course be wrong (otherwise, there would be no point to having the “surface_word_token” sheet!), but correcting them is far easier than entering everything from scratch. In detail, the workflow is as follows:

  • Add entries for translation_spreadsheet_tokens and translation_spreadsheet_surface to your config file. translation_spreadsheet_surface should point to a file that can also be used as a normal spreadsheet for the “surface_word_type” translation model.

  • Invoke the “resources” compilation step.

  • You should have blank files in the “tmp_resources” directory called <Id>_tmp_segment_translations.csv, <Id>_tmp_translations_surface_type.csv and <Id>_tmp_translations_token.csv.

  • Fill in some or all of <Id>_tmp_segment_translations.csv and <Id>_tmp_translations_surface_type.csv and copy them to the files referenced by the config file entries segment_translation_spreadsheet and translation_spreadsheet_surface.

  • Invoke the “resources” compilation step again.

  • Fill in some or all of <Id>_tmp_translations_token.csv and copy to the files referenced by the config file entry translation_spreadsheet_tokens.

  • Invoke the “resources” compilation step again and repeat the cycle as many times as necessary.

The above is considerably easier to do in the portal, which takes care of the above workflow automatically.

Pathnames for translation files

When you have added the missing translations, you copy the filled-in “tmp” translations files to the replace the real translations files defined by the config file. If you intend to use the results as “distributed” content, it is necessary to conform to the following naming conventions, where <Id> is the corpus identifier, <CorpusLang> is the language in which the corpus text is written, and <TransLang> is the language in which the translations will appear:

  • Segment translation file: This needs to be called <CorpusLang>_<TransLang>.csv and be placed in the translations subdirectory of the corpus directory.

  • lemma word translation file: This needs to be called <CorpusLang>_<TransLang>.csv and be placed in the translations subdirectory of the language directory.

  • surface word type translation file: This needs to be called type_<CorpusLang>_<TransLang>.csv and be placed in the translations subdirectory of the language directory.

  • surface word token translation file: This needs to be called token_<CorpusLang>_<TransLang>.csv and be placed in the translations subdirectory of the corpus directory.

Note the the word translation file is in the language directory if you are using a type-based model, but in the corpus directory if you are using a token-based model. This is logical: type-based models are shared between multiple corpora, but token-based models are unique to one corpus.

For example, Peter Rabbit will have its segment translation file in peter_rabbit/translations/english_french.csv. If we use a lemma-based word translation model, the word translation file will be in english/translations/english_french.csv. If we are using a surface word token translation model, the word translation file will be in peter_rabbit/translations/token_english_french.csv.

Adding notes to words

It can be useful to attach notes to selected words in a text. For example, you may want to add biographical/geographical information to a name, or grammar information to a function word. LARA includes a straightforward mechanism, similar in form to the one used for adding translations, that allows you to do this. There are three steps:

  • The “resources” call (first step of compilation) produces a CSV spreadsheet in the tmp_resources directory with a name of the form <Id>_tmp_notes.csv. The format of the file is exactly the same as that of the lemma translation spreadsheet: there are two columns, one for lemmas and one for notes, and a row for each lemma. Fill in notes for all relevant lemmas.

  • Copy the file to a place in the project directory (probably the corpus subdirectory is most suitable).

  • Add an entry to the config file with the key notes_spreadsheet specifying where the new notes spreadsheet has been placed.

If a notes spreadsheet is defined, generated example pages will include a note if one exists for the lemma in question, and there will be a link at the bottom of each example page pointing to a file which lists all the notes in alphabetical order.

Adding images to words

Similarly, it can be useful to attach images to selected words. This is formally similar to adding translations or notes:

  • The “resources” call (first step of compilation) produces a CSV spreadsheet in the tmp_resources directory with a name of the form <Id>_tmp_image_dict.csv. The format of the file is again the same as that of the lemma translation spreadsheet: there are two columns, one for lemmas and one for images, and a row for each lemma. Fill in pathnames for the lemmas where you wish to associate an image. The spreadsheet should look something like this:

_images/ImageDict.jpg
  • Copy the file to a place in the project directory (probably the corpus subdirectory is most suitable).

  • Add an entry to the config file with the key image_dict_spreadsheet specifying where the image dict spreadsheet has been placed.

  • Copy the actual image files (jpgs, pngs, etc) to the images subdirectory.

When an image is defined, it will be shown at the top of the concordance file for the lemma in question, and can be viewed by clicking on the word.

If the config file specifies image_dict_words_in_colour as yes and coloured_words as no, words with associated images will be shown in red.

Second invocation of LARA compiler (“word_pages”)

When you have completed the above steps, invoke the second (“word_pages”) step of the LARA compilation:

python3 $LARA/Code/Python/lara_run.py word_pages [ <local-config-file>* ]

This should create a directory of LARA pages in the $LARA/compiled directory.

Editing a file from the content directory

To edit a file from a content directory, specify the type of file and start LARA as follows:

python3 $LARA/Code/Python/lara_run.py edit <file-id> [ <local-config-file>* ]

The given <file-id> must be one of:

  • config_file or cf, in which case the local config file will be openend in the text editor.

  • corpus or c, in which case the tagged corpus file will be openend in the text editor.

  • untagged_corpus or uc, in which case the untagged corpus file will be openend in the text editor.

  • translation_spreadsheet or ts, in which case the word translation spreadsheet file will be openend in the editor for CSV spreadsheets.

  • segment_translation_spreadsheet or sts, in which case the segment translation spreadsheet file will be openend in the editor for CSV spreadsheets.

Before you can edit these files, you have to define the editors for text/CSV in the environment variables LARA_TEXTEDITOR and LARA_CSVEDITOR. Use the full path to the respective program.

For example, on my machine these variables are set as follows

$ echo $LARA_TEXTEDITOR
C:/Program Files (x86)/Vim/vim80/gvim.exe

$ echo $LARA_CSVEDITOR
C:/Program Files/LibreOffice/program/scalc.exe

Opening the compiled HTML file in the browser

In order to open the HTML file that was created by the second invocation of LARA compiler (“word_pages”), you can either go to the vocabpages of your content under $LARA/compiled and double-click the file _hyperlinked_text_.html, or issue this command from your content directory directly:

python3 $LARA/Code/Python/lara_run.py open_in_browser [ <local-config-file>* ]

Creating a new content from a template

To create a new (empty) content from a template start LARA as follows:

python3 $LARA/Code/Python/lara_run.py newcontent <content-id>

A new content directory named <content-id> will be created under $LARA/Content along with the subdirectories corpus, translations, images, and audio. In the corpus directory the local configuration file <content-id>.json and a dummy corpus file <content-id>.txt will also be created. This serves as a starting point for your new LARA content.


(*) NB: Any of the commands that require a local config file (“treetagger”, “resources”, “word_pages” or “edit”) can be called from a subdirectory of $LARA/Content and without specifiying the name of the file in which case, the config file from the corpus subdirectory will be used unless there is more than one file available.


Making your LARA pages accessible

You will probably need to go round the cycle several times as you fix problems. When you think your LARA pages look good enough, use scp to copy them to a webserver and make them generally accessible. If you are using the Geneva web space, you will use a command that looks something like:

scp -r peter_rabbitvocabpages manny@isslnx1.unige.ch:/export/data/www/issco-site/en/research/projects/callector

Summary

Here’s a summary of the steps you need to follow to create a basic piece of LARA content using TreeTagger. Let’s call your content mycontent:

  • Make your content directory. Look in the $LARA/Content directory, and find the folder for a similar piece of content - let’s call it $LARA/Content/oldcontent. Copy $LARA/Content/oldcontent and rename it to make your new directory $LARA/Content/mycontent.

  • Make your config file. Go to the directory $LARA/Content/mycontent/corpus. Start by editing the file local_config.json. This tells LARA where to find everything it will need to build your resource. It’ll contain many references to the domain you copied from, oldcontent. Do a global replace of oldcontent with mycontent.

  • Make your plain text file. The directory $LARA/Content/mycontent/corpus should also contain the corpus file for oldcontent, i.e. the file with the original text. If you’re not It’ll probably be called something like oldcontent.txt. Rename it to mycontent.txt. Open it in an editor and replace all the old text with your new text. Imitate the formatting if necessary for things like headings and images. In particular, make sure that you break up the text into segments by adding a || at the end of every segment. Most often this will after a period.

  • Run TreeTagger. Now you’re ready to do the TreeTagger step. Open a Cywin window, terminal or similar, and do:

    cd $LARA/Content/mycontent/corpus
    python3 $LARA/Code/Python/lara_run.py treetagger local_config.json
    

    This should create a file in $LARA/Content/mycontent/corpus called something like mycontent_tagged_and_cleaned.txt.

  • Correct the tagging. If the preceding step worked, open mycontent_tagged_and_cleaned.txt in an editor and correct the tagging in the places where TreeTagger got it wrong.

  • First version of your LARA content. It’s a good idea to create a very rough first version of your LARA content at this point, to check the tagging. Go back to your Cywin/terminal window and do:

    cd $LARA/Content/mycontent/corpus
    python3 $LARA/Code/Python/lara_run.py resources local_config.json
    python3 $LARA/Code/Python/lara_run.py word_pages local_config.json
    

    The “resources” step creates the resources you’ll need to audio and translation, and the “word_pages” step creates the actual LARA content. If all of this worked, you’ll see the LARA content in the new directory $LARA/compiled/mycontentvocabpages. Open it and then open the file _hyperlinked_text_.html in Chrome. You should be able to see a first version of your LARA content. If this shows you mistakes in the tagging (looking at the alphabetical index is often helpful), correct them and repeat this step.

  • Post your recording tasks. Once your tagging looks more or less okay, it’s time to move on to the resources. Go to the LiteDevTools site and click on Recording > Manage tasks. Now post two recording tasks, one for the word audio and one for the segment audio. The files you’ll need to upload are going to be in $LARA/tmp_resources, and they’ll be called mycontent_record_words.txt and mycontent_record_segments.txt. Assign each task to yourself, or whoever is doing the recording.

  • Do the recording. Go back to the LiteDevTools top level and enter Recording > Available tasks. You should see the two tasks you just assigned to yourself. Open them in turn and do the recording.

  • Download the segment audio. Now go back to Recording > Manage tasks. You’re going to add all your recorded data to your content. Start with the segment recordings. Click on Results and then on Download complete results. You’ll probably have to wait a couple of minutes while LiteDevTools creates a large zipfile for you to download. You’re going to put this in the audio directory, which will be called something like $LARA/Content/mycontent/audio/myself, where myself is your name or the name of the person doing the recording. Delete all the old content from this directory. Then download the file from LDT to $LARA/Content/mycontent/audio/myself and unzip it. (This may happen automatically). There should be a bunch of audio files and a metadata file called metadata_help.txt. Make sure you keep the metadata file. This is absolutely essential, otherwise LARA won’t be able to find any of your audio files!

  • Download the word audio. Next, download the word audio. This is similar, but not exactly the same, because usually you’ll combine the word audio from all your content in one place so that you only record things once. If the language you’ve used is called “mylanguage”, you’re going to put your downloaded content in the directory $LARA/Content/mylanguage/audio/myself. Since you’re planning to combine the downloaded content with the existing content, you’ll probably find it easiest if you make a temporary directory (call it tmp) and start by unzipping your file there. If you’re being careful, you may first want to convert the contents of tmp to mp3 format by going to the directory above tmp and doing:

    python3 $LARA/Code/Python/lara_run.py audio_dir_to_mp3 tmp
    

    This will create a similar directory called tmp_mp3 with everything converted; you don’t have to do this, but mp3 files are a lot smaller and your content will be more responsive. Now copy your files from tmp or tmp_mp3 as follows. First, copy all the audio files to $LARA/Content/mylanguage/audio/myself; second, open $LARA/Content/mylanguage/audio/myself/help_metadata.txt in an editor, then copy the metadata from the metadata file in tmp or tmp_mp3 and put it at the end. You’re updating the metadata with the new content. If your directory is originally empty, just copy over the metadata from tmp or tmp_mp3.

  • Fill in the translations. You also need to fill in translations. Look in $LARA/tmp_resources, where you should have files called mycontent_tmp_translations.csv and mycontent_tmp_segment_translations.csv. Open each one - you will probably find it easiest if you use Open Office Calc, and you will need to select UTF-8 as the encoding, and Tab as the separator. Fill in the translations and save. You should save using the Save As menu. Check “Edit filter settings” and again set the encoding to UTF-8 and the separator to Tab. You save the segment translation file in $LARA/Content/mycontent/translations/mylanguage_studentlanguage.csv, and the word translations in $LARA/Content/mylanguage/translations/mylanguage_studentlanguage.csv.

  • Make the LARA pages. Finally, remake the LARA pages by once again going to your Cygwin/terminal and doing:

    cd $LARA/Content/mycontent/corpus
    python3 $LARA/Code/Python/lara_run.py word_pages local_config.json