Distributed content

This section describes how to create and use distributed LARA content. In contrast to the local content described in the previous section, where we have a self-contained set of LARA files for a single corpus, distributed content assumes that we will have LARA content for different corpora and languages spread over multiple servers. We create a different set of LARA pages for each reader based on their “reading history”. A “reading history” is a record of which corpora the reader has accessed. LARA links together the pages by downloading only the data relevant to the “reading history”, inserting links to the remote servers to reference the bulky audio and image files.

You can see an example of LARA pages for a reading history here. The reading history in question consists of the full text of Peter Rabbit followed by the beginning of Alice in Wonderland. This time, if you click on the word “Rabbits” in the top line, you will see examples from both texts, so both Peter Rabbit and the White Rabbit.

As with local content, the user specifies how to compile a reading history using a config file. The file has a structure that is very similar to the one used for local content. The most important difference is that, instead of specifying a corpus, it specifies a reading history. A typical reading history config file looks like this:

{
  "id": "reader1_english",
  "l1": "french",
  "resource_file": "$LARA/Content/all_resources.json",
  "reading_history": [[ "peter_rabbit", "english_geneva", [ 1, 5 ]],
                      [ "alice_in_wonderland", "english_geneva", [ 1, 5 ]]],
  "preferred_voice": "cathy",
  "audio_mouseover": "yes",
  "translation_mouseover": "yes",
  "segment_translation_mouseover": "yes",
  "max_examples_per_word_page": 10
}

The “resource file” referred to is a JSON file which associates the name of each resource with the information needed to use it. For a corpus resource, this is the resource’s URL and the Id for the associated language resource. A language resource just needs a URL and something to show that it is a language resource. The format is as follow:

{
    "<CorpusResourceId>": [ "<CorpusURL>", "<LanguageResourceId>" ],
    ...
    "<LanguageResourceId>": [ "<CorpusURL>", "LanguageResource" ],
    ...
}

for example:

{
   {
 "peter_rabbit": [ "https://www.issco.unige.ch/en/research/projects/callector/peter_rabbit",
                   "english_geneva" ],
 "alice_in_wonderland": [ "https://www.issco.unige.ch/en/research/projects/callector/alice_in_wonderland",
                          "english_geneva" ],
 "ogden_nash" : ["https://www.issco.unige.ch/en/research/projects/callector/LaraResourceContent/21_ogden_nash2",
                 "english_geneva"],
 "four_little_children" : ["https://www.issco.unige.ch/en/research/projects/callector/LaraResourceContent/35_four_little_children",
                           "english_geneva"],

 "tina_fer_i_fri": [ "https://www.issco.unige.ch/en/research/projects/callector/tina_fer_i_fri",
                     "icelandic_reykjavik" ],

 "the_boy_who_cried_wolf": [ "https://www.issco.unige.ch/en/research/projects/callector/the_boy_who_cried_wolf",
                             "farsi_geneva"],
 "bozboz_ghandi": [ "https://www.issco.unige.ch/en/research/projects/callector/bozboz_ghandi",
                    "farsi_geneva"],
 "EbneSina": [ "https://www.issco.unige.ch/en/research/projects/callector/EbneSina",
               "farsi_geneva"],
 "Arash": [ "https://www.issco.unige.ch/en/research/projects/callector/Arash",
            "farsi_geneva"],

 "hyakumankai_ikita_neko": [ "https://www.issco.unige.ch/en/research/projects/callector/hyakumankai_ikita_neko",
                             "japanese_canberra"],

 "dante": [ "https://www.issco.unige.ch/en/research/projects/callector/dante",
            "italian_swinburne"],
 "ungaretti": [ "https://www.issco.unige.ch/en/research/projects/callector/ungaretti",
                "italian_swinburne" ],

 "english_geneva": [ "https://www.issco.unige.ch/en/research/projects/callector/english",
                     "LanguageResource" ],
 "farsi_geneva": [ "https://www.issco.unige.ch/en/research/projects/callector/farsi",
                   "LanguageResource" ],
 "icelandic_reykjavik": [ "https://www.issco.unige.ch/en/research/projects/callector/icelandic",
                          "LanguageResource" ],
 "italian_swinburne": [ "https://www.issco.unige.ch/en/research/projects/callector/italian",
                        "LanguageResource" ],
 "japanese_canberra": [ "https://www.issco.unige.ch/en/research/projects/callector/japanese",
                        "LanguageResource" ]
}

It’s essential to use the right directory structure for your resources. We describe that, and also the format of the metadata. There is a script for automatically adding the metadata, and it also checks your directory structure.

Publishing distributed content

Before trying to make LARA resources available in distributed form, you should first be sure you can compile them as local resources. If you have organised the material according to the standard directory structure expected for distributed LARA resources - see immediately below - you will then only need to do the following:

  • Run a script to add metadata to your resource directory.

  • Upload the result to a webserver.

  • Register your new resource in the resource file you are using.

Here are the details.

Adding metadata to your resource directory

First, you need to make sure that your resource directory has the right format. That’s described in the section “Directory structure for distributed LARA” below. You can then add the metadata with an invocation of the following form:

python3 $LARA/Python/lara_run.py add_metadata <ResourceDir> <CorpusOrLanguage>

where <ResourceDir> is the directory with the resource and <CorpusOrLanguage> is either corpus or language. So for example, to add metadata to peter_rabbit, a corpus resource, you would do:

python3 $LARA/Python/lara_run.py add_metadata $LARA/Content/peter_rabbit corpus

while to add metadata to english, a language resource, you would do:

python3 $LARA/Python/lara_run.py add_metadata $LARA/Content/english language

Typical output looks like this (Python 3):

$ python3 $LARA/Python/LARA/lara_run.py add_metadata peter_rabbit corpus
--- Written JSON file peter_rabbit/audio/metadata.json
--- Checked audio directory peter_rabbit/audio/cathy. The 39 audio files match the metadata.
--- Single .txt file in corpus directory, peter_rabbit.txt. Assuming it is the corpus.
--- Written JSON file peter_rabbit/corpus/metadata.json
--- Written images metadata file for 26 images, peter_rabbit/images/metadata.txt
--- 2 translation files found in peter_rabbit/translations
--- Written JSON file peter_rabbit/translations/metadata.json

The script should give you informative error messages if the directory structure is incorrect.

Uploading the resource

Once you’ve added the metadata, you can upload it to any webserver. For example, in Geneva, we would probably do something like this:

scp -r peter_rabbit manny@isslnx1.unige.ch:/export/data/www/issco-site/en/research/projects/callector

Compiling LARA pages for a reading progress

If all the resources exist and are registered, you can compile LARA pages for a reading progress with a single call of the form:

python3 -l $LARA/Python/LARA/lara_run.py distributed <ConfigFile>

e.g.

python3 -l $LARA/Python/LARA/lara_run.py distributed $LARA/Content/reader1_english/distributed_config.json

The resulting LARA pages are as usual put in the $LARA/Content/compiled directory. They are much smaller than local content pages, since they do not contain any audio or image files, only links to remote multimedia content.

Directory structure for distributed LARA

In order to be able to distribute LARA content over many servers, the content for each individual resource needs to be organised using a uniform directory structure. This lets the LARA compiler find things when it is putting together pages for a reading progress.

We have two kinds of LARA resource: corpus-specific and language-specific. A corpus-specific resource collects together all the data relevant to a particular corpus, e.g. Peter Rabbit. At a minimum, this will be the corpus text itself. It can also optionally contain segment audio, segment translations, and/or images. A language-specific resource collects together to a particular language, e.g. English. This consists of word audio and/or word translations.

The directory structure for a corpus resource is as follows. The directories and metadata files need to have exactly these names:

                                                    <CorpusId>
                                                        |
            --------------------------------------------------------------------------------------------
            |                          |                           |                                    |
         corpus                     images                       audio                            translations
            |                          |                           |                                    |
      -------------               -----------------           ---------------------        -----------------------------
      |           |               |       |       |           |          |        |        |            |              |
metadata.json <corpus>.txt metadata.txt <image1> ...    metadata.json <voice1>   ...  metadata.json <translation>.csv ...
                                                                         |
                                                          ------------------------------
                                                          |            |       |       |
                                                 metadata_help.txt <audio1> <audio2>  ...

For an example of a corpus resource directory, look at $LARA/Content/peter_rabbit.

The directory structure for a language resource is as follows. Again, the directories and metadata files need to have exactly these names:

                                <LanguageId>
                                     |
                   ----------------------------------------------
                   |                                             |
              translations                                     audio
                   |                                             |
    --------------------------------                      --------------------
    |              |                |                     |         |        |
metadata.json <translation>.csv    ...             metadata.json <voice1>   ...
                                                          |
                                              ------------------------------
                                              |          |        |        |
                                    metadata_help.txt <audio1> <audio2>  ...

For an example of a language resource directory, look at $LARA/Content/english.

Metadata for distributed LARA

You’ll have noticed that the directory trees above contain several metadata files. These are essential; without them, the LARA compiler will not be able to find the material it needs to download when it build pages for a reading progress. You shouldn’t normally need to know what the metadata looks like, since if the “add_metadata” script is working correctly it will do all this work for you, but here are the details in case they something goes wrong.

There are five kinds of metadata files. We again illustrate with Peter Rabbit.

Corpus metadata file

A language resource’s corpus directory needs to have a JSON-formatted metadata file called metadata.json. It should contain a list with one element giving the name of the corpus file. For Peter Rabbit, the content of the file is

[ "peter_rabbit.txt" ]

since the actual corpus file in the same directory is called peter_rabbit.txt.

Image metadata file

A language resource’s images directory needs to have a plain text metadata file called metadata.txt. It should contain the names of all the images in the file, one per line. For Peter Rabbit, the content of the file is

01VeryBigFirTree.jpg
02YourFatherHadAnAccident.jpg
03DontGetIntoMischief.jpg
04OffToTheBaker.jpg
05GatheringBlackberries.jpg
06SqueezedUnderTheGate.jpg
07ThenHeAteSomeRadishes.jpg
08FeelingRatherSick.jpg
09WhomShouldHeMeet.jpg
10WavingARake.jpg
11AmongThePotatoes.jpg
12ABlueJacketWithBrassButtons.jpg
13FriendlySparrows.jpg
14WithASieve.jpg
15JumpedIntoACan.jpg
16TurnedThemOverCarefully.jpg
17UpsettingThreePlants.jpg
18VeryDamp.jpg
19ItWasLocked.jpg
20TheTipOfHerTailTwitched.jpg
21ClimbedUponAWheelbarrow.jpg
22BehindSomeBlackcurrantBushes.jpg
23AScarecrowToFrightenTheBlackbirds.jpg
24FloppedDownUponTheNiceSoftSand.jpg
25CamomileTea.jpg
26BreadAndMilkAndBlackberries.jpg

Translations metadata file

The translations directory for both a corpus resource and a language resource needs a JSON-formatted metadata file called metadata.json. It should contain a key/value list associating L1s with translation CSV files. For Peter Rabbit, the file looks like this:

{  "french":"english_french.csv" }

since english_french.csv is the segment translation file for French.

Top-level audio metadata file

The audio directory for both a corpus resource and a language resource needs a JSON-formatted metadata file called metadata.json. It should contain a list of the directories for the different speakers who have provided audio files. For Peter Rabbit, the file looks like this:

[ "cathy" ]

since there is just one directory of audio files, cathy.

For the Tína fer í frí resource, where two different people have contributed recordings, the metadata file looks like this:

[
"branislav",
"svavar_voice"
]

LDT audio metadata file

Each subdirectory in an audio directory also contains a plain-text metadata file called metadata_help.txt. The data from these files comes from LiteDevTools, and associates audio files with pieces of text. For example, the file for peter_rabbit/audio/cathy starts like this:

AudioOutput help any_speaker help/50768_181219203839.wav Once upon a time there were four little Rabbits, and their names were-- Flopsy, Mopsy, Cotton-tail, and Peter.# |
AudioOutput help any_speaker help/50769_181219201554.wav They lived with their Mother in a sand-bank, underneath the root of a very big fir-tree.# |
AudioOutput help any_speaker help/50770_181221174953.wav 'Now my dears,' said old Mrs. Rabbit one morning, 'you may go into the fields or down the lane, but don't go into Mr. McGregor's garden: your Father had an accident there; he was put in a pie by Mrs. McGregor.'# |
AudioOutput help any_speaker help/50771_181219201652.wav 'Now run along, and don't get into mischief. I am going out.'# |
AudioOutput help any_speaker help/50772_181219201712.wav Then old Mrs. Rabbit took a basket and her umbrella, and went through the wood to the baker's. She bought a loaf of brown bread and five currant buns.# |