Internal documentation: top-level calls to Python

This section describes how to perform the command-line calls to LARA used by the LARA portal. Calls are of the general form:

python3 $LARA/Code/Python/lara_run_for_portal.py <CommandName> <Arg1> <Arg2> ...

where <CommandName> is the name of the command, and <Arg1> etc are the arguments. In the following, we will only give the command name and the arguments. So for example when we write:

add_metadata <ResourcesDir> <Type>

this corresponds to the actual command:

python3 $LARA/Code/Python/lara_run_for_portal.py add_metadata <ResourcesDir> <Type>

The following arguments should be interpreted thus:

  • <ConfigFile>. Pathname of a local config file, e.g. $LARA/Content/peter_rabbit/corpus/local_config.cfg

  • <ReaderId>. String used by distributed LARA to identify a reader, e.g. reader1.

  • <ResourceId>. String used by distributed LARA to identify a corpusresource, e.g. peter_rabbit.

  • <LanguageResourceId>. String used by distributed LARA to identify a language resource, e.g. english_geneva.

  • <L2>. String used by distributed LARA to identify a language, e.g. english.

  • <ReplyFile>. Pathname specifying the file used to pass back information, e.g. $LARA/tmp/reply1.json.

Segmenting text

Turn text into segmented text with the following call:

segment_file <UnsegmentedFile> <SegmentedFile> <L2>

The value of <L2> is used to choose the segmenter.

For downward compatibility, the earlier form of the call

segment_file <UnsegmentedFile> <SegmentedFile>

is also supported, and will call the Punkt segmenter. This will however give a bad result for languages where Punkt is inappropriate, in particular versions of Chinese.

Invoking TreeTagger

Commands for invoking TreeTagger. In the first version, the information about language, input file and output file is specified in the config file:

treetagger <ConfigFile>

In the second, it is specified explicitly:

treetagger_basic <Language> <InFile> <OutFile>

In both cases, the version of TreeTagger for <Language> is called to add tags to <InFile> and produce <OutFile>.

Performing multi word expression annotation

You can perform multi word expression annotation on the corpus file referenced by <ConfigFile> using the command

mwe_annotate <ConfigFile>

The config file needs to contain an entry for mwe_file, pointing to a file of MWE definitions. This will produce a file of candidate MWE matches called <Id>_tmp_mwe_annotations.json. If you want to copy the file of candidate MWE matches to a specific location, use the variant

mwe_annotate_and_copy <ConfigFile> <MWEAnnotationsFile>

You then need to edit the generated file to mark the correct matches. If a match is correct, you need to change mwe_status_unknown to mwe_okay, and if it is not, you need to change it to mwe_not_okay. Then copy the annotated file to the file referenced by mwe_annotations_file. The matches marked as correct can then be inserted into the corpus file with a call of the form:

apply_mwe_annotations <ConfigFile>

Two new files will be written to the tmp_resources directory: <Id>_mwe_processed_corpus.txt/docx (the new version of the corpus), and <Id>_mwe_trace.html (a human-readable trace file showing the changes made). Again, if you want to copy the transformed file and the trace files to named locations, use the variant

apply_mwe_annotations_and_copy <ConfigFile> <TransforedCorpusFile> <TraceFile>

You can check the well-formedness of a txt-format file of MWE definitions with a call of the form

check_mwe_defs <MWEDefsFile>

You can convert a txt-format file of MWE definitions to JSON format with a call of the form

mwe_txt_defs_to_json <MWEDefsFileTxt> <MWEDefsFileJSON>

You can convert a JSON-format file of MWE definitions to txt format with a call of the form

mwe_json_defs_to_txt <MWEDefsFileJSON> <MWEDefsFileTxt>

The JSON format generated by mwe_txt_defs_to_json is as exemplified here

{
  "classes": {
      "di": [ "di", "d'" ],
      (...)
  },
  "mwes": {
      "ACCENDERE si": { "name": "accendersi", "pos": "V" },
      (...)
      }
  },
  "transforms": [
      "*verb* si -> si *verb*",
      (...)
  ]
}

Note the the values of "mwe" and "classes" are dicts, but the value of "transforms" is a list.

To update an MWE lexicon file by merging it with another MWE lexicon file, use a call of the form

merge_update_mwe_defs <MWEDefsFileMain> <MWEDefsFileUpdateMaterial> <ConfigFile>

<ConfigFile> is only required to define the tmp directories.

Performing the “resources” step

Commands for performing the “resources” step of LARA compilation, which creates the recording and translation files. In the first version, these are put in the directory $LARA/tmp_resources:

resources <ConfigFile>

In the second version, the txt versions of the two recording files and the CSV versions of the word and segment translation files are then copied to the designated locations:

resources_basic <ConfigFile> <WordRecordingFile> <SegmentRecordingFile> <WordTranslationCSV> <SegmentTranslationCSV>

If you are using the combined “surface_word_type” and “surface_word_token” translation model (see the section on LARA translation spreadsheets), <WordTranslationCSV> in the above call will be the “surface_word_type” spreadsheet. You will then also need to perform the call:

resources_basic_tokens <ConfigFile> <WordTranslationCSV>

to produce the word-token based word translation spreadsheet.

In the third version, all the generated files are copied to locations which add suffixes to a designated base file name. The call is of the form:

resources_and_copy <ConfigFile> <BaseFileName>

for example:

resources_and_copy local_config.json $LARA/tmp/sample_english_tokens/

and the suffixes used are as follows:

  • ldt_word_recording_full.txt. Text form of word recording file (all items, including ones already recorded).

  • ldt_word_recording_full.json. JSON form of word recording file (all items, including ones already recorded).

  • ldt_segment_recording_full.txt. Text form of segment recording file (all items, including ones already recorded).

  • ldt_segment_recording_full.json. JSON form of segment recording file (all items, including ones already recorded).

  • segment_translations.csv. CSV form of segment translation file.

  • segment_translations.json. JSON form of segment translation file.

  • word_translations.csv. CSV form of lemma translations file.

  • word_translations.json. JSON form of lemma translations file (includes examples for each lemma).

  • word_translations_surface_type.csv. CSV form of surface word translations file.

  • word_translations_surface_type.json. JSON form of surface word translations file (includes examples for each surface word).

  • word_translations_tokens.csv. CSV form of word tokens translation file.

  • word_translations_tokens.json. JSON form of word tokens translation file.

  • notes.csv. CSV form of lemma notes file.

  • notes.json. JSON form of lemma notes file (includes examples for each lemma).

  • split.json. Internalised JSON form of text.

The fourth version is like the third one, except that a timestamped zipfile containing all the copied files is also produced and saved to a log directory:

resources_and_copy_and_log_zipfile <ConfigFile> <BaseFileName> <LogDir>

Creating word token translation files from surface word translation files

There is a specialised form of resources_and_copy above which only creates the word tokens translation files, constructing them by filling in the values from the word type file. Values in the current word token file are ignored.

The call is of the form:

tokens_from_types_and_copy <ConfigFile> <BaseFileName>

for example:

tokens_from_types_and_copy local_config.json $LARA/tmp/

Performing the “word pages” step

Command for performing the “word pages” step of LARA compilation. All the arguments are taken from the config file. The resulting pages are put in the LARA/compiled directory.:

word_pages <ConfigFile>

Processing LDT output

Commands for processing zipfiles downloaded from LiteDevTools. The simplest version is the following:

install_ldt_zipfile <Zipfile> <RecordingScript> <Type> <ConfigFile> [ <BadMetadataFile> ]

where <Zipfile> is the downloaded zipfile, <RecordingScript> is the JSON-formatted recording script passed to LDT to create the recordings, <Type> is either words or segments. The zipfile is unzipped and the contents converted to mp3 format, after which the files are copied to the places specified by the config file.

The optional argument <BadMetadataFile> should be the name of a file with a .json extension, which will be used to pass back metadata for any files which failed to unpack correctly. The format will be a list of dicts, each of which will have the fields file and text. If this argument is not supplied, the list of bad files will be written out to a file in the tmp_resources directory.

For downward compatibility, we also have the call:

audio_dir_to_mp3 <AudioDir>

This assumes that the zipfile has already been unzipped to <AudioDir>. It converts the data and metadata to mp3 form, putting it in the directory <AudioDir>_mp3.

Creating LARA audio using a TTS engine

There is support for using TTS engines to create LARA audio. So far, engines supported are ReadSpeaker, Google TTS and ABAIR. In order to use ReadSpeaker, you need to have a valid ReadSpeaker license key in the UTF-8 text file $LARA/Code/Python/readspeaker_license_key.txt.

You can create TTS audio using the same “record_words” and “record_segments” files as for human audio recorded using LiteDevTools (see preceding section). The command-line call is of the form:

create_tts_audio <RecordingScriptFile> <ConfigFile> <Zipfile>

for example

create_tts_audio $LARA/tmp_resources/sample_english_surface_tokens_tts_record_segments.json local_config_tts.json readspeaker_segments.zip

The config file must specify the TTS engine using the parameter tts_engine, e.g.

"tts_engine": "abair",

If there is more than one voice available for the TTS engine and language, the voice can be specified using the parameter tts_voice, e.g.

"tts_voice": "ga_UL_anb_nnmnkwii",

The default is to use the first voice in the relevant list from lara_config._tts_info.

The last argument in the command-line call is the zipfile of TTS audio files to be produced. It will be in the same format as the ones downloaded from LDT, and will in particular contain similarly formatted metadata. It is consequently possible to install it using the install_ldt_zipfile option immediately above, e.g. here

install_ldt_zipfile readspeaker_segments.zip $LARA/tmp_resources/sample_english_surface_tokens_tts_record_segments.json segments local_config_tts.json

You can find out which TTS engines and voices are available for a language with a command-line call of the form:

get_tts_engines <Lang> <ResultFile>

for example

$ python3 $LARA/Code/Python/lara_run_for_portal.py get_tts_engines english tts_engines.txt
--- Written JSON file tts_engines.txt
--- lara_run_for_portal command executed (0.0253 secs)

--- All timings:
--- lara_run_for_portal command executed (0.0253 secs)

$ cat tts_engines.txt
[
  {
      "engine": "readspeaker",
      "voices": [
          "Alice-DNN"
      ]
  },
  {
      "engine": "google_tts",
      "voices": [
          "default"
      ]
  }
]

Adding metadata for distributed LARA

The call:

add_metadata <ResourceDir> <Type>

where <Type> is “corpus” or “language” processes a resource directory to add the metadata needed for distributed LARA. Example:

add_metadata $LARA/Content/peter_rabbit corpus

The directory with added metadata can then be uploaded to a webserver.

Merging two language resource directories

The call:

merge_language_resources <Dir1> <Dir2> <DirMerged> <ConfigFile>

merges the two language resource directories <Dir1> and <Dir2> to create <DirMerged>. <ConfigFile> is needed to specify where tmp files will be created. Example:

merge_language_resources $LARA/Content/english $LARA/Content/english2 $LARA/Content/english3 $LARA/Content/peter_rabbit/corpus/local_config.json

Merging two translation spreadsheets

The call:

merge_update_translation_spreadsheet <CSVMain> <CSVNewMaterial>

updates the translation spreadsheet <CSVMain> using the contents of <CSVNewMaterial> and then overwrites <CSVMain> with the result. It should not be possible for two concurrent calls to result in one set of updates overwriting another; <CSVMain> is locked by the first call until it has completed, and the second call cannot be made until the lock is released.

Example:

merge_update_translation_spreadsheet english_french.csv english_french_new.csv

Finding deleted lines in a pair of translation spreadsheets

When checking to see if data has accidentally been deleted, it can be useful to take two related translation spreadsheets and find the number of lines in the second one that have the same key as a line in the first, but a null value. There are two versions, one for two-column spreadsheets (lemma, types, segments, notes), and one for word token spreadsheets. The calls have the forms:

find_deleted_lines_in_translation_spreadsheets <CSVOld> <CSVNew> <AnswerFile>

find_deleted_lines_in_word_token_spreadsheets <CSVOld> <CSVNew> <AnswerFile>

Typical calls look like this:

$ python3 $LARA/Code/Python/lara_run_for_portal.py find_deleted_lines_in_translation_spreadsheets le_petit_prince_mwe_tmp_translations_surface_type.csv le_petit_prince_mwe_tmp_translations_surface_type_deleted_lines.csv answer1.json
--- Read CSV spreadsheet as utf-8-sig (2442 lines) C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp_resources/le_petit_prince_mwe_tmp_translations_surface_type.csv
--- Read CSV spreadsheet as utf-8-sig (2441 lines) C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp_resources/le_petit_prince_mwe_tmp_translations_surface_type_deleted_lines.csv
--- Written JSON file answer1.json
--- Written answer (6 lines deleted) to answer1.json
--- lara_run_for_portal command executed (0.0119 secs)

--- All timings:
--- lara_run_for_portal command executed (0.0119 secs)

$  python3 $LARA/Code/Python/lara_run_for_portal.py find_deleted_lines_in_word_token_spreadsheets  le_petit_prince_mwe_tmp_translations_token.csv  le_petit_prince_mwe_tmp_translations_token_deleted_lines.csv answer1.json
--- Read CSV spreadsheet as utf-8-sig (5584 lines) C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp_resources/le_petit_prince_mwe_tmp_translations_token.csv
--- Read CSV spreadsheet as utf-8-sig (5583 lines) C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp_resources/le_petit_prince_mwe_tmp_translations_token_deleted_lines.csv
--- Written JSON file answer1.json
--- Written answer (4 lines deleted) to answer1.json
--- lara_run_for_portal command executed (0.173 secs)

--- All timings:
--- lara_run_for_portal command executed (0.173 secs)

Getting voices and L1s for a resource

The call:

get_voices_and_l1s_for_resource <ResourceId> <ConfigFile> <ReplyFile>

downloads voice and L1 metadata from the corpus or language resource <ResourceId> defined by the resource file in <ConfigFile> and writes out the result to <ReplyFile>. Example:

get_voices_and_l1s_for_resource peter_rabbit distributed_config.json reply1.json

The contents of <ReplyFile> will look something like this:

{
  "l1s": [
      "french",
      "russian"
  ],
  "voices": [
      "cathyc"
  ]
}

Getting voices and L1s for a resource file

The call:

get_voices_and_l1s_for_resource_file <ConfigFile> <ReplyFile>

downloads voice and L1 metadata for all the resources defined by the resource file in <ConfigFile> and writes out the result to <ReplyFile>. Example:

get_voices_and_l1s_for_resource_file distributed_config.json reply1.json

The contents of <ReplyFile> will look something like this:

{
  "Arash": {
      "l1s": [],
      "voices": [
          "hanieh"
      ]
  },
  "EbneSina": {
      "l1s": [],
      "voices": [
          "hanieh"
      ]
  },
  "alice_in_wonderland": {
      "l1s": [],
      "voices": [
          "cathyc"
      ]
  },
  "bozboz_ghandi": {
      "l1s": [],
      "voices": [
          "hanieh"
      ]
  },
  "dante": {
      "l1s": [
          "english"
      ],
      "voices": [
          "sabina"
      ]
  },
  (...)

Getting the audio and translation files for a corpus resource

The call:

count_audio_and_translation_files <ConfigFile> <ReplyFile>

extracts figures for available and missing audio and translation files for the resource defined by <ConfigFile> and writes them out to <ReplyFile>. Example:

$ python3 $LARA/Code/Python/lara_run_for_portal.py count_audio_and_translation_files $LARA/Content/peter_rabbit/corpus/local_config.json result1.json
--- Environment variables and working directories look okay
--- Read LARA text file as utf-8-sig (41 lines) C:/cygwin64/home/sf/callector-lara-svn/trunk/Content/peter_rabbit/audio/cathyc/metadata_help.txt
--- Read LARA text file as utf-8-sig (2736 lines) C:/cygwin64/home/sf/callector-lara-svn/trunk/Content/english/audio/cathyc/metadata_help.txt
--- Read CSV spreadsheet as utf-8-sig (40 lines) C:/cygwin64/home/sf/callector-lara-svn/trunk/Content/peter_rabbit/translations/english_french.csv
--- Loaded segment translation spreadsheet (39 records) $LARA/Content/peter_rabbit/translations/english_french.csv
--- Read CSV spreadsheet as utf-8-sig (1209 lines) C:/cygwin64/home/sf/callector-lara-svn/trunk/Content/english/translations/english_french.csv
--- Loaded word translation spreadsheet $LARA/Content/english/translations/english_french.csv
--- Written JSON file result1.json
--- lara_run_for_portal command executed (0.0871 secs)

--- All timings:
--- lara_run_for_portal command executed (0.0871 secs)

$ more result1.json
{
    "segments": {
        "not_recorded": 2,
        "not_translated": 2,
        "recorded": 39,
        "translated": 39
    },
    "words": {
        "not_recorded": 0,
        "not_translated": 0,
        "recorded": 384,
        "translated": 356
    }
}

Downloading a resource

The call:

download_resource <URL> <Dir>

downloads the corpus or language resource at <URL> and puts the contents in the directory <Dir>, which is deleted first. Example:

download_resource https://www.issco.unige.ch/en/research/projects/callector/peter_rabbit $LARA/tmp/peter_rabbit_downloaded

Trace output is printed showing the status of each file.

Exporting a corpus resource as a zipfile

The call:

make_export_zipfile <SourceConfigFile> <TargetZipfile>

creates <TargetZipfile>, which should contain all the data necessary for the corpus resource defined by <SourceConfigFile>. Example of a successful call:

$ python3 $LARA/Code/Python/lara_run_for_portal.py make_export_zipfile local_config.json $LARA/tmp/peter_rabbit_export.zip
--- Environment variables and working directories look okay
--- Created directory C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_9673357461669683
--- Created directory C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_9673357461669683/peter_rabbit
--- Created directory C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_9673357461669683/peter_rabbit/audio
--- Created directory C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_9673357461669683/peter_rabbit/corpus
--- Created directory C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_9673357461669683/peter_rabbit/images
--- Created directory C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_9673357461669683/peter_rabbit/translations
--- Copying config file
--- Copying segment audio directory
--- Created directory C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_9673357461669683/peter_rabbit/audio/cathyc
--- Copied 40 files
--- Copying segment translation spreadsheet
--- Processing word audio directory
--- Processing word translation spreadsheet
--- Copying image directory
--- Copied 26 files
--- Creating config file
--- Written JSON file C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_9673357461669683/peter_rabbit/corpus/local_config.json
--- Copying CSS and JS files
--- Copied 0 files
--- Making zipfile
--- Zipped up C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_9673357461669683 as C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/peter_rabbit_export.zip
--- Deleted directory C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_9673357461669683
--- lara_run_for_portal command executed (0.773 secs)

--- All timings:
--- lara_run_for_portal command executed (0.773 secs)

Importing a corpus resource from a zipfile

The call:

import_zipfile <Zipfile> <CorpusDir> <LanguageRootDir> <ConfigFile>

unpacks <Zipfile>, created by the make_export_zipfile operation immediately above, and installs it as the new directory <CorpusDir>. <LanguageRootDir> specifies the root directory for the associated language resource, i.e. the directory above all the language resources: it is assumed that the language resource directory will be <LanguageRootDir>/<Language>, where <Language> is the value of the parameter "language" in the config file from <Zipfile>. <ConfigFile> is a local config file needed to define tmp directory, but does not supply any other information. Example of a successful call:

$ python3 $LARA/Code/Python/lara_run_for_portal.py import_zipfile $LARA/tmp/peter_rabbit_export.zip $LARA/tmp/peter_rabbit_imported $LARA/Content $LARA/Content/peter_rabbit/corpus/local_config.json
--- Created directory C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_9702242416447959
--- Unzipped C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/peter_rabbit_export.zip to C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_9702242416447959
--- Updated config file to match corpus dir C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/peter_rabbit_imported and language dir C:/cygwin64/home/sf/callector-lara-svn/trunk/Content/english
--- Written JSON file C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/peter_rabbit_imported/corpus/local_config.json
--- Deleted directory C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_9702242416447959
--- Unpacked and installed zipfile to C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/peter_rabbit_imported
--- lara_run_for_portal command executed (0.455 secs)

--- All timings:
--- lara_run_for_portal command executed (0.455 secs)

Checking well-formedness of a config file

The call:

check_config_file <ConfigFile> <Type> <ReplyFile>

checks the well-formedness of <ConfigFile> as a config file of type <Type>, where <Type> is either local or distributed, and writes JSON-formatted information to <ReplyFile>. Examples:

$ python3 $LARA/Code/Python/lara_run_for_portal.py check_config_file $LARA/Content/peter_rabbit/corpus/local_config.json local reply.json

$ cat reply.json
{
    "status": "good"
}

$ python3 $LARA/Code/Python/lara_run_for_portal.py check_config_file $LARA/Content/dante/corpus/local_config.json local reply.json

$ cat reply.json
{
    "status": "bad",
    "unknown_keys": [
        "popup_audio",
        "max_word_pages"
    ]
}

$ python3 $LARA/Code/Python/lara_run_for_portal.py check_config_file $LARA/Content/reader1_english/distributed_config.json distributed reply.json

$ cat reply.json
{
    "status": "good"
}

$ python3 $LARA/Code/Python/lara_run_for_portal.py check_config_file $LARA/Content/peter_rabbit/corpus/local_config.json distributed reply.json

$ cat reply.json
{
    "missing_items": [
        "resource_file",
        "reading_history"
    ],
    "status": "bad"
}

Checking validity of a LARA ID

The call:

check_lara_id <Id> <ReplyFile>

checks whether <Id> is a valid LARA ID and writes feedback to <ReplyFile>. Examples:

$ python3 $LARA/Code/Python/lara_run_for_portal.py check_lara_id "test123" reply.json
--- Written JSON file reply.json
--- lara_run_for_portal command executed (0.00199 secs)

--- All timings:
--- lara_run_for_portal command executed (0.00199 secs)

$ cat reply.json
true

$ python3 $LARA/Code/Python/lara_run_for_portal.py check_lara_id "test12 3" reply.json
--- Written JSON file reply.json
--- lara_run_for_portal command executed (0.00599 secs)

--- All timings:
--- lara_run_for_portal command executed (0.00599 secs)

$ cat reply.json
"Incorrect LARA ID \"test12 3\". A LARA ID can only include letters, numbers, - and _, and must start with a letter"

Creating flashcards

The call:

make_flashcards <ConfigFile> <FlashcardType> <NCards> <Level> <POS> <UserId> <Strategy> <OutFile>

makes a set of <NCards> flashcards for the text defined by <ConfigFile> and puts the result in the JSON file <OutFile>.

<FlashcardType> must be one of the flashcard types returned by get_possible_flashcard_types below,

<Level> must be one of [ beginner, advanced, intermediate, multiword_expressions ]

<POS> must be one of [ nouns, pronouns, adjectives, verbs, adverbs, prepositions, numerals, any ]

<UserId> is a portal userid

<Strategy> must be one of [ default, retry_failed_questions ]

The content is a list of records of the form:

{ 'question': <Question>,
  'answer': <Answer>,
  'distractors': <ListOfDistractors> }

Getting possible flashcard types

The call:

get_possible_flashcard_types <ReplyFile>

writes out a JSON-formatted list of the possible flashcard types to <ReplyFile>. Example:

$ python3 $LARA/Code/Python/lara_run_for_portal.py get_possible_flashcard_types reply.json
--- Written JSON file reply.json
--- lara_run_for_portal command executed (0.0489 secs)

--- All timings:
--- lara_run_for_portal command executed (0.0489 secs)

$ cat reply.json
[
  "lemma_translation_ask_l2",
  "token_translation_ask_l2",
  "token_translation_ask_l2_audio",
  "signed_video_ask_l2",
  "sentence_with_gap"
]

Structured diff on tagged corpus

The call:

diff_tagged_corpus <OldTaggedCorpus> <ConfigFile> <Zipfile>

does a structured diff between an old version of the tagged corpus in <ConfigFile> and the current one, and writes out the results to <Zipfile>. Example:

$ python3 $LARA/Code/Python/lara_run_for_portal.py diff_tagged_corpus Tagged_nasrettin_v1.txt local_config.json diff.zip
--- Environment variables and working directories look okay
--- Written JSON file C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_4651045318227005.json
--- Environment variables and working directories look okay
--- Read LARA text file as utf-8-sig (619 lines) C:/cygwin64/home/sf/callector-lara-svn/trunk/Content/nasrettin_large/corpus/Tagged_nasrettin_v1.txt
(... something in an encoding that can't be written out ...)
--- Written JSON file $LARA/tmp_resources/nasrettin_large_old_split.json
(... something in an encoding that can't be written out ...)
--- Written LARA text file C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp_resources/nasrettin_large_old_tagging_feedback.txt
--- Environment variables and working directories look okay
--- Read LARA text file as utf-8-sig (617 lines) C:/cygwin64/home/sf/callector-lara-svn/trunk/Content/nasrettin_large/corpus/Tagged_nasrettin_v2.txt
(... something in an encoding that can't be written out ...)
--- Written JSON file $LARA/tmp_resources/nasrettin_large_split.json
(... something in an encoding that can't be written out ...)
--- Written LARA text file C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp_resources/nasrettin_large_tagging_feedback.txt
--- Created directory C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_2977082621110262
--- Found 2575 items in $LARA/tmp_resources/nasrettin_large_old_split.json
--- Found 2523 items in $LARA/tmp_resources/nasrettin_large_split.json
--- Performing diff (surface)...
... done
--- 162 lines in diff
--- Written LARA text file C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_2977082621110262/surface_diff.txt
--- Found 2575 items in $LARA/tmp_resources/nasrettin_large_old_split.json
--- Found 2523 items in $LARA/tmp_resources/nasrettin_large_split.json
--- Performing diff (lemma)...
... done
--- 326 lines in diff
--- Written LARA text file C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_2977082621110262/lemma_diff.txt
--- Found 2575 items in $LARA/tmp_resources/nasrettin_large_old_split.json
--- Found 2523 items in $LARA/tmp_resources/nasrettin_large_split.json
--- Performing diff (surface_and_lemma)...
... done
--- 326 lines in diff
--- Written LARA text file C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_2977082621110262/surface_and_lemma_diff.txt
--- Written JSON file C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_2977082621110262/summary.json
--- Zipped up C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_2977082621110262 as diff.zip
--- Deleted directory C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/lara_tmp_dir_2977082621110262

SUMMARY

Surface words changed: 107
Lemmas changed:        189
Full details:          diff.zip
--- lara_run_for_portal command executed (1.93 secs)

--- All timings:
--- lara_run_for_portal command executed (1.93 secs)

Unzipping a file

The call:

unzip <Zipfile> <Dir>

unzips <Zipfile> and puts the result in <Dir>. Example of a successful call:

$ python3 $LARA/Code/Python/lara_run_for_portal.py unzip ziptest.zip ziptarget
--- Unzipped ziptest.zip to ziptarget
--- lara_run_for_portal command executed (0.0156 secs)

--- All timings:
--- lara_run_for_portal command executed (0.0156 secs)

Example of an unsuccessful call:

$ python3 $LARA/Code/Python/lara_run_for_portal.py unzip local_config_hanieh.json ziptarget
*** Error: something went wrong when trying to unzip local_config_hanieh.json to ziptarget
File is not a zip file
--- lara_run_for_portal command executed (0.0 secs)

--- All timings:
--- lara_run_for_portal command executed (0.0 secs)

(This is provided to give the portal easy access to Python library functionality).

Streamed download of binary files

The call:

streamed_download_binary_file <URL> <Pathname>

performs a streamed download of the binary file at <URL> and puts the result in <Pathname>. Example of a successful call:

$ python3 $LARA/Code/Python/lara_run_for_portal.py streamed_download_binary_file https://www.issco.unige.ch/en/research/projects/callector/peter_rabbit/images/01VeryBigFirTree.jpg $LARA/tmp/01VeryBigFirTree.jpg
--- Downloaded file C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/01VeryBigFirTree.jpg
--- lara_run_for_portal command executed (2.86 secs)

--- All timings:
--- lara_run_for_portal command executed (2.86 secs)

Example of an unsuccessful call:

$ python3 $LARA/Code/Python/lara_run_for_portal.py streamed_download_binary_file https://www.issco.unige.ch/en/research/projects/callector/peter_rabbit/images/NotAFile.jpg $LARA/tmp/NotAFile.jpg
*** Error: unable to download from https://www.issco.unige.ch/en/research/projects/callector/peter_rabbit/images/NotAFile.jpg
--- lara_run_for_portal command executed (1.46 secs)

--- All timings:
--- lara_run_for_portal command executed (1.46 secs)

Note that this will in general produce inconsistent results for text files, but you will not receive an error message. The call is intended for efficient download of large binary files, typically zipfiles.

(This is provided to give the portal easy access to Python library functionality).

Converting a CSV file into a JSON file

The call:

csv_to_json <CSVFile> <JSONFile>

converts <CSVFile> into <JSONFile>. <JSONFile> will be a list of lists. Example:

$ more toy1.csv
a       1
b       2

$ python3 $LARA/Code/Python/lara_run_for_portal.py csv_to_json toy1.csv toy1.json
--- Read CSV spreadsheet as utf-8-sig (2 lines) C:/cygwin64/home/sf/callector-lara-svn/trunk/tmp/toy1.csv
--- Written JSON file toy1.json
--- lara_run_for_portal command executed (0.00295 secs)

--- All timings:
--- lara_run_for_portal command executed (0.00295 secs)

$ more toy1.json
[
   [
      "a",
      "1"
   ],
   [
      "b",
      "2"
   ]
]

Converting a word token CSV file into a JSON file

The call:

word_token_csv_to_json <CSVFile> <JSONFile>

converts <CSVFile>, a word token CSV file, into <JSONFile>. <JSONFile> will be a list of three-element lists.

Converting a JSON file into a CSV file

The call:

json_to_csv <JSONFile> <CSVFile>

converts <JSONFile> into <CSVFile>. <JSONFile> must be a list of lists, all of whose elements are strings or numbers. Example:

$ more toy1.json
[
   [
      "a",
      "1"
   ],
   [
      "b",
      "2"
   ]
]

$ python3 $LARA/Code/Python/lara_run_for_portal.py json_to_csv toy1.json toy2.csv
--- Written CSV spreadsheet (2 lines) toy2.csv
--- lara_run_for_portal command executed (0.0 secs)

--- All timings:
--- lara_run_for_portal command executed (0.0 secs)

$ more toy2.csv
a       1
b       2

Converting a word token JSON file into a CSV file

The call:

word_token_json_to_csv <JSONFile> <CSVFile>

converts <JSONFile>, a word token translation file, into <CSVFile>. <JSONFile> must be a list of three-element lists, and all the elements of these lists must be strings. <CSVFile> will be a word token translation file in the usual format, containing groups of three lines with intervening separators.

Converting a type or lemma JSON file into a CSV file

The call:

word_type_or_lemma_json_to_csv <JSONFile> <CSVFile>

converts <JSONFile>, a word type or lemma translation file, into <CSVFile>. <JSONFile> must be a list of four-element lists, and the first two elements of these lists must be strings. <CSVFile> will be a word type or lemma translation file in the usual two-column format.

Checking the status of the Concraft server for Polish

The call:

check_concraft_server_status <ConfigFile> <ReplyFile>

checks the status of the Concraft Polish tagging server and writes it to <ReplyFile>. The possible values are Okay, *** Error: Concraft server appears to be down, *** Error: unable to run Morfeusz2, *** Error: unable to initialise and *** Error: not installed.

<ConfigFile> is as usual needed to say where to put tmp files.

Example, running on LARA server machine (note that the environment variable MORFEUSZPYTHON needs to be set):

[Concraft server is up]

manny@isslnx1:~/tmp$ export LARA="/export/data/www/issco-site/en/research/projects/LARA-portal-stage/trunk"
manny@isslnx1:~/tmp$ export MORFEUSZPYTHON="python3.5"
manny@isslnx1:~/tmp$ python3.7 $LARA/Code/Python/lara_run_for_portal.py check_concraft_server_status /export/data/ww   w/LARA-data-stage/63_Genesis_1_in_French/corpus/local_config.json reply.txt
--- Morpheusz will be run under python3.5
--- Environment variables and working directories look okay
--- Written LARA text file /export/data/www/LARA-data-stage/WorkingTmpDirectory/lara_tmp_6418943845418531.txt
--- Read LARA text file as utf-8-sig: /export/data/www/LARA-data-stage/WorkingTmpDirectory/lara_tmp_6418943845418531.   txt
--- Written JSON file /export/data/www/LARA-data-stage/WorkingTmpDirectory/lara_tmp_9989410359646069.json
--- Successfully called Morfeusz through python3.5
--- Written LARA text file /home/manny/tmp/reply.txt
--- lara_run_for_portal command executed (0.17 secs)

--- All timings:
--- lara_run_for_portal command executed (0.17 secs)
manny@isslnx1:~/tmp$ more reply.txt
Okay

[Concrat server is down]

manny@isslnx1:~/tmp$  python3.7 $LARA/Code/Python/lara_run_for_portal.py check_concraft_server_status /export/data/www/LARA-data-stage/63_Genesis_1_in_French/corpus/local_config.json reply.txt
--- Morpheusz will be run under python3.5
--- Environment variables and working directories look okay
--- Written LARA text file /export/data/www/LARA-data-stage/WorkingTmpDirectory/lara_tmp_1242017235091182.txt
--- Read LARA text file as utf-8-sig: /export/data/www/LARA-data-stage/WorkingTmpDirectory/lara_tmp_1242017235091182.txt
--- Written JSON file /export/data/www/LARA-data-stage/WorkingTmpDirectory/lara_tmp_9447915553448311.json
--- Successfully called Morfeusz through python3.5
--- Written LARA text file /home/manny/tmp/reply.txt
--- lara_run_for_portal command executed (0.177 secs)

--- All timings:
--- lara_run_for_portal command executed (0.177 secs)
manny@isslnx1:~/tmp$ more reply.txt
*** Error: Concraft server appears to be down

Crowdsourcing a project

The call:

cut_up_project <ConfigFile> <ZipfileForCrowdsourcing>

takes as input the project defined by <ConfigFile>, whose corpus file should contain at least one instance of the string <cut>. It creates <ZipfileForCrowdsourcing>, a zipfile of export zipfiles, with one export zipfile for each component of the cut-up corpus file.

The call:

stick_together_projects <FileWithListOfConfigFiles> <TargetConfigFile>

performs the converse operation. It takes as input <FileWithListOfConfigFiles>, which should be a JSON file containing an ordered list of the config files which together represent the components of a cut-up project, and the project defined by <TargetConfigFile>. The resources in the projects listed in <FileWithListOfConfigFiles> will be combined and inserted into the project defined by <TargetConfigFile>. The following resources are collected:

  • Corpus. The corpus files are concatenated in the order specified by <FileWithListOfConfigFiles>.

  • Segment translation files. The segment translation files are concatenated in the order specified by <FileWithListOfConfigFiles>.

  • Word token translation files. The word token translation files are concatenated in the order specified by <FileWithListOfConfigFiles>.

  • Segment audio files. The segment audio files for the component projects are copied to the segment audio directory for the target project, and the metadata is combined.

  • MWE annotations. The MWE annotations for the component projects are combined and copied to the target project.

Note that shared resources - lemma and word type translations, and word audio files - are not processed by this operation.

Downloading Picturebook information from the Selector Tool

The call:

get_and_store_selector_tool_data <ConfigFile>

tries to download Selector Tool data and copy it to the appropriate place.

If there is word location data stored for the project in the Selector Tool database, it is copied to a file referenced from the config file’s picturebook_word_locations_file parameter. If necessary, a new file name is generated and added to the config file. Also, the Selector Tool DB ID for the project is stored in the config file’s selector_tool_id parameter.

Uploading Picturebook information to the Selector Tool

The call:

copy_tmp_picturebook_data_into_place <ConfigFile> <TmpWordLocationsFile> <TmpWordLocationsZipfile> <Dir>

copies information from the <TmpWordLocationsFile> and <TmpWordLocationsZipfile> files produced by the Resources phase to places where the Selector Tool needs them.

The images and word location data from <TmpWordLocationsZipfile> are copied to <Dir>, which should be the web directory where the Selector Tool expects to find data. The metadata file in the directory is updated.

The word location data from <TmpWordLocationsFile> is uploaded to the Selector Tool DB.

Creating CSV files for human vs TTS questionnaires

The portal has functionality for running questionnaires that compare human audio and TTS audio derived from LARA projects. A study using questionnaires of this kind is described in the paper Assessing the quality of TTS audio in the LARA learning-by-reading platform, published in the proceedings of EUROCALL 2021.

To set up a questionnaire, the first step is to create matched pairs of LARA projects containing the text and audio files that will be used. Two projects in a pair must be identical, except that one has human audio and the other TTS audio. The projects need to be compiled into LARA form, with the compiled LARA files located in the standard web directory on the server.

Next, you create a JSON-formatted metadata file that lists your pairs of projects, indexing each one by a language ID, and for each one specifying a) what data is to be used in the questionnaire and b) the name of a CSV file that will be created containing the information in a format that can be entered into the questionnaire database.

An item in the metadata file is of the form:

"<Language>": {
            "file": "<OutputFile>",
            "data": [
                        {   "id": "<Id>",
                            "human": "<HumanAudioConfigFile>",
                            "tts": "<TTSAudioConfigFile>",
                            "word_time": <MinutesOfWordAudio>,
                            "segment_time": <MinutesOfSegmentAudio>
                        }
                    ]
    },

where

  • <Language> is the name of the language

  • <Id> is an ID

  • <HumanAudioConfigFile> is the LARA config file for the human audio project

  • <TTSAudioConfigFile> is the LARA config file for the TTS audio project, which must have the same text as the human audio project

  • <MinutesOfWordAudio> is the approximate total number of minutes of word audio to select. Items will be randomly selected.

  • <MinutesOfSegmentAudio> is the approximate total number of minutes of segment audio to select (a consecutive sequence of segments will be randomly selected), or “all” to use all segment audio.

A typical example looks like this:

"english": {
        "file": "$LARA/tmp/lrec_2022_english_shortened.csv",
        "data": [
                                {  "id": "lpp_english",
                                        "human": "$LARA/Content/the_little_prince_lrec2022/corpus/local_config_shortened.json",
                                        "tts": "$LARA/Content/the_little_prince_lrec2022/corpus/local_config_tts_shortened.json",
                                        "word_time": 0.0,
                                        "segment_time": "all"
                                }
                        ]
},

You can then create the CSV files using a call of the form:

make_human_tts_evaluation_forms <MetadataFile>

Formatting data from a human vs TTS questionnaire

Data from a human vs TTS questionnaire can be formatted using a call of the form:

format_human_tts_data <SegmentFile> <OverallFile> <VoiceRatingFile> <ResultsDir>

where

<SegmentFile> is a semicolon-separated UTF-8 CSV file with format and example:

Lang    UserID Sex  DoB  Education    Level      Teacher Hearing  Reading V1  Text     Score
English 26     male 1997 postgraduate nearNative yes     no       no      tts lettuces 4

<VoiceRatingFile> is a semicolon-separated UTF-8 CSV file with format and example:

Lang    UserID    Version  Question  Score
English 19        1        6         5

<OverallFile> is a semicolon-separated UTF-8 CSV file with format and example:

Lang    UserID Sex  DoB  Education    Level  T  H  R  VersionFirst V1Comment            V2Comment                              Comment StartTime             EndTime
English 45     male 1959 postgraduate native no no no tts          "cold and monotone." "more natural, clearer pronunciation"  "NULL"  "2021-05-24 02:48:03" "2021-05-24 02:57:28"

<ResultsDir> is the directory in which to place results.

Compiling a reading history

You can compile a reading history defined by a distributed LARA config file using a call of the form:

compile_reading_history <ConfigFile> <ReplyFile>

Example:

compile_reading_history $LARA/Content/reader1_english/distributed_config.json reply1.json

The config file specifies the reading history. A typical file of this kind is as follows:

{
  "id": "reader1_english",
  "l1": "french",
  "resource_file": "$LARA/Content/all_resources.json",
  "reading_history": [[ "peter_rabbit", "english_geneva", [ 1, 5 ]],
                      [ "alice_in_wonderland", "english_geneva", [ 1, 5 ]]],
  "preferred_voice": "cathy",
  "audio_mouseover": "yes",
  "translation_mouseover": "yes",
  "segment_translation_mouseover": "yes",
  "max_examples_per_word_page": 10
}

A JSON structure listing the generated HTML pages is written out to <ReplyFile>.

Incremental compilation of a reading history

After performing a compile_reading_history, it is possible to perform an incremental compilation of the history which only adds the next page for a specified resource. This is usually much quicker than a full compile. The call is of the form:

compile_next_page_in_history <DistributedResourceId> <DistributedLanguageResourceId> <ConfigFile> <ReplyFile>

Here, <DistributedResourceId> and <DistributedLanguageResourceId> are the names of a corpus resource and a language resource defined in the resource_file listed in <ConfigFile>.

Example:

compile_next_page_in_history peter_rabbit english_geneva $LARA/Content/reader1_english/distributed_config.json reply1.json

The new page is written out to <ReplyFile> in the same format as for compile_reading_history.

Getting a list of pages for a resource

This downloads the resource and processes the corpus file to get the list of pages, writing out the result to <ReplyFile>. It caches the answer and looks it up next time without repeating the download. The pages are returned in three forms: page number, base file name and full file name. The call is of the form:

get_page_names_for_resource <DistributedResourceId> <ConfigFile> <ReplyFile>

Here, <DistributedResourceId> is the name of a corpus resource defined in the resource_file listedin <ConfigFile>.

Example:

get_page_names_for_resource peter_rabbit $LARA/Content/reader1_english/distributed_config.json reply1.json

After carrying out the operation, the contents of reply1.json will be something like:

[
  {
      "base_file": "_main_text_peter_rabbit_1_.html",
      "html_file": "$LARA/compiled/_main_text_peter_rabbit_1_.html",
      "page_number": 1
  },
  {
      "base_file": "_main_text_peter_rabbit_2_.html",
      "html_file": "$LARA/compiled/_main_text_peter_rabbit_2_.html",
      "page_number": 2
  },
  {
      "base_file": "_main_text_peter_rabbit_3_.html",
      "html_file": "$LARA/compiled/_main_text_peter_rabbit_3_.html",
      "page_number": 3
  },
  (...)
]

If an error occurred, for example because the resource is not defined, the response will be:

false

Cleaning the reading history cache

You can delete the reading history cache files as follows (you need to pass a distributed LARA config file to say where they are):

clean_reading_portal_cache <ConfigFile>

Example:

clean_reading_portal_cache $LARA/Content/reader1_english/distributed_config.json