Internal documentation: overview of architecture

This section presents an overview of the LARA codebase’s architecture. You will probably only want to read it if you are planning to modify the code.

The LARA code is divided into two parts. The core engine, written in Python, performs the main processing and can be run standalone from the command-line as a desktop app. Because it is designed to be used this way, input, output and intermediate results are kept in files.

LARA is more commonly accessed through the Portal, which is written in PHP. Like most PHP apps, the portal keeps most of its information in a database. When it calls the core engine, it typically converts database information into file form, then unpacks the files returned into its database.

If you are new to LARA, you should start by using the portal a bit to understand what the functionality is. Try reading a few LARA texts, then construct a small document of your own.

As you will see when you use the portal, constructing a LARA text involves the following steps:

  • Uploading a text file.

  • (In most cases) performing automatic tagging.

  • Generating a set of resources that need to be filled in to annotate the text.

  • Performing the annotation (adding translations, audio, etc).

  • Putting everything together into the final LARA document.

Most of this is done by making calls to the core engine. Below, we briefly summarise how this works.

Core engine (Python)

The Python code for the core engine is in the directory Code/Python. If you want to experiment with using it on your own machine, there are instructions for installation here.

The top-level calls the portal makes to the core engine are defined here. The following calls are particularly important:

As part of generating the final set of HTML pages, the Python code also creates a file in “abstract HTML” form. This is probably the representation you will want to use if you are planning to manipulate the LARA content in some way. The abstract HTML form is described here.

Config files

Each LARA project is defined by a config file. This is a JSON-formatted file which specifies the resources used to build the project and the relevant parameter settings. In particular, it specifies the following:

  • The corpus file containing the text, in different versions: plain, segmented, tagged, tagged and edited.

  • The directories containing audio files associated with words and segments.

  • The directories where temporary and final results are written out.

A full list of config file parameters and their meanings is presented in this section.

Directory structure

The code requires the following file structure:

  • Every project has its own directory.

  • Every language has its own directory.

A project directory has the following subdirectories:

  • corpus. This contains all versions of the corpus files, along with some other files (CSS, scripts, “notes”).

  • audio. Audio files in mp3 format specifically related to this project: audio files associated with segments of the corpus and embedded audio files. In the case of a sign language project, this directory will instead contain video files in webm format.

  • images. Image files in any format specifically related to this project.

  • translations. Translation files specifically related to this project, in CSV form: translations of segments and token translations of words. For explanation of what this means, see here and here.

Every project needs to be associated with the following directories, which will not in general be under the project directory:

  • Directory for writing out temporary resource files, i.e. partly uninstantiated files containing translation, audio or other data. By default, this is $LARA/tmp_resources. A different directory can be specified using the config file parameter lara_tmp_directory.

  • Directory for writing out other tmp files. By default, this is $LARA/tmp. A different directory can be specified using the config file parameter working_tmp_directory.

  • Directory for writing out LARA documents in HTML form. By default, this is $LARA/compiled. A different directory can be specified using the config file parameter compiled_directory.

A language directory has the following subdirectories:

  • corpus. This contains all versions of the corpus files, along with some other files (CSS, scripts, “notes”).

  • audio. Audio files in mp3 format specifically related to this project: audio files associated with segments of the corpus and embedded audio files. In the case of a sign language project, this directory will instead contain video files in webm format.

  • translations. Translation files specifically related to this language, in CSV form: surface or lemma translations of words. Again, here and here for details.

Portal (PHP)

The PHP, JS and CSS code for the portal is in the directory Code/PHP/lara-portal. It uses a standard model-view-controller architecture.

Most of the time, the portal takes requests from the user interface and produces external calls which do the main processing. Most often the calls are to the Python code, but the portal also has important interfaces to LiteDevTools (LDT) platform, which is used for recording, and to the packages which upload and download zipfiles.

The easiest way to understand the portal’s operation is to look through the view subdirectory to find HTML fragments matching labels visible in the interface, look through the data subdirectory to find the corresponding external calls, and then look through the code to see how one is connected to the other. Usually, the connection is quite simple. Consequently, start by familiarizing yourself with the operation of the portal and the Python command-line functionality from the user point of view.

Database

The portal uses an SQL database. The following tables are particularly important:

  • ContentConfig. Corresponds to a config file.

  • Contents. Corresponds to a config file. [WHAT IS THE DIFFERENCE BETWEEN THIS AND ``ContentConfig``?]

  • ContentRecordingTasks. Corresponds to an LDT recording task.

  • ContentSegments. Corresponds to a segment translation line.

  • ContentLemma. Corresponds to a lemma translation line.

  • ContentTokens. Corresponds to a token translation line.

  • ContentTypes. Corresponds to a word type translation line.

  • ExternalCommandsLogs. Corresponds to an external command occurrence. (Python, LDT, etc).