The LARA repository

LARA is an open source project hosted on SourceForge. You can freely download both code and LARA content.

If you want to contribute to code development and/or make content available to the LARA community, please contact Manny Rayner (Emmanuel.Rayner@unige.ch), Hanieh Habibi (Hanieh.Habibi@unige.ch) or Matt Butterweck (matthias@butterweck.de) to obtain the necessary privileges.

The repository contains the main directories Code, Content and Doc. The Code directory contains the two subdirectories Python (core LARA engine) and PHP (LARA portal).

Python code (core LARA engine)

The core LARA engine is a set of Python scripts that turn text into LARA pages. They can be run standalone from the command-line using the top-level file Code/Python/lara_run.py or through the LARA portal. The portal accesses the Python code through the file Code/Python/lara_run_for_portal.py.

As of Aug 10 2020, the Code/Python directory contained the following files:

  • check_lara_env.py Check that the environment variable LARA is defined.

  • lara.py Main code for converting text into LARA pages.

  • lara_add_metadata.py Add metadata to LARA content directory for “distributed LARA”.

  • lara_audio.py Processing audio and signed language video files.

  • lara_clitics.py Processing clitics (used by tagging interface).

  • lara_compare_split_files.py Compare two different versions of a “split file” (internalised format).

  • lara_config.py Internalise a LARA config file.

  • lara_distributed.py “Distributed LARA”: create pages for reading histories.

  • lara_download_metadata.py Download the metadata for “distributed LARA” from the web.

  • lara_download_resource.py Download a LARA resource from the web.

  • lara_edda.py Processing the University of Iceland hand-tagged Edda into LARA form.

  • lara_edit.py Edit LARA files.

  • lara_extra_info.py Add audio, translations, links to external resources etc in LARA pages.

  • lara_farsi_tagging.py Interface to hazm tagger/lemmatiser for Farsi.

  • lara_fifty_words.py Create LARA resource used in Australian aboriginal languages “Fifty Words” project.

  • lara_flashcards.py Processing for LARA flashcards.

  • lara_flashcards_toy.py Initial toy version of LARA flashcard code (not used).

  • lara_generic_tagging.py Tagging code shared between different tagger/lemmatisers.

  • lara_gui.py LARA GUI.

  • lara_html.py HTML formatting of LARA pages.

  • lara_icelandic.py Interface to Icelandic tagging and lemmatising resources.

  • lara_icelandic_minimal.py Code for testing interface to Icelandic tagging and lemmatising resources (not used).

  • lara_images.py Processing images and embedded videos.

  • lara_import_export.py Convert between LARA projects and zipfiles.

  • lara_install_audio_zipfile.py Installing downloaded zipfiles from LiteDevTools in LARA projects.

  • lara_merge_resources.py Merge two versions of a LARA language resource directory.

  • lara_minimal_tag.py Lightweight spreadsheet-based lemmatiser for languages that don’t have a real tagger/lemmatiser.

  • lara_mt.py Initial interface to machine translation engines (not yet usable).

  • lara_mwe.py Processing for multiword expressions.

  • lara_mwe_ml.py Processing to create machine-learning data for multiword expressions.

  • lara_new_content.py Create config file and directory for new LARA resource.

  • lara_offline_test.py Regression testing for LARA processing.

  • lara_onp.py Processing for adding links to Old Norse online resources.

  • lara_open_in_browser.py Open LARA files in browser (used by LARA GUI).

  • lara_oxford_lists.py Create statistics based on Oxford 5000 vocabulary list.

  • lara_parallel_text.py Create LARA text in two-column format using two aligned versions.

  • lara_parse_utils.py Utilities for low-level parsing, in particular used for internalising texts.

  • lara_postags.py Add POS tags (used by tagging).

  • lara_reader_gui.py Stub for desktop interface to distributed LARA (not used).

  • lara_reading_portal.py Top-level operations for reading histories.

  • lara_replace_chars.py Handling reserved characters in LARA documents.

  • lara_run.py Top-level file.

  • lara_run_convert_tex.py Top-level file to invoke script for converting “Tex’s French grammar” to LARA form.

  • lara_run_farsi_tagging.py Top-level file to invoke hazm parser, which needs to run under Python 2.

  • lara_run_for_portal.py Top-level file for portal operations.

  • lara_segment.py Wrapper to invoke NTLK Punkt sentence segmenter and import segmentation into LARA documents.

  • lara_signed.py Scripts for processing sign language documents.

  • lara_spell_correct.py Simple spelling correction code used by lara_onp.py.

  • lara_split_and_clean.py Converting LARA texts into internalised form.

  • lara_tagging.py Wrapper for NLTK tagger (not used).

  • lara_tex_french_course.py Script for converting “Tex’s French grammar” to LARA form.

  • lara_text_to_sign.py Script for converting LARA document with sign language annotations into pure sign language document.

  • lara_tmp_utils.py Miscellaneous utilities probably without general value.

  • lara_top.py Top-level LARA calls (level under lara_run.py and lara_run_for_portal.py).

  • lara_transform_tagged_file.py Carry out transformations on internalised files produced by lara_split_and_clean.py.

  • lara_translations.py Processing for translation annotations.

  • lara_treetagger.py Top-level wrapper for tagging and lemmatising operations. Originally only TreeTagger, now generalised to handle other tagger/lemmatisers as well.

  • lara_turkish.py Wrapper for Turkish tagger/lemmatiser.

  • lara_utils.py Basic utility functions.

  • run_lara_download_metadata.py Top-level file to invoke lara_download_metadata.py.

  • run_lara_tagging.py Top-level file to invoke lara_tagging.py (not used).

  • wagnerfischer.py Open source implementation of Wagner-Fischer edit distance calculation, imported from third party source.

PHP code (LARA portal)

(To be added).

Content

The Content directory contains a large variety of LARA projects in different stages of completion. We briefly describe their status as of Aug 10 2020, listing the reading language and the state of completion for tagging, audio, translation and images. “Complete” means all of these are present if relevant.

  • aladdin Dutch. Short children’s story. Complete.

  • alice_in_wonderland English. Full-length children’s book (~30K words), tagging, audio and images complete.

  • animal_farm English. Full-length novel (~35K words), tagging and audio complete.

  • Arash. Farsi. Short children’s story. Complete.

  • barngarla_alphabet Barngarla (Australian aboriginal language). Alphabet book. Complete.

  • barngarla_fifty_words Barngarla (Australian aboriginal language). Contribution to Aboriginal languages “Fifty words” project.

  • bozboz_ghandi Farsi. Short children’s story. Complete.

  • dante Medieval Italian. Extracts from Dante’s Inferno. Complete.

  • daughters_doll. English. Short magazine article. Complete.

  • de_jongen_en_de_spreeuw. Dutch. Short children’s story. Complete.

  • EbneSina Farsi. Short children’s story. Complete.

  • edward_lear English. Short poems. Complete.

  • enoch_soames English. Novella (~10K words). Tagging and audio complete, translations in progress.

  • four_little_children English. Children’s story (~3K words). Tagging and images only.

  • hur_gick_det_sen Swedish. Children’s poem. Complete.

  • hyakumankai_ikita_neko Japanese. Short children’s book. Complete.

  • hávamál Old Norse. Epic poem with English translations. Complete.

  • hávamál_is Old Norse. Epic poem with Icelandic translations. Complete.

  • il_piccolo_principe Italian. Edition of “Le petit prince” (~16K words). Tagging and images complete.

  • kallocain. Swedish. Full-length novel (55K words). Tagging except for MWEs complete, translation complete.

  • karlsson_pa_taket. Swedish. First chapter of children’s book. Complete.

  • kejserens_nye_klæder Danish. Short children’s story. Tagging and translation complete.

  • le_bonheur French. Short story (~2K words). Tagging and translation complete.

  • le_chien_jaune French. Full length novel (~35K words). Tagging complete.

  • le_petit_prince French. Children’s novel (~16K words). Tagging, images and translations complete, audio about 60% complete.

  • litli_prinsinn Icelandic. Edition of “Le petit prince” (~16K words). Tagging and images complete.

  • lorem_ipsum Pseudo-Latin. Short text. Tagging and audio complete.

  • molana Farsi. Short children’s story. Complete.

  • nibelungenlied Middle High German. Epic poem (~80K words). Tagging and images complete, partial translations, some audio.

  • peter_rabbit English. Children’s story (~1K words). Complete.

  • picture_dictionary_farsi Farsi. Sample pages from picture dictionary. Complete.

  • picture_dictionary_french French. Sample pages from picture dictionary. Complete.

  • revivalistics. Multilingual. Sample pages from OUP linguistics book. Complete.

  • sample_catalan Catalan. Short sample document. Incomplete.

  • sample_czech Czech. Short sample document. Incomplete.

  • sample_danish Danish. Short sample document. Incomplete.

  • sample_dutch Dutch. Short sample document. Incomplete.

  • sample_english English. Short sample document. Complete.

  • sample_english_surface English. Short sample document. Complete.

  • sample_english_tokens English. Short sample document. Complete.

  • sample_finnish Finnish. Short sample document. Incomplete.

  • sample_french French. Short sample document. Incomplete.

  • sample_german German. Short sample document. Incomplete.

  • sample_greek Greek. Short sample document. Incomplete.

  • sample_icelandic Icelandic. Short sample document. Incomplete.

  • sample_italian Italian. Short sample document. Incomplete.

  • sample_japanese Japanese. Short sample document. Incomplete.

  • sample_norwegian Norwegian. Short sample document. Incomplete.

  • sample_polish Polish. Short sample document. Incomplete.

  • sample_portuguese Portuguese. Short sample document. Incomplete.

  • sample_romanian Romanian. Short sample document. Incomplete.

  • sample_slovak Slovak. Short sample document. Incomplete.

  • sample_slovenian Slovenian. Short sample document. Incomplete.

  • sample_spanish Spanish. Short sample document. Incomplete.

  • sample_swahili Swahili. Short sample document. Incomplete.

  • sample_swedish Swedish. Short sample document. Incomplete.

  • sentient_meat Short story (~1K words). Tagging and audio complete.

  • shakespeare_sonnets Four Shakespeare sonnets. Tagging and audio complete.

  • shakespeare_sonnets_kirsten Shakespeare sonnet. Tagging and audio complete.

  • staythefuckhome Multilingual. Document illustrating use of embedded video. Complete.

  • swedish_grammar Swedish. Sample linguistics document. Complete.

  • texs_french_course French. Online language course (~11K words). Tagging and audio complete.

  • the_boy_who_cried_wolf Farsi. Short children’s story. Complete.

  • the_little_black_bag English. Short story (~10K words). Tagging complete.

  • the_little_prince English. Edition of “Le petit prince” (~16K words). Tagging and images complete.

  • the_rime_of_the_ancient_mariner. English. Long poem (~4K words). Tagging, images and audio complete.

  • tickertape_of_misery English. Extract from novel, read by author. Tagging and audio complete.

  • tina_fer_i_fri Icelandic. Children’s book (~2.5K words). Complete.

  • tina_signed Icelandic with sign language annotations. Children’s book (~2.5K words). Complete.

  • völuspá Old Norse. Epic poem with English translations. Complete.

  • völuspá_is Old Norse. Epic poem with Icelandic translations. Complete.

  • why_read English. Short magazine article. Complete.

  • wilhelmbusch German. Humorous poems. Tagging and translations complete.

  • zweig_episode German. Short story (~2.5K words). Tagging complete.

Documentation

The Doc directory contains this documentation.