Internal documentation: abstract HTML¶
This section describes the abstract HTML layer, which provides a level of representation designed for people who are interested in rendering LARA text themselves. Abstract HTML is implemented on top of the standard JSON format.
In the rest of the section, we describe how to create abstract HTML, explain where to find and how to use sample Python code that illustrates the rendering process, and specify the format of the abstract HTML representation.
Creating abstract HTML¶
The most direct way to create abstract HTML for a LARA project is to use a call of the following form:
python3 $LARA/Code/Python/lara_run.py abstract_html <ConfigFile>
It is necessary to perform the resources
step first. Usually, you will also want to fill in some of the generated resources.
By default, a call of the form:
python3 $LARA/Code/Python/lara_run.py word_pages <ConfigFile>
also produces abstract HTML as an intermediate step before producing word pages.
The abstract HTML file is by default created in pickled gzipped form. This is a compact encoding which can be read efficiently using a Pickle library. It is also possible to create an abstract representation file as plain JSON. Abstract HTML files are created in the tmp_resources
directory defined by the config file. If <Id>
is the project ID, the pathname of a pickled gzipped abstract HTML file is:
<Id>_abstract_html.data.gz
and the pathname of a plain JSON abstract HTML file is:
<Id>_abstract_html.json
The following config file parameter controls production of abstract HTML:
“abstract_html_format”. Determines whether to produce normal HTML, abstract HTML, or both. Possible values are
"pickle_only"
,"json_only"
,"pickle_and_json"
. Default is"pickle_only"
.
Creating an abstract HTML zipfile¶
You can create a zipfile containing both the abstract HTML and all the multimedia files it references using a call of the following form:
python3 $LARA/Code/Python/lara_run.py abstract_html_zipfile <ConfigFile> <Format> <Zipfile>
where <Format>
is pickle
or json
This may be convenient if you are planning to render the abstract HTML yourself.
Python code for rendering abstract HTML¶
There is Python code in $LARA/Code/Python/lara_abstract_html.py
for rendering abstract HTML into concrete HTML. It produces HTML which should be very similar to the HTML produced by the normal process.
The sample abstract HTML rendering code can be invoked from the command-line using a call of the form:
python3 $LARA/Code/Python/lara_run.py word_pages_from_abstract_html <ConfigFile>
The abstract HTML is taken from a file produced by the make_pages
operation as defined immediately above. If there is both a pickled gzipped file and a plain JSON file, the pickled gzipped file takes precedence.
Sticking together abstract HTML for several texts¶
The abstract HTML format has been designed so that it is easy to stick together several pieces of abstract HTML into a larger piece and then render it as a combined LARA text. You can do this with a call of the form:
python3 $LARA/Code/Python/lara_run.py word_pages_from_abstract_html_multiple <MasterConfigFile> <ConfigFile1> <ConfigFile2> ...
The abstract HTML files associated with <ConfigFile1>
, <ConfigFile2>
will be combined as specified in <MasterConfigFile>
. You need to have created the abstract HTML files first.
You will probably want <MasterConfigFile>
to include the line:
"id_on_examples": "yes",
This says to add annotations after the examples on concordance pages to show which component text each example has been taken from.
You can make the annotations look better if you put in lines in the component config files of the form:
"id_printform": "<Formatted ID>",
for example:
"id_printform": "Alice in Wonderland",
This says how to print the name of the component text in the annotation. The default is to use the value of the id
parameter.
A typical example of a <MasterConfigFile>
is the following. Note that most of the parameters do not need to be set.
{
"id": "hc_andersen_combined",
"id_on_examples": "yes",
"language": "danish",
"abstract_html_format": "json_only",
"corpus": "$LARA/Content/hc_andersen_combined/corpus/placeholder.txt",
"translation_mouseover": "yes",
"segment_translation_mouseover": "yes",
"segment_translation_character": "✎",
"audio_mouseover": "yes",
"max_examples_per_word_page": 20,
"coloured_words": "no",
"allow_table_of_contents": "yes"
}
Format of abstract HTML¶
Abstract HTML a is JSON-based format. We specify this format hierarchically, mostly using the toy project $LARA/Content/mary_had_a_little_lamb
as the running example.
The top-level structure of a piece of abstract HTML is a dict with the following keys:
"segments"
. Representations of segments (sentences)."pages"
. Representations of main text pages."word_pages"
. Representation of concordance pages for words."alphabetical_index"
. Alphabetically ordered lemma index."frequency_index"
. Frequency ordered lemma index."notes"
. Optional list of notes attached to lemmas."image_lexicon"
. Optional list of images attached to lemmas."toc"
. Optional table of contents."css"
. Default CSS file."custom_css"
. Optional custom CSS file."script"
. Default JS script file."custom_script"
. Optional custom script file."audio_tracking_data"
. Optional data for allowing karaoke-style highlighting of audio.
“segments”¶
The value of the "segments"
key is a dict of segment representations indexed by their anchor
values. A segment representation is a dict with the following keys:
"anchor"
. Unique identifier for segment, can be used to construct an anchor in the HTML."audio"
. Optional representation of audio file associated with segment."corpus_name"
. Identifier for corpus where segment occurs."page"
. Identifier for text page where segment occurs."plain_text"
. Plain text version of segment."translation"
. Optional translation for segment."words"
. List of words and other text/multimedia items in segment.
A typical line in the "segments"
dict looks like this:
"Mary_had_a_little_lamb_page_1_segment_1": {
"anchor": "Mary_had_a_little_lamb_page_1_segment_1",
"audio": {
"corpus_name": "Mary_had_a_little_lamb",
"file": "491462_200311_022328758.mp3"
},
"corpus_name": "Mary_had_a_little_lamb",
"page": 1,
"plain_text": "Mary Had a Little Lamb",
"translation": "Marie hade ett litet lamm",
"words": [
{
"word": "<h1>"
},
{
"audio": {
"corpus_name": "Mary_had_a_little_lamb",
"file": "118812_190719_183034441.mp3"
},
"lemma": "mary",
"translation": "Marie",
"word": "Mary"
},
{
"word": " "
},
{
"audio": {
"corpus_name": "Mary_had_a_little_lamb",
"file": "118808_190719_183019768.mp3"
},
"lemma": "have",
"translation": "hade ",
"word": "Had"
},
( ... more items ... )
]
Items in the "words"
list are all dicts. They can be of the following kinds:
Word. Representation of a word with associated multimedia information. The dict will contain the keys
"word"
,"lemma"
,"audio"
(optional) and"translation"
(optional). A typical example looks like this:{ "audio": { "corpus": "Mary_had_a_little_lamb", "file": "118812_190719_183034441.mp3" }, "lemma": "mary", "translation": "Marie", "word": "Mary" },
Plain text. Representation of a piece of text, possibly including HTML markup but with no other associated information. The dict will have the single key
"word"
. A typical example looks like this:{ "word": "<h1>" },
Image. Representation of an embedded image file. The dict will contain the keys
"corpus_name"
,"file"
,"width"
,"height"
and"multimedia"
, with the value of the"multimedia"
tag as"img"
. A typical example looks like this:{ "corpus_name": "Mary_had_a_little_lamb", "file": "MaryAndLamb.jpg", "height": 292, "multimedia": "img", "width": 517 },
Audio. Representation of an embedded audio file. The dict will contain the keys
"corpus"
,"file"
and"multimedia"
, with the value of the"multimedia"
tag as"audio"
. A typical example looks like this:{ "corpus_name": "Mary_had_a_little_lamb", "file": "MaryVerse1.mp3", "multimedia": "audio" }
“pages”¶
The value of the "pages"
key is a list of segment representations. A page representation is a dict with the following keys:
"corpus_name"
. Identifier for corpus."page_name"
. Number serving as unique identifier for page in corpus."segments"
. List of segment IDs."play_all"
. Dict containing information required to create a single audio file that combines all the segment audio files in the page. This is used if one of the segments in the page contains an audio file with the special valueplay all
."custom_css_file"
. Name of custom CSS file for page, if any."custom_scrip_file"
. Name of custom CSS file for page, if any.
A typical item in the "pages"
list looks like this:
{
"corpus_name": "Mary_had_a_little_lamb",
"custom_css_file": null,
"custom_script_file": null,
"page_name": 1,
"play_all": {
"corpus_name": "Mary_had_a_little_lamb",
"file_name": "play_all_Mary_had_a_little_lamb_1.mp3",
"page_name": 1,
"segment_audio_files": false
},
"segments": [
"Mary_had_a_little_lamb_page_1_segment_1",
"Mary_had_a_little_lamb_page_1_segment_2",
"Mary_had_a_little_lamb_page_1_segment_3"
]
},
“word_pages”¶
The value of the "word_pages"
key is a dict of concordance page/word page representations indexed by their lemma
values. A word page representation is a dict with the following keys:
"lemma"
. The lemma for which this is a concordance page."examples"
. List of segment IDs for examples of this lemma."extra_info"
. Possibly empty list of other information, for example notes or links to external resources."images"
. Possibly empty list of image representations, used if the lemma is associated with one or more images. Each image representation is a dict with the keys"corpus"
,"file"
and"multimedia"
. The value of"multimedia"
is"img"
.
A typical line in the "word_pages"
dict looks like this:
"lamb": {
"examples": [
"Mary_had_a_little_lamb_page_1_segment_1",
"Mary_had_a_little_lamb_page_2_segment_3",
"Mary_had_a_little_lamb_page_2_segment_6",
"Mary_had_a_little_lamb_page_3_segment_6"
],
"extra_info": [
"<p>✎ Could be Jesus.</p>",
""
],
"images": [
{
"corpus": "Mary_had_a_little_lamb",
"file": "lamb.jpg",
"multimedia": "img"
}
],
"lemma": "lamb"
}
“alphabetical_index”¶
The value of the "alphabetical_index"
key is a list of entries that can be used to construct an alphabetically ordered index. Each entry is a dict with the keys "count"
and "word"
. The value of "count"
is a frequency count, and the value of "word"
is a list of dicts, each of which has the two keys "lemma"
and "word"
.
A typical line in the "alphabetical_index"
list looks like this:
{
"count": 3,
"word": [
{
"lemma": "mary",
"word": "mary"
}
]
},
“frequency_index”¶
The value of the "frequency_index"
key is a list of entries that can be used to construct a frequency ordered index. Each entry is a dict with the keys "count"
, "cumulative_percentage"
and "word"
. The value of "count"
is a frequency count, the value of "cumulative_percentage"
is a cumulative frequency count expressed as a percentage, and the value of "word"
is a list of dicts, each of which has the two keys "lemma"
and "word"
.
A typical line in the "frequency_index"
list looks like this:
{
"count": 3,
"cumulative_percentage": "12.28%",
"word": [
{
"lemma": "mary",
"word": "mary"
}
]
},
“notes”¶
The value of the "notes"
key is a list of entries giving all the lemmas which have an associated note. Each entry is a dict with the keys note"
and "word"
. The value of "note"
is the text of the note, and the value of "word"
is a list of dicts, each of which has the two keys "lemma"
and "word"
.
A typical line in the "notes"
list looks like this:
{
"note": "Possibly the Virgin Mary.",
"word": [
{
"lemma": "mary",
"word": "mary"
}
]
}
“image_lexicon”¶
The value of the "image_lexicon"
key is a list of entries giving all the lemmas which have associated images. Each entry is a dict with the keys "lemma"
and "images"
. The value of "images"
is a list of dicts with the keys "corpus_name"
and "file"
.
A typical line in the "image_lexicon"
list looks like this:
{
"images": [
{
"corpus_name": "Mary_had_a_little_lamb",
"file": "lamb.jpg"
}
],
"lemma": "lamb"
},
“toc”¶
The value of the "toc"
key is a list of entries that can be used to create a table of contents. Each entry is a dict with the keys "anchor"
, corpus_name"
, page_name"
, "plain_text"
and "tag"
. The value of "anchor"
is the ID of the segment where the section in question starts, "page_name"
is the ID of the page, "plain_text"
is the text of the TOC entry. "tag"
indicates the level of nesting and should be either "h1"
or "h2"
.
A typical line in the "toc"
list looks like this:
{
"anchor": "Mary_had_a_little_lamb_page_1_segment_1",
"corpus_name": "Mary_had_a_little_lamb",
"page_name": 1,
"plain_text": " Mary Had a Little Lamb ",
"tag": "h1"
},
“css”¶
The value of the "css"
key is a list of lines giving the contents of the default CSS file generated by the normal LARA make_pages
operation.
“custom_css”¶
The value of the "custom_css"
key is a list of lines giving the contents of the custom CSS file generated by the normal LARA make_pages
operation, if any.
“script”¶
The value of the "script"
key is a list of lines giving the contents of the default JS script file generated by the normal LARA make_pages
operation.
“custom_script”¶
The value of the "custom_script"
key is a list of lines giving the contents of the custom JS script file generated by the normal LARA make_pages
operation, if any.
“audio_tracking_data”¶
The value of the "audio_tracking_data"
key is a dict which associates audio tracking IDs with lists of timings for segments where audio tracking is to be used.
A typical line in the "audio_tracking_data"
list looks like this:
"audio_völuspá_1": [
0.0,
2.3,
3.2,
4.34,
5.2,
7.9,
9.1,
10.4,
12.7
],