|
DICO Project
ISSCO - University of Geneva
|
 |
DICO is a tool to consult multiple dictionaries or structured data
on a computer network developped at
ISSCO.
It is designed to accommodate various dictionary
representations. In addition to a set of basic searching mechanisms, it allows
easy addition of custom indexes.
DICO was demonstrated at CeBIT'95 on the Swiss Technology Stand,
and it is accessible on the network of the University of Geneva since 1992.
- simplicity: tool that enables non-specialist users to lookup
entries in dictionaries.
- accessibility: it should work in an environment of networked
heterogeneous computers.
- expandability: it should be easy to add data, indexes.
- robustness: it should be resistant to hazards inherent to networks
and inexperienced users
- General Design
- Communication Protocol
- Advantages of a client-server design
- Interfaces
- Indexes
- Display
- Transducer
- Dictionary source representation
- Alphabetical order
- Approximate matching
- Special characters
- Summary
- HotDico
The tool is split in three parts communicating through the network
- A Dictionary Information Service (DIS)
- A dictionary servers
- A client user interface,

- A server can handle many clients.
- There can be many servers on the network.
The communication is driven by the client:
the client sends a request to the server,
and waits until the server processes it and sends back an answer.
For example:
- Ask for the list of available dictionaries.
- Change the current dictionary.
- Ask for the list of entries corresponding to a given search key.
- Ask for the content of an entry.
- Ask for next or previous entry in the alphabetical order
- Change the format for displaying entries
- Ask for a help or information page
- Access through the network
- The work is split in two
- The user interface handles query editing, display of the entries, local
configuration.
- The server handles database operations
- expandability: different user interfaces can be developed for
different purpose or computers:
- X-windows (xdico)
- full screen character terminal (tdico)
- java user interface (HOTDico)
- One copy of the dictionary data
- saves disk space and maintenance time
- better protection of data
-
Java Interface
|
- This interface has been redefined to adopt the new concept of Virtual
language Reference Library and has been completely
recoded with the programming language Java.
- The choice of an interpreted language like these has the
major interest that it's widely available on all common platforms and operating
systems.
|
X-Windows client
|
- Allows typographical display of entries (various font styles, etc.)
allows select-paste.
- Adds an extra stretch of network access because the interface can be
displayed on a remote screen.
- Customizable via the X resources
- Uses basic X11R5 tools for portability
|
Terminal client
|
- Mimics windows using the full screen capabilities of the character
terminal
- Keyboard keys for buttons and menus
- Typographical aspect rendition with character attributes (underlined,
inverse video, standout mode)
- Accommodates almost all terminals with addressable cursor (thanks to
UNIX curses package).
- Adds an extra stretch of network access when using a terminal
emulator and connecting with telnet or a modem
|
In DICO, a dictionary is a collection of textual entries. Each entry
is associated to at least one main access key (headword or headphrase) and
possibly some secondary access keys (text).
Main index implementation
The main index is a N to N mapping of headwords to entries:
- Homographs correspond to different entries
- Spelling variants correspond to the same entry
The sorted list of headwords is kept in
memory to allow for fast searches:
- By full headword (binary search)
- By a prefix of headword (binary search)
- By a regular expression pattern (sequential search).
Headwords can be displayed (include punctuation and spaces)
Secondary index implementation
The secondary index is a N to N mapping of access keys to entries.
- Arbitrary keys: it does not have to appear in the content of an entry
- Creation of an index is done off-line, and may involve external data and
heavy processing
Search is done with full access key (by hashcode). Reference to the
entries are kept in lists or bitmaps.
Examples:
- Domain information codes (BIOLOGY, THEATRE)
- Reverse index: all words appearing in the definition or translation
part of a word (possibly lemmatized). synonyms, or thesaurus.
Advantages of this index design
- Simplicity: few constraint on the dictionary representation:
- The content of an entry is contiguous
- Entries in one file, in any order, not necessarily contiguous
- No significant limits (but those of the hardware and software
environment) to the size of an entry or their number.
- Expandability: new secondary indexes can be added easily
- Fast: main index is in shared memory, sorted, secondary indexes are
hashcoded (the real work is done off-line, only once).
Entries are formatted on the screen dynamically, according to the current
dictionary, display format and screen width.
Transducer rules are interpreted.
Display Representation
- Few constraints on dictionary source format: entries should have be
uniform so that one set of rules can convert them all. The automaton is
powerful enough to accommodate complex coding.
- Choice of display format, for example:
- Variants of layout
- Outline source
- Expandability: new display formats
(i.e. new set of rules) can be added easily.
- Formatting is done by the client:
- Unload the server
- Allow compact source entries: unload the network
Dictionary on electronic media:
- Word processing format
- Optical character recognition
- Hand typed from printed version
Parsing is difficult because
- There are errors in the representation (typographical, file corruption)
- The structure of similar entries is not consistent
(different lexicographers at different times)
- Easy clues are few: you need language understanding to
disambiguate some constructions.
The Text Encoding Initiative will give guidelines on how to code printed
dictionaries for exchange purposes.
Alphabetical order is apparent when displaying a list of matching headwords
and when asking for the next or previous entry.Sorting rules are specific
to each language:
- Different alphabets: in Danish æ, ø, å are sorted after z
- Grouped letters: in Spanish ch is sorted after cz and ll after lz
- Secondary sort: in French the accent on a character is used as a secondary
key. Example:
- mais.......mais.......i.....(but)
- maïs.......mais.......ï.....(corn)
- maison..maison...i.....(house)
In DICO a description of the alphabetical order has to be associated
to each dictionary.
To allow for easy and forgiving entry of
search keys
- Ignore case (optional): a matches a or A
- Ignore accents (optional): e matches e, é, è, ë, ê
- Ignore punctuation and spaces:
- bb or b&b matches b.&b.
- crowsnest matches crow's nest
It also allows for easy cut and paste from another document (or the
definition part of an entry) when editing the search key.
This is described together with the alphabetical order rules:
- Dictionary specific
- Easily modified
Currently DICO supports ISO-Latin1 alphabet and the IPA phonetic
alphabet.
Sometimes the user's equipment or his inexperience does not allow typing or
displaying special characters.
- Input: an option allows to type two characters to compose one.
- pair...........character
- a`...................à
- n~.................ñ
- Output: an option controls how special characters are displayed.
- char.......bad...long name.........pair
- é...............i..........e-acute...........e'
- Attention to details makes the client interfaces usable to people with
different skills and with diverse computer equipment.
- Good accessibility through the network thanks to the client-server
design.
- Remarkable design features are:
- Client-server
- Transducer
- Access to many dictionaries through the same interface
- Language independent
DICO is accessible on the network of the University of Geneva since 1992.
Eight dictionaries installed:
- Collins GEM
- English ´ French (6972, 5808)
- English ´ German (707, 763)
- French ´ German (443, 585)
- Dictionary of Current Idiomatic English (Volume 1: verbs and Volume 2: phrases)
- French-SwissGerman
Information & Copyright
DICO (Copyright (c) ISSCO 1994,1995,1996),
Dominique PETITPIERRE
Dominique.Petitpierre@unige.ch
Gilbert ROBERT
Gilbert.Robert@issco.unige.ch
Intellectual property of ISSCO (University of Geneva) .