Lexical ambiguity is a fundamental problem in Information Retrieval (IR), especially in the medical domain. Many systems use a subset of the words contained in the document to represent the content. Such systems are faced with two main problems ([Salton and McGill, 1983]; [Krovetz, 1995]; [Krovetz and Croft, 1992]). Firstly, words are ambiguous out of context and this ambiguity will cause documents to be retrieved that are not-pertinent; secondly, the user is not so much interested in retrieving documents with exactly the same words, as in retrieving those containing words with a similar meaning. Retrieval programs generally address these problems by expanding the query words by related terms from a thesaurus. But again this is only possible if the meaning of the word is unambiguously known [Towell and Voorhees, 1998].
If we accept the hypothesis that resolving ambiguity is essential and
will lead to an improvement in the performance of these IR systems,
the question is how to disambiguate the words. In this research, we
propose a method based on existing medical terminological resources on
the one hand, and statistical tools for linguistic annotation on the
other, in order to develop more satisfactory indexing techniques for
French patient reports. The main hypothesises guiding the project are that: (i)
Syntax can help to distinguish meanings of words that are
polyfunctional
(see also [Wilks and Stevenson, 1996];
[Yarowsky, 1992]; [Ceusters et al., 1996]). (ii) Syntactic analysis can be done by a probabilistic
tagger (HMM, Hidden Markov model; [Rabiner, 1989]; [Kupiec, 1992]; etc.)
and, more daringly, (iii) remaining semantic ambiguity can also be
solved (mutatis mutandis) by an HMM tagger.
These hypothesises have been tested in the following way. The text is first annotated with ISSCO's corpus annotation tools ([Armstrong et al., 1995]; [Armstrong, 1996]) that assign the syntactic and semantic analysis (tag) to the words. This information is then used to index the text and to improve the performance of the search engine. In this paper, we describe the first phase of the project, namely the method of linguistic annotation (section 2), its evaluation (section 3) and how the annotation is used for indexing (section 4). The evaluation of the search engine itself is foreseen for the second phase of the project and will not be described here.