next up previous
Next: Linguistic annotation Up: Indexing by statistical tagging Previous: Indexing by statistical tagging

Introduction

Lexical ambiguity is a fundamental problem in Information Retrieval (IR), especially in the medical domain. Many systems use a subset of the words contained in the document to represent the content. Such systems are faced with two main problems ([Salton and McGill, 1983]; [Krovetz, 1995]; [Krovetz and Croft, 1992]). Firstly, words are ambiguous out of context and this ambiguity will cause documents to be retrieved that are not-pertinent; secondly, the user is not so much interested in retrieving documents with exactly the same words, as in retrieving those containing words with a similar meaning. Retrieval programs generally address these problems by expanding the query words by related terms from a thesaurus. But again this is only possible if the meaning of the word is unambiguously known [Towell and Voorhees, 1998].

If we accept the hypothesis that resolving ambiguity is essential and will lead to an improvement in the performance of these IR systems, the question is how to disambiguate the words. In this research, we propose a method based on existing medical terminological resources on the one hand, and statistical tools for linguistic annotation on the other, in order to develop more satisfactory indexing techniques for French patient reports. The main hypothesises guiding the project are that: (i) Syntax can help to distinguish meanings of words that are polyfunctionalgif (see also [Wilks and Stevenson, 1996]; [Yarowsky, 1992]; [Ceusters et al., 1996]). (ii) Syntactic analysis can be done by a probabilistic tagger (HMM, Hidden Markov model; [Rabiner, 1989]; [Kupiec, 1992]; etc.) and, more daringly, (iii) remaining semantic ambiguity can also be solved (mutatis mutandis) by an HMM tagger.

These hypothesises have been tested in the following way. The text is first annotated with ISSCO's corpus annotation tools ([Armstrong et al., 1995]; [Armstrong, 1996]) that assign the syntactic and semantic analysis (tag) to the words. This information is then used to index the text and to improve the performance of the search engine. In this paper, we describe the first phase of the project, namely the method of linguistic annotation (section 2), its evaluation (section 3) and how the annotation is used for indexing (section 4). The evaluation of the search engine itself is foreseen for the second phase of the project and will not be described here.


next up previous
Next: Linguistic annotation Up: Indexing by statistical tagging Previous: Indexing by statistical tagging

Sabine Lehmann
jeudi, 22 juin 2000, 11:35:42 MET DST