Free text Medical Document Retrieval by using terminological resources and statistical linguistics

This project is founded by the Swiss National Science Foundation (Subside no 3200-049832.96). It has started in April 1998 and will finish in 2001.

The work is done in collaboration with the Division d'Informatique générale of the Hôpital cantonal universitaire.

Summary of the project

Lexical ambiguity is a fundamental problem in Information Retrieval (IR), especially in the medical domain. Many systems use a subset of the words contained in the document to represent the content. Such systems are faced with two main problems (Salton, 1983; Krovetz, 1995; Krovetz and Croft, 1995). Firstly, words are ambiguous out of context and this ambiguity will cause documents to be retrieved that are not pertinent; secondly, the user is not so much interested in retrieving documents with exactly the same words, as in retrieving those containing words with a similar meaning. Retrieval programs generally address these problems by expanding the query words by related terms from a thesaurus. But again this is only possible if the meaning of the word is unambiguously known.

If we accept the hypothesis that resolving ambiguity is essential and will lead to an improvement in the performance of these systems, the question is how to disambiguate the words. In this project, we propose a method based on existing medical terminological resources on the one hand, and statistical tools for linguistic annotation on the other, in order to develop more satisfactory indexing techniques for patient reports. The main hypotheses guiding the project are that: (i) Syntax can help to distinguish meanings of words that are polyfunctional. (ii) Syntactic analysis can be done by a probabilistic tagger (HMM, Hidden Markov model) and, more daringly, (iii) remaining semantic ambiguity can also be solved (mutatis mutandis) by an HMM tagger.

These hypotheses have been tested in the following way. The text is first annotated with ISSCO's corpus annotation tools that assign the syntactic and semantic analysis (tag) to the words. This information is then used to index the text and to improve the search results.

ISSCO Projects