Paper Abstracts - Violeta Seretan


Violeta Seretan, Eric Wehrli, Luka Nerima, and Amalia Todirascu (2015)

A Collocation extraction tool for Romanian. In PARSEME 5th general meeting, Iaşi, Romania.

Abstract

Lexical knowledge, and in particular knowledge on multi-word expressions, is at the cornerstone of language applications such as syntactic parsing or machine translation. Corpus-driven lexical acquisition is one of the major means to create such knowledge, in order to build or consolidate dictionaries and similar types of lexical resources. We describe ongoing work devoted to the corpus-based extraction of multi-word expressions – in particular, collocations – for the Romanian language.

Download (pdf)


Violeta Seretan (2015)

Multi-word expressions in user-generated content: How many and how well translated? Evidence from a post-editing experiment. In Proceedings of the Second Workshop on Multi-word Units in Machine Translation and Translation Technology (MUMTTT 2015), Malaga, Spain.

Abstract

According to theoretical claims, multi-word expressions are pervasive in all genres and domains, and, because of their idiosyncratic nature, they are particularly prone to automatic translation errors. We tested these claims empirically in the user-generated content domain and found that, while multi-word expressions are indeed common in this domain, their automatic translation is actually often correct, and only a modest amount – about one fifth – of the post-editing effort is devoted to fixing their translation. We also found that the upperbound for the increase in translation quality expected from perfectly handling multi-word expressions is 9 BLEU points, much higher than what is currently achieved. These results suggest that the translation of multi-word expressions is nowadays largely correct, but there is still a way to go towards their perfect translation.

Download (pdf)


Pierrette Bouillon, Johanna Gerlach, Asheesh Gulati, Victoria Porro, and Violeta Seretan (2015)

The ACCEPT Academic Portal: Bringing together pre-editing, MT and post-editing into a learning environment. In Proceedings of the The 18th Annual Conference of the European Association for Machine Translation (EAMT2015), Antalya, Turkey.

Abstract

The ACCEPT Academic Portal is a user-centred online platform specifically designed to offer a complete machine translation workflow including pre-editing and post-editing steps for teaching purposes. The platform leverages technology developed in the ACCEPT European Project (2012–2014) devoted to improving the translatability of user-generated content. Originally available as a series of plug-ins and demonstrators on the ACCEPT portal, the various software components have been interconnected into an easy-to-use platform reproducing all phases of a real MT workflow. The platform provides a unique environment to study the interaction between MT-related processes and to assess the contribution of new technologies to translation. It will be useful for research and teaching purposes alike.

Download (pdf)


Gulati Asheesh, Pierrette Bouillon, Johanna Gerlach, Victoria Porro, and Violeta Seretan (2014)

The ACCEPT Academic Portal: A user-centred online platform for pre-editing and post-editing. In Proceedings of the 7th International Conference of the Iberian Association of Translation and Interpreting Studies (AIETI), Malaga, Spain.

Abstract

The advance of machine translation in the last years is placing new demands on professional translators. This entails new requirements on translation educational curricula at the university level and exacerbates the need for dedicated software for teaching students how to leverage the technologies involved in a machine translation workflow. In this paper, we introduce the ACCEPT Academic Portal, a user-centred online platform which implements a complete machine translation workflow and is specifically designed for teaching purposes. Its ultimate objective is to increase the understanding of pre-editing, post-editing and evaluation of machine translation. The platform is built around three main modules, the Pre-editing, Translation and Post-editing modules, and currently supports three language combinations: French > English, English > French and English > German. The pre-editing module provides checking resources to verify the compliance of the input data with automatic and interactive pre-editing rules. The translation module translates the raw and pre-edited version of the input text using a phrase-based Moses system, and highlights the differences between the two translations for easy identification of the impact of pre-editing on translation. The post-editing module allows users to improve translations by freely post-editing the text with the help of interactive and automatic rules. Finally, at the end of the workflow, a summary and statistics on the whole process are made available to users for evaluation and description purposes. Through its simple and user-friendly interface, as well as its pedagogically-motivated functionalities that enable experimentation, visual comparison, and documentation, this academic platform provides a unique tool to study the interactions between processes, and to assess the contribution of new technologies to translation.

Download (pdf)


Victoria Porro, Johanna Gerlach, Pierrette Bouillon, and Violeta Seretan (2014)

Rule-based automatic post-processing of SMT output to reduce human post-editing effort. In Proceedings of the Translating and the Computer Conference, London, U.K.

Abstract

To enhance sharing of knowledge across the language barrier, the ACCEPT project focuses on improving machine translation of user-generated content by investigating pre- and post-editing strategies. Within this context, we have developed automatic monolingual post-editing rules for French, aimed at correcting frequent errors automatically. The rules were developed using the AcrolinxIQ technology, which relies on shallow linguistic analysis. In this paper, we present an evaluation of these rules, considering their impact on the readability of MT output and their usefulness for subsequent manual post-editing. Results show that the readability of a high proportion of the data is indeed improved when automatic post-editing rules are applied. Their usefulness is confirmed by the fact that a large share of the edits brought about by the rules are in fact kept by human post-editors. Moreover, results reveal that edits which improve readability are not necessarily the same as those preserved by post-editors in the final output, hence the importance of considering both readability and post-editing effort in the evaluation of post-editing strategies.

Download (pdf)


Violeta Seretan, Pierrette Bouillon, and Johanna Gerlach (2014)

A Large-Scale Evaluation of Pre-editing Strategies for Improving User-Generated Content Translation. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1793–1799, Reykjavik, Iceland, May 2014.

Abstract

The user-generated content represents an increasing share of the information available today. To make this type of content instantly accessible in another language, the ACCEPT project focuses on developing pre-editing technologies for correcting the source text in order to increase its translatability. Linguistically-informed pre-editing rules have been developed for English and French for the two domains considered by the project, namely, the technical domain and the healthcare domain. In this paper, we present the evaluation experiments carried out to assess the impact of the proposed pre-editing rules on translation quality. Results from a large-scale evaluation campaign show that pre-editing helps indeed attain a better translation quality for a high proportion of the data, the difference with the number of cases where the adverse effect is observed being statistically significant. The ACCEPT pre-editing technology is freely available online and can be used in anyWeb-based environment to enhance the translatability of user-generated content so that it reaches a broader audience.

Download (pdf)


Violeta Seretan, Johann Roturier, David Silva, and Pierrette Bouillon (2014)

The ACCEPT Portal: An online framework for the pre-editing and post-editing of user-generated content. In Proceedings of the EACL 2014 Workshop on Humans and Computer-assisted Translation (HaCat 2014), Gothenburg, Sweden, 2014.

Abstract

With the development of Web 2.0, a lot of content is nowadays generated online by users. Due to its characteristics (e.g., use of jargon and abbreviations, typos, grammatical and style errors), the user-generated content poses specific challenges to machine translation. This paper presents an online platform devoted to the pre-editing of user-generated content and its post-editing, two main types of human assistance strategies which are combined with domain adaptation and other techniques in order to improve the translation of this type of content. The platform has recently been released publicly and is being tested by two main types of user communities, namely, technical forum users and volunteer translators.

Download (pdf)


Violeta Seretan (2013)

On collocations and their interaction with parsing and translation. Informatics, 1(1):11–31.

Abstract

We address the problem of automatically processing collocations–a subclass of multi-word expressions characterized by a high degree of morphosyntactic flexibility–in the context of two major applications, namely, syntactic parsing and machine translation. We show that parsing and collocation identification are processes that are interrelated and that benefit from each other, inasmuch as syntactic information is crucial for acquiring collocations from corpora and, vice versa, collocational information can be used to improve parsing performance. Similarly, we focus on the interrelation between collocations and machine translation, highlighting the use of translation information for multilingual collocation identification, as well as the use of collocational knowledge for improving translation. We give a panorama of the existing relevant work, and we parallel the literature surveys with our own experiments involving a symbolic parser and a rule-based translation system. The results show a significant improvement over approaches in which the corresponding tasks are decoupled.

Download (pdf)


Violeta Seretan and Eric Wehrli (2013)

Syntactic concordancing and multi-word expression detection. Int. J. Data Mining, Modelling and Management, Vol. 5, No. 2, pp.158–181.

Abstract

Concordancers are tools that display the contexts of a given word in a corpus. Also called key word in context (KWIC), these tools are nowadays indispensable in the work of lexicographers, linguists, and translators. We present an enhanced type of concordancer that integrates syntactic information on sentence structure as well as statistical information on word cooccurrence in order to detect and display those words from the context that are most strongly related to the word under investigation. This tool considerably alleviates the users’ task, by highlighting syntactically well-formed word combinations that are likely to form complex lexical units, i.e., multi-word expressions. One of the key distinctive features of the tool is its multilingualism, as syntax-based multi-word expression detection is available for multiple languages and parallel concordancing enables users to consult the version of a source context in another language, when multilingual parallel corpora are available. In this article, we describe the underlying methodology and resources used by the system, its architecture, and its recently developed online version. We also provide relevant performance evaluation results for the main system components, focusing on the comparison between syntax-based and syntax-free approaches.

Download (pdf)


Violeta Seretan and Eric Wehrli (2013)

Context-sensitive look-up in electronic dictionaries. In Rufus H. Gouws, Ulrich Heid, Wolfgang Schweickard, Herbert E. Wiegand, editors, Dictionaries. An International Encyclopedia of Lexicography. Supplementary volume: Recent Developments with Focus on Electronic and Computational Lexicography, volume 5/4 of Handbooks of Linguistics and Communication Science (HSK), pages 1046–1052. De Gruyter Mouton, Berlin, Boston.

Abstract

While the access to the content of electronic dictionaries is traditionally made via a simple headword-based search, in the past decade a new generation of electronic dictionaries has emerged, which are aimed at providing a more sophisticated look-up functionality that takes into account the actual users’ needs. Since users are most likely to consult a dictionary while reading a text (and do not necessarily know how to relate an inflected word to the corresponding headword or to recognize the multi-word unit this word may be part of), it is apparent that the inflected word and its context should play a key role in accessing the dictionary information. Context-sensitive look-up methodologies rely on linguistic analysis tools, such as morphological analysers, syntactic parsers or semantic taggers, in order to enable the match between text and dictionary and to narrow the information to be displayed to users according to the clues provided by the word context. This chapter introduces the problematics of the context-based dictionary look-up, presents the available context-sensitive dictionaries by discussing their underlying approaches as well as their technical options and challenges, and indicates the development perspectives for this look-up approach.

Download (pdf)


Violeta Seretan (2013)

A multilingual integrated framework for processing lexical collocations. In Adam Przepiórkowski, Maciej Piasecki, Krzysztof Jassem, and Piotr Fuglewicz, editors, Computational Linguistics, volume 458 of Studies in Computational Intelligence, pages 87–108. Springer Berlin Heidelberg.

Abstract

Lexical collocations are typical combinations of words, such as heavy rain, close collaboration, or to meet a deadline. Pervasive in language, they are a key issue for NLP systems since, as other types of multi-word expressions like idioms, they do not allow for word-by-word processing. We present a multilingual framework that lays emphasis on the accurate acquisition of collocational knowledge from corpora and its exploitation in two large-scale applications (parsing and machine translation), as well as for lexicographic support and for reading assistance. The underlying methodology departs from mainstream approaches by relying on deep parsing to cope with the high morphosyntactic flexibility of collocations. We review theoretical claims and contrast them with practical work, showing our efforts to model collocations in an adequate and comprehensive way. Experimental results show the efficiency of our approach and the impact of collocational knowledge on the performance of parsing and machine translation.

Download (pdf)


Violeta Seretan (2012)

Acquisition of syntactic simplification rules for French. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey.

Abstract

Text simplification is the process of reducing the lexical and syntactic complexity of a text while attempting to preserve (most of) its information content. It has recently emerged as an important research area, which holds promise for enhancing the text readability for the benefit of a broader audience as well as for increasing the performance of other applications. Our work focuses on syntactic complexity reduction and deals with the task of corpus-based acquisition of syntactic simplification rules for the French language. We show that the data-driven manual acquisition of simplification rules can be complemented by the semi-automatic detection of syntactic constructions requiring simplification. We provide the first comprehensive set of syntactic simplification rules for French, whose size is comparable to similar resources that exist for English and Brazilian Portuguese. Unlike these manually-built resources, our resource integrates larger lists of lexical cues signaling simplifiable constructions, that are useful for informing practical systems.

Download (pdf)


Violeta Seretan (2011)

A collocation-driven approach to text summarization. In Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles (TALN 2011), pages 9–14, Montpellier, France.

Résumé

Dans cet article, nous décrivons une nouvelle approche pour la création de résumés extractifs – tâche qui consiste à créer automatiquement un résumé pour un document en sélectionnant un sous-ensemble de ses phrases – qui exploite des informations collocationnelles spécifiques à un domaine, acquises préalablement à partir d’un corpus de développement. Un extracteur de collocations fondé sur l’analyse syntaxique est utilisé afin d’inférer un modèle de contenu qui est ensuite appliqué au document à résumer. Cette approche a été utilisée pour la création des versions simples pour les articles de Wikipedia en anglais, dans le cadre d’un projet visant la création automatique d’articles simplifiées, similaires aux articles recensées dans Simple English Wikipedia. Une évaluation du système développé reste encore à faire. Toutefois, les résultats préalables obtenus pour les articles sur des villes montrent le potentiel de cette approche guidée par collocations pour la sélection des phrases pertinentes.

Abstract

We present a novel approach to extractive summarization – the task of producing an abstract for an input document by selecting a subset of the original sentences – which relies on domain-specific collocation information automatically acquired from a development corpus. A syntax-based collocation extractor is used to infer a content template and then to match this template against the document to summarize. The approach has been applied to generate simplified versions of Wikipedia articles in English, as part of a larger project on automatically generating Simple English Wikipedia articles starting from their standard counterpart. An evaluation of the developed system has yet to be performed; nonetheless, the preliminary results obtained in summarizing Wikipedia articles on cities already indicated the potential of our collocation-driven method to select relevant sentences.

Download (pdf)


Rodolfo Delmonte, Vincenzo Pallotta, Violeta Seretan, Lammert Vrieling, and David Walker (2011)

An interaction mining suite based on natural language understanding. In Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles (TALN 2011), Montpellier, France.

Abstract

We introduce InteranalyticsTM Interaction Mining Suite, a collection of tools that performs the analysis, summarization and visualization of conversational content. Interaction Mining is an emerging Business Intelligence (BI) application whose main goal is the discovery and automatic extraction of useful information from human conversational interactions for analytical purposes. Turning conversational data into meaningful information leads to better business decisions through appropriate visualization and navigation techniques. InteranalyticsTM leveraged an advanced Natural Language Understanding (NLU) technology to a BI tool enabling analysts to understand and generate insights from conversational content in selected business applications such as Speech Analytics, Social Media monitoring, and Market Research.

Download (pdf)


Violeta Seretan and Eric Wehrli (2011)

FipsCoView: On-line visualisation of collocations extracted from multilingual parallel corpora.. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World, pages 125–127, Portland, Oregon, USA.

Abstract

We introduce FipsCoView, an on-line interface for dictionary-like visualisation of collocations detected from parallel corpora using a syntactically-informed extraction method.

Download (pdf)


Eric Wehrli, Violeta Seretan, and Luka Nerima (2010)

Sentence analysis and collocation identification. In Proceedings of the Workshop on Multiword Expressions: from Theory to Applications (MWE 2010), pages 27–35, Beijing, China.

Abstract

Identifying collocations in a sentence, in order to ensure their proper processing in subsequent applications, and performing the syntactic analysis of the sentence are interrelated processes. Syntactic information is crucial for detecting collocations, and vice versa, collocational information is useful for parsing. This article describes an original approach in which collocations are identified in a sentence as soon as possible during the analysis of that sentence, rather than at the end of the analysis, as in our previous work. In this way, priority is given to parsing alternatives involving collocations, and collocational information guide the parser through the maze of alternatives. This solution was shown to lead to substantial improvements in the performance of both tasks (collocation identification and parsing), and in that of a subsequent task (machine translation).

Download (pdf)


Violeta Seretan and Eric Wehrli (2010)

Tools for syntactic concordancing. In Proceedings of the International Multiconference on Computer Science and Information Technology, pages 493–500, Wisła, Poland.

Abstract

Concordancers are tools that display the immediate context for the occurrences of a given word in a corpus. Also called KWIC – Key Word in Context tools, they are essential in the work of lexicographers, corpus linguists, and translators alike. We present an enhanced type of concordancer, which relies on a syntactic parser and on statistical association measures in order to detect those words in the context that are syntactically related to the sought word and are the most relevant for it, because together they may participate in multi-word expressions (MWEs). Our syntax-based concordancer highlights the MWEs in a corpus, groups them into syntactically-homogeneous classes (e.g., verb-object, adjective-noun), ranks MWEs according to the strength of association with the given word, and for each MWE occurrence displays the whole source sentence as a context. In addition, parallel sentence alignment and MWE translation techniques are used to display the translation of the source sentence in another language, and to automatically find a translation for the identified MWEs. The tool also offers functionalities for building a MWE database, and is available both off-line and online for a number languages (among which English, French, Spanish, Italian, German, Greek and Romanian).

Download (pdf)


Violeta Seretan and Eric Wehrli (2010)

Extending a multilingual symbolic parser to Romanian. In Dan Tufiş and Corina Forăscu, editors, Multilinguality and Interoperability in Language Processing with Emphasis on Romanian, Romanian Academy Publishing House, Bucharest, Romania.

Abstract

A syntactic parser (a system that analyses the structure of natural language sentences) is a fundamental tool for any language, providing information that is essential for virtually any language application. With a single exception (Călăcean & Nivre 2009), such a tool was missing from the otherwise vast repertory of language tools available for Romanian. In this paper, we report on ongoing work aimed at developing a symbolic syntactic parser able to fully analyse unrestricted Romanian text–in contrast, the existing parser provides an analysis in terms of dependency relations, is data-driven, and was only trained on simple sentences. Our parser is based on the Fips multilingual parsing architecture (Wehrli, 2007). We present the preliminary tasks that enabled the implementation of the Romanian version, i.e., lexicon compilation and grammar specification. We describe the current status of the parser and present experimental results, both on parsing a collection of journalistic text, and on using the parsed data in a collocation extraction application.

Download (pdf)


Violeta Seretan, Eric Wehrli, Luka Nerima, and Gabriela Soare (2010)

FipsRomanian: Towards a Romanian version of the Fips syntactic parser. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta.

Abstract

We describe work in progress on the development of a full syntactic parser for Romanian. This work is part of a larger project of multilingual extension of the Fips parser (Wehrli, 2007), already available for French, English, German, Spanish, Italian, and Greek, to four new languages (Romanian, Romansh, Russian and Japanese). The Romanian version was built by starting with the Fips generic parsing architecture for the Romance languages and customising the grammatical component, in close relation to the development of the lexical component. We describe this process and report on preliminary results obtained for journalistic texts.

Download (pdf)


Luka Nerima, Eric Wehrli, and Violeta Seretan (2010)

A recursive treatment of collocations. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta.

Abstract

This article discusses the treatment of collocations in the context of a long-term project on the development of multilingual NLP tools. Besides “classical” two-word collocations, we will focus on the case of complex collocations (3 words or more) for which a recursive design is presented in the form of collocation of collocations. Although comparatively less numerous than two-word collocations, the complex collocations pose important challenges for NLP. The article discusses how these collocations are retrieved from corpora, inserted and stored in a lexical database, how the parser uses such knowledge and what are the advantages offered by a recursive approach to complex collocations.

Download (pdf)


Eric Wehrli, Luka Nerima, Violeta Seretan, and Yves Scherrer (2009)

On-line and off-line translation aids for non-native readers. In Proceedings of the International Multiconference on Computer Science and Information Technology, pages 299–303, Mrągowo, Poland.

Abstract

Twic and TwicPen are reading aid systems for readers of material in foreign languages. Although they include a sentence translation engine, both systems are primarily conceived to give word and expression translation to readers with a basic knowledge of the language they read. Twic has been designed for on-line material and consists of a plug-in for internet browsers communicating with our server. TwicPen offers a similar assistance for readers of printed material. It consists of a hand-held scanner connected to a lap-top (or desk-top) computer running our parsing and translation software. Both systems provide readers a limited number of translations selected on the basis of a linguistic analysis of the whole scanned text fragment (a phrase, part of the sentence, etc.). The use of a morphological and syntactic parser makes it possible (i) to disambiguate to a large extent the word selected by the user (and hence to drastically reduce the noise in the response), and (ii) to handle expressions (compounds, collocations, idioms), often a major source of difficulty for non-native readers. The systems are available for the following language-pairs: English-French, French-English, German-French, German-English, Italian-French, Spanish-French. Several other pairs are under development.

Download (pdf)


Violeta Seretan (2009)

Extraction de collocations et leurs équivalents de traduction à partir de corpus parallèles. TAL, 50(1):305–332.

Résumé

Identifier les collocations dans le texte source (par exemple, break record) et les traduire correctement (battre record contre *casser record) constituent un réel défi pour la traduction automatique, d’autant plus que ces expressions sont très nombreuses et très flexibles du point de vue syntaxique. Cet article présente une méthode permettant de repérer des équivalents de traduction pour les collocations à partir de corpus parallèles, qui sera utilisée pour augmenter la base de données lexicales d’un système de traduction. La méthode est fondée sur une approche syntaxique « profonde », dans laquelle les collocations et leurs équivalents potentiels sont extraits à partir de phrases alignées à l’aide d’un analyseur multilingue. L’article présente également les outils qui sont utilisés par cette méthode. Il se concentre en particulier sur les efforts déployés afin de rendre compte des divergences structurelles entre les langues et d’optimiser la performance de la méthode, notamment en ce qui concerne la couverture.

Abstract

Identifying collocations in a text (e.g., break record) and correctly translating them (battre record vs. *casser record) represent key issues in machine translation, notably because of their prevalence in language and their syntactic flexibility. This article describes a method for discovering translation equivalents for collocations from parallel corpora, aimed at increasing the lexical coverage of a machine translation system. The method is based on a “deep” syntactic approach, in which collocations and candidate translations are identified from sentence-aligned text with the help of a multilingual parser. The article also introduces the tools on which this method relies. It focuses in particular on the efforts made to account for structural divergences between languages and to improve the method’s performance in terms of coverage.

Download (pdf)


Violeta Seretan (2009)

An integrated environment for extracting and translating collocations. In Proceedings of the Fifth Corpus Linguistics Conference, Liverpool, U.K.

Abstract

This paper describes the way collocations, which constitute an important part of the multi-word lexicon of a language, are integrated into a multilingual parser and into a machine translation system. Different processing modules concur to ensure an appropriate treatment for collocations, from their automatic acquisition to their actual use in parsing and translation. The main concerns are, first, to cope with the syntactic flexibility characterising collocations, and second, to make sure that the collocation phenomenon is modelled in a rather comprehensive manner. The paper discusses, in particular, issues such as the necessity to extract collocations from syntactically parsed text (rather than from raw text), the identification of collocations consisting of more than two words, the detection of translation equivalents in parallel texts, and the issue of representing collocational information in a lexical database. The processing framework built represents an unprecedented environment that provides an advanced and comprehensive treatment for collocations.

Download (pdf)


Eric Wehrli, Violeta Seretan, Luka Nerima, and Lorenza Russo (2009).

Collocations in a rule-based MT system: A case study evaluation of their translation adequacy. In Proceedings of the 13th Annual Meeting of the European Association for Machine Translation, pages 128–135, Barcelona, Spain.

Abstract

Collocations constitute a subclass of multi-word expressions that are particularly problematic for machine translation, due 1) to their omnipresence in texts, and 2) to their morpho-syntactic properties, allowing virtually unlimited variation and leading to long-distance dependencies. Since existing MT systems incorporate mostly local information, these are arguably ill-suited for handling those collocations whose items are not found in close proximity. In this article, we describe an integrated environment in which collocations (and possibly their translation equivalents) are first identified from text corpora and stored in the lexical database of a translation system, then they are employed by this system, which is capable of dealing with syntactic transformations as it is based on a deep linguistic approach. We compare the performance of our system (in terms of collocation translation adequacy) with that of two major MT systems, one statistical, and the other rule-based. Our results confirm that syntactic variation affects translation quality and show that a deep syntactic approach is more robust in this sense, especially for languages with freer word order (e.g., German) and richer morphology (e.g., Italian) than English.

Download (pdf)


Athina Michou and Violeta Seretan (2009).

A tool for multi-word expression extraction in Modern Greek using syntactic parsing. In Proceedings of the Demonstrations Session at EACL 2009, pages 45–48, Athens, Greece.

Abstract

This paper presents a tool for extracting multi-word expressions from corpora in Modern Greek, which is used together with a parallel concordancer to augment the lexicon of a rule-based machine translation system. The tool is part of a larger extraction system that relies, in turn, on a multilingual parser developed over the past decade in our laboratory. The paper reviews the various NLP modules and resources which enable the retrieval of Greek multi-word expressions and their translations: the Greek parser, its lexical database, the extraction and concordancing system.

Download (pdf)


Violeta Seretan and Eric Wehrli (2008)

Multilingual collocation extraction with a syntactic parser. Language Resources and Evaluation, 43(1):71–85. . The original publication is available at www.springerlink.com.

Abstract

An impressive amount of work was devoted over the past few decades to collocation extraction. The state of the art shows that there is a sustained interest in the morphosyntactic preprocessing of texts in order to better identify candidate expressions; however, the treatment performed is, in most cases, limited (lemmatization, POS-tagging, or shallow parsing). This article presents a collocation extraction system based on the full parsing of source corpora, which supports four languages: English, French, Spanish, and Italian. The performance of the system is compared against that of the standard mobile-window method. The evaluation experiment investigates several levels of the significance lists, uses a fine-grained annotation schema, and covers all the languages supported. Consistent results were obtained for these languages: parsing, even if imperfect, leads to a significant improvement in the quality of results, in terms of collocational precision (between 16.4 and 29.7%, depending on the language; 20.1% overall), MWE precision (between 19.9 and 35.8%; 26.1% overall), and grammatical precision (between 47.3 and 67.4%; 55.6% overall). This positive result bears a high importance, especially in the perspective of the subsequent integration of extraction results in other NLP applications.

Download (pdf)


Violeta Seretan (2008)

Collocation Extraction Based on Syntactic Parsing. Ph.D. thesis, University of Geneva.

Abstract

Pervasive across texts of different genres and domains, collocations (typical lexical associations like to wreak havoc, to meet a condition, to believe firmly, a deep concern, highly controversial) constitute a large proportion of the multi-word expressions in a language. Due to their encoding idiomaticity, collocations are of paramount importance to text production tasks. Their recognition and appropriate usage is essential, for instance, in Foreign Language Learning or in Natural Language Processing applications such as machine translation and natural language generation. At the same time, collocations have a wide applicability to tasks concerned with the opposite process of text analysis.

The problem that is tackled in this thesis is the automatic acquisition of accurate collocational information from text corpora. More specifically, the thesis provides a methodological framework for the syntax-based identification of collocation candidates in the source text, prior to the statistical computation step. The development of syntax-based approaches to collocation extraction, which has traditionally been hindered by the absence of appropriate linguistic tools, is nowadays possible thanks to the advances achieved in parsing. Until now, the absence of sufficiently robust parsers was typically circumvented by applying linear proximity constraints in order to detect syntactic links between words. This method is relatively successful for English, but for languages with a richer morphology and a freer word order, parsing is a prerequisite for a good performance.

The thesis proposes (and fully evaluates on data in four different languages, English, French, Spanish and Italian) a core extraction procedure for discovering binary collocations, which is based on imposing syntactic constraints on the component items instead of linear proximity constraints. This procedure is further employed in several methods of advanced extraction, whose aim is to cover a broader spectrum of collocational phenomena in text. Three distinct but complementary extension directions have been considered in this thesis: extraction of n-ary collocations (n > 2), datadriven induction of collocationally relevant syntactic configurations, and collocation mining from an alternative source corpus, the World Wide Web. The possibility to abstract away from the surface text form and to recover, thanks to parsing, the syntactic links between discontinuous elements in text, plays a crucial role in achieving highly efficient results.

The methods proposed in this study were adopted in the development of an integrated system of collocation extraction and visualization in parallel corpora, a system which was intended to enrich the workbench of translators or other users (e.g., terminologists, lexicographers, language learners) wanting to exploit their text archives. Finally, the thesis gives an example of a practical application that builds on this system in order to further process the extracted collocations, by automatically translating them when parallel corpora are available.

Download (pdf)


Violeta Seretan and Eric Wehrli (2007)

Collocation translation based on sentence alignment and parsing. In Actes de la 14e conférence sur le Traitement Automatique des Langues Naturelles (TALN 2007), pages 401–410, Toulouse, France. Best Paper Award.

Résumé

Bien que de nombreux efforts aient été déployés pour extraire des collocations à partir de corpus de textes, seule une minorité de travaux se préoccupent aussi de rendre le résultat de l’extraction prêt à être utilisé dans les applications TAL qui pourraient en bénéficier, telles que la traduction automatique. Cet article décrit une méthode précise d’identification de la traduction des collocations dans un corpus parallèle, qui présente les avantages suivants : elle peut traiter des collocation flexibles (et pas seulement figées) ; elle a besoin de ressources limitées et d’un pouvoir de calcul raisonnable (pas d’alignement complet, pas d’entraînement) ; elle peut être appliquée à plusieurs paires des langues et fonctionne même en l’absence de dictionnaires bilingues. La méthode est basée sur l’information syntaxique provenant du parseur multilingue Fips. L’évaluation effectuée sur 4000 collocations de type verbe-objet correspondant à plusieurs paires de langues a montré une précision moyenne de 89.8% et une couverture satisfaisante (70.9%). Ces résultats sont supérieurs à ceux enregistrés dans l’évaluation d’autres méthodes de traduction de collocations.

Abstract

To date, substantial efforts have been devoted to the extraction of collocations from text corpora. However, only a few works deal with the subsequent processing of results in order for these to be successfully integrated into the NLP applications that could benefit from them (e.g., machine translation). This paper presents an accurate method for identifying translation equivalents of collocations in parallel text, whose main strengths are that : it can handle flexible (not only rigid) collocations ; it only requires limited resources and computation (no full alignment, no training needed) ; it deals with several language pairs, and it can even work when no bilingual dictionary is available. The method relies heavily on syntactic information provided by the Fips multilingual parser. Evaluation performed on 4000 verb-object collocations for different language pairs showed an average accuracy of 89.8% and a reasonable coverage (70.9%). These figures are higher that those reported in the evaluation of related work in collocation translation.

Download (pdf)


Vincenzo Pallotta, Violeta Seretan, and Marita Ailomaa (2007)

User requirements analysis for Meeting Information Retrieval based on query elicitation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007), pages 1008–1015, Prague, Czech Republic.

Abstract

We present a user requirements study for Question Answering on meeting records that assesses the difficulty of users questions in terms of what type of knowledge is required in order to provide the correct answer. We grounded our work on the empirical analysis of elicited user queries. We found that the majority of elicited queries (around 60%) pertain to argumentative processes and outcomes. Our analysis also suggests that standard keyword-based Information Retrieval can only deal successfully with less than 20% of the queries, and that it must be complemented with other types of metadata and inference.

Download (pdf)


Vincenzo Pallotta, Violeta Seretan, Marita Ailomaa, Hatem Ghorbel, and Martin Rajman (2007)

Towards an argumentative coding scheme for annotating meeting dialogue data. In Proceedings of the 10th International Pragmatics Association Conference (IPrA), Göoteborg, Sweden.

Abstract

This paper reports on the main issues arisen during the development and test of a coding scheme for the argumentative annotation of meeting discussions. A corpus of meeting discussions has been collected in the framework of a research project on multimodal dialogue analysis and a coding scheme has been proposed. Annotations have been gathered by a set of annotators with different skills in argumentative discourse analysis and the reliability of the coding schema has been assessed against standard statistical measures.

Download (pdf)


Luka Nerima, Violeta Seretan, and Eric Wehrli (2006)

Le problème des collocations en TAL. Nouveaux cahiers de linguistique française, 27(2006):95–115.

Résumé

Cet article présente le modèle de traitement des expressions à mots multiples tel qu’il est implémenté dans les travaux en TAL du LATL. Il discute le repérage automatique, le stockage dans le lexique, ainsi que la prise en charge de ces expressions dans le parser Fips, le traducteur Its-2 et dans le système d’assistance terminologique TWiC. En se focalisant sur les collocations, les plus flexibles et les plus fréquentes de ces expressions, il met en évidence la nécessité d’effectuer une analyse syntaxique détaillée du texte afin d’assurer le traitement approprié de ces expressions et de garantir une meilleure performance à l’analyse et à la traduction.

Download (pdf)


Violeta Seretan and Eric Wehrli (2006)

Accurate collocation extraction using a multilingual parser. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 953-960, Sydney, Australia.

Abstract

This paper focuses on the use of advanced techniques of text analysis as support for collocation extraction. A hybrid system is presented that combines statistical methods and multilingual parsing for detecting accurate collocational information from English, French, Spanish and Italian corpora. The advantage of relying on full parsing over using a traditional window method (which ignores the syntactic information) is first theoretically motivated, then empirically validated by a comparative evaluation experiment.

Download (pdf)


Violeta Seretan and Eric Wehrli (2006)

Multilingual collocation extraction: Issues and solutions. In Proceedings of the Workshop on Multilingual Language Resources and Interoperability, pages40-49, Sydney, Autralia.

Abstract

Although traditionally seen as a language-independent task, collocation extraction relies nowadays more and more on the linguistic preprocessing of texts (e.g., lemmatization, POS tagging, chunking or parsing) prior to the application of statistical measures. This paper provides a language-oriented review of the existing extraction work. It points out several language-specific issues related to extraction and proposes a strategy for coping with them. It then describes a hybrid extraction system based on a multilingual parser. Finally, it presents a case-study on the performance of an association measure across a number of languages.

Download (pdf)


Violeta Seretan (2005)

Induction of syntactic collocation patterns from generic syntactic relations. In Proceedings of Nineteenth International Joint Conference on Artificial Intelligence (IJCAI 2005), pages 1698-1699, Edinburgh, Scotland.

Abstract

Syntactic configurations used in collocation extraction are highly divergent from one system to another, this questioning the validity of results and making comparative evaluation difficult. We describe a corpus-driven approach for inferring an exhaustive set of configurations from actual data by finding, with a parser, all the productive syntactic associations, then by appealing to human expertisefor relevance judgements.

Download (pdf)


Seretan, Violeta, Luka Nerima, and Eric Wehrli (2004)

Multi-word collocation extraction by syntactic composition of collocation bigrams.. In Nicolas Nicolov, Kalina Bontcheva, Galia Angelova, and Ruslan Mitkov, editors, Recent Advances in Natural Language Processing III: Selected Papers from RANLP 2003, Current Issues in Linguistic Theory, pages 91–100. John Benjamins, Amsterdam/Philadelphia. The original publication is available at www.benjamins.com.

Abstract

This paper presents a method of multi-word collocation extraction, which is based on the syntactic composition of two-word collocations previously identified in text. We describe a procedure of word linking that iteratively builds up longer expressions, which constitute multi-word collocation candidates. We then present several measures used for candidates ranking according to the collocational strength, and show the results of a trigram extraction experiment. The methodology used is particularly suited for the extraction of flexible collocations, which can undergo complex syntactical transformations such as passivization, relativization and dislocation.

Download (pdf)


Violeta Seretan, Luka Nerima, and Eric Wehrli (2004)

A tool for multi-word collocation extraction and visualization in multilingual corpora. In Proceedings of the Eleventh EURALEX International Congress (EURALEX 2004), pages 755-766, Lorient, France.

Abstract

This document describes an implemented system of collocation extraction which is designed as aid to translation and which will be used in a real translation environment. Its main functionalities are: retrieving multi-word collocations from an existing corpus of documents in a given language (only French and English are supported for the time being); visualizing the list of extracted terms and their contexts by using a concordance tool; retrieving the translation equivalent of the sentences containing the collocations in the existing parallel corpora; and enabling the user to create a sublist of validated collocations to be further used as reference in translation. The approach underlying this system is hybrid, as the extraction method combines the syntactic analysis of texts (for selecting the collocation candidates) with a statistical-based measure for the relevance test (i.e., for candidates ranking according to the collocational strength). We present the underlying approach and methodology, the architecture of the systems, we describe the main system components and provide several experimental results.

Download (pdf)


Violeta Seretan, Luka Nerima, and Eric Wehrli (2004)

Using the Web as a corpus for the syntactic-based collocation identification. In Proceedings of International Conference on on Language Resources and Evaluation (LREC 2004), pages 1871-1874, Lisbon, Portugal.

Abstract

This paper presents an experiment that uses a Web search engine and a robust parser for the Web-based identification of collocations (statistically significant word associations representing “a conventional way of saying things” (Manning and Sch¨utze, 1999)). We identify the possible collocates of a given word by parsing the text snippets returned by the search engine when querying that word. Then, we rank the list of syntactic co-occurrences retrieved according to the collocational strength of each pair by using different statistical measures.

Download (pdf)


Violeta Seretan, Luka Nerima, and Eric Wehrli (2003)

Extraction of multi-word collocations using syntactic bigram composition. In Proceedings of International Conference on Recent Advances in NLP (RANLP-2003), pages 424-431, Borovets, Bulgaria.

Abstract

This paper presents a method for extracting multi-word collocations (MWCs) from text corpora, which is based on the previous extraction of syntactically bound collocation bigrams. We describe an iterative word linking procedure which relies on a syntactic criterion and aims at building up arbitrarily long expressions that represent multi-word collocation candidates. We propose several measures to rank candidates according to the collocational strength, and we present the results of a trigram extraction experiment. The methodology used is particularly well-suited for the identification of those collocations whose terms are arbitrarily distant, due to syntactic processes (passivization, relativization, dislocation, topicalization).

Download (pdf)


Luka Nerima, Violeta Seretan, and Eric Wehrli (2003)

Creating a multilingual collocation dictionary from large text corpora. In Proceedings of the Research Notes Session of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL'03), pages 131-134, Budapest, Hungary.

Abstract

This paper describes a system of terminological extraction capable of handling multi-word expressions, using a powerful syntactic parser. The system includes a concordancing tool enabling the user to display the context of the collocation, i.e. the sentence or the whole document where the collocation occurs. Since the corpora are multilingual, the system also offers an alignment mechanism for the corresponding translated documents.

Download (pdf)


Violeta Seretan (2002)

Discourse analysis correction using anaphoric cues. In Proceedings of the 2nd Workshop on RObust Methods in Analysis of Natural languade Data (ROMAND 2002), pages 79-86, Frascati, Italy.

Abstract

This paper presents a method for correcting arbitrary (possibly wrong or inadequate) discourse structure analyses, which is based on the constraints over the discourse structure configuration imposed by the anaphoric relations in text. The approach taken underlies on concepts and ideas from Veins Theory (VT; Cristea, Ide, Romary, 1998) as a theory of global discourse cohesion. The method, currently under implementation, will be evaluated against a corpus of texts previously corrected manually.

Download (pdf)


Violeta Seretan and Dan Cristea (2002)

The use of referential constrains in structuring discourse. In Proceedings of The Third International Conference on Language Resources and Evaluation, LREC 2002, pages 1231-1238, Las Palmas, Spain.

Abstract

The quality of discourse structure annotations is negatively influenced by the numerous difficulties that occur in the analysis process. In contrast, referential annotation resources are considerably more reliable, given the high precision of the existent anaphora resolution systems. We present an approach based on the Veins Theory (Cristea, Ide, Romary, 1998), in which successful reference annotations of texts are exploited in order to improve arbitrary structural analyses; in this way, the large amount of corpora annotated at reference level can be used for the acquisition of discourse structure annotation resources.

Download (pdf)