Domain Adaptation for Dependency Parsing

The domain adaptation task aims to investigate techniques for adapting state-of-the-art dependency parsing systems to domains outside of the data from which they were trained or developed. This is the first time that such a task is proposed in the framework of the EVALITA campaign and for the Italian language. The goal of this task is to learn how to increase the accuracy of a parsing system when dealing with out-of-domain texts. In particular, the task will consist in learning how to derive labelled dependency relations for Italian by means of a parser developed for general language. The following data sets (in CoNLL format) will be distributed: - for the source domain:

  • a training set represented by the ISST-TANL corpus jointly developed by the Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC-CNR) and the University of Pisa (UniPi) and already used in the dependency parsing track of EVALITA 2009 (pilot sub-task);
  • a development set of about 5,000 tokens;

- for the target domain:

  • a target corpus drawn from an Italian legislative corpus, gathering laws enacted by different releasing agencies (European Commission, Italian State and Regions) and regulating a variety of domains, ranging from environment, human rights, disability rights to freedom of expression. The target corpus includes automatically generated sentence splitting, tokenization and PoS tagging;
  • a manually annotated development set of about 5,000 tokens, also including labeled dependency relations.

Evaluation will be carried out in terms of standard accuracy dependency parsing measures (labeled attachment score, unlabelled attachment score, label accuracy) with respect to a test set of texts from the target domain of about 5,000 tokens including manually revised PoS-tags. Developed systems can only exploit resources (data) provided by the organizers. This also entails that the use of additional components that have been trained on another set of data is prohibited.

Task materials: Detailed guidelines can be found at the dedicated web page.

Data Distribution: Test data are available [4/10/2011] - Training data can be downloaded from the dedicated web page.

Organizers

Felice Dell’Orletta (ILC-CNR, Pisa), Simonetta Montemagni (ILC-CNR, Pisa), Giulia Venturi (ILC-CNR, Pisa), Tommaso Agnoloni (ITTIG-CNR, Firenze), Enrico Francesconi (ITTIG-CNR, Firenze), and Simone Marchi (ILC-CNR, Pisa)