Parsing

The parsing task of EVALITA aims to define and extend the current state of the art in parsing of Italian by encouraging the application of existing models to this language.

As in Evalita 2007, both statistical and rule-based approaches are allowed, and, in order to account for different parsing paradigms, the task will be articulated into two different tracks, i.e. dependency and constituency parsing.

Dependency parsing track

The dependency parsing track will be articulated into two subtasks, differing at the level of the used training and test corpora: there will be a main subtask using as a development set the Turin University Treebank (TUT) developed by the University of Torino, and a second pilot subtask using the ISST-CoNLL corpus jointly developed by the Istituto di Linguistica Computazionale (ILC-CNR) and the University of Pisa (UniPi) and already used for Italian in the multi-lingual track of CoNLL 2007 Shared Task on Dependency Parsing. The latter task is optional, but all participants are strongly encouraged to perform both tasks.

The dependency parsing track will provide the community members with the possibility of testing their parsers across different developments sets, differing in size, composition, granularity and annotation schemes (both tagsets and annotation criteria); note however that in spite of these differences the data format is the same for both data sets, i.e. that in use in CoNLL.

There are different reasons underlying this novelty: if, on the one hand, the main dependency subtask will guarantee comparability of achieved results with respect both to the previous EVALITA initiative and to the parallel constituency parsing track, on the other hand, the pilot subtask will create the prerequisites for comparing EVALITA-2009 results with the state-of-the-art on dependency parsing at CoNLL-2007.

Last but not least, comparison of results achieved on different corpora will result in interesting insights into whether and how the annotation features of a treebank can influence the parser performance. Starting from this comparison we are planning to create a larger unified resource for Italian in which individual dependency annotated corpora will be combined together and which will hopefully be used for the next EVALITA editions.

COLLABORATION WITH PASSAGE

In order to further extend the possibilities both for cross-format and cross-linguistic comparison, we started a collaboration for the parsing task with Passage,the evaluation campaign on parsing for French. Therefore, for the Main Dependency Parsing subtask of Evalita 2009 there will be a portion of data shared with Passage both in the development and in the test set (namely 200 sentences for development and 40 for test). The shared data, extracted from the JRC corpus, are annotated in the French version according to the EASY format, and have been annotated in the Italian version according to the TUT format especially for Evalita. This has been made possible tanks to the cooperation with Patrick Paroubek from LIMSI.

Further details are available in the guidelines for the parsing task.

Constituency parsing track

The constituency parsing track will consist in a single task based on the corpus of the Turin University Treebank annotated in a Penn-like format (TUT-Penn), developed by the University of Torino.

By showing a higher distance from the state of the art for constituency than for dependency parsing for Italian, the results of the Evalita 2007 confirmed the hypothesis known in literature that dependency structures are more adequate for the representation of Italian, regardless of the fact that the task is based on the Penn treebank format, which is the more diffused and parsed in the world. The constituency parsing track will provide the community members with the possibility of testing again their parsers across paradigms, since this track exploits the same linguistic data of the dependency parsing track annotated according to a constituency-based format.

The quantitative evaluation of the different kinds of results will be performed according to the standard measures, i.e. LAS (labelled attachment score) for the dependency track and EvalB (crossing bracket measure) for the constituency track.

Task materials

Detailed Guidelines [04/09/2009]

Data Distribution

Training and test data are covered by a Creative Commons license “Attribution- Non-Commercial-Share Alike 2.5 Italy“.

  1. TUT data (for Dependency Parsing Main Task and for Constituency Parsing Track) can be downloaded from the dedicated web page. For any problem please contact: Cristina Bosco, bosco[at]di.unito.it
  2. TANL Dependency annotated corpus (for Dependency Parsing Pilot task) can be downloaded from the dedicated web page. For any problem please contact: Simonetta Montemagni, simonetta.montemagni[at]ilc.cnr.it

Organizers

Dependency parsing main task:

  • Cristina Bosco (Uni. Torino)
  • Alessandro Mazzei (Uni. Torino)
  • Vincenzo Lombardo (Uni. Torino)

Dependency parsing pilot task:

  • Felice dell’Orletta (Uni. Pisa)
  • Alessandro Lenci (Uni. Pisa)
  • Simonetta Montemagni (ILC-CNR, Pisa)

Constituency parsing:

  • Cristina Bosco (Uni. Torino)
  • Alessandro Mazzei (Uni. Torino)
  • Vincenzo Lombardo (Uni. Torino)