Named Entity Recognition on Transcribed Broadcast News

The deadline to submit the runs has been postponed to October 21. In the Named Entity Recognition (NER) task, systems are required to recognize the Named Entities occurring in a text. As in the previous editions of EVALITA, we distinguish four types of entities: Person, Organization, Location and Geo-Political Entities (see the EVALITA 2009 annotation report for more details). The novelty introduced for this edition is that the task is based on spoken news broadcasts. In particular it is based on news broadcasts graciously provided by the local broadcaster RTTR. More specifically, participants can chose either one or both of the following subtasks:Ç

Subtasks

  • full task: participants will perform both automatic transcription (using an Automatic Speech Recognition system of their choice) and Named Entity Recognition
  • NER only: participants will perform Named Entity Recognition on the automatic transcription provided by the organizers (using a state-of-the-art ASR system)

The final ranking will be based on the F-measure score obtained in Named Entity Recognition (regardless of the performance of the ASR system), which will be computed after automatic optimal alignment between the manual and the automatic transcription. The performance of ASR systems used by participants in the full task will be computed, but it will not be considered for the official ranking. After submitting systems results, participants will be provided with the manual transcription of the test data and will be asked to run on them the same exact NER systems; these results will be used to analyze the impact of transcription errors and NOT to rank the systems.

Test data

Will consist of news broadcasts recorded and automatically transcribed. In particular, we will provide:

  • audio files (one for each program)
  • one text file with the automatic transcription of all programs, one token per line (for the NER only task)

For training, the following data will be available:

  • news broadcasts manually transcribed and annotated with Named Entities
  • both automatic transcription and audio files of the same news
  • I-CAB, a corpus of (written) news stories annotated with Named Entities

Participants can submit two runs for each subtask (for a maximum total of four runs). The first run will be produced according to the ‘closed’ modality: only the data we distribute and no additional resources are allowed for training and tuning the system. For external resources we intend resources used to acquire knowledge such as gazetteers, NE dictionaries, ontologies or Wikipedia. The only source of knowledge you can use is the material that the organisers provide (i.e. I-CAB plus the RTTR data). For what concerns tools, complex NLP toolkits (e.g. TextPro, GATE, OpenNLP) are forbidden for the closed run because they use NE dictionaries, while simple POS taggers or tools for lemmatization are allowed. The second run will be produced according to the ‘open’ modality: any type of data can be used , provided it is described in the final report. The ‘closed’ run is compulsory, while the ‘open’ run is optional. Systems with embedded resources for which it is impossible to produce a run according to the closed modality are allowed to submit only the ‘open’ run provided that participants contact us in advance explaining the motivation for their request.

Task materials

Data Distribution

Training and test data are available for research purposes upon acceptance of a license agreement:

  • If you work for a non-profit research organization, you can obtain an unlimited Research License

I-CAB is also available as part of the training data:

  • If you work for a non-profit research organization, you can obtain an unlimited Research License

Contact

Manuela Speranza, manspera[at]fbk.eu

Organizers

  • Valentina Bartalesi Lenzi (CELCT, Trento)
  • Manuela Speranza (FBK, Trento)
  • Rachele Sprugnoli (CELCT, Trento)