Speaker Identity Verification (Application & Forensic)

Speaker Identity Verification (SIV) is the automatic process of recognizing the identity of an individual from the analysis of his voice. The related task of EVALITA is aimed to gather state-of-the-art knowledge and to promote research advancement in this area using Italian language resources.

The 2009 evaluation will focus on two widespread application areas of speaker recognition technology:

  1. Customer authentication (for services as remote banking, e-commerce, etc.) and other consumer-oriented applications;
  2. Forensic purposes.

To account for the two above mainstreams, the task is divided into two tracks, namely: "Application" and "Forensic". Evaluation protocols and data are adapted to the intended target application.

1)"Application" track

Systems submitted for this track will be evaluated on a "remote authentication by telephone" use case scenario. Given an input speech signal, the system will be required to accept or reject the identity claimed by the speaker. Such a task is commonly referred as "Speaker Verification".

Inspired by larger world-wide evaluation campaigns, the proposed task is meant to be a first attempt to spread common practices and evaluation protocols for Speaker Verification through the Italian research community. To this aim, foreign institutions doing research in this field are strongly invited to participate to Evalita 2009.

Participants have been provided with 3 sets of data:

  • The "training" (or "enrollment") set. This data, reproducing realistic client enrollment, should be used to build up models of all genuine users of the system.
  • The "testing" set. This data set, used to mimic the actual operation of the system, will contain client and impostor access trials. Evaluation scores will be calculated on the testing set, according to realistic "decision cost" models.
  • The "UBM" (Universal Background Model) development set. This additional speech data (from speakers not included in the "enrollment" or "testing" sets) is generally used to train a background speaker-independent model (also known as "world model") for the verification system.

As system performance is normally affected by the used telephone channel, evaluation data will contain recordings from both fixed and mobile telephone networks. A special focus will be put on "mixed" and "cross-channel" evaluation scenarios.

2) "Forensic" track

The Forensic Speaker Identity Verification is characterized by two main points: the first one is related to the individuals involved in the task consisting of suspected individuals that usually have the aim of not being recognized (and therefore not willing to collaborate); the second one is related to a specific balance of the "decision costs" i.e. between wrong identification scores and failed identification scores.

In this first evaluation campaign, participants applying for this track are allowed to use any of the methods/models nowadays available (automatic, semiautomatic or manual ones), but also new methods or new models not yet tested or verified. The aim is to stimulate and to test new trends in Forensic SIV by having people/experts working on the same sound material in a situation reproducing a "classic" Forensic case study.

For this track a specific corpus will be made available to all the applicants. The corpus reproduces characteristics and instruments usually found in legal cases.

Participants have been provided with 3 sets of data:

  • the "training" data set will contain clear and normal voice composed of sentences and spontaneous speech;
  • the "testing" data set will contain two subsets:
    1. the first "testing" data set, is composed of identical material (i.e. spontaneous speech, phonetically balanced sentences etc.) recorded in different conditions and through different recording channels (i.e. with speakers inside and out of a car, with voices recorded through mobile and fixed telephone networks, with speakers in different noise situations as for example street, classroom, bus station etc.);
    2. an additional "testing" data set will also be provided containing an untagged conversation including different voices and different noises in the same recorded file.

