Speech Activity detection and Speaker LOcalization in DOMestic environments (SASLODOM)
The DIRHA project investigates the adoption of distributed microphone networks and related processing for the development of voice-enabled automated home environments based on distant-speech interaction. Its main feature is that microphones are installed in different rooms, communicating with each other at acoustic level, since all the doors are open.
In this context, a basic but fundamental task of the front-end processing is the detection and localization of speech events generated by users without constraints about their position or orientation within the various rooms.
We invite researchers working in the field of multi-microphone signal processing to develop and test their techniques on the DIRHA corpora.
The scenario addressed in the DIRHA project refers to an apartment monitored by 40 microphones, distributed on the walls and the ceiling of its five rooms, as depicted in the figure below. It encompasses typical situations observable in domestic contexts, in terms of speech input as well as of other acoustic events and background noise.
The multi-room speech activity detection and speaker localization task is a combination of the traditional speech/non-speech detection and speaker localization. Given the multi-room domestic scenario addressed in the DIRHA project, for each speech event the goal of the task is to:
- provide the corresponding time boundaries,
- determine the room where it was generated,
- derive the spatial coordinates of the speaker.
Any other acoustic event must be ignored. Note that for those not interested in source localization we have defined a sub-task consisting only in the first two points.
The DIRHA consortium created both simulated and real data sets to train and test a variety of signal processing algorithms.
Simulated data were obtained reverberating close-talking sentences (commands, keywords, phonetic sentences..) with room impulse responses measured in the DIRHA apartment. Several scenes, whose duration is 1 minute, were generated using a probabilistic framework. Each scene consists of a set of utterances and of other acoustic events produced in different rooms and positions. A variety of background noises, typical of the domestic scenario, are overlapped (i.e. added) to the scene. The development and test sets consist both of 40 scenes in Italian.
Real data were extracted from Wizard-of-Oz (WOZ) sessions. Each session consisted in a real interaction between a user and the Wizard, the latter one reproducing its output through a loudspeaker installed on the ceiling of the room. The training/development set will includes 12 scenes while 10 recordings will be distributed as test material.
Detailed guidelines, evaluation tools, details about the rooms and about the microphones are available through the task website and in the documents accompanying the development and test data.
Alessio Brutti (Fondazione Bruno Kessler)
Maurizio Omologo (Fondazione Bruno Kessler)
Mirco Ravanelli (Fondazione Bruno Kessler)