Results of EVALITA 2009

Part-of-Speech Tagging - Open Task

Participants POS TA CPOS TA POS UWTA CPOS UWTA Rank
SemaWiki 2 96.75% 97.03% 94.62% 95.30% 1
SemaWiki 1 96.44% 96.73% 94.27% 95.07% 2
SemaWiki 4 96.38% 96.67% 93.13% 93.81% 3
SemaWiki 3 96.14% 96.42% 92.55% 93.24% 4
Pianta 96.06% 96.36% 92.21% 93.24% 5
Lesmo 95.95% 96.26% 92.33% 93.01% 6
Tamburini 1 95.93% 96.40% 90.95% 92.67% 7
Tamburini 2 95.63% 96.16% 91.07% 92.78% 8

Part-of-Speech Tagging - Closed Task

Participants POS TA CPOS TA POS UWTA CPOS UWTA Rank
Felice_ILC 96,34% 96,91% 91,07% 93,36% 1
Gesmundo 95,85% 96,48% 91,41% 93,81% 2
SemaWiki 2 95,73% 96,52% 90,15% 93,47% 3
SemaWiki 1 95,24% 96,00% 87,40% 90,72% 4
Pianta 93,54% 94,10% 85,45% 87,74% 5
Rigutini 2 93,37% 94,15% 86,03% 88,43% 6
Rigutini 3 93,31% 94,15% 6,03% 88,55% 7
Rigutini 4 93,29% 94,17% 85,34% 88.09% 8
Rigutini 1 93,10% 93,76% 84,54% 87,06% 9
CSTSøgaard 1 91,90% 93,21% 86,03% 89,58% 10
CSTSøgaard 2 91,64% 93,21% 86,14% 89,92% 11

 

Dependency Parsing Track

Dependency parsing MDS: evaluation on all the test set (240 sentences)

Participants LAS UAS p-value
UniTo_Lesmo_PAR 88.73 92.28 0.472
UniPi Attardi DPAR 88.67 92.72 0.0001
FBKirst Lavelli DPAR 86.5 90.96 0.005
UniAmsterdam Sangati DPAR 84.98 89.07 0.0001
UniCopenhagen Soegaard DPAR 80.42 89.05 0.0001
CELI Dini DPAR 68 77.95 -

 

Dependency parsing PDS: evaluation on all the test set (260 sentences)

Participants LAS UAS p-value
UniPi Attardi DPAR 83.38 87.71 0.0001
FBKirst Lavelli DPAR 80.54 84.85 0.0012
UniCopenhagen Soegaard DPAR 78.51 85.81 0.0001
UniTo Lesmo DPAR 73.44 80.80 0.0001
CELI Dini DPAR 57.81 64.10 -

 

Dependency parsing MDS: evaluation on the shared test set (100 sentences from newspaper), the civil law (100 sentences) and passage (40 sentences)

Participants shared civillaw passage
  LAS UAS LAS UAS LAS UAS
UniPi Attardi DPAR 82.60 95.02 92.63 95.38 90.10 92.90
UniTo Lesmo DPAR 84.68 89.73 91.54 94.64 89.36 91.58
FBKirst Lavelli DPAR 79.91 87.15 90.23 93.33 89.11 91.75
UniAmsterdam Sangati DPAR 76.66 87.99 89.93 95.51 87.87 93.89
UniCopenhagen Soegaard DPAR 72.84 81.93 86.04 90.27 80.94 85.31
CELI Dini DPAR 63.86 70.15 70.74 74.97 68.89 73.35

 

Dependency parsing PDS: evaluation on the shared test set (100 sentences from newspapers), and the remaining test corpus (160 sentences)

Participants shared rest
  LAS UAS LAS UAS
UniPi Attardi DPAR 84.67 88.99 82.70 87.04
FBKirst Lavelli DPAR 81.12 85.02 80.24 84.76
UniCopenhagen Soegaard DPAR 78.61 85.26 78.45 86.10
UniTo Lesmo DPAR 75.12 82.58 72.56 79.88
CELI Dini DPAR 60.78 67.07 56.27 62.55

 

Constituency Parsing Track

Constituency parsing: evaluation on all the test set (200 sentences)

Participants LF LR LP P for LR P for LP
FBKirst Lavelli CPAR 78.73 80.02 77.48 0.1592 0.0021
UniAmsterdam Sangati CPAR 75.79 78.53 73.24 - -

 

Constituency parsing: separate evaluation on the newspaper (100 sentences) and civil law (100 sentences) test set.

Participants newspaper civillaw        
LF LR LP LF LR LP  
FBKirst Lavelli DPAR 76.21 76.08 76.34 80.66 83.15 78.33
UniAmsterdam Sangati DPAR 74.33 76.08 72.65 76.93 80.47 73.69

 

Lexical Substitution

Results obtained using the scoring type best

Participants Prec. Rec. F mode P mode R
uniba2 8.16 7.18 7.64 10.58 10.58
baroniCutugnoLenciPucci 6.26 6.01 6.13 11.28 10.84
uniba1 6.80 5.53 6.10 8.90 8.90
uniba3 6.28 5.46 5.84 8.13 8.13
decao3 3.95 3.21 3.54 6.58 6.58
decao2 3.90 3.17 3.50 6.71 6.71
decao1 3.16 3.16 3.16 6.97 6.97
decao4 3.52 2.80 3.12 5.03 5.03
           
baseline psc 10.86 9.06 9.88 13.94 13.94
baseline iwn psc 9.71 8.19 8.89 13.16 13.16
baseline iwn 2.72 1.78 2.15 2.19 2.19

 

Results obtained using the scoring type oot

Participants Prec. Rec. F mode P mode R
uniba2 41.46 36.50 38.82 47.23 47.23
uniba1 37.74 30.69 33.85 34.84 34.84
uniba3 28.54 24.79 26.53 34.58 34.58
decao3 23.48 19.11 21.07 26.58 26.58
decao2 23.00 18.72 20.64 26.32 26.32
decao1 20.09 20.09 20.09 27.74 27.74
decao4 18.62 14.78 16.48 20.52 20.52
baroniCutugnoLenciPucci 16.65 16.00 16.32 24.97 24.00
           
baseline iwn psc 27.52 23.23 25.19 37.24 32.39
baseline psc 23.00 19.20 20.93 26.97 26.97
baseline iwn 14.51 9.51 11.49 12.77 12.77

 

Named Entity Recognition

Systems’ results in terms of F-Measure, Precision and Recall

Participants Over. Over. Over. F1
  FB1 Prec. Rec. GPE LOC ORG PER
FBK_ZanoliPianta 82.00 84.07 80.02 85.13 51.24 70.56 88.31
UniGen_Gesmundo_r2 81.46 86.06 77.33 83.36 50.81 71.08 87.41
UniTN-FBK-RGB_r2 81.09 83.20 79.08 85.25 52.24 69.61 86.69
UniTN-FBK-RGB_r1 80.90 83.05 78.86 85.19 54.62 69.41 86.30
UniTN_Nguyen_r1 79.77 82.26 77.43 82.85 42.34 67.89 86.44
UniTN_Nguyen_r2 79.61 81.65 77.67 82.49 50.85 67.38 86.25
UniGen_Gesmundo_r1 76.21 83.92 69.79 79.07 47.06 64.67 82.04
UniTN_Rigo_r2 74.98 81.08 69.73 75.96 38.32 60.36 83.18
UniTN_Rigo_r1 74.34 80.71 68.91 75.77 31.16 59.87 82.38
UniPI-ILC-CNR_r2 69.67 75.42 64.74 71.42 38.91 58.37 76.38
UniPI-ILC-CNR_r1 67.98 73.65 63.11 71.66 27.45 57.02 73.85
ECNU_Cai 61.03 65.55 57.09 69.25 28.72 51.49 63.49
BASELINE 43.99 42.80 45.25 69.00 37.07 45.54 32.06
BASELINE –u 39.14 40.58 37.80 52.75 28.57 44.23 32.10

 

Local Entity Detection and Recognition

Percentages for Value, Precision, Recall and F-measure of the participating system

Participants LEDR EMD
  Value Prec. Rec. F Value Prec. Rec. F
FBKirst_UNITN 36.7% 78.5% 61.1% 68.7% 65,7% 78,1% 74,1% 76,1%

 

Textual Entailment

Run Correct Accuracy
FBKirst_run1.txt 285 0.71
FBKirst_run2.txt 282 0.71
ofe_semTypes_1.txt 257 0.64
ofe_semTypes_2.txt 228 0.57
ofe_lexical_2.txt 230 0.58
ofe_lexical_1.txt 225 0.56
FBKirst_run4.txt 202 0.51
FBKirst_run3.txt 199 0.50

 

Connected Digits Recognition

Clean ASR task. Results are ordered by Word Accuracy. In the last column, T means that a non-official training was used, L means that the results were delivered late.

Systems Sentence Acc. Word Acc. Words Corr. Err Del+Ins+Sub  
ISTC-SONIC_2 96.44% 99.45% 2360 2353 13 7+6+0  
ISTC-SONIC_1 96.44% 99.45% 2360 2350 13 8+3+2  
ISTC-SPHINX_1 96.16% 99.32% 2360 2352 16 4+8+4  
ABLA-NUANCE 95.89% 99.28% 2360 2345 17 6+2+9 T
ISTC-OGI_1 95.62% 99.19% 2360 2346 19 6+5+8  
ISTC-OGI_2 94.25% 98.94% 2360 2342 25 11+7+7  
ISTC-SPHINX_2 93.70% 98.77%% 2360 2345 29 6+14+9  
CEDAT85 89.59% 98.05% 2360 2333 46 5+19+22 T
ABLA-TSPEECH 81.64% 96.06% 2360 2270 93 34+3+56 T
UNINA 18.36% 77.84% 2360 1941 523 116+104+303 L

 

Noisy ASR task Results are ordered by Word Accuracy. In the last column, T means that a non-official training was used.

Participants Sentence Acc. Word Acc. Words Corr. Err Del+Ins+Sub  
ISTC-SONIC_2 87.77% 96.21% 4036 3896 153 104+13+36  
ISTC-SONIC_1 86.45% 95.91% 4036 3882 165 105+11+49  
ISTC-OGI_2 81.82% 93.95% 4036 3821 244 121+29+94  
ISTC-SPHINX_1 79.17% 93.06% 4036 3807 280 136+51+93  
ISTC-OGI_1 81.65% 99.45% 4036 3767 306 135+37+134  
ISTC-SPHINX_2 72.56% 91.63% 4036 3779 338 133+81+124  
CEDAT85 78.02% 91.03% 4036 2353 362 255+36+71 T
ABLA-NUANCE 77.69% 88.65% 4036 3604 458 268+26+164 T
ABLA-NUANCE 69.09% 82.23% 4036 3375 717 467+56+194 T

 

Spoken Dialogue Systems Evaluation

Dialog level statistics

Participants Duration (sec) Duration (# Turns)
UniNA 145.8±72.7 11.0±5.7
Loquendo 182.2±84.7 18.9±8.9
UniTN 206.4±81.7 24.4±10.1

 

Task durations (#turns: mean±std.dev.) and success rates

Task UniNA Loquendo UniTN
  Duration (turns) Tsr (corr/req) Duration (turns) Tsr (corr/req) Duration (turns) Tsr (corr/req)
Identify representative 1.9 ± 0.4 100.0% (19/19) 2.4 ± 0.8 95.0% (19/20) 3.1 ± 0.5 90.5% (19/21)
Ask customer detail 2.0 ± 0.0 83.3% (5/6) 2.3 ± 0.5 88.9% (8/9) 3.4 ± 1.6 54.6% (12/22)
List orders 2.5 ± 1.5 0.0% (0/8) 2.0 ± 0.0 80.0% (4/5) 3.0 ± 0.0 75.0% (3/4)
Show last order 2.0 ±0.0 100% (1/1) - - - -
List customers 2.0 ± 0.0 50.0% (2/4) 2.0 ± 0.0 0.0% (0/8) 3.0 ± 0.0 66.7% (2/3)
New order 4.6 ± 1.5 36.4% (4/11) 4.3 ± 1.8 42.9% (9/21) 7.5 ± 2.8 63.2% (12/19)
List products by category 3.0 ± 1.0 14.3% (1/7) - - 3.0 ± 0.0 100.0% (3/3)
List products by brand - - - - 3.0 ± 0.0 50.0% (1/2)
List products – other 2.0 ± 0.0 0.0% (0/4) 3.0 ± 0.8 25.0% (2/8) 3.8 ± 1.6 44.4% (4/9)
Search single product 2.3 ± 0.4 55.6% (5/9) 2.8 ± 1.6 77.8% (14/18) 3.5 ± 2.5 78.6% (11/14)
Ask for help 2.0 ± 0.0 100% (3/3) - - 2.0 ± 0.0 100.0% (2/2)
Exit application 2.5 ± 0.5 100.0% (5/5) - 0.0% (0/1) 2.4 ± 0.8 25.0% (4/16)
OVERALL (corr/req) - 58.4% (45/77) - 62.2% (56/90) - 63.5% (73/115)

 

Speaker Identity Verification

Application Track.

DCF analysis for TS1. Axes are log base 10 of normalized DCF.

speaker identity verification 1

DCF analysis for TS2. Axes are log base 10 of normalized DCF.

speaker identity verification 2

Forensic Track.

For detailed information about the results, please see the report