The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is not clear how precise these correlation estimates are, nor whether differences between two metrics’ correlations reflect a true difference or if it is due to random chance. In this talk, I will address these two problems by proposing methods for calculating confidence intervals and running hypothesis tests for correlations. After evaluating which of the proposed methods is most appropriate for summarization through two simulation experiments, I will analyze the results of applying these methods to several different automatic evaluation metrics across three sets of human annotations. In this research, we find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are. Further, although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings. This work is accepted to TACL2021: https://arxiv.org/abs/2104.00054
I will also present an ongoing study that identifies two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate summarization systems in practice and propose changes to rectify this disconnect. The results from these analyses point to the need for future research to focus on developing more consistent and reliable human evaluations of summaries.
This work was done in collaboration with Daniel Deutsch, a PhD student from the Cognitive Computation Group at the Department of Computer and Information Science, University of Pennsylvania.