Date(s) - 26/06/2013
2:00 pm - 3:00 pm
In this talk I will survey the state-of-the-art techniques for multi-document news summarization and automatic evaluation of summarization. I will outline three open problems with significant implications for future progress in the field. First, I will discuss the ability to control the level of specificity in the summary, an aspect of summarization in which the operation of systems still differs dramatically from that of people performing the same task. Human authored summaries are more general than the original documents while machine summaries are markedly more specific, calling for the need to incorporate measures of specificity in approaches for content selection and sentence compression. Next, I will present two relatively recent findings in summarization evaluation, namely that the best systems perform equally well, with only a few significant differences among them according to manual evaluation and that the combined content of their summaries forms an informative gold-standard such that automatic evaluation can be performed with high accuracy using only that gold-standard. Taken together these finings indicate an untapped potential for system combination in summarization.
Finally, I will present a recent assessment of the accuracy of ROUGE metrics which draws attention to the need for more careful use of automatic evaluation. Using ROUGE to compare systems leads to incorrect conclusion in one in three comparisons. Moreover, there is a considerable difference in the ability of different ROUGE variants to support correct inferences about the statistical significance of differences between systems.