Stephanie Condon is a senior staff writer for Red Ventures based in Portland, Oregon, covering business technology for ZDNet. Google on Wednesday announced it's adding automatically-generated summaries to Workspace tools, so users can quickly get up to speed on their workplace content. The new AI-powered feature will first roll out in Google Docs. "Wouldn't life be better if more things had a TL;DR?" Google CEO Sundar Pichai said during the keynote address at the Google I/O conference, referencing the abbreviation for a "too long; didn't read" summary. The new feature would be useful, for example, for someone who has a 25-page document to read ahead of a meeting that's just five minutes away, Pichai said.
Like many people in the early weeks of a new year, you may well be seeking ways to improve efficiencies in your company for the coming 12 months. Time wasted by office workers trawling through emails and documents could be one area for you to examine. Here's why: A recent report by McKinsey management consultants estimated that employees typically spend 28% of their working week on reading and answering emails, which works out to about 21/2 hours per day per staffer. Other studies suggest that there's an additional "interruption effect" that lasts as long as 23 minutes from the time your staffers hit the send button until they return to their original tasks. The distractions of that deluge of emails and documents aside, the environmental impact of fossil-fuel energy used in data centers and the internet to ferry emails to their digital destinations is becoming a major concern.
Summarization of speech is a difficult problem due to the spontaneity of the flow, disfluencies, and other issues that are not usually encountered in written texts. Our work presents the first application of the BERTSum model to conversational language. We generate abstractive summaries of narrated instructional videos across a wide variety of topics, from gardening and cooking to software configuration and sports. In order to enrich the vocabulary, we use transfer learning and pretrain the model on a few large cross-domain datasets in both written and spoken English. We also do preprocessing of transcripts to restore sentence segmentation and punctuation in the output of an ASR system. The results are evaluated with ROUGE and Content-F1 scoring for the How2 and WikiHow datasets. We engage human judges to score a set of summaries randomly selected from a dataset curated from HowTo100M and YouTube. Based on blind evaluation, we achieve a level of textual fluency and utility close to that of summaries written by human content creators. The model beats current SOTA when applied to WikiHow articles that vary widely in style and topic, while showing no performance regression on the canonical CNN/DailyMail dataset. Due to the high generalizability of the model across different styles and domains, it has great potential to improve accessibility and discoverability of internet content. We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
We present ClidSum, a benchmark dataset for building cross-lingual summarization systems on dialogue documents. It consists of 67k+ dialogue documents from two subsets (i.e., SAMSum and MediaSum) and 112k+ annotated summaries in different target languages. Based on the proposed ClidSum, we introduce two benchmark settings for supervised and semi-supervised scenarios, respectively. We then build various baseline systems in different paradigms (pipeline and end-to-end) and conduct extensive experiments on ClidSum to provide deeper analyses. Furthermore, we propose mDialBART which extends mBART-50 (a multi-lingual BART) via further pre-training. The multiple objectives used in the further pre-training stage help the pre-trained model capture the structural characteristics as well as important content in dialogues and the transformation from source to the target language. Experimental results show the superiority of mDialBART, as an end-to-end model, outperforms strong pipeline models on ClidSum. Finally, we discuss specific challenges that current approaches faced with this task and give multiple promising directions for future research. We have released the dataset and code at https://github.com/krystalan/ClidSum.
Multi-document summarization is a challenging task for which there exists little large-scale datasets. We propose Multi-XScience, a large-scale multi-document summarization dataset created from scientific articles. Multi-XScience introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references. Our work is inspired by extreme summarization, a dataset construction protocol that favours abstractive modeling approaches. Descriptive statistics and empirical results---using several state-of-the-art models trained on the Multi-XScience dataset---reveal that Multi-XScience is well suited for abstractive models.