Media
MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents
Wolfson, Tomer, Trivedi, Harsh, Geva, Mor, Goldberg, Yoav, Roth, Dan, Khot, Tushar, Sabharwal, Ashish, Tsarfaty, Reut
Automated agents, powered by Large language models (LLMs), are emerging as the go-to tool for querying information. However, evaluation benchmarks for LLM agents rarely feature natural questions that are both information-seeking and genuinely time-consuming for humans. To address this gap we introduce MoNaCo, a benchmark of 1,315 natural and time-consuming questions that require dozens, and at times hundreds, of intermediate steps to solve -- far more than any existing QA benchmark. To build MoNaCo, we developed a decomposed annotation pipeline to elicit and manually answer real-world time-consuming questions at scale. Frontier LLMs evaluated on MoNaCo achieve at most 61.2% F1, hampered by low recall and hallucinations. Our results underscore the limitations of LLM-powered agents in handling the complexity and sheer breadth of real-world information-seeking tasks -- with MoNaCo providing an effective resource for tracking such progress. The MoNaCo benchmark, codebase, prompts and models predictions are all publicly available at: https://tomerwolgithub.github.io/monaco
Beyond Words: Interjection Classification for Improved Human-Computer Interaction
Goren, Yaniv, Cohen, Yuval, Apartsin, Alexander, Aperstein, Yehudit
In the realm of human-computer interaction, fostering a natural dialogue between humans and machines is paramount. A key, often overlooked, component of this dialogue is the use of interjections such as "mmm" and "hmm". Despite their frequent use to express agreement, hesitation, or requests for information, these interjections are typically dismissed as "non-words" by Automatic Speech Recognition (ASR) engines. Addressing this gap, we introduce a novel task dedicated to interjection classification, a pioneer in the field to our knowledge. This task is challenging due to the short duration of interjection signals and significant inter- and intra-speaker variability. In this work, we present and publish a dataset of interjection signals collected specifically for interjection classification. We employ this dataset to train and evaluate a baseline deep learning model. To enhance performance, we augment the training dataset using techniques such as tempo and pitch transformation, which significantly improve classification accuracy, making models more robust. The interjection dataset, a Python library for the augmentation pipeline, baseline model, and evaluation scripts, are available to the research community.
Can Media Act as a Soft Regulator of Safe AI Development? A Game Theoretical Analysis
da Fonseca, Henrique Correia, Fernandes, Antรณnio, Song, Zhao, Cimpeanu, Theodor, Balabanova, Nataliya, Bashir, Adeela, Bova, Paolo, Buscemi, Alessio, Di Stefano, Alessandro, Duong, Manh Hong, Domingos, Elias Fernandez, Ogbo, Ndidi Bianca, Powers, Simon T., Proverbio, Daniele, Shamszaman, Zia Ush, Santos, Fernando P., Han, The Anh, Krellner, Marcus
When developers of artificial intelligence (AI) products need to decide between profit and safety for the users, they likely choose profit. Untrustworthy AI technology must come packaged with tangible negative consequences. Here, we envisage those consequences as the loss of reputation caused by media coverage of their misdeeds, disseminated to the public. We explore whether media coverage has the potential to push AI creators into the production of safe products, enabling widespread adoption of AI technology. We created artificial populations of self-interested creators and users and studied them through the lens of evolutionary game theory. Our results reveal that media is indeed able to foster cooperation between creators and users, but not always. Cooperation does not evolve if the quality of the information provided by the media is not reliable enough, or if the costs of either accessing media or ensuring safety are too high. By shaping public perception and holding developers accountable, media emerges as a powerful soft regulator -- guiding AI safety even in the absence of formal government oversight.
AI video tech fast-tracks humanoid robot training
Fox News Flash top headlines are here. Check out whats clicking on Foxnews.com. One of the biggest hurdles in developing humanoid robots is the sheer amount of training data required. Teaching machines to act like humans demands massive video datasets. Collecting that data is expensive, time-consuming and difficult to scale.
It's Always Been Our Meanest Sci-Fi Franchise--and Our Most Honest
Alien: Earth begins where most Alien stories end: with a crew of blue-collar workers realizing that they are, and have always been, doomed. Deemed expendable by their employers over the monsters in the cargo hold (at least the crew of the USCSS Maginot, unlike the Nostromo, knew the monsters were the mission), they are made mortally aware of their place at the bottom of several food chains at once. With the FX show's fifth episode, cheekily titled "In Space, No One โฆ," creator Noah Hawley takes us back to the Maginot's corridors to give viewers a rendition of Alien in miniature, retrofitting the sturdy bones of Ridley Scott's seminal film to his own ends. This may sound like a cynical enterprise, but it's par for the course for Alien. As Slate's own Sam Adams has noted, the series is Hollywood's greatest non-franchise, a collection of films (and comic books and video games) constantly remixing a few primary colors into compelling new shades.
Russia-Ukraine war: List of key events, day 1,287
Russian drone attacks and shelling killed three people and injured five others in Ukraine's Dnipropetrovsk region, Governor Serhiy Lysak wrote on Telegram. Two people were killed in Russian attacks on the Polohivskyi district, as Russian forces launched 578 attacks on 18 settlements in Ukraine's Zaporizhia region, Governor Ivan Fedorov said. Separate Russian attacks also killed one person in Kherson, one person in the Kyiv region and one person in Donetsk, local officials reported, according to the Kyiv Independent news outlet. A Ukrainian drone injured three people in the village of Proletarsky, in Russia's Belgorod region, Governor Vyacheslav Gladkov said. Russian forces seized the Ukrainian settlement of Fedorivka in the Donetsk region, Russian state news agency TASS reported, citing the Russian Ministry of Defence.
MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning
Truong, Quang-Trung, Wong, Yuk-Kwan, Dang, Vo Hoang Kim Tuyen, Gotama, Rinaldi, Nguyen, Duc Thanh, Yeung, Sai-Kit
Marine videos present significant challenges for video understanding due to the dynamics of marine objects and the surrounding environment, camera motion, and the complexity of underwater scenes. Existing video captioning datasets, typically focused on generic or human-centric domains, often fail to generalize to the complexities of the marine environment and gain insights about marine life. To address these limitations, we propose a two-stage marine object-oriented video captioning pipeline. We introduce a comprehensive video understanding benchmark that leverages the triplets of video, text, and segmentation masks to facilitate visual grounding and captioning, leading to improved marine video understanding and analysis, and marine video generation. Additionally, we highlight the effectiveness of video splitting in order to detect salient object transitions in scene changes, which significantly enrich the semantics of captioning content. Our dataset and code have been released at https://msc.hkustvgd.com.
Deep Binding of Language Model Virtual Personas: a Study on Approximating Political Partisan Misperceptions
Kang, Minwoo, Moon, Suhong, Lee, Seung Hyeong, Raj, Ayush, Suh, Joseph, Chan, David M., Canny, John
Large language models (LLMs) are increasingly capable of simulating human behavior, offering cost-effective ways to estimate user responses to various surveys and polls. However, the questions in these surveys usually reflect socially understood attitudes: the patterns of attitudes of old/young, liberal/conservative, as understood by both members and non-members of those groups. It is not clear whether the LLM binding is \emph{deep}, meaning the LLM answers as a member of a particular in-group would, or \emph{shallow}, meaning the LLM responds as an out-group member believes an in-group member would. To explore this difference, we use questions that expose known in-group/out-group biases. This level of fidelity is critical for applying LLMs to various political science studies, including timely topics on polarization dynamics, inter-group conflict, and democratic backsliding. To this end, we propose a novel methodology for constructing virtual personas with synthetic user "backstories" generated as extended, multi-turn interview transcripts. This approach is justified by the theory of \emph{narrative identity} which argues that personality at the highest level is \emph{constructed} from self-narratives. Our generated backstories are longer, rich in detail, and consistent in authentically describing a singular individual, compared to previous methods. We show that virtual personas conditioned on our backstories closely replicate human response distributions (up to an 87% improvement as measured by Wasserstein Distance) and produce effect sizes that closely match those observed in the original studies of in-group/out-group biases. Altogether, our work extends the applicability of LLMs beyond estimating socially understood responses, enabling their use in a broader range of human studies.
SaRoHead: Detecting Satire in a Multi-Domain Romanian News Headline Dataset
Vรฎrlan, Mihnea-Alexandru, Smฤdu, Rฤzvan-Alexandru, Cercel, Dumitru-Clementin, Pop, Florin, Cercel, Mihaela-Claudia
The primary goal of a news headline is to summarize an event in as few words as possible. Depending on the media outlet, a headline can serve as a means to objectively deliver a summary or improve its visibility. For the latter, specific publications may employ stylistic approaches that incorporate the use of sarcasm, irony, and exaggeration, key elements of a satirical approach. As such, even the headline must reflect the tone of the satirical main content. Current approaches for the Romanian language tend to detect the non-conventional tone (i.e., satire and clickbait) of the news content by combining both the main article and the headline. Because we consider a headline to be merely a brief summary of the main article, we investigate in this paper the presence of satirical tone in headlines alone, testing multiple baselines ranging from standard machine learning algorithms to deep learning models. Our experiments show that Bidirectional Transformer models outperform both standard machine-learning approaches and Large Language Models (LLMs), particularly when the meta-learning Reptile approach is employed.
Not All Data Are Unlearned Equally
Krishnan, Aravind, Reddy, Siva, Mosbach, Marius
Machine unlearning is concerned with the task of removing knowledge learned from particular data points from a trained model. In the context of large language models (LLMs), unlearning has recently received increased attention, particularly for removing knowledge about named entities from models for privacy purposes. While various approaches have been proposed to address the unlearning problem, most existing approaches treat all data points to be unlearned equally, i.e., unlearning that Montreal is a city in Canada is treated exactly the same as unlearning the phone number of the first author of this paper. In this work, we show that this all data is equal assumption does not hold for LLM unlearning. We study how the success of unlearning depends on the frequency of the knowledge we want to unlearn in the pre-training data of a model and find that frequency strongly affects unlearning, i.e., more frequent knowledge is harder to unlearn. Additionally, we uncover a misalignment between probability and generation-based evaluations of unlearning and show that this problem worsens as models become larger. Overall, our experiments highlight the need for better evaluation practices and novel methods for LLM unlearning that take the training data of models into account.