Media
London AI firm says Getty copyright case poses 'overt threat' to industry
Stability allows users to generate images using text prompts, and its directors include James Cameron, the Oscar-winning film director of Avatar and Titanic. But Getty called the people who were training the AI system "a bunch of tech geeks" and claimed they were indifferent to the problems their innovation might create. Stability countered by alleging that Getty was using "fanciful" legal routes and spending approximately 10m to fight a technology it feared was "an existential threat" to its business. As a result the program, called Stability Diffusion, outputs images with Getty Images watermarks still on them. Getty alleges that Stability was "completely indifferent to what they fed into the training data".
The Download: an inspiring toy robot arm, and why AM radio matters
As a child of an electronic engineer, I spent a lot of time in our local Radio Shack as a kid. While my dad was locating capacitors and resistors, I was in the toy section. It was there, in 1984, that I discovered the best toy of my childhood: the Armatron robotic arm. Described as a "robot-like arm to aid young masterminds in scientific and laboratory experiments," it was a legit robotic arm. And the bold look and function of Armatron made quite an impression on many young kids who would one day have a career in robotics.
Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding
Zaranis, Emmanouil, Farinhas, Antรณnio, Santos, Saul, Canaverde, Beatriz, Ramos, Miguel Moura, Surikuchi, Aditya K, Viveiros, Andrรฉ, Liao, Baohao, Bueno-Benito, Elena, Sivakumaran, Nithin, Vasylenko, Pavlo, Yu, Shoubin, Sannigrahi, Sonal, Mohammed, Wafaa, Peters, Ben, Villegas, Danae Sรกnchez, Stengel-Eskin, Elias, Attanasio, Giuseppe, Yoon, Jaehong, Frank, Stella, Suglia, Alessandro, Zerva, Chrysoula, Elliott, Desmond, Dimiccoli, Mariella, Bansal, Mohit, Lanz, Oswald, Bernardi, Raffaella, Fernรกndez, Raquel, Pezzelle, Sandro, Niculae, Vlad, Martins, Andrรฉ F. T.
Despite recent progress in vision-language models (VLMs), holistic understanding of long-form video content remains a significant challenge, partly due to limitations in current benchmarks. Many focus on peripheral, ``needle-in-a-haystack'' details, encouraging context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced by language models themselves) that are easier for models to answer but fail to reflect genuine understanding. In this paper, we introduce MF$^2$, a new benchmark for evaluating whether models can comprehend, consolidate, and recall key narrative information from full-length movies (50-170 minutes long). MF$^2$ includes over 50 full-length, open-licensed movies, each paired with manually constructed sets of claim pairs -- one true (fact) and one plausible but false (fib), totalling over 850 pairs. These claims target core narrative elements such as character motivations and emotions, causal chains, and event order, and refer to memorable moments that humans can recall without rewatching the movie. Instead of multiple-choice formats, we adopt a binary claim evaluation protocol: for each pair, models must correctly identify both the true and false claims. This reduces biases like answer ordering and enables a more precise assessment of reasoning. Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance, underscoring the relative ease of the task for humans and their superior ability to retain and reason over critical narrative information -- an ability current VLMs lack.
PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time
Zhang, Weizhi, Zhang, Xinyang, Zhang, Chenwei, Yang, Liangwei, Shang, Jingbo, Wei, Zhepei, Zou, Henry Peng, Huang, Zijie, Wang, Zhengyang, Gao, Yifan, Pan, Xiaoman, Xiong, Lian, Liu, Jingguo, Yu, Philip S., Li, Xian
Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide range of domains and tasks. Despite their potential, current LLM agents often adopt a one-size-fits-all approach, lacking the flexibility to respond to users' varying needs and preferences. This limitation motivates us to develop PersonaAgent, the first personalized LLM agent framework designed to address versatile personalization tasks. Specifically, PersonaAgent integrates two complementary components - a personalized memory module that includes episodic and semantic memory mechanisms; a personalized action module that enables the agent to perform tool actions tailored to the user. At the core, the persona (defined as unique system prompt for each user) functions as an intermediary: it leverages insights from personalized memory to control agent actions, while the outcomes of these actions in turn refine the memory. Based on the framework, we propose a test-time user-preference alignment strategy that simulate the latest n interactions to optimize the persona prompt, ensuring real-time user preference alignment through textual loss feedback between simulated and ground-truth responses. Experimental evaluations demonstrate that PersonaAgent significantly outperforms other baseline methods by not only personalizing the action space effectively but also scaling during test-time real-world applications. These results underscore the feasibility and potential of our approach in delivering tailored, dynamic user experiences.
Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction
Van Gysel, Christophe, Wu, Maggie, Verwimp, Lyan, Tirkaz, Caglar, Bertola, Marco, Lei, Zhihong, Oualil, Youssef
End-to-end (E2E) Automatic Speech Recognition (ASR) models are trained using paired audio-text samples that are expensive to obtain, since high-quality ground-truth data requires human annotators. Voice search applications, such as digital media players, leverage ASR to allow users to search by voice as opposed to an on-screen keyboard. However, recent or infrequent movie titles may not be sufficiently represented in the E2E ASR system's training data, and hence, may suffer poor recognition. In this paper, we propose a phonetic correction system that consists of (a) a phonetic search based on the ASR model's output that generates phonetic alternatives that may not be considered by the E2E system, and (b) a rescorer component that combines the ASR model recognition and the phonetic alternatives, and select a final system output. We find that our approach improves word error rate between 4.4 and 7.6% relative on benchmarks of popular movie titles over a series of competitive baselines.
Generating Grounded Responses to Counter Misinformation via Learning Efficient Fine-Grained Critiques
Xu, Xiaofei, Zhang, Xiuzhen, Deng, Ke
Fake news and misinformation poses a significant threat to society, making efficient mitigation essential. However, manual fact-checking is costly and lacks scalability. Large Language Models (LLMs) offer promise in automating counter-response generation to mitigate misinformation, but a critical challenge lies in their tendency to hallucinate non-factual information. Existing models mainly rely on LLM self-feedback to reduce hallucination, but this approach is computationally expensive. In this paper, we propose MisMitiFact, Misinformation Mitigation grounded in Facts, an efficient framework for generating fact-grounded counter-responses at scale. MisMitiFact generates simple critique feedback to refine LLM outputs, ensuring responses are grounded in evidence. We develop lightweight, fine-grained critique models trained on data sourced from readily available fact-checking sites to identify and correct errors in key elements such as numerals, entities, and topics in LLM generations. Experiments show that MisMitiFact generates counter-responses of comparable quality to LLMs' self-feedback while using significantly smaller critique models. Importantly, it achieves ~5x increase in feedback generation throughput, making it highly suitable for cost-effective, large-scale misinformation mitigation. Code and LLM prompt templates are at https://github.com/xxfwin/MisMitiFact.
WhisQ: Cross-Modal Representation Learning for Text-to-Music MOS Prediction
Emon, Jakaria Islam, Alam, Kazi Tamanna, Salek, Md. Abu
--Mean Opinion Score (MOS) prediction for text-to-music systems requires evaluating both overall musical quality and text-prompt alignment. This paper introduces WhisQ, a multimodal architecture that addresses this dual-assessment challenge through sequence-level co-attention and optimal transport regularization. WhisQ employs the Whisper-Base pretrained model for temporal audio encoding and Qwen-3, a 0.6B Small Language Model (SLM), for text encoding, with both maintaining sequence structure for fine-grained cross-modal modeling. The architecture features specialized prediction pathways: OMQ is predicted from pooled audio embeddings, while T A leverages bidirectional sequence co-attention between audio and text. Sinkhorn optimal transport loss further enforce semantic alignment in the shared embedding space. On the MusicEval Track-1 dataset, WhisQ achieves substantial improvements over the baseline: 7% improvement in Spearman correlation for OMQ and 14% for T A. Ablation studies reveal that optimal transport regularization provides the largest performance gain (10% SRCC improvement), demonstrating the importance of explicit cross-modal alignment for text-to-music evaluation.
DeepFake Doctor: Diagnosing and Treating Audio-Video Fake Detection
Klemt, Marcel, Segna, Carlotta, Rohrbach, Anna
Generative AI advances rapidly, allowing the creation of very realistic manipulated video and audio. This progress presents a significant security and ethical threat, as malicious users can exploit DeepFake techniques to spread misinformation. Recent DeepFake detection approaches explore the multimodal (audio-video) threat scenario. In particular, there is a lack of reproducibility and critical issues with existing datasets - such as the recently uncovered silence shortcut in the widely used FakeAVCeleb dataset. Considering the importance of this topic, we aim to gain a deeper understanding of the key issues affecting benchmarking in audio-video DeepFake detection. We examine these challenges through the lens of the three core benchmarking pillars: datasets, detection methods, and evaluation protocols. To address these issues, we spotlight the recent DeepSpeak v1 dataset and are the first to propose an evaluation protocol and benchmark it using SOTA models. We introduce SImple Multimodal BAseline (SIMBA), a competitive yet minimalistic approach that enables the exploration of diverse design choices. We also deepen insights into the issue of audio shortcuts and present a promising mitigation strategy. Finally, we analyze and enhance the evaluation scheme on the widely used FakeAVCeleb dataset. Our findings offer a way forward in the complex area of audio-video DeepFake detection.
Combating Misinformation in the Arab World: Challenges & Opportunities
Abouzied, Azza, Alam, Firoj, Ali, Raian, Papotti, Paolo
Misinformation and disinformation pose significant risks globally, with the Arab region facing unique vulnerabilities due to geopolitical instabilities, linguistic diversity, and cultural nuances. We explore these challenges through the key facets of combating misinformation: detection, tracking, mitigation and community-engagement. We shed light on how connecting with grass-roots fact-checking organizations, understanding cultural norms, promoting social correction, and creating strong collaborative information networks can create opportunities for a more resilient information ecosystem in the Arab world.
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning
Cai, Zikui, Wang, Andrew, Satheesh, Anirudh, Nakhawa, Ankit, Jae, Hyunwoo, Powell, Keenan, Liu, Minghui, Jay, Neel, Oh, Sungbin, Wang, Xiyao, Liang, Yongyuan, Goldstein, Tom, Huang, Furong
Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they overwhelmingly rely on static images, failing to capture the temporal complexity of real-world environments. Second, they narrowly focus on mathematical problem-solving, neglecting the broader spectrum of reasoning skills -- including abstract, physical, planning, spatial, and temporal capabilities -- required for robust multimodal intelligence. Third, many benchmarks quickly saturate, offering limited headroom for diagnosing failure modes or measuring continued progress. We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary reasoning categories. Each instance is programmatically generated using deterministic Python scripts (via Manim, Matplotlib, MoviePy), generative video models, and curated real footage. This script-driven design allows fine-grained control over visual complexity, distractor density, and temporal dynamics -- enabling difficulty to be scaled systematically as models improve. Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve: its controllable generation pipeline supports the creation of arbitrarily challenging new instances, making it ideally suited for stress-testing next-generation models. Initial experiments with state-of-the-art systems -- including various Gemini 2.5 Pro and OpenAI o3 which represent the strongest available at the time, alongside strong open-source models -- reveal substantial performance gaps across all categories, with particularly large deficits in abstract and planning tasks. We release the full dataset, generation scripts, and evaluation harness to support transparent, reproducible, and forward-looking multimodal reasoning research.