Media
ConDABench: Interactive Evaluation of Language Models for Data Analysis
Dutta, Avik, Gupta, Priyanshu, Hasanbeig, Hosein, Singh, Rahul Pratap, Nigam, Harshit, Gulwani, Sumit, Radhakrishna, Arjun, Soares, Gustavo, Tiwari, Ashish
Real-world data analysis tasks often come with under-specified goals and unclean data. User interaction is necessary to understand and disambiguate a user's intent, and hence, essential to solving these complex tasks. Existing benchmarks for evaluating LLMs on data analysis tasks do not capture these complexities or provide first-class support for interactivity. We introduce ConDABench, a framework for generating conversational data analysis (ConDA) benchmarks and evaluating external tools on the generated benchmarks. \bench consists of (a) a multi-agent workflow for generating realistic benchmarks from articles describing insights gained from public datasets, (b) 1,420 ConDA problems generated using this workflow, and (c) an evaluation harness that, for the first time, makes it possible to systematically evaluate conversational data analysis tools on the generated ConDA problems. Evaluation of state-of-the-art LLMs on the benchmarks reveals that while the new generation of models are better at solving more instances, they are not necessarily better at solving tasks that require sustained, long-form engagement. ConDABench is an avenue for model builders to measure progress towards truly collaborative models that can complete complex interactive tasks.
Comparing Human and Language Models Sentence Processing Difficulties on Complex Structures
Amouyal, Samuel Joseph, Meltzer-Asscher, Aya, Berant, Jonathan
Large language models (LLMs) that fluently converse with humans are a reality - but do LLMs experience human-like processing difficulties? We systematically compare human and LLM sentence comprehension across seven challenging linguistic structures. We collect sentence comprehension data from humans and five families of state-of-the-art LLMs, varying in size and training procedure in a unified experimental framework. Our results show LLMs overall struggle on the target structures, but especially on garden path (GP) sentences. Indeed, while the strongest models achieve near perfect accuracy on non-GP structures (93.7% for GPT-5), they struggle on GP structures (46.8% for GPT-5). Additionally, when ranking structures based on average performance, rank correlation between humans and models increases with parameter count. For each target structure, we also collect data for their matched baseline without the difficult structure. Comparing performance on the target vs. baseline sentences, the performance gap observed in humans holds for LLMs, with two exceptions: for models that are too weak performance is uniformly low across both sentence types, and for models that are too strong the performance is uniformly high. Together, these reveal convergence and divergence in human and LLM sentence comprehension, offering new insights into the similarity of humans and LLMs.
Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline
Li, Haiyang, Wang, Yaxiong, Tang, Shengeng, Wu, Lianwei, Cheng, Lechao, Zhong, Zhun
In recent years, detecting fake multimodal content on social media has drawn increasing attention. Two major forms of deception dominate: human-crafted misinformation (e.g., rumors and misleading posts) and AI-generated content produced by image synthesis models or vision-language models (VLMs). Although both share deceptive intent, they are typically studied in isolation. NLP research focuses on human-written misinformation, while the CV community targets AI-generated artifacts. As a result, existing models are often specialized for only one type of fake content. In real-world scenarios, however, the type of a multimodal post is usually unknown, limiting the effectiveness of such specialized systems. To bridge this gap, we construct the Omnibus Dataset for Multimodal News Deception (OmniFake), a comprehensive benchmark of 127K samples that integrates human-curated misinformation from existing resources with newly synthesized AI-generated examples. Based on this dataset, we propose Unified Multimodal Fake Content Detection (UMFDet), a framework designed to handle both forms of deception. UMFDet leverages a VLM backbone augmented with a Category-aware Mixture-of-Experts (MoE) Adapter to capture category-specific cues, and an attribution chain-of-thought mechanism that provides implicit reasoning guidance for locating salient deceptive signals. Extensive experiments demonstrate that UMFDet achieves robust and consistent performance across both misinformation types, outperforming specialized baselines and offering a practical solution for real-world multimodal deception detection.
EviNote-RAG: Enhancing RAG Models via Answer-Supportive Evidence Notes
Dai, Yuqin, Wang, Guoqing, Wang, Yuan, Dou, Kairan, Zhou, Kaichen, Zhang, Zhanwei, Yang, Shuo, Tang, Fei, Yin, Jun, Zeng, Pengyu, Ying, Zhenzhe, Yi, Can, Meng, Changhua, Zhou, Yuchen, Shen, Yongliang, Lu, Shuai
Retrieval-Augmented Generation (RAG) has advanced open-domain question answering by incorporating external information into model reasoning. However, effectively leveraging external information to enhance reasoning presents the following challenges: (1) low signal-to-noise ratio, where answer-supportive external information is diluted by irrelevant material, and (2) error accumulation, which arises in multi-hop reasoning when incomplete or misleading information is incorporated. To address these challenges, we introduce EviNote-RAG, a framework that follows a retrieve-note-answer workflow. Instead of reasoning directly over raw external information, the model first produces Supportive-Evidence Notes (SENs), which concisely preserve answer-critical information and explicitly mark key and uncertainty information to improve accuracy. We further design an entailment-based Evidence Quality Reward (EQR) to ensure that SENs are logically sufficient to derive the final answer, thereby enhancing SENs' quality. Experiments on both in-domain and out-of-domain QA benchmarks show that EviNote-RAG achieves state-of-the-art performance, improving answer accuracy, training stability, robustness, and efficiency. In particular, it yields relative F1 gains of 20% on HotpotQA (+0.093), 40% on Bamboogle (+0.151), and 91% on 2Wiki (+0.256), benefiting from improvements in the reasoning process.
Guillermo del Toro's em Frankenstein /em Is a Lavish Epic Decades in the Making
Movies Guillermo del Toro's Is a Lavish Epic Decades in the Making Enter your email to receive alerts for this author. You can manage your newsletter subscriptions at any time. You're already subscribed to the aa_Dana_Stevens newsletter. You can manage your newsletter subscriptions at any time. We encountered an issue signing you up.
Sam Fender wins 2025 Mercury Prize for album of the year
Sam Fender has won the 2025 Mercury Prize for his third album, People Watching, a steely-eyed dissection of working-class life in the north of England. The singer looked stunned when his name was announced. I didn't think that was going to happen at all, he told the BBC as he came off stage. I've spent the last 10 minutes crying. Fender beat the likes of Pulp and Wolf Alice - both former winners of the £25,000 prize for the best British or Irish album of the year - at a star-studded ceremony in Newcastle's Utilita Arena.
Sharon Osbourne backs naming airport after Ozzy
Sharon Osbourne has said it would be amazing if Birmingham Airport was renamed in honour of her late husband, rock legend Ozzy Osbourne. The TV personality has given her support to a campaign to call the airport Ozzy Osbourne International, which was launched by podcaster and comedian Dan Hudson after the Black Sabbath singer died at the age of 76 in July. More than 70,000 people have signed a petition backing the idea, which Hudson said was inspired by airports being named after famous figures such as John Lennon. It would be amazing, Osbourne said of a potential rebrand. It's just a dream right now, but sometimes dreams come true.
Green sea turtle no longer Endangered
These gentle, 400-pound giants are splashing back from the brink of extinction. Breakthroughs, discoveries, and DIY tips sent every weekday. In an ocean conservation victory, green sea turtles () have been brought from the brink of extinction. The International Union for Conservation of Nature (IUCN) elevated the keystone species from Endangered to Least Concern . The global conservation organization moves species between categories once new data indicates changes in their population, threat levels, or habitat.
Is this why aliens haven't contacted us yet? Extraterrestrials are BORED of trying to find us - and have simply stopped looking, scientist claims
'Arc de Trump' designed by president unveiled as he reveals controversial past plan for monument site'Vile' American flag spotted in Republican's office sparks Capitol investigation Experts reveal five-day window when'life-threatening' storm is set to smash US as it brews in Atlantic Ocean RICHARD EDEN: The VERY telling video that suggests one of Meghan's closest confidants has been'Markled'. He once leapt to her defence... now like so many others he needs to watch his step She's the dancer caught'going at it' in bed with Britney Spears. Nepo babies dare to bare! Celebrity offspring leave nothing to imagination as they dominate Victoria's Secret show... what would their parents say? Nightmarish moment train door closes on 65-year-old man's coat and drags him to his death MAUREEN CALLAHAN: Trump's depraved critics have committed their foulest act yet... Bella Hadid's health battle takes dark turn: Loved ones reveal hellish new details about model... as ominous texts emerge Why'embarrassed' Keith Urban is'in hiding' amid divorce from wife Nicole Kidman Disney superfan, 31, vanishes from her Midwest home months after announcing pregnancy... then horrific discovery is made at Walt Disney World Selena Gomez admits she was'sobbing' and fearing the worst just WEEKS after marrying music producer Benny Blanco in lavish ceremony Race against time to build a 211-mile gravel track across America's most extreme frontier for new'Manhattan Project'... but it could be too late Victoria's Secret show 2025: Bella Hadid rules the runway after her health woes, Jasmine Tookes opens the show at nine months pregnant and Emily Ratajkowski makes her debut aged 34 as legendary Angels and nepo babies unite after failed woke rebrand Red-eyed female executive, 61, with $1.1m home attacked two Alaska Airlines staff and forced plane to make emergency landing, police say Most shocking moments from female-fronted talk show dubbed'The View for conservatives' Nancy Pelosi explodes at reporter as she's escorted down Capitol Building steps Is this why aliens haven't contacted us yet? READ MORE: Reaching out to aliens could result in'the end of all life on Earth' It's one of the biggest unanswered questions in science: if there's life beyond Earth, why hasn't it contacted us yet?