Media
OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking
Li, Yanhong, Xu, Tianyang, Tang, Kenan, Livescu, Karen, McAllester, David, Zhou, Jiawei
Knowledge-intensive question answering is central to large language models (LLMs) and is typically assessed using static benchmarks derived from sources like Wikipedia and textbooks. However, these benchmarks fail to capture evolving knowledge in a dynamic world, and centralized curation struggles to keep pace with rapid LLM advancements. To address these drawbacks, we propose Open Knowledge Bench (OKBench), a fully automated framework for generating high-quality, dynamic knowledge benchmarks on demand. Focusing on the news domain where knowledge updates daily, OKBench is an agentic framework that automates the sourcing, creation, validation, and distribution of benchmarks. Our approach democratizes benchmark creation and facilitates thorough evaluation of retrieval-augmented methods by reducing overlap with pretraining data. We evaluate our framework on a wide range open-source and proprietary LLMs of various sizes and configurations, both with and without retrieval over freshly generated knowledge. Our results reveal distinct model behaviors when confronted with new information and highlight how retrieval narrows the performance gap between small and large models. These findings underscore the importance of evaluating LLMs on evolving knowledge benchmarks.
What About the Scene with the Hitler Reference? HAUNT: A Framework to Probe LLMs' Self-consistency Via Adversarial Nudge
Dutta, Arka, Dutta, Sujan, Magu, Rijul, Datta, Soumyajit, De Choudhury, Munmun, KhudaBukhsh, Ashiqur R.
Hallucinations pose a critical challenge to the real-world deployment of large language models (LLMs) in high-stakes domains. In this paper, we present a framework for stress testing factual fidelity in LLMs in the presence of adversarial nudge. Our framework consists of three steps. In the first step, we instruct the LLM to produce sets of truths and lies consistent with the closed domain in question. In the next step, we instruct the LLM to verify the same set of assertions as truths and lies consistent with the same closed domain. In the final step, we test the robustness of the LLM against the lies generated (and verified) by itself. Our extensive evaluation, conducted using five widely known proprietary LLMs across two closed domains of popular movies and novels, reveals a wide range of susceptibility to adversarial nudges: \texttt{Claude} exhibits strong resilience, \texttt{GPT} and \texttt{Grok} demonstrate moderate resilience, while \texttt{Gemini} and \texttt{DeepSeek} show weak resilience. Considering that a large population is increasingly using LLMs for information seeking, our findings raise alarm.
The Collective Turing Test: Large Language Models Can Generate Realistic Multi-User Discussions
Bouleimen, Azza, De Marzo, Giordano, Kim, Taehee, Pagan, Nicol`o, Metzler, Hannah, Giordano, Silvia, Garcia, David
Large Language Models (LLMs) offer new avenues to simulate online communities and social media. Potential applications range from testing the design of content recommendation algorithms to estimating the effects of content policies and interventions. However, the validity of using LLMs to simulate conversations between various users remains largely untested. We evaluated whether LLMs can convincingly mimic human group conversations on social media. We collected authentic human conversations from Reddit and generated artificial conversations on the same topic with two LLMs: Llama 3 70B and GPT-4o. When presented side-by-side to study participants, LLM-generated conversations were mistaken for human-created content 39\% of the time. In particular, when evaluating conversations generated by Llama 3, participants correctly identified them as AI-generated only 56\% of the time, barely better than random chance. Our study demonstrates that LLMs can generate social media conversations sufficiently realistic to deceive humans when reading them, highlighting both a promising potential for social simulation and a warning message about the potential misuse of LLMs to generate new inauthentic social media content.
X-Troll: eXplainable Detection of State-Sponsored Information Operations Agents
Tian, Lin, Zhang, Xiuzhen, Kim, Maria Myung-Hee, Biggs, Jennifer, Rizoiu, Marian-Andrei
State-sponsored trolls, malicious actors who deploy sophisticated linguistic manipulation in coordinated information campaigns, posing threats to online discourse integrity. While Large Language Models (LLMs) achieve strong performance on general natural language processing (NLP) tasks, they struggle with subtle propaganda detection and operate as ``black boxes'', providing no interpretable insights into manipulation strategies. This paper introduces X-Troll, a novel framework that bridges this gap by integrating explainable adapter-based LLMs with expert-derived linguistic knowledge to detect state-sponsored trolls and provide human-readable explanations for its decisions. X-Troll incorporates appraisal theory and propaganda analysis through specialized LoRA adapters, using dynamic gating to capture campaign-specific discourse patterns in coordinated information operations. Experiments on real-world data demonstrate that our linguistically-informed approach shows strong performance compared with both general LLM baselines and existing troll detection models in accuracy while providing enhanced transparency through expert-grounded explanations that reveal the specific linguistic strategies used by state-sponsored actors. X-Troll source code is available at: https://github.com/ltian678/xtroll_source/.
That New Hit Song on Spotify? It Was Made by A.I.
That New Hit Song on Spotify? Aspiring musicians are churning out tracks using generative artificial intelligence. Some are topping the charts. Nick Arter, a thirty-five-year-old in Washington, D.C., never quite managed to become a professional musician the old-fashioned way. He grew up in Harrisburg, Pennsylvania, in a music-loving family.
Why do cats love boxes? Evolution has an answer.
Why do cats love boxes? Boxes give cats control, comfort, and prime ambush angles. Even when a box is too small, cats still love them. Breakthroughs, discoveries, and DIY tips sent every weekday. If you've ever purchased an expensive, bespoke toy for your feline friend, then watched them ignore said purchase in favor of the cardboard container it arrived in, you will know this universal truth: cats love boxes.
The Download: how to survive a conspiracy theory, and moldy cities
What it's like to be in the middle of a conspiracy theory (according to a conspiracy theory expert) It's something of a familiar cycle by now: Tragedy hits; rampant misinformation and conspiracy theories follow. It's often even more acute in the case of a natural disaster, when conspiracy theories about what "really" caused the calamity run right into culture-war-driven climate change denialism. Put together, these theories obscure real causes while elevating fake ones. I've studied these ideas extensively, having spent the last 10 years writing about conspiracy theories and disinformation as a journalist and researcher. I've covered everything from the rise of QAnon to whether Donald Trump faked his assassination attempt. I've written three books, testified to Congress, and even written a report for the January 6th Committee.
Is a Robot Vacuum Worth It?
Is a Robot Vacuum Worth It? It's not for everyone, but sometimes my robot vacuum is my only friend. Every single day--weekend, weekday, rain or shine--whichever robot vacuum I'm currently testing starts running at 9 am. I heave a sigh of relief and continue with whatever else I was doing, content that at least f*cking chore in my house is getting done. When I first started testing robot vacuums eight years ago, it sometimes seemed like more trouble than it was worth. I cleaned up the floor .