excerpt
- North America > United States > Texas > Tarrant County > Fort Worth (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government > Regional Government > North America Government > United States Government (0.70)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
- Europe (0.14)
- (5 more...)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.46)
What if Readers Like A.I.-Generated Fiction?
Finally, he gave the summaries to his fine-tuned model, and he asked it to compose passages "in the style of Vauhini Vara." Going into all this, I was self-assured, even smug. I'd always felt that my style was original and, more important, that my books were totally distinct from one another. I figured that, even if the A.I. model could imitate my past books, it couldn't predict the style of the novel in progress. So, when Chakrabarty sent me the A.I.-generated imitations, I was genuinely confused.
- South America (0.04)
- North America > United States > New York > Suffolk County > Stony Brook (0.04)
- North America > United States > Michigan (0.04)
- (7 more...)
- Personal (1.00)
- Research Report > New Finding (0.46)
- Media > News (0.46)
- Education > Educational Setting > K-12 Education (0.46)
FLAWS: A Benchmark for Error Identification and Localization in Scientific Papers
Xi, Sarina, Rao, Vishisht, Payan, Justin, Shah, Nihar B.
The identification and localization of errors is a core task in peer review, yet the exponential growth of scientific output has made it increasingly difficult for human reviewers to reliably detect errors given the limited pool of experts. Recent advances in Large Language Models (LLMs) have sparked interest in their potential to support such evaluation tasks, from academic peer review to automated scientific assessment. However, despite the growing use of LLMs in review systems, their capabilities to pinpoint errors remain underexplored. In this work, we introduce Fault Localization Across Writing in Science (FLAWS), an automated benchmark consisting of 713 paper-error pairs designed to evaluate how effectively LLMs detect errors that undermine key claims in research papers. We construct the benchmark by systematically inserting claim-invalidating errors into peer-reviewed papers using LLMs, paired with an automated evaluation metric that measures whether models can identify and localize these errors. Developing such a benchmark presents unique challenges that we overcome: ensuring that the inserted errors are well-defined, challenging, and relevant to the content of the paper, avoiding artifacts that would make identification trivial, and designing a scalable, automated evaluation metric. On the resulting benchmark, we evaluate five frontier LLMs: Claude Sonnet 4.5, DeepSeek Reasoner v3.1, Gemini 2.5 Pro, GPT 5, and Grok 4. Among these, GPT 5 is the top-performing model, achieving 39.1% identification accuracy when k=10, where k is the number of top-ranked error text candidates generated by the LLM.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (5 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- Asia > Middle East > Jordan (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (5 more...)
Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers
Chakrabarty, Tuhin, Ginsburg, Jane C., Dhillon, Paramveer
The use of copyrighted books for training AI models has led to numerous lawsuits from authors concerned about AI's ability to generate derivative content. Yet it's unclear if these models can generate high quality literary text while emulating authors' styles. To answer this we conducted a preregistered study comparing MFA-trained expert writers with three frontier AI models: ChatGPT, Claude & Gemini in writing up to 450 word excerpts emulating 50 award-winning authors' diverse styles. In blind pairwise evaluations by 159 representative expert & lay readers, AI-generated text from in-context prompting was strongly disfavored by experts for both stylistic fidelity (OR=0.16, p<10^-8) & writing quality (OR=0.13, p<10^-7) but showed mixed results with lay readers. However, fine-tuning ChatGPT on individual authors' complete works completely reversed these findings: experts now favored AI-generated text for stylistic fidelity (OR=8.16, p<10^-13) & writing quality (OR=1.87, p=0.010), with lay readers showing similar shifts. These effects generalize across authors & styles. The fine-tuned outputs were rarely flagged as AI-generated (3% rate v. 97% for in-context prompting) by best AI detectors. Mediation analysis shows this reversal occurs because fine-tuning eliminates detectable AI stylistic quirks (e.g., cliche density) that penalize in-context outputs. While we do not account for additional costs of human effort required to transform raw AI output into cohesive, publishable prose, the median fine-tuning & inference cost of $81 per author represents a dramatic 99.7% reduction compared to typical professional writer compensation. Author-specific fine-tuning thus enables non-verbatim AI writing that readers prefer to expert human writing, providing empirical evidence directly relevant to copyright's fourth fair-use factor, the "effect upon the potential market or value" of the source works.
- North America > United States > Michigan (0.04)
- North America > United States > California (0.04)
- North America > United States > Virginia (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Law > Litigation (1.00)
- Law > Intellectual Property & Technology Law (1.00)
- Government > Regional Government > North America Government > United States Government (0.68)
- Education > Curriculum > Subject-Specific Education (0.46)
Evaluating Multimodal Large Language Models on Core Music Perception Tasks
Carone, Brandon James, Roman, Iran R., Ripollés, Pablo
Multimodal Large Language Models (LLMs) claim "musical understanding" via evaluations that conflate listening with score reading. We benchmark three SOTA LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, and Qwen2.5-Omni) across three core music skills: Syncopation Scoring, Transposition Detection, and Chord Quality Identification. Moreover, we separate three sources of variability: (i) perceptual limitations (audio vs. MIDI inputs), (ii) exposure to examples (zero- vs. few-shot manipulations), and (iii) reasoning strategies (Standalone, CoT, LogicLM). For the latter we adapt LogicLM, a framework combining LLMs with symbolic solvers to perform structured reasoning, to music. Results reveal a clear perceptual gap: models perform near ceiling on MIDI but show accuracy drops on audio. Reasoning and few-shot prompting offer minimal gains. This is expected for MIDI, where performance reaches saturation, but more surprising for audio, where LogicLM, despite near-perfect MIDI accuracy, remains notably brittle. Among models, Gemini Pro achieves the highest performance across most conditions. Overall, current systems reason well over symbols (MIDI) but do not yet "listen" reliably from audio. Our method and dataset make the perception-reasoning boundary explicit and offer actionable guidance for building robust, audio-first music systems.
- Asia > Middle East > Iran (0.05)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- (2 more...)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding
Jubair, Sheikh, Omayrah, Arwa, Alshammari, Amal, Althnian, Alhanoof, Alothaimen, Abdulhamed, Alzahrani, Norah A., Alzaidi, Shahad D., Al-Twairesh, Nora, Al-Thubaity, Abdulmohsen
Recent advancements in Large Language Models (LLMs) have demonstrated sophisticated capabilities, including the ability to process and comprehend extended contexts. These emergent capabilities necessitate rigorous evaluation methods to effectively assess their performance in long-context understanding. In this paper, we present \textbf{LC-Eval}, a bilingual, multi-task evaluation benchmark designed to evaluate long-context understanding in English and Arabic, targeting context lengths ranging from 4k to over 128k tokens. LC-Eval introduces four novel and challenging tasks: multi-document question answering, bilingual question answering, claim verification within a paragraph, and multiple-choice questions based on long contexts. These tasks are designed to assess LLMs' abilities in deep reasoning, document comprehension, information tracing, and bilingual information extraction and understanding. The benchmark includes datasets in both Arabic and English for each task, allowing for a comparative analysis of their performance across different text genres. Evaluations were conducted on both open-weight and closed LLMs, with results indicating that LC-Eval presents significant challenges. Even high-performing models, such as GPT-4o, struggled with certain tasks, highlighting the complexity and rigor of the benchmark.
- Asia > China > Beijing > Beijing (0.04)
- Europe > France (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (7 more...)
- Research Report (0.64)
- Questionnaire & Opinion Survey (0.49)
AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance
Marino, Bill, Hunter, Rosco, Jamali, Zubair, Kalpakos, Marinos Emmanouil, Kashyap, Mudra, Hinton, Isaiah, Hanson, Alexa, Nazir, Maahum, Schnabl, Christoph, Steffek, Felix, Wen, Hongkai, Lane, Nicholas D.
As governments move to regulate AI, there is growing interest in using Large Language Models (LLMs) to assess whether or not an AI system complies with a given AI Regulation (AIR). However, there is presently no way to benchmark the performance of LLMs at this task. To fill this void, we introduce AIReg-Bench: the first benchmark dataset designed to test how well LLMs can assess compliance with the EU AI Act (AIA). We created this dataset through a two-step process: (1) by prompting an LLM with carefully structured instructions, we generated 120 technical documentation excerpts (samples), each depicting a fictional, albeit plausible, AI system - of the kind an AI provider might produce to demonstrate their compliance with AIR; (2) legal experts then reviewed and annotated each sample to indicate whether, and in what way, the AI system described therein violates specific Articles of the AIA. The resulting dataset, together with our evaluation of whether frontier LLMs can reproduce the experts' compliance labels, provides a starting point to understand the opportunities and limitations of LLM-based AIR compliance assessment tools and establishes a benchmark against which subsequent LLMs can be compared. The dataset and evaluation code are available at https://github.com/camlsys/aireg-bench.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (13 more...)
- Overview (0.67)
- Research Report (0.66)
- Workflow (0.48)
- Law (1.00)
- Education (1.00)
- Government > Regional Government > Europe Government (0.46)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- Asia > Middle East > Jordan (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (5 more...)