AITopics | start

Collaborating Authors

start

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Grammar-AlignedDecoding

Neural Information Processing SystemsFeb-10-2026, 04:06:57 GMT

Specifically, ingrammar-constrained decoding(GCD), the LLM'soutput must follow agiven grammar. Our algorithm uses prior sample outputs to soundly overapproximate the future grammaticality of different output prefixes.

artificial intelligence, large language model, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Washington > King County > Seattle (0.04)
Asia > Singapore (0.04)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.37)

Add feedback

SupplementaryMaterial

Neural Information Processing SystemsFeb-9-2026, 10:54:09 GMT

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant (No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST)), National Research Foundation of Korea (NRF) grant (NRF2020H1D3A2A03100945) andDataVoucher grant(2021-DV-I-P-00114), funded bythe Koreagovernment(MSIT). The dataset contains question-SQL pairs if the question is answerable. Are relationships between individual instances made explicit (e.g., users' movie ratings, socialnetworklinks)? N/A. Arethereanyerrors,sourcesofnoise,orredundanciesinthedataset? Question templates are created to have slots that are later filled with pre-defined values and records from the database. EHRSQL is based on patients in MIMIC-III and eICU.

admission, artificial intelligence, datetime, (17 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Israel (0.04)

Industry: Health & Medicine (0.72)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

Opponent Shaping in LLM Agents

Segura, Marta Emili Garcia, Hailes, Stephen, Musolesi, Mirco

arXiv.org Artificial IntelligenceOct-10-2025

Large Language Models (LLMs) are increasingly being deployed as autonomous agents in real-world environments. As these deployments scale, multi-agent interactions become inevitable, making it essential to understand strategic behavior in such systems. A central open question is whether LLM agents, like reinforcement learning agents, can shape the learning dynamics and influence the behavior of others through interaction alone. In this paper, we present the first investigation of opponent shaping (OS) with LLM-based agents. Existing OS algorithms cannot be directly applied to LLMs, as they require higher-order derivatives, face scalability constraints, or depend on architectural components that are absent in transformers. To address this gap, we introduce ShapeLLM, an adaptation of model-free OS methods tailored for transformer-based agents. Using ShapeLLM, we examine whether LLM agents can influence co-players' learning dynamics across diverse game-theoretic environments. We demonstrate that LLM agents can successfully guide opponents toward exploitable equilibria in competitive games (Iterated Prisoner's Dilemma, Matching Pennies, and Chicken) and promote coordination and improve collective welfare in cooperative games (Iterated Stag Hunt and a cooperative version of the Prisoner's Dilemma). Our findings show that LLM agents can both shape and be shaped through interaction, establishing opponent shaping as a key dimension of multi-agent LLM research.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.08255

Country:

Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)
Asia > China (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment > Games (0.68)
Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Deconstructing Self-Bias in LLM-generated Translation Benchmarks

Xu, Wenda, Agrawal, Sweta, Zouhar, Vilém, Freitag, Markus, Deutsch, Daniel

arXiv.org Artificial IntelligenceOct-1-2025

As large language models (LLMs) begin to saturate existing benchmarks, automated benchmark creation using LLMs (LLM-as-a-benchmark) has emerged as a scalable alternative to slow and costly human curation. While these generated test sets have to potential to cheaply rank models, we demonstrate a critical flaw. LLM-generated benchmarks systematically favor the model that created the benchmark: they exhibit self-bias on low resource languages to English translation tasks. We show three key findings on automatic benchmarking of LLMs for translation: First, this bias originates from two sources: the generated test data (LLM-as-a-testset) and the evaluation method (LLM-as-an-evaluator), with their combination amplifying the effect. Second, self-bias in LLM-as-a-benchmark is heavily influenced by the model's generation capabilities in the source language. For instance, we observe more pronounced bias in into-English translation, where the model's generation system is developed, than in out-of-English translation tasks. Third, we observe that low diversity in source text is one attribution to self-bias. Our results suggest that improving the diversity of these generated source texts can mitigate some of the observed self-bias. The rapid advancements in Large Language Models (LLMs) have led to an unprecedented saturation of existing, meticulously human-curated benchmarks. This phenomenon exposes two critical, intertwined challenges: traditional benchmark creation is too laborious and expensive to keep pace with rapid model development, and this challenge is compounded by the inherent difficulty of constructing high-quality benchmarks for low-resource languages, even with human labor, which further strains existing benchmark resources.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.266

Country:

South America > Brazil > São Paulo (0.04)
North America > United States > New York (0.04)
Europe > United Kingdom (0.04)
(14 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (1.00)
Banking & Finance > Trading (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

643e347250cf9289e5a2a6c1ed5ee42e-Supplemental-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsAug-15-2025, 08:48:00 GMT

The following section is answers to questions listed in datasheets for datasets. A.1 Motivation For what purpose was the dataset created? Who created the dataset (e.g., which team, research group) and on behalf of which entity Who funded the creation of the dataset? This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant (No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST)), National Research Foundation of Korea (NRF) grant (NRF-2020H1D3A2A03100945) and Data V oucher grant (2021-DV -I-P-00114), funded by the A.2 Composition What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? EHRSQL contains natural questions and their corresponding SQL queries (text). How many instances are there in total (of each type, if appropriate)? There are about 24.4K instances (22.5K answerable; 1.9K unanswerable). We conducted a poll at a university hospital and collected a wide range of questions frequently asked on the structured EHR data. What data does each instance consist of? The dataset contains question-SQL pairs if the question is answerable.

admission, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.68)

Industry: Health & Medicine > Health Care Providers & Services (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Fast and Simplex: 2-Simplicial Attention in Triton

Roy, Aurko, Chou, Timothy, Duvvuri, Sai Surya, Chen, Sijia, Yu, Jiecao, Wang, Xiaodong, Zaheer, Manzil, Anil, Rohan

arXiv.org Artificial IntelligenceJul-4-2025

Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count together. However, these scaling laws assume an infinite supply of data and apply primarily in compute-bound settings. As modern large language models increasingly rely on massive internet-scale datasets, the assumption that they are compute-bound is becoming less valid. This shift highlights the need for architectures that prioritize token efficiency. In this work, we investigate the use of the 2-simplicial Transformer, an architecture that generalizes standard dot-product attention to trilinear functions through an efficient Triton kernel implementation. We demonstrate that the 2-simplicial Transformer achieves better token efficiency than standard Transformers: for a fixed token budget, similarly sized models outperform their dot-product counterparts on tasks involving mathematics, coding, reasoning, and logic. We quantify these gains by demonstrating that $2$-simplicial attention changes the exponent in the scaling laws for knowledge and reasoning tasks compared to dot product attention.

dtype, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2507.02754

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > San Mateo County > Menlo Park (0.05)
Asia > Middle East > Jordan (0.04)
(4 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

START: Self-taught Reasoner with Tools

Li, Chengpeng, Xue, Mingfeng, Zhang, Zhenru, Yang, Jiaxi, Zhang, Beichen, Wang, Xiang, Yu, Bowen, Hui, Binyuan, Lin, Junyang, Liu, Dayiheng

arXiv.org Artificial IntelligenceMar-7-2025

Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introduce START (Self-Taught Reasoner with Tools), a novel tool-integrated long CoT reasoning LLM that significantly enhances reasoning capabilities by leveraging external tools. Through code execution, START is capable of performing complex computations, self-checking, exploring diverse methods, and self-debugging, thereby addressing the limitations of LRMs. The core innovation of START lies in its self-learning framework, which comprises two key techniques: 1) Hint-infer: We demonstrate that inserting artificially designed hints (e.g., ``Wait, maybe using Python here is a good idea.'') during the inference process of a LRM effectively stimulates its ability to utilize external tools without the need for any demonstration data. Hint-infer can also serve as a simple and effective sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and modifying the reasoning trajectories with tool invocation generated by a LRM via Hint-infer, followed by fine-tuning the LRM. Through this framework, we have fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA (GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the competition-level code benchmark (LiveCodeBench), START achieves accuracy rates of 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary model o1-Preview.

qwq-32b-preview, start, zhang, (17 more...)

arXiv.org Artificial Intelligence

2503.04625

Country:

Europe > Austria > Vienna (0.14)
North America > United States (0.14)

Genre: Research Report (1.00)

Industry:

Education (0.46)
Health & Medicine > Therapeutic Area (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding

Krumdick, Michael, Lovering, Charles, Reddy, Varshini, Ebner, Seth, Tanner, Chris

arXiv.org Artificial IntelligenceMar-6-2025

LLM-as-a-Judge is a framework that uses an LLM (large language model) to evaluate the quality of natural language text - typically text that is also generated by an LLM. This framework holds great promise due to its relative low-cost, ease of use, and strong correlations with human stylistic preferences. However, LLM Judges have been shown to exhibit biases that can distort their judgments. We evaluate how well LLM Judges can grade whether a given response to a conversational question is correct, an ability crucial to soundly estimating the overall response quality. To do so, we create and publicly release a human-annotated dataset with labels of correctness for 1,200 LLM responses. We source questions from a combination of existing datasets and a novel, challenging benchmark (BFF-Bench) created for this analysis. We demonstrate a strong connection between an LLM's ability to correctly answer a question and grade responses to that question. Although aggregate level statistics might imply a judge has high agreement with human annotators, it will struggle on the subset of questions it could not answer. To address this issue, we recommend a simple solution: provide the judge with a correct, human-written reference answer. We perform an in-depth analysis on how reference quality can affect the performance of an LLM Judge. We show that providing a weaker judge (e.g. Qwen 2.5 7B) with higher quality references reaches better agreement with human annotators than a stronger judge (e.g. GPT-4o) with synthetic references.

agreement, correctness, evaluation, (16 more...)

arXiv.org Artificial Intelligence

2503.05061

Country:

North America > United States > Massachusetts (0.14)
North America > Mexico > Mexico City (0.14)
Europe > Spain (0.14)
Asia > Thailand (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Banking & Finance (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

AirIO: Learning Inertial Odometry with Enhanced IMU Feature Observability

Qiu, Yuheng, Xu, Can, Chen, Yutian, Zhao, Shibo, Geng, Junyi, Scherer, Sebastian

arXiv.org Artificial IntelligenceJan-26-2025

Inertial odometry (IO) using only Inertial Measurement Units (IMUs) offers a lightweight and cost-effective solution for Unmanned Aerial Vehicle (UAV) applications, yet existing learning-based IO models often fail to generalize to UAVs due to the highly dynamic and non-linear-flight patterns that differ from pedestrian motion. In this work, we identify that the conventional practice of transforming raw IMU data to global coordinates undermines the observability of critical kinematic information in UAVs. By preserving the body-frame representation, our method achieves substantial performance improvements, with a 66.7% average increase in accuracy across three datasets. Furthermore, explicitly encoding attitude information into the motion network results in an additional 23.8% improvement over prior results. Combined with a data-driven IMU correction model (AirIMU) and an uncertainty-aware Extended Kalman Filter (EKF), our approach ensures robust state estimation under aggressive UAV maneuvers without relying on external sensors or control inputs. Notably, our method also demonstrates strong generalizability to unseen data not included in the training set, underscoring its potential for real-world UAV applications.

artificial intelligence, dataset, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2501.15659

Country: North America > United States > Pennsylvania (0.28)

Genre: Research Report (0.82)

Industry:

Information Technology > Robotics & Automation (0.88)
Aerospace & Defense > Aircraft (0.66)
Materials > Chemicals > Industrial Gases > Liquified Gas (0.46)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

PredBench: Benchmarking Spatio-Temporal Prediction across Diverse Disciplines

Wang, ZiDong, Lu, Zeyu, Huang, Di, He, Tong, Liu, Xihui, Ouyang, Wanli, Bai, Lei

arXiv.org Artificial IntelligenceJul-11-2024

In this paper, we introduce PredBench, a benchmark tailored for the holistic evaluation of spatio-temporal prediction networks. Despite significant progress in this field, there remains a lack of a standardized framework for a detailed and comparative analysis of various prediction network architectures. PredBench addresses this gap by conducting large-scale experiments, upholding standardized and appropriate experimental settings, and implementing multi-dimensional evaluations. This benchmark integrates 12 widely adopted methods with 15 diverse datasets across multiple application domains, offering extensive evaluation of contemporary spatio-temporal prediction networks. Through meticulous calibration of prediction settings across various applications, PredBench ensures evaluations relevant to their intended use and enables fair comparisons. Moreover, its multi-dimensional evaluation framework broadens the analysis with a comprehensive set of metrics, providing deep insights into the capabilities of models. The findings from our research offer strategic directions for future developments in the field. Our codebase is available at https://github.com/OpenEarthLab/PredBench.

dataset, prediction, sequence, (13 more...)

arXiv.org Artificial Intelligence

2407.08418

Country:

Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Hong Kong (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry:

Transportation (0.46)
Automobiles & Trucks (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback