Goto

Collaborating Authors

 affiliation


A Clarinetist, a High School Student, and Some Climate Deniers Write a Science Paper

Mother Jones

Don't miss this: Double your impact! We're able to stand strong because we're funded by readers like you. Support journalism that doesn't flinch. Don't miss this: Tomorrow is the final day of our $50,000 match We're able to stand strong because we're funded by readers like you. Support journalism that doesn't flinch.



672cf3025399742b1a047c8dc6b1e992-AuthorFeedback.pdf

Neural Information Processing Systems

We would like to express our sincere gratitude to the reviewers for providing their valuable feedback. This generalization will be added to the revision. We will clarify this point together with further experiments on purely real datasets in a revision. This can readily be obtained by [39, 40] which do not exploit the hierarchical structure. We will provide this discussion in a revision.


Justice in Judgment: Unveiling (Hidden) Bias in LLM-assisted Peer Reviews

Vasu, Sai Suresh Macharla, Sheth, Ivaxi, Wang, Hui-Po, Binkyte, Ruta, Fritz, Mario

arXiv.org Artificial Intelligence

The adoption of large language models (LLMs) is transforming the peer review process, from assisting reviewers in writing more detailed evaluations to generating entire reviews automatically. While these capabilities offer exciting opportunities, they also raise critical concerns about fairness and reliability. In this paper, we investigate bias in LLM-generated peer reviews by conducting controlled experiments on sensitive metadata, including author affiliation and gender. Our analysis consistently shows affiliation bias favoring institutions highly ranked on common academic rankings. Additionally, we find some gender preferences, which, even though subtle in magnitude, have the potential to compound over time. Notably, we uncover implicit biases that become more evident with token-based soft ratings.



AGENTIQL: An Agent-Inspired Multi-Expert Framework for Text-to-SQL Generation

Heidari, Omid Reza, Reid, Siobhan, Yaakoubi, Yassine

arXiv.org Artificial Intelligence

LLMs have advanced text-to-SQL generation, yet monolithic architectures struggle with complex reasoning and schema diversity. We propose AGENTIQL, an agent-inspired multi-expert framework that combines a reasoning agent for question decomposition, a coding agent for sub-query generation, and a refinement step for column selection. An adaptive router further balances efficiency and accuracy by selecting between our modular pipeline and a baseline parser. Several steps in the pipeline can be executed in parallel, making the framework scalable to larger workloads. Evaluated on the Spider benchmark, AGENTIQL improves execution accuracy and interpretability and achieves up to 86.07% EX with 14B models using the Planner&Executor merging strategy. The attained performance is contingent upon the efficacy of the routing mechanism, thereby narrowing the gap to GPT-4-based SOTA (89.65% EX) while using much smaller open-source LLMs. Beyond accuracy, AGENTIQL enhances transparency by exposing intermediate reasoning steps, offering a robust, scalable, and interpretable approach to semantic parsing.



Emergent evaluation hubs in a decentralizing large language model ecosystem

Cebrian, Manuel, Kito, Tomomi, Fernandez, Raul Castro

arXiv.org Artificial Intelligence

Large language models are proliferating, and so are the benchmarks that serve as their common yardsticks. We ask how the agglomeration patterns of these two layers compare: do they evolve in tandem or diverge? Drawing on two curated proxies for the ecosystem, the Stanford Foundation-Model Ecosystem Graph and the Evidently AI benchmark registry, we find complementary but contrasting dynamics. Model creation has broadened across countries and organizations and diversified in modality, licensing, and access. Benchmark influence, by contrast, displays centralizing patterns: in the inferred benchmark-author-institution network, the top 15% of nodes account for over 80% of high-betweenness paths, three countries produce 83% of benchmark outputs, and the global Gini for inferred benchmark authority reaches 0.89. An agent-based simulation highlights three mechanisms: higher entry of new benchmarks reduces concentration; rapid inflows can temporarily complicate coordination in evaluation; and stronger penalties against over-fitting have limited effect. Taken together, these results suggest that concentrated benchmark influence functions as coordination infrastructure that supports standardization, comparability, and reproducibility amid rising heterogeneity in model production, while also introducing trade-offs such as path dependence, selective visibility, and diminishing discriminative power as leaderboards saturate.



Charting a Decade of Computational Linguistics in Italy: The CLiC-it Corpus

Alzetta, Chiara, Auriemma, Serena, Bondielli, Alessandro, Dini, Luca, Fazzone, Chiara, Miaschi, Alessio, Miliani, Martina, Sartor, Marta

arXiv.org Artificial Intelligence

Over the past decade, Computational Linguistics (CL) and Natural Language Processing (NLP) have evolved rapidly, especially with the advent of Transformer-based Large Language Models (LLMs). This shift has transformed research goals and priorities, from Lexical and Semantic Resources to Language Modelling and Multimodality. In this study, we track the research trends of the Italian CL and NLP community through an analysis of the contributions to CLiC-it, arguably the leading Italian conference in the field. We compile the proceedings from the first 10 editions of the CLiC-it conference (from 2014 to 2024) into the CLiC-it Corpus, providing a comprehensive analysis of both its metadata, including author provenance, gender, affiliations, and more, as well as the content of the papers themselves, which address various topics. Our goal is to provide the Italian and international research communities with valuable insights into emerging trends and key developments over time, supporting informed decisions and future directions in the field.