Goto

Collaborating Authors

 badge


Reflections on the Reproducibility of Commercial LLM Performance in Empirical Software Engineering Studies

Angermeir, Florian, Amougou, Maximilian, Kreitz, Mark, Bauer, Andreas, Linhuber, Matthias, Fucci, Davide, C., Fabiola Moyón, Mendez, Daniel, Gorschek, Tony

arXiv.org Artificial Intelligence

Large Language Models have gained remarkable interest in industry and academia. The increasing interest in LLMs in academia is also reflected in the number of publications on this topic over the last years. For instance, alone 78 of the around 425 publications at ICSE 2024 performed experiments with LLMs. Conducting empirical studies with LLMs remains challenging and raises questions on how to achieve reproducible results, for both researchers and practitioners. One important step towards excelling in empirical research on LLM and their application is to first understand to what extent current research results are eventually reproducible and what factors may impede reproducibility. This investigation is within the scope of our work. We contribute an analysis of the reproducibility of LLM-centric studies, provide insights into the factors impeding reproducibility, and discuss suggestions on how to improve the current state. In particular, we studied the 85 articles describing LLM-centric studies, published at ICSE 2024 and ASE 2024. Of the 85 articles, 18 provided research artefacts and used OpenAI models. We attempted to replicate those 18 studies. Of the 18 studies, only five were sufficiently complete and executable. For none of the five studies, we were able to fully reproduce the results. Two studies seemed to be partially reproducible, and three studies did not seem to be reproducible. Our results highlight not only the need for stricter research artefact evaluations but also for more robust study designs to ensure the reproducible value of future publications.


A Derivation of Backdoor Adjustment

Neural Information Processing Systems

G, we have the following two rules: Rule 1. Action/observation exchange: P ( y | do ( x),d o ( z)) = P ( y | do ( x),z), if ( Y? We derive the variational context adjustment in Eq. Figure 8: An intuitive explanation of how CaseQ generalize to novel context. Kronecker product is a generalization of the outer product from vectors to matrices. By this way, we build links between majority (seen) contexts and minority (unseen) contexts. We provide an example in Figure 1 for further illustration.


#ICML2025 outstanding position paper: Interview with Jaeho Kim on addressing the problems with conference reviewing

AIHub

At this year's International Conference on Machine Learning (ICML2025), Jaeho Kim, Yunseok Lee and Seulki Lee won an outstanding position paper award for their work Position: The AI Conference Peer Review Crisis Demands Author Feedback and Reviewer Rewards. We hear from Jaeho about the problems they were trying to address, and their proposed author feedback mechanism and reviewer reward system. Our position paper addresses the problems plaguing current AI conference peer review systems, while also raising questions about the future direction of peer review. The imminent problem with the current peer review system in AI conferences is the exponential growth in paper submissions driven by increasing interest in AI. To put this with numbers, NeurIPS received over 30,000 submissions this year, while ICLR saw a 59.8% increase in submissions in just one year.



Introducing Answered with Evidence -- a framework for evaluating whether LLM responses to biomedical questions are founded in evidence

Baldwin, Julian D, Dinh, Christina, Mukerji, Arjun, Sanghavi, Neil, Gombar, Saurabh

arXiv.org Artificial Intelligence

The growing use of large language models (LLMs) for biomedical question answering raises concerns about the accuracy and evidentiary support of their responses. To address this, we present Answered with Evidence, a framework for evaluating whether LLM-generated answers are grounded in scientific literature. We analyzed thousands of physician-submitted questions using a comparative pipeline that included: (1) Alexandria, fka the Atropos Evidence Library, a retrieval-augmented generation (RAG) system based on novel observational studies, and (2) two PubMed-based retrieval-augmented systems (System and Perplexity). We found that PubMed-based systems provided evidence-supported answers for approximately 44% of questions, while the novel evidence source did so for about 50%. Combined, these sources enabled reliable answers to over 70% of biomedical queries. As LLMs become increasingly capable of summarizing scientific content, maximizing their value will require systems that can accurately retrieve both published and custom-generated evidence--or generate such evidence in real time.


Entrepreneur 'humiliated' after London Tech Week turns her and baby away

The Guardian

An entrepreneur has told how she was left feeling "humiliated" after being turned away from London Tech Week, an annual corporate event, because she was with her baby daughter. Davina Schonle was prevented from entering the event on Monday after travelling for three hours with her eight-month-old and had to cancel meetings with potential suppliers to her tech startup. Schonle told TheBusinessDesk.com that as she went to the entrance with her daughter in her pram: "I was asked if I was a VIP. I was then told I wasn't allowed in with a baby. I went to get my badge, but was then taken over to the organisers from Informa, who told me they weren't insured. But they asked again if I was a VIP or speaker, and later another lady came over and twisted my badge around to see, clearly checking to see if I was a VIP."


Rotten Tomatoes further dilutes its utility with 'Verified Hot' badge

Engadget

Rotten Tomatoes just added a new "Verified Hot" badge that indicates an overall positive user score that will join the "Certified Fresh" badge for critic scores. To qualify for this designation, a movie or show needs to have a Verified Audience Score of 90 percent or higher. Finally, the dregs will be slapped with a "Stale" badge, which is for any show or movie that falls beneath 60 percent. Rotten Tomatoes is trying to get around review bombing here by mandating that user reviews be from people who actually saw the movie in question. There are a couple of little problems with this. It verifies that a consumer saw the movie via the ticketing firm Fandango, and there are plenty of other ticketing firms out there, including, you know, the theater cashier.


Meta changes its labels for AI-generated images after complaints from photographers

Engadget

Meta is updating its "Made with AI" labels after widespread complaints from photographers that the company was mistakenly flagging non-AI-generated content. In an update, the company said that it will change the wording to "AI info" because the current labels "weren't always aligned with people's expectations and didn't always provide enough context." The company introduced the "Made with AI" labels earlier this year after criticism from the Oversight Board about its "manipulated media" policy. Meta said that, like many of its peers, it would rely on "industry standard" signals to determine when generative AI had been used to create an image. However, it wasn't long before photographers began noticing that Facebook and Instagram were applying the badge on images that hadn't actually been created with AI.


How gamification took over the world

MIT Technology Review

For some, this phenomenon leads to an interest in flow states and immersion. For others, it's simply a reason to play more games. For a handful of consultants, startup gurus, and game designers in the late 2000s, it became the key to unlocking our true human potential. In her 2010 TED Talk, "Gaming Can Make a Better World," the game designer Jane McGonigal called this engaged state "blissful productivity." "There's a reason why the average World of Warcraft gamer plays for 22 hours a week," she said.


Composite Active Learning: Towards Multi-Domain Active Learning with Theoretical Guarantees

Hao, Guang-Yuan, Huang, Hengguan, Wang, Haotian, Gao, Jie, Wang, Hao

arXiv.org Artificial Intelligence

Active learning (AL) aims to improve model performance within a fixed labeling budget by choosing the most informative data points to label. Existing AL focuses on the single-domain setting, where all data come from the same domain (e.g., the same dataset). However, many real-world tasks often involve multiple domains. For example, in visual recognition, it is often desirable to train an image classifier that works across different environments (e.g., different backgrounds), where images from each environment constitute one domain. Such a multi-domain AL setting is challenging for prior methods because they (1) ignore the similarity among different domains when assigning labeling budget and (2) fail to handle distribution shift of data across different domains. In this paper, we propose the first general method, dubbed composite active learning (CAL), for multi-domain AL. Our approach explicitly considers the domain-level and instance-level information in the problem; CAL first assigns domain-level budgets according to domain-level importance, which is estimated by optimizing an upper error bound that we develop; with the domain-level budgets, CAL then leverages a certain instance-level query strategy to select samples to label from each domain. Our theoretical analysis shows that our method achieves a better error bound compared to current AL methods. Our empirical results demonstrate that our approach significantly outperforms the state-of-the-art AL methods on both synthetic and real-world multi-domain datasets. Code is available at https://github.com/Wang-ML-Lab/multi-domain-active-learning.