Goto

Collaborating Authors

 literature review



Invisible Load: Uncovering the Challenges of Neurodivergent Women in Software Engineering

Zaib, Munazza, Wang, Wei, Hidellaarachchi, Dulaji, Siddiqui, Isma Farah

arXiv.org Artificial Intelligence

Neurodivergent women in Software Engineering (SE) encounter distinctive challenges at the intersection of gender bias and neurological differences. To the best of our knowledge, no prior work in SE research has systematically examined this group, despite increasing recognition of neurodiversity in the workplace. Underdiagnosis, masking, and male-centric workplace cultures continue to exacerbate barriers that contribute to stress, burnout, and attrition. In response, we propose a hybrid methodological approach that integrates InclusiveMag's inclusivity framework with the GenderMag walkthrough process, tailored to the context of neurodivergent women in SE. The overarching design unfolds across three stages, scoping through literature review, deriving personas and analytic processes, and applying the method in collaborative workshops. We present a targeted literature review that synthesize challenges into cognitive, social, organizational, structural and career progression challenges neurodivergent women face in SE, including how under/late diagnosis and masking intensify exclusion. These findings lay the groundwork for subsequent stages that will develop and apply inclusive analytic methods to support actionable change.


Interview with Frida Hartman: Studying bias in AI-based recruitment tools

AIHub

In a new series of interviews, we're meeting some of the PhD students that were selected to take part in the Doctoral Consortium at the European Conference on Artificial Intelligence (ECAI-2025) . In the second interview of the series, we caught up with Frida Hartman to find out how her PhD is going so far, and plans for the next steps in her investigations. Frida, along with co-authors Mario Mirabile and Michele Dusi, was also the winner of the ECAI-2025 Diversity & Inclusion Competition, for work entitled . This award was presented at the closing ceremony of the conference. Could start by giving us a quick introduction to yourself and the topic that you're working on?


Artificial Intelligence and Accounting Research: A Framework and Agenda

Stratopoulos, Theophanis C., Wang, Victor Xiaoqi

arXiv.org Artificial Intelligence

Recent advances in artificial intelligence, particularly generative AI (GenAI) and large language models (LLMs), are fundamentally transforming accounting research, creating both opportunities and competitive threats for scholars. This paper proposes a framework that classifies AI-accounting research along two dimensions: research focus (accounting-centric versus AI-centric) and methodological approach (AI-based versus traditional methods). We apply this framework to papers from the IJAIS special issue and recent AI-accounting research published in leading accounting journals to map existing studies and identify research opportunities. Using this same framework, we analyze how accounting researchers can leverage their expertise through strategic positioning and collaboration, revealing where accounting scholars' strengths create the most value. We further examine how GenAI and LLMs transform the research process itself, comparing the capabilities of human researchers and AI agents across the entire research workflow. This analysis reveals that while GenAI democratizes certain research capabilities, it simultaneously intensifies competition by raising expectations for higher-order contributions where human judgment, creativity, and theoretical depth remain valuable. These shifts call for reforming doctoral education to cultivate comparative advantages while building AI fluency.



Interview with Mario Mirabile: trust in multi-agent systems

AIHub

In a new series of interviews, we're meeting some of the PhD students that were selected to take part in the Doctoral Consortium at the European Conference on Artificial Intelligence (ECAI 2025) . During the conference in Bologna, we caught up with Mario Mirabile who is studying for his PhD in trustworthy AI and multi-agent systems at the University of Santiago de Compostela and is a Research Fellow in human-AI interaction at the University of Bologna. Mario, along with co-authors Frida Hartman and Michele Dusi, was also the winner of the ECAI-2025 Diversity & Inclusion Competition, for work entitled . This award was presented at the closing ceremony of the conference. Could you start by giving us an introduction to the topic you are working on?


LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews

Madeyski, Lech, Kitchenham, Barbara, Shepperd, Martin

arXiv.org Artificial Intelligence

Context: Large language models (LLMs) are released faster than users' ability to evaluate them rigorously. When LLMs underpin research, such as identifying relevant literature for systematic reviews (SRs), robust empirical assessment is essential. Objective: We identify and discuss key challenges in assessing LLM performance for selecting relevant literature, identify good (evaluation) practices, and propose recommendations. Method: Using a recent large-scale study as an example, we identify problems with the use of traditional metrics for assessing the performance of Gen-AI tools for identifying relevant literature in SRs. We analyzed 27 additional papers investigating this issue, extracted the performance metrics, and found both good practices and widespread problems, especially with the use and reporting of performance metrics for SR screening. Results: Major weaknesses included: i) a failure to use metrics that are robust to imbalanced data and do not directly indicate whether results are better than chance, e.g., the use of Accuracy, ii) a failure to consider the impact of lost evidence when making claims concerning workload savings, and iii) pervasive failure to report the full confusion matrix (or performance metrics from which it can be reconstructed) which is essential for future meta-analyses. On the positive side, we extract good (evaluation) practices on which our recommendations for researchers and practitioners, as well as policymakers, are built. Conclusions: SR screening evaluations should prioritize lost evidence/recall alongside chance-anchored and cost-sensitive Weighted MCC (WMCC) metric, report complete confusion matrices, treat unclassifiable outputs as referred-back positives for assessment, adopt leakage-aware designs with non-LLM baselines and open artifacts, and ground conclusions in cost-benefit analysis where FNs carry higher penalties than FPs.


Appendix overview

Neural Information Processing Systems

We further focus on measuring the overlap between datasets. We check for the overlap on the level of systematic reviews based on the review's Cochrane ID.



Quality Assurance of LLM-generated Code: Addressing Non-Functional Quality Characteristics

Sun, Xin, Ståhl, Daniel, Sandahl, Kristian, Kessler, Christoph

arXiv.org Artificial Intelligence

In recent years, LLMs have been widely integrated into software engineering workflows, supporting tasks like code generation. However, while these models often generate functionally correct outputs, we still lack a systematic understanding and evaluation of their non-functional qualities. Existing studies focus mainly on whether generated code passes the tests rather than whether it passes with quality. Guided by the ISO/IEC 25010 quality model, this study conducted three complementary investigations: a systematic review of 108 papers, two industry workshops with practitioners from multiple organizations, and an empirical analysis of patching real-world software issues using three LLMs. Motivated by insights from both the literature and practitioners, the empirical study examined the quality of generated patches on security, maintainability, and performance efficiency. Across the literature, we found that security and performance efficiency dominate academic attention, while maintainability and other qualities are understudied. In contrast, industry experts prioritize maintainability and readability, warning that generated code may accelerate the accumulation of technical debt. In our evaluation of functionally correct patches generated by three LLMs, improvements in one quality dimension often come at the cost of others. Runtime and memory results further show high variance across models and optimization strategies. Overall, our findings reveal a mismatch between academic focus, industry priorities, and model performance, highlighting the urgent need to integrate quality assurance mechanisms into LLM code generation pipelines to ensure that future generated code not only passes tests but truly passes with quality.