Goto

Collaborating Authors

 accommodation


Invisible Load: Uncovering the Challenges of Neurodivergent Women in Software Engineering

Zaib, Munazza, Wang, Wei, Hidellaarachchi, Dulaji, Siddiqui, Isma Farah

arXiv.org Artificial Intelligence

Neurodivergent women in Software Engineering (SE) encounter distinctive challenges at the intersection of gender bias and neurological differences. To the best of our knowledge, no prior work in SE research has systematically examined this group, despite increasing recognition of neurodiversity in the workplace. Underdiagnosis, masking, and male-centric workplace cultures continue to exacerbate barriers that contribute to stress, burnout, and attrition. In response, we propose a hybrid methodological approach that integrates InclusiveMag's inclusivity framework with the GenderMag walkthrough process, tailored to the context of neurodivergent women in SE. The overarching design unfolds across three stages, scoping through literature review, deriving personas and analytic processes, and applying the method in collaborative workshops. We present a targeted literature review that synthesize challenges into cognitive, social, organizational, structural and career progression challenges neurodivergent women face in SE, including how under/late diagnosis and masking intensify exclusion. These findings lay the groundwork for subsequent stages that will develop and apply inclusive analytic methods to support actionable change.


A pilot turned an old plane into a two-bedroom apartment

Popular Science

Jon Kotwicki jokes that converting an aluminum plane in Alaska is the "worst idea that a person could possibly have." This 108-foot-long former cargo plane now has a king size bed, washer dryer, and heated floors, but the build was by no means easy. Breakthroughs, discoveries, and DIY tips sent every weekday. When flight instructor and former commercial airline pilot Jon Kotwicki happened upon a DC-6 air freighter for sale in 2022, he knew it was the perfect plane to transform into an overnight rental. However, once he made the purchase, "my first thought," says Kotwicki, "was, 'My God, what have I done?'" Built in 1956, the 117-foot-wide, 108-foot-long cargo plane had spent its days carrying freight and fuel to remote villages in Alaska before retiring from flight.


Fairness Evaluation of Large Language Models in Academic Library Reference Services

Wang, Haining, Clark, Jason, Yan, Yueru, Bradley, Star, Chen, Ruiyang, Zhang, Yiqiong, Fu, Hengyi, Tian, Zuoyu

arXiv.org Artificial Intelligence

As libraries explore large language models (LLMs) for use in virtual reference services, a key question arises: Can LLMs serve all users equitably, regardless of demographics or social status? While they offer great potential for scalable support, LLMs may also reproduce societal biases embedded in their training data, risking the integrity of libraries' commitment to equitable service. To address this concern, we evaluate whether LLMs differentiate responses across user identities by prompting six state-of-the-art LLMs to assist patrons differing in sex, race/ethnicity, and institutional role. We find no evidence of differentiation by race or ethnicity, and only minor evidence of stereotypical bias against women in one model. LLMs demonstrate nuanced accommodation of institutional roles through the use of linguistic choices related to formality, politeness, and domain-specific vocabularies, reflecting professional norms rather than discriminatory treatment. These findings suggest that current LLMs show a promising degree of readiness to support equitable and contextually appropriate communication in academic library reference services.


LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High

Sieker, Judith, Lachenmaier, Clara, Zarrieß, Sina

arXiv.org Artificial Intelligence

This paper examines how LLMs handle false presuppositions and whether certain linguistic factors influence their responses to falsely presupposed content. Presuppositions subtly introduce information as given, making them highly effective at embedding disputable or false information. This raises concerns about whether LLMs, like humans, may fail to detect and correct misleading assumptions introduced as false presuppositions, even when the stakes of misinformation are high. Using a systematic approach based on linguistic presupposition analysis, we investigate the conditions under which LLMs are more or less sensitive to adopt or reject false presuppositions. Focusing on political contexts, we examine how factors like linguistic construction, political party, and scenario probability impact the recognition of false presuppositions. We conduct experiments with a newly created dataset and examine three LLMs: OpenAI's GPT-4-o, Meta's LLama-3-8B, and MistralAI's Mistral-7B-v03. Our results show that the models struggle to recognize false presuppositions, with performance varying by condition. This study highlights that linguistic presupposition analysis is a valuable tool for uncovering the reinforcement of political misinformation in LLM responses.


TripTide: A Benchmark for Adaptive Travel Planning under Disruptions

Karmakar, Priyanshu, Chaudhuri, Soumyabrata, Mallick, Shubhojit, Gupta, Manish, Jana, Abhik, Ghosh, Shreya

arXiv.org Artificial Intelligence

Recent efforts like TripCraft and TravelPlanner have advanced the use of Large Language Models ( LLMs) for personalized, constraint aware travel itinerary generation. Yet, real travel often faces disruptions. To address this, we present TripTide, the first benchmark evaluating LLM's ability to revise itineraries under realistic disruptions. TripTide models key dimensions such as disruption severity and traveler tolerance, enabling nuanced assessment of LLM adaptability to events like flight cancellations, weather closures, or overbooked attractions. We conduct a threefold evaluation. First, we introduce automatic metrics including Preservation of Intent (how well the revised plan maintains feasibility and goals), Responsiveness (promptness and appropriateness of disruption handling), and Adaptability (semantic, spatial, and sequential divergence between original and revised plans). Second, we apply an LLM-as-a-judge approach to automatically assess revision quality. Third, we perform manual expert evaluation to verify whether revisions preserve semantic, spatial, sequential, and responsive aspects. Our experiments show that LLMs maintain strong sequential consistency and semantic stability, while spatial deviations are larger for shorter trips but decrease with longer ones, indicating that extended plans encourage better geographic coherence. However, disruption-handling ability declines as plan length increases, highlighting limits in LLM robustness. TripTide establishes a benchmark for evaluating adaptability, personalization, and resilience in LLM-based travel planning under real-world uncertainty.


Using Large Language Models for Abstraction of Planning Domains - Extended Version

Banihashemi, Bita, Patel, Megh, Lespérance, Yves

arXiv.org Artificial Intelligence

Generating an abstraction of a dynamic domain that aligns with a given purpose remains a significant challenge given that the choice of such an abstraction can impact an agent's ability to plan, reason, and provide explanations effectively. We model the agent's concrete behaviors in PDDL and investigate the use of in-context learning with large language models (LLMs) for the generation of abstract PDDL domains and problem instances, given an abstraction objective specified in natural language. The benchmark examples we use are new and have not been part of the data any LLMs have been trained on. We consider three categories of abstractions: abstraction of choice of alternative concrete actions, abstraction of sequences of concrete actions, and abstraction of action/predicate parameters, as well as combinations of these. The generated abstract PDDL domains and problem instances are then checked by symbolic validation tools as well as human experts. Our experiments show that GPT -4o can generally synthesize useful planning domain abstractions in simple settings, although it is better at abstracting over actions than over the associated fluents.


Adaptive Generation of Bias-Eliciting Questions for LLMs

Staab, Robin, Dekoninck, Jasper, Baader, Maximilian, Vechev, Martin

arXiv.org Artificial Intelligence

Large language models (LLMs) are now widely deployed in user-facing applications, reaching hundreds of millions worldwide. As they become integrated into everyday tasks, growing reliance on their outputs raises significant concerns. In particular, users may unknowingly be exposed to model-inherent biases that systematically disadvantage or stereotype certain groups. However, existing bias benchmarks continue to rely on templated prompts or restrictive multiple-choice questions that are suggestive, simplistic, and fail to capture the complexity of real-world user interactions. In this work, we address this gap by introducing a counterfactual bias evaluation framework that automatically generates realistic, open-ended questions over sensitive attributes such as sex, race, or religion. By iteratively mutating and selecting bias-inducing questions, our approach systematically explores areas where models are most susceptible to biased behavior. Beyond detecting harmful biases, we also capture distinct response dimensions that are increasingly relevant in user interactions, such as asymmetric refusals and explicit acknowledgment of bias. Leveraging our framework, we construct CAB, a human-verified benchmark spanning diverse topics, designed to enable cross-model comparisons. Using CAB, we analyze a range of LLMs across multiple bias dimensions, revealing nuanced insights into how different models manifest bias. For instance, while GPT-5 outperforms other models, it nonetheless exhibits persistent biases in specific scenarios. These findings underscore the need for continual improvements to ensure fair model behavior.


Wavefront Coding for Accommodation-Invariant Near-Eye Displays

Akpinar, Ugur, Sahin, Erdem, Hayward, Tina M., Majumder, Apratim, Menon, Rajesh, Gotchev, Atanas

arXiv.org Artificial Intelligence

Abstract--We present a new computational near-eye display method that addresses the vergence-accommodation conflict problem in stereoscopic displays through accommodation-invariance. We employ end-to-end learning to jointly optimize the wavefront-coding optics and the image pre-processing module. T o implement this approach, we develop a differentiable retinal image formation model that accounts for limiting aperture and chromatic aberrations introduced by the eye optics. We further integrate the neural transfer function and the contrast sensitivity function into the loss model to account for related perceptual effects. T o tackle off-axis distortions, we incorporate position dependency into the pre-processing module. In addition to conducting rigorous analysis based on simulations, we also fabricate the designed diffractive optical element and build a benchtop setup, demonstrating accommodation-invariance for depth ranges of up to four diopters. HE simplicity of stereoscopic near-eye display (NED) design has made these systems particularly attractive for virtual reality (VR) and augmented reality (AR) applications. However, a major drawback hindering their widespread adoption is the vergence-accommodation conflict (V AC), which is caused by the mismatch between the two visual cues. In natural viewing conditions, vergence and accommodation work in synchrony, but the link between them gets broken in stereoscopic NEDs, resulting in severe visual discomfort [1], [2], [3]. Two groups of methods have addressed the V AC. Accommodation-enabling (AE) displays have aimed at delivering close-to-natural viewing experience by recreating near-correct retinal blur to drive the accommodation to the vergence distance of the object. We discuss AE display approaches in more details in Sec. Instead of recreating focus cues, accommodation-invariant (AI) displays have aimed at coupling vergence with accommodation by removing the retinal defocus blur completely.


CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension

Li, Rui, Zhang, Zeyu, Bo, Xiaohe, Tian, Zihang, Chen, Xu, Dai, Quanyu, Dong, Zhenhua, Tang, Ruiming

arXiv.org Artificial Intelligence

Current Large Language Models (LLMs) are confronted with overwhelming information volume when comprehending long-form documents. This challenge raises the imperative of a cohesive memory module, which can elevate vanilla LLMs into autonomous reading agents. Despite the emergence of some heuristic approaches, a systematic design principle remains absent. To fill this void, we draw inspiration from Jean Piaget's Constructivist Theory, illuminating three traits of the agentic memory -- structured schemata, flexible assimilation, and dynamic accommodation. This blueprint forges a clear path toward a more robust and efficient memory system for LLM-based reading comprehension. To this end, we develop CAM, a prototype implementation of Constructivist Agentic Memory that simultaneously embodies the structurality, flexibility, and dynamicity. At its core, CAM is endowed with an incremental overlapping clustering algorithm for structured memory development, supporting both coherent hierarchical summarization and online batch integration. During inference, CAM adaptively explores the memory structure to activate query-relevant information for contextual response, akin to the human associative process. Compared to existing approaches, our design demonstrates dual advantages in both performance and efficiency across diverse long-text reading comprehension tasks, including question answering, query-based summarization, and claim verification.


Planner-R1: Reward Shaping Enables Efficient Agentic RL with Smaller LLMs

Zhu, Siyu, Jiang, Yanbin, Sang, Hejian, Tang, Shao, Song, Qingquan, He, Biao, Jain, Rohit, Wang, Zhipeng, Geramifard, Alborz

arXiv.org Artificial Intelligence

We investigated Agentic RL with large language models on the \textsc{TravelPlanner} benchmark. Our approach, \textsc{Planner-R1}, achieved a \textbf{56.9\%} final-pass rate with only 180 training queries, a $2.7\times$ improvement over GPT-5's $21.2\%$ baseline and the strongest agentic result on the public leaderboard. A central finding was that smaller models (8B) were highly responsive to reward shaping: with dense process-level signals, they reached competitive performance while being $3.5\times$ more compute-efficient and $1.5\times$ more memory-efficient than 32B models. Larger models were more robust under sparse rewards but exhibited smaller relative gains from shaping and higher variance across runs. While curriculum learning offered no significant benefit, shaped rewards consistently amplified learning dynamics, making 8B models the most efficient setting for agentic RL. Crucially, these gains did not come at the cost of overfitting: fine-tuned models mostly maintained or exceeded baseline performance on out-of-domain tasks, including \textsc{Multi-IF}, \textsc{NaturalPlan}, and $τ$-\textsc{Bench}. These results establish reward shaping as a decisive lever for scaling agentic RL, highlight the competitive strength of smaller models, and demonstrate that efficiency can be achieved without sacrificing generalization.