Personal
Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases
Gao, Rena, Wu, Xuetong, Kuribayashi, Tatsuki, Ye, Mingrui, Qi, Siya, Roever, Carsten, Liu, Yuanxing, Yuan, Zheng, Lau, Jey Han
This study evaluates Large Language Models' (LLMs) ability to simulate non-native-like English use observed in human second language (L2) learners interfered with by their native first language (L1). In dialogue-based interviews, we prompt LLMs to mimic L2 English learners with specific L1s (e.g., Japanese, Thai, Urdu) across seven languages, comparing their outputs to real L2 learner data. Our analysis examines L1-driven linguistic biases, such as reference word usage and avoidance behaviors, using information-theoretic and distributional density measures. Results show that modern LLMs (e.g., Qwen2.5, LLAMA3.3, DeepseekV3, GPT-4o) replicate L1-dependent patterns observed in human L2 data, with distinct influences from various languages (e.g., Japanese, Korean, and Mandarin significantly affect tense agreement, and Urdu influences noun-verb collocations). Our results reveal the potential of LLMs for L2 dialogue generation and evaluation for future educational applications.
CrossFuse: Learning Infrared and Visible Image Fusion by Cross-Sensor Top-K Vision Alignment and Beyond
Shi, Yukai, Shi, Cidan, Weng, Zhipeng, Tian, Yin, Xian, Xiaoyu, Lin, Liang
Infrared and visible image fusion (IVIF) is increasingly applied in critical fields such as video surveillance and autonomous driving systems. Significant progress has been made in deep learning-based fusion methods. However, these models frequently encounter out-of-distribution (OOD) scenes in real-world applications, which severely impact their performance and reliability. Therefore, addressing the challenge of OOD data is crucial for the safe deployment of these models in open-world environments. Unlike existing research, our focus is on the challenges posed by OOD data in real-world applications and on enhancing the robustness and generalization of models. In this paper, we propose an infrared-visible fusion framework based on Multi-View Augmentation. For external data augmentation, Top-k Selective Vision Alignment is employed to mitigate distribution shifts between datasets by performing RGB-wise transformations on visible images. This strategy effectively introduces augmented samples, enhancing the adaptability of the model to complex real-world scenarios. Additionally, for internal data augmentation, self-supervised learning is established using Weak-Aggressive Augmentation. This enables the model to learn more robust and general feature representations during the fusion process, thereby improving robustness and generalization. Extensive experiments demonstrate that the proposed method exhibits superior performance and robustness across various conditions and environments. Our approach significantly enhances the reliability and stability of IVIF tasks in practical applications.
Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh
Laiyk, Nurkhan, Orel, Daniil, Joshi, Rituraj, Goloburda, Maiya, Wang, Yuxia, Nakov, Preslav, Koto, Fajri
Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs' understanding of procedural, legal, and structural governance topics. We employ LLM-assisted data generation, comparing open-weight and closed-weight models for dataset construction, and select GPT-4o as the backbone. Each entity of our dataset undergoes full manual verification to ensure high quality. We also show that fine-tuning Qwen, Falcon, and Gemma on our dataset leads to consistent performance improvements in both multiple-choice and generative tasks, demonstrating the potential of LLM-assisted instruction tuning for low-resource languages.
A Tale of Two Structures: Do LLMs Capture the Fractal Complexity of Language?
Alabdulmohsin, Ibrahim, Steiner, Andreas
Language exhibits a fractal structure in its information-theoretic complexity (i.e. bits per token), with self-similarity across scales and long-range dependence (LRD). In this work, we investigate whether large language models (LLMs) can replicate such fractal characteristics and identify conditions-such as temperature setting and prompting method-under which they may fail. Moreover, we find that the fractal parameters observed in natural language are contained within a narrow range, whereas those of LLMs' output vary widely, suggesting that fractal parameters might prove helpful in detecting a non-trivial portion of LLM-generated texts. Notably, these findings, and many others reported in this work, are robust to the choice of the architecture; e.g. Gemini 1.0 Pro, Mistral-7B and Gemma-2B. We also release a dataset comprising of over 240,000 articles generated by various LLMs (both pretrained and instruction-tuned) with different decoding temperatures and prompting methods, along with their corresponding human-generated texts. We hope that this work highlights the complex interplay between fractal properties, prompting, and statistical mimicry in LLMs, offering insights for generating, evaluating and detecting synthetic texts.
Inside China's electric-vehicle-to-humanoid-robot pivot
Now our intrepid China reporter, Caiwei Chen, has identified a new trend unfolding within China's tech scene: Companies that were dominant in electric vehicles are betting big on translating that success into developing humanoid robots. I spoke with her about what she found out and what it might mean for Trump's policies and the rest of the globe. James: Before we talk about robots, let's talk about DeepSeek. The frenzy for the AI model peaked a couple of weeks ago. What are you hearing from other Chinese AI companies?
Detecting LLM Fact-conflicting Hallucinations Enhanced by Temporal-logic-based Reasoning
Li, Ningke, Song, Yahui, Wang, Kailong, Li, Yuekang, Shi, Ling, Liu, Yi, Wang, Haoyu
Abstract--Large language models (LLMs) face the challenge of hallucinations - outputs that seem coherent but are actually incorrect. A particularly damaging type is fact-conflicting hallucination (FCH), where generated content contradicts established facts. Addressing FCH presents three main challenges: 1) Automatically constructing and maintaining large-scale benchmark datasets is difficult and resource-intensive; 2) Generating complex and efficient test cases that the LLM has not been trained on - especially those involving intricate temporal features - is challenging, yet crucial for eliciting hallucinations; and 3) Validating the reasoning behind LLM outputs is inherently difficult, particularly with complex logical relationships, as it requires transparency in the model's decision-making process. LLMs are tested using these cases through template-based prompts, which require them to generate both answers and reasoning steps. T o validate the reasoning, we propose two semantic-aware oracles that compare the semantic structure of LLM outputs to the ground truths. Key insights reveal that LLMs struggle with out-of-distribution knowledge and logical reasoning. These findings highlight the importance of continued efforts to detect and mitigate hallucinations in LLMs. Large Language Models (LLMs) have revolutionized language processing, demonstrating impressive text generation and comprehension capabilities with diverse applications. However, despite their growing use, LLMs face significant security and privacy challenges [1], [2], [3], [4], [5], which affect their overall effectiveness and reliability . A critical issue is the phenomenon of hallucination, where LLMs generate outputs that are coherent but factually incorrect or irrelevant. This tendency to produce misleading information compromises the safety and usability of LLM-based systems. This paper focuses on fact-conflicting hallucina tion (FCH), the most prominent form of hallucination in LLMs. FCH occurs when LLMs generate content that directly contradicts established facts. For instance, as illustrated in Figure 1, an LLM incorrectly asserts that " Haruki Murakami won the Nobel Prize in Literature in 2016 ", whereas the fact is that "Haruki Murakami has not won the Nobel Prize, though he has received numerous other literary awards ". Such inaccuracies can significantly lead to user confusion and undermine the trust and reliability that are crucial for LLM applications. N. Li, K. Wang, and H. Wang are with Huazhong University of Science and T echnology, China. Song is with the National University of Singapore, Singapore. Li is with the University of New South Wales, Australia.
Talking About the Assumption in the Room
Mothilal, Ramaravind Kommiya, Lalani, Faisal M., Ahmed, Syed Ishtiaque, Guha, Shion, Sultana, Sharifa
The reference to assumptions in how practitioners use or interact with machine learning (ML) systems is ubiquitous in HCI and responsible ML discourse. However, what remains unclear from prior works is the conceptualization of assumptions and how practitioners identify and handle assumptions throughout their workflows. This leads to confusion about what assumptions are and what needs to be done with them. We use the concept of an argument from Informal Logic, a branch of Philosophy, to offer a new perspective to understand and explicate the confusions surrounding assumptions. Through semi-structured interviews with 22 ML practitioners, we find what contributes most to these confusions is how independently assumptions are constructed, how reactively and reflectively they are handled, and how nebulously they are recorded. Our study brings the peripheral discussion of assumptions in ML to the center and presents recommendations for practitioners to better think about and work with assumptions.
Can you pass that tool?: Implications of Indirect Speech in Physical Human-Robot Collaboration
Zhang, Yan, Ratnayake, Tharaka Sachintha, Sew, Cherie, Knibbe, Jarrod, Goncalves, Jorge, Johal, Wafa
Indirect speech acts (ISAs) are a natural pragmatic feature of human communication, allowing requests to be conveyed implicitly while maintaining subtlety and flexibility. Although advancements in speech recognition have enabled natural language interactions with robots through direct, explicit commands--providing clarity in communication--the rise of large language models presents the potential for robots to interpret ISAs. However, empirical evidence on the effects of ISAs on human-robot collaboration (HRC) remains limited. To address this, we conducted a Wizard-of-Oz study (N=36), engaging a participant and a robot in collaborative physical tasks. Our findings indicate that robots capable of understanding ISAs significantly improve human's perceived robot anthropomorphism, team performance, and trust. However, the effectiveness of ISAs is task- and context-dependent, thus requiring careful use. These results highlight the importance of appropriately integrating direct and indirect requests in HRC to enhance collaborative experiences and task performance.
Personality Structured Interview for Large Language Model Simulation in Personality Research
Wang, Pengda, Zou, Huiqi, Chen, Hanjie, Sun, Tianjun, Xiao, Ziang, Oswald, Frederick L.
Although psychometrics researchers have recently explored the use of large language models (LLMs) as proxies for human participants, LLMs often fail to generate heterogeneous data with human-like diversity, which diminishes their value in advancing social science research. To address these challenges, we explored the potential of the theory-informed Personality Structured Interview (PSI) as a tool for simulating human responses in personality research. In this approach, the simulation is grounded in nuanced real-human interview transcripts that target the personality construct of interest. We have provided a growing set of 357 structured interview transcripts from a representative sample, each containing an individual's response to 32 open-ended questions carefully designed to gather theory-based personality evidence. Additionally, grounded in psychometric research, we have summarized an evaluation framework to systematically validate LLM-generated psychometric data. Results from three experiments demonstrate that well-designed structured interviews could improve human-like heterogeneity in LLM-simulated personality data and predict personality-related behavioral outcomes (i.e., organizational citizenship behaviors and counterproductive work behavior). We further discuss the role of theory-informed structured interviews in LLM-based simulation and outline a general framework for designing structured interviews to simulate human-like data for psychometric research.
Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control
Su, Jinyan, Healey, Jennifer, Nakov, Preslav, Cardie, Claire
Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to mitigate large language model (LLM) hallucinations by incorporating external knowledge retrieval. However, existing RAG frameworks often apply retrieval indiscriminately,leading to inefficiencies-over-retrieving when unnecessary or failing to retrieve iteratively when required for complex reasoning. Recent adaptive retrieval strategies, though adaptively navigates these retrieval strategies, predict only based on query complexity and lacks user-driven flexibility, making them infeasible for diverse user application needs. In this paper, we introduce a novel user-controllable RAG framework that enables dynamic adjustment of the accuracy-cost trade-off. Our approach leverages two classifiers: one trained to prioritize accuracy and another to prioritize retrieval efficiency. Via an interpretable control parameter $\alpha$, users can seamlessly navigate between minimal-cost retrieval and high-accuracy retrieval based on their specific requirements. We empirically demonstrate that our approach effectively balances accuracy, retrieval cost, and user controllability, making it a practical and adaptable solution for real-world applications.