Goto

Collaborating Authors

 zero shot


Generative Exaggeration in LLM Social Agents: Consistency, Bias, and Toxicity

Nudo, Jacopo, Pandolfo, Mario Edoardo, Loru, Edoardo, Samory, Mattia, Cinelli, Matteo, Quattrociocchi, Walter

arXiv.org Artificial Intelligence

We investigate how Large Language Models (LLMs) behave when simulating political discourse on social media. Leveraging 21 million interactions on X during the 2024 U.S. presidential election, we construct LLM agents based on 1,186 real users, prompting them to reply to politically salient tweets under controlled conditions. Agents are initialized either with minimal ideological cues (Zero Shot) or recent tweet history (Few Shot), allowing one-to-one comparisons with human replies. We evaluate three model families (Gemini, Mistral, and DeepSeek) across linguistic style, ideological consistency, and toxicity. We find that richer contextualization improves internal consistency but also amplifies polarization, stylized signals, and harmful language. We observe an emergent distortion that we call "generation exaggeration": a systematic amplification of salient traits beyond empirical baselines. Our analysis shows that LLMs do not emulate users, they reconstruct them. Their outputs, indeed, reflect internal optimization dynamics more than observed behavior, introducing structural biases that compromise their reliability as social proxies. This challenges their use in content moderation, deliberative simulations, and policy modeling.


The Impact of Prompt Programming on Function-Level Code Generation

Khojah, Ranim, Neto, Francisco Gomes de Oliveira, Mohamad, Mazen, Leitner, Philipp

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly used by software engineers for code generation. However, limitations of LLMs such as irrelevant or incorrect code have highlighted the need for prompt programming (or prompt engineering) where engineers apply specific prompt techniques (e.g., chain-of-thought or input-output examples) to improve the generated code. Despite this, the impact of different prompt techniques -- and their combinations -- on code generation remains underexplored. In this study, we introduce CodePromptEval, a dataset of 7072 prompts designed to evaluate five prompt techniques (few-shot, persona, chain-of-thought, function signature, list of packages) and their effect on the correctness, similarity, and quality of complete functions generated by three LLMs (GPT-4o, Llama3, and Mistral). Our findings show that while certain prompt techniques significantly influence the generated code, combining multiple techniques does not necessarily improve the outcome. Additionally, we observed a trade-off between correctness and quality when using prompt techniques. Our dataset and replication package enable future research on improving LLM-generated code and evaluating new prompt techniques.


Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve?

Öncel, Fırat, Bethge, Matthias, Ermis, Beyza, Ravanelli, Mirco, Subakan, Cem, Yıldız, Çağatay

arXiv.org Artificial Intelligence

In the last decade, the generalization and adaptation abilities of deep learning models were typically evaluated on fixed training and test distributions. Contrary to traditional deep learning, large language models (LLMs) are (i) even more overparameterized, (ii) trained on unlabeled text corpora curated from the Internet with minimal human intervention, and (iii) trained in an online fashion. These stark contrasts prevent researchers from transferring lessons learned on model generalization and adaptation in deep learning contexts to LLMs. To this end, our short paper introduces empirical observations that aim to shed light on further training of already pretrained language models. Specifically, we demonstrate that training a model on a text domain could degrade its perplexity on the test portion of the same domain. We observe with our subsequent analysis that the performance degradation is positively correlated with the similarity between the additional and the original pretraining dataset of the LLM. Our further token-level perplexity observations reveals that the perplexity degradation is due to a handful of tokens that are not informative about the domain. We hope these findings will guide us in determining when to adapt a model vs when to rely on its foundational capabilities.


THaLLE: Text Hyperlocally Augmented Large Language Extension -- Technical Report

Labs, KBTG, Khamnuansin, Danupat, Petchsod, Atthakorn, Lertpiya, Anuruth, Balee, Pornchanan, Lodkaew, Thanawat, Chalothorn, Tawunrat, Pongthawornkamol, Thadpong, Lertsutthiwong, Monchai

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have emerged as leading tools in Natural Language Processing (NLP) due to their exceptional performance across various tasks. The advent of open-source models such as Llama [1] from Meta, Gemma [2] from Google, and Qwen [3] from Alibaba has significantly enhanced public access to advanced LLMs. Additionally, low-cost techniques for LLM fine-tuning, such as Low-rank Adaptation (LoRA) [4], have enabled the fine-tuning of these models on consumer-grade hardware, thereby accelerating their development and adoption. LLMs are now utilized in a wide array of applications, ranging from personal assistants, i.e., ChatGPT, to specialized tasks in diverse domains. In the financial sector, BloombergGPT [5], a proprietary LLM trained from the ground up with an infusion of financial data, has demonstrated superior performance on financial benchmarks compared to other models in the market.


RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation

Chan, Chi-Min, Xu, Chunpu, Yuan, Ruibin, Luo, Hongyin, Xue, Wei, Guo, Yike, Fu, Jie

arXiv.org Artificial Intelligence

Large Language Models (LLMs) exhibit remarkable capabilities but are prone to generating inaccurate or hallucinatory responses. This limitation stems from their reliance on vast pretraining datasets, making them susceptible to errors in unseen scenarios. To tackle these challenges, Retrieval-Augmented Generation (RAG) addresses this by incorporating external, relevant documents into the response generation process, thus leveraging non-parametric knowledge alongside LLMs' in-context learning abilities. However, existing RAG implementations primarily focus on initial input for context retrieval, overlooking the nuances of ambiguous or complex queries that necessitate further clarification or decomposition for accurate responses. To this end, we propose learning to Refine Query for Retrieval Augmented Generation (RQ-RAG) in this paper, endeavoring to enhance the model by equipping it with capabilities for explicit rewriting, decomposition, and disambiguation. Our experimental results indicate that our method, when applied to a 7B Llama2 model, surpasses the previous state-of-the-art (SOTA) by an average of 1.9\% across three single-hop QA datasets, and also demonstrates enhanced performance in handling complex, multi-hop QA datasets. Our code is available at https://github.com/chanchimin/RQ-RAG.


VIM: Probing Multimodal Large Language Models for Visual Embedded Instruction Following

Lu, Yujie, Li, Xiujun, Wang, William Yang, Choi, Yejin

arXiv.org Artificial Intelligence

We introduce VISUAL EMBEDDED INSTRUCTION (VIM), a new framework designed to evaluate the visual instruction following capability of Multimodal Large Language Models (MLLMs). As illustrated in Figure 2, VIM challenges the MLLMs by embedding the instructions into the visual scenes, demanding strong visual interpretative skills for instruction following. We adapt VIM to various benchmarks, including VQAv2, MME, MM-Vet, and RefCOCO series, compose a VIM bench, and probe diverse MLLMs across three distinct in-context learning settings: Zero Shot, One Shot, and Pair Shot. We observe that there is a significant performance disparity between the open-source MLLMs and GPT-4V, implying that their proficiency in visual instruction comprehension is not up to par. Our results highlight a promising direction for the enhancement of MLLMs capabilities on instruction following. We aim VIM to serve as a useful norm for advancing the state of the art and driving further progress in the field.


Mixed Formal Learning: A Path to Transparent Machine Learning

Carrico, Sandra

arXiv.org Artificial Intelligence

This paper presents Mixed Formal Learning, a new architecture that learns models based on formal mathematical representations of the domain of interest and exposes latent variables. The second element in the architecture learns a particular skill, typically by using traditional prediction or classification mechanisms. Our key findings include that this architecture: (1) Facilitates transparency by exposing key latent variables based on a learned mathematical model; (2) Enables Low Shot and Zero Shot training of machine learning without sacrificing accuracy or recall.