gemma
Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability
Raimondi, Bianca, Dalbagno, Daniela, Gabbrielli, Maurizio
Large language models (LLMs) have been shown to internalize human-like biases during finetuning, yet the mechanisms by which these biases manifest remain unclear. In this work, we investigated whether the well-known Knobe effect, a moral bias in intentionality judgements, emerges in finetuned LLMs and whether it can be traced back to specific components of the model. We conducted a Layer-Patching analysis across 3 open-weights LLMs and demonstrated that the bias is not only learned during finetuning but also localized in a specific set of layers. Surprisingly, we found that patching activations from the corresponding pretrained model into just a few critical layers is sufficient to eliminate the effect. Our findings offer new evidence that social biases in LLMs can be interpreted, localized, and mitigated through targeted interventions, without the need for model retraining.
- Europe > Italy (0.14)
- North America > United States (0.14)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
Mind Reading or Misreading? LLMs on the Big Five Personality Test
Di Cursi, Francesco, Boldrini, Chiara, Conti, Marco, Passarella, Andrea
We evaluate large language models (LLMs) for automatic personality prediction from text under the binary Five Factor Model (BIG5). Five models -- including GPT-4 and lightweight open-source alternatives -- are tested across three heterogeneous datasets (Essays, MyPersonality, Pandora) and two prompting strategies (minimal vs. enriched with linguistic and psychological cues). Enriched prompts reduce invalid outputs and improve class balance, but also introduce a systematic bias toward predicting trait presence. Performance varies substantially: Openness and Agreeableness are relatively easier to detect, while Extraversion and Neuroticism remain challenging. Although open-source models sometimes approach GPT-4 and prior benchmarks, no configuration yields consistently reliable predictions in zero-shot binary settings. Moreover, aggregate metrics such as accuracy and macro-F1 mask significant asymmetries, with per-class recall offering clearer diagnostic value. These findings show that current out-of-the-box LLMs are not yet suitable for APPT, and that careful coordination of prompt design, trait framing, and evaluation metrics is essential for interpretable results.
Enhancing Breast Cancer Prediction with LLM-Inferred Confounders
Wheeler High School, Marietta, GA Abstract This study enhances breast cancer prediction by using large language models to infer the likelihood of confounding diseases, namely diabetes, obesity, and cardiovascular disease, from routine clinical data. These AI-generated features improved Random Forest model performance, particularly for LLMs like Gemma (3.9%) and Llama (6.4%). The approach shows promise for noninvasive prescreening and clinical integration, supporting improved early detection and shared decision-making in breast cancer diagnosis. Introduction Breast cancer (BC) is a leading cause of death among women in the U.S., with most cases having unknown causes despite known risk factors1. Researchers have identified correlations between BC and various clinical features and biomarkers, such as body mass index, glucose, insulin, leptin, adiponectin, resistin, MCP-1, and HOMA, that can be measured through routine blood tests.
Balancing Natural Language Processing Accuracy and Normalisation in Extracting Medical Insights
Tworek, Paulina, Bargieł, Miłosz, Khan, Yousef, Pełech-Pilichowski, Tomasz, Mikołajczyk, Marek, Lewandowski, Roman, Sousa, Jose
Extracting structured medical insights from unstructured clinical text using Natural Language Processing (NLP) remains an open challenge in healthcare, particularly in non-English contexts where resources are scarce. This study presents a comparative analysis of NLP low-compute rule-based methods and Large Language Models (LLMs) for information extraction from electronic health records (EHR) obtained from the Voivodeship Rehabilitation Hospital for Children in Ameryka, Poland. We evaluate both approaches by extracting patient demographics, clinical findings, and prescribed medications while examining the effects of lack of text normalisation and translation-induced information loss. Results demonstrate that rule-based methods provide higher accuracy in information retrieval tasks, particularly for age and sex extraction. However, LLMs offer greater adaptability and scalability, excelling in drug name recognition. The effectiveness of the LLMs was compared with texts originally in Polish and those translated into English, assessing the impact of translation. These findings highlight the trade-offs between accuracy, normalisation, and computational cost when deploying NLP in healthcare settings. We argue for hybrid approaches that combine the precision of rule-based systems with the adaptability of LLMs, offering a practical path toward more reliable and resource-efficient clinical NLP in real-world hospitals.
- Health & Medicine > Therapeutic Area > Immunology (1.00)
- Health & Medicine > Health Care Technology > Medical Record (0.87)
- Health & Medicine > Health Care Providers & Services (0.69)
- Health & Medicine > Diagnostic Medicine (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story
Pedashenko, Vladislav, Kushnareva, Laida, Nibal, Yana Khassan, Tulchinskii, Eduard, Kuznetsov, Kristian, Zharchinskii, Vladislav, Maximov, Yury, Piontkovskaya, Irina
Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively "easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.
- North America > United States (0.46)
- Europe > Austria (0.28)
- Asia > Middle East > UAE (0.28)
Preference Learning from Physics-Based Feedback: Tuning Language Models to Design BCC/B2 Superalloys
Ghosh, Satanu, Holgate, Collin, Brodnik, Neal R., Downey, Doug, Daly, Samantha, Pollock, Tresa M., Carton, Samuel
We apply preference learning to the task of language model-guided design of novel structural alloys. In contrast to prior work that focuses on generating stable inorganic crystals, our approach targets the synthesizeability of a specific structural class: BCC/B2 superalloys, an underexplored family of materials with potential applications in extreme environments. Using three open-weight models (LLaMA-3.1, Gemma-2, and OLMo-2), we demonstrate that language models can be optimized for multiple design objectives using a single, unified reward signal through Direct Preference Optimization (DPO). Unlike prior approaches that rely on heuristic or human-in-the-loop feedback (costly), our reward signal is derived from thermodynamic phase calculations, offering a scientifically grounded criterion for model tuning. To our knowledge, this is the first demonstration of preference-tuning a language model using physics-grounded feedback for structural alloy design. The resulting framework is general and extensible, providing a path forward for intelligent design-space exploration across a range of physical science domains.
Beyond the Surface: Probing the Ideological Depth of Large Language Models
Kabir, Shariar, Esterling, Kevin, Dong, Yue
Large language models (LLMs) display recognizable political leanings, yet they vary significantly in their ability to represent a political orientation consistently. In this paper, we define ideological depth as (i) a model's ability to follow political instructions without failure (steerability), and (ii) the feature richness of its internal political representations measured with sparse autoencoders (SAEs), an unsupervised sparse dictionary learning (SDL) approach. Using Llama-3.1-8B-Instruct and Gemma-2-9B-IT as candidates, we compare prompt-based and activation-steering interventions and probe political features with publicly available SAEs. We find large, systematic differences: Gemma is more steerable in both directions and activates approximately 7.3x more distinct political features than Llama. Furthermore, causal ablations of a small targeted set of Gemma's political features to create a similar feature-poor setting induce consistent shifts in its behavior, with increased rates of refusals across topics. Together, these results indicate that refusals on benign political instructions or prompts can arise from capability deficits rather than safety guardrails. Ideological depth thus emerges as a measurable property of LLMs, and steerability serves as a window into their latent political architecture.
- Asia (0.46)
- North America > United States (0.28)
- Law (1.00)
- Government (1.00)
GEMMA-SQL: A Novel Text-to-SQL Model Based on Large Language Models
Pandey, Hari Mohan, Gupta, Anshul, Sarkar, Subham, Tomer, Minakshi, Johannes, Schneider, Gong, Yan
Text-to-SQL systems enable users to interact with structured databases using natural language, eliminating the need for specialized programming knowledge. In this work, we introduce GEMMA-SQL, a lightweight and efficient text-to-SQL model built upon the open-source Gemma 2B architecture. Unlike many large language models (LLMs), GEMMA-SQL is fine-tuned in a resource-efficient, iterative manner and can be deployed on low-cost hardware. Leveraging the SPIDER benchmark for training and evaluation, GEMMA-SQL combines multiple prompting strategies, including few-shot learning, to enhance SQL query generation accuracy. The instruction-tuned variant, GEMMA-SQL Instruct, achieves 66.8% Test-Suite accuracy and 63.3% Exact Set Match accuracy, outperforming several state-of-the-art baselines such as IRNet, RYANSQL, and CodeXDavinci. The proposed approach demonstrates that effective prompt design and targeted instruction tuning can significantly boost performance while maintaining high scalability and adaptability. These results position GEMMA-SQL as a practical, open-source alternative for robust and accessible text-to-SQL systems.
- Europe > United Kingdom > England > Dorset > Bournemouth (0.04)
- North America > United States > Wyoming (0.04)
- Europe > Liechtenstein (0.04)
- (3 more...)
- Workflow (1.00)
- Research Report > New Finding (1.00)
- Overview (1.00)
XBreaking: Understanding how LLMs security alignment can be broken
Arazzi, Marco, Kembu, Vignesh Kumar, Nocera, Antonino, P, Vinod
Abstract--Large Language Models are fundamental actors in the modern IT landscape dominated by AI solutions. However, security threats associated with them might prevent their reliable adoption in critical application scenarios such as government organizations and medical institutions. For this reason, commercial LLMs typically undergo a sophisticated censoring mechanism to eliminate any harmful output they could possibly produce. These mechanisms maintain the integrity of LLM alignment by guaranteeing that the models respond safely and ethically. In response to this, attacks on LLMs are a significant threat to such protections, and many previous approaches have already demonstrated their effectiveness across diverse domains. Existing LLM attacks mostly adopt a generate-and-test strategy to craft malicious input. T o improve the comprehension of censoring mechanisms and design a targeted attack, we propose an Explainable-AI solution that comparatively analyzes the behavior of censored and uncensored models to derive unique exploitable alignment patterns. Then, we propose XBreaking, a novel approach that exploits these unique patterns to break the security and alignment constraints of LLMs by targeted noise injection. Our thorough experimental campaign returns important insights about the censoring mechanisms and demonstrates the effectiveness and performance of our approach. Nowadays, Large Language Models (LLMs, for short) represent the most promising and relevant advancement in the field of Artificial Intelligence. These complex deep learning models are trained on massive datasets that cover almost all aspects of people's daily lives, thus granting them the capability of generating, understanding, and processing human language. For this reason, their integration as support tools is becoming pervasive with applications spanning from text editor and proofreading to virtual assistant and personalized text generation. However, the diffusion of this technology, especially in critical domains such as government organizations and medical institutions, imposes the assessment of their security and privacy characteristics.
- Law > Civil Rights & Constitutional Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
Efficiency vs. Alignment: Investigating Safety and Fairness Risks in Parameter-Efficient Fine-Tuning of LLMs
Taraghi, Mina, Pequignot, Yann, Nikanjam, Amin, Merzouk, Mohamed Amine, Khomh, Foutse
Organizations are increasingly adopting and adapting Large Language Models (LLMs) hosted on public repositories such as HuggingFace. Although these adaptations often improve performance on specialized downstream tasks, recent evidence indicates that they can also degrade a model's safety or fairness. Since different fine-tuning techniques may exert distinct effects on these critical dimensions, this study undertakes a systematic assessment of their trade-offs. Four widely used Parameter-Efficient Fine-Tuning methods, LoRA, IA3, Prompt-Tuning, and P-Tuning, are applied to four instruction-tuned model families (Meta-Llama-3-8B, Qwen2.5-7B, Mistral-7B, and Gemma-7B). In total, 235 fine-tuned variants are evaluated across eleven safety hazard categories and nine demographic fairness dimensions. The results show that adapter-based approaches (LoRA, IA3) tend to improve safety scores and are the least disruptive to fairness, retaining higher accuracy and lower bias scores. In contrast, prompt-based methods (Prompt-Tuning and P-Tuning) generally reduce safety and cause larger fairness regressions, with decreased accuracy and increased bias. Alignment shifts are strongly moderated by base model type: LLaMA remains stable, Qwen records modest gains, Gemma experiences the steepest safety decline, and Mistral, which is released without an internal moderation layer, displays the greatest variance. Improvements in safety do not necessarily translate into improvements in fairness, and no single configuration optimizes all fairness metrics simultaneously, indicating an inherent trade-off between these objectives. These findings suggest a practical guideline for safety-critical deployments: begin with a well-aligned base model, favour adapter-based PEFT, and conduct category-specific audits of both safety and fairness.
- North America > Canada > Quebec > Montreal (0.14)
- North America > Canada > Quebec > Capitale-Nationale Region > Québec (0.04)
- North America > Canada > Quebec > Capitale-Nationale Region > Quebec City (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)