AITopics | building trust workshop

Collaborating Authors

building trust workshop

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The Steganographic Potentials of Language Models

Karpov, Artem, Adeleke, Tinuade, Cho, Seong Hah, Perez-Campanero, Natalia

arXiv.org Artificial IntelligenceMay-7-2025

The potential for large language models (LLMs) to hide messages within plain text (steganography) poses a challenge to detection and thwarting of unaligned AI agents, and undermines faithfulness of LLMs reasoning. We explore the steganographic capabilities of LLMs fine-tuned via reinforcement learning (RL) to: (1) develop covert encoding schemes, (2) engage in steganography when prompted, and (3) utilize steganography in realistic scenarios where hidden reasoning is likely, but not prompted. In these scenarios, we detect the intention of LLMs to hide their reasoning as well as their steganography performance. Our findings in the fine-tuning experiments as well as in behavioral non fine-tuning evaluations reveal that while current models exhibit rudimentary steganographic abilities in terms of security and capacity, explicit algorithmic guidance markedly enhances their capacity for information concealment.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.03439

Country:

Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Self-Ablating Transformers: More Interpretability, Less Sparsity

Ferrao, Jeremias, Mikaelson, Luhan, Pepper, Keenan, Antolin, Natalia Perez-Campanero

arXiv.org Artificial IntelligenceMay-2-2025

A growing intuition in machine learning suggests a link between sparsity and in-terpretability. We introduce a novel self-ablation mechanism to investigate this connection ante-hoc in the context of language transformers. Our approach dynamically enforces a k-winner-takes-all constraint, forcing the model to demonstrate selective activation across neuron and attention units. Unlike post-hoc methods that analyze already-trained models, our approach integrates interpretabil-ity directly into model training, promoting feature localization from inception. Training small models on the TinyStories dataset and employing interpretabil-ity tests, we find that self-ablation leads to more localized circuits, concentrated feature representations, and increased neuron specialization without compromising language modelling performance. Surprisingly, our method also decreased overall sparsity, indicating that self-ablation promotes specialization rather than widespread inactivity. This reveals a complex interplay between sparsity and in-terpretability, where decreased global sparsity can coexist with increased local specialization, leading to enhanced interpretability. To facilitate reproducibility, we make our code available at https://github.com/keenanpepper/ As machine learning systems are entrusted with increasingly complex tasks, our ability to understand their decision-making processes lags behind their growing capabilities (OpenAI, 2024; Gemini Team, Google, 2024). Much of the current research in interpretability for LLMs focuses on developing post-hoc methods, attempting to explain the behaviour of already-trained models (Ribeiro et al., 2016; Conmy et al., 2023; Foote et al., 2023b; Bills et al., 2023; Huben et al., 2024). While valuable, these approaches often provide only an approximate or incomplete understanding of the underlying mechanisms (Rudin, 2019). A more fundamental yet less studied approach involves designing models to be inherently more interpretable, an ante-hoc approach, where transparency is woven into the architecture itself (Slavin et al., 2018; Tamkin et al., 2023; Cloud et al., 2024; Liu et al., 2024).

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2505.00509

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.89)

Add feedback

An Empirical Study on Prompt Compression for Large Language Models

Zhang, Zheng, Li, Jinyi, Lan, Yihuai, Wang, Xiang, Wang, Hao

arXiv.org Artificial IntelligenceMay-2-2025

Prompt engineering enables Large Language Models (LLMs) to perform a variety of tasks. However, lengthy prompts significantly increase computational complexity and economic costs. To address this issue, we study six prompt compression methods for LLMs, aiming to reduce prompt length while maintaining LLM response quality. In this paper, we present a comprehensive analysis covering aspects such as generation performance, model hallucinations, efficacy in multimodal tasks, word omission analysis, and more. We evaluate these methods across 13 datasets, including news, scientific articles, commonsense QA, math QA, long-context QA, and VQA datasets. Our experiments reveal that prompt compression has a greater impact on LLM performance in long contexts compared to short ones. In the Longbench evaluation, moderate compression even enhances LLM performance. Our code and data is available at https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2505.00019

Country:

Asia (0.93)
Europe (0.67)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Latent Adversarial Training Improves the Representation of Refusal

Abbas, Alexandra, Petrova, Nora, Lyons, Helios Ael, Perez-Campanero, Natalia

arXiv.org Artificial IntelligenceApr-29-2025

Recent work has shown that language models' refusal behavior is primarily encoded in a single direction in their latent space, making it vulnerable to targeted attacks. Although Latent Adversarial Training (LAT) attempts to improve robustness by introducing noise during training, a key question remains: How does this noise-based training affect the underlying representation of refusal behavior? Understanding this encoding is crucial for evaluating LAT's effectiveness and limitations, just as the discovery of linear refusal directions revealed vulnerabilities in traditional supervised safety fine-tuning (SSFT). Through the analysis of Llama 2 7B, we examine how LAT reorganizes the refusal behavior in the model's latent space compared to SSFT and embedding space adversarial training (AT). By computing activation differences between harmful and harmless instruction pairs and applying Singular Value Decomposition (SVD), we find that LAT significantly alters the refusal representation, concentrating it in the first two SVD components which explain approximately 75 percent of the activation differences variance - significantly higher than in reference models. This concentrated representation leads to more effective and transferable refusal vectors for ablation attacks: LAT models show improved robustness when attacked with vectors from reference models but become more vulnerable to self-generated vectors compared to SSFT and AT. Our findings suggest that LAT's training perturbations enable a more comprehensive representation of refusal behavior, highlighting both its potential strengths and vulnerabilities for improving model safety.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2504.18872

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Enhancing CBMs Through Binary Distillation with Applications to Test-Time Intervention

Shen, Matthew, Hsu, Aliyah, Agarwal, Abhineet, Yu, Bin

arXiv.org Artificial IntelligenceMar-9-2025

Concept bottleneck models~(CBM) aim to improve model interpretability by predicting human level ``concepts" in a bottleneck within a deep learning model architecture. However, how the predicted concepts are used in predicting the target still either remains black-box or is simplified to maintain interpretability at the cost of prediction performance. We propose to use Fast Interpretable Greedy Sum-Trees~(FIGS) to obtain Binary Distillation~(BD). This new method, called FIGS-BD, distills a binary-augmented concept-to-target portion of the CBM into an interpretable tree-based model, while mimicking the competitive prediction performance of the CBM teacher. FIGS-BD can be used in downstream tasks to explain and decompose CBM predictions into interpretable binary-concept-interaction attributions and guide adaptive test-time intervention. Across $4$ datasets, we demonstrate that adaptive test-time intervention identifies key concepts that significantly improve performance for realistic human-in-the-loop settings that allow for limited concept interventions.

intervention, prediction, travelingbird, (15 more...)

arXiv.org Artificial Intelligence

2503.0673

Country:

Asia > India (0.05)
Asia > Singapore (0.04)
Europe > Monaco (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.52)

Industry:

Health & Medicine (0.46)
Transportation (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

Towards Understanding Distilled Reasoning Models: A Representational Approach

Baek, David D., Tegmark, Max

arXiv.org Artificial IntelligenceMar-5-2025

In this paper, we investigate how model distillation impacts the development of reasoning features in large language models (LLMs). To explore this, we train a crosscoder on Qwen-series models and their fine-tuned variants. Our results suggest that the crosscoder learns features corresponding to various types of reasoning, including self-reflection and computation verification. Moreover, we observe that distilled models contain unique reasoning feature directions, which could be used to steer the model into over-thinking or incisive-thinking mode. In particular, we perform analysis on four specific reasoning categories: (a) self-reflection, (b) deductive reasoning, (c) alternative reasoning, and (d) contrastive reasoning. Finally, we examine the changes in feature geometry resulting from the distillation process and find indications that larger distilled models may develop more structured representations, which correlate with enhanced distillation performance. By providing insights into how distillation modifies the model, our study contributes to enhancing the transparency and reliability of AI systems.

arxiv preprint arxiv, building trust workshop, language model, (11 more...)

arXiv.org Artificial Intelligence

2503.0373

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback