AITopics | unaligned model

Collaborating Authors

unaligned model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Who's the Evil Twin? Differential Auditing for Undesired Behavior

Balappanawar, Ishwar, Vattikuti, Venkata Hasith, Kintzley, Greta, Azimi-Mancel, Ronan, Golechha, Satvik

arXiv.org Artificial IntelligenceAug-25-2025

Detecting hidden behaviors in neural networks poses a significant challenge due to minimal prior knowledge and potential adversarial obfuscation. We explore this problem by framing detection as an adversarial game between two teams: the red team trains two similar models, one trained solely on benign data and the other trained on data containing hidden harmful behavior, with the performance of both being nearly indistinguishable on the benign dataset. The blue team, with limited to no information about the harmful behaviour, tries to identify the compromised model. We experiment using CNNs and try various blue team strategies, including Gaussian noise analysis, model diffing, integrated gradients, and adversarial attacks under different levels of hints provided by the red team. Results show high accuracy for adversarial-attack-based methods (100\% correct prediction, using hints), which is very promising, whilst the other techniques yield more varied performance. During our LLM-focused rounds, we find that there are not many parallel methods that we could apply from our study with CNNs. Instead, we find that effective LLM auditing methods require some hints about the undesired distribution, which can then used in standard black-box and open-weight methods to probe the models further and reveal their misalignment. We open-source our auditing games (with the model and data) and hope that our findings contribute to designing better audits.

blue team, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2508.06827

Country: North America > United States (0.92)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Neurology (1.00)
(3 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback

Decoupled Alignment for Robust Plug-and-Play Adaptation

Luo, Haozheng, Yu, Jiahao, Zhang, Wenxin, Li, Jialong, Hu, Jerry Yao-Chieh, Xing, Xinyu, Liu, Han

arXiv.org Artificial IntelligenceJun-6-2024

This innovation is practically urgent and important. LLMs have been widely adopted in various applications recently, demonstrating their ability to generate high-quality human-like texts [Team et al., 2024, Touvron et al., 2023, Ivison et al., 2023]. However, the security of these models has become a significant concern due to the potential risks of generating harmful content [Wu et al., 2024a, Yu et al., 2024, 2023a, Chao et al., 2023, Deng et al., 2023]. To align the LLMs with ethical guidelines, researchers have developed various methods to enhance their safety. For example, the Llama-2-Chat [Touvron et al., 2023] and Gemma-it [Team et al., 2024] models have been extensively fine-tuned to improve their alignment performance. However, these methods often require extensive computational resources or manual red-teaming, which can be costly and time-consuming [Team et al., 2024, OpenAI, 2024, Bai et al., 2022, Ganguli et al., 2022]. Thus, most of the LLMs finetuned from the pre-trained models by third-party developers do not undergo the alignment process [Xu et al., 2024a, Chiang et al., 2023, Ivison et al., 2023], leaving them vulnerable to generating harmful content by users with malicious intent. To combat these issues, we seek motivations from knowledge distillation technologies [Xu et al., 2024b, Hahn and Choi, 2019], where a teacher model's knowledge is transferred to a student model. Specifically, through numerical experiments Figure 3 and Figure 4, we make two key detections: MLP Alignment.

arxiv preprint arxiv, experiment, language model, (13 more...)

arXiv.org Artificial Intelligence

2406.01514

Country: North America > United States > Illinois > Cook County > Evanston (0.04)

Genre: Research Report > New Finding (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Thorough Examination of Decoding Methods in the Era of LLMs

Shi, Chufan, Yang, Haoran, Cai, Deng, Zhang, Zhisong, Wang, Yifan, Yang, Yujiu, Lam, Wai

arXiv.org Artificial IntelligenceFeb-10-2024

Decoding methods play an indispensable role in converting language models from next-token predictors into practical task solvers. Prior research on decoding methods, primarily focusing on task-specific models, may not extend to the current era of general-purpose large language models (LLMs). Moreover, the recent influx of decoding strategies has further complicated this landscape. This paper provides a comprehensive and multifaceted analysis of various decoding methods within the context of LLMs, evaluating their performance, robustness to hyperparameter changes, and decoding speeds across a wide range of tasks, models, and deployment environments. Our findings reveal that decoding method performance is notably task-dependent and influenced by factors such as alignment, model size, and quantization. Intriguingly, sensitivity analysis exposes that certain methods achieve superior performance at the cost of extensive hyperparameter tuning, highlighting the trade-off between attaining optimal results and the practicality of implementation in varying contexts.

gsm8k, hyperparameter, unaligned model, (15 more...)

arXiv.org Artificial Intelligence

2402.06925

Country:

North America > United States > Pennsylvania (0.04)
Europe > United Kingdom > Northern Ireland > County Down > Belfast (0.04)
Europe > United Kingdom > Northern Ireland > County Antrim > Belfast (0.04)
(12 more...)

Genre: Research Report > New Finding (0.48)

Industry: Leisure & Entertainment > Sports (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback