AITopics | Yin, Xuwang

Collaborating Authors

Yin, Xuwang

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems

Ren, Richard, Agarwal, Arunim, Mazeika, Mantas, Menghini, Cristina, Vacareanu, Robert, Kenstler, Brad, Yang, Mick, Barrass, Isabelle, Gatti, Alice, Yin, Xuwang, Trevino, Eduardo, Geralnik, Matias, Khoja, Adam, Lee, Dean, Yue, Summer, Hendrycks, Dan

arXiv.org Artificial IntelligenceMar-20-2025

As large language models (LLMs) become more capable and agentic, the requirement for trust in their outputs grows significantly, yet at the same time concerns have been mounting that models may learn to lie in pursuit of their goals. To address these concerns, a body of work has emerged around the notion of "honesty" in LLMs, along with interventions aimed at mitigating deceptive behaviors. However, evaluations of honesty are currently highly limited, with no benchmark combining large scale and applicability to all models. Moreover, many benchmarks claiming to measure honesty in fact simply measure accuracy--the correctness of a model's beliefs--in disguise. In this work, we introduce a large-scale human-collected dataset for measuring honesty directly, allowing us to disentangle accuracy from honesty for the first time. Across a diverse set of LLMs, we find that while larger models obtain higher accuracy on our benchmark, they do not become more honest. Surprisingly, while most frontier LLMs obtain high scores on truthfulness benchmarks, we find a substantial propensity in frontier LLMs to lie when pressured to do so, resulting in low honesty scores on our benchmark. We find that simple methods, such as representation engineering interventions, can improve honesty. These results underscore the growing need for robust evaluations and effective interventions to ensure LLMs remain trustworthy.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.0375

Country: North America > United States > California (0.14)

Genre: Research Report > New Finding (0.48)

Industry:

Media (0.94)
Leisure & Entertainment (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Mazeika, Mantas, Yin, Xuwang, Tamirisa, Rishub, Lim, Jaehyuk, Lee, Bruce W., Ren, Richard, Phan, Long, Mu, Norman, Khoja, Adam, Zhang, Oliver, Hendrycks, Dan

arXiv.org Artificial IntelligenceFeb-12-2025

As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.0864

Country:

Europe (1.00)
Asia (0.93)
North America > United States > California (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, Mantas, Phan, Long, Yin, Xuwang, Zou, Andy, Wang, Zifan, Mu, Norman, Sakhaee, Elham, Li, Nathaniel, Basart, Steven, Li, Bo, Forsyth, David, Hendrycks, Dan

arXiv.org Artificial IntelligenceFeb-6-2024

Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods. To address this issue, we introduce HarmBench, a standardized evaluation framework for automated red teaming. We identify several desirable properties previously unaccounted for in red teaming evaluations and systematically design HarmBench to meet these criteria. Using HarmBench, we conduct a large-scale comparison of 18 red teaming methods and 33 target LLMs and defenses, yielding novel insights. We also introduce a highly efficient adversarial training method that greatly enhances LLM robustness across a wide range of attacks, demonstrating how HarmBench enables codevelopment of attacks and defenses. We open source HarmBench at https://github.com/centerforaisafety/HarmBench.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2402.04249

Country: North America > United States > Illinois (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)
Government > Military (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Learning Globally Optimized Language Structure via Adversarial Training

Yin, Xuwang

arXiv.org Artificial IntelligenceNov-12-2023

Recent work has explored integrating autoregressive language models with energy-based models (EBMs) to enhance text generation capabilities. However, learning effective EBMs for text is challenged by the discrete nature of language. This work proposes an adversarial training strategy to address limitations in prior efforts. Specifically, an iterative adversarial attack algorithm is presented to generate negative samples for training the EBM by perturbing text from the autoregressive model. This aims to enable the EBM to suppress spurious modes outside the support of the data distribution. Experiments on an arithmetic sequence generation task demonstrate that the proposed adversarial training approach can substantially enhance the quality of generated sequences compared to prior methods. The results highlight the promise of adversarial techniques to improve discrete EBM training. Key contributions include: (1) an adversarial attack strategy tailored to text to generate negative samples, circumventing MCMC limitations; (2) an adversarial training algorithm for EBMs leveraging these attacks; (3) empirical validation of performance improvements on a sequence generation task.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2311.06771

Genre: Research Report (0.41)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (0.92)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, Andy, Phan, Long, Chen, Sarah, Campbell, James, Guo, Phillip, Ren, Richard, Pan, Alexander, Yin, Xuwang, Mazeika, Mantas, Dombrowski, Ann-Kathrin, Goel, Shashwat, Li, Nathaniel, Byun, Michael J., Wang, Zifan, Mallen, Alex, Basart, Steven, Koyejo, Sanmi, Song, Dawn, Fredrikson, Matt, Kolter, J. Zico, Hendrycks, Dan

arXiv.org Artificial IntelligenceOct-10-2023

In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2310.01405

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Education (0.92)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(4 more...)

Add feedback

Learning Energy-Based Models With Adversarial Training

Yin, Xuwang, Li, Shiying, Rohde, Gustavo K.

arXiv.org Artificial IntelligenceDec-27-2022

We study a new approach to learning energy-based models (EBMs) based on adversarial training (AT). We show that (binary) AT learns a special kind of energy function that models the support of the data distribution, and the learning process is closely related to MCMC-based maximum likelihood learning of EBMs. We further propose improved techniques for generative modeling with AT, and demonstrate that this new approach is capable of generating diverse and realistic images. Aside from having competitive image generation performance to explicit EBMs, the studied approach is stable to train, is well-suited for image translation tasks, and exhibits strong out-of-distribution adversarial robustness. Our results demonstrate the viability of the AT approach to generative modeling, suggesting that AT is a competitive alternative approach to learning EBMs.

artificial intelligence, celeba-hq 256, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2012.06568

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(2 more...)

Add feedback

Generative Robust Classification

Yin, Xuwang

arXiv.org Artificial IntelligenceDec-14-2022

Training adversarially robust discriminative (i.e., softmax) classifier has been the dominant approach to robust classification. Building on recent work on adversarial training (AT)-based generative models, we investigate using AT to learn unnormalized class-conditional density models and then performing generative robust classification. Our result shows that, under the condition of similar model capacities, the generative robust classifier achieves comparable performance to a baseline softmax robust classifier when the test data is clean or when the test perturbation is of limited size, and much better performance when the test perturbation size exceeds the training perturbation size. The generative classifier is also able to generate samples or counterfactuals that more closely resemble the training data, suggesting that the generative classifier can better capture the class-conditional distributions. In contrast to standard discriminative adversarial training where advanced data augmentation techniques are only effective when combined with weight averaging, we find it straightforward to apply advanced data augmentation to achieve better robustness in our approach. Our result suggests that the generative classifier is a competitive alternative to robust classification, especially for problems with limited number of classes.

artificial intelligence, classifier, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2212.07283

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

Neural Networks, Hypersurfaces, and Radon Transforms

Kolouri, Soheil, Yin, Xuwang, Rohde, Gustavo K.

arXiv.org Machine LearningJul-4-2019

Connections between integration along hypersufaces, Radon transforms, and neural networks are exploited to highlight an integral geometric mathematical interpretation of neural networks. By analyzing the properties of neural networks as operators on probability distributions for observed data, we show that the distribution of outputs for any node in a neural network can be interpreted as a nonlinear projection along hypersurfaces defined by level surfaces over the input data space. We utilize these descriptions to provide new interpretation for phenomena such as nonlinearity, pooling, activation functions, and adversarial examples in neural network-based learning problems.

deep learning, neural network, radon transform, (19 more...)

arXiv.org Machine Learning

1907.0222

Country: North America > United States > Virginia > Albemarle County > Charlottesville (0.14)

Genre: Research Report (0.51)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Divide-and-Conquer Adversarial Detection

Yin, Xuwang, Kolouri, Soheil, Rohde, Gustavo K.

arXiv.org Machine LearningMay-27-2019

The vulnerabilities of deep neural networks against adversarial examples have become a major concern for deploying these models in sensitive domains. Devising a definitive defense against such attacks is proven to be challenging, and the methods relying on detecting adversarial samples have been shown to be only effective when the attacker is oblivious to the detection mechanism, i.e., in non-adaptive attacks. In this paper, we propose an effective and practical method for detecting adaptive/dynamic adversaries. In short, we train adversary-robust auxiliary detectors to discriminate in-class natural examples from adversarially crafted out-of-class examples. To identify a potential adversary, we first obtain the estimated class of the input using the classification system, and then use the corresponding detector to verify whether the input is a natural example of that class, or is an adversarially manipulated example. Experimental results on MNIST and CIFAR10 dataset show that our method could withstand adaptive PGD attacks. Furthermore, we demonstrate that with our novel training scheme our models learn significant more robust representation than ordinary adversarial training.

deep learning, detector, neural network, (19 more...)

arXiv.org Machine Learning

1905.11475

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback