Goto

Collaborating Authors

 Machine Learning


Appendix A Related Work A.1 Multimodal Large Language Models 3 A.2 Trustworthiness of LLMs

Neural Information Processing Systems

A.1 Multimodal Large Language Models Building on the foundational capabilities of groundbreaking Large Language Models (LLMs) such as GPT [3], PALM [6], Mistral [49], and LLama [108], which excel in language understanding and reasoning, recent innovations have integrated these models with other modalities (especially vision), leading to the development of Multimodal Large Language Models (MLLMs). These advanced MLLMs combine and process visual and textual data, demonstrating enhanced versatility in addressing both traditional vision tasks [21, 40, 42, 133] and complex multimodal challenges [34, 70, 136]. Among all MLLMs, proprietary models consistently perform well. OpenAI's GPT-4-Vision [82] pioneered this space by adeptly handling both text and image content. Anthropic's Claude 3 series [7] integrates advanced vision capabilities and multilingual support, enhancing its application across diverse cognitive and real-time tasks.


A. About Equation 16

Neural Information Processing Systems

We only consider the feasible cases. Then we consider the sigm(...) function in Equation 1. For instance, say x is the parameter to be trained, we want x to satisfy sigm(x) = 1 or 0, we would apply gradient-descent which will cause x + / . Two basic blocks of fully connected + ReLU layer are utilized as backbone for all benchmark datasets, generating instance-level representations with a dimensionality of 512. We set one branch here for the binary-classification problems.


Ligang He

Neural Information Processing Systems

Anomaly detection in time series data is fundamental to the design, deployment, and evaluation of industrial control systems. Temporal modeling has been the natural focus of anomaly detection approaches for time series data. However, the focus on temporal modeling can obscure or dilute the spatial information that can be used to capture complex interactions in multivariate time series. In this paper, we propose SARAD, an approach that leverages spatial information beyond data autoencoding errors to improve the detection and diagnosis of anomalies. SARAD trains a Transformer to learn the spatial associations, the pairwise inter-feature relationships which ubiquitously characterize such feedback-controlled systems. As new associations form and old ones dissolve, SARAD applies subseries division to capture their changes over time. Anomalies exhibit association descending patterns, a key phenomenon we exclusively observe and attribute to the disruptive nature of anomalies detaching anomalous features from others. To exploit the phenomenon and yet dismiss non-anomalous descent, SARAD performs anomaly detection via autoencoding in the association space. We present experimental results to demonstrate that SARAD achieves state-of-the-art performance, providing robust anomaly detection and a nuanced understanding of anomalous events.


Towards Text Generation with Adversarially Learned Neural Outlines

Neural Information Processing Systems

Recent progress in deep generative models has been fueled by two paradigms - autoregressive and adversarial models. We propose a combination of both approaches with the goal of learning generative models of text. Our method first produces a high-level sentence outline and then generates words sequentially, conditioning on both the outline and the previous outputs. We generate outlines with an adversarial model trained to approximate the distribution of sentences in a latent space induced by general-purpose sentence encoders. This provides strong, informative conditioning for the autoregressive stage. Our quantitative evaluations suggests that conditioning information from generated outlines is able to guide the autoregressive model to produce realistic samples, comparable to maximum-likelihood trained language models, even at high temperatures with multinomial sampling. Qualitative results also demonstrate that this generative procedure yields natural-looking sentences and interpolations.


Fox News AI Newsletter: Expert warns just 20 cloud images can make an AI deepfake video of your child

FOX News

Texas high school student Elliston Berry joins'Fox & Friends' to discuss the House's passage of a new bill that criminalizes the sharing of non-consensual intimate images, including content created with artificial intelligence. Welcome to Fox News' Artificial Intelligence newsletter with the latest AI technology advancements. IN TODAY'S NEWSLETTER: - Peek-a-boo, big tech sees you: Expert warns just 20 cloud images can make an AI deepfake video of your child - 5 AI terms you keep hearing and what they actually mean - AI to monitor NYC subway safety as crime concerns rise First Lady Melania Trump, joined by U.S. President Donald Trump, delivers remarks before President Trump signed the TAKE IT DOWN Act into law in the Rose Garden of the White House on May 19, 2025 in Washington, DC. The first lady made the Tools to Address Known Exploitation by Immobilizing Technological Deepfakes on Websites and Networks (TAKE IT DOWN) Act a priority, traveling to Capitol Hill to lobby lawmakers and show her support for the legislation, which addresses non-consensual intimate imagery, or "revenge porn," and artificial intelligence deepfakes posted online and to social media. DEEPFAKE DANGERS: Parents love capturing their kids' big moments, from first steps to birthday candles.


Enhancing Consistency-Based Image Generation via Adversarially-Trained Classification and Energy-Based Discrimination

Neural Information Processing Systems

The recently introduced Consistency models pose an efficient alternative to diffusion algorithms, enabling rapid and good quality image synthesis. These methods overcome the slowness of diffusion models by directly mapping noise to data, while maintaining a (relatively) simpler training. Consistency models enable a fast one-or few-step generation, but they typically fall somewhat short in sample quality when compared to their diffusion origins. In this work we propose a novel and highly effective technique for post-processing Consistency-based generated images, enhancing their perceptual quality. Our approach utilizes a joint classifierdiscriminator model, in which both portions are trained adversarially. While the classifier aims to grade an image based on its assignment to a designated class, the discriminator portion of the very same network leverages the softmax values to assess the proximity of the input image to the targeted data manifold, thereby serving as an Energy-based Model. By employing example-specific projected gradient iterations under the guidance of this joint machine, we refine synthesized images and achieve an improved FID scores on the ImageNet 64x64 dataset for both Consistency-Training and Consistency-Distillation techniques.


Jiashi Gao

Neural Information Processing Systems

Federated learning (FL) offers a machine learning paradigm that protects privacy, allowing multiple clients to collaboratively train a global model while only accessing their local data. Recent research in FL has increasingly focused on improving the uniformity of model performance across clients, a fairness principle known as egalitarian fairness. However, achieving egalitarian fairness in FL may sacrifice the model performance for data-rich clients to benefit those with less data. This tradeoff raises concerns about the stability of FL, as data-rich clients may opt to leave the current coalition and join another that is more closely aligned with its expected high performance. In this context, our work rigorously addresses the critical concern: Does egalitarian fairness lead to instability? Drawing from game theory and social choice theory, we initially characterize fair FL systems as altruism coalition formation games (ACFGs) and reveal that the instability issues emerging from the pursuit of egalitarian fairness are significantly related to the clients' altruism within the coalition and the configuration of the friends-relationship networks among the clients. Then, we theoretically propose the optimal egalitarian fairness bounds that an FL coalition can achieve while maintaining core stability under various types of altruistic behaviors. The theoretical contributions clarify the quantitative relationships between achievable egalitarian fairness and the disparities in the sizes of local datasets, disproving the misconception that egalitarian fairness inevitably leads to instability. Finally, we conduct experiments to evaluate the consistency of our theoretically derived egalitarian fairness bounds with the empirically achieved egalitarian fairness in fair FL settings.


Surround Modulation: A Bio-inspired Connectivity Structure for Convolutional Neural Networks

Neural Information Processing Systems

Numerous neurophysiological studies have revealed that a large number of the primary visual cortex neurons operate in a regime called surround modulation. Surround modulation has a substantial effect on various perceptual tasks, and it also plays a crucial role in the efficient neural coding of the visual cortex. Inspired by the notion of surround modulation, we designed new excitatory-inhibitory connections between a unit and its surrounding units in the convolutional neural network (CNN) to achieve a more biologically plausible network. Our experiments show that this simple mechanism can considerably improve both the performance and training speed of traditional CNNs in visual tasks. We further explore additional outcomes of the proposed structure. We first evaluate the model under several visual challenges, such as the presence of clutter or change in lighting conditions and show its superior generalization capability in handling these challenging situations. We then study possible changes in the statistics of neural activities such as sparsity and decorrelation and provide further insight into the underlying efficiencies of surround modulation. Experimental results show that importing surround modulation into the convolutional layers ensues various effects analogous to those derived by surround modulation in the visual cortex.



Grounded Answers for Multi-agent Decision-making Problem through Generative World Model

Neural Information Processing Systems

Recent progress in generative models has stimulated significant innovations in many fields, such as image generation and chatbots. Despite their success, these models often produce sketchy and misleading solutions for complex multi-agent decisionmaking problems because they miss the trial-and-error experience and reasoning as humans. To address this limitation, we explore a paradigm that integrates a language-guided simulator into the multi-agent reinforcement learning pipeline to enhance the generated answer. The simulator is a world model that separately learns dynamics and reward, where the dynamics model comprises an image tokenizer as well as a causal transformer to generate interaction transitions autoregressively, and the reward model is a bidirectional transformer learned by maximizing the likelihood of trajectories in the expert demonstrations under language guidance. Given an image of the current state and the task description, we use the world model to train the joint policy and produce the image sequence as the answer by running the converged policy on the dynamics model. The empirical results demonstrate that this framework can improve the answers for multi-agent decision-making problems by showing superior performance on the training and unseen tasks of the StarCraft Multi-Agent Challenge benchmark. In particular, it can generate consistent interaction sequences and explainable reward functions at interaction states, opening the path for training generative models of the future.