Reinforcement Learning with Adaptive Regularization for Safe Control of Critical Systems
Reinforcement Learning (RL) is a powerful method for controlling dynamic systems, but its learning mechanism can lead to unpredictable actions that undermine the safety of critical systems. Here, we propose RL with Adaptive Regularization (RL-AR), an algorithm that enables safe RL exploration by combining the RL policy with a policy regularizer that hard-codes the safety constraints. RL-AR performs policy combination via a "focus module," which determines the appropriate combination depending on the state--relying more on the safe policy regularizer for less-exploited states while allowing unbiased convergence for well-exploited states. In a series of critical control applications, we demonstrate that RL-AR not only ensures safety during training but also achieves a return competitive with the standards of model-free RL that disregards safety.
A OpenXLand Components
In this environment, the agent receives reward when the orange sphere makes contact with the blue pyramid. We see that the orange sphere is elevated, and therefore the agent must find it and use the ramps to access it. As to the blue pyramid, we do not see it because it is not there: the agent must first get the orange sphere near the black rounded cube first to spawn one. This environment also contains a grey pyramid that serves as a distraction. Importantly, if the agent brings the grey pyramid near the black rounded cube, both will disappear, making it impossible for the agent to spawn a blue pyramid and subsequently obtain its reward.
Using Unity to Help Solve Reinforcement Learning Connor Brennan Andrew Robert Williams
Leveraging the depth and flexibility of XLand as well as the rapid prototyping features of the Unity engine, we present the United Unity Universe, an open-source toolkit designed to accelerate the creation of innovative reinforcement learning environments. This toolkit includes a robust implementation of OpenXLand, a framework for meta-RL based on XLand 2.0 [23], complemented by a user-friendly interface which allows users to modify the details of procedurally generated terrains and task rules with ease. Along with a ready-to-use implementation of OpenXLand, we provide a curated selection of terrains and rule sets, accompanied by implementations of reinforcement learning baselines to facilitate quick experimentation with novel architectural designs for adaptive agents. Furthermore, we illustrate how the United Unity Universe serves as a high-level language that enables researchers to develop diverse and endlessly variable 3D environments within a unified framework. This functionality establishes the United Unity Universe (U3) as an essential tool for advancing the field of reinforcement learning, especially in the development of adaptive and generalizable learning systems.
A PT-suitable reference family if: 1. (Full support): supp(π). 2. (Regularity): The log-likelihood ratio between π
B.1 Conditional convergence in distribution Suppose (X, d The proof of this Lemma is identical to the portmanteau lemma for weak convergence by replacing probabilities/expectations with conditional probabilities/expectations (for example, see [38, Section 2.1]). Lemma B.2. Suppose X, X X as m, and X is a constant a.s., then X A, where A is a constant. We can exchange the expectation and limit by the dominated convergence theorem. The result follows by taking ϵ 0. 4. Since X is a.s. For any K > 0, we have x x K is a bounded and continuous function. R. Because f g: X is a bounded and A. We now show that (X The result follows by an application of the continuous mapping theorem with the function (x, A) Ax. B.2 Model assumptions The following sets of assumptions are only used to prove the large-data limit results of Proposition 3.1, Proposition 3.2, and Proposition 3.3. We will always use a subscript m to indicate that the quantity is dependent on the data. For the remainder of this section we will assume the following regularity conditions.
AdjointDEIS: Efficient Gradients for Diffusion Models
The optimization of the latents and parameters of diffusion models with respect to some differentiable metric defined on the output of the model is a challenging and complex problem. The sampling for diffusion models is done by solving either the probability flow ODE or diffusion SDE wherein a neural network approximates the score function allowing a numerical ODE/SDE solver to be used. However, naïve backpropagation techniques are memory intensive, requiring the storage of all intermediate states, and face additional complexity in handling the injected noise from the diffusion term of the diffusion SDE. We propose a novel family of bespoke ODE solvers to the continuous adjoint equations for diffusion models, which we call AdjointDEIS. We exploit the unique construction of diffusion SDEs to further simplify the formulation of the continuous adjoint equations using exponential integrators. Moreover, we provide convergence order guarantees for our bespoke solvers. Significantly, we show that the continuous adjoint equations for diffusion SDEs actually simplify to a simple ODE. Lastly, we demonstrate the effectiveness of AdjointDEIS for guided generation with an adversarial attack in the form of the face morphing problem.
Data Distribution Valuation
Data valuation is a class of techniques for quantitatively assessing the value of data for applications like pricing in data marketplaces. Existing data valuation methods define a value for a discrete dataset. However, in many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled. For example, consider a buyer trying to evaluate whether to purchase data from different vendors. The buyer may observe (and compare) only a small preview sample from each vendor, to decide which vendor's data distribution is most useful to the buyer and purchase. The core question is how should we compare the values of data distributions from their samples? Under a Huber characterization of the data heterogeneity across vendors, we propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies for comparing data distributions from samples. We empirically demonstrate that our method is sample-efficient and effective in identifying valuable data distributions against several existing baselines, on multiple real-world datasets (e.g., network intrusion detection, credit card fraud detection) and downstream applications (classification, regression).
Multi-Label Learning with Stronger Consistency Guarantees
We present a detailed study of surrogate losses and algorithms for multi-label learning, supported by H-consistency bounds. We first show that, for the simplest form of multi-label loss (the popular Hamming loss), the well-known consistent binary relevance surrogate suffers from a sub-optimal dependency on the number of labels in terms of H-consistency bounds, when using smooth losses such as logistic losses. Furthermore, this loss function fails to account for label correlations. To address these drawbacks, we introduce a novel surrogate loss, multi-label logistic loss, that accounts for label correlations and benefits from label-independent H-consistency bounds. We then broaden our analysis to cover a more extensive family of multi-label losses, including all common ones and a new extension defined based on linear-fractional functions with respect to the confusion matrix.
Japan backs AI chip startup EdgeCortix in boost to defense tech
EdgeCortix, a Tokyo-based artificial intelligence (AI) chip startup, is riding a wave of interest to foster Japanese semiconductors with defense applications. EdgeCortix, which has won a contract tied to the U.S. Department of Defense, on Wednesday secured government subsidies of 3 billion ( 21 million) to develop energy-efficient AI chiplets for commercialization in 2027. The contract may help revenue more than double this year, founder Sakyasingha Dasgupta said. The products, designed to help robots make real-time decisions and fill the country's labor shortages, target mass production at Taiwan Semiconductor Manufacturing Co.'s plant in Japan. The subsidies are on top of 4 billion in support the semiconductor designer won in November to make chips for next-generation communication systems.
Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning
In recent advancements in unsupervised visual representation learning, the Joint-Embedding Predictive Architecture (JEPA) has emerged as a significant method for extracting visual features from unlabeled imagery through an innovative masking strategy. Despite its success, two primary limitations have been identified: the inefficacy of Exponential Moving Average (EMA) from I-JEPA in preventing entire collapse and the inadequacy of I-JEPA prediction in accurately learning the mean of patch representations.