Goto

Collaborating Authors

 epo


EPO: Diverse and Realistic Protein Ensemble Generation via Energy Preference Optimization

Sun, Yuancheng, Ren, Yuxuan, Chen, Zhaoming, Han, Xu, Liu, Kang, Ye, Qiwei

arXiv.org Artificial Intelligence

Accurate exploration of protein conformational ensembles is essential for uncovering function but remains hard because molecular-dynamics (MD) simulations suffer from high computational costs and energy-barrier trapping. This paper presents Energy Preference Optimization (EPO), an online refinement algorithm that turns a pretrained protein ensemble generator into an energy-aware sampler without extra MD trajectories. Specifically, EPO leverages stochastic differential equation sampling to explore the conformational landscape and incorporates a novel energy-ranking mechanism based on list-wise preference optimization. Crucially, EPO introduces a practical upper bound to efficiently approximate the intractable probability of long sampling trajectories in continuous-time generative models, making it easily adaptable to existing pretrained generators. On Tetrapeptides, A T - LAS, and Fast-Folding benchmarks, EPO successfully generates diverse and physically realistic ensembles, establishing a new state-of-the-art in nine evaluation metrics. These results demonstrate that energy-only preference signals can efficiently steer generative models toward thermodynamically consistent conformational ensembles, providing an alternative to long MD simulations and widening the applicability of learned potentials in structural biology and drug discovery.


Exchange Policy Optimization Algorithm for Semi-Infinite Safe Reinforcement Learning

Zhang, Jiaming, Yang, Yujie, Wang, Haoning, Zhang, Liping, Li, Shengbo Eben

arXiv.org Artificial Intelligence

Safe reinforcement learning (safe RL) aims to respect safety requirements while optimizing long-term performance. In many practical applications, however, the problem involves an infinite number of constraints, known as semi-infinite safe RL (SI-safe RL). Such constraints typically appear when safety conditions must be enforced across an entire continuous parameter space, such as ensuring adequate resource distribution at every spatial location. In this paper, we propose exchange policy optimization (EPO), an algorithmic framework that achieves optimal policy performance and deterministic bounded safety. EPO works by iteratively solving safe RL subproblems with finite constraint sets and adaptively adjusting the active set through constraint expansion and deletion. At each iteration, constraints with violations exceeding the predefined tolerance are added to refine the policy, while those with zero Lagrange multipliers are removed after the policy update. This exchange rule prevents uncontrolled growth of the working set and supports effective policy training. Our theoretical analysis demonstrates that, under mild assumptions, strategies trained via EPO achieve performance comparable to optimal solutions with global constraint violations strictly remaining within a prescribed bound.


ContextBench: Modifying Contexts for Targeted Latent Activation

Graham, Robert, Stevinson, Edward, Richter, Leo, Chia, Alexander, Miller, Joseph, Bloom, Joseph Isaac

arXiv.org Machine Learning

Identifying inputs that trigger specific behaviours or latent features in language models could have a wide range of safety use cases. We investigate a class of methods capable of generating targeted, linguistically fluent inputs that activate specific latent features or elicit model behaviours. We formalise this approach as context modification and present ContextBench -- a benchmark with tasks assessing core method capabilities and potential safety applications. Our evaluation framework measures both elicitation strength (activation of latent features or behaviours) and linguistic fluency, highlighting how current state-of-the-art methods struggle to balance these objectives. We enhance Evolutionary Prompt Optimisation (EPO) with LLM-assistance and diffusion model inpainting, and demonstrate that these variants achieve state-of-the-art performance in balancing elicitation effectiveness and fluency.


Evolutionary Policy Optimization

Wang, Jianren, Su, Yifan, Gupta, Abhinav, Pathak, Deepak

arXiv.org Artificial Intelligence

Despite its extreme sample inefficiency, on-policy reinforcement learning has become a fundamental tool in real-world applications. With recent advances in GPU-driven simulation, the ability to collect vast amounts of data for RL training has scaled exponentially. However, studies show that current on-policy methods, such as PPO, fail to fully leverage the benefits of parallelized environments, leading to performance saturation beyond a certain scale. In contrast, Evolutionary Algorithms (EAs) excel at increasing diversity through randomization, making them a natural complement to RL. However, existing EvoRL methods have struggled to gain widespread adoption due to their extreme sample inefficiency. To address these challenges, we introduce Evolutionary Policy Optimization (EPO), a novel policy gradient algorithm that combines the strengths of EA and policy gradients. We show that EPO significantly improves performance across diverse and challenging environments, demonstrating superior scalability with parallelized simulations.


EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning

Liu, Xiaoqian, Wang, Ke, Li, Yongbin, Wu, Yuchuan, Ma, Wentao, Kong, Aobo, Huang, Fei, Jiao, Jianbin, Zhang, Junge

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have shown impressive reasoning capabilities in well-defined problems with clear solutions, such as mathematics and coding. However, they still struggle with complex real-world scenarios like business negotiations, which require strategic reasoning-an ability to navigate dynamic environments and align long-term goals amidst uncertainty. Existing methods for strategic reasoning face challenges in adaptability, scalability, and transferring strategies to new contexts. To address these issues, we propose explicit policy optimization (EPO) for strategic reasoning, featuring an LLM that provides strategies in open-ended action space and can be plugged into arbitrary LLM agents to motivate goal-directed behavior. To improve adaptability and policy transferability, we train the strategic reasoning model via multi-turn reinforcement learning (RL) using process rewards and iterative self-play, without supervised fine-tuning (SFT) as a preliminary step. Experiments across social and physical domains demonstrate EPO's ability of long-term goal alignment through enhanced strategic reasoning, achieving state-of-the-art performance on social dialogue and web navigation tasks. Our findings reveal various collaborative reasoning mechanisms emergent in EPO and its effectiveness in generating novel strategies, underscoring its potential for strategic reasoning in real-world applications.


EPO: Hierarchical LLM Agents with Environment Preference Optimization

Zhao, Qi, Fu, Haotian, Sun, Chen, Konidaris, George

arXiv.org Artificial Intelligence

Long-horizon decision-making tasks present significant challenges for LLM-based agents due to the need for extensive planning over multiple steps. In this paper, we propose a hierarchical framework that decomposes complex tasks into manageable subgoals, utilizing separate LLMs for subgoal prediction and low-level action generation. To address the challenge of creating training signals for unannotated datasets, we develop a reward model that leverages multimodal environment feedback to automatically generate reward signals. We introduce Environment Preference Optimization (EPO), a novel method that generates preference signals from the environment's feedback and uses them to train LLM-based agents. Extensive experiments on ALFRED demonstrate the state-of-the-art performance of our framework, achieving first place on the ALFRED public leaderboard and showcasing its potential to improve long-horizon decision-making in diverse environments.


Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints

Gao, Shiqing, Ding, Jiaxin, Fu, Luoyi, Wang, Xinbing, Zhou, Chenghu

arXiv.org Artificial Intelligence

In Constrained Reinforcement Learning (CRL), agents explore the environment to learn the optimal policy while satisfying constraints. The penalty function method has recently been studied as an effective approach for handling constraints, which imposes constraints penalties on the objective to transform the constrained problem into an unconstrained one. However, it is challenging to choose appropriate penalties that balance policy performance and constraint satisfaction efficiently. In this paper, we propose a theoretically guaranteed penalty function method, Exterior Penalty Policy Optimization (EPO), with adaptive penalties generated by a Penalty Metric Network (PMN). PMN responds appropriately to varying degrees of constraint violations, enabling efficient constraint satisfaction and safe exploration. We theoretically prove that EPO consistently improves constraint satisfaction with a convergence guarantee. We propose a new surrogate function and provide worst-case constraint violation and approximation error. In practice, we propose an effective smooth penalty function, which can be easily implemented with a first-order optimizer. Extensive experiments are conducted, showing that EPO outperforms the baselines in terms of policy performance and constraint satisfaction with a stable training process, particularly on complex tasks.


Fluent dreaming for language models

Thompson, T. Ben, Straznickas, Zygimantas, Sklar, Michael

arXiv.org Artificial Intelligence

Feature visualization, also known as "dreaming", offers insights into vision models by optimizing the inputs to maximize a neuron's activation or other internal component. However, dreaming has not been successfully applied to language models because the input space is discrete. We extend Greedy Coordinate Gradient, a method from the language model adversarial attack literature, to design the Evolutionary Prompt Optimization (EPO) algorithm. EPO optimizes the input prompt to simultaneously maximize the Pareto frontier between a chosen internal feature and prompt fluency, enabling fluent dreaming for language models. We demonstrate dreaming with neurons, output logits and arbitrary directions in activation space. We measure the fluency of the resulting prompts and compare language model dreaming with max-activating dataset examples. Critically, fluent dreaming allows automatically exploring the behavior of model internals in reaction to mildly out-of-distribution prompts. Code for running EPO is available at https://github.com/Confirm-Solutions/dreamy. A companion page demonstrating code usage is at https://confirmlabs.org/posts/dreamy.html


EPOS and Mindtree Expand Strategic Digital Engineering Partnership

#artificialintelligence

Mindtree, a global technology services and digital transformation company, announced that it has extended its relationship with premium audio and video solutions brand, EPOS as a digital engineering partner to help augment and accelerate the brand's development of audio technologies and solutions. As part of the multiyear engagement, Mindtree will work as an integrated part of EPOS' development organisation, and take part in strengthening its product innovation, time-to-market, and customer satisfaction, especially in EPOS' high-growth enterprise audio and video segment. Mindtree will provide a broad range of competencies and knowledge within development, maintenance, and quality assurance services to support and innovate all product categories of EPOS within the segments of Enterprise Solutions and Gaming. "This collaboration is important for EPOS to ensure and further develop our portfolio of best-in-class solutions and technologies," said Jeppe Dalberg-Larsen, President at EPOS. "I am confident that Mindtree's extensive product engineering and testing capabilities, coupled with its flexible, transparent, and collaborative approach, will strengthen and support our ability to deliver differentiated audio and video technology, and sound experiences." "We are pleased to partner with an acclaimed audio solutions leader such as EPOS in advancing state-of-the-art digital technologies," said Venu Lambu, Executive Director and President of Global Markets at Mindtree.


Webinar invite: The Power of AI in Audio

#artificialintelligence

Behavioural Strategy was founded as a management consultancy in Copenhagen in 2014 to help corporations be more valuable to both shareholders and society by using a broader range of research. The company combine behavioral economics with psychology, decision theory, game theory, economics, statistics (lots of statistics), technology and a keen understanding of commercial operations to solve difficult problems. Behavioural Strategy are a hybrid between a service and a technology company offering both traditional consultancy services as well as a suite of automated tools that help us help businesses. EPOS is an audio and video solution company developing and selling devices for business professionals and the gaming community. Based on leading and advanced technologies, the Danish founded company delivers high-end audio and video solutions with design, technology and performance as paramount parameters.