critic
A Additional Benchmark Information 354 A.1 Offline
Figure 5: Graphical representation of the normalized performance of the best trained policy on D4RL averaged over 4 random seeds. Figure 15: Graphical representation of the normalized performance of the last trained policy on D4RL after online tuning averaged over 4 random seeds. Our codebase is released under Apache License 2.0. For most of the algorithms and datasets, we use default hyperparameters if available. Decision Transformer (DT) training is splitted into datasets pass epochs.
Eval-PPO: Building an Efficient Threat Evaluator Using Proximal Policy Optimization
Sun, Wuzhou, Li, Siyi, Zou, Qingxiang, Liao, Zixing
In various game scenarios, selecting a fixed number of targets from multiple enemy units is an extremely challenging task. This difficulty stems from the complex relationship between the threat levels of enemy units and their feature characteristics, which complicates the design of rule-based evaluators. Moreover, traditional supervised learning methods face the challenge of lacking explicit labels during training when applied to this threat evaluation problem. In this study, we redefine the threat evaluation problem as a reinforcement learning task and introduce an efficient evaluator training algorithm, Eval-PPO, based on the Proximal Policy Optimization (PPO) algorithm. Eval-PPO integrates multidimensional enemy features and the state information of friendly units through systematic training, thereby achieving precise threat assessment. Compared with rule-based methods, Eval-PPO demonstrates a significant improvement in average success rate, with an increase of 17.84%.
YARE-GAN: Yet Another Resting State EEG-GAN
Farahzadi, Yeganeh, Ansarinia, Morteza, Kekecs, Zoltan
Generative Adversarial Networks (GANs) have shown promise in synthesising realistic neural data, yet their potential for unsupervised representation learning in resting-state EEG remains under explored. In this study, we implement a Wasserstein GAN with Gradient Penalty (WGAN-GP) to generate multi-channel resting-state EEG data and assess the quality of the synthesised signals through both visual and feature-based evaluations. Our results indicate that the model effectively captures the statistical and spectral characteristics of real EEG data, although challenges remain in replicating high-frequency oscillations in the frontal region. Additionally, we demonstrate that the Critic's learned representations can be fine-tuned for age group classification, achieving an out-of-sample accuracy, significantly better than a shuffled-label baseline. These findings suggest that generative models can serve not only as EEG data generators but also as unsupervised feature extractors, reducing the need for manual feature engineering. This study highlights the potential of GAN-based unsupervised learning for EEG analysis, suggesting avenues for more data-efficient deep learning applications in neuroscience.
Demystifying Long Chain-of-Thought Reasoning in LLMs
Yeo, Edward, Tong, Yuxuan, Niu, Morry, Neubig, Graham, Yue, Xiang
Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
HARP: Human-Assisted Regrouping with Permutation Invariant Critic for Multi-Agent Reinforcement Learning
Hu, Huawen, Shi, Enze, Yue, Chenxi, Yang, Shuocun, Wu, Zihao, Li, Yiwei, Zhong, Tianyang, Zhang, Tuo, Liu, Tianming, Zhang, Shu
Human-in-the-loop reinforcement learning integrates human expertise to accelerate agent learning and provide critical guidance and feedback in complex fields. However, many existing approaches focus on single-agent tasks and require continuous human involvement during the training process, significantly increasing the human workload and limiting scalability. In this paper, we propose HARP (Human-Assisted Regrouping with Permutation Invariant Critic), a multi-agent reinforcement learning framework designed for group-oriented tasks. HARP integrates automatic agent regrouping with strategic human assistance during deployment, enabling and allowing non-experts to offer effective guidance with minimal intervention. During training, agents dynamically adjust their groupings to optimize collaborative task completion. When deployed, they actively seek human assistance and utilize the Permutation Invariant Group Critic to evaluate and refine human-proposed groupings, allowing non-expert users to contribute valuable suggestions. In multiple collaboration scenarios, our approach is able to leverage limited guidance from non-experts and enhance performance. The project can be found at https://github.com/huawen-hu/HARP.
- North America > United States > Georgia > Clarke County > Athens (0.14)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
- Asia > China > Shaanxi Province > Xi'an (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
How to train your draGAN: A task oriented solution to imbalanced classification
Guertler, Leon O., Ashfahani, Andri, Luu, Anh Tuan
The long-standing challenge of building effective classification models for small and imbalanced datasets has seen little improvement since the creation of the Synthetic Minority Over-sampling Technique (SMOTE) over 20 years ago. Though GAN based models seem promising, there has been a lack of purpose built architectures for solving the aforementioned problem, as most previous studies focus on applying already existing models. This paper proposes a unique, performance-oriented, data-generating strategy that utilizes a new architecture, coined draGAN, to generate both minority and majority samples. The samples are generated with the objective of optimizing the classification model's performance, rather than similarity to the real data. We benchmark our approach against state-of-the-art methods from the SMOTE family and competitive GAN based approaches on 94 tabular datasets with varying degrees of imbalance and linearity. Empirically we show the superiority of draGAN, but also highlight some of its shortcomings. All code is available on: https://github.com/LeonGuertler/draGAN.
Dealing with the Routing Problem part1(Computer Science)
Abstract: This paper attempts to solve the famous Vehicle Routing Problem by considering multiple constraints including capacitated vehicles, single depot, and distance using two approaches namely, cluster first and route the second algorithm and using integer linear programming. A set of nodes are provided as input to the system and a feasible route is generated as output, giving clusters of nodes and the route to be traveled within the cluster. For clustering the nodes, we have adopted the DBSCAN algorithm, and the routing is done using the approximation algorithm, Christofide's algorithm. Abstract: Recently, the applications of the methodologies of Reinforcement Learning (RL) to NP-Hard Combinatorial optimization problems have become a popular topic. This is essentially due to the nature of the traditional combinatorial algorithms, often based on a trial-and-error process.
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.77)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.72)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.57)
Karaoker: Alignment-free singing voice synthesis with speech training data
Kakoulidis, Panos, Ellinas, Nikolaos, Vamvoukakis, Georgios, Markopoulos, Konstantinos, Sung, June Sig, Jho, Gunu, Tsiakoulis, Pirros, Chalamandaris, Aimilios
Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthesizes singing voice and transfers style following a multi-dimensional template extracted from a source waveform of an unseen singer/speaker. The model is jointly conditioned with a single deep convolutional encoder on continuous data including pitch, intensity, harmonicity, formants, cepstral peak prominence and octaves. We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks that guide the model to an accurate result. In addition to multitasking, we also employ a Wasserstein GAN training scheme as well as new losses on the acoustic model's output to further refine the quality of the model.
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States > New York (0.04)
- Europe > Greece (0.04)
- Asia (0.04)
- Education > Focused Education > Special Education > Speech Therapy (0.61)
- Media (0.49)