radam
Representation Meets Optimization: Training PINNs and PIKANs for Gray-Box Discovery in Systems Pharmacology
Daryakenari, Nazanin Ahmadi, Shukla, Khemraj, Karniadakis, George Em
Physics-Informed Kolmogorov-Arnold Networks (PIKANs) are gaining attention as an effective counterpart to the original multilayer perceptron-based Physics-Informed Neural Networks (PINNs). Both representation models can address inverse problems and facilitate gray-box system identification. However, a comprehensive understanding of their performance in terms of accuracy and speed remains underexplored. In particular, we introduce a modified PIKAN architecture, tanh-cPIKAN, which is based on Chebyshev polynomials for parametrization of the univariate functions with an extra nonlinearity for enhanced performance. We then present a systematic investigation of how choices of the optimizer, representation, and training configuration influence the performance of PINNs and PIKANs in the context of systems pharmacology modeling. We benchmark a wide range of first-order, second-order, and hybrid optimizers, including various learning rate schedulers. We use the new Optax library to identify the most effective combinations for learning gray-boxes under ill-posed, non-unique, and data-sparse conditions. We examine the influence of model architecture (MLP vs. KAN), numerical precision (single vs. double), the need for warm-up phases for second-order methods, and sensitivity to the initial learning rate. We also assess the optimizer scalability for larger models and analyze the trade-offs introduced by JAX in terms of computational efficiency and numerical accuracy. Using two representative systems pharmacology case studies - a pharmacokinetics model and a chemotherapy drug-response model - we offer practical guidance on selecting optimizers and representation models/architectures for robust and efficient gray-box discovery. Our findings provide actionable insights for improving the training of physics-informed networks in biomedical applications and beyond.
RADAMS: Resilient and Adaptive Alert and Attention Management Strategy against Informational Denial-of-Service (IDoS) Attacks
Attacks exploiting human attentional vulnerability have posed severe threats to cybersecurity. In this work, we identify and formally define a new type of proactive attentional attacks called Informational Denial-of-Service (IDoS) attacks that generate a large volume of feint attacks to overload human operators and hide real attacks among feints. We incorporate human factors (e.g., levels of expertise, stress, and efficiency) and empirical results (e.g., the Yerkes-Dodson law and the sunk cost fallacy) to model the operators' attention dynamics and their decision-making processes along with the real-time alert monitoring and inspection. To assist human operators in timely and accurately dismissing the feints and escalating the real attacks, we develop a Resilient and Adaptive Data-driven alert and Attention Management Strategy (RADAMS) that de-emphasizes alerts selectively based on the alerts' observable features. RADAMS uses reinforcement learning to achieve a customized and transferable design for various human operators and evolving IDoS attacks. The integrated modeling and theoretical analysis lead to the Product Principle of Attention (PPoA), fundamental limits, and the tradeoff among crucial human and economic factors. Experimental results corroborate that the proposed strategy outperforms the default strategy and can reduce the IDoS risk by as much as 20%. Besides, the strategy is resilient to large variations of costs, attack frequencies, and human attention capacities. We have recognized interesting phenomena such as attentional risk equivalency, attacker's dilemma, and the half-truth optimal attack strategy.
EAdam Optimizer: How $\epsilon$ Impact Adam
Many adaptive optimization methods have been proposed and used in deep learning, in which Adam is regarded as the default algorithm and widely used in many deep learning frameworks. Recently, many variants of Adam, such as Adabound, RAdam and Adabelief, have been proposed and show better performance than Adam. However, these variants mainly focus on changing the stepsize by making differences on the gradient or the square of it. Motivated by the fact that suitable damping is important for the success of powerful second-order optimizers, we discuss the impact of the constant $\epsilon$ for Adam in this paper. Surprisingly, we can obtain better performance than Adam simply changing the position of $\epsilon$. Based on this finding, we propose a new variant of Adam called EAdam, which doesn't need extra hyper-parameters or computational costs. We also discuss the relationships and differences between our method and Adam. Finally, we conduct extensive experiments on various popular tasks and models. Experimental results show that our method can bring significant improvement compared with Adam. Our code is available at https://github.com/yuanwei2019/EAdam-optimizer.
Descending through a Crowded Valley -- Benchmarking Deep Learning Optimizers
Schmidt, Robin M., Schneider, Frank, Hennig, Philipp
Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive ranking, then at least with evidence-backed heuristics. To do so, we perform an extensive, standardized benchmark of more than a dozen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices. Analyzing almost 35,000 individual runs, we contribute the following three points: (i) Optimizer performance varies greatly across tasks. This subset includes popular favorites and some lesser-known contenders. We have open-sourced all our experimental results, making them directly available as challenging and well-tuned baselines. This allows for more meaningful comparisons when evaluating novel optimization methods without requiring any further computational efforts. Large-scale stochastic optimization drives a wide variety of machine learning tasks. Because choosing the right optimization algorithm and effectively tuning its hyperparameters heavily influences the training speed and final performance of the learned model, doing so is an important, everyday challenge to practitioners. Hence, stochastic optimization methods have been a focal point of research (cf. Figure 1), engendering an ever-growing list of algorithms, many of them specifically targeted towards deep learning.
On the adequacy of untuned warmup for adaptive optimization
Adaptive optimization algorithms such as Adam (Kingma & Ba, 2014) are widely used in deep learning. The stability of such algorithms is often improved with a warmup schedule for the learning rate. Motivated by the difficulty of choosing and tuning warmup schedules, Liu et al. (2019) propose automatic variance rectification of Adam's adaptive learning rate, claiming that this rectified approach ("RAdam") surpasses the vanilla Adam algorithm and reduces the need for expensive tuning of Adam with warmup. In this work, we point out various shortcomings of this analysis. We then provide an alternative explanation for the necessity of warmup based on the magnitude of the update term, which is of greater relevance to training stability. Finally, we provide some "rule-of-thumb" warmup schedules, and we demonstrate that simple untuned warmup of Adam performs more-or-less identically to RAdam in typical practical settings. We conclude by suggesting that practitioners stick to linear warmup with Adam, with a sensible default being linear warmup over $2 / (1 - \beta_2)$ training iterations.
Is Rectified Adam actually *better* than Adam? - PyImageSearch
Is the Rectified Adam (RAdam) optimizer actually better than the standard Adam optimizer? According to my 24 experiments, the answer is no, typically not (but there are cases where you do want to use it instead of Adam). In Liu et al.'s 2018 paper, On the Variance of the Adaptive Learning Rate and Beyond, the authors claim that Rectified Adam can obtain: The authors tested their hypothesis on three different datasets, including one NLP dataset and two computer vision datasets (ImageNet and CIFAR-10). In each case Rectified Adam outperformed standard Adam…but failed to outperform standard Stochastic Gradient Descent (SGD)! The Rectified Adam optimizer has some strong theoretical justifications -- but as a deep learning practitioner, you need more than just theory -- you need to see empirical results applied to a variety of datasets. And perhaps more importantly, you need to obtain a mastery level experience operating/driving the optimizer (or a small subset of optimizers) as well. If you haven't yet, go ahead and read part one to ensure you have a good understanding of how the Rectified Adam optimizer works. From there, read today's post to help you understand how to design, code, and run experiments used to compare deep learning optimizers. To learn how to compare Rectified Adam to standard Adam, just keep reading! In the first part of this tutorial, we'll briefly discuss the Rectified Adam optimizer, including how it works and why it's interesting to us as deep learning practitioners.
New State of the Art AI Optimizer: Rectified Adam (RAdam). Improve your AI accuracy instantly versus Adam, and why it works.
As you can see, RAdam provides a dynamic heuristic to provide automated variance reduction and thus removes the need and manual tuning involved with a warmup during training. In addition, RAdam is shown to be more robust to learning rate variations (the most important hyperparameter) and provides better training accuracy and generalization on a variety of datasets and within a variety of AI architectures. In short, I'd highly recommend you drop RAdam into your AI architecture and see if you don't get an immediate benefit. I'd offer a money back guarantee but since the cost for it is $0.00…:) RAdam is available for PyTorch at their official github here.
On the Variance of the Adaptive Learning Rate and Beyond
Liu, Liyuan, Jiang, Haoming, He, Pengcheng, Chen, Weizhu, Liu, Xiaodong, Gao, Jianfeng, Han, Jiawei
The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in details. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate (i.e., it has problematically large variance in the early stage), suggest warmup works as a variance reduction technique, and provide both empirical and theoretical evidence to verify our hypothesis. We further propose RAdam, a new variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate. Extensive experimental results on image classification, language modeling, and neural machine translation verify our intuition and demonstrate the effectiveness and robustness of our proposed method. All implementations are available at: https://github.com/LiyuanLucasLiu/RAdam.