rate scheduler
Appendix: On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them
Suppose we have a non-zero solution θ which is a stationary point of f(θ,t) at t-th step and SGD finds θt = θ at t-th step. Theorem 2.2 of Shapiro and Wardi [9] told us that the learning rate should be small enough for convergence. Obviously, we have η < in practice. As ηt = ηt+1 does not hold, SGD cannot converging to any non-zero stationary point. The proof is now complete.
Representation Meets Optimization: Training PINNs and PIKANs for Gray-Box Discovery in Systems Pharmacology
Daryakenari, Nazanin Ahmadi, Shukla, Khemraj, Karniadakis, George Em
Physics-Informed Kolmogorov-Arnold Networks (PIKANs) are gaining attention as an effective counterpart to the original multilayer perceptron-based Physics-Informed Neural Networks (PINNs). Both representation models can address inverse problems and facilitate gray-box system identification. However, a comprehensive understanding of their performance in terms of accuracy and speed remains underexplored. In particular, we introduce a modified PIKAN architecture, tanh-cPIKAN, which is based on Chebyshev polynomials for parametrization of the univariate functions with an extra nonlinearity for enhanced performance. We then present a systematic investigation of how choices of the optimizer, representation, and training configuration influence the performance of PINNs and PIKANs in the context of systems pharmacology modeling. We benchmark a wide range of first-order, second-order, and hybrid optimizers, including various learning rate schedulers. We use the new Optax library to identify the most effective combinations for learning gray-boxes under ill-posed, non-unique, and data-sparse conditions. We examine the influence of model architecture (MLP vs. KAN), numerical precision (single vs. double), the need for warm-up phases for second-order methods, and sensitivity to the initial learning rate. We also assess the optimizer scalability for larger models and analyze the trade-offs introduced by JAX in terms of computational efficiency and numerical accuracy. Using two representative systems pharmacology case studies - a pharmacokinetics model and a chemotherapy drug-response model - we offer practical guidance on selecting optimizers and representation models/architectures for robust and efficient gray-box discovery. Our findings provide actionable insights for improving the training of physics-informed networks in biomedical applications and beyond.
32fcc8cfe1fa4c77b5c58dafd36d1a98-AuthorFeedback.pdf
We thank the reviewers for their detailed comments. Please see our response below. "... common implementation of weight decay [1] will usually multiply the amount of weight decay by the learning " The same holds in our setup: We have an "How do different learning rate schedules affect the conclusion?": We address LR schedule questions below. "It would be great if the authors can provide more experiments on ... AUTOL2" We ran additional experiments "((1)) If I could have access to the test set... " . We reject the claim that our submission "violates the ethics of "((2)) I have concerns on comparing AutoL2... " . Experiments with lr decay and AutoL2 are presented in the SM. "((3))) The practically of the proposed work... "... more insights on the relation between learning rate scheduler and AutoL2... " We address this point in the "... the lambda update refractory period is not detailed ... " The refractory period lasts for "It would be interesting to see on the same graph, training with learning rate scheduler ... " In the SM we have the "In Figure 1a and 1b, how is the best test accuracy determined?... " In Figs.