Energy
MoE-Inference-Bench: Performance Evaluation of Mixture of Expert Large Language and Vision Models
Chitty-Venkata, Krishna Teja, Howland, Sylvia, Azar, Golara, Soboleva, Daria, Vassilieva, Natalia, Raskar, Siddhisanket, Emani, Murali, Vishwanath, Venkatram
Mixture of Experts (MoE) models have enabled the scaling of Large Language Models (LLMs) and Vision Language Models (VLMs) by achieving massive parameter counts while maintaining computational efficiency. However, MoEs introduce several inference-time challenges, including load imbalance across experts and the additional routing computational overhead. To address these challenges and fully harness the benefits of MoE, a systematic evaluation of hardware acceleration techniques is essential. We present MoE-Inference-Bench, a comprehensive study to evaluate MoE performance across diverse scenarios. We analyze the impact of batch size, sequence length, and critical MoE hyperparameters such as FFN dimensions and number of experts on throughput. We evaluate several optimization techniques on Nvidia H100 GPUs, including pruning, Fused MoE operations, speculative decoding, quantization, and various parallelization strategies. Our evaluation includes MoEs from the Mixtral, DeepSeek, OLMoE and Qwen families. The results reveal performance differences across configurations and provide insights for the efficient deployment of MoEs.
Is the Frequency Principle always valid?
We investigate the learning dynamics of shallow ReLU neural networks on the unit sphere \(S^2\subset\mathbb{R}^3\) in polar coordinates \((τ,ϕ)\), considering both fixed and trainable neuron directions \(\{w_i\}\). For fixed weights, spherical harmonic expansions reveal an intrinsic low-frequency preference with coefficients decaying as \(O(\ell^{5/2}/2^\ell)\), typically leading to the Frequency Principle (FP) of lower-frequency-first learning. However, this principle can be violated under specific initial conditions or error distributions. With trainable weights, an additional rotation term in the harmonic evolution equations preserves exponential decay with decay order \(O(\ell^{7/2}/2^\ell)\) factor, also leading to the FP of lower-frequency-first learning. But like fixed weights case, the principle can be violated under specific initial conditions or error distributions. Our numerical results demonstrate that trainable directions increase learning complexity and can either maintain a low-frequency advantage or enable faster high-frequency emergence. This analysis suggests the FP should be viewed as a tendency rather than a rule on curved domains like \(S^2\), providing insights into how direction updates and harmonic expansions shape frequency-dependent learning.
OVITA: Open-Vocabulary Interpretable Trajectory Adaptations
Maurya, Anurag, Ghosh, Tashmoy, Nguyen, Anh, Prakash, Ravi
--Adapting trajectories to dynamic situations and user preferences is crucial for robot operation in unstructured environments with non-expert users. Natural language enables users to express these adjustments in an interactive manner . We introduce OVIT A, an interpretable, open-vocabulary, language-driven framework designed for adapting robot trajectories in dynamic and novel situations based on human instructions. OVIT A leverages multiple pre-trained Large Language Models (LLMs) to integrate user commands into trajectories generated by motion planners or those learned through demonstrations. OVIT A employs code as an adaptation policy generated by an LLM, enabling users to adjust individual waypoints, thus providing flexible control. Another LLM, which acts as a code explainer, removes the need for expert users, enabling intuitive interactions. The efficacy and significance of the proposed OVIT A framework is demonstrated through extensive simulations and real-world environments with diverse tasks involving spatiotemporal variations on heterogeneous robotic platforms such as a KUKA IIW A robot manipulator, Clearpath Jackal ground robot, and CrazyFlie drone. I. INTRODUCTION Robotic systems have increasingly permeated diverse domains, from industrial automation to service robotics, demanding efficient trajectory generation and adaptation techniques. A fundamental challenge in this context lies in enabling robots to generalize in dynamic and unstructured environments.
Optimizing Neural Networks with Learnable Non-Linear Activation Functions via Lookup-Based FPGA Acceleration
Yin, Mengyuan, Choong, Benjamin Chen Ming, Qu, Chuping, Goh, Rick Siow Mong, Wong, Weng-Fai, Luo, Tao
--Learned activation functions in models like Kolmogorov-Arnold Networks (KANs) outperform fixed-activation architectures in terms of accuracy and interpretability; however, their computational complexity poses critical challenges for energy-constrained edge AI deployments. Conventional CPUs/GPUs incur prohibitive latency and power costs when evaluating higher order activations, limiting deployability under ultra-tight energy budgets. We address this via a reconfigurable lookup architecture with edge FPGAs. FPGA reconfigurability enables dynamic hardware specialization for learned functions, a key advantage for edge systems that require post-deployment adaptability. This breakthrough positions our approach as a practical enabler for energy-critical edge AI, where computational intensity and power constraints traditionally preclude the use of adaptive activation networks. The development of effective activation functions has long been a central focus in machine learning research to enhance neural network capabilities. Neural networks with trainable activation functions represent an important and actively explored class of models, attracting growing research interest due to their potential to enhance model expressivity and adaptability to specific tasks [1] - complementing models with traditional fixed functions such as ReLU [2] and Leaky ReLU [3]. Learnable activation functions can be classified into two main categories: parameterized standard activation functions and ensemble-based activation functions [4].
A Dataset and Benchmark for Robotic Cloth Unfolding Grasp Selection: The ICRA 2024 Cloth Competition
De Gusseme, Victor-Louis, Lips, Thomas, Proesmans, Remko, Hietala, Julius, Lee, Giwan, Choi, Jiyoung, Choi, Jeongil, Kim, Geon, Yonrith, Phayuth, Tabernik, Domen, Gams, Andrej, Nimac, Peter, Urbas, Matej, Muhovič, Jon, Skočaj, Danijel, Mavsar, Matija, Yu, Hyojeong, Kwon, Minseo, Kim, Young J., Cong, Yang, Chen, Ronghan, Ren, Yu, Diao, Supeng, Weng, Jiawei, Liu, Jiayue, Sun, Haoran, Yang, Linhan, Zhang, Zeqing, Guo, Ning, Yang, Lei, Wan, Fang, Song, Chaoyang, Pan, Jia, Jin, Yixiang, A, Yong, Shi, Jun, Li, Dingzhe, Yang, Yong, Yamasaki, Kakeru, Kajiwara, Takumi, Nakadera, Yuki, Saxena, Krati, Shibata, Tomohiro, Xia, Chongkun, Mo, Kai, Yu, Yanzhao, Lin, Qihao, Ma, Binqiang, Sagong, Uihun, Choi, JungHyun, Park, JeongHyun, Lee, Dongwoo, Kim, Yeongmin, Hwang, Myun Joong, Kuribayashi, Yusuke, Hiratsuka, Naoki, Tanaka, Daisuke, Arnold, Solvi, Yamazaki, Kimitoshi, Mateo-Agullo, Carlos, Verleysen, Andreas, Wyffels, Francis
Robotic cloth manipulation suffers from a lack of standardized benchmarks and shared datasets for evaluating and comparing different approaches. To address this, we created a benchmark and organized the ICRA 2024 Cloth Competition, a unique head-to-head evaluation focused on grasp pose selection for in-air robotic cloth unfolding. Eleven diverse teams participated in the competition, utilizing our publicly released dataset of real-world robotic cloth unfolding attempts and a variety of methods to design their unfolding approaches. Afterwards, we also expanded our dataset with 176 competition evaluation trials, resulting in a dataset of 679 unfolding demonstrations across 34 garments. Analysis of the competition results revealed insights about the trade-off between grasp success and coverage, the surprisingly strong achievements of hand-engineered methods and a significant discrepancy between competition performance and prior work, underscoring the importance of independent, out-of-the-lab evaluation in robotic cloth manipulation. The associated dataset is a valuable resource for developing and evaluating grasp selection methods, particularly for learning-based approaches. We hope that our benchmark, dataset and competition results can serve as a foundation for future benchmarks and drive further progress in data-driven robotic cloth manipulation. The dataset and benchmarking code are available at https://airo.ugent.be/cloth_competition.
CelloAI: Leveraging Large Language Models for HPC Software Development in High Energy Physics
Atif, Mohammad, Chopra, Kriti, Kilic, Ozgur, Wang, Tianle, Dong, Zhihua, Leggett, Charles, Lin, Meifeng, Calafiura, Paolo, Habib, Salman
Next-generation High Energy Physics (HEP) experiments will generate unprecedented data volumes, necessitating High Performance Computing (HPC) integration alongside traditional high-throughput computing. However, HPC adoption in HEP is hindered by the challenge of porting legacy software to heterogeneous architectures and the sparse documentation of these complex scientific codebases. We present CelloAI, a locally hosted coding assistant that leverages Large Language Models (LLMs) with retrieval-augmented generation (RAG) to support HEP code documentation and generation. This local deployment ensures data privacy, eliminates recurring costs and provides access to large context windows without external dependencies. CelloAI addresses two primary use cases, code documentation and code generation, through specialized components. For code documentation, the assistant provides: (a) Doxygen style comment generation for all functions and classes by retrieving relevant information from RAG sources (papers, posters, presentations), (b) file-level summary generation, and (c) an interactive chatbot for code comprehension queries. For code generation, CelloAI employs syntax-aware chunking strategies that preserve syntactic boundaries during embedding, improving retrieval accuracy in large codebases. The system integrates callgraph knowledge to maintain dependency awareness during code modifications and provides AI-generated suggestions for performance optimization and accurate refactoring. We evaluate CelloAI using real-world HEP applications from ATLAS, CMS, and DUNE experiments, comparing different embedding models for code retrieval effectiveness. Our results demonstrate the AI assistant's capability to enhance code understanding and support reliable code generation while maintaining the transparency and safety requirements essential for scientific computing environments.
Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective
Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their heavy resource demands make quantization-reducing precision to lower-bit formats-critical for efficient serving. While many quantization methods exist, a systematic understanding of their performance, energy, and quality tradeoffs in realistic serving conditions remains a gap. In this work, we first develop a fully automated online characterization framework qMeter, and then conduct an in-depth characterization of 11 post-training LLM quantization methods across 4 model sizes (7B-70B) and two GPU architectures (A100, H100). We evaluate quantization at the application, workload, parallelism, and hardware levels under online serving conditions. Our study reveals highly task- and method-dependent tradeoffs, strong sensitivity to workload characteristics, and complex interactions with parallelism and GPU architecture. We further present three optimization case studies illustrating deployment challenges in capacity planning, energy-efficient scheduling, and multi-objective tuning. To the best of our knowledge, this is one of the first comprehensive application-, system-, and hardware-level characterization of LLM quantization from a joint performance, energy, and quality perspective.
WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling
Li, Jiacheng, Tan, Jianchao, Yang, Zhidong, Sun, Pingwei, Huo, Feiye, Qin, Jiayu, Sun, Yerui, Xie, Yuchen, Cai, Xunliang, Zhang, Xiangyu, He, Maoxin, Tan, Guangming, Jia, Weile, Zhao, Tong
Transformer architecture gradually dominates the LLM field. Recent advances in training optimization for Transformer-based large language models (LLMs) primarily focus on architectural modifications or optimizer adjustments. However, these approaches lack systematic optimization of weight patterns during training. Weight pattern refers to the distribution and relative magnitudes of weight parameters in a neural network. To address this issue, we propose a Weight Scaling method called WISCA to enhance training efficiency and model quality by strategically improving neural network weight patterns without changing network structures. By rescaling weights while preserving model outputs, WISCA indirectly optimizes the model's training trajectory. Experiments demonstrate that WISCA significantly improves convergence quality (measured by generalization capability and loss reduction), particularly in LLMs with Grouped Query Attention (GQA) architectures and LoRA fine-tuning tasks. Empirical results show 5.6% average improvement on zero-shot validation tasks and 2.12% average reduction in training perplexity across multiple architectures.
Situational Awareness as the Imperative Capability for Disaster Resilience in the Era of Complex Hazards and Artificial Intelligence
Disasters frequently exceed established hazard models, revealing blind spots where unforeseen impacts and vulnerabilities hamper effective response. This perspective paper contends that situational awareness (SA)-the ability to perceive, interpret, and project dynamic crisis conditions-is an often overlooked yet vital capability for disaster resilience. While risk mitigation measures can reduce known threats, not all hazards can be neutralized; truly adaptive resilience hinges on whether organizations rapidly detect emerging failures, reconcile diverse data sources, and direct interventions where they matter most. We present a technology-process-people roadmap, demonstrating how real-time hazard nowcasting, interoperable workflows, and empowered teams collectively transform raw data into actionable insight. A system-of-systems approach enables federated data ownership and modular analytics, so multiple agencies can share timely updates without sacrificing their distinct operational models. Equally crucial, structured sense-making routines and cognitive load safeguards help humans remain effective decision-makers amid data abundance. By framing SA as a socio-technical linchpin rather than a peripheral add-on, this paper spotlights the urgency of elevating SA to a core disaster resilience objective. We conclude with recommendations for further research-developing SA metrics, designing trustworthy human-AI collaboration, and strengthening inclusive data governance-to ensure that communities are equipped to cope with both expected and unexpected crises.
Constrained Diffusion Models for Synthesizing Representative Power Flow Datasets
Hoseinpour, Milad, Dvorkin, Vladimir
--High-quality power flow datasets are essential for training machine learning models in power systems. However, security and privacy concerns restrict access to real-world data, making statistically accurate and physically consistent synthetic datasets a viable alternative. We develop a diffusion model for generating synthetic power flow datasets from real-world power grids that both replicate the statistical properties of the real-world data and ensure AC power flow feasibility. T o enforce the constraints, we incorporate gradient guidance based on the power flow constraints to steer diffusion sampling toward feasible samples. For computational efficiency, we further leverage insights from the fast decoupled power flow method and propose a variable decoupling strategy for the training and sampling of the diffusion model. These solutions lead to a physics-informed diffusion model, generating power flow datasets that outperform those from the standard diffusion in terms of feasibility and statistical similarity, as shown in experiments across IEEE benchmark systems.