Goto

Collaborating Authors

 npus


Laptop makers embraced AI. Then Microsoft left them hanging

PCWorld

PCWorld reports that laptop manufacturers rushed to meet Microsoft's 40 TOPS NPU requirement for Copilot+ PCs, only to see Microsoft shift strategy away from NPU dependence. Intel, AMD, and Qualcomm now offer powerful NPUs (48-80 TOPS) in new processors, but limited software applications currently utilize this specialized AI hardware.


Quant-Trim in Practice: Improved Cross-Platform Low-Bit Deployment on Edge NPUs

Dhahri, Rayen, Urban, Steffen

arXiv.org Artificial Intelligence

Specialized edge accelerators rely on low-bit quantization, but vendor compilers differ in scaling, clipping, and kernel support, often as black boxes. The same floating-point (FP) checkpoint can therefore yield inconsistent accuracy across backends, forcing practitioners to tweak flags or refactor models to vendor-friendly operator subsets. We introduce Quant-Trim, a training-phase method that produces a hardware-neutral checkpoint robust to backend and precision choices. It combines progressive fake quantization to align training with the deployed integer grid and reverse pruning to tame outlier-driven scale inflation while preserving learnability. Quant-Trim is agnostic to quantization schemes (symmetric/asymmetric, per-tensor/per-channel, INT8/INT4) and requires no vendor-specific graph changes. Across models and tasks, it narrows the FP-to-low-bit gap, reduces dependence on compiler heuristics/calibration, and avoids per-backend retraining. We report accuracy and edge metrics latency, throughput, energy per inference, and cost under static/dynamic activation scaling and varying operator coverage.


The great NPU failure: Two years later, local AI is still all about GPUs

PCWorld

When you purchase through links in our articles, we may earn a small commission. Local AI tools are more powerful than ever, but most of the magic ain't happening on NPUs--much to Microsoft's disappointment, I'm sure. For the last few years, the term "AI PC" has basically meant little more than "a lightweight portable laptop with a neural processing unit (NPU) ." Today, two years after the glitzy launch of NPUs with Intel's Meteor Lake hardware, these AI PCs still feel like glorified tech demos. But local AI is here!


Enhancing Learned Knowledge in LoRA Adapters Through Efficient Contrastive Decoding on Ascend NPUs

Heisler, Morgan Lindsay, Xing, Linzi, Shi, Ge, Sadri, Hanieh, Singh, Gursimran, Zhang, Weiwei, Ye, Tao, Xiong, Ying, Zhang, Yong, Fan, Zhenan

arXiv.org Artificial Intelligence

Huawei Cloud users leverage LoRA (Low-Rank Adaptation) as an efficient and scalable method to fine-tune and customize large language models (LLMs) for application-specific needs. However, tasks that require complex reasoning or deep contextual understanding are often hindered by biases or interference from the base model when using typical decoding methods like greedy or beam search. These biases can lead to generic or task-agnostic responses from the base model instead of leveraging the LoRA-specific adaptations. In this paper, we introduce Contrastive LoRA Decoding (CoLD), a novel decoding framework designed to maximize the use of task-specific knowledge in LoRA-adapted models, resulting in better downstream performance. CoLD uses contrastive decoding by scoring candidate tokens based on the divergence between the probability distributions of a LoRA-adapted expert model and the corresponding base model. This approach prioritizes tokens that better align with the LoRA's learned representations, enhancing performance for specialized tasks. While effective, a naive implementation of CoLD is computationally expensive because each decoding step requires evaluating multiple token candidates across both models. To address this, we developed an optimized kernel for Huawei's Ascend NPU. CoLD achieves up to a 5.54% increase in task accuracy while reducing end-to-end latency by 28% compared to greedy decoding. This work provides practical and efficient decoding strategies for fine-tuned LLMs in resource-constrained environments and has broad implications for applied data science in both cloud and on-premises settings.


Scaling LLM Test-Time Compute with Mobile NPU on Smartphones

Hao, Zixu, Wei, Jianyu, Wang, Tuowei, Huang, Minxing, Jiang, Huiqiang, Jiang, Shiqi, Cao, Ting, Ren, Ju

arXiv.org Artificial Intelligence

Deploying Large Language Models (LLMs) on mobile devices faces the challenge of insufficient performance in smaller models and excessive resource consumption in larger ones. This paper highlights that mobile Neural Processing Units (NPUs) have underutilized computational resources, particularly their matrix multiplication units, during typical LLM inference. To leverage this wasted compute capacity, we propose applying parallel test-time scaling techniques on mobile NPUs to enhance the performance of smaller LLMs. However, this approach confronts inherent NPU challenges, including inadequate hardware support for fine-grained quantization and low efficiency in general-purpose computations. To overcome these, we introduce two key techniques: a hardware-aware tile quantization scheme that aligns group quantization with NPU memory access patterns, and efficient LUT-based replacements for complex operations such as Softmax and dequantization. We design and implement an end-to-end inference system that leverages the NPU's compute capability to support test-time scaling on Qualcomm Snapdragon platforms. Experiments show our approach brings significant speedups: up to 19.0 for mixed-precision GEMM and 2.2 for Softmax. More importantly, we demonstrate that smaller models using test-time scaling can match or exceed the accuracy of larger models, achieving a new performance-cost Pareto frontier.


Evaluating the Energy Efficiency of NPU-Accelerated Machine Learning Inference on Embedded Microcontrollers

Fanariotis, Anastasios, Orphanoudakis, Theofanis, Fotopoulos, Vasilis

arXiv.org Artificial Intelligence

The deployment of machine learning (ML) models on microcontrollers (MCUs) is constrained by strict energy, latency, and memory requirements, particularly in battery - operated and real - time edge devices. While software - level optimizations such as quantizatio n and pruning reduce model size and computation, hardware acceleration has emerged as a decisive enabler for efficient embedded inference. This paper evaluates the impact of Neural Processing Units (NPUs) on MCU - based ML execution, using the ARM Cortex - M55 core combined with the Ethos - U55 NPU on the Alif Semiconductor Ensemble E7 development board as a representative platform. A rigorous measurement methodology was employed, incorporating per - inference net energy accounting via GPIO - triggered high - resolutio n digital multimeter synchronization and idle - state subtraction, ensuring accurate attribution of energy costs. Experimental results across six representative ML models -- including MiniResNet, MobileNetV2, FD - MobileNet, MNIST, TinyYolo, and SSD - MobileNet -- dem onstrate substantial efficiency gains when inference is offloaded to the NPU. For moderate to large networks, latency improvements ranged from 7 to over 125, with per - inference net energy reductions up to 143 . Notably, the NPU enabled execution of model s unsupported on CPU - only paths, such as SSD - MobileNet, highlighting its functional as well as efficiency advantages. These findings establish NPUs as a cornerstone of energy - aware embedded AI, enabling real - time, power - constrained ML inference at the MCU level.


19 AI-infused apps that prove NPUs are already changing how we work

PCWorld

When you purchase through links in our articles, we may earn a small commission. More and more applications support the use of NPUs in modern AI notebooks. Since Intel integrated a dedicated Neural Processing Unit (NPU) into modern notebooks with the Core Ultra processors and AMD with the Ryzen AI series, the software landscape has visibly changed. Applications from various fields, such as image processing, video production, communication and document processing, are increasingly using this specialized hardware to execute AI functions locally, becoming faster and more energy-efficient. Frameworks help developers to program applications that also offer NPU support.


Serving Large Language Models on Huawei CloudMatrix384

Zuo, Pengfei, Lin, Huimin, Deng, Junbo, Zou, Nan, Yang, Xingkun, Diao, Yingyu, Gao, Weifeng, Xu, Ke, Chen, Zhangyu, Lu, Shirui, Qiu, Zhao, Li, Peiyang, Chang, Xianyu, Yu, Zhengzhong, Miao, Fangzheng, Zheng, Jia, Li, Ying, Feng, Yuan, Wang, Bei, Zong, Zaijian, Zhou, Mosong, Zhou, Wenli, Chen, Houjiang, Liao, Xingyu, Li, Yipeng, Zhang, Wenxiao, Zhu, Ping, Wang, Yinggang, Xiao, Chuanjie, Liang, Depeng, Cao, Dong, Liu, Juncheng, Yang, Yongqiang, Bai, Xiaolong, Li, Yi, Xie, Huaguo, Wu, Huatao, Yu, Zhibin, Chen, Lv, Liu, Hu, Ding, Yujun, Zhu, Haipei, Xia, Jing, Xiong, Yi, Yu, Zhou, Liao, Heng

arXiv.org Artificial Intelligence

The rapid evolution of large language models (LLMs), driven by growing parameter scales, adoption of mixture-of-experts (MoE) architectures, and expanding context lengths, imposes unprecedented demands on AI infrastructure. Traditional AI clusters face limitations in compute intensity, memory bandwidth, inter-chip communication, and latency, compounded by variable workloads and strict service-level objectives. Addressing these issues requires fundamentally redesigned hardware-software integration. This paper introduces Huawei CloudMatrix, a next-generation AI datacenter architecture, realized in the production-grade CloudMatrix384 supernode. It integrates 384 Ascend 910 NPUs and 192 Kunpeng CPUs interconnected via an ultra-high-bandwidth Unified Bus (UB) network, enabling direct all-to-all communication and dynamic pooling of resources. These features optimize performance for communication-intensive operations, such as large-scale MoE expert parallelism and distributed key-value cache access. To fully leverage CloudMatrix384, we propose CloudMatrix-Infer, an advanced LLM serving solution incorporating three core innovations: a peer-to-peer serving architecture that independently scales prefill, decode, and caching; a large-scale expert parallelism strategy supporting EP320 via efficient UB-based token dispatch; and hardware-aware optimizations including specialized operators, microbatch-based pipelining, and INT8 quantization. Evaluation with the DeepSeek-R1 model shows CloudMatrix-Infer achieves state-of-the-art efficiency: prefill throughput of 6,688 tokens/s per NPU and decode throughput of 1,943 tokens/s per NPU (<50 ms TPOT). It effectively balances throughput and latency, sustaining 538 tokens/s per NPU even under stringent 15 ms latency constraints, while INT8 quantization maintains model accuracy across benchmarks.


Microsoft's Copilot gamble is a bust. But AI PCs still feel inevitable

PCWorld

A year ago, Microsoft hyped Copilot PCs as the next big thing. Twelve months later, it's hard not to see them as one of the tech industry's more significant flops. The question is whether they'll stay that way. Many Copilot PCs began shipping on June 18, 2024, about a month after Microsoft announced the program at the company's headquarters a month earlier. Acer, Asus, Dell, HP, Lenovo, Samsung, and Microsoft's own Surface division committed to shipping Copilot PCs, whose centerpiece was a processor with an embedded Neural Processing Unit -- the engine of AI -- capable of 40 trillion operations per second, or TOPS.


Microsoft is still ignoring the AI PCs that actually matter

PCWorld

Should Microsoft and the PC industry have paid more attention to the GPU during the development of AI and Copilot PCs? After a year's time waiting for Copilot PCs (and their newfangled "Neural Processing Units" to take off, I can't help but wonder. Microsoft launched the Copilot PC initiative on May 20, 2024, and began shipping them on June 18. Since then, Microsoft has supported Copilot PCs with a handful of features, rolling them out first for PCs with the Qualcomm Snapdragon chips inside and then later for PCs powered by Intel Core Ultra Series 2 chips and the AMD Ryzen AI 300 processor. Qualcomm is essentially blameless, delivering a potent PC processor with most AI capabilities and long battery life.