AITopics | Zhang, Sai Qian

Collaborating Authors

Zhang, Sai Qian

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Speculative Decoding and Beyond: An In-Depth Survey of Techniques

Hu, Yunhai, Liu, Zining, Dong, Zhenyuan, Peng, Tianfan, McDanel, Bradley, Zhang, Sai Qian

arXiv.org Artificial IntelligenceMar-3-2025

--Sequential dependencies present a fundamental bottleneck in deploying large-scale autoregressive models, particularly for real-time applications. While traditional optimization approaches like pruning and quantization often compromise model quality, recent advances in generation-refinement frameworks demonstrate that this trade-off can be significantly mitigated. This survey presents a comprehensive taxonomy of generation-refinement frameworks, analyzing methods across autoregressive sequence tasks. We categorize methods based on their generation strategies (from simple n-gram prediction to sophisticated draft models) and refinement mechanisms (including single-pass verification and iterative approaches). Through systematic analysis of both algorithmic innovations and system-level implementations, we examine deployment strategies across computing environments and explore applications spanning text, images, and speech generation. This systematic examination of both theoretical frameworks and practical implementations provides a foundation for future research in efficient autoregressive decoding. Index T erms --Large Language Model, Speculative Decoding, Computer System, Distributed System.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2502.19732

Country:

North America > United States (0.29)
Asia (0.28)

Genre:

Research Report (1.00)
Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.94)

Add feedback

Huff-LLM: End-to-End Lossless Compression for Efficient LLM Inference

Yubeaton, Patrick, Mahmoud, Tareq, Naga, Shehab, Taheri, Pooria, Xia, Tianhua, George, Arun, Khalil, Yasmein, Zhang, Sai Qian, Joshi, Siddharth, Hegde, Chinmay, Garg, Siddharth

arXiv.org Artificial IntelligenceFeb-2-2025

As they become more capable, large language models (LLMs) have continued to rapidly increase in size. This has exacerbated the difficulty in running state of the art LLMs on small, edge devices. Standard techniques advocate solving this problem through lossy compression techniques such as quantization or pruning. However, such compression techniques are lossy, and have been shown to change model behavior in unpredictable manners. We propose Huff-LLM, an \emph{end-to-end, lossless} model compression method that lets users store LLM weights in compressed format \emph{everywhere} -- cloud, disk, main memory, and even in on-chip memory/buffers. This allows us to not only load larger models in main memory, but also reduces bandwidth required to load weights on chip, and makes more efficient use of on-chip weight buffers. In addition to the memory savings achieved via compression, we also show latency and energy efficiency improvements when performing inference with the compressed model.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.00922

Country: North America > United States > New York (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

FovealNet: Advancing AI-Driven Gaze Tracking Solutions for Optimized Foveated Rendering System Performance in Virtual Reality

Liu, Wenxuan, Duinkharjav, Monde, Sun, Qi, Zhang, Sai Qian

arXiv.org Artificial IntelligenceDec-30-2024

Leveraging real-time eye-tracking, foveated rendering optimizes hardware efficiency and enhances visual quality virtual reality (VR). This approach leverages eye-tracking techniques to determine where the user is looking, allowing the system to render high-resolution graphics only in the foveal region-the small area of the retina where visual acuity is highest, while the peripheral view is rendered at lower resolution. However, modern deep learning-based gaze-tracking solutions often exhibit a long-tail distribution of tracking errors, which can degrade user experience and reduce the benefits of foveated rendering by causing misalignment and decreased visual quality. This paper introduces \textit{FovealNet}, an advanced AI-driven gaze tracking framework designed to optimize system performance by strategically enhancing gaze tracking accuracy. To further reduce the implementation cost of the gaze tracking algorithm, FovealNet employs an event-based cropping method that eliminates over $64.8\%$ of irrelevant pixels from the input image. Additionally, it incorporates a simple yet effective token-pruning strategy that dynamically removes tokens on the fly without compromising tracking accuracy. Finally, to support different runtime rendering configurations, we propose a system performance-aware multi-resolution training strategy, allowing the gaze tracking DNN to adapt and optimize overall system performance more effectively. Evaluation results demonstrate that FovealNet achieves at least $1.42\times$ speed up compared to previous methods and 13\% increase in perceptual quality for foveated output.

artificial intelligence, human computer interaction, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2412.10456

Country: North America > United States (0.68)

Genre: Research Report > New Finding (0.34)

Industry:

Information Technology (0.68)
Health & Medicine > Therapeutic Area (0.34)

Technology:

Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Unlocking Visual Secrets: Inverting Features with Diffusion Priors for Image Reconstruction

Zhang, Sai Qian, Li, Ziyun, Guo, Chuan, Mahloujifar, Saeed, Dangwal, Deeksha, Suh, Edward, De Salvo, Barbara, Liu, Chiao

arXiv.org Artificial IntelligenceDec-11-2024

Inverting visual representations within deep neural networks (DNNs) presents a challenging and important problem in the field of security and privacy for deep learning. The main goal is to invert the features of an unidentified target image generated by a pre-trained DNN, aiming to reconstruct the original image. Feature inversion holds particular significance in understanding the privacy leakage inherent in contemporary split DNN execution techniques, as well as in various applications based on the extracted DNN features. In this paper, we explore the use of diffusion models, a promising technique for image synthesis, to enhance feature inversion quality. We also investigate the potential of incorporating alternative forms of prior knowledge, such as textual prompts and cross-frame temporal correlations, to further improve the quality of inverted features. Our findings reveal that diffusion models can effectively leverage hidden information from the DNN features, resulting in superior reconstruction performance compared to previous methods. This research offers valuable insights into how diffusion models can enhance privacy and security within applications that are reliant on DNN features.

artificial intelligence, inversion, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2412.10448

Country: Europe > Switzerland (0.28)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DFRot: Achieving Outlier-Free and Massive Activation-Free for Rotated LLMs with Refined Rotation

Xiang, Jingyang, Zhang, Sai Qian

arXiv.org Machine LearningDec-2-2024

Rotating the activation and weight matrices to reduce the influence of outliers in large language models (LLMs) has recently attracted significant attention, particularly in the context of model quantization. Prior studies have shown that in low-precision quantization scenarios, such as 4-bit weights and 4-bit activations (W4A4), randomized Hadamard transforms can achieve significantly higher accuracy than randomized orthogonal transforms. Notably, the reason behind this phenomena remains unknown. In this paper, we find that these transformations show substantial improvement in eliminating outliers for common tokens and achieve similar quantization error. The primary reason for the accuracy difference lies in the fact that randomized Hadamard transforms can slightly reduce the quantization error for tokens with massive activations while randomized orthogonal transforms increase the quantization error. Due to the extreme rarity of these tokens and their critical impact on model accuracy, we consider this a long-tail optimization problem, and therefore construct a simple yet effective method: a weighted loss function. Additionally, we propose an optimization strategy for the rotation matrix that involves alternating optimization of quantization parameters while employing orthogonal Procrustes transforms to refine the rotation matrix. This makes the distribution of the rotated activation values more conducive to quantization, especially for tokens with massive activations. Our method enhances the Rotated LLMs by achieving dual free, Outlier-Free and Massive Activation-Free, dubbed as DFRot. Extensive experiments demonstrate the effectiveness and efficiency of DFRot. By tuning the rotation matrix using just a single sample, DFRot achieves a perplexity improvement of 0.25 and 0.21 on W4A4KV4 and W4A4KV16, respectively, for LLaMA3-8B, a model known for its quantization challenges.

large language model, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2412.00648

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

PETAH: Parameter Efficient Task Adaptation for Hybrid Transformers in a resource-limited Context

Augustin, Maximilian, Sarwar, Syed Shakib, Elhoushi, Mostafa, Zhang, Sai Qian, Li, Yuecheng, De Salvo, Barbara

arXiv.org Artificial IntelligenceOct-23-2024

Following their success in natural language processing (NLP), there has been a shift towards transformer models in computer vision. While transformers perform well and offer promising multi-tasking performance, due to their high compute requirements, many resource-constrained applications still rely on convolutional or hybrid models that combine the benefits of convolution and attention layers and achieve the best results in the sub 100M parameter range. Simultaneously, task adaptation techniques that allow for the use of one shared transformer backbone for multiple downstream tasks, resulting in great storage savings at negligible cost in performance, have not yet been adopted for hybrid transformers. In this work, we investigate how to achieve the best task-adaptation performance and introduce PETAH: Parameter Efficient Task Adaptation for Hybrid Transformers. We further combine PETAH adaptation with pruning to achieve highly performant and storage friendly models for multi-tasking. In our extensive evaluation on classification and other vision tasks, we demonstrate that our PETAH-adapted hybrid models outperform established task-adaptation techniques for ViTs while requiring fewer parameters and being more efficient on mobile hardware.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.17661

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Neural Architecture Search of Hybrid Models for NPU-CIM Heterogeneous AR/VR Devices

Zhao, Yiwei, Li, Ziyun, Khwa, Win-San, Sun, Xiaoyu, Zhang, Sai Qian, Sarwar, Syed Shakib, Stangherlin, Kleber Hugo, Lu, Yi-Lun, Gomez, Jorge Tomas, Seo, Jae-Sun, Gibbons, Phillip B., De Salvo, Barbara, Liu, Chiao

arXiv.org Artificial IntelligenceOct-10-2024

Low-Latency and Low-Power Edge AI is essential for Virtual Reality and Augmented Reality applications. Recent advances show that hybrid models, combining convolution layers (CNN) and transformers (ViT), often achieve superior accuracy/performance tradeoff on various computer vision and machine learning (ML) tasks. However, hybrid ML models can pose system challenges for latency and energy-efficiency due to their diverse nature in dataflow and memory access patterns. In this work, we leverage the architecture heterogeneity from Neural Processing Units (NPU) and Compute-In-Memory (CIM) and perform diverse execution schemas to efficiently execute these hybrid models. We also introduce H4H-NAS, a Neural Architecture Search framework to design efficient hybrid CNN/ViT models for heterogeneous edge systems with both NPU and CIM. Our H4H-NAS approach is powered by a performance estimator built with NPU performance results measured on real silicon, and CIM performance based on industry IPs. H4H-NAS searches hybrid CNN/ViT models with fine granularity and achieves significant (up to 1.34%) top-1 accuracy improvement on ImageNet dataset. Moreover, results from our Algo/HW co-design reveal up to 56.08% overall latency and 41.72% energy improvements by introducing such heterogeneous computing over baseline solutions. The framework guides the design of hybrid network architectures and system architectures of NPU+CIM heterogeneous systems.

artificial intelligence, cim macro, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2410.08326

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Han, Zeyu, Gao, Chao, Liu, Jinyang, Zhang, Jeff, Zhang, Sai Qian

arXiv.org Artificial IntelligenceJul-12-2024

Large models represent a groundbreaking advancement in multiple application fields, enabling remarkable achievements across various tasks. However, their unprecedented scale comes with significant computational costs. These models, often consisting of billions of parameters, require vast amounts of computational resources for execution. Especially, the expansive scale and computational demands pose considerable challenges when customizing them for particular downstream tasks, particularly over the hardware platforms constrained by computational capabilities. Parameter Efficient Fine-Tuning (PEFT) provides a practical solution by efficiently adjusting the large models over the various downstream tasks. In particular, PEFT refers to the process of adjusting the parameters of a pre-trained large models to adapt it to a specific task or domain while minimizing the number of additional parameters introduced or computational resources required. This approach is particularly important when dealing with large-scale language models with high parameter counts, as fine-tuning these models from scratch can be computationally expensive and resource-intensive, posing considerable challenges in the supporting system platform design. In this survey, we present comprehensive studies of various PEFT algorithms, examining their performance and computational overhead. Moreover, we provide an overview of applications developed using different PEFT algorithms and discuss common techniques employed to mitigate computation costs for PEFT. In addition to providing an extensive survey from an algorithmic standpoint, we also examine various real-world system designs to investigate the implementation costs associated with different PEFT approaches. This survey serves as an indispensable resource for researchers aiming to understand both the PEFT algorithm and its system implementation, offering detailed ......

arxiv preprint arxiv, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2403.14608

Country:

North America > United States > California (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Middle East > UAE (0.14)

Genre:

Overview (1.00)
Research Report > Promising Solution (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (0.67)
Education (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Hardware (1.00)
(6 more...)

Add feedback

HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization

Liu, Wenxuan, Zhang, Sai Qian

arXiv.org Artificial IntelligenceMay-31-2024

Diffusion Transformers (DiTs) have recently gained substantial attention in both industrial and academic fields for their superior visual generation capabilities, outperforming traditional diffusion models that use U-Net. However, the enhanced performance of DiTs also comes with high parameter counts and implementation costs, seriously restricting their use on resource-limited devices such as mobile phones. To address these challenges, we introduce the Hybrid Floating-point Quantization for DiT (HQ-DiT), an efficient post-training quantization method that utilizes 4-bit floating-point (FP) precision on both weights and activations for DiT inference. Compared to fixed-point quantization (e.g., INT8), FP quantization, complemented by our proposed clipping range selection mechanism, naturally aligns with the data distribution within DiT, resulting in a minimal quantization error. Furthermore, HQ-DiT also implements a universal identity mathematical transform to mitigate the serious quantization error caused by the outliers. The experimental results demonstrate that DiT can achieve extremely low-precision quantization (i.e., 4 bits) with negligible impact on performance. Our approach marks the first instance where both weights and activations in DiTs are quantized to just 4 bits, with only a 0.12 increase in sFID on ImageNet 256 256.

machine learning, natural language, quantization, (19 more...)

arXiv.org Artificial Intelligence

2405.19751

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

DLoRA: Distributed Parameter-Efficient Fine-Tuning Solution for Large Language Model

Gao, Chao, Zhang, Sai Qian

arXiv.org Artificial IntelligenceApr-8-2024

To enhance the performance of large language models (LLM) on downstream tasks, one solution is to fine-tune certain LLM parameters and make it better align with the characteristics of the training dataset. This process is commonly known as parameter-efficient fine-tuning (PEFT). Due to the scale of LLM, PEFT operations are usually executed in the public environment (e.g., cloud server). This necessitates the sharing of sensitive user data across public environments, thereby raising potential privacy concerns. To tackle these challenges, we propose a distributed PEFT framework called DLoRA. DLoRA enables scalable PEFT operations to be performed collaboratively between the cloud and user devices. Coupled with the proposed Kill and Revive algorithm, the evaluation results demonstrate that DLoRA can significantly reduce the computation and communication workload over the user devices while achieving superior accuracy and privacy protection.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2404.05182

Country:

North America > United States > New York (0.14)
North America > United States > California (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Security & Privacy (0.89)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback