Yang, Jianchao
HWPQ: Hessian-free Weight Pruning-Quantization For LLM Compression And Acceleration
Kang, Yuhan, Luo, Zhongdi, Wen, Mei, Shi, Yang, He, Jun, Yang, Jianchao, Xue, Zeyu, Feng, Jing, Liu, Xinwang
Large Language Models (LLMs) have achieved remarkable success across numerous domains. However, the high time complexity of existing pruning and quantization methods significantly hinders their effective deployment on resource-constrained consumer or edge devices. In this study, we propose a novel Hessian-free Weight Pruning-Quantization (HWPQ) method. HWPQ eliminates the need for computationally intensive Hessian matrix calculations by introducing a contribution-based weight metric, which evaluates the importance of weights without relying on second-order derivatives. Additionally, we employ the Exponentially Weighted Moving Average (EWMA) technique to bypass weight sorting, enabling the selection of weights that contribute most to LLM accuracy and further reducing time complexity. Our approach is extended to support 2:4 structured sparsity pruning, facilitating efficient execution on modern hardware accelerators. Experimental results demonstrate that HWPQ significantly enhances the compression performance of LLaMA2. Compared to state-of-the-art quantization and pruning frameworks, HWPQ achieves average speedups of 5.97x (up to 20.75x) in quantization time and 12.29x (up to 56.02x) in pruning time, while largely preserving model accuracy. Furthermore, we observe a 1.50x inference speedup compared to the baseline.
Slimmable Neural Networks
Yu, Jiahui, Yang, Linjie, Xu, Ning, Yang, Jianchao, Huang, Thomas
Instead of training individual networks with different width configurations, we train a shared network with switchable batch normalization. At runtime, the network can adjust its width on the fly according to on-device benchmarks and resource constraints, rather than downloading and offloading different models. Our trained networks, named slimmable neural networks, achieve similar (and in many cases better) ImageNet classification accuracy than individually trained models of MobileNet v1, MobileNet v2, ShuffleNet and ResNet-50 at different widths respectively. We also demonstrate better performance of slimmable models compared with individual ones across a wide range of applications including COCO bounding-box object detection, instance segmentation and person keypoint detection without tuning hyper-parameters. Lastly we visualize and discuss the learned features of slimmable networks. Recently deep neural networks are prevailing in applications on mobile phones, augmented reality devices and autonomous cars. Many of these applications require a short response time. Towards this goal, manually designed lightweight networks (Howard et al., 2017; Zhang et al., 2017; Sandler et al., 2018) are proposed with low computational complexities and small memory footprints.
YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark
Xu, Ning, Yang, Linjie, Fan, Yuchen, Yue, Dingcheng, Liang, Yuchen, Yang, Jianchao, Huang, Thomas
Learning long-term spatial-temporal features are critical for many video analysis tasks. However, existing video segmentation methods predominantly rely on static image segmentation techniques, and methods capturing temporal dependency for segmentation have to depend on pretrained optical flow models, leading to suboptimal solutions for the problem. End-to-end sequential learning to explore spatialtemporal features for video segmentation is largely limited by the scale of available video segmentation datasets, i.e., even the largest video segmentation dataset only contains 90 short video clips. To solve this problem, we build a new large-scale video object segmentation dataset called YouTube Video Object Segmentation dataset (YouTube-VOS). Our dataset contains 4,453 YouTube video clips and 94 object categories. This is by far the largest video object segmentation dataset to our knowledge and has been released at http://youtube-vos.org. We further evaluate several existing state-of-the-art video object segmentation algorithms on this dataset which aims to establish baselines for the development of new algorithms in the future. Keywords: Video object segmentation, Large-scale dataset, Benchmark.
Learning from Noisy Labels with Distillation
Li, Yuncheng, Yang, Jianchao, Song, Yale, Cao, Liangliang, Luo, Jiebo, Li, Li-Jia
The ability of learning from noisy labels is very useful in many visual recognition tasks, as a vast amount of data with noisy labels are relatively easy to obtain. Traditionally, the label noises have been treated as statistical outliers, and approaches such as importance re-weighting and bootstrap have been proposed to alleviate the problem. According to our observation, the real-world noisy labels exhibit multi-mode characteristics as the true labels, rather than behaving like independent random outliers. In this work, we propose a unified distillation framework to use side information, including a small clean dataset and label relations in knowledge graph, to "hedge the risk" of learning from noisy labels. Furthermore, unlike the traditional approaches evaluated based on simulated label noises, we propose a suite of new benchmark datasets, in Sports, Species and Artifacts domains, to evaluate the task of learning from noisy labels in the practical setting. The empirical study demonstrates the effectiveness of our proposed method in all the domains.
Building a Large Scale Dataset for Image Emotion Recognition: The Fine Print and The Benchmark
You, Quanzeng (University of Rochester) | Luo, Jiebo (University of Rochester) | Jin, Hailin (Adobe Research ) | Yang, Jianchao (Snapchat Inc)
Psychological research results have confirmed that people can have different emotional reactions to different visual stimuli. Several papers have been published on the problem of visual emotion analysis. In particular, attempts have been made to analyze and predict people's emotional reaction towards images. To this end, different kinds of hand-tuned features are proposed. The results reported on several carefully selected and labeled small image data sets have confirmed the promise of such features. While the recent successes of many computer vision related tasks are due to the adoption of Convolutional Neural Networks (CNNs), visual emotion analysis has not achieved the same level of success. This may be primarily due to the unavailability of confidently labeled and relatively large image data sets for visual emotion analysis. In this work, we introduce a new data set, which started from 3+ million weakly labeled images of different emotions and ended up 30 times as large as the current largest publicly available visual emotion data set. We hope that this data set encourages further research on visual emotion analysis. We also perform extensive benchmarking analyses on this large data set using the state of the art methods including CNNs.
Robust Image Sentiment Analysis Using Progressively Trained and Domain Transferred Deep Networks
You, Quanzeng (University of Rochester) | Luo, Jiebo (University of Rochester) | Jin, Hailin (Adobe Research) | Yang, Jianchao (Adobe Research)
Sentiment analysis of online user generated content is important for many social media analytics tasks. Researchers have largely relied on textual sentiment analysis to develop systems to predict political elections, measure economic indicators, and so on. Recently, social media users are increasingly using images and videos to express their opinions and share their experiences. Sentiment analysis of such large scale visual content can help better extract user sentiments toward events or topics, such as those in image tweets, so that prediction of sentiment from visual content is complementary to textual sentiment analysis. Motivated by the needs in leveraging large scale yet noisy training data to solve the extremely challenging problem of image sentiment analysis, we employ Convolutional Neural Networks (CNN). We first design a suitable CNN architecture for image sentiment analysis. We obtain half a million training samples by using a baseline sentiment algorithm to label Flickr images. To make use of such noisy machine labeled data, we employ a progressive strategy to fine-tune the deep network. Furthermore, we improve the performance on Twitter images by inducing domain transfer with a small number of manually labeled Twitter images. We have conducted extensive experiments on manually labeled Twitter images. The results show that the proposed CNN can achieve better performance in image sentiment analysis than competing algorithms.
Scale Adaptive Blind Deblurring
Zhang, Haichao, Yang, Jianchao
The presence of noise and small scale structures usually leads to large kernel estimation errors in blind image deblurring empirically, if not a total failure. We present a scale space perspective on blind deblurring algorithms, and introduce a cascaded scale space formulation for blind deblurring. This new formulation suggests a natural approach robust to noise and small scale structures through tying the estimation across multiple scales and balancing the contributions of different scales automatically by learning from data. The proposed formulation also allows to handle non-uniform blur with a straightforward extension. Experiments are conducted on both benchmark dataset and real-world images to validate the effectiveness of the proposed method. One surprising finding based on our approach is that blur kernel estimation is not necessarily best at the finest scale.
Data Clustering by Laplacian Regularized L1-Graph
Yang, Yingzhen (University of Illinois at Urbana-Champaign) | Wang, Zhangyang (University of Illinois at Urbana-Champaign) | Yang, Jianchao (Adobe Research) | Wang, Jiangping (University of Illinois at Urbana-Champaign) | Chang, Shiyu (University of Illinois at Urbana-Champaign) | Huang, Thomas S (University of Illinois at Urbana-Champaign)
L1-Graph has been proven to be effective in data clustering, which partitions the data space by using the sparse representation of the data as the similarity measure. However, the sparse representation is performed for each datum separately without taking into account the geometric structure of the data. Motivated by L1-Graph and manifold leaning, we propose Laplacian Regularized L1-Graph (LRℓ1-Graph) for data clustering. The sparse representations of LRℓ1-Graph are regularized by the geometric information of the data so that they vary smoothly along the geodesics of the data manifold by the graph Laplacian according to the manifold assumption. Moreover, we propose an iterative regularization scheme, where the sparse representation obtained from the previous iteration is used to build the graph Laplacian for the current iteration of regularization. The experimental results on real data sets demonstrate the superiority of our algorithm compared to L1-Graph and other competing clustering methods.