Not enough data to create a plot.
Try a different view from the menu above.
Supplementary Material for A Benchmark Dataset for Event-Guided Human Pose Estimation and Tracking in Extreme Conditions
We have included in the supplementary material the parts that we could not mention in the main paper. Section A covers the implementation details, Section B presents additional experiments, and Section C describes the detailed annotation process. Lastly, we have included a description of the license and ethical considerations in the Section D.
A Benchmark Dataset for Event-Guided Human Pose Estimation and Tracking in Extreme Conditions
Multi-person pose estimation and tracking have been actively researched by the computer vision community due to their practical applicability. However, existing human pose estimation and tracking datasets have only been successful in typical scenarios, such as those without motion blur or with well-lit conditions. These RGB-based datasets are limited to learning under extreme motion blur situations or poor lighting conditions, making them inherently vulnerable to such scenarios.
Appendix) F (w
We implement pipeline between data downloading and data ingestion to accelerate the training. After completing the computation of gradients, the worker would directly send the gradient with the token back to the PS in a non-blocking way. In this way, the fast workers would ingest much more data than the straggling workers. When a worker recovered from a failure, it would drop the previous state (e.g., data in the batch buffer and token) and proceed to deal with the new batch. The disappearance of a specific token would not change the correctness and efficiency of GBA.
GBA: A Tuning-free Approach to Switch between Synchronous and Asynchronous Training for Recommendation Models
High-concurrency asynchronous training upon parameter server (PS) architecture and high-performance synchronous training upon all-reduce (AR) architecture are the most commonly deployed distributed training modes for recommendation models. Although synchronous AR training is designed to have higher training efficiency, asynchronous PS training would be a better choice for training speed when there are stragglers (slow workers) in the shared cluster, especially under limited computing resources. An ideal way to take full advantage of these two training modes is to switch between them upon the cluster status. However, switching training modes often requires tuning hyper-parameters, which is extremely time-and resource-consuming. We find two obstacles to a tuning-free approach: the different distribution of the gradient values and the stale gradients from the stragglers.
To Err Like Human: Affective Bias-Inspired Measures for Visual Emotion Recognition Evaluation Jufeng Yang
Accuracy is a commonly adopted performance metric in various classification tasks, which measures the proportion of correctly classified samples among all samples. It assumes equal importance for all classes, hence equal severity for misclassifications. However, in the task of emotional classification, due to the psychological similarities between emotions, misclassifying a certain emotion into one class may be more severe than another, e.g., misclassifying'excitement' as'anger' apparently is more severe than as'awe'. Albeit high meaningful for many applications, metrics capable of measuring these cases of misclassifications in visual emotion recognition tasks have yet to be explored. In this paper, based on Mikel's emotion wheel from psychology, we propose a novel approach for evaluating the performance in visual emotion recognition, which takes into account the distance on the emotion wheel between different emotions to mimic the psychological nuances of emotions. Experimental results in semi-supervised learning on emotion recognition and user study have shown that our proposed metrics is more effective than the accuracy to assess the performance and conforms to the cognitive laws of human emotions.
Supplementary Material of A Unified Conditional Framework for Diffusion-based Image Restoration
For all tasks, we adopt a UNet architecture similar to the one described in DvSR [4]. The input feature map is expanded to 64 channels. There are five stages in both the encoder and decoder, and each stage contains two diffusion model blocks. Between each encoder stage, the input resolution is downsampled by a convolution layer with stride 2 and the channels are expanded by a factor of 2. On the other hand, in each decoder stage, the feature map resolution and the channels are reversed by the Nearest upsampling and a convolution layer separately. During training, we use a linear noise schedule with a total of T = 2000 steps.
A Unified Conditional Framework for Diffusion-based Image Restoration 1
Diffusion Probabilistic Models (DPMs) have recently shown remarkable performance in image generation tasks, which are capable of generating highly realistic images. When adopting DPMs for image restoration tasks, the crucial aspect lies in how to integrate the conditional information to guide the DPMs to generate accurate and natural output, which has been largely overlooked in existing works. In this paper, we present a unified conditional framework based on diffusion models for image restoration. We leverage a lightweight UNet to predict initial guidance and the diffusion model to learn the residual of the guidance. By carefully designing the basic module and integration module for the diffusion model block, we integrate the guidance and other auxiliary conditional information into every block of the diffusion model to achieve spatially-adaptive generation conditioning. To handle high-resolution images, we propose a simple yet effective inter-step patch-splitting strategy to produce arbitrary-resolution images without grid artifacts. We evaluate our conditional framework on three challenging tasks: extreme low-light denoising, deblurring, and JPEG restoration, demonstrating its significant improvements in perceptual quality and the generalization to restoration tasks.