Plotting

Advancing Video Anomaly Detection: A Concise Review and a New Dataset Arjun Raj

Neural Information Processing Systems

Video Anomaly Detection (VAD) finds widespread applications in security surveillance, traffic monitoring, industrial monitoring, and healthcare. Despite extensive research efforts, there remains a lack of concise reviews that provide insightful guidance for researchers. Such reviews would serve as quick references to grasp current challenges, research trends, and future directions.


Supplementary Material of HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Neural Information Processing Systems

Details of the Model Architecture The detailed architecture of the generator and MPD is depicted in Figure 4. The configuration of three variants of the generator is listed in Table 5. In the ResBlock of V1 and V2, 2 convolution layers and 1 residual connection are stacked 3 times. In the Resblock of V3, 1 convolution layer and 1 residual connection are stacked 2 times. Therefore, V3 consists of a much smaller number of layers than V1 and V2. Periodic signal discrimination experiments We conducted additional experiments similar to training a discriminator using a simple dataset to verify the ability of MPD to discriminate periodic signals.


HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Neural Information Processing Systems

Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU.


c5d736809766d46260d816d8dbc9eb44-AuthorFeedback.pdf

Neural Information Processing Systems

We thank all the reviewers for their valuable comments. R1: We would like to clarify that, 'When the model was trained without the mel-spectrogram loss, the training process We also think that applying the L1/L2 loss gives no disadvantage in one-to-one mapping as our work. We will clarify the details of the experiments in Section 3. Table 1: Mean Opinion Scores. Specifically, we trained MelGAN up to 500k steps and compared it with all were trained up to 500k steps. MOS evaluation results are shown in [Table 1].


We thank all reviewers for their positive reception of our paper and for their constructive feedback

Neural Information Processing Systems

We thank all reviewers for their positive reception of our paper and for their constructive feedback. On dual norms and prior work. Thank you for pointing us to the relevant prior work of Demontis et al. and Xu et al. which we apparently missed. We will discuss these connections between our work and the prior work of Demontis et al. and Xu et al. in the Nevertheless, as MNIST is the only vision dataset for which we've been able to train models to high levels of MNIST is clearly not solved from an adversarial robustness perspective. We think this is an interesting open problem for the community to consider.


ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation

Neural Information Processing Systems

Temporal action segmentation and long-term action anticipation are two popular vision tasks for the temporal analysis of actions in videos. Despite apparent relevance and potential complementarity, these two problems have been investigated as separate and distinct tasks. In this work, we tackle these two problems, action segmentation and action anticipation, jointly using a unified diffusion model dubbed ActFusion. The key idea to unification is to train the model to effectively handle both visible and invisible parts of the sequence in an integrated manner; the visible part is for temporal segmentation, and the invisible part is for future anticipation. To this end, we introduce a new anticipative masking strategy during training in which a late part of the video frames is masked as invisible, and learnable tokens replace these frames to learn to predict the invisible future. Experimental results demonstrate the bi-directional benefits between action segmentation and anticipation. ActFusion achieves the state-of-the-art performance across the standard benchmarks of 50 Salads, Breakfast, and GTEA, outperforming task-specific models in both of the two tasks with a single unified model through joint learning.


A Theoretical Understanding of Self-Correction through In-context Alignment Zeming Wei

Neural Information Processing Systems

Going beyond mimicking limited human experiences, recent studies show initial evidence that, like humans, large language models (LLMs) are capable of improving their abilities purely by self-correction, i.e., correcting previous responses through self-examination, as seen in models like OpenAI o1. Nevertheless, little is known about how such capabilities arise. In this work, based on a simplified setup akin to an alignment task, we theoretically analyze self-correction from an in-context learning perspective, showing that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way.


Copula Multi-label Learning

Neural Information Processing Systems

A formidable challenge in multi-label learning is to model the interdependencies between labels and features. Unfortunately, the statistical properties of existing multi-label dependency modelings are still not well understood. Copulas are a powerful tool for modeling dependence of multivariate data, and achieve great success in a wide range of applications, such as finance, econometrics and systems neuroscience. This inspires us to develop a novel copula multi-label learning paradigm for modeling label and feature dependencies. The copula based paradigm enables to reveal new statistical insights in multi-label learning. In particular, the paper first leverages the kernel trick to construct continuous distribution in the output space, and then estimates our proposed model semiparametrically where the copula is modeled parametrically, while the marginal distributions are modeled nonparametrically. Theoretically, we show that our estimator is an unbiased and consistent estimator and follows asymptotically a normal distribution. Moreover, we bound the mean squared error of estimator. The experimental results from various domains validate the superiority of our proposed approach.


A Additional Implementation Details

Neural Information Processing Systems

We comprehensively discuss the GNN model specifications in Section 5, task specifications in Section 6 and key experimental setup in Section 7.2. We use L2 normalization after the final GNN layer to stabilize training, and do not use Laplacian normalization when doing message passing. We use cosine learning rate schedule (annealed to 0, no restarting). We use 0.0005 L2 weight decay for regularization. SGD optimizer is set with momentum of 0.9.


Design Space for Graph Neural Networks

Neural Information Processing Systems

The rapid evolution of Graph Neural Networks (GNNs) has led to a growing number of new architectures as well as novel applications. However, current research focuses on proposing and evaluating specific architectural designs of GNNs, such as GCN, GIN, or GAT, as opposed to studying the more general design space of GNNs that consists of a Cartesian product of different design dimensions, such as the number of layers or the type of the aggregation function. Additionally, GNN designs are often specialized to a single task, yet few efforts have been made to understand how to quickly find the best GNN design for a novel task or a novel dataset. Here we define and systematically study the architectural design space for GNNs which consists of 315,000 different designs over 32 different predictive tasks. Our approach features three key innovations: (1) A general GNN design space; (2) a GNN task space with a similarity metric, so that for a given novel task/dataset, we can quickly identify/transfer the best performing architecture; (3) an efficient and effective design space evaluation method which allows insights to be distilled from a huge number of model-task combinations. Our key results include: (1) A comprehensive set of guidelines for designing well-performing GNNs; (2) while best GNN designs for different tasks vary significantly, the GNN task space allows for transferring the best designs across different tasks; (3) models discovered using our design space achieve state-of-the-art performance. Overall, our work offers a principled and scalable approach to transition from studying individual GNN designs for specific tasks, to systematically studying the GNN design space and the task space. Finally, we release GraphGym, a powerful platform for exploring different GNN designs and tasks.