Goto

Collaborating Authors

Reformulating Zero-shot Action Recognition for Multi-label Actions (Supplementary Material)

Neural Information Processing Systems

Standard video models expect frame dimensions with the same height and width, so we crop a square region around the actor and resize it to the network specific dimensions (112 112). We present some examples of AVA video frames with their annotations as well as the generated crops in Figure 1. This square crop can cause multiple actors to appear within one clip, as seen in the second example, but it ensures the aspect ratio of the person is not altered, which is necessary as this is the manner in which the video model is trained. Figure 1: Example of original ground-truth bounding boxes (left) in the AVA dataset, with the cropped actors on the right. For PS-ZSAR prediction confidences are obtained from the softmax probabilities output by our pair-wise similarity function.


Bringing Image Structure to Video via Frame-Clip Consistency of Object Tokens

Neural Information Processing Systems

Recent action recognition models have achieved impressive results by integrating objects, their locations and interactions. However, obtaining dense structured annotations for each frame is tedious and time-consuming, making these methods expensive to train and less scalable. On the other hand, one does often have access to a small set of annotated images, either within or outside the domain of interest. Here we ask how such images can be leveraged for downstream video understanding tasks. We propose a learning framework StructureViT (SViT for short), which demonstrates how utilizing the structure of a small number of images only available during training can improve a video model.


clarify that B-RAI [24] is a recently proposed algorithm for estimating the posterior probability of causal relations among observed

Neural Information Processing Systems

We would like to sincerely thank you for your important ideas and constructive comments. It is not related to the deep learning domain. We will clearly state these contributions in the paper. As you suggest, we will define B2N, RAI, and GGT in the paper. An ensemble of 15 (last point on the curve, Figure 1), having a total of 3.6M parameters, is Optimizing for a specific loss hinders other objectives, e.g., accuracy and calibration.


Efficient Algorithms for Smooth Minimax Optimization

Neural Information Processing Systems

In terms of g(, y), we consider two settings - strongly convex and nonconvex - and improve upon the best known rates in both. For strongly-convex g(, y), y, we propose a new direct optimal algorithm combining Mirror-Prox and Nesterov's AGD, and show that it can find global optimum in ร• (1/k


strongly-convex-concave minimax problems first, which we will add in the final revision

Neural Information Processing Systems

We thank all the reviewers for their constructive comments. Conceptual DIAG: The intuition behind Algorithm 1 stems from a "conceptual" version of DIAG (also specified in Algorithm 1, Step 4), which is inspired from the conceptual version of Mirror-Prox (MP) (cf. Thus the overall speed of Imp-STEP is O()) steps. Response to reviewer 1: We agree with and will include, the reviewer's comment, that the non-smoothness of We will devote more space to explaining the DIAG algorithm and discussing more related works. We will add a precise justification (which was omitted due to the lack of space) in the next revision.


DARE: Disentanglement-Augmented Rationale Extraction

Neural Information Processing Systems

Rationale extraction can be considered as a straightforward method of improving the model explainability, where rationales are a subsequence of the original inputs, and can be extracted to support the prediction results. Existing methods are mainly cascaded with the selector which extracts the rationale tokens, and the predictor which makes the prediction based on selected tokens. Since previous works fail to fully exploit the original input, where the information of non-selected tokens is ignored, in this paper, we propose a Disentanglement-Augmented Rationale Extraction (DARE) method, which encapsulates more information from the input to extract rationales. Specifically, it first disentangles the input into the rationale representations and the non-rationale ones, and then learns more comprehensive rationale representations for extracting by minimizing the mutual information (MI) between the two disentangled representations. Besides, to improve the performance of MI minimization, we develop a new MI estimator by exploring existing MI estimation methods. Extensive experimental results on three real-world datasets and simulation studies clearly validate the effectiveness of our proposed method. Code is released at https://github.com/yuelinan/DARE.


The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Neural Information Processing Systems

The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-ofthe-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including indepth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb.


Appendix: Not All Low-Pass Filters are Robust in Graph Convolutional Networks 15 B Broader Impact 16 C Additional Related Work 16 D Additional Preliminaries on Graph Signal Filtering

Neural Information Processing Systems

For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope? If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? Graph Convolutional Networks (GCNs) could be crucial tools for a broad range of applications, including social networks, computer vision, natural language processing, traffic prediction, chemistry, protein design, recommendation system and so on [64, 58]. Any of these applications may have a different social effect. The use of GCNs could improve protein design efficiency and lead to the development of new medicines, but it could also result in job losses.


7 Appendix A Limitations

Neural Information Processing Systems

Table 6 provides summary statistics of domain coverage. Overall, the benchmark covers 8,637 biology images and 8,678 pathology images across 12 subdomains. Similarly, Table 7 shows summary statistics of microscopy modalities covered by Micro-Bench perception, including 10,864 images for light microscopy, 5,618 for fluorescence microscopy, and 833 images for electron microscopy across 8 microscopy imaging submodalities and 25 unique microscopy staining techniques (see Table 8). Micro-Bench Perception (Coarse-grained): Hierarchical metadata for each of the 17,235 perception images and task-specific templates (shown in Table 23) are used to create 5 coarse-grained questions and captions regarding microscopy modality, submodality, domain, subdomain, and staining technique. The use of hierarchical metadata enables the generation of options within each hierarchical level.


Topological Attention for Time Series Forecasting

Neural Information Processing Systems

The problem of (point) forecasting univariate time series is considered. Most approaches, ranging from traditional statistical methods to recent learning-based techniques with neural networks, directly operate on raw time series observations. As an extension, we study whether local topological properties, as captured via persistent homology, can serve as a reliable signal that provides complementary information for learning to forecast. To this end, we propose topological attention, which allows attending to local topological features within a time horizon of historical data. Our approach easily integrates into existing end-to-end trainable forecasting models, such as N-BEATS, and, in combination with the latter, exhibits state-of-the-art performance on the large-scale M4 benchmark dataset of 100,000 diverse time series from different domains. Ablation experiments, as well as a comparison to a broad range of forecasting methods in a setting where only a single time series is available for training, corroborate the beneficial nature of including local topological information through an attention mechanism.