On the Convergence of Step Decay Step-Size for Stochastic Optimization

Neural Information Processing Systems

The convergence of stochastic gradient descent is highly dependent on the step-size, especially on non-convex problems such as neural network training. Step decay step-size schedules (constant and then cut) are widely used in practice because of their excellent convergence and generalization qualities, but their theoretical properties are not yet well understood. We provide convergence results for step decay in the non-convex regime, ensuring that the gradient norm vanishes at an O(ln T/ T) rate. We also provide near-optimal (and sometimes provably tight) convergence guarantees for general, possibly non-smooth, convex and strongly convex problems. The practical efficiency of the step decay step-size is demonstrated in several large-scale deep neural network training tasks.


On the Convergence of Step Decay Step-Size for Stochastic Optimization

Neural Information Processing Systems

The convergence of stochastic gradient descent is highly dependent on the step-size, especially on non-convex problems such as neural network training. Step decay step-size schedules (constant and then cut) are widely used in practice because of their excellent convergence and generalization qualities, but their theoretical properties are not yet well understood. We provide convergence results for step decay in the non-convex regime, ensuring that the gradient norm vanishes at an O(ln T/ T) rate. We also provide near-optimal (and sometimes provably tight) convergence guarantees for general, possibly non-smooth, convex and strongly convex problems. The practical efficiency of the step decay step-size is demonstrated in several large-scale deep neural network training tasks.




Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Neural Information Processing Systems

Web-scale visual entity recognition, the task of associating images with their corresponding entities within vast knowledge bases like Wikipedia, presents significant challenges due to the lack of clean, large-scale training data. In this paper, we propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Instead of relying on the multimodal LLM to directly annotate data, which we found to be suboptimal, we prompt it to reason about potential candidate entity labels by accessing additional contextually relevant information (such as Wikipedia), resulting in more accurate annotations. We further use the multimodal LLM to enrich the dataset by generating question-answer pairs and a grounded finegrained textual description (referred to as "rationale") that explains the connection between images and their assigned entities. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks (e.g.


Grounded Reinforcement Learning: Learning to Win the Game under Human Commands Supplementary Materials

Neural Information Processing Systems

In this section, we describe the details of MiniRTS Environment and human dataset. The data do not contain any personally identifiable information or offensive content. Figure 1: MiniRTS [2] implements the rockpaper-scissors Figure 2: Building units can produce different attack graph, each army type army units using resources. "workshop" can produce has some units it is effective against and vulnerable "archer", "dragon" and "catapult" while other to. For example, "swordman" restrains buildings can build one unit type. Only "peasant" "spearman" but is retrained by "cavarly". Game Units There are 3 kinds of units in MiniRTS, including resource units, building units, and army units. Resource Units: Resource units are stationary and neutral. Resource units cannot be constructed by anyone and are created at the beginning of a game. One mine action could gather resources from the resource units, and the mined resources are necessary to build new building units or army units.


Grounded Reinforcement Learning: Learning to Win the Game under Human Commands

Neural Information Processing Systems

We consider the problem of building a reinforcement learning (RL) agent that can both accomplish non-trivial tasks, like winning a real-time strategy game, and strictly follow high-level language commands from humans, like "attack", even if a command is sub-optimal. We call this novel yet important problem, Grounded Reinforcement Learning (GRL). Compared with other language grounding tasks, GRL is particularly non-trivial and cannot be simply solved by pure RL or behavior cloning (BC). From the RL perspective, it is extremely challenging to derive a precise reward function for human preferences since the commands are abstract and the valid behaviors are highly complicated and multi-modal. From the BC perspective, it is impossible to obtain perfect demonstrations since human strategies in complex games are typically sub-optimal. We tackle GRL via a simple, tractable, and practical constrained RL objective and develop an iterative RL algorithm, REinforced demonstration Distillation (RED), to obtain a strong GRL policy. We evaluate the policies derived by RED, BC and pure RL methods on a simplified real-time strategy game, MiniRTS. Experiment results and human studies show that the RED policy is able to consistently follow human commands and, at the same time, achieve a higher win rate than the baselines. We release our code and present more examples at https://sites.google.com/view/grounded-rl.


Attention Bottlenecks for Multimodal Fusion - Supplementary Materials Arsha Nagrani Shan Yang Anurag Arnab Aren Jansen Cordelia Schmid Chen Sun

Neural Information Processing Systems

Here we provide additional ablation results on mini-Audioset (Sec. We then provide results on two additional datasets, Moments in Time and Kinetics in Sec. C and perform some preliminary transfer learning experiments in Sec. E. Finally we provide details on the AS-500K split. In this section we expand on the ablations provided in Sec.


Attention Bottlenecks for Multimodal Fusion

Neural Information Processing Systems

Humans perceive the world by concurrently processing and fusing highdimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality ('late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses'fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense relevant information in each modality and share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.


Introducing Routing Uncertainty in Capsule Networks Fabio De Sousa Ribeiro

Neural Information Processing Systems

Rather than performing inefficient local iterative routing between adjacent capsule layers, we propose an alternative global view based on representing the inherent uncertainty in part-object assignment. In our formulation, the local routing iterations are replaced with variational inference of part-object connections in a probabilistic capsule network, leading to a significant speedup without sacrificing performance. In this way, global context is also considered when routing capsules by introducing global latent variables that have direct influence on the objective function, and are updated discriminatively in accordance with the minimum description length (MDL) principle. We focus on enhancing capsule network properties, and perform a thorough evaluation on pose-aware tasks, observing improvements in performance over previous approaches whilst being more computationally efficient.