Not enough data to create a plot.
Try a different view from the menu above.
Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization James Oldfield
The Mixture of Experts (MoE) paradigm provides a powerful way to decompose dense layers into smaller, modular computations often more amenable to human interpretation, debugging, and editability. However, a major challenge lies in the computational cost of scaling the number of experts high enough to achieve finegrained specialization. In this paper, we propose the Multilinear Mixture of Experts (ยตMoE) layer to address this, focusing on vision models.
Data Diversification: A Simple Strategy For Neural Machine Translation Xuan-Phi Nguyen, Wu Kui 3, Ai Ti Aw
We introduce Data Diversification: a simple but effective strategy to boost neural machine translation (NMT) performance. It diversifies the training data by using the predictions of multiple forward and backward models and then merging them with the original dataset on which the final NMT model is trained. Our method is applicable to all NMT models. It does not require extra monolingual data like back-translation, nor does it add more computations and parameters like ensembles of models. Our method achieves state-of-the-art BLEU scores of 30.7 and 43.7 in the WMT'14 English-German and English-French translation tasks, respectively. It also substantially improves on 8 other translation tasks: 4 IWSLT tasks (English-German and English-French) and 4 low-resource translation tasks (English-Nepali and English-Sinhala). We demonstrate that our method is more effective than knowledge distillation and dual learning, it exhibits strong correlation with ensembles of models, and it trades perplexity off for better BLEU score.
Optimal Best-arm Identification in Linear Bandits
We study the problem of best-arm identification with fixed confidence in stochastic linear bandits. The objective is to identify the best arm with a given level of certainty while minimizing the sampling budget. We devise a simple algorithm whose sampling complexity matches known instance-specific lower bounds, asymptotically almost surely and in expectation. The algorithm relies on an arm sampling rule that tracks an optimal proportion of arm draws, and that remarkably can be updated as rarely as we wish, without compromising its theoretical guarantees. Moreover, unlike existing best-arm identification strategies, our algorithm uses a stopping rule that does not depend on the number of arms. Experimental results suggest that our algorithm significantly outperforms existing algorithms. The paper further provides a first analysis of the best-arm identification problem in linear bandits with a continuous set of arms.
FineStyle: Fine-grained Controllable Style Personalization for Text-to-image Models
Nine image pairs are generated by personalized text-to-image models, each of which is fine-tuned on a respective, single style reference image displayed at the corner of the left image of each pair. Fine-grained concepts are written on top of the images for comparisons, showing the nuanced compositionality encompassing color, foreground object, background, and textures. Full prompts are available in Appendix A.1.
Appendix
The appendix is organized as follows. In Appendix A, we first discuss the relationship of our work to prior arts. In Appendix B, we provide some preliminary tools for analyzing our manifold optimization problem. Based upon this, the proof of Theorem 1 and the proof of Theorem 2 are provided in Appendix C and Appendix D, respectively. Finally, our experimental setup as well as more experimental results are provided in Appendix E. Notations. Before we proceed, let us first introduce the notations that will be used throughout the appendix.
Graph Convolutions Enrich the Self-Attention in Transformers!
Transformers, renowned for their self-attention mechanism, have achieved state-ofthe-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep Transformer models is the oversmoothing problem, where representations across layers converge to indistinguishable values, leading to significant performance degradation. We interpret the original self-attention as a simple graph filter and redesign it from a graph signal processing (GSP) perspective.
A Supplemental Figures
Supplementary Material for "What shapes feature representations? Figure A.1: Feature decodability in models with a ResNet-50 architecture trained on the Navon dataset. Accuracy decoding features (shape, texture) from an untrained model (left) versus from shape- (center) and texture-trained (right) models. Results corresponding to trained models are mean across models trained on 5 cv splits. Target features are enhanced relative to the untrained model, whereas non-target features are suppressed. Figure A.2: Non-target features are suppressed in the post-pool layer of models with a ResNet-50 architecture trained on the Trifeature dataset.
Katherine L. Hermann Andrew K. Lampinen
In naturalistic learning problems, a model's input contains a wide range of features, some useful for the task at hand, and others not. Of the useful features, which ones does the model use? Of the task-irrelevant features, which ones does the model represent? Answers to these questions are important for understanding the basis of models' decisions, as well as for building models that learn versatile, adaptable representations useful beyond the original training task. We study these questions using synthetic datasets in which the task-relevance of input features can be controlled directly.
71e9c6620d381d60196ebe694840aaaa-AuthorFeedback.pdf
We thank the reviewers for their helpful comments. Feature difficulty (R3): "I hope that the authors have a grasp of manually designed image features and their We agree that color is an easier feature than shape or texture. We performed experiments using both vision and non-vision datasets. Indeed, we found that feature difficulty was not the sole determinant of feature use or representation (Figs. 5 & 6). The joint image feature-label statistics of ImageNet are unknown and uncontrolled.