Appendix: Training Transitive and Commutative Multimodal Transformers with LoReTTa Manuel Tran

Neural Information Processing Systems 

In our SVL-MNIST experiments, we freeze the backbone and train a linear classifier on top. The initial learning rate is 0.1, but it We do not use weight decay. It is particularly effective for problems with many features. We divide the SVL-MNIST dataset into training, validation, and test sets (Figure A1). The first dataset (I, T) consists of 12,000 paired samples from MNIST and WineReviews.