Obando-Ceron, Johan
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
Vayani, Ashmal, Dissanayake, Dinura, Watawana, Hasindri, Ahsan, Noor, Sasikumar, Nevasini, Thawakar, Omkar, Ademtew, Henok Biadglign, Hmaiti, Yahya, Kumar, Amandeep, Kuckreja, Kartik, Maslych, Mykola, Ghallabi, Wafa Al, Mihaylov, Mihail, Qin, Chao, Shaker, Abdelrahman M, Zhang, Mike, Ihsani, Mahardika Krisna, Esplana, Amiel, Gokani, Monil, Mirkin, Shachar, Singh, Harsh, Srivastava, Ashay, Hamerlik, Endre, Izzati, Fathinah Asma, Maani, Fadillah Adamsyah, Cavada, Sebastian, Chim, Jenny, Gupta, Rohit, Manjunath, Sanjay, Zhumakhanova, Kamila, Rabevohitra, Feno Heriniaina, Amirudin, Azril, Ridzuan, Muhammad, Kareem, Daniya, More, Ketan, Li, Kunyang, Shakya, Pramesh, Saad, Muhammad, Ghasemaghaei, Amirpouya, Djanibekov, Amirbek, Azizov, Dilshod, Jankovic, Branislava, Bhatia, Naman, Cabrera, Alvaro, Obando-Ceron, Johan, Otieno, Olympiah, Farestam, Fabian, Rabbani, Muztoba, Baliah, Sanoojan, Sanjeev, Santosh, Shtanchaev, Abduragim, Fatima, Maheen, Nguyen, Thao, Kareem, Amrin, Aremu, Toluwani, Xavier, Nathan, Bhatkal, Amit, Toyin, Hawau, Chadha, Aman, Cholakkal, Hisham, Anwer, Rao Muhammad, Felsberg, Michael, Laaksonen, Jorma, Solorio, Thamar, Choudhury, Monojit, Laptev, Ivan, Shah, Mubarak, Khan, Salman, Khan, Fahad
Existing Large Multimodal Models (LMMs) generally focus on only a few regions and languages. As LMMs continue to improve, it is increasingly important to ensure they understand cultural contexts, respect local sensitivities, and support low-resource languages, all while effectively integrating corresponding visual cues. In pursuit of culturally diverse global multimodal models, our proposed All Languages Matter Benchmark (ALM-bench) represents the largest and most comprehensive effort to date for evaluating LMMs across 100 languages. ALM-bench challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages, including many low-resource languages traditionally underrepresented in LMM research. The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions, which are further divided into short and long-answer categories. ALM-bench design ensures a comprehensive assessment of a model's ability to handle varied levels of difficulty in visual and linguistic reasoning. To capture the rich tapestry of global cultures, ALM-bench carefully curates content from 13 distinct cultural aspects, ranging from traditions and rituals to famous personalities and celebrations. Through this, ALM-bench not only provides a rigorous testing ground for state-of-the-art open and closed-source LMMs but also highlights the importance of cultural and linguistic inclusivity, encouraging the development of models that can serve diverse global populations effectively. Our benchmark is publicly available.
Neuroplastic Expansion in Deep Reinforcement Learning
Liu, Jiashun, Obando-Ceron, Johan, Courville, Aaron, Pan, Ling
In the realm of neuroscience, it has been observed that biological agents often experience a diminishing ability to adapt over time, analogous to the gradual solidification of neural pathways in the brain (Livingston, 1966). This phenomenon, typically known as the loss of plasticity (Mateos-Aparicio and Rodrรญguez-Moreno, 2019), significantly affects an agent's capacity to learn continually, especially when agents learn by trial and error in deep reinforcement learning (deep RL) due to the nonstationarity nature. The declining adaptability throughout the learning process can severely hinder the agent's ability to effectively learn and respond to complex or non-stationary scenarios (Abbas et al., 2023). This limitation presents a fundamental obstacle to achieving sustained learning and adaptability in artificial agents, which echoes the plasticity-stability dilemma (Abraham and Robins, 2005) observed in biological neural networks. There have been several recent studies highlighting a significant loss of plasticity in deep RL (Kumar et al., 2020, Lyle et al., 2022), which substantially restricts the agent's ability to learn from subsequent experiences (Lyle et al., 2023, Ma et al., 2023). The identification of primacy bias (Nikishin et al., 2022) further illustrates how agents may become overfitted to early experiences, which inhibits learning from subsequent new data. The consequences of plasticity loss further impede deep RL in continual learning scenarios, where the agent struggles to sequentially learn across a series of different tasks (Dohare et al., 2024). 1
Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL
Sokar, Ghada, Obando-Ceron, Johan, Courville, Aaron, Larochelle, Hugo, Castro, Pablo Samuel
The use of deep neural networks in reinforcement learning (RL) often suffers from performance degradation as model size increases. While soft mixtures of experts (SoftMoEs) have recently shown promise in mitigating this issue for online RL, the reasons behind their effectiveness remain largely unknown. In this work we provide an in-depth analysis identifying the key factors driving this performance gain. We discover the surprising result that tokenizing the encoder output, rather than the use of multiple experts, is what is behind the efficacy of SoftMoEs. Indeed, we demonstrate that even with an appropriately scaled single expert, we are able to maintain the performance gains, largely thanks to tokenization.
Mixture of Experts in a Mixture of RL settings
Willi, Timon, Obando-Ceron, Johan, Foerster, Jakob, Dziugaite, Karolina, Castro, Pablo Samuel
Mixtures of Experts (MoEs) have gained prominence in (self-)supervised learning due to their enhanced inference efficiency, adaptability to distributed training, and modularity. Previous research has illustrated that MoEs can significantly boost Deep Reinforcement Learning (DRL) performance by expanding the network's parameter count while reducing dormant neurons, thereby enhancing the model's learning capacity and ability to deal with non-stationarity. In this work, we shed more light on MoEs' ability to deal with non-stationarity and investigate MoEs in DRL settings with "amplified" non-stationarity via multi-task training, providing further evidence that MoEs improve learning capacity. In contrast to previous work, our multi-task results allow us to better understand the underlying causes for the beneficial effect of MoE in DRL training, the impact of the various MoE components, and insights into how best to incorporate them in actor-critic-based DRL networks. Finally, we also confirm results from previous work.
Mixtures of Experts Unlock Parameter Scaling for Deep RL
Obando-Ceron, Johan, Sokar, Ghada, Willi, Timon, Lyle, Clare, Farebrother, Jesse, Foerster, Jakob, Dziugaite, Gintare Karolina, Precup, Doina, Castro, Pablo Samuel
The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance scales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
JaxPruner: A concise library for sparsity research
Lee, Joo Hyung, Park, Wonpyo, Mitchell, Nicole, Pilault, Jonathan, Obando-Ceron, Johan, Kim, Han-Byul, Lee, Namhoon, Frantar, Elias, Long, Yun, Yazdanbakhsh, Amir, Agrawal, Shivani, Subramanian, Suvinay, Wang, Xin, Kao, Sheng-Chun, Zhang, Xingyao, Gale, Trevor, Bik, Aart, Han, Woohyun, Ferev, Milen, Han, Zhonglin, Kim, Hong-Seok, Dauphin, Yann, Dziugaite, Gintare Karolina, Castro, Pablo Samuel, Evci, Utku
This paper introduces JaxPruner, an open-source JAX-based pruning and sparse training library for machine learning research. JaxPruner aims to accelerate research on sparse neural networks by providing concise implementations of popular pruning and sparse training algorithms with minimal memory and latency overhead. Algorithms implemented in JaxPruner use a common API and work seamlessly with the popular optimization library Optax, which, in turn, enables easy integration with existing JAX based libraries. We demonstrate this ease of integration by providing examples in four different codebases: Scenic, t5x, Dopamine and FedJAX and provide baseline experiments on popular benchmarks.
Bigger, Better, Faster: Human-level Atari with human-level efficiency
Schwarzer, Max, Obando-Ceron, Johan, Courville, Aaron, Bellemare, Marc, Agarwal, Rishabh, Castro, Pablo Samuel
We introduce a value-based RL agent, which we 64 call BBF, that achieves super-human performance in the Atari 100K benchmark. BBF relies on scaling 16 the neural networks used for value estimation, as well as a number of other design choices that 4 enable this scaling in a sample-efficient manner. We conduct extensive analyses of these design 1 choices and provide insights for future work. We 2015 2017 2019 2021 2023 end with a discussion about updating the goalposts for sample-efficient RL research on the ALE. Figure 1: Environment samples to reach human-level performance, We make our code and data publicly available.
Small batch deep reinforcement learning
Obando-Ceron, Johan, Bellemare, Marc G., Castro, Pablo Samuel
In value-based deep reinforcement learning with replay memories, the batch size parameter specifies how many transitions to sample for each gradient update. Although critical to the learning process, this value is typically not adjusted when proposing new algorithms. In this work we present a broad empirical study that suggests {\em reducing} the batch size can result in a number of significant performance gains; this is surprising, as the general tendency when training neural networks is towards larger batch sizes for improved performance. We complement our experimental findings with a set of empirical analyses towards better understanding this phenomenon.