Goto

Collaborating Authors

 initialize


BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

Neural Information Processing Systems

Mixture of Experts (MoE) framework has become a popular architecture for large language models due to its superior performance compared to dense models. However, training MoEs from scratch in a large-scale regime is prohibitively expensive.



Appendix to: Conformal Frequency Estimation with Sketched Data

Neural Information Processing Systems

Output: deterministic upper-bound for the frequency of z in the data set: ห†fCMSup (z). The CMS-CU algorithm Algorithm A2 CMS-CU Input: Data set Z1,...,Zm. Output: deterministic upper-bound for the frequency of z in the data set: ห†fCMS CUup (z). Input: A (trainable) rule for computing nested intervals [ห†Lm,ฮฑ(; t), ห†Um,ฮฑ(; t)], t T. Input: Number of data points mtrain0


example where multi step outperforms one step

Neural Information Processing Systems

As explained in the main text, this section presents an example that is only a slight modification of the one in Figure 4, but where a multi-step approach is clearly preferred over just one step. The data-generating and learning processes are exactly the same (100 trajectories of length 100, discount 0.9, ฮฑ = 0.1for reverse KL regularization). The only difference is that rather than using a behavior that is a mixture of optimal and uniform, we use a behavior that is a mixture of maximally suboptimal and uniform. If we call the suboptimal policy ฯ€ (which always goes down and left in our gridworld), then the behavior for the modified example is ฮฒ = 0.2 ฯ€ +0.8 u, where uis uniform. Results are shown in Figure 7. Figure 7: A gridworld example with modified behavior where multi-step is much better than one-step.


A Architectures, Hyper-parameters and Algorithms

Neural Information Processing Systems

Our approach, named ORDER, uses a three-step training process. In the next parts of this section, we'll explain the methods, structures, and settings we use in each of After that, we'll talk about how we set up and carried out our experiments. In this section, we'll break down the design of the state encoder, how we decided on the best We used a grid search strategy to find the optimal hyper-parameters for our experiments. This allowed each observation dimension to match up with a state factor. We summarize the training process in Algorithm 1.


Checklist

Neural Information Processing Systems

Do the main claims made in the abstract and introduction accurately reflect the paper's Did you describe the limitations of your work? Did you specify all the training details (e.g., data splits, hyperparameters, how they Did you report error bars (e.g., with respect to the random seed after running experi-20 Did you include the total amount of compute and the type of resources used (e.g., type If your work uses existing assets, did you cite the creators? Did you mention the license of the assets? Did you include any new assets either in the supplemental material or as a URL? [Y es] Did you discuss whether and how consent was obtained from people whose data you're We thereby state that we bear all responsibility in case of violation of rights, etc., and confirmation of F or what purpose was the dataset created? - For the novel task of data analysis as explained Who created the dataset and on behalf of which entity? - This dataset is created during a Who funded the creation of the dataset? What do the instances that comprise the dataset represent?





Near-OptimalGoal-Oriented Reinforcement LearninginNon-StationaryEnvironments

Neural Information Processing Systems

The different roles of c and P in this lower bound inspire us to design algorithms that estimate costs and transitions separately. Specifically, assuming the knowledge of c and P, we develop a simple but sub-optimal algorithm and another more involved minimax optimal algorithm (up to logarithmic terms). These algorithms combine the ideas of finite-horizon approximation [Chen et al., 2022a], special Bernstein-style bonuses of the MVP algorithm[Zhangetal.,2020],adaptiveconfidencewidening[WeiandLuo,2021],as well as some new techniques such as properly penalizing long-horizon policies. Finally,when c and P are unknown, we develop avariant ofthe MASTER algorithm [Weiand Luo,2021]and integrate the aforementioned ideas into itto achieve O(min{B?S