Goto

Collaborating Authors

 mixtape



Mixtape: Breaking the Softmax Bottleneck Efficiently

Neural Information Processing Systems

The softmax bottleneck has been shown to limit the expressiveness of neural language models. Mixture of Softmaxes (MoS) is an effective approach to address such a theoretical limitation, but are expensive compared to softmax in terms of both memory and time. We propose Mixtape, an output layer that breaks the softmax bottleneck more efficiently with three novel techniques--logit space vector gating, sigmoid tree decomposition, and gate sharing. On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the MoS layer by 3.5x to 10.5x while obtaining similar performance. A network equipped with Mixtape is only 20% to 34% slower than a softmax-based network with 10-30K vocabulary sizes, and outperforms softmax in perplexity and translation quality.


From Mixtape to Pro Jank Footy: the most exciting Australian indie games at SXSW Sydney 2025

The Guardian

'There were, frankly, too many indie games to play in a day - a nice problem to have' gamers playing at SXSW Sydney Games Showcase 2025. 'There were, frankly, too many indie games to play in a day - a nice problem to have' gamers playing at SXSW Sydney Games Showcase 2025. Hyperkinetic shooters, gorgeous animal adventures and even a charming puzzler where you play a postie: Australia's developers are punching above their weight T here's no escaping the fact that SXSW Sydney - Australia's iteration of Austin's tech, music and film event, now in its third year - is absolutely beset by brands. In Tumbalong park on Saturday, families who had arrived for a free concert for kids meandered around the garish yellow CommBank Tour zone, as a line wound its way into the giant L'Oréal tent. But metres away at the International Convention Centre, inside the halls dedicated to gaming, the corporate influence was more muted.



Mixtape: Breaking the Softmax Bottleneck Efficiently

Neural Information Processing Systems

The softmax bottleneck has been shown to limit the expressiveness of neural lan- guage models. Mixture of Softmaxes (MoS) is an effective approach to address such a theoretical limitation, but are expensive compared to softmax in terms of both memory and time. We propose Mixtape, an output layer that breaks the softmax bottleneck more efficiently with three novel techniques--logit space vector gating, sigmoid tree decomposition, and gate sharing. On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the MoS layer by 3.5x to 10.5x while obtaining similar performance. A network equipped with Mixtape is only 20% to 34% slower than a softmax-based network with 10-30K vocabulary sizes, and outperforms softmax in perplexity and translation quality.


Reviews: Mixtape: Breaking the Softmax Bottleneck Efficiently

Neural Information Processing Systems

POS-AUTHOR FEEDBACK I thank the authors for their feedback and clarifications. I have increased my score based on those answers, and trusting that the promised modifications will appear in the final version. I would strongly encourage to make the release of the code as easy to use as possible, ideally with plugins for major platforms. This would not only increase citations, but have a direct impact in a number of use-cases ORIGINAL REVIEW This paper addresses the softmax bottleneck problem: resolving it has shown to significantly improve results when the output is over a large space (eg: NLP). However, current solutions are very costly.


Reviews: Mixtape: Breaking the Softmax Bottleneck Efficiently

Neural Information Processing Systems

This paper proposes techniques to deal with the softmax bottleneck problem. Pros • Experimental results show strong performances in language modeling and machine translation. Cons • Writing of the paper can be further enhanced by making it self-contained. The paper represents solid work. There are clarity issues pointed out by the reviewers.


Mixtape: Breaking the Softmax Bottleneck Efficiently

Neural Information Processing Systems

The softmax bottleneck has been shown to limit the expressiveness of neural lan- guage models. Mixture of Softmaxes (MoS) is an effective approach to address such a theoretical limitation, but are expensive compared to softmax in terms of both memory and time. We propose Mixtape, an output layer that breaks the softmax bottleneck more efficiently with three novel techniques--logit space vector gating, sigmoid tree decomposition, and gate sharing. On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the MoS layer by 3.5x to 10.5x while obtaining similar performance. A network equipped with Mixtape is only 20% to 34% slower than a softmax-based network with 10-30K vocabulary sizes, and outperforms softmax in perplexity and translation quality.


Mixtape: Breaking the Softmax Bottleneck Efficiently

Yang, Zhilin, Luong, Thang, Salakhutdinov, Russ R., Le, Quoc V.

Neural Information Processing Systems

The softmax bottleneck has been shown to limit the expressiveness of neural lan- guage models. Mixture of Softmaxes (MoS) is an effective approach to address such a theoretical limitation, but are expensive compared to softmax in terms of both memory and time. We propose Mixtape, an output layer that breaks the softmax bottleneck more efficiently with three novel techniques--logit space vector gating, sigmoid tree decomposition, and gate sharing. On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the MoS layer by 3.5x to 10.5x while obtaining similar performance. A network equipped with Mixtape is only 20% to 34% slower than a softmax-based network with 10-30K vocabulary sizes, and outperforms softmax in perplexity and translation quality.