Goto

Collaborating Authors

 dtype




5b4e9aa703d0bfa11041debaa2d1b633-Supplemental.pdf

Neural Information Processing Systems

For the results shown in Figure 5 (bottom), the function codes and signatures were trained for 10 epochs on 8192 samples with Shampoo [25]. We again used 6 random seeds, and for each set of trainable parameters, we grid-searched the learning rate.


EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

Xie, Tianxin, Yang, Shan, Li, Chenxing, Yu, Dong, Liu, Li

arXiv.org Artificial Intelligence

Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These models also require extensive, high-quality datasets for training. To address these limitations, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control (conversion, interpolation, erasure) by activation steering. We first empirically observe that modifying a subset of the internal activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized speech. Building on this insight, we then develop a training-free and efficient algorithm, including activation extraction, emotional token searching, and inference-time steering, which can be seamlessly integrated into a wide range of pretrained models (e.g., F5-TTS, CosyVoice2, and E2-TTS). In addition, to derive effective steering vectors, we construct a curated emotional speech dataset with diverse speakers. Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and continuous control over speech emotion, outperforming the state-of-the-art (SOTA). To the best of our knowledge, this is the first method that achieves training-free and continuous fine-grained emotion control in TTS. Demo samples are available at https://emosteer-tts-demo.pages.dev/.




Towards a Large Physics Benchmark

Barman, Kristian G., Caron, Sascha, Hasibi, Faegheh, Shalugin, Eugene, Marcet, Yoris, Otte, Johannes, de Regt, Henk W., Moody, Merijn

arXiv.org Artificial Intelligence

We introduce a benchmark framework developed by and for the scientific community to evaluate, monitor and steer large language model development in fundamental physics. Building on philosophical concepts of scientific understanding and creativity, we develop a scoring system in which each question is scored by an expert for its correctness, difficulty, and surprise. The questions are of three forms: (i) multiple-choice questions for conceptual understanding, (ii) analytical problems requiring mathematical derivation, and (iii) openended tasks requiring complex problem solving. Our current dataset contains diverse set of examples, including a machine learning challenge to classify high-energy physics events, such as the four top quark signal. To ensure continued relevance, we propose a living benchmark, where physicists contribute questions, for instance alongside new publications. We invite contributions via: http://www.physicsbenchmarks.org/. We hope that this benchmark will enable a targeted AI development that can make a meaningful contribution to fundamental physics research.


Fast and Simplex: 2-Simplicial Attention in Triton

Roy, Aurko, Chou, Timothy, Duvvuri, Sai Surya, Chen, Sijia, Yu, Jiecao, Wang, Xiaodong, Zaheer, Manzil, Anil, Rohan

arXiv.org Artificial Intelligence

Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count together. However, these scaling laws assume an infinite supply of data and apply primarily in compute-bound settings. As modern large language models increasingly rely on massive internet-scale datasets, the assumption that they are compute-bound is becoming less valid. This shift highlights the need for architectures that prioritize token efficiency. In this work, we investigate the use of the 2-simplicial Transformer, an architecture that generalizes standard dot-product attention to trilinear functions through an efficient Triton kernel implementation. We demonstrate that the 2-simplicial Transformer achieves better token efficiency than standard Transformers: for a fixed token budget, similarly sized models outperform their dot-product counterparts on tasks involving mathematics, coding, reasoning, and logic. We quantify these gains by demonstrating that $2$-simplicial attention changes the exponent in the scaling laws for knowledge and reasoning tasks compared to dot product attention.


One RL to See Them All: Visual Triple Unified Reinforcement Learning

Ma, Yan, Du, Linge, Shen, Xuyang, Chen, Shaoxiang, Li, Pengfei, Ren, Qibing, Ma, Lizhuang, Dai, Yuchao, Liu, Pengfei, Yan, Junjie

arXiv.org Artificial Intelligence

Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI.


TileLang: A Composable Tiled Programming Model for AI Systems

Wang, Lei, Cheng, Yu, Shi, Yining, Tang, Zhengju, Mo, Zhiwen, Xie, Wenhao, Ma, Lingxiao, Xia, Yuqing, Xue, Jilong, Yang, Fan, Yang, Zhi

arXiv.org Artificial Intelligence

Modern AI workloads rely heavily on optimized computing kernels for both training and inference. These AI kernels follow well-defined data-flow patterns, such as moving tiles between DRAM and SRAM and performing a sequence of computations on those tiles. However, writing high-performance kernels remains complex despite the clarity of these patterns. Achieving peak performance requires careful, hardware-centric optimizations to fully leverage modern accelerators. While domain-specific compilers attempt to reduce the burden of writing high-performance kernels, they often struggle with usability and expressiveness gaps. In this paper, we present TileLang, a generalized tiled programming model for more efficient AI Kernel programming. TileLang decouples scheduling space (thread binding, layout, tensorize and pipeline) from dataflow, and encapsulated them as a set of customization annotations and primitives. This approach allows users to focus on the kernel's data-flow itself, while leaving most other optimizations to compilers. We conduct comprehensive experiments on commonly-used devices, across numerous experiments, our evaluation shows that TileLang can achieve state-of-the-art performance in key kernels, demonstrating that its unified block-and-thread paradigm and transparent scheduling capabilities deliver both the power and flexibility demanded by modern AI system development.