maximum length
It's Not That Simple. An Analysis of Simple Test-Time Scaling
Prior work proposed simple test-time scaling, a method for replicating this scaling behavior with models distilled from o1-like models by manually controlling test-time compute: either scaling down by enforcing a maximum length or scaling up by iteratively appending "Wait" when the model is about to terminate its generation. This paper presents an analysis of simple test-time scaling and finds that the scaling behavior is largely attributed to scaling down by enforcing a maximum length. In contrast, fine-tuning on long CoT data distilled from o1-like models has no significant impact on scaling behavior, and scaling up by appending "Wait" leads to inconsistencies, as the model may oscillate between solutions. A key distinction exists between scaling down by enforcing a maximum length and scaling up test-time compute in o1-like models, such as DeepSeek-R1\@. These models are typically allowed to utilize as much compute as needed, with the only constraint being the model's maximum supported length. By learning to naturally scale up test-time compute during reinforcement learning, o1-like models surpass their peak performance when scaling up. In contrast, simple test-time scaling progressively imposes a lower upper limit on model performance as it scales down. While replicating the test-time scaling behavior of o1 models can be straightforward by scaling down, it is crucial to recognize that the goal of scaling test-time compute is to unlock higher performance -- beyond what the model could originally achieve -- rather than merely reproducing the appearance of scaling behavior.
HuixiangDou2: A Robustly Optimized GraphRAG Approach
Kong, Huanjun, Wang, Zhefan, Wang, Chenyang, Ma, Zhe, Dong, Nanqing
Large Language Models (LLMs) perform well on familiar queries but struggle with specialized or emerging topics. Graph-based Retrieval-Augmented Generation (GraphRAG) addresses this by structuring domain knowledge as a graph for dynamic retrieval. However, existing pipelines involve complex engineering workflows, making it difficult to isolate the impact of individual components. Evaluating retrieval effectiveness is also challenging due to dataset overlap with LLM pretraining data. In this work, we introduce HuixiangDou2, a robustly optimized GraphRAG framework. Specifically, we leverage the effectiveness of dual-level retrieval and optimize its performance in a 32k context for maximum precision, and compare logic-based retrieval and dual-level retrieval to enhance overall functionality. Our implementation includes comparative experiments on a test set, where Qwen2.5-7B-Instruct initially underperformed. With our approach, the score improved significantly from 60 to 74.5, as illustrated in the Figure. Experiments on domain-specific datasets reveal that dual-level retrieval enhances fuzzy matching, while logic-form retrieval improves structured reasoning. Furthermore, we propose a multi-stage verification mechanism to improve retrieval robustness without increasing computational cost. Empirical results show significant accuracy gains over baselines, highlighting the importance of adaptive retrieval. To support research and adoption, we release HuixiangDou2 as an open-source resource https://github.com/tpoisonooo/huixiangdou2.
Random Walk in Random Permutation Set Theory
Zhou, Jiefeng, Li, Zhen, Deng, Yong
Random walk is an explainable approach for modeling natural processes at the molecular level. The Random Permutation Set Theory (RPST) serves as a framework for uncertainty reasoning, extending the applicability of Dempster-Shafer Theory. Recent explorations indicate a promising link between RPST and random walk. In this study, we conduct an analysis and construct a random walk model based on the properties of RPST, with Monte Carlo simulations of such random walk. Our findings reveal that the random walk generated through RPST exhibits characteristics similar to those of a Gaussian random walk and can be transformed into a Wiener process through a specific limiting scaling procedure. This investigation establishes a novel connection between RPST and random walk theory, thereby not only expanding the applicability of RPST, but also demonstrating the potential for combining the strengths of both approaches to improve problem-solving abilities.
The Framework of a Design Process Language
The thesis develops a view of design in a concept formation framework and outlines a language to describe both the object of the design and the process of designing. The unknown object at the outset of the design work may be seen as an unknown concept that the designer is to define. Throughout the process, she develops a description of this object by relating it to known concepts. The search stops when the designer is satisfied that the design specification is complete enough to satisfy the requirements from it once built. It is then a collection of propositions that all contribute towards defining the design object - a collection of sentences describing relationships between the object and known concepts. Also, the design process itself may be described by relating known concepts - by organizing known abilities into particular patterns of activation, or mobilization. In view of the demands posed to a language to use in this concept formation process, the framework of a Design Process Language (DPL) is developed. The basis for the language are linguistic categories that act as classes of relations used to combine concepts, containing relations used for describing process and object within the same general system, with some relations being process specific, others being object specific, and with the bulk being used both for process and object description. Another outcome is the distinction of modal relations, or relations describing futurity, possibility, willingness, hypothetical events, and the like. The design process almost always includes aspects such as these, and it is thus necessary for a language facilitating design process description to support such relationships to be constructed. The DPL is argued to be a foundation whereupon to build a language that can be used for enabling computers to be more useful - act more intelligently - in the design process.
Addressing the Length Bias Problem in Document-Level Neural Machine Translation
Zhang, Zhuocheng, Gu, Shuhao, Zhang, Min, Feng, Yang
Document-level neural machine translation (DNMT) has shown promising results by incorporating more context information. However, this approach also introduces a length bias problem, whereby DNMT suffers from significant translation quality degradation when decoding documents that are much shorter or longer than the maximum sequence length during training. %i.e., the length bias problem. To solve the length bias problem, we propose to improve the DNMT model in training method, attention mechanism, and decoding strategy. Firstly, we propose to sample the training data dynamically to ensure a more uniform distribution across different sequence lengths. Then, we introduce a length-normalized attention mechanism to aid the model in focusing on target information, mitigating the issue of attention divergence when processing longer sequences. Lastly, we propose a sliding window strategy during decoding that integrates as much context information as possible without exceeding the maximum sequence length. The experimental results indicate that our method can bring significant improvements on several open datasets, and further analysis shows that our method can significantly alleviate the length bias problem.
Optimizing transformer-based machine translation model for single GPU training: a hyperparameter ablation study
Verma, Luv, Kolhatkar, Ketaki N.
In machine translation tasks, the relationship between model complexity and performance is often presumed to be linear, driving an increase in the number of parameters and consequent demands for computational resources like multiple GPUs. To explore this assumption, this study systematically investigates the effects of hyperparameters through ablation on a sequence-to-sequence machine translation pipeline, utilizing a single NVIDIA A100 GPU. Contrary to expectations, our experiments reveal that combinations with the most parameters were not necessarily the most effective. This unexpected insight prompted a careful reduction in parameter sizes, uncovering "sweet spots" that enable training sophisticated models on a single GPU without compromising translation quality. The findings demonstrate an intricate relationship between hyperparameter selection, model size, and computational resource needs. The insights from this study contribute to the ongoing efforts to make machine translation more accessible and cost-effective, emphasizing the importance of precise hyperparameter tuning over mere scaling.
HiPool: Modeling Long Documents Using Graph Neural Networks
Li, Irene, Feng, Aosong, Radev, Dragomir, Ying, Rex
Encoding long sequences in Natural Language Processing (NLP) is a challenging problem. Though recent pretraining language models achieve satisfying performances in many NLP tasks, they are still restricted by a pre-defined maximum length, making them challenging to be extended to longer sequences. So some recent works utilize hierarchies to model long sequences. However, most of them apply sequential models for upper hierarchies, suffering from long dependency issues. In this paper, we alleviate these issues through a graph-based method. We first chunk the sequence with a fixed length to model the sentence-level information. We then leverage graphs to model intra- and cross-sentence correlations with a new attention mechanism. Additionally, due to limited standard benchmarks for long document classification (LDC), we propose a new challenging benchmark, totaling six datasets with up to 53k samples and 4034 average tokens' length. Evaluation shows our model surpasses competitive baselines by 2.6% in F1 score, and 4.8% on the longest sequence dataset. Our method is shown to outperform hierarchical sequential models with better performance and scalability, especially for longer sequences.