Goto

Collaborating Authors

 newline


The Hidden Cost of Readability: How Code Formatting Silently Consumes Your LLM Budget

arXiv.org Artificial Intelligence

Source code is usually formatted with elements like indentation and newlines to improve readability for human developers. However, these visual aids do not seem to be beneficial for large language models (LLMs) in the same way since the code is processed as a linear sequence of tokens. Furthermore, these additional tokens can lead to increased computational costs and longer response times for LLMs. If such formatting elements are non-essential to LLMs, we can reduce such costs by removing them from the code. To figure out the role played by formatting elements, we conduct a comprehensive empirical study to evaluate the impact of code formatting on LLM performance and efficiency. Through large-scale experiments on Fill-in-the-Middle Code Completion tasks across four programming languages (Java, Python, C++, C\#) and ten LLMs-including both commercial and open-source models-we systematically analyze token count and performance when formatting elements are removed. Key findings indicate that LLMs can maintain performance across formatted code and unformatted code, achieving an average input token reduction of 24.5\% with negligible output token reductions. This makes code format removal a practical optimization strategy for improving LLM efficiency. Further exploration reveals that both prompting and fine-tuning LLMs can lead to significant reductions (up to 36.1\%) in output code length without compromising correctness. To facilitate practical applications, we develop a bidirectional code transformation tool for format processing, which can be seamlessly integrated into existing LLM inference workflows, ensuring both human readability and LLM efficiency.


Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

arXiv.org Artificial Intelligence

In this report, we present a series of math-specific large language models: Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of the Qwen2.5 series lies in integrating the philosophy of self-improvement throughout the entire pipeline, from pre-training and post-training to inference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilized to generate large-scale, high-quality mathematical data. (2) In the post-training phase, we develop a reward model (RM) by conducting massive sampling from Qwen2-Math-Instruct. This RM is then applied to the iterative evolution of data in supervised fine-tuning (SFT). With a stronger SFT model, it's possible to iteratively train and update the RM, which in turn guides the next round of SFT data iteration. On the final SFT model, we employ the ultimate RM for reinforcement learning, resulting in the Qwen2.5-Math-Instruct. (3) Furthermore, during the inference stage, the RM is used to guide sampling, optimizing the model's performance. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced mathematical reasoning capabilities, including Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR). We evaluate our models on 10 mathematics datasets in both English and Chinese, such as GSM8K, MATH, GaoKao, AMC23, and AIME24, covering a range of difficulties from grade school level to math competition problems.


Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers

arXiv.org Artificial Intelligence

Large language models have catalyzed an unprecedented wave in code generation. While achieving significant advances, they blur the distinctions between machine-and human-authored source code, causing integrity and authenticity issues of software artifacts. Previous methods such as DetectGPT have proven effective in discerning machine-generated texts, but they do not identify and harness the unique patterns of machine-generated code. Thus, its applicability falters when applied to code. In this paper, we carefully study the specific patterns that characterize machine and human-authored code. Through a rigorous analysis of code attributes such as length, lexical diversity, and naturalness, we expose unique pat-terns inherent to each source. We particularly notice that the structural segmentation of code is a critical factor in identifying its provenance. Based on our findings, we propose a novel machine-generated code detection method called DetectCodeGPT, which improves DetectGPT by capturing the distinct structural patterns of code. Diverging from conventional techniques that depend on external LLMs for perturbations, DetectCodeGPT perturbs the code corpus by strategically inserting spaces and newlines, ensuring both efficacy and efficiency. Experiment results show that our approach significantly outperforms state-of-the-art techniques in detecting machine-generated code.


BashPitfalls - Greg's Wiki

#artificialintelligence

This page is a compilation of common mistakes made by bash users. Each example is flawed in some way. Yes, it would be great if you could just treat the output of ls or find as a list of filenames and iterate over it. This entire approach is fatally flawed, and there is no trick that can make it work. You must use an entirely different approach. If a filename contains whitespace, it undergoes WordSplitting. Assuming we have a file named 01 - Don't Eat the Yellow Snow.mp3 in the current directory, the for loop will iterate over each word in the resulting file name: 01, -, Don't, Eat, etc. If a filename contains glob characters, it undergoes filename expansion ("globbing"). If ls produces any output containing a * character, the word containing it will become recognized as a pattern and substituted with a list of all filenames that match it. If the command substitution returns multiple filenames, there is no way to tell where the first one ends and the second one begins. Pathnames may contain any character except NUL. Depending on which platform you're on, which arguments you used (or didn't use), and whether its standard output is pointing to a terminal or not, ls may randomly decide to replace certain characters in a filename with "?", or simply not print them at all. Never try to parse the output of ls. It's an external command whose output is intended specifically to be read by a human, not parsed by a script. That may seem desirable since ls adds a newline, but if the last filename in the list ends with a newline, ... or $() will remove that one also. In the ls examples, if the first filename starts with a hyphen, it may lead to pitfall #3. This causes the entire output of ls to be treated as a single word. Instead of iterating over each file name, the loop will only execute once, assigning to f a string with all the filenames rammed together. Nor can you simply change IFS to a newline.


Logical Segmentation of Source Code

arXiv.org Machine Learning

Many software analysis methods have come to rely on machine learning approaches. Code segmentation - the process of decomposing source code into meaningful blocks - can augment these methods by featurizing code, reducing noise, and limiting the problem space. Traditionally, code segmentation has been done using syntactic cues; current approaches do not intentionally capture logical content. We develop a novel deep learning approach to generate logical code segments regardless of the language or syntactic correctness of the code. Due to the lack of logically segmented source code, we introduce a unique data set construction technique to approximate ground truth for logically segmented code. Logical code segmentation can improve tasks such as automatically commenting code, detecting software vulnerabilities, repairing bugs, labeling code functionality, and synthesizing new code.


Hierarchical Text Generation using an Outline

arXiv.org Machine Learning

Many challenges in natural language processing require generating text, including language translation, dialogue generation, and speech recognition. For all of these problems, text generation becomes more difficult as the text becomes longer. Current language models often struggle to keep track of coherence for long pieces of text. Here, we attempt to have the model construct and use an outline of the text it generates to keep it focused. We find that the usage of an outline improves perplexity. We do not find that using the outline improves human evaluation over a simpler baseline, revealing a discrepancy in perplexity and human perception. Similarly, hierarchical generation is not found to improve human evaluation scores.


Integrating Programming by Example and Natural Language Programming

AAAI Conferences

We motivate the integration of programming by example and natural language programming by developing a system for specifying programs for simple text editing operations based on regular expressions. The programs are described with unconstrained natural language instructions, and providing one or more examples of input/output. We show that natural language allows the system to deduce the correct program much more often and much faster than is possible with the input/output example(s) alone, showing that natural language programming and programming by example can be combined in a way that overcomes the ambiguities that both methods suffer from individually, while providing a more natural interface to the user.