palindrome
HAI-Eval: Measuring Human-AI Synergy in Collaborative Coding
Luo, Hanjun, Ni, Chiming, Wen, Jiaheng, Huang, Zhimu, Wang, Yiran, Liao, Bingduo, Chung, Sylvia, Jin, Yingbin, Li, Xinfeng, Xu, Wenyuan, Wang, XiaoFeng, Salam, Hanan
LLM-powered coding agents are reshaping the development paradigm. However, existing evaluation systems, neither traditional tests for humans nor benchmarks for LLMs, fail to capture this shift. They remain focused on well-defined algorithmic problems, which excludes problems where success depends on human-AI collaboration. Such collaborative problems not only require human reasoning to interpret complex contexts and guide solution strategies, but also demand AI efficiency for implementation. To bridge this gap, we introduce HAI-Eval, a unified benchmark designed to measure the synergy of human-AI partnership in coding. HAI-Eval's core innovation is its "Collaboration-Necessary" problem templates, which are intractable for both standalone LLMs and unaided humans, but solvable through effective collaboration. Specifically, HAI-Eval uses 45 templates to dynamically create tasks. It also provides a standardized IDE for human participants and a reproducible toolkit with 450 task instances for LLMs, ensuring an ecologically valid evaluation. We conduct a within-subject study with 45 participants and benchmark their performance against 5 state-of-the-art LLMs under 4 different levels of human intervention. Results show that standalone LLMs and unaided participants achieve poor pass rates (0.67% and 18.89%), human-AI collaboration significantly improves performance to 31.11%. Our analysis reveals an emerging co-reasoning partnership. This finding challenges the traditional human-tool hierarchy by showing that strategic breakthroughs can originate from either humans or AI. HAI-Eval establishes not only a challenging benchmark for next-generation coding agents but also a grounded, scalable framework for assessing core developer competencies in the AI era. Our benchmark and interactive demo will be openly accessible.
- Asia > China > Hong Kong (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Security & Privacy (1.00)
- Education (1.00)
When are Kalman-Filter Restless Bandits Indexable?
Christopher R. Dance, Tomi Silander
We study the restless bandit associated with an extremely simple scalar Kalman filter model in discrete time. Under certain assumptions, we prove that the problem is indexable in the sense that the Whittle index is a non-decreasing function of the relevant belief state. In spite of the long history of this problem, this appears to be the first such proof. We use results about Schur-convexity and mechanical words, which are particular binary strings intimately related to palindromes.
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
Shi, Yuling, Wang, Songsong, Wan, Chengcheng, Gu, Xiaodong
While large language models have made significant strides in code generation, the pass rate of the generated code is bottlenecked on subtle errors, often requiring human intervention to pass tests, especially for complex problems. Existing LLM-based debugging systems treat generated programs as monolithic units, failing to address bugs at multiple levels of granularity, from low-level syntax errors to high-level algorithmic flaws. In this paper, we introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity. MGDebugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error. During debugging, it analyzes each subfunction and iteratively resolves bugs in a bottom-up manner. To effectively test each subfunction, we propose an LLM-simulated Python executor, which traces code execution and tracks important variable states to pinpoint errors accurately. Extensive experiments demonstrate that MGDebugger outperforms existing debugging systems, achieving an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix. Furthermore, MGDebugger effectively fixes bugs across different categories and difficulty levels, demonstrating its robustness and effectiveness.
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (4 more...)
- Workflow (0.93)
- Research Report > New Finding (0.46)
6d70cb65d15211726dcce4c0e971e21c-Paper.pdf
We study the restless bandit associated with an extremely simple scalar Kalman filter model in discrete time. Under certain assumptions, we prove that the problem is indexable in the sense that the Whittle index is a non-decreasing function of the relevant belief state. In spite of the long history of this problem, this appears to be the first such proof. We use results about Schur-convexity and mechanical words, which are particular binary strings intimately related to palindromes.
Is Self-Repair a Silver Bullet for Code Generation?
Olausson, Theo X., Inala, Jeevana Priya, Wang, Chenglong, Gao, Jianfeng, Solar-Lezama, Armando
Large language models have shown remarkable aptitude in code generation, but still struggle to perform complex tasks. Self-repair -- in which the model debugs and repairs its own code -- has recently become a popular way to boost performance in these settings. However, despite its increasing popularity, existing studies of self-repair have been limited in scope; in many settings, its efficacy thus remains poorly understood. In this paper, we analyze Code Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken from HumanEval and APPS. We find that when the cost of carrying out repair is taken into account, performance gains are often modest, vary a lot between subsets of the data, and are sometimes not present at all. We hypothesize that this is because self-repair is bottlenecked by the model's ability to provide feedback on its own code; using a stronger model to artificially boost the quality of the feedback, we observe substantially larger performance gains. Similarly, a small-scale study in which we provide GPT-4 with feedback from human participants suggests that even for the strongest models, self-repair still lags far behind what can be achieved with human-level debugging.
- Education > Educational Setting (1.00)
- Government (0.93)
LLM-Assisted Code Cleaning For Training Accurate Code Generators
Jain, Naman, Zhang, Tianjun, Chiang, Wei-Lin, Gonzalez, Joseph E., Sen, Koushik, Stoica, Ion
Natural language to code generation is an important application area of LLMs and has received wide attention from the community. The majority of relevant studies have exclusively concentrated on increasing the quantity and functional correctness of training sets while disregarding other stylistic elements of programs. More recently, data quality has garnered a lot of interest and multiple works have showcased its importance for improving performance. In this work, we investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs by 1.) renaming variables, 2.) modularizing and decomposing complex code into smaller helper sub-functions, and 3.) inserting natural-language based plans via LLM based transformations. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B on our transformed modularized programs improves the performance by up to 30% compared to fine-tuning on the original dataset. Additionally, we demonstrate improved performance from using a smaller amount of higher-quality data, finding that a model fine-tuned on the entire original dataset is outperformed by a model trained on 15% of our cleaned dataset. Even in comparison to closed-source models, our models outperform the much larger AlphaCoder models.
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report (1.00)
- Overview (0.67)
A blind spot for large language models: Supradiegetic linguistic information
Zimmerman, Julia Witte, Hudon, Denis, Cramer, Kathryn, Onge, Jonathan St., Fudolig, Mikaela, Trujillo, Milo Z., Danforth, Christopher M., Dodds, Peter Sheridan
Large Language Models (LLMs) like ChatGPT reflect profound changes in the field of Artificial Intelligence, achieving a linguistic fluency that is impressively, even shockingly, human-like. The extent of their current and potential capabilities is an active area of investigation by no means limited to scientific researchers. It is common for people to frame the training data for LLMs as "text" or even "language". We examine the details of this framing using ideas from several areas, including linguistics, embodied cognition, cognitive science, mathematics, and history. We propose that considering what it is like to be an LLM like ChatGPT, as Nagel might have put it, can help us gain insight into its capabilities in general, and in particular, that its exposure to linguistic training data can be productively reframed as exposure to the diegetic information encoded in language, and its deficits can be reframed as ignorance of extradiegetic information, including supradiegetic linguistic information. Supradiegetic linguistic information consists of those arbitrary aspects of the physical form of language that are not derivable from the one-dimensional relations of context -- frequency, adjacency, proximity, co-occurrence -- that LLMs like ChatGPT have access to. Roughly speaking, the diegetic portion of a word can be thought of as its function, its meaning, as the information in a theoretical vector in a word embedding, while the supradiegetic portion of the word can be thought of as its form, like the shapes of its letters or the sounds of its syllables. We use these concepts to investigate why LLMs like ChatGPT have trouble handling palindromes, the visual characteristics of symbols, translating Sumerian cuneiform, and continuing integer sequences.
- North America > United States > Vermont > Chittenden County > Burlington (0.14)
- North America > United States > Ohio (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (5 more...)
- Energy (0.67)
- Health & Medicine > Therapeutic Area > Neurology (0.46)
Measuring Quality of DNA Sequence Data via Degradation
Karr, Alan F., Hauzel, Jason, Porter, Adam A., Schaefer, Marcel
As public genome databases proliferate, their immense scientific power is tempered by skepticism about their quality. The skepticism is not merely anecdotal: there are documented instances and implications (Commichaux et al., 2021; Langdon, 2014; Steinegger and Salzberg, 2020). Although we argue in Appendix A that data quality should not be construed as comprising only errors in data, the principal contribution of the paper is a novel paradigm for measuring quality of genome sequences by deliberately introducing errors that reduce quality, a process we term degradation. The errors are single nucleotide polymorphisms (SNPs), insertions and deletions that both occur naturally as mutations and arise in next generation sequencing. Our reasoning is that higher quality data are more fragile: the higher the initial quality, the greater the effect of the same amount of degradation.
- Europe > Austria > Vienna (0.14)
- North America > United States > New York (0.05)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- (2 more...)
Preprocessing in Inductive Logic Programming
Inductive logic programming is a type of machine learning in which logic programs are learned from examples. This learning typically occurs relative to some background knowledge provided as a logic program. This dissertation introduces bottom preprocessing, a method for generating initial constraints on the programs an ILP system must consider. Bottom preprocessing applies ideas from inverse entailment to modern ILP systems. Inverse entailment is an influential early ILP approach introduced with Progol. This dissertation also presents $\bot$-Popper, an implementation of bottom preprocessing for the modern ILP system Popper. It is shown experimentally that bottom preprocessing can reduce learning times of ILP systems on hard problems. This reduction can be especially significant when the amount of background knowledge in the problem is large.
- Research Report (1.00)
- Summary/Review (0.68)
When are Kalman-Filter Restless Bandits Indexable?
Dance, Christopher R., Silander, Tomi
We study the restless bandit associated with an extremely simple scalar Kalman filter model in discrete time. Under certain assumptions, we prove that the problem is {\it indexable} in the sense that the {\it Whittle index} is a non-decreasing function of the relevant belief state. In spite of the long history of this problem, this appears to be the first such proof. We use results about {\it Schur-convexity} and {\it mechanical words}, which are particularbinary strings intimately related to {\it palindromes}.