AITopics | maintainability

Collaborating Authors

maintainability

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Beyond Prototyping: Autonomous, Enterprise-Grade Frontend Development from Pixel to Production via a Specialized Multi-Agent Framework

Ganesaraja, Ramprasath, N, Swathika, AP, Saravanan, Rathinasamy, Kamalkumar, Amancharla, Chetana, Das, Rahul, Panse, Sahil Dilip, Batwe, Aditya, Vijayan, Dileep, Ashok, Veena, P, Thanushree A, Rao, Kausthubh J, Olivero, Alden, Roshan, null, Manthena, Rajeshwar Reddy, A, Asmitha Yuga Sre, Tripathi, Harsh, Selvaraj, Suganya, Chin, Vito, Bhaskar, Kasthuri Rangan, Bhaskar, Kasthuri Rangan, R, Venkatraman, Vijayakumar, Sajit

arXiv.org Artificial IntelligenceDec-9-2025

We present AI4UI, a framework of autonomous front-end development agents purpose-built to meet the rigorous requirements of enterprise-grade application delivery. Unlike general-purpose code assistants designed for rapid prototyping, AI4UI focuses on production readiness delivering secure, scalable, compliant, and maintainable UI code integrated seamlessly into enterprise workflows. AI4UI operates with targeted human-in-the-loop involvement: at the design stage, developers embed a Gen-AI-friendly grammar into Figma prototypes to encode requirements for precise interpretation; and at the post processing stage, domain experts refine outputs for nuanced design adjustments, domain-specific optimizations, and compliance needs. Between these stages, AI4UI runs fully autonomously, converting designs into engineering-ready UI code. Technical contributions include a Figma grammar for autonomous interpretation, domain-aware knowledge graphs, a secure abstract/package code integration strategy, expertise driven architecture templates, and a change-oriented workflow coordinated by specialized agent roles. In large-scale benchmarks against industry baselines and leading competitor systems, AI4UI achieved 97.24% platform compatibility, 87.10% compilation success, 86.98% security compliance, 78.00% feature implementation success, 73.50% code-review quality, and 73.36% UI/UX consistency. In blind preference studies with 200 expert evaluators, AI4UI emerged as one of the leaders demonstrating strong competitive standing among leading solutions. Operating asynchronously, AI4UI generates thousands of validated UI screens in weeks rather than months, compressing delivery timeline

agent, ai4ui, artificial intelligence, (17 more...)

arXiv.org Artificial Intelligence

2512.06046

Genre:

Workflow (1.00)
Research Report > Experimental Study (0.34)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Add feedback

Show and Tell: Prompt Strategies for Style Control in Multi-Turn LLM Code Generation

Bohr, Jeremiah

arXiv.org Artificial IntelligenceNov-19-2025

Language models generate functionally correct code that tends toward excessive verbosity, with elaborate documentation and defensive patterns that diverge from human baselines. Two prompting mechanisms have emerged for stylistic control: instruction based prompts that articulate abstract directives, and example based prompts that provide concrete code demonstrations. The core problem is whether stylistic constraints persist when models enhance initial implementations with additional features while maintaining high functional accuracy. Here we show that instruction-based, example-based, and combined prompts produce distinct patterns of initial control and expansion discipline over one enhancement turn. We manipulated system prompts across four conditions in a paired two-turn protocol where models first generated solutions to an intermediate Python task, then revised their code under general improvement directives, holding the user task fixed (N = 160 paired programs). Combined prompts produced the strongest initial compression and greatest expansion discipline. Instructions showed large initial effects and moderate expansion discipline. Examples showed modest initial effects with no expansion discipline. These results show that initial prompt effectiveness and expansion discipline are separate aspects of prompt design, and that combined approaches provide the most stable stylistic control in this two-turn workflow.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.13972

Country: North America > United States > Wisconsin (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Quality Assurance of LLM-generated Code: Addressing Non-Functional Quality Characteristics

Sun, Xin, Ståhl, Daniel, Sandahl, Kristian, Kessler, Christoph

arXiv.org Artificial IntelligenceNov-14-2025

In recent years, LLMs have been widely integrated into software engineering workflows, supporting tasks like code generation. However, while these models often generate functionally correct outputs, we still lack a systematic understanding and evaluation of their non-functional qualities. Existing studies focus mainly on whether generated code passes the tests rather than whether it passes with quality. Guided by the ISO/IEC 25010 quality model, this study conducted three complementary investigations: a systematic review of 108 papers, two industry workshops with practitioners from multiple organizations, and an empirical analysis of patching real-world software issues using three LLMs. Motivated by insights from both the literature and practitioners, the empirical study examined the quality of generated patches on security, maintainability, and performance efficiency. Across the literature, we found that security and performance efficiency dominate academic attention, while maintainability and other qualities are understudied. In contrast, industry experts prioritize maintainability and readability, warning that generated code may accelerate the accumulation of technical debt. In our evaluation of functionally correct patches generated by three LLMs, improvements in one quality dimension often come at the cost of others. Runtime and memory results further show high variance across models and optimization strategies. Overall, our findings reveal a mismatch between academic focus, industry priorities, and model performance, highlighting the urgent need to integrate quality assurance mechanisms into LLM code generation pipelines to ensure that future generated code not only passes tests but truly passes with quality.

large language model, machine learning, maintainability, (19 more...)

arXiv.org Artificial Intelligence

2511.10271

Country:

Europe (1.00)
North America > United States (0.93)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Bridging the Prototype-Production Gap: A Multi-Agent System for Notebooks Transformation

Elhashemy, Hanya, Lotfy, Youssef, Tang, Yongjian

arXiv.org Artificial IntelligenceNov-11-2025

The increasing adoption of Jupyter notebooks in data science and machine learning workflows has created a gap between exploratory code development and production-ready software systems. While notebooks excel at iterative development and visualization, they often lack proper software engineering principles, making their transition to production environments challenging. This paper presents Codelevate, a novel multi-agent system that automatically transforms Jupyter notebooks into well-structured, maintainable Python code repositories. Our system employs three specialized agents - Architect, Developer, and Structure - working in concert through a shared dependency tree to ensure architectural coherence and code quality. Our experimental results validate Codelevate's capability to bridge the prototype-to-production gap through autonomous code transformation, yielding quantifiable improvements in code quality metrics while preserving computational semantics.

artificial intelligence, dependency tree, transformation, (14 more...)

arXiv.org Artificial Intelligence

2511.07257

Country: Europe > Germany (0.16)

Genre: Research Report (0.42)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Add feedback

Teaching Code Refactoring Using LLMs

Khairnar, Anshul, Rajoju, Aarya, Gehringer, Edward F.

arXiv.org Artificial IntelligenceAug-14-2025

--This Innovative Practice full paper explores how Large Language Models (LLMs) can enhance the teaching of code refactoring in software engineering courses through real-time, context-aware feedback. Refactoring improves code quality but is difficult to teach, especially with complex, real-world codebases. Traditional methods like code reviews and static analysis tools offer limited, inconsistent feedback. Our approach integrates LLM-assisted refactoring into a course project using structured prompts to help students identify and address code smells such as long methods and low cohesion. Implemented in Spring 2025 in a long-lived OSS project, the intervention is evaluated through student feedback and planned analysis of code quality improvements. Findings suggest that LLMs can bridge theoretical and practical learning, supporting a deeper understanding of maintainability and refactoring principles. Despite the importance of refactoring, teaching effective techniques remains challenging, particularly when students encounter real-world, complex codebases rather than contrived examples [2]. Students often struggle with identifying refactoring opportunities in unfamiliar code and implementing appropriate transformations that preserve functionality while enhancing quality. Open Source Software (OSS) projects offer an authentic environment for students to practice refactoring skills.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2508.09332

Country: North America > United States > North Carolina (0.15)

Genre: Research Report > New Finding (1.00)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

ReCatcher: Towards LLMs Regression Testing for Code Generation

Abbassi, Altaf Allah, Da Silva, Leuson, Nikanjam, Amin, Khomh, Foutse

arXiv.org Artificial IntelligenceJul-28-2025

Large Language Models (LLMs) for code generation evolve rapidly through fine-tuning, merging, or new model releases. However, such updates can introduce regressions, not only in correctness but also in code quality and performance. To address this, we present ReCatcher, a regression testing framework for Python code generation. ReCatcher systematically compares two LLMs, typically a current model and a candidate update, across three dimensions: logical correctness, static code quality, and execution performance. We apply ReCatcher to assess regressions across three update scenarios, fine-tuning, merging, and model release, using CodeLlama, DeepSeek-Coder, and GPT-4o. Our evaluation shows that fine-tuning with cross-language datasets increases syntax errors by up to 12%. Merging with general-purpose models like Llama2 leads to regressions in correctness by up to 18%. GPT-4o introduces regressions of up to 50% in handling missing imports compared to GPT-3.5-turbo, while GPT-4o-mini suffers up to 80% performance degradation in execution time versus GPT-4o. Overall, logical correctness, performance, and error handling (e.g., syntax errors and missing imports) are the most regression-prone areas. Comparing ReCatcher with baseline solutions, it presents better and consistent accuracy across logical and performance aspects. ReCatcher highlights the importance of systematic regression evaluation before adopting new models, while assisting researchers and practitioners in making more informed update decisions.

large language model, machine learning, manuscriptsubmittedtoacm, (20 more...)

arXiv.org Artificial Intelligence

2507.1939

Country:

Europe (0.46)
North America > Canada > Quebec (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Training Language Models to Generate Quality Code with Program Analysis Feedback

Yao, Feng, Wang, Zilong, Liu, Liyuan, Cui, Junxia, Zhong, Li, Fu, Xiaohan, Mai, Haohui, Krishnan, Vish, Gao, Jianfeng, Shang, Jingbo

arXiv.org Artificial IntelligenceMay-30-2025

Code generation with large language models (LLMs), often termed vibe coding, is increasingly adopted in production but fails to ensure code quality, particularly in security (e.g., SQL injection vulnerabilities) and maintainability (e.g., missing type annotations). Existing methods, such as supervised fine-tuning and rule-based post-processing, rely on labor-intensive annotations or brittle heuristics, limiting their scalability and effectiveness. We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code using program analysis-guided feedback. Specifically, REAL integrates two automated signals: (1) program analysis detecting security or maintainability defects and (2) unit tests ensuring functional correctness. Unlike prior work, our framework is prompt-agnostic and reference-free, enabling scalable supervision without manual intervention. Experiments across multiple datasets and model scales demonstrate that REAL outperforms state-of-the-art methods in simultaneous assessments of functionality and code quality. Our work bridges the gap between rapid prototyping and production-ready code, enabling LLMs to deliver both speed and quality.

large language model, natural language, vulnerability, (17 more...)

arXiv.org Artificial Intelligence

2505.22704

Country:

North America > United States (0.46)
Europe > Austria (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Do Prompt Patterns Affect Code Quality? A First Empirical Assessment of ChatGPT-Generated Code

Della Porta, Antonio, Lambiase, Stefano, Palomba, Fabio

arXiv.org Artificial IntelligenceApr-21-2025

Large Language Models (LLMs) have rapidly transformed software development, especially in code generation. However, their inconsistent performance, prone to hallucinations and quality issues, complicates program comprehension and hinders maintainability. Research indicates that prompt engineering-the practice of designing inputs to direct LLMs toward generating relevant outputs-may help address these challenges. In this regard, researchers have introduced prompt patterns, structured templates intended to guide users in formulating their requests. However, the influence of prompt patterns on code quality has yet to be thoroughly investigated. An improved understanding of this relationship would be essential to advancing our collective knowledge on how to effectively use LLMs for code generation, thereby enhancing their understandability in contemporary software development. This paper empirically investigates the impact of prompt patterns on code quality, specifically maintainability, security, and reliability, using the Dev-GPT dataset. Results show that Zero-Shot prompting is most common, followed by Zero-Shot with Chain-of-Thought and Few-Shot. Analysis of 7583 code files across quality metrics revealed minimal issues, with Kruskal-Wallis tests indicating no significant differences among patterns, suggesting that prompt structure may not substantially impact these quality metrics in ChatGPT-assisted code generation.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2504.13656

Country:

Europe (0.29)
Asia > Middle East > Republic of Türkiye (0.16)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study > Negative Result (0.49)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MaintainCoder: Maintainable Code Generation Under Dynamic Requirements

Wang, Zhengren, Ling, Rui, Wang, Chufan, Yu, Yongan, Li, Zhiyu, Xiong, Feiyu, Zhang, Wentao

arXiv.org Artificial IntelligenceMar-31-2025

Modern code generation has made significant strides in functional correctness and execution efficiency. However, these systems often overlook a critical dimension in real-world software development: maintainability. To handle dynamic requirements with minimal rework, we propose MaintainCoder as a pioneering solution. It integrates Waterfall model, design patterns, and multi-agent collaboration to systematically enhance cohesion, reduce coupling, and improve adaptability. We also introduce MaintainBench, a benchmark comprising requirement changes and corresponding dynamic metrics on maintainance effort. Experiments demonstrate that existing code generation methods struggle to meet maintainability standards when requirements evolve. In contrast, MaintainCoder improves maintainability metrics by 14-30% with even higher correctness, i.e. pass@k. Our work not only provides the foundation of maintainable code generation, but also highlights the need for more holistic code quality research. Resources: https://github.com/IAAR-Shanghai/MaintainCoder.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.2426

Country:

Asia > China > Shanghai > Shanghai (0.24)
North America > Canada > Quebec > Montreal (0.04)
Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
Asia > South Korea (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Architecture for a Trustworthy Quantum Chatbot

Aragonés-Soria, Yaiza, Oriol, Manuel

arXiv.org Artificial IntelligenceMar-6-2025

Large language model (LLM)-based tools such as ChatGPT seem useful for classical programming assignments. The more specialized the field, the more likely they lack reliability because of the lack of data to train them. In the case of quantum computing, the quality of answers of generic chatbots is low. C4Q is a chatbot focused on quantum programs that addresses this challenge through a software architecture that integrates specialized LLMs to classify requests and specialized question answering modules with a deterministic logical engine to provide trustworthy quantum computing support. This article describes the latest version (2.0) of C4Q, which delivers several enhancements: ready-to-run Qiskit code for gate definitions and circuit operations, expanded features to solve software engineering tasks such as the travelling salesperson problem and the knapsack problem, and a feedback mechanism for iterative improvement. Extensive testing of the backend confirms the system's reliability, while empirical evaluations show that C4Q 2.0's classification LLM reaches near-perfect accuracy. The evaluation of the result consists in a comparative study with three existing chatbots highlighting C4Q 2.0's maintainability and correctness, reflecting on how software architecture decisions, such as separating deterministic logic from probabilistic text generation impact the quality of the results.

architecture, chatbot, llm, (16 more...)

arXiv.org Artificial Intelligence

2503.04875

Country:

Europe > Switzerland (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Oceania > Australia > Queensland (0.04)
(4 more...)

Genre:

Research Report (0.50)
Overview (0.46)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback