maintainability
Beyond Prototyping: Autonomous, Enterprise-Grade Frontend Development from Pixel to Production via a Specialized Multi-Agent Framework
Ganesaraja, Ramprasath, N, Swathika, AP, Saravanan, Rathinasamy, Kamalkumar, Amancharla, Chetana, Das, Rahul, Panse, Sahil Dilip, Batwe, Aditya, Vijayan, Dileep, Ashok, Veena, P, Thanushree A, Rao, Kausthubh J, Olivero, Alden, Roshan, null, Manthena, Rajeshwar Reddy, A, Asmitha Yuga Sre, Tripathi, Harsh, Selvaraj, Suganya, Chin, Vito, Bhaskar, Kasthuri Rangan, Bhaskar, Kasthuri Rangan, R, Venkatraman, Vijayakumar, Sajit
We present AI4UI, a framework of autonomous front-end development agents purpose-built to meet the rigorous requirements of enterprise-grade application delivery. Unlike general-purpose code assistants designed for rapid prototyping, AI4UI focuses on production readiness delivering secure, scalable, compliant, and maintainable UI code integrated seamlessly into enterprise workflows. AI4UI operates with targeted human-in-the-loop involvement: at the design stage, developers embed a Gen-AI-friendly grammar into Figma prototypes to encode requirements for precise interpretation; and at the post processing stage, domain experts refine outputs for nuanced design adjustments, domain-specific optimizations, and compliance needs. Between these stages, AI4UI runs fully autonomously, converting designs into engineering-ready UI code. Technical contributions include a Figma grammar for autonomous interpretation, domain-aware knowledge graphs, a secure abstract/package code integration strategy, expertise driven architecture templates, and a change-oriented workflow coordinated by specialized agent roles. In large-scale benchmarks against industry baselines and leading competitor systems, AI4UI achieved 97.24% platform compatibility, 87.10% compilation success, 86.98% security compliance, 78.00% feature implementation success, 73.50% code-review quality, and 73.36% UI/UX consistency. In blind preference studies with 200 expert evaluators, AI4UI emerged as one of the leaders demonstrating strong competitive standing among leading solutions. Operating asynchronously, AI4UI generates thousands of validated UI screens in weeks rather than months, compressing delivery timeline
- North America > United States > Colorado > Boulder County > Boulder (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)
- Asia > Middle East > Jordan (0.04)
- Workflow (1.00)
- Research Report > Experimental Study (0.34)
Show and Tell: Prompt Strategies for Style Control in Multi-Turn LLM Code Generation
Language models generate functionally correct code that tends toward excessive verbosity, with elaborate documentation and defensive patterns that diverge from human baselines. Two prompting mechanisms have emerged for stylistic control: instruction based prompts that articulate abstract directives, and example based prompts that provide concrete code demonstrations. The core problem is whether stylistic constraints persist when models enhance initial implementations with additional features while maintaining high functional accuracy. Here we show that instruction-based, example-based, and combined prompts produce distinct patterns of initial control and expansion discipline over one enhancement turn. We manipulated system prompts across four conditions in a paired two-turn protocol where models first generated solutions to an intermediate Python task, then revised their code under general improvement directives, holding the user task fixed (N = 160 paired programs). Combined prompts produced the strongest initial compression and greatest expansion discipline. Instructions showed large initial effects and moderate expansion discipline. Examples showed modest initial effects with no expansion discipline. These results show that initial prompt effectiveness and expansion discipline are separate aspects of prompt design, and that combined approaches provide the most stable stylistic control in this two-turn workflow.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
Quality Assurance of LLM-generated Code: Addressing Non-Functional Quality Characteristics
Sun, Xin, Ståhl, Daniel, Sandahl, Kristian, Kessler, Christoph
In recent years, LLMs have been widely integrated into software engineering workflows, supporting tasks like code generation. However, while these models often generate functionally correct outputs, we still lack a systematic understanding and evaluation of their non-functional qualities. Existing studies focus mainly on whether generated code passes the tests rather than whether it passes with quality. Guided by the ISO/IEC 25010 quality model, this study conducted three complementary investigations: a systematic review of 108 papers, two industry workshops with practitioners from multiple organizations, and an empirical analysis of patching real-world software issues using three LLMs. Motivated by insights from both the literature and practitioners, the empirical study examined the quality of generated patches on security, maintainability, and performance efficiency. Across the literature, we found that security and performance efficiency dominate academic attention, while maintainability and other qualities are understudied. In contrast, industry experts prioritize maintainability and readability, warning that generated code may accelerate the accumulation of technical debt. In our evaluation of functionally correct patches generated by three LLMs, improvements in one quality dimension often come at the cost of others. Runtime and memory results further show high variance across models and optimization strategies. Overall, our findings reveal a mismatch between academic focus, industry priorities, and model performance, highlighting the urgent need to integrate quality assurance mechanisms into LLM code generation pipelines to ensure that future generated code not only passes tests but truly passes with quality.
- Europe > Austria > Vienna (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.14)
- Europe > Portugal > Lisbon > Lisbon (0.04)
- (10 more...)
Bridging the Prototype-Production Gap: A Multi-Agent System for Notebooks Transformation
Elhashemy, Hanya, Lotfy, Youssef, Tang, Yongjian
The increasing adoption of Jupyter notebooks in data science and machine learning workflows has created a gap between exploratory code development and production-ready software systems. While notebooks excel at iterative development and visualization, they often lack proper software engineering principles, making their transition to production environments challenging. This paper presents Codelevate, a novel multi-agent system that automatically transforms Jupyter notebooks into well-structured, maintainable Python code repositories. Our system employs three specialized agents - Architect, Developer, and Structure - working in concert through a shared dependency tree to ensure architectural coherence and code quality. Our experimental results validate Codelevate's capability to bridge the prototype-to-production gap through autonomous code transformation, yielding quantifiable improvements in code quality metrics while preserving computational semantics.
Teaching Code Refactoring Using LLMs
Khairnar, Anshul, Rajoju, Aarya, Gehringer, Edward F.
--This Innovative Practice full paper explores how Large Language Models (LLMs) can enhance the teaching of code refactoring in software engineering courses through real-time, context-aware feedback. Refactoring improves code quality but is difficult to teach, especially with complex, real-world codebases. Traditional methods like code reviews and static analysis tools offer limited, inconsistent feedback. Our approach integrates LLM-assisted refactoring into a course project using structured prompts to help students identify and address code smells such as long methods and low cohesion. Implemented in Spring 2025 in a long-lived OSS project, the intervention is evaluated through student feedback and planned analysis of code quality improvements. Findings suggest that LLMs can bridge theoretical and practical learning, supporting a deeper understanding of maintainability and refactoring principles. Despite the importance of refactoring, teaching effective techniques remains challenging, particularly when students encounter real-world, complex codebases rather than contrived examples [2]. Students often struggle with identifying refactoring opportunities in unfamiliar code and implementing appropriate transformations that preserve functionality while enhancing quality. Open Source Software (OSS) projects offer an authentic environment for students to practice refactoring skills.
- North America > United States > North Carolina > Wake County > Raleigh (0.05)
- Atlantic Ocean > North Atlantic Ocean > Baltic Sea (0.04)
- Asia > China > Hebei Province > Shijiazhuang (0.04)
Training Language Models to Generate Quality Code with Program Analysis Feedback
Yao, Feng, Wang, Zilong, Liu, Liyuan, Cui, Junxia, Zhong, Li, Fu, Xiaohan, Mai, Haohui, Krishnan, Vish, Gao, Jianfeng, Shang, Jingbo
Code generation with large language models (LLMs), often termed vibe coding, is increasingly adopted in production but fails to ensure code quality, particularly in security (e.g., SQL injection vulnerabilities) and maintainability (e.g., missing type annotations). Existing methods, such as supervised fine-tuning and rule-based post-processing, rely on labor-intensive annotations or brittle heuristics, limiting their scalability and effectiveness. We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code using program analysis-guided feedback. Specifically, REAL integrates two automated signals: (1) program analysis detecting security or maintainability defects and (2) unit tests ensuring functional correctness. Unlike prior work, our framework is prompt-agnostic and reference-free, enabling scalable supervision without manual intervention. Experiments across multiple datasets and model scales demonstrate that REAL outperforms state-of-the-art methods in simultaneous assessments of functionality and code quality. Our work bridges the gap between rapid prototyping and production-ready code, enabling LLMs to deliver both speed and quality.
- Europe > Austria > Vienna (0.14)
- North America > United States > California > San Diego County > San Diego (0.04)
Do Prompt Patterns Affect Code Quality? A First Empirical Assessment of ChatGPT-Generated Code
Della Porta, Antonio, Lambiase, Stefano, Palomba, Fabio
Large Language Models (LLMs) have rapidly transformed software development, especially in code generation. However, their inconsistent performance, prone to hallucinations and quality issues, complicates program comprehension and hinders maintainability. Research indicates that prompt engineering-the practice of designing inputs to direct LLMs toward generating relevant outputs-may help address these challenges. In this regard, researchers have introduced prompt patterns, structured templates intended to guide users in formulating their requests. However, the influence of prompt patterns on code quality has yet to be thoroughly investigated. An improved understanding of this relationship would be essential to advancing our collective knowledge on how to effectively use LLMs for code generation, thereby enhancing their understandability in contemporary software development. This paper empirically investigates the impact of prompt patterns on code quality, specifically maintainability, security, and reliability, using the Dev-GPT dataset. Results show that Zero-Shot prompting is most common, followed by Zero-Shot with Chain-of-Thought and Few-Shot. Analysis of 7583 code files across quality metrics revealed minimal issues, with Kruskal-Wallis tests indicating no significant differences among patterns, suggesting that prompt structure may not substantially impact these quality metrics in ChatGPT-assisted code generation.
- Europe > Italy (0.40)
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.05)
- Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.05)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study > Negative Result (0.49)
MaintainCoder: Maintainable Code Generation Under Dynamic Requirements
Wang, Zhengren, Ling, Rui, Wang, Chufan, Yu, Yongan, Li, Zhiyu, Xiong, Feiyu, Zhang, Wentao
Modern code generation has made significant strides in functional correctness and execution efficiency. However, these systems often overlook a critical dimension in real-world software development: maintainability. To handle dynamic requirements with minimal rework, we propose MaintainCoder as a pioneering solution. It integrates Waterfall model, design patterns, and multi-agent collaboration to systematically enhance cohesion, reduce coupling, and improve adaptability. We also introduce MaintainBench, a benchmark comprising requirement changes and corresponding dynamic metrics on maintainance effort. Experiments demonstrate that existing code generation methods struggle to meet maintainability standards when requirements evolve. In contrast, MaintainCoder improves maintainability metrics by 14-30% with even higher correctness, i.e. pass@k. Our work not only provides the foundation of maintainable code generation, but also highlights the need for more holistic code quality research. Resources: https://github.com/IAAR-Shanghai/MaintainCoder.
- Asia > China > Shanghai > Shanghai (0.24)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
- Asia > South Korea (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)
Architecture for a Trustworthy Quantum Chatbot
Aragonés-Soria, Yaiza, Oriol, Manuel
Large language model (LLM)-based tools such as ChatGPT seem useful for classical programming assignments. The more specialized the field, the more likely they lack reliability because of the lack of data to train them. In the case of quantum computing, the quality of answers of generic chatbots is low. C4Q is a chatbot focused on quantum programs that addresses this challenge through a software architecture that integrates specialized LLMs to classify requests and specialized question answering modules with a deterministic logical engine to provide trustworthy quantum computing support. This article describes the latest version (2.0) of C4Q, which delivers several enhancements: ready-to-run Qiskit code for gate definitions and circuit operations, expanded features to solve software engineering tasks such as the travelling salesperson problem and the knapsack problem, and a feedback mechanism for iterative improvement. Extensive testing of the backend confirms the system's reliability, while empirical evaluations show that C4Q 2.0's classification LLM reaches near-perfect accuracy. The evaluation of the result consists in a comparative study with three existing chatbots highlighting C4Q 2.0's maintainability and correctness, reflecting on how software architecture decisions, such as separating deterministic logic from probabilistic text generation impact the quality of the results.
- North America > United States (0.28)
- Europe > Portugal (0.14)
- Research Report (0.50)
- Overview (0.46)
From PowerPoint UI Sketches to Web-Based Applications: Pattern-Driven Code Generation for GIS Dashboard Development Using Knowledge-Augmented LLMs, Context-Aware Visual Prompting, and the React Framework
Developing web-based GIS applications, commonly known as CyberGIS dashboards, for querying and visualizing GIS data in environmental research often demands repetitive and resource-intensive efforts. While Generative AI offers automation potential for code generation, it struggles with complex scientific applications due to challenges in integrating domain knowledge, software engineering principles, and UI design best practices. This paper introduces a knowledge-augmented code generation framework that retrieves software engineering best practices, domain expertise, and advanced technology stacks from a specialized knowledge base to enhance Generative Pre-trained Transformers (GPT) for front-end development. The framework automates the creation of GIS-based web applications (e.g., dashboards, interfaces) from user-defined UI wireframes sketched in tools like PowerPoint or Adobe Illustrator. A novel Context-Aware Visual Prompting method, implemented in Python, extracts layouts and interface features from these wireframes to guide code generation. Our approach leverages Large Language Models (LLMs) to generate front-end code by integrating structured reasoning, software engineering principles, and domain knowledge, drawing inspiration from Chain-of-Thought (CoT) prompting and Retrieval-Augmented Generation (RAG). A case study demonstrates the framework's capability to generate a modular, maintainable web platform hosting multiple dashboards for visualizing environmental and energy data (e.g., time-series, shapefiles, rasters) from user-sketched wireframes. By employing a knowledge-driven approach, the framework produces scalable, industry-standard front-end code using design patterns such as Model-View-ViewModel (MVVM) and frameworks like React. This significantly reduces manual effort in design and coding, pioneering an automated and efficient method for developing smart city software.
- North America > United States > Tennessee > Anderson County > Oak Ridge (0.04)
- Europe > Germany > Hesse > Darmstadt Region > Wiesbaden (0.04)
- North America > United States > Indiana (0.04)
- Government > Regional Government > North America Government > United States Government (1.00)
- Energy > Renewable (1.00)
- Information Technology (0.93)
- Transportation (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.66)