lloc
Quantitative Analysis of Technical Debt and Pattern Violation in Large Language Model Architectures
As Large Language Models (LLMs) transition from code completion tools to autonomous system architects, their impact on long-term software maintainability remains unquantified. While existing research benchmarks functional correctness (pass@k), this study presents the first empirical framework to measure "Architectural Erosion" and the accumulation of Technical Debt in AI-synthesized microservices. We conducted a comparative pilot study of three state-of-the-art models (GPT-5.1, Claude 4.5 Sonnet, and Llama 3 8B) by prompting them to implement a standardized Book Lending Microservice under strict Hexagonal Architecture constraints. Utilizing Abstract Syntax Tree (AST) parsing, we find that while proprietary models achieve high architectural conformance (0% violation rate for GPT-5.1), open-weights models exhibit critical divergence. Specifically, Llama 3 demonstrated an 80% Architectural Violation Rate, frequently bypassing interface adapters to create illegal circular dependencies between Domain and Infrastructure layers. Furthermore, we identified a phenomenon of "Implementation Laziness," where open-weights models generated 60% fewer Logical Lines of Code (LLOC) than their proprietary counterparts, effectively omitting complex business logic to satisfy token constraints. These findings suggest that without automated architectural linting, utilizing smaller open-weights models for system scaffolding accelerates the accumulation of structural technical debt.
Evaluation of large language models for assessing code maintainability
Dillmann, Marc, Siebert, Julien, Trendowicz, Adam
Increased availability of open-source software repositories and recent advances in code analysis using large language models (LLMs) has triggered a wave of new work to automate software engineering tasks that were previously very difficult to automate. In this paper, we investigate a recent line of work that hypothesises that comparing the probability of code generated by LLMs with the probability the current code would have had can indicate potential quality problems. We investigate the association between the cross-entropy of code generated by ten different models (based on GPT2 and Llama2) and the following quality aspects: readability, understandability, complexity, modularisation, and overall maintainability assessed by experts and available in an benchmark dataset. Our results show that, controlling for the number of logical lines of codes (LLOC), cross-entropy computed by LLMs is indeed a predictor of maintainability on a class level (the higher the cross-entropy the lower the maintainability). However, this relation is reversed when one does not control for LLOC (e.g., comparing small classes with longer ones). Furthermore, while the complexity of LLMs affects the range of cross-entropy (smaller models tend to have a wider range of cross-entropy), this plays a significant role in predicting maintainability aspects. Our study limits itself on ten different pretrained models (based on GPT2 and Llama2) and on maintainability aspects collected by Schnappinger et al. When controlling for logical lines of code (LLOC), cross-entropy is a predictor of maintainability. However, while related work has shown the potential usefulness of cross-entropy at the level of tokens or short sequences, at the class level this criterion alone may prove insufficient to predict maintainability and further research is needed to make best use of this information in practice.