complexity metric
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (2 more...)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Health Care Technology (0.90)
- Health & Medicine > Diagnostic Medicine > Imaging (0.88)
How Ensemble Learning Balances Accuracy and Overfitting: A Bias-Variance Perspective on Tabular Data
Abstract--Tree-based ensemble methods consistently outperform single models on tabular classification tasks, yet the conditions under which ensembles provide clear advantages--and prevent overfitting despite using high-variance base learners--are not always well understood by practitioners. We study four real-world classification problems (Breast Cancer diagnosis, Heart Disease prediction, Pima Indians Diabetes, and Credit Card Fraud detection) comparing classical single models against nine ensemble methods using five-seed repeated stratified cross-validation with statistical significance testing. Our results reveal three distinct regimes: (i) On nearly linearly separable data (Breast Cancer), well-regularized linear models achieve 97% accuracy with <2% generalization gaps; ensembles match but do not substantially exceed this performance. We systematically quantify dataset complexity through linearity scores, feature correlation, class separability, and noise estimates, explaining why different data regimes favor different model families. Cross-validated train/test accuracy and generalization-gap plots provide simple visual diagnostics for practitioners to assess when ensemble complexity is warranted. Statistical testing confirms that ensemble gains are significant on nonlinear tasks (p < 0.01) but not on near-linear data (p > 0.15). The study provides actionable guidelines for ensemble model selection in high-stakes tabular applications, with full code and reproducible experiments publicly available. A model that almost perfectly fits its training data can still fail badly on new cases. This gap between training performance and real-world behaviour is the essence of overfitting, and it is particularly problematic in domains such as medical diagnosis and financial fraud detection, where mistakes are costly: missed tumours delay treatment, and undetected fraud translates directly into monetary loss.
- North America > United States > Wisconsin (0.04)
- Asia > India (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area > Oncology (0.90)
- Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.60)
Exploring Complexity Changes in Diseased ECG Signals for Enhanced Classification
Quintero, Camilo Quiceno, George, Sandip Varkey
The complex dynamics of the heart are reflected in its electrical activity, captured through electrocardiograms (ECGs). In this study we use nonlinear time series analysis to understand how ECG complexity varies with cardiac pathology. Using the large PTB-XL dataset, we extracted nonlinear measures from lead II ECGs, and cross-channel metrics (leads II, V2, A VL) using Spearman correlations and mutual information. Significant differences between diseased and healthy individuals were found in almost all measures between healthy and diseased classes, and between 5 diagnostic superclasses (p < .001). Moreover, incorporating these complexity quantifiers into machine learning models substantially improved classification accuracy measured using area under the ROC curve (AUC) from 0.86 (baseline) to 0.87 (nonlinear measures) and 0.90 (including cross-time series metrics).
Type and Complexity Signals in Multilingual Question Representations
This work investigates how a multilingual transformer model represents morphosyntactic properties of questions. We introduce the Question Type and Complexity (QTC) dataset with sentences across seven languages, annotated with type information and complexity metrics including dependency length, tree depth, and lexical density. Our evaluation extends probing methods to regression labels with selectivity controls to quantify gains in generalizability. We compare layer-wise probes on frozen Glot500-m (Imani et al., 2023) representations against subword TF-IDF baselines, and a fine-tuned model. Results show that statistical features classify questions effectively in languages with explicit marking, while neural probes capture fine-grained structural complexity patterns better. We use these results to evaluate when contextual representations outperform statistical baselines and whether parameter updates reduce the availability of pre-trained linguistic information.
- Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.05)
- Europe > France (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (5 more...)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (4 more...)
A Metrics-Oriented Architectural Model to Characterize Complexity on Machine Learning-Enabled Systems
--How can the complexity of ML-enabled systems be managed effectively? The goal of this research is to investigate how complexity affects ML-Enabled Systems (MLES). T o address this question, this research aims to introduce a metrics-based architectural model to characterize the complexity of MLES. The goal is to support architectural decisions, providing a guideline for the inception and growth of these systems. This paper showcases the first step for creating the metrics-based architectural model: an extension of a reference architecture that can describe MLES to collect their metrics.
- Workflow (0.92)
- Research Report (0.83)
Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking
Creo, Aldan, Fernandez, Raul Castro, Cebrian, Manuel
As large language models (LLMs) become increasingly deployed, understanding the complexity and evolution of jailbreaking strategies is critical for AI safety. We present a mass-scale empirical analysis of jailbreak complexity across over 2 million real-world conversations from diverse platforms, including dedicated jailbreaking communities and general-purpose chatbots. Using a range of complexity metrics spanning probabilistic measures, lexical diversity, compression ratios, and cognitive load indicators, we find that jailbreak attempts do not exhibit significantly higher complexity than normal conversations. This pattern holds consistently across specialized jailbreaking communities and general user populations, suggesting practical bounds on attack sophistication. Temporal analysis reveals that while user attack toxicity and complexity remains stable over time, assistant response toxicity has decreased, indicating improving safety mechanisms. The absence of power-law scaling in complexity distributions further points to natural limits on jailbreak development. Our findings challenge the prevailing narrative of an escalating arms race between attackers and defenders, instead suggesting that LLM safety evolution is bounded by human ingenuity constraints while defensive measures continue advancing. Our results highlight critical information hazards in academic jailbreak disclosure, as sophisticated attacks exceeding current complexity baselines could disrupt the observed equilibrium and enable widespread harm before defensive adaptation.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > Spain > Valencian Community > Valencia Province > Valencia (0.04)
- Europe > Spain > Galicia > Madrid (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.95)
Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach
Sepidband, Melika, Taherkhani, Hamed, Wang, Song, Hemmati, Hadi
--Automatic code generation has gained significant momentum with the advent of Large Language Models (LLMs) such as GPT -4. Although many studies focus on improving the effectiveness of LLMs for code generation, very limited work tries to understand the generated code's characteristics and leverage that to improve failed cases. In this paper, as the most straightforward characteristic of code, we investigate the relationship between code complexity and the success of LLMgenerated code. Using a large set of standard complexity metrics, we first conduct an empirical analysis to explore their correlation with LLM's performance on code generation (i.e., Pass@1). Using logistic regression models, we identify which complexity metrics are most predictive of code correctness. Building on these findings, we propose an iterative feedback method, where LLMs are prompted to generate correct code based on complexity metrics from previous failed outputs. Experiment results show that our approach makes notable improvements, particularly with a smaller LLM (GPT - 3.5 T urbo), where, e.g., Pass@1 increased by 35.71% compared to the baseline's improvement of 12.5% on the HumanEval dataset. The study expands experiments to BigCodeBench and integrates the method with the Reflexion code generation agent, leading to Pass@1 improvements of 20% (GPT -4o) and 23.07% The results highlight that complexity-aware feedback enhances both direct LLM prompting and agent-based workflows. Automatic code generation aims to reduce manual coding and boost productivity [1], with LLMs like GPT -4 [2] making significant advancements. However, ensuring accuracy and correctness remains a challenge. Recently, several approaches have been proposed to enhance LLM-based code generation.
- North America > Canada > Ontario > Toronto (0.04)
- Oceania > New Zealand > North Island > Waikato (0.04)
C2RUST-BENCH: A Minimized, Representative Dataset for C-to-Rust Transpilation Evaluation
Sirlanci, Melih, Yagemann, Carter, Lin, Zhiqiang
Despite the effort in vulnerability detection over the last two decades, memory safety vulnerabilities continue to be a critical problem. Recent reports suggest that the key solution is to migrate to memory-safe languages. To this end, C-to-Rust transpilation becomes popular to resolve memory-safety issues in C programs. Recent works propose C-to-Rust transpilation frameworks; however, a comprehensive evaluation dataset is missing. Although one solution is to put together a large enough dataset, this increases the analysis time in automated frameworks as well as in manual efforts for some cases. In this work, we build a method to select functions from a large set to construct a minimized yet representative dataset to evaluate the C-to-Rust transpilation. We propose C2RUST-BENCH that contains 2,905 functions, which are representative of C-to-Rust transpilation, selected from 15,503 functions of real-world programs.
- North America > United States > Ohio (0.04)
- Asia > Middle East > Oman (0.04)
- Information Technology > Software (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.54)
Uncovering Fairness through Data Complexity as an Early Indicator
Ferreira, Juliett Suárez, Slavkovik, Marija, Casillas, Jorge
Fairness constitutes a concern within machine learning (ML) applications. Currently, there is no study on how disparities in classification complexity between privileged and unprivileged groups could influence the fairness of solutions, which serves as a preliminary indicator of potential unfairness. In this work, we investigate this gap, specifically, we focus on synthetic datasets designed to capture a variety of biases ranging from historical bias to measurement and representational bias to evaluate how various complexity metrics differences correlate with group fairness metrics. We then apply association rule mining to identify patterns that link disproportionate complexity differences between groups with fairness-related outcomes, offering data-centric indicators to guide bias mitigation. Our findings are also validated by their application in real-world problems, providing evidence that quantifying group-wise classification complexity can uncover early indicators of potential fairness challenges. This investigation helps practitioners to proactively address bias in classification tasks.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Switzerland (0.04)
- Europe > Spain > Catalonia (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Education (1.00)
- Health & Medicine > Therapeutic Area (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)