error 0
- Asia > China > Guangdong Province > Shenzhen (0.04)
- North America > United States > New York > Erie County > Amherst (0.04)
- Asia > India > West Bengal > Kharagpur (0.04)
- Asia > China > Heilongjiang Province > Harbin (0.04)
- Information Technology > Security & Privacy (1.00)
- Law (0.67)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > Canada > Quebec > Montreal (0.04)
WAREX: Web Agent Reliability Evaluation on Existing Benchmarks
Kara, Su, Faisal, Fazle, Nath, Suman
Recent advances in browser-based LLM agents have shown promise for automating tasks ranging from simple form filling to hotel booking or online shopping. Current benchmarks measure agent performance in controlled environments, such as containers or stable networks, where websites behave deterministically. However, in the real world, users access websites over networks and HTTPS connections that introduce instability from multiple sources: client-side, server-side issues or broader system failures. Moreover, live websites are prone to web attacks such Cross-Site Scripting, as well as general site modifications which can cause unexpected or malicious pop-ups or improper functionality. Our experiments show that introducing WAREX leads to significant drops in task success rates, highlighting the limited robustness of state-of-the-art agents. W eb agents are leaving the lab and entering the wild, but benchmarks give a false sense of reliability. Web agents have emerged as a promising paradigm for automating complex online tasks, attracting significant attention across academia and industry. Recent advances have produced state-of-the-art web agents with diverse designs, ranging from variations in prompting and observation spaces to reinforcement learning-based action policies. Notable examples include SteP (Sodhi et al., 2024), WebNaviX (Shlomov et al., 2024), Agent Q (Putta et al., 2024), and GUI-Owl (Y e et al., 2025), among a myriad others. Large technology companies have also begun deploying production-grade agents, such as OpenAI (2025); Perplexity (2025) and TinyFish (2025).
- North America > United States > West Virginia (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
- Information Technology > Communications (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
MedErr-CT: A Visual Question Answering Benchmark for Identifying and Correcting Errors in CT Reports
Kyung, Sunggu, Park, Hyungbin, Seo, Jinyoung, Sung, Jimin, Kim, Jihyun, Kim, Dongyeong, Jo, Wooyoung, Nam, Yoojin, Park, Sangah, Kwon, Taehee, Lee, Sang Min, Kim, Namkug
Computed T omography (CT) plays a crucial role in clinical diagnosis, but the growing demand for CT examinations has raised concerns about diagnostic errors. While Multimodal Large Language Models (MLLMs) demonstrate promising comprehension of medical knowledge, their tendency to produce inaccurate information highlights the need for rigorous validation. However, existing medical visual question answering (VQA) benchmarks primarily focus on simple visual recognition tasks, lacking clinical relevance and failing to assess expert-level knowledge. W e introduce MedErr-CT, a novel benchmark for evaluating medical MLLMs' ability to identify and correct errors in CT reports through a VQA framework. The benchmark includes six error categories--four vision-centric errors (Omission, Insertion, Direction, Size) and two lexical error types (Unit, Typo)--and is organized into three task levels: classification, detection, and correction. Using this benchmark, we quantitatively assess the performance of state-of-the-art 3D medical MLLMs, revealing substantial variation in their capabilities across different error types. Our benchmark contributes to the development of more reliable and clinically applicable MLLMs, ultimately helping reduce diagnostic errors and improve accuracy in clinical practice.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- (3 more...)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Health & Medicine > Nuclear Medicine (0.96)
- Health & Medicine > Health Care Technology (0.68)
Model Merging is Secretly Certifiable: Non-Vacuous Generalisation Bounds for Low-Shot Learning
Kim, Taehoon, Gouk, Henry, Kim, Minyoung, Hospedales, Timothy
Certifying the IID generalisation ability of deep networks is the first of many requirements for trusting AI in high-stakes applications from medicine to security. However, when instantiating generalisation bounds for deep networks it remains challenging to obtain non-vacuous guarantees, especially when applying contemporary large models on the small scale data prevalent in such high-stakes fields. In this paper, we draw a novel connection between a family of learning methods based on model fusion and generalisation certificates, and surprisingly show that with minor adjustment several existing learning strategies already provide non-trivial generalisation guarantees. Essentially, by focusing on data-driven learning of downstream tasks by fusion rather than fine-tuning, the certified generalisation gap becomes tiny and independent of the base network size, facilitating its certification. Our results show for the first time non-trivial generalisation guarantees for learning with as low as 100 examples, while using vision models such as VIT-B and language models such as mistral-7B. This observation is significant as it has immediate implications for facilitating the certification of existing systems as trustworthy, and opens up new directions for research at the intersection of practice and theory.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Randomized based restricted kernel machine for hyperspectral image classification
In recent years, the random vector functional link (RVFL) network has gained significant popularity in hyperspectral image (HSI) classification due to its simplicity, speed, and strong generalization performance. However, despite these advantages, RVFL models face several limitations, particularly in handling non-linear relationships and complex data structures. The random initialization of input-to-hidden weights can lead to instability, and the model struggles with determining the optimal number of hidden nodes, affecting its performance on more challenging datasets. To address these issues, we propose a novel randomized based restricted kernel machine ($R^2KM$) model that combines the strehyperngths of RVFL and restricted kernel machines (RKM). $R^2KM$ introduces a layered structure that represents kernel methods using both visible and hidden variables, analogous to the energy function in restricted Boltzmann machines (RBM). This structure enables $R^2KM$ to capture complex data interactions and non-linear relationships more effectively, improving both interpretability and model robustness. A key contribution of $R^2KM$ is the introduction of a novel conjugate feature duality based on the Fenchel-Young inequality, which expresses the problem in terms of conjugate dual variables and provides an upper bound on the objective function. This duality enhances the model's flexibility and scalability, offering a more efficient and flexible solution for complex data analysis tasks. Extensive experiments on hyperspectral image datasets and real-world data from the UCI and KEEL repositories show that $R^2KM$ outperforms baseline models, demonstrating its effectiveness in classification and regression tasks.
- North America > United States > California (0.04)
- North America > United States > Indiana (0.04)
- North America > United States > Florida > Brevard County (0.04)
- Asia > India > Madhya Pradesh (0.04)
TRKM: Twin Restricted Kernel Machines for Classification and Regression
Restricted kernel machines (RKMs) have considerably improved generalization in machine learning. Recent advancements explored various techniques within the RKM framework, integrating kernel functions with least squares support vector machines (LSSVM) to mirror the energy function of restricted Boltzmann machines (RBM), leading to enhanced performance. However, RKMs may face challenges in generalization when dealing with unevenly distributed or complexly clustered data. Additionally, as the dataset size increases, the computational burden of managing high-dimensional feature spaces can become substantial, potentially hindering performance in large-scale datasets. To address these challenges, we propose twin restricted kernel machine (TRKM). TRKM combines the benefits of twin models with the robustness of the RKM framework to enhance classification and regression tasks. By leveraging the Fenchel-Young inequality, we introduce a novel conjugate feature duality, allowing the formulation of classification and regression problems in terms of dual variables. This duality provides an upper bound to the objective function of the TRKM problem, resulting in a new methodology under the RKM framework. The model uses an energy function similar to that of RBM, incorporating both visible and hidden variables corresponding to both classes. Additionally, the kernel trick is employed to map data into a high-dimensional feature space, where the model identifies an optimal separating hyperplane using a regularized least squares approach. Experiments on UCI and KEEL datasets confirm TRKM's superiority over baselines, showcasing its robustness and efficiency in handling complex data. Furthermore, We implemented the TRKM model on the brain age dataset, demonstrating its efficacy in predicting brain age.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > India > Madhya Pradesh (0.04)
- Research Report > Experimental Study (0.46)
- Research Report > New Finding (0.46)
Multi-Class Deep Boosting
Our algorithms can use as a base classifier set a family of deep decision trees or other rich or complex families and yet benefit from strong generalization guarantees. We give new data-dependent learning bounds for convex ensembles in the multiclass classification setting expressed in terms of the Rademacher complexities of the sub-families composing the base classifier set, and the mixture weight assigned to each sub-family. These bounds are finer than existing ones both thanks to an improved dependency on the number of classes and, more crucially, by virtue of a more favorable complexity term expressed as an average of the Rademacher complexities based on the ensemble's mixture weights.
Finite-Sample and Distribution-Free Fair Classification: Optimal Trade-off Between Excess Risk and Fairness, and the Cost of Group-Blindness
Algorithmic fairness in machine learning has recently garnered significant attention. However, two pressing challenges remain: (1) The fairness guarantees of existing fair classification methods often rely on specific data distribution assumptions and large sample sizes, which can lead to fairness violations when the sample size is moderate-a common situation in practice. (2) Due to legal and societal considerations, using sensitive group attributes during decision-making (referred to as the group-blind setting) may not always be feasible. In this work, we quantify the impact of enforcing algorithmic fairness and group-blindness in binary classification under group fairness constraints. Specifically, we propose a unified framework for fair classification that provides distribution-free and finite-sample fairness guarantees with controlled excess risk. This framework is applicable to various group fairness notions in both group-aware and group-blind scenarios. Furthermore, we establish a minimax lower bound on the excess risk, showing the minimax optimality of our proposed algorithm up to logarithmic factors. Through extensive simulation studies and real data analysis, we further demonstrate the superior performance of our algorithm compared to existing methods, and provide empirical support for our theoretical findings.
- South America > Paraguay > Asunción > Asunción (0.04)
- North America > United States > California (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (4 more...)