training set
A Proof of Theorems
We still need to demonstrate that the properties in P AC-Bayes analysis hold for both the margin operator and the robust margin operator. Then we complete the proof of Lemma 6.1. The proof of Lemma 7.1 and 7.2 is similar. We provide the proof of Lemma 7.2 below. Lemma 7.1 follows the proof of Lemma 7.2 by replacing the robust margin operator by the margin Since the above bound holds for any x in the domain X, we can get the following a.s.: R The second inequality is the tail bound above.
Spectral Bias Outside the Training Set for Deep Networks in the Kernel Regime
We provide quantitative bounds measuring the $L^2$ difference in function space between the trajectory of a finite-width network trained on finitely many samples from the idealized kernel dynamics of infinite width and infinite data. An implication of the bounds is that the network is biased to learn the top eigenfunctions of the Neural Tangent Kernel not just on the training set but over the entire input space. This bias depends on the model architecture and input distribution alone and thus does not depend on the target function which does not need to be in the RKHS of the kernel. The result is valid for deep architectures with fully connected, convolutional, and residual layers. Furthermore the width does not need to grow polynomially with the number of samples in order to obtain high probability bounds up to a stopping time. The proof exploits the low-effective-rank property of the Fisher Information Matrix at initialization, which implies a low effective dimension of the model (far smaller than the number of parameters). We conclude that local capacity control from the low effective rank of the Fisher Information Matrix is still underexplored theoretically.
A Vector Symbolic Approach to Multiple Instance Learning
Dhrubo, Ehsan Ahmed, Alam, Mohammad Mahmudul, Raff, Edward, Oates, Tim, Holt, James
Multiple Instance Learning (MIL) tasks impose a strict logical constraint: a bag is labeled positive if and only if at least one instance within it is positive. While this iff constraint aligns with many real-world applications, recent work has shown that most deep learning-based MIL approaches violate it, leading to inflated performance metrics and poor generalization. We propose a novel MIL framework based on Vector Symbolic Architectures (VSAs), which provide a differentiable mechanism for performing symbolic operations in high-dimensional space. Our method encodes the MIL assumption directly into the model's structure by representing instances and concepts as nearly orthogonal high-dimensional vectors and using algebraic operations to enforce the iff constraint during classification. To bridge the gap between raw data and VSA representations, we design a learned encoder that transforms input instances into VSA-compatible vectors while preserving key distributional properties. Our approach, which includes a VSA-driven MaxNetwork classifier, achieves state-of-the-art results for a valid MIL model on standard MIL benchmarks and medical imaging datasets, outperforming existing methods while maintaining strict adherence to the MIL formulation. This work offers a principled, interpretable, and effective alternative to existing MIL approaches that rely on learned heuristics.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Oceania > New Zealand > North Island > Waikato (0.04)
- North America > United States > Maryland > Baltimore County (0.04)
- (4 more...)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (0.88)
A Proof of Theorems
We still need to demonstrate that the properties in P AC-Bayes analysis hold for both the margin operator and the robust margin operator. Then we complete the proof of Lemma 6.1. The proof of Lemma 7.1 and 7.2 is similar. We provide the proof of Lemma 7.2 below. Lemma 7.1 follows the proof of Lemma 7.2 by replacing the robust margin operator by the margin Since the above bound holds for any x in the domain X, we can get the following a.s.: R The second inequality is the tail bound above.
Trustworthiness Preservation by Copies of Machine Learning Systems
Ceragioli, Leonardo, Primiero, Giuseppe
A common practice of ML systems development concerns the training of the same model under different data sets, and the use of the same (training and test) sets for different learning models. The first case is a desirable practice for identifying high quality and unbiased training conditions. The latter case coincides with the search for optimal models under a common dataset for training. These differently obtained systems have been considered akin to copies. In the quest for responsible AI, a legitimate but hardly investigated question is how to verify that trustworthiness is preserved by copies. In this paper we introduce a calculus to model and verify probabilistic complex queries over data and define four distinct notions: Justifiably, Equally, Weakly and Almost Trustworthy which can be checked analysing the (partial) behaviour of the copy with respect to its original. We provide a study of the relations between these notions of trustworthiness, and how they compose with each other and under logical operations. The aim is to offer a computational tool to check the trustworthiness of possibly complex systems copied from an original whose behavour is known.
- Europe > Italy > Lazio > Rome (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Estonia > Harju County > Tallinn (0.04)
- Asia > Japan (0.04)
- Health & Medicine (0.73)
- Banking & Finance (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)
CLIMB: Data Foundations for Large Scale Multimodal Clinical Foundation Models
Dai, Wei, Chen, Peilin, Lu, Malinda, Li, Daniel, Wei, Haowen, Cui, Hejie, Liang, Paul Pu
Recent advances in clinical AI have enabled remarkable progress across many clinical domains. However, existing benchmarks and models are primarily limited to a small set of modalities and tasks, which hinders the development of large-scale multimodal methods that can make holistic assessments of patient health and well-being. To bridge this gap, we introduce Clinical Large-Scale Integrative Multimodal Benchmark (CLIMB), a comprehensive clinical benchmark unifying diverse clinical data across imaging, language, temporal, and graph modalities. CLIMB comprises 4.51 million patient samples totaling 19.01 terabytes distributed across 2D imaging, 3D video, time series, graphs, and multimodal data. Through extensive empirical evaluation, we demonstrate that multitask pretraining significantly improves performance on understudied domains, achieving up to 29% improvement in ultrasound and 23% in ECG analysis over single-task learning. Pretraining on CLIMB also effectively improves models' generalization capability to new tasks, and strong unimodal encoder performance translates well to multimodal performance when paired with task-appropriate fusion strategies. Our findings provide a foundation for new architecture designs and pretraining strategies to advance clinical AI research. Code is released at https://github.com/DDVD233/climb.
- South America > Brazil (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Vietnam > Hanoi > Hanoi (0.04)
- (23 more...)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Therapeutic Area > Dermatology (1.00)
- (7 more...)
Feature Fusion Attention Network with CycleGAN for Image Dehazing, De-Snowing and De-Raining
--This paper presents a novel approach to image dehazing by combining Feature Fusion Attention (FF A) networks with CycleGAN architecture. Our method leverages both supervised and unsupervised learning techniques to effectively remove haze from images while preserving crucial image details. The proposed hybrid architecture demonstrates significant improvements in image quality metrics, achieving superior PSNR and SSIM scores compared to traditional dehazing methods. Through extensive experimentation on the RESIDE and Dense-Haze CVPR 2019 dataset, we show that our approach effectively handles both synthetic and real-world hazy images. CycleGAN handles the unpaired nature of hazy and clean images effectively, enabling the model to learn mappings even without paired data.
Data Programming: Creating Large Training Sets, Quickly
Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users provide a set of labeling functions, which are programs that heuristically label subsets of the data, but that are noisy and may conflict. By viewing these labeling functions as implicitly describing a generative model for this noise, we show that we can recover the parameters of this model to "denoise" the generated training set, and establish theoretically that we can recover the parameters of these generative models in a handful of settings. We then show how to modify a discriminative loss function to make it noise-aware, and demonstrate our method over a range of discriminative models including logistic regression and LSTMs.
Reviews: Data Programming: Creating Large Training Sets, Quickly
This is a interesting work. The author motivates the problem regarding how availability of large labeled training sets may be hindrance to several supervised ML systems and deep learning techniques. And data programming can be an interesting approach here. Also the user study indicating how researchers find it easy to write labeling heuristics instead of generating ground truth through crowdsourcing or otherwise is a good indication of the utility of this technique. The writing is clear and easy to follow, the experiments are thorough.