Goto

Collaborating Authors

 cdi


Dimension-Uniform Discretization Analysis of Preconditioned Annealed Langevin Dynamics for Multimodal Gaussian Mixtures

arXiv.org Machine Learning

Obtaining stable diffusion-based samplers in high- and infinite-dimensional settings is challenging because errors can accumulate across high-frequency coordinates and make the dynamics unstable under refinement of the finite-dimensional approximation of the underlying function-space problem. Discretization is a typical source of such errors, and preconditioning with a suitable spectral decay is one way to control their accumulation. In this paper, we study this problem for preconditioned annealed Langevin dynamics (ALD) applied to Gaussian mixtures. We first show that Euler-Maruyama (EM) discretization, by treating the stiff linear part of the annealed score with a forward Euler step, imposes a stability constraint coupling the preconditioner with the annealed covariance scale. Together with the conditions ensuring dimension-uniform control of the annealed dynamics, this constraint forces the initial smoothed law to remain uniformly close to the target across dimensions. We then consider an exponential-integrator scheme that integrates the stiff linear part of the annealed score exactly. Under explicit spectral summability conditions coupling the smoothing covariance, the component covariance spectra, and the preconditioner, we prove a dimension-uniform Kullback-Leibler (KL) bound for this scheme. This bound can be made arbitrarily small, uniformly in dimension, by allowing enough time for annealing and then refining the time mesh accordingly. Importantly, these conditions allow regimes in which the KL divergence between the target and the initial smoothed law diverges with dimension, showing that the restrictions imposed by EM are scheme-dependent rather than intrinsic to ALD.


Automatic coherence-driven inference on arguments

arXiv.org Artificial Intelligence

CDI also offers a plausible approach for automatically making sense of competing arguments in a way that accords with the features enumerated here. This paper is part of an argument that it is now feasible to computationally instantiate a reasonable approximation of a coherence theory of truth [64]: the recent benchmark [12] provides additional quantitative evidence in this direction. By "hard-coding" acceptance of conclusively established propositions, this theory can furthermore be anchored in a correspondence theory of truth [65]. In other words, coherence computations can be required to incorporate privileged information that also coheres with observed reality. While it is easy to imagine attempts to try the same thing with privileged information that does not cohere with observed reality, lies cannot persist when they can easily be unraveled. Even with flawless technology (which this will not be), obstacles will be manifold. For example, in a pluralistic society, legal coherence may actually require sacrificing fairness in some ways [66]. Ultimately, people must decide matters for themselves. It is only reasonable to hope that technology can serve as a reliable tool to help people make their decisions more coherent.


Coherence-driven inference for cybersecurity

arXiv.org Artificial Intelligence

Large language models (LLMs) can compile weighted graphs on natural language data to enable automatic coherence-driven inference (CDI) relevant to red and blue team operations in cybersecurity. This represents an early application of automatic CDI that holds near- to medium-term promise for decision-making in cybersecurity and eventually also for autonomous blue team operations.


The Einstein Test: Towards a Practical Test of a Machine's Ability to Exhibit Superintelligence

arXiv.org Artificial Intelligence

Creative and disruptive insights (CDIs), such as the development of the theory of relativity, have punctuated human history, marking pivotal shifts in our intellectual trajectory. Recent advancements in artificial intelligence (AI) have sparked debates over whether state of the art models possess the capacity to generate CDIs. We argue that the ability to create CDIs should be regarded as a significant feature of machine superintelligence (SI).To this end, we propose a practical test to evaluate whether an approach to AI targeting SI can yield novel insights of this kind. We propose the Einstein test: given the data available prior to the emergence of a known CDI, can an AI independently reproduce that insight (or one that is formally equivalent)? By achieving such a milestone, a machine can be considered to at least match humanity's past top intellectual achievements, and therefore to have the potential to surpass them.


CDI: Copyrighted Data Identification in Diffusion Models

arXiv.org Artificial Intelligence

Diffusion Models (DMs) benefit from large and diverse datasets for their training. Since this data is often scraped from the Internet without permission from the data owners, this raises concerns about copyright and intellectual property protections. While (illicit) use of data is easily detected for training samples perfectly re-created by a DM at inference time, it is much harder for data owners to verify if their data was used for training when the outputs from the suspect DM are not close replicas. Conceptually, membership inference attacks (MIAs), which detect if a given data point was used during training, present themselves as a suitable tool to address this challenge. However, we demonstrate that existing MIAs are not strong enough to reliably determine the membership of individual images in large, state-of-the-art DMs. To overcome this limitation, we propose CDI, a framework for data owners to identify whether their dataset was used to train a given DM. CDI relies on dataset inference techniques, i.e., instead of using the membership signal from a single data point, CDI leverages the fact that most data owners, such as providers of stock photography, visual media companies, or even individual artists, own datasets with multiple publicly exposed data points which might all be included in the training of a given DM. By selectively aggregating signals from existing MIAs and using new handcrafted methods to extract features for these datasets, feeding them to a scoring model, and applying rigorous statistical testing, CDI allows data owners with as little as 70 data points to identify with a confidence of more than 99% whether their data was used to train a given DM. Thereby, CDI represents a valuable tool for data owners to claim illegitimate use of their copyrighted data.


Limits to classification performance by relating Kullback-Leibler divergence to Cohen's Kappa

arXiv.org Machine Learning

The performance of machine learning classification algorithms are evaluated by estimating metrics, often from the confusion matrix, using training data and cross-validation. However, these do not prove that the best possible performance has been achieved. Fundamental limits to error rates can be estimated using information distance measures. To this end, the confusion matrix has been formulated to comply with the Chernoff-Stein Lemma. This links the error rates to the Kullback-Leibler divergences between the probability density functions describing the two classes. This leads to a key result that relates Cohen's Kappa to the Resistor Average Distance which is the parallel resistor combination of the two Kullback-Leibler divergences. The Resistor Average Distance has units of bits and is estimated from the same training data used by the classification algorithm, using kNN estimates of the KullBack-Leibler divergences. The classification algorithm gives the confusion matrix and Kappa. Theory and methods are discussed in detail and then applied to Monte Carlo data and real datasets. Four very different real datasets - Breast Cancer, Coronary Heart Disease, Bankruptcy, and Particle Identification - are analysed, with both continuous and discrete values, and their classification performance compared to the expected theoretical limit. In all cases this analysis shows that the algorithms could not have performed any better due to the underlying probability density functions for the two classes. Important lessons are learnt on how to predict the performance of algorithms for imbalanced data using training datasets that are approximately balanced. Machine learning is very powerful but classification performance ultimately depends on the quality of the data and the relevance of the variables to the problem.


Viewpoint: Regulatory Interest in Big Data, AI More Than a Carrier Problem - Carrier Management

#artificialintelligence

The California Insurance Commissioner and the California Department of Insurance (CDI) recently issued a bulletin regarding industry bias and discrimination. The bulletin acknowledged allegations of bias and discrimination in the industry and gave notice to insurance players that the CDI is watching and that "bias and discrimination in any form will be investigated and will not be tolerated." The bulletin is addressed to "All Admitted and Non-Admitted Insurance Companies, Licensees, and Other Interested Parties" -- clearly intending to cause awareness and attention beyond the carrier ecosystem. So, what does this mean? California has been a leader in following Europe regarding consumer protection laws.


Pivoting CDI: The World of Healthcare Watches

#artificialintelligence

Is CDI about to embark on a long journey to reinvent Itself? There is no arguing that artificial intelligence (AI) and natural language processing (NLP) are making inroads in the healthcare revenue cycle, creating better efficiencies with the automation of a multitude of historically manually performed tasks, thereby reducing positions that were once performed by staff. AI is clearly beginning to take hold and make significant inroads in the clinical documentation integrity (CDI) space. I have noticed serval posts on LinkedIn, as well as in Becker's Healthcare e-newsletters, discussing the role of AI in the revenue cycle. Just recently, there was a blog post published in KevinMD titled "How an AI bot transformed my EHR experience (KevinMD blog)" centering on how AI streamlined the provider's documentation and charting in the electronic health record (EHR) by scanning through the documentation as the note is being completed, providing suggested diagnoses with associated ICD-10 codes.


Job: CDD (6 months), Linguist, Yseop, 6 academic posts, Job: CDI, Young doctor in data science / ML / DL / NLP, Post-doc (CEA List and LISN), CIFRE thesis proposal

#artificialintelligence

Scientific context: The ambition of the CATCH project is to propose artificial intelligence and deep learning tools to take into account and automatically exploit the multitude of human testimonies related to an industrial accident and its consequences on the environment and health. By involving the population in the collection and analysis of data, particularly through social networks, and by providing effective means for interpreting this data, the proposed solution should contribute to providing answers to the worrying problem of industrial accidents and their consequences.


Industry Voices--Not all automation is created equally for clinical documentation improvement

#artificialintelligence

Healthcare system survival pivots on many metrics, but the ability to generate revenue and to evidence high quality of care are two of the most essential. At the center of both metrics is the clinical documentation process, where an accurate representation of every patient's clinical experience while in a provider's care must be recorded. As simple as it may sound, achieving that accurate reflection of diagnoses, interventions and the clinical picture is anything but simple. Medicine is as much science as it is art, and complex definitions, levels of specificity and complex medical terminology mean that most hospitals struggle to document everything properly, leading to significant lost revenues and under-reporting on quality metrics. Health systems have answered this challenge by standing up clinical documentation integrity (CDI) programs, staffed with clinicians.