Plotting

A Unified Approach to Analysis and Design of Denoising Markov Models

arXiv.org Machine Learning

Probabilistic generative models based on measure transport, such as diffusion and flow-based models, are often formulated in the language of Markovian stochastic dynamics, where the choice of the underlying process impacts both algorithmic design choices and theoretical analysis. In this paper, we aim to establish a rigorous mathematical foundation for denoising Markov models, a broad class of generative models that postulate a forward process transitioning from the target distribution to a simple, easy-to-sample distribution, alongside a backward process particularly constructed to enable efficient sampling in the reverse direction. Leveraging deep connections with nonequilibrium statistical mechanics and generalized Doob's $h$-transform, we propose a minimal set of assumptions that ensure: (1) explicit construction of the backward generator, (2) a unified variational objective directly minimizing the measure transport discrepancy, and (3) adaptations of the classical score-matching approach across diverse dynamics. Our framework unifies existing formulations of continuous and discrete diffusion models, identifies the most general form of denoising Markov models under certain regularity assumptions on forward generators, and provides a systematic recipe for designing denoising Markov models driven by arbitrary L\'evy-type processes. We illustrate the versatility and practical effectiveness of our approach through novel denoising Markov models employing geometric Brownian motion and jump processes as forward dynamics, highlighting the framework's potential flexibility and capability in modeling complex distributions.


Adversarial Curriculum Graph-Free Knowledge Distillation for Graph Neural Networks

arXiv.org Artificial Intelligence

Data-free Knowledge Distillation (DFKD) is a method that constructs pseudo-samples using a generator without real data, and transfers knowledge from a teacher model to a student by enforcing the student to overcome dimensional differences and learn to mimic the teacher's outputs on these pseudo-samples. In recent years, various studies in the vision domain have made notable advancements in this area. However, the varying topological structures and non-grid nature of graph data render the methods from the vision domain ineffective. Building upon prior research into differentiable methods for graph neural networks, we propose a fast and high-quality data-free knowledge distillation approach in this paper. Without compromising distillation quality, the proposed graph-free KD method (ACGKD) significantly reduces the spatial complexity of pseudo-graphs by leveraging the Binary Concrete distribution to model the graph structure and introducing a spatial complexity tuning parameter. This approach enables efficient gradient computation for the graph structure, thereby accelerating the overall distillation process. Additionally, ACGKD eliminates the dimensional ambiguity between the student and teacher models by increasing the student's dimensions and reusing the teacher's classifier. Moreover, it equips graph knowledge distillation with a CL-based strategy to ensure the student learns graph structures progressively. Extensive experiments demonstrate that ACGKD achieves state-of-the-art performance in distilling knowledge from GNNs without training data.


On the Role of Priors in Bayesian Causal Learning

arXiv.org Machine Learning

--In this work, we investigate causal learning of independent causal mechanisms from a Bayesian perspective. Confirming previous claims from the literature, we show in a didactically accessible manner that unlabeled data (i.e., cause realizations) do not improve the estimation of the parameters defining the mechanism. Furthermore, we observe the importance of choosing an appropriate prior for the cause and mechanism parameters, respectively. Specifically, we show that a factorized prior results in a factorized posterior, which resonates with Janz-ing and Sch olkopf's definition of independent causal mechanisms via the Kolmogorov complexity of the involved distributions and with the concept of parameter independence of Heckerman et al. Impact Statement --Learning the effect from a given cause is an important problem in many engineering disciplines, specifically in the field of surrogate modeling, which aims to reduce the computational cost of numerical simulations. Causal learning, however, cannot make use of unlabeled data - i.e., cause realizations - if the mechanism that produces the effect is independent from the cause. In this work, we recover this well-known fact from a Bayesian perspective.


Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources

arXiv.org Artificial Intelligence

The reproduction of state-of-the-art multimodal LLM pre-training faces barriers at every stage of the pipeline, including high-quality data filtering, multimodal data mixture strategies, sequence packing techniques, and training frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs using only 220 A100-40G GPU hours. Our approach employs low-to-high dynamic image resolution and multimodal sequence packing to significantly enhance pre-training efficiency. The training dataset was carefully curated using both MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based filtering methods, substantially improving data quality and training efficiency. The Open-Qwen2VL pre-training is conducted on academic level 8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36% of 1.4T multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista, indicating the remarkable training efficiency of Open-Qwen2VL. We open-source all aspects of our work, including compute-efficient and data-efficient training details, data filtering methods, sequence packing scripts, pre-training data in WebDataset format, FSDP-based training codebase, and both base and instruction-tuned model checkpoints. We redefine "fully open" for multimodal LLMs as the complete release of: 1) the training codebase, 2) detailed data filtering techniques, and 3) all pre-training and supervised fine-tuning data used to develop the model.


Detecting Localized Density Anomalies in Multivariate Data via Coin-Flip Statistics

arXiv.org Machine Learning

Detecting localized density differences in multivariate data is a crucial task in computational science. Such anomalies can indicate a critical system failure, lead to a groundbreaking scientific discovery, or reveal unexpected changes in data distribution. We introduce EagleEye, an anomaly detection method to compare two multivariate datasets with the aim of identifying local density anomalies, namely over- or under-densities affecting only localised regions of the feature space. Anomalies are detected by modelling, for each point, the ordered sequence of its neighbours' membership label as a coin-flipping process and monitoring deviations from the expected behaviour of such process. A unique advantage of our method is its ability to provide an accurate, entirely unsupervised estimate of the local signal purity. We demonstrate its effectiveness through experiments on both synthetic and real-world datasets. In synthetic data, EagleEye accurately detects anomalies in multiple dimensions even when they affect a tiny fraction of the data. When applied to a challenging resonant anomaly detection benchmark task in simulated Large Hadron Collider data, EagleEye successfully identifies particle decay events present in just 0.3% of the dataset. In global temperature data, EagleEye uncovers previously unidentified, geographically localised changes in temperature fields that occurred in the most recent years. Thanks to its key advantages of conceptual simplicity, computational efficiency, trivial parallelisation, and scalability, EagleEye is widely applicable across many fields.


Barrier Certificates for Unknown Systems with Latent States and Polynomial Dynamics using Bayesian Inference

arXiv.org Machine Learning

-- Certifying safety in dynamical systems is crucial, but barrier certificates -- widely used to verify that system trajectories remain within a safe region -- typically require explicit system models. When dynamics are unknown, data-driven methods can be used instead, yet obtaining a valid certificate requires rigorous uncertainty quantification. For this purpose, existing methods usually rely on full-state measurements, limiting their applicability. This paper proposes a novel approach for synthesizing barrier certificates for unknown systems with latent states and polynomial dynamics. A Bayesian framework is employed, where a prior in state-space representation is updated using input-output data via a targeted marginal Metropolis-Hastings sampler . The resulting samples are used to construct a candidate barrier certificate through a sum-of-squares program. It is shown that if the candidate satisfies the required conditions on a test set of additional samples, it is also valid for the true, unknown system with high probability. The approach and its probabilistic guarantees are illustrated through a numerical simulation. Ensuring the safety of dynamical systems is a critical concern in applications such as human-robot interaction, autonomous driving, and medical devices, where failures can lead to severe consequences. In such scenarios, safety constraints typically mandate that the system state remains within a predefined allowable region. Barrier certificates [1] provide a systematic framework for verifying safety by establishing mathematical conditions that guarantee that system trajectories remain within these regions.


Identifying Obfuscated Code through Graph-Based Semantic Analysis of Binary Code

arXiv.org Machine Learning

Protecting sensitive program content is a critical issue in various situations, ranging from legitimate use cases to unethical contexts. Obfuscation is one of the most used techniques to ensure such protection. Consequently, attackers must first detect and characterize obfuscation before launching any attack against it. This paper investigates the problem of function-level obfuscation detection using graph-based approaches, comparing algorithms, from elementary baselines to promising techniques like GNN (Graph Neural Networks), on different feature choices. We consider various obfuscation types and obfuscators, resulting in two complex datasets. Our findings demonstrate that GNNs need meaningful features that capture aspects of function semantics to outperform baselines. Our approach shows satisfactory results, especially in a challenging 11-class classification task and in a practical malware analysis example.


On Model Protection in Federated Learning against Eavesdropping Attacks

arXiv.org Machine Learning

-- In this study, we investigate the protection offered by Federated Learning algorithms against eavesdropping adversaries. In our model, the adversary is capable of intercepting model updates transmitted from clients to the server, enabling it to create its own estimate of the model. Unlike previous research, which predominantly focuses on safeguarding client data, our work shifts attention to protecting the client model itself. Through a theoretical analysis, we examine how various factors--such as the probability of client selection, the structure of local objective functions, global aggregation at the server, and the eavesdropper's capabilities--impact the overall level of protection. We further validate our findings through numerical experiments, assessing the protection by evaluating the model accuracy achieved by the adversary. Finally, we compare our results with methods based on differential privacy, underscoring their limitations in this specific context. Traditionally, deep learning techniques require centralized data collection and processing that may be infeasible in collaborative scenarios, such as healthcare, credit scoring, vehicle fleet learning, internet-of-things, e-commerce, and natural language processing, due to the high scalability of modern networks, growing sensitive data privacy concerns, and legal regulations such as GDPR [1]-[3]. In these domains, data is often distributed among multiple parties of interest, with no single trusted authority. Federated Learning (FL) has emerged as a distributed collaborative learning paradigm, which allows coordination among multiple clients to perform training without sharing raw data. Instead, they participate in the learning process by training models locally and sharing only the model parameters with a central server. This server aggregates the updates and redistributes the improved model to all participants [4], [5]. Based on the distribution/partition of data among the clients, FL can be classified into horizontal (HFL), vertical (VFL), and transfer (TFL) federated learning [1], [6].


On the Geometry of Receiver Operating Characteristic and Precision-Recall Curves

arXiv.org Machine Learning

We study the geometry of Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves in binary classification problems. The key finding is that many of the most commonly used binary classification metrics are merely functions of the composition function $G := F_p \circ F_n^{-1}$, where $F_p(\cdot)$ and $F_n(\cdot)$ are the class-conditional cumulative distribution functions of the classifier scores in the positive and negative classes, respectively. This geometric perspective facilitates the selection of operating points, understanding the effect of decision thresholds, and comparison between classifiers. It also helps explain how the shapes and geometry of ROC/PR curves reflect classifier behavior, providing objective tools for building classifiers optimized for specific applications with context-specific constraints. We further explore the conditions for classifier dominance, present analytical and numerical examples demonstrating the effects of class separability and variance on ROC and PR geometries, and derive a link between the positive-to-negative class leakage function $G(\cdot)$ and the Kullback--Leibler divergence. The framework highlights practical considerations, such as model calibration, cost-sensitive optimization, and operating point selection under real-world capacity constraints, enabling more informed approaches to classifier deployment and decision-making.


UK needs to relax AI laws or risk transatlantic ties, thinktank warns

The Guardian

To enforce a strict licensing model, the UK would also need to restrict access to models that have been trained on such content, which could include US-owned AI systems. With the Trump administration signalling it will not pursue strict AI regulations and China pursuing AI growth at "breakneck speed", the UK could weaken its economic and national security interests by lagging in the AI race, said TBI. "If the UK imposes laws that are too strict, it risks falling behind in the AI-driven economy and weakening its capacity to protect national security interests," said TBI. The report said arguing that commercial AI models cannot be trained on content from the open web was close to saying knowledge workers – a broad category of professionals ranging from lawyers to researchers – cannot profit from insights they get when reading the same content. Rather than fighting to uphold outdated regulations, said TBI, rights holders and policymakers should help build a future where creativity is valued alongside AI innovation. Fernando Garibay, a record producer who has worked with artists including Lady Gaga and U2, said history has been dotted with "end-of-time claims" related to technological breakthroughs, from the printing press to music streaming.