Plotting

Proper scoring rules for estimation and forecast evaluation

arXiv.org Machine Learning

In recent years, proper scoring rules have emerged as a power ful general approach for estimating probability distributions. In addition to significantly ex panding the range of modeling techniques that can be applied in practice, this has also substantially broadened the conceptual understanding of estimation methods. Originally, proper scoring rules we re conceived in meteorology as summary statistics for describing the performance of probabilisti c forecasts ( Murphy and Winkler, 1984), but they also play an important role in economics as tools for bel ief elicitation ( Schotter and Trevino, 2014). A probabilistic forecast is a probability distribution ove r the space of the possible outcomes of the future event that is stated by the forecaster. The simple st and most popular case of probabilistic forecasts arises when the outcome is binary, so the probabilistic forecast reduces to issuing a predictive probability of success. Brier ( 1950) was the first to consider the problem of devising a scoring rule which could not be "played" by a dishonest fore casting agent. He introduced the quadratic scoring rule and showed that it incentivizes a for ecasting agent to state his most accurate probability estimate when faced with uncertainty.


PolypSegTrack: Unified Foundation Model for Colonoscopy Video Analysis

arXiv.org Artificial Intelligence

Early detection, accurate segmentation, classification and tracking of polyps during colonoscopy are critical for preventing colorectal cancer. Many existing deep-learning-based methods for analyzing colonoscopic videos either require task-specific fine-tuning, lack tracking capabilities, or rely on domain-specific pre-training. In this paper, we introduce PolypSegTrack, a novel foundation model that jointly addresses polyp detection, segmentation, classification and unsupervised tracking in colonoscopic videos. Our approach leverages a novel conditional mask loss, enabling flexible training across datasets with either pixel-level segmentation masks or bounding box annotations, allowing us to bypass task-specific fine-tuning. Our unsupervised tracking module reliably associates polyp instances across frames using object queries, without relying on any heuristics. We leverage a robust vision foundation model backbone that is pre-trained unsupervisedly on natural images, thereby removing the need for domain-specific pre-training. Extensive experiments on multiple polyp benchmarks demonstrate that our method significantly outperforms existing state-of-the-art approaches in detection, segmentation, classification, and tracking.


A Unified Approach to Analysis and Design of Denoising Markov Models

arXiv.org Machine Learning

Probabilistic generative models based on measure transport, such as diffusion and flow-based models, are often formulated in the language of Markovian stochastic dynamics, where the choice of the underlying process impacts both algorithmic design choices and theoretical analysis. In this paper, we aim to establish a rigorous mathematical foundation for denoising Markov models, a broad class of generative models that postulate a forward process transitioning from the target distribution to a simple, easy-to-sample distribution, alongside a backward process particularly constructed to enable efficient sampling in the reverse direction. Leveraging deep connections with nonequilibrium statistical mechanics and generalized Doob's $h$-transform, we propose a minimal set of assumptions that ensure: (1) explicit construction of the backward generator, (2) a unified variational objective directly minimizing the measure transport discrepancy, and (3) adaptations of the classical score-matching approach across diverse dynamics. Our framework unifies existing formulations of continuous and discrete diffusion models, identifies the most general form of denoising Markov models under certain regularity assumptions on forward generators, and provides a systematic recipe for designing denoising Markov models driven by arbitrary L\'evy-type processes. We illustrate the versatility and practical effectiveness of our approach through novel denoising Markov models employing geometric Brownian motion and jump processes as forward dynamics, highlighting the framework's potential flexibility and capability in modeling complex distributions.


Adversarial Curriculum Graph-Free Knowledge Distillation for Graph Neural Networks

arXiv.org Artificial Intelligence

Data-free Knowledge Distillation (DFKD) is a method that constructs pseudo-samples using a generator without real data, and transfers knowledge from a teacher model to a student by enforcing the student to overcome dimensional differences and learn to mimic the teacher's outputs on these pseudo-samples. In recent years, various studies in the vision domain have made notable advancements in this area. However, the varying topological structures and non-grid nature of graph data render the methods from the vision domain ineffective. Building upon prior research into differentiable methods for graph neural networks, we propose a fast and high-quality data-free knowledge distillation approach in this paper. Without compromising distillation quality, the proposed graph-free KD method (ACGKD) significantly reduces the spatial complexity of pseudo-graphs by leveraging the Binary Concrete distribution to model the graph structure and introducing a spatial complexity tuning parameter. This approach enables efficient gradient computation for the graph structure, thereby accelerating the overall distillation process. Additionally, ACGKD eliminates the dimensional ambiguity between the student and teacher models by increasing the student's dimensions and reusing the teacher's classifier. Moreover, it equips graph knowledge distillation with a CL-based strategy to ensure the student learns graph structures progressively. Extensive experiments demonstrate that ACGKD achieves state-of-the-art performance in distilling knowledge from GNNs without training data.


On the Role of Priors in Bayesian Causal Learning

arXiv.org Machine Learning

--In this work, we investigate causal learning of independent causal mechanisms from a Bayesian perspective. Confirming previous claims from the literature, we show in a didactically accessible manner that unlabeled data (i.e., cause realizations) do not improve the estimation of the parameters defining the mechanism. Furthermore, we observe the importance of choosing an appropriate prior for the cause and mechanism parameters, respectively. Specifically, we show that a factorized prior results in a factorized posterior, which resonates with Janz-ing and Sch olkopf's definition of independent causal mechanisms via the Kolmogorov complexity of the involved distributions and with the concept of parameter independence of Heckerman et al. Impact Statement --Learning the effect from a given cause is an important problem in many engineering disciplines, specifically in the field of surrogate modeling, which aims to reduce the computational cost of numerical simulations. Causal learning, however, cannot make use of unlabeled data - i.e., cause realizations - if the mechanism that produces the effect is independent from the cause. In this work, we recover this well-known fact from a Bayesian perspective.


Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources

arXiv.org Artificial Intelligence

The reproduction of state-of-the-art multimodal LLM pre-training faces barriers at every stage of the pipeline, including high-quality data filtering, multimodal data mixture strategies, sequence packing techniques, and training frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs using only 220 A100-40G GPU hours. Our approach employs low-to-high dynamic image resolution and multimodal sequence packing to significantly enhance pre-training efficiency. The training dataset was carefully curated using both MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based filtering methods, substantially improving data quality and training efficiency. The Open-Qwen2VL pre-training is conducted on academic level 8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36% of 1.4T multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista, indicating the remarkable training efficiency of Open-Qwen2VL. We open-source all aspects of our work, including compute-efficient and data-efficient training details, data filtering methods, sequence packing scripts, pre-training data in WebDataset format, FSDP-based training codebase, and both base and instruction-tuned model checkpoints. We redefine "fully open" for multimodal LLMs as the complete release of: 1) the training codebase, 2) detailed data filtering techniques, and 3) all pre-training and supervised fine-tuning data used to develop the model.


Detecting Localized Density Anomalies in Multivariate Data via Coin-Flip Statistics

arXiv.org Machine Learning

Detecting localized density differences in multivariate data is a crucial task in computational science. Such anomalies can indicate a critical system failure, lead to a groundbreaking scientific discovery, or reveal unexpected changes in data distribution. We introduce EagleEye, an anomaly detection method to compare two multivariate datasets with the aim of identifying local density anomalies, namely over- or under-densities affecting only localised regions of the feature space. Anomalies are detected by modelling, for each point, the ordered sequence of its neighbours' membership label as a coin-flipping process and monitoring deviations from the expected behaviour of such process. A unique advantage of our method is its ability to provide an accurate, entirely unsupervised estimate of the local signal purity. We demonstrate its effectiveness through experiments on both synthetic and real-world datasets. In synthetic data, EagleEye accurately detects anomalies in multiple dimensions even when they affect a tiny fraction of the data. When applied to a challenging resonant anomaly detection benchmark task in simulated Large Hadron Collider data, EagleEye successfully identifies particle decay events present in just 0.3% of the dataset. In global temperature data, EagleEye uncovers previously unidentified, geographically localised changes in temperature fields that occurred in the most recent years. Thanks to its key advantages of conceptual simplicity, computational efficiency, trivial parallelisation, and scalability, EagleEye is widely applicable across many fields.


Barrier Certificates for Unknown Systems with Latent States and Polynomial Dynamics using Bayesian Inference

arXiv.org Machine Learning

-- Certifying safety in dynamical systems is crucial, but barrier certificates -- widely used to verify that system trajectories remain within a safe region -- typically require explicit system models. When dynamics are unknown, data-driven methods can be used instead, yet obtaining a valid certificate requires rigorous uncertainty quantification. For this purpose, existing methods usually rely on full-state measurements, limiting their applicability. This paper proposes a novel approach for synthesizing barrier certificates for unknown systems with latent states and polynomial dynamics. A Bayesian framework is employed, where a prior in state-space representation is updated using input-output data via a targeted marginal Metropolis-Hastings sampler . The resulting samples are used to construct a candidate barrier certificate through a sum-of-squares program. It is shown that if the candidate satisfies the required conditions on a test set of additional samples, it is also valid for the true, unknown system with high probability. The approach and its probabilistic guarantees are illustrated through a numerical simulation. Ensuring the safety of dynamical systems is a critical concern in applications such as human-robot interaction, autonomous driving, and medical devices, where failures can lead to severe consequences. In such scenarios, safety constraints typically mandate that the system state remains within a predefined allowable region. Barrier certificates [1] provide a systematic framework for verifying safety by establishing mathematical conditions that guarantee that system trajectories remain within these regions.


Identifying Obfuscated Code through Graph-Based Semantic Analysis of Binary Code

arXiv.org Machine Learning

Protecting sensitive program content is a critical issue in various situations, ranging from legitimate use cases to unethical contexts. Obfuscation is one of the most used techniques to ensure such protection. Consequently, attackers must first detect and characterize obfuscation before launching any attack against it. This paper investigates the problem of function-level obfuscation detection using graph-based approaches, comparing algorithms, from elementary baselines to promising techniques like GNN (Graph Neural Networks), on different feature choices. We consider various obfuscation types and obfuscators, resulting in two complex datasets. Our findings demonstrate that GNNs need meaningful features that capture aspects of function semantics to outperform baselines. Our approach shows satisfactory results, especially in a challenging 11-class classification task and in a practical malware analysis example.


On Model Protection in Federated Learning against Eavesdropping Attacks

arXiv.org Machine Learning

-- In this study, we investigate the protection offered by Federated Learning algorithms against eavesdropping adversaries. In our model, the adversary is capable of intercepting model updates transmitted from clients to the server, enabling it to create its own estimate of the model. Unlike previous research, which predominantly focuses on safeguarding client data, our work shifts attention to protecting the client model itself. Through a theoretical analysis, we examine how various factors--such as the probability of client selection, the structure of local objective functions, global aggregation at the server, and the eavesdropper's capabilities--impact the overall level of protection. We further validate our findings through numerical experiments, assessing the protection by evaluating the model accuracy achieved by the adversary. Finally, we compare our results with methods based on differential privacy, underscoring their limitations in this specific context. Traditionally, deep learning techniques require centralized data collection and processing that may be infeasible in collaborative scenarios, such as healthcare, credit scoring, vehicle fleet learning, internet-of-things, e-commerce, and natural language processing, due to the high scalability of modern networks, growing sensitive data privacy concerns, and legal regulations such as GDPR [1]-[3]. In these domains, data is often distributed among multiple parties of interest, with no single trusted authority. Federated Learning (FL) has emerged as a distributed collaborative learning paradigm, which allows coordination among multiple clients to perform training without sharing raw data. Instead, they participate in the learning process by training models locally and sharing only the model parameters with a central server. This server aggregates the updates and redistributes the improved model to all participants [4], [5]. Based on the distribution/partition of data among the clients, FL can be classified into horizontal (HFL), vertical (VFL), and transfer (TFL) federated learning [1], [6].