pmi
AppendixforTask-FreeContinualLearningVia OnlineDiscrepancyDistanceLearning
Theorem1.Let Pi represent the distribution of all seen training samples (including all previous Agoodtrade-offbetween themodel'scomplexityandgeneralization performance, observedfrom Eq. (12), is allowing each component to learn the underlying data distribution of a unique target set. By satisfying the ideal selection process (Eq.(22) of the paper) and also consideringthateachcomponent Gtfinishedthetrainingon Mkt atTkt,weassumethatthedynamic 4 expansion modelG can be seen as a single modelh trained on all previously learnt memories Maximal Interfered Retrieval (MIR), [1] is one of 5 themostpopular memory-based approaches, whichusesamemory bufferwithasample selection criterion. Since Pi would involve several underlying data distributions as the number of training steps (i) increases, the diversity in the memory plays an important role to ensure a tight GB in Eq.(15). G be single model which consists of a classifierh HandaVAEmodelv. M be a memory buffer updated at the training stepTi. Figure 1: The learning process of the proposed ODDL-S, which consists of three phases.
Appendices ASketchofTheoreticalAnalyses
Theorem B.1 (Performance difference bound for Model-based RL). Mi denote the inconsistency between the learned dynamics PMi and the true dynamics, i.e. ϵ For L1 L3, with the performance gap approximation of M1 and π1, we apply Lemma C.2, and Here, dπMi denotes the distribution of state-action pair induced by policy π under the dynamical modelMi. Theorem B.3 (Refined bound with constraints). Let µ and v be two probability distributions on the configuration space X, according to LemmaC.1,thenwehaveDTV(µ Under these definitions, we can yield the following intermediate outcome by applying the results from B.2and B.1 Here, we take the time-varying linear quadratic regulator as an instance for illustrating the rationality of our assumption on α.
- North America > United States > New York > New York County > New York City (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Theoretical Refinement of CLIP by Utilizing Linear Structure of Optimal Similarity
Yoshida, Naoki, Hayakawa, Satoshi, Takida, Yuhta, Uesaka, Toshimitsu, Wakaki, Hiromi, Mitsufuji, Yuki
In this study, we propose an enhancement to the similarity computation mechanism in multi-modal contrastive pretraining frameworks such as CLIP. Prior theoretical research has demonstrated that the optimal similarity metrics between paired modalities should correspond to the pointwise mutual information (PMI) between the two modalities. However, the current implementations of CLIP and its variants fail to fully utilize the underlying linear structure of PMI. We therefore propose KME-CLIP, which leverages this structure through the inner product in a reproducing kernel Hilbert space. We theoretically prove that our method can approximate PMI with arbitrary accuracy and empirically demonstrate that our approach overall outperforms the standard CLIP formulation across several retrieval and classification tasks.
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Beyond independent component analysis: identifiability and algorithms
Ribot, Alvaro, Seigal, Anna, Zwiernik, Piotr
Independent Component Analysis (ICA) is a classical method for recovering latent variables with useful identifiability properties. For independent variables, cumulant tensors are diagonal; relaxing independence yields tensors whose zero structure generalizes diagonality. These models have been the subject of recent work in non-independent component analysis. We show that pairwise mean independence answers the question of how much one can relax independence: it is identifiable, any weaker notion is non-identifiable, and it contains the models previously studied as special cases. Our results apply to distributions with the required zero pattern at any cumulant tensor. We propose an algebraic recovery algorithm based on least-squares optimization over the orthogonal group. Simulations highlight robustness: enforcing full independence can harm estimation, while pairwise mean independence enables more stable recovery. These findings extend the classical ICA framework and provide a rigorous basis for blind source separation beyond independence.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > New York (0.04)
- (3 more...)
Directed Information $γ$-covering: An Information-Theoretic Framework for Context Engineering
We introduce \textbf{Directed Information $γ$-covering}, a simple but general framework for redundancy-aware context engineering. Directed information (DI), a causal analogue of mutual information, measures asymmetric predictiveness between chunks. If $\operatorname{DI}_{i \to j} \ge H(C_j) - γ$, then $C_i$ suffices to represent $C_j$ up to $γ$ bits. Building on this criterion, we formulate context selection as a $γ$-cover problem and propose a greedy algorithm with provable guarantees: it preserves query information within bounded slack, inherits $(1+\ln n)$ and $(1-1/e)$ approximations from submodular set cover, and enforces a diversity margin. Importantly, building the $γ$-cover is \emph{query-agnostic}: it incurs no online cost and can be computed once offline and amortized across all queries. Experiments on HotpotQA show that $γ$-covering consistently improves over BM25, a competitive baseline, and provides clear advantages in hard-decision regimes such as context compression and single-slot prompt selection. These results establish DI $γ$-covering as a principled, self-organizing backbone for modern LLM pipelines.
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > United States > Illinois (0.04)
- North America > United States > Hawaii (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning
Chanda, Prateek, Sureka, Saral, Chatterjee, Parth Pratim, Killamsetty, Krishnateja, Nayak, Nikhil Shivakumar, Ramakrishnan, Ganesh
The performance of finetuned large language models (LLMs) hinges critically on the composition of the training mixture. However, selecting an optimal blend of task datasets remains a largely manual, heuristic driven process, with practitioners often relying on uniform or size based sampling strategies. We introduce TASKPGM, a principled and scalable framework for mixture optimization that selects continuous task proportions by minimizing an energy function over a Markov Random Field (MRF). Task relationships are modeled using behavioral divergences such as Jensen Shannon Divergence and Pointwise Mutual Information computed from the predictive distributions of single task finetuned models. Our method yields a closed form solution under simplex constraints and provably balances representativeness and diversity among tasks. We provide theoretical guarantees, including weak submodularity for budgeted variants, and demonstrate consistent empirical improvements on Llama 2 and Mistral across evaluation suites such as MMLU and BIGBench. Beyond performance, TASKPGM offers interpretable insights into task influence and mixture composition, making it a powerful tool for efficient and robust LLM finetuning.
Impact of Pretraining Word Co-occurrence on Compositional Generalization in Multimodal Models
CLIP and large multimodal models (LMMs) have better accuracy on examples involving concepts that are highly represented in the training data. However, the role of concept combinations in the training data on compositional generalization is largely unclear -- for instance, how does accuracy vary when a common object appears in an uncommon pairing with another object? In this paper, we investigate how word co-occurrence statistics in the pretraining dataset (a proxy for co-occurrence of visual concepts) impacts CLIP/LMM performance. To disentangle the effects of word co-occurrence frequencies from single-word frequencies, we measure co-occurrence with pointwise mutual information (PMI), which normalizes the joint probability of two words co-occurring by the probability of co-occurring independently. Using synthetically generated images with a variety of concept pairs, we show a strong correlation between PMI in the CLIP pretraining data and zero-shot accuracy in CLIP models trained on LAION-400M (r=0.97 and 14% accuracy gap between images in the top and bottom 5% of PMI values), demonstrating that even accuracy on common concepts is affected by the combination of concepts in the image. Leveraging this finding, we reproduce this effect in natural images by editing them to contain pairs with varying PMI, resulting in a correlation of r=0.75. Finally, we demonstrate that this behavior in CLIP transfers to LMMs built on top of CLIP (r=0.70 for TextVQA, r=0.62 for VQAv2). Our findings highlight the need for algorithms and architectures that improve compositional generalization in multimodal models without scaling the training data combinatorially. Our code is available at https://github.com/helenqu/multimodal-pretraining-pmi.
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
UProp: Investigating the Uncertainty Propagation of LLMs in Multi-Step Agentic Decision-Making
Duan, Jinhao, Diffenderfer, James, Madireddy, Sandeep, Chen, Tianlong, Kailkhura, Bhavya, Xu, Kaidi
As Large Language Models (LLMs) are integrated into safety-critical applications involving sequential decision-making in the real world, it is essential to know when to trust LLM decisions. Existing LLM Uncertainty Quantification (UQ) methods are primarily designed for single-turn question-answering formats, resulting in multi-step decision-making scenarios, e.g., LLM agentic system, being underexplored. In this paper, we introduce a principled, information-theoretic framework that decomposes LLM sequential decision uncertainty into two parts: (i) internal uncertainty intrinsic to the current decision, which is focused on existing UQ methods, and (ii) extrinsic uncertainty, a Mutual-Information (MI) quantity describing how much uncertainty should be inherited from preceding decisions. We then propose UProp, an efficient and effective extrinsic uncertainty estimator that converts the direct estimation of MI to the estimation of Pointwise Mutual Information (PMI) over multiple Trajectory-Dependent Decision Processes (TDPs). UProp is evaluated over extensive multi-step decision-making benchmarks, e.g., AgentBench and HotpotQA, with state-of-the-art LLMs, e.g., GPT-4.1 and DeepSeek-V3. Experimental results demonstrate that UProp significantly outperforms existing single-turn UQ baselines equipped with thoughtful aggregation strategies. Moreover, we provide a comprehensive analysis of UProp, including sampling efficiency, potential applications, and intermediate uncertainty propagation, to demonstrate its effectiveness. Codes will be available at https://github.com/jinhaoduan/UProp.
- North America > United States > Colorado (0.05)
- North America > United States > Texas > Yoakum County > Plains (0.04)
- Asia > Middle East > Jordan (0.04)
- Energy (0.68)
- Leisure & Entertainment (0.47)
- Government > Regional Government (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Modeling human decomposition: a Bayesian approach
Smith, D. Hudson, Nisbet, Noah, Ehrett, Carl, Tica, Cristina I., Atwell, Madeline M., Weisensee, Katherine E.
Environmental and individualistic variables affect the rate of human decomposition in complex ways. These effects complicate the estimation of the postmortem interval (PMI) based on observed decomposition characteristics. In this work, we develop a generative probabilistic model for decomposing human remains based on PMI and a wide range of environmental and individualistic variables. This model explicitly represents the effect of each variable, including PMI, on the appearance of each decomposition characteristic, allowing for direct interpretation of model effects and enabling the use of the model for PMI inference and optimal experimental design. In addition, the probabilistic nature of the model allows for the integration of expert knowledge in the form of prior distributions. We fit this model to a diverse set of 2,529 cases from the GeoFOR dataset. We demonstrate that the model accurately predicts 24 decomposition characteristics with an ROC AUC score of 0.85. Using Bayesian inference techniques, we invert the decomposition model to predict PMI as a function of the observed decomposition characteristics and environmental and individualistic variables, producing an R-squared measure of 71%. Finally, we demonstrate how to use the fitted model to design future experiments that maximize the expected amount of new information about the mechanisms of decomposition using the Expected Information Gain formalism.
- North America > United States > New Mexico (0.04)
- North America > United States > Texas (0.04)
- Europe > Netherlands (0.04)
- Europe > Italy > Molise > Campobasso Province > Campobasso (0.04)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.68)