Goto

Collaborating Authors

 time 0


Appendix AVariational Paragraph Embedder A.1 Selection of substitution rate p

Neural Information Processing Systems

Figure 4: Impact of the proportion of injected noise for learning Paragraph Embeddings on XSum dataset. PPLint and the PPL of the generation obtained from training PLANNER on the corresponding z at different noise level. We observed when the value of p is within (0, 0.7), there Performing a grid search on each task using diffusion models is an expensive process. However, it has been observed that an increase in the value of p leads to a deviation between the two. This could be attributed to a higher conversion error that occurs when p is excessively large. A.2 Selection of number of latent code k The parameter k determines the number of latent codes used to represent a paragraph and therefore controls the compression level. Latent codes with smaller values of k are easier to model using the diffusion model, but may struggle to accurately preserve all the information in the original text. Additionally, smaller values of k offer computational efficiency as the sequence length for the diffusion model is k. To determine the best set of latent codes, we conducted experiments using three different methods: 1) selecting the first k hidden vectors, 2) selecting the last k hidden vectors, and 3) selecting interleaving hidden vectors, one for every L k hidden vectors. The results of the ablation study are presented in Table 5. Based on our findings, we observed no significant difference among the different choices, so we opted for option 1). Furthermore, we discovered that increasing the value of k does not lead to a dramatic improvement in performance. To balance between efficiency and performance, in most of our study we only use k =16 Setup BLEU_clean BLEU_robust First k (k=16) 79.59 43.17 A.3 Reconstruction, denoising and interpolation examples In Table 6, we present examples that demonstrate the adeptness of the trained Variational Paragraph Embedder in providing clean and denoised reconstructions. Additionally, we showcase interpolation results (Table 7, 8) derived from two random sentences in the hotel review dataset. The interpolated paragraph is usually coherent and incorporates inputs from both sentences, characterizing the distributional smoothness of the latent space. Reconstructed text complaints: after two nights stay, i asked the maid to clean our room (empty the wastebasket & make the bed). Denoising reconstruction (hotel review), noise level 0.3 Original text * * * check out the bathroom picture * * * i was in nyc by myself to watch some friends participate in the us olympic marathon trials. Corrupted text * * [unused697] check exams the bathroom picture * * slams i was in nyc mead myself yankee 2016 some scotch ruin in the outfielder olympicnca trials.


Appendix A V ariational Paragraph Embedder A.1 Selection of substitution rate p

Neural Information Processing Systems

Figure 4: Impact of the proportion of injected noise for learning Paragraph Em-beddings on XSum dataset. (Figure 4). The results of the ablation study are presented in Table 5. Embedder in providing clean and denoised reconstructions. In general, it has been observed that generations progress in a coarse-to-fine manner. The early time step, which is close to 1, tends to be less fluent and generic. This was the nicest stay we have ever had. Turtle Bay was a great resort. This was the nicest stay we have ever had.



SolvingInterpretableKernelDimensionReduction

Neural Information Processing Systems

Kernel dimensionality reduction (KDR) algorithms find a low dimensional representation of the original data by optimizing kernel dependency measures that are capable ofcapturing nonlinear relationships.


Notes 1A special event x0 is sometimes given at time 0 to mark the beginning of the sequence; the model then generatestherestofthesequenceconditionedonx0

Neural Information Processing Systems

NHP is a thoughtfully designed framework that has been demonstrated effective on temporal data, but our method can also be used for other models with parametric intensityfunctions. In this section, we prove the claim in section 2.2 that argmaxθJLL(θ) = Θ When we take the expectation under p, each summand gets weighted by the probability that x[0,t) and x[t,t+dt) would take on the values in that summand. Therefore,wehaveG θ( t, x[0, t)) < 0since the distributions in equation (9) are distinct for the given history x[0, t). This lemma says: if θ and θ are meaningfully different in that they predict different intensities at time t for some history, then they actually do so for a set of histories of non-zero measure, making this difference visible in the objective functions like JLL(θ) (see above) and JNC(θ) (see Appendix B). We use d to denote the maximal difference between the intensities over (t0,t00), i.e., d If x[0,t) doesn't have any event, then its probability p( x[0,t)) = exp( Suppose that t1 has been shifted by R. Recall that we need order-(1dt)I many such histories.


Feature Learning for Interpretable, Performant Decision Trees Supplementary Material 1 Experiment Specification

Neural Information Processing Systems

Here we cover the full specification of the experiments. Some details were omitted from the main text. If there were separate training and test sets, they were combined before creating the random 10-fold split. All attributes are normalized to mean 0 and standard deviation 1. Additional details for each model type follow.


Cache-to-Cache: Direct Semantic Communication Between Large Language Models

arXiv.org Artificial Intelligence

Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.



Vecchia-Inducing-Points Full-Scale Approximations for Gaussian Processes

arXiv.org Machine Learning

Gaussian processes are flexible, probabilistic, non-parametric models widely used in machine learning and statistics. However, their scalability to large data sets is limited by computational constraints. To overcome these challenges, we propose Vecchia-inducing-points full-scale (VIF) approximations combining the strengths of global inducing points and local Vecchia approximations. Vecchia approximations excel in settings with low-dimensional inputs and moderately smooth covariance functions, while inducing point methods are better suited to high-dimensional inputs and smoother covariance functions. Our VIF approach bridges these two regimes by using an efficient correlation-based neighbor-finding strategy for the Vecchia approximation of the residual process, implemented via a modified cover tree algorithm. We further extend our framework to non-Gaussian likelihoods by introducing iterative methods that substantially reduce computational costs for training and prediction by several orders of magnitudes compared to Cholesky-based computations when using a Laplace approximation. In particular, we propose and compare novel preconditioners and provide theoretical convergence results. Extensive numerical experiments on simulated and real-world data sets show that VIF approximations are both computationally efficient as well as more accurate and numerically stable than state-of-the-art alternatives. All methods are implemented in the open source C++ library GPBoost with high-level Python and R interfaces.


Unlearning Works Better Than You Think: Local Reinforcement-Based Selection of Auxiliary Objectives

arXiv.org Machine Learning

We introduce Local Reinforcement-Based Selection of Auxiliary Objectives (LRSAO), a novel approach that selects auxiliary objectives using reinforcement learning (RL) to support the optimization process of an evolutionary algorithm (EA) as in EA+RL framework and furthermore incorporates the ability to unlearn previously used objectives. By modifying the reward mechanism to penalize moves that do no increase the fitness value and relying on the local auxiliary objectives, LRSAO dynamically adapts its selection strategy to optimize performance according to the landscape and unlearn previous objectives when necessary. We analyze and evaluate LRSAO on the black-box complexity version of the non-monotonic Jump function, with gap parameter $\ell$, where each auxiliary objective is beneficial at specific stages of optimization. The Jump function is hard to optimize for evolutionary-based algorithms and the best-known complexity for reinforcement-based selection on Jump was $O(n^2 \log(n) / \ell)$. Our approach improves over this result to achieve a complexity of $\Theta(n^2 / \ell^2 + n \log(n))$ resulting in a significant improvement, which demonstrates the efficiency and adaptability of LRSAO, highlighting its potential to outperform traditional methods in complex optimization scenarios.