Plotting

 Chen, Zaiyi


In-Context Former: Lightning-fast Compressing Context for Large Language Model

arXiv.org Artificial Intelligence

With the rising popularity of Transformer-based large language models (LLMs), reducing their high inference costs has become a significant research focus. One effective approach is to compress the long input contexts. Existing methods typically leverage the self-attention mechanism of the LLM itself for context compression. While these methods have achieved notable results, the compression process still involves quadratic time complexity, which limits their applicability. To mitigate this limitation, we propose the In-Context Former (IC-Former). Unlike previous methods, IC-Former does not depend on the target LLMs. Instead, it leverages the cross-attention mechanism and a small number of learnable digest tokens to directly condense information from the contextual word embeddings. This approach significantly reduces inference time, which achieves linear growth in time complexity within the compression range. Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times while achieving over 90% of the baseline performance on evaluation metrics. Overall, our model effectively reduces compression costs and makes real-time compression scenarios feasible.


Boosting Gradient Ascent for Continuous DR-submodular Maximization

arXiv.org Artificial Intelligence

Projected Gradient Ascent (PGA) is the most commonly used optimization scheme in machine learning and operations research areas. Nevertheless, numerous studies and examples have shown that the PGA methods may fail to achieve the tight approximation ratio for continuous DR-submodular maximization problems. To address this challenge, we present a boosting technique in this paper, which can efficiently improve the approximation guarantee of the standard PGA to \emph{optimal} with only small modifications on the objective function. The fundamental idea of our boosting technique is to exploit non-oblivious search to derive a novel auxiliary function $F$, whose stationary points are excellent approximations to the global maximum of the original DR-submodular objective $f$. Specifically, when $f$ is monotone and $\gamma$-weakly DR-submodular, we propose an auxiliary function $F$ whose stationary points can provide a better $(1-e^{-\gamma})$-approximation than the $(\gamma^2/(1+\gamma^2))$-approximation guaranteed by the stationary points of $f$ itself. Similarly, for the non-monotone case, we devise another auxiliary function $F$ whose stationary points can achieve an optimal $\frac{1-\min_{\boldsymbol{x}\in\mathcal{C}}\|\boldsymbol{x}\|_{\infty}}{4}$-approximation guarantee where $\mathcal{C}$ is a convex constraint set. In contrast, the stationary points of the original non-monotone DR-submodular function can be arbitrarily bad~\citep{chen2023continuous}. Furthermore, we demonstrate the scalability of our boosting technique on four problems. In all of these four problems, our resulting variants of boosting PGA algorithm beat the previous standard PGA in several aspects such as approximation ratio and efficiency. Finally, we corroborate our theoretical findings with numerical experiments, which demonstrate the effectiveness of our boosting PGA methods.


An Online Algorithm for Chance Constrained Resource Allocation

arXiv.org Artificial Intelligence

This paper studies the online stochastic resource allocation problem (RAP) with chance constraints. The online RAP is a 0-1 integer linear programming problem where the resource consumption coefficients are revealed column by column along with the corresponding revenue coefficients. When a column is revealed, the corresponding decision variables are determined instantaneously without future information. Moreover, in online applications, the resource consumption coefficients are often obtained by prediction. To model their uncertainties, we take the chance constraints into the consideration. To the best of our knowledge, this is the first time chance constraints are introduced in the online RAP problem. Assuming that the uncertain variables have known Gaussian distributions, the stochastic RAP can be transformed into a deterministic but nonlinear problem with integer second-order cone constraints. Next, we linearize this nonlinear problem and analyze the performance of vanilla online primal-dual algorithm for solving the linearized stochastic RAP. Under mild technical assumptions, the optimality gap and constraint violation are both on the order of $\sqrt{n}$. Then, to further improve the performance of the algorithm, several modified online primal-dual algorithms with heuristic corrections are proposed. Finally, extensive numerical experiments on both synthetic and real data demonstrate the applicability and effectiveness of our methods.


Online Allocation Problem with Two-sided Resource Constraints

arXiv.org Artificial Intelligence

Online resource allocation is a prominent paradigm for sequential decision making during a finite horizon subject to the resource constraints, increasingly attracting the wide attention of researchers and practitioners in theoretical computer science (Mehta et al., 2007; Devanur and Jain, 2012; Devanur et al., 2019), operations research (Agrawal et al., 2014; Li and Ye, 2021) and machine learning communities (Balseiro et al., 2020; Li et al., 2020). In these settings, the requests arrive online and we need to serve each request via one of the available channels, which consumes a certain amount of resources and generates a corresponding service charge. The objective of the decision maker is to maximize the cumulative revenue subject to the resource capacity constraints. Such problem frequently appears in many applications including online advertising (Mehta et al., 2007; Buchbinder et al., 2007), online combinatorial auctions (Chawla et al., 2010), online linear programming(Agrawal et al., 2014; Buchbinder and Naor, 2009), online routing(Buchbinder and Naor, 2006), online multi-leg flight seats and hotel rooms allocation (Talluri et al., 2004), etc. The aforementioned online resource allocation framework only considers the capacity (upper bound) constraints for resources.


Symmetric Cross Entropy for Robust Learning with Noisy Labels

arXiv.org Machine Learning

Training accurate deep neural networks (DNNs) in the presence of noisy labels is an important and challenging task. Though a number of approaches have been proposed for learning with noisy labels, many open issues remain. In this paper, we show that DNN learning with Cross Entropy (CE) exhibits overfitting to noisy labels on some classes ("easy" classes), but more surprisingly, it also suffers from significant under learning on some other classes ("hard" classes). Intuitively, CE requires an extra term to facilitate learning of hard classes, and more importantly, this term should be noise tolerant, so as to avoid overfitting to noisy labels. Inspired by the symmetric KL-divergence, we propose the approach of \textbf{Symmetric cross entropy Learning} (SL), boosting CE symmetrically with a noise robust counterpart Reverse Cross Entropy (RCE). Our proposed SL approach simultaneously addresses both the under learning and overfitting problem of CE in the presence of noisy labels. We provide a theoretical analysis of SL and also empirically show, on a range of benchmark and real-world datasets, that SL outperforms state-of-the-art methods. We also show that SL can be easily incorporated into existing methods in order to further enhance their performance.


Joint Semantic Domain Alignment and Target Classifier Learning for Unsupervised Domain Adaptation

arXiv.org Machine Learning

Unsupervised domain adaptation aims to transfer the classifier learned from the source domain to the target domain in an unsupervised manner. With the help of target pseudo-labels, aligning class-level distributions and learning the classifier in the target domain are two widely used objectives. Existing methods often separately optimize these two individual objectives, which makes them suffer from the neglect of the other. However, optimizing these two aspects together is not trivial. To alleviate the above issues, we propose a novel method that jointly optimizes semantic domain alignment and target classifier learning in a holistic way. The joint optimization mechanism can not only eliminate their weaknesses but also complement their strengths. The theoretical analysis also verifies the favor of the joint optimization mechanism. Extensive experiments on benchmark datasets show that the proposed method yields the best performance in comparison with the state-of-the-art unsupervised domain adaptation methods.


Efficient Rank Minimization via Solving Non-convexPenalties by Iterative Shrinkage-Thresholding Algorithm

arXiv.org Machine Learning

Rank minimization (RM) is a wildly investigated task of finding solutions by exploiting low-rank structure of parameter matrices. Recently, solving RM problem by leveraging non-convex relaxations has received significant attention. It has been demonstrated by some theoretical and experimental work that non-convex relaxation, e.g. Truncated Nuclear Norm Regularization (TNNR) and Reweighted Nuclear Norm Regularization (RNNR), can provide a better approximation of original problems than convex relaxations. However, designing an efficient algorithm with theoretical guarantee remains a challenging problem. In this paper, we propose a simple but efficient proximal-type method, namely Iterative Shrinkage-Thresholding Algorithm(ISTA), with concrete analysis to solve rank minimization problems with both non-convex weighted and reweighted nuclear norm as low-rank regularizers. Theoretically, the proposed method could converge to the critical point under very mild assumptions with the rate in the order of $O(1/T)$. Moreover, the experimental results on both synthetic data and real world data sets show that proposed algorithm outperforms state-of-arts in both efficiency and accuracy.


Universal Stagewise Learning for Non-Convex Problems with Convergence on Averaged Solutions

arXiv.org Machine Learning

Although stochastic gradient descent (SGD) method and its variants (e.g., stochastic momentum methods, AdaGrad) are the choice of algorithms for solving non-convex problems (especially deep learning), there still remain big gaps between the theory and the practice with many questions unresolved. For example, there is still a lack of theories of convergence for SGD and its variants that use stagewise step size and return an averaged solution in practice. In addition, theoretical insights of why adaptive step size of AdaGrad could improve non-adaptive step size of {\sgd} is still missing for non-convex optimization. This paper aims to address these questions and fill the gap between theory and practice. We propose a universal stagewise optimization framework for a broad family of {\bf non-smooth non-convex} (namely weakly convex) problems with the following key features: (i) at each stage any suitable stochastic convex optimization algorithms (e.g., SGD or AdaGrad) that return an averaged solution can be employed for minimizing a regularized convex problem; (ii) the step size is decreased in a stagewise manner; (iii) an averaged solution is returned as the final solution that is selected from all stagewise averaged solutions with sampling probabilities {\it increasing} as the stage number. Our theoretical results of stagewise AdaGrad exhibit its adaptive convergence, therefore shed insights on its faster convergence for problems with sparse stochastic gradients than stagewise SGD. To the best of our knowledge, these new results are the first of their kind for addressing the unresolved issues of existing theories mentioned earlier.