Goto

Collaborating Authors

 Li, Shengxi


InstructOCR: Instruction Boosting Scene Text Spotting

arXiv.org Artificial Intelligence

In the field of scene text spotting, previous OCR methods primarily relied on image encoders and pre-trained text information, but they often overlooked the advantages of incorporating human language instructions. To address this gap, we propose InstructOCR, an innovative instruction-based scene text spotting model that leverages human language instructions to enhance the understanding of text within images. Our framework employs both text and image encoders during training and inference, along with instructions meticulously designed based on text attributes. This approach enables the model to interpret text more accurately and flexibly. Extensive experiments demonstrate the effectiveness of our model and we achieve state-of-the-art results on widely used benchmarks. Furthermore, the proposed framework can be seamlessly applied to scene text VQA tasks. By leveraging instruction strategies during pre-training, the performance on downstream VQA tasks can be significantly improved, with a 2.6% increase on the TextVQA dataset and a 2.1% increase on the ST-VQA dataset. These experimental results provide insights into the benefits of incorporating human language instructions for OCR-related tasks.


QVD: Post-training Quantization for Video Diffusion Models

arXiv.org Artificial Intelligence

Recently, video diffusion models (VDMs) have garnered significant attention due to their notable advancements in generating coherent and realistic video content. However, processing multiple frame features concurrently, coupled with the considerable model size, results in high latency and extensive memory consumption, hindering their broader application. Post-training quantization (PTQ) is an effective technique to reduce memory footprint and improve computational efficiency. Unlike image diffusion, we observe that the temporal features, which are integrated into all frame features, exhibit pronounced skewness. Furthermore, we investigate significant inter-channel disparities and asymmetries in the activation of video diffusion models, resulting in low coverage of quantization levels by individual channels and increasing the challenge of quantization. To address these issues, we introduce the first PTQ strategy tailored for video diffusion models, dubbed QVD. Specifically, we propose the High Temporal Discriminability Quantization (HTDQ) method, designed for temporal features, which retains the high discriminability of quantized features, providing precise temporal guidance for all video frames. In addition, we present the Scattered Channel Range Integration (SCRI) method which aims to improve the coverage of quantization levels across individual channels. Experimental validations across various models, datasets, and bit-width settings demonstrate the effectiveness of our QVD in terms of diverse metrics. In particular, we achieve near-lossless performance degradation on W8A8, outperforming the current methods by 205.12 in FVD.


Von Mises-Fisher Elliptical Distribution

arXiv.org Machine Learning

A large class of modern probabilistic learning systems assumes symmetric distributions, however, real-world data tend to obey skewed distributions and are thus not always adequately modelled through symmetric distributions. To address this issue, elliptical distributions are increasingly used to generalise symmetric distributions, and further improvements to skewed elliptical distributions have recently attracted much attention. However, existing approaches are either hard to estimate or have complicated and abstract representations. To this end, we propose to employ the von-Mises-Fisher (vMF) distribution to obtain an explicit and simple probability representation of the skewed elliptical distribution. This is shown not only to allow us to deal with non-symmetric learning systems, but also to provide a physically meaningful way of generalising skewed distributions. For rigour, our extension is proved to share important and desirable properties with its symmetric counterpart. We also demonstrate that the proposed vMF distribution is both easy to generate and stable to estimate, both theoretically and through examples.


Reciprocal Adversarial Learning via Characteristic Functions

arXiv.org Machine Learning

Generative adversarial nets (GANs) have become a preferred tool for tasks involving complicated distributions. To stabilise the training and reduce the mode collapse of GANs, one of their main variants employs the integral probability metric (IPM) as the loss function. This provides extensive IPM-GANs with theoretical support for basically comparing moments in an embedded domain of the \textit{critic}. We generalise this by comparing the distributions rather than their moments via a powerful tool, i.e., the characteristic function (CF), which uniquely and universally comprising all the information about a distribution. For rigour, we first establish the physical meaning of the phase and amplitude in CF, and show that this provides a feasible way of balancing the accuracy and diversity of generation. We then develop an efficient sampling strategy to calculate the CFs. Within this framework, we further prove an equivalence between the embedded and data domains when a reciprocal exists, where we naturally develop the GAN in an auto-encoder structure, in a way of comparing everything in the embedded space (a semantically meaningful manifold). This efficient structure uses only two modules, together with a simple training strategy, to achieve bi-directionally generating clear images, which is referred to as the reciprocal CF GAN (RCF-GAN). Experimental results demonstrate the superior performances of the proposed RCF-GAN in terms of both generation and reconstruction.


Meta-learning for Multi-variable Non-convex Optimization Problems: Iterating Non-optimums Makes Optimum Possible

arXiv.org Machine Learning

In this paper, we aim to address the problem of solving a non-convex optimization problem over an intersection of multiple variable sets. This kind of problems is typically solved by using an alternating minimization (AM) strategy which splits the overall problem into a set of sub-problems corresponding to each variable, and then iteratively performs minimization over each sub-problem using a fixed updating rule. However, due to the intrinsic non-convexity of the overall problem, the optimization can usually be trapped into bad local minimum even when each sub-problem can be globally optimized at each iteration. To tackle this problem, we propose a meta-learning based Global Scope Optimization (GSO) method. It adaptively generates optimizers for sub-problems via meta-learners and constantly updates these meta-learners with respect to the global loss information of the overall problem. Therefore, the sub-problems are optimized with the objective of minimizing the global loss specifically. We evaluate the proposed model on a number of simulations, including solving bi-linear inverse problems: matrix completion, and non-linear problems: Gaussian mixture models. The experimental results show that our proposed approach outperforms AM-based methods in standard settings, and is able to achieve effective optimization in some challenging cases while other methods would typically fail.


A general solver to the elliptical mixture model through an approximate Wasserstein manifold

arXiv.org Machine Learning

This paper studies the problem of estimation for general finite mixture models, with a particular focus on the elliptical mixture models (EMMs). Instead of using the widely adopted Kullback-Leibler divergence, we provide a stable solution to the EMMs that is robust to initialisations and attains superior local optimum by adaptively optimising along a manifold of an approximate Wasserstein distance. More specifically, we first summarise computable and identifiable EMMs, in order to identify the optimisation problem. Due to a probability constraint, solving this problem is cumbersome and unstable, especially under the Wasserstein distance. We thus resort to an efficient optimisation on a statistical manifold defined under an approximate Wasserstein distance, which allows for explicit metrics and operations. This is shown to significantly stabilise and improve the EMM estimations. We also propose an adaptive method to further accelerate the convergence. Experimental results demonstrate excellent performances of the proposed solver.


A universal framework for learning based on the elliptical mixture model (EMM)

arXiv.org Machine Learning

An increasing prominence of unbalanced and noisy data highlights the importance of elliptical mixture models (EMMs), which exhibit enhanced robustness, flexibility and stability over the widely applied Gaussian mixture model (GMM). However, existing studies of the EMM are typically of \textit{ad hoc} nature, without a universal analysis framework or existence and uniqueness considerations. To this end, we propose a general framework for estimating the EMM, which makes use of the Riemannian manifold optimisation to convert the original constrained optimisation paradigms into an un-constrained one. We first revisit the statistics of elliptical distributions, to give a rationale for the use of Riemannian metrics as well as the reformulation of the problem in the Riemannian space. We then derive the EMM learning framework, based on Riemannian gradient descent, which ensures the same optimum as the original problem but accelerates the convergence speed. We also unify the treatment of the existing elliptical distributions to build a universal EMM, providing a simple and intuitive way to deal with the non-convex nature of this optimisation problem. Numerical results demonstrate the ability of the proposed framework to accommodate EMMs with different properties of individual functions, and also verify the robustness and flexibility of the proposed framework over the standard GMM.