Bui, Hung
PhoGPT: Generative Pre-training for Vietnamese
Nguyen, Dat Quoc, Nguyen, Linh The, Tran, Chi, Nguyen, Dung Ngoc, Phung, Dinh, Bui, Hung
We open-source a state-of-the-art 4B-parameter generative model series for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-4B and its chat variant, PhoGPT-4B-Chat. The base model, PhoGPT-4B, with exactly 3.7B parameters, is pre-trained from scratch on a Vietnamese corpus of 102B tokens, with an 8192 context length, employing a vocabulary of 20480 token types. The chat variant, PhoGPT-4B-Chat, is the modeling output obtained by fine-tuning PhoGPT-4B on a dataset of 70K instructional prompts and their responses, along with an additional 290K conversations. We demonstrate its strong performance compared to previous closed-source and open-source 7B-parameter models. Our PhoGPT models are available at: https://github.com/VinAIResearch/PhoGPT
Benchmarking with MIMIC-IV, an irregular, spare clinical time series dataset
Bui, Hung, Warrier, Harikrishna, Gupta, Yogesh
Irregularly sampled time series data occur in multiple scientific and industrial domains including finance, climate science and healthcare. In healthcare, electronic health records (EHR) have been widely adopted with the hope they would save time and improve the quality of patient care. The role of Artificial Intelligence (AI) in EHR is rapidly transforming the healthcare landscape, offering new opportunities to improve patient care, enhance decision-making, and optimize healthcare operations Shukla and Marlin (2022). Time-series data is routinely collected in various healthcare settings where different measurements are recorded for patients throughout their course of stay. Predicting clinical outcomes like mortality, decompensation, length of stay, and disease risk from such complex multivariate time-series data can facilitate both effective management of critical care units and automatic personalized treatment recommendations for patients Tipirneni et al. (2022). However, modeling time series data subject to irregular sampling poses a significant challenge to machine learning models that assume fully observed, fixed-size feature representations Shukla and Marlin (2021).
On Cross-Layer Alignment for Model Fusion of Heterogeneous Neural Networks
Nguyen, Dang, Nguyen, Trang, Nguyen, Khai, Phung, Dinh, Bui, Hung, Ho, Nhat
Layer-wise model fusion via optimal transport, named OTFusion, applies soft neuron association for unifying different pre-trained networks to save computational resources. While enjoying its success, OTFusion requires the input networks to have the same number of layers. To address this issue, we propose a novel model fusion framework, named CLAFusion, to fuse neural networks with a different number of layers, which we refer to as heterogeneous neural networks, via cross-layer alignment. The cross-layer alignment problem, which is an unbalanced assignment problem, can be solved efficiently using dynamic programming. Based on the cross-layer alignment, our framework balances the number of layers of neural networks before applying layer-wise model fusion. Our experiments indicate that CLAFusion, with an extra finetuning process, improves the accuracy of residual networks on the CIFAR10, CIFAR100, and Tiny-ImageNet datasets. Furthermore, we explore its practical usage for model compression and knowledge distillation when applying to the teacher-student setting.
On Learning Domain-Invariant Representations for Transfer Learning with Multiple Sources
Phung, Trung, Le, Trung, Vuong, Long, Tran, Toan, Tran, Anh, Bui, Hung, Phung, Dinh
Domain adaptation (DA) benefits from the rigorous theoretical works that study its insightful characteristics and various aspects, e.g., learning domain-invariant representations and its trade-off. However, it seems not the case for the multiple source DA and domain generalization (DG) settings which are remarkably more complicated and sophisticated due to the involvement of multiple source domains and potential unavailability of target domain during training. In this paper, we develop novel upper-bounds for the target general loss which appeal to us to define two kinds of domain-invariant representations. We further study the pros and cons as well as the trade-offs of enforcing learning each domain-invariant representation. Finally, we conduct experiments to inspect the trade-off of these representations for offering practical hints regarding how to use them in practice and explore other interesting properties of our developed theory.
Temporal Predictive Coding For Model-Based Planning In Latent Space
Nguyen, Tung, Shu, Rui, Pham, Tuan, Bui, Hung, Ermon, Stefano
High-dimensional observations are a major challenge in the application of model-based reinforcement learning (MBRL) to real-world environments. To handle high-dimensional sensory inputs, existing approaches use representation learning to map high-dimensional observations into a lower-dimensional latent space that is more amenable to dynamics estimation and planning. In this work, we present an information-theoretic approach that employs temporal predictive coding to encode elements in the environment that can be predicted across time. Since this approach focuses on encoding temporally-predictable information, we implicitly prioritize the encoding of task-relevant components over nuisance information within the environment that are provably task-irrelevant. By learning this representation in conjunction with a recurrent state space model, we can then perform planning in latent space. We evaluate our model on a challenging modification of standard DMControl tasks where the background is replaced with natural videos that contain complex but irrelevant information to the planning task. Our experiments show that our model is superior to existing methods in the challenging complex-background setting while remaining competitive with current state-of-the-art models in the standard setting.
Improving Bayesian Inference in Deep Neural Networks with Variational Structured Dropout
Nguyen, Son, Nguyen, Duong, Nguyen, Khai, Ho, Nhat, Than, Khoat, Bui, Hung
Bayesian Neural Networks (BNNs) [37, 47] offer a probabilistic interpretation for deep learning models by imposing a prior distribution on the weight parameters and aim to obtain a posterior distribution instead of only point estimates. By marginalizing over this posterior for prediction, BNNs perform a procedure of ensemble learning. These principles facilitate the model to improve generalization, robustness and allow for uncertainty quantification. However, computing exactly the posterior of non-linear Bayesian networks is infeasible and approximate inference has been devised. The core challenge is how to construct an expressive approximation for the true posterior while maintaining computational efficiency and scalability, especially for modern deep learning architectures. Variational inference is a popular deterministic approximation approach to to deal with this challenge. The first practical methods are proposed in [15, 5, 28], in which, the approximate posterior is assumed to be a fully factorized distribution, also called mean-field variational inference. Generally, the mean-field approximation family encourages some advantages in inference including computational tractability and effective optimization with the stochastic gradient-based methods. However, it will ignore strong statistical dependencies among random weights of the neural networks, which leads to an inability to capture the complicated structure of the true posterior and to estimate true model uncertainty.
On Robust Optimal Transport: Computational Complexity, Low-rank Approximation, and Barycenter Computation
Le, Khang, Nguyen, Huy, Nguyen, Quang, Ho, Nhat, Pham, Tung, Bui, Hung
The recent advance in computation with optimal transport (OT) problem [12, 3, 13, 7, 20, 23, 17, 18] has led to a surge of interest in using that tool in various domains of machine learning and statistics. The range of its applications is broad, including deep generative models [4, 14, 32], scalable Bayes [29, 30], mixture and hierarchical models [21], and other applications [28, 25, 10, 15, 33, 31, 8]. The goal of optimal transport is to find a minimal cost of moving masses between (supports of) probability distributions. It is known that the estimation of transport cost is not robust when there are outliers. To deal with this issue, [34] proposed a trimmed version of optimal transport. In particular, they search for the truncated probability distributions such that the optimal transport cost between them is minimized. However, their trimmed optimal transport is non-trivial to compute, which hinders its usage in practical applications. Another line of works proposed using unbalanced optimal transport (UOT) to solve the sensitivity of optimal transport to outliers [5, 26]. More specifically, their idea is to assign as small as possible mass to outliers by relaxing the marginal constraints of OT through a penalty function such as the KL divergence.
BoMb-OT: On Batch of Mini-batches Optimal Transport
Nguyen, Khai, Nguyen, Quoc, Ho, Nhat, Pham, Tung, Bui, Hung, Phung, Dinh, Le, Trung
Mini-batch optimal transport (m-OT) has been successfully used in practical applications that involve probability measures with intractable density, or probability measures with a very high number of supports. The m-OT solves several sparser optimal transport problems and then returns the average of their costs and transportation plans. Despite its scalability advantage, m-OT is not a proper metric between probability measures since it does not satisfy the identity property. To address this problem, we propose a novel mini-batching scheme for optimal transport, named Batch of Mini-batches Optimal Transport (BoMb-OT), that can be formulated as a well-defined distance on the space of probability measures. Furthermore, we show that the m-OT is a limit of the entropic regularized version of the proposed BoMb-OT when the regularized parameter goes to infinity. We carry out extensive experiments to show that the new mini-batching scheme can estimate a better transportation plan between two original measures than m-OT. It leads to a favorable performance of BoMb-OT in the matching and color transfer tasks. Furthermore, we observe that BoMb-OT also provides a better objective loss than m-OT for doing approximate Bayesian computation, estimating parameters of interest in parametric generative models, and learning non-parametric generative models with gradient flow.
Learning Compositional Sparse Gaussian Processes with a Shrinkage Prior
Tong, Anh, Tran, Toan, Bui, Hung, Choi, Jaesik
Choosing a proper set of kernel functions is an important problem in learning Gaussian Process (GP) models since each kernel structure has different model complexity and data fitness. Recently, automatic kernel composition methods provide not only accurate prediction but also attractive interpretability through search-based methods. However, existing methods suffer from slow kernel composition learning. To tackle large-scaled data, we propose a new sparse approximate posterior for GPs, MultiSVGP, constructed from groups of inducing points associated with individual additive kernels in compositional kernels. We demonstrate that this approximation provides a better fit to learn compositional kernels given empirical observations. We also theoretically justification on error bound when compared to the traditional sparse GP. In contrast to the search-based approach, we present a novel probabilistic algorithm to learn a kernel composition by handling the sparsity in the kernel selection with Horseshoe prior. We demonstrate that our model can capture characteristics of time series with significant reductions in computational time and have competitive regression performance on real-world data sets.
Improving Relational Regularized Autoencoders with Spherical Sliced Fused Gromov Wasserstein
Nguyen, Khai, Nguyen, Son, Ho, Nhat, Pham, Tung, Bui, Hung
Relational regularized autoencoder (RAE) is a framework to learn the distribution of data by minimizing a reconstruction loss together with a relational regularization on the latent space. A recent attempt to reduce the inner discrepancy between the prior and aggregated posterior distributions is to incorporate sliced fused Gromov-Wasserstein (SFG) between these distributions. That approach has a weakness since it treats every slicing direction similarly, meanwhile several directions are not useful for the discriminative task. To improve the discrepancy and consequently the relational regularization, we propose a new relational discrepancy, named spherical sliced fused Gromov Wasserstein (SSFG), that can find an important area of projections characterized by a von Mises-Fisher distribution. Then, we introduce two variants of SSFG to improve its performance. The first variant, named mixture spherical sliced fused Gromov Wasserstein (MSSFG), replaces the vMF distribution by a mixture of von Mises-Fisher distributions to capture multiple important areas of directions that are far from each other. The second variant, named power spherical sliced fused Gromov Wasserstein (PSSFG), replaces the vMF distribution by a power spherical distribution to improve the sampling time in high dimension settings. We then apply the new discrepancies to the RAE framework to achieve its new variants. Finally, we conduct extensive experiments to show that the new proposed autoencoders have favorable performance in learning latent manifold structure, image generation, and reconstruction.