AITopics | Yau, Chung-Yiu

Collaborating Authors

Yau, Chung-Yiu

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models

Wei, Quan, Yau, Chung-Yiu, Wai, Hoi-To, Yang, null, Zhao, null, Kang, Dongyeop, Park, Youngsuk, Hong, Mingyi

arXiv.org Artificial IntelligenceFeb-13-2025

Supervised fine-tuning is a standard method for adapting pre-trained large language models (LLMs) to downstream tasks. Quantization has been recently studied as a post-training technique for efficient LLM deployment. To obtain quantized fine-tuned LLMs, conventional pipelines would first fine-tune the pre-trained models, followed by post-training quantization. This often yields suboptimal performance as it fails to leverage the synergy between fine-tuning and quantization. To effectively realize low-bit quantization of weights, activations, and KV caches in LLMs, we propose an algorithm named Rotated Straight-Through-Estimator (RoSTE), which combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy that identifies an effective rotation configuration to reduce activation outliers. We provide theoretical insights on RoSTE by analyzing its prediction error when applied to an overparameterized least square quantized training problem. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration. Experiments on Pythia and Llama models of different sizes demonstrate the effectiveness of RoSTE. Compared to existing post-SFT quantization baselines, our method consistently achieves superior performances across various tasks and different LLM architectures.

large language model, machine learning, quantization, (16 more...)

arXiv.org Artificial Intelligence

2502.09003

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Fully Stochastic Primal-dual Gradient Algorithm for Non-convex Optimization on Random Graphs

Yau, Chung-Yiu, Liu, Haoming, Wai, Hoi-To

arXiv.org Artificial IntelligenceOct-24-2024

Stochastic decentralized optimization algorithms often suffer from issues such as synchronization overhead and intermittent communication. This paper proposes a $\underline{\rm F}$ully $\underline{\rm S}$tochastic $\underline{\rm P}$rimal $\underline{\rm D}$ual gradient $\underline{\rm A}$lgorithm (FSPDA) that suggests an asynchronous decentralized procedure with (i) sparsified non-blocking communication on random undirected graphs and (ii) local stochastic gradient updates. FSPDA allows multiple local gradient steps to accelerate convergence to stationarity while finding a consensual solution with stochastic primal-dual updates. For problems with smooth (possibly non-convex) objective function, we show that FSPDA converges to an $\mathrm{\mathcal{O}( {\it \sigma /\sqrt{nT}} )}$-stationary solution after $\mathrm{\it T}$ iterations without assuming data heterogeneity. The performance of FSPDA is on par with state-of-the-art algorithms whose convergence depend on static graph and synchronous updates. To our best knowledge, FSPDA is the first asynchronous algorithm that converges exactly under the non-convex setting. Numerical experiments are presented to show the benefits of FSPDA.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2410.18774

Country: Asia (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

Add feedback

EMC$^2$: Efficient MCMC Negative Sampling for Contrastive Learning with Global Convergence

Yau, Chung-Yiu, Wai, Hoi-To, Raman, Parameswaran, Sarkar, Soumajyoti, Hong, Mingyi

arXiv.org Artificial IntelligenceApr-16-2024

Contrastive representation learning has been instrumental in self-supervised learning for large-scale pretraining of foundation models Radford et al. (2021); Cherti et al. (2023) as well as in the fine-tuning stage on downstream tasks Xiong et al. (2020); Lindgren et al. (2021). It helps encode real-world data into lowdimensional feature vectors that abstract the important attributes about the data, and generalize well outside of the training distribution. More recently, contrastive learning with multi-modal data has helped embed different data modalities into the same feature space Li et al. (2023), such as the studies with visual-language models Radford et al. (2021); Alayrac et al. (2022); Cherti et al. (2023) and document understanding Xu et al. (2020); Lee et al. (2023). Contrastive learning uses pairwise comparison of representations in the training objective, with the goal of learning representations of data where positive pairs are drawn closer while negative pairs move apart in the representation space. It is well known that generating a large dataset of pairwise samples such as image-text pairs of the same semantics costs much lower than manual labeling, e.g., the WebImageText dataset used for training CLIP originates from Wikipedia articles Radford et al. (2021).

artificial intelligence, emc 2, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2404.10575

Country:

North America > United States (0.14)
Asia > China (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning > Representation Of Examples (0.34)

Add feedback

DoCoM: Compressed Decentralized Optimization with Near-Optimal Sample Complexity

Yau, Chung-Yiu, Wai, Hoi-To

arXiv.org Artificial IntelligenceJul-31-2023

This paper proposes the Doubly Compressed Momentum-assisted stochastic gradient tracking algorithm $\texttt{DoCoM}$ for communication-efficient decentralized optimization. The algorithm features two main ingredients to achieve a near-optimal sample complexity while allowing for communication compression. First, the algorithm tracks both the averaged iterate and stochastic gradient using compressed gossiping consensus. Second, a momentum step is incorporated for adaptive variance reduction with the local gradient estimates. We show that $\texttt{DoCoM}$ finds a near-stationary solution at all participating agents satisfying $\mathbb{E}[ \| \nabla f( \theta ) \|^2 ] = \mathcal{O}( 1 / T^{2/3} )$ in $T$ iterations, where $f(\theta)$ is a smooth (possibly non-convex) objective function. Notice that the proof is achieved via analytically designing a new potential function that tightly tracks the one-iteration progress of $\texttt{DoCoM}$. As a corollary, our analysis also established the linear convergence of $\texttt{DoCoM}$ to a global optimal solution for objective functions with the Polyak-{\L}ojasiewicz condition. Numerical experiments demonstrate that our algorithm outperforms several state-of-the-art algorithms in practice.

artificial intelligence, machine learning, optimization problem, (16 more...)

arXiv.org Artificial Intelligence

2202.00255

Country: North America > United States (0.14)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback