AITopics

2412.20984

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningDec-30-2024

Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism

Lau, Tim Tsz-Kit, Li, Weijian, Xu, Chenwei, Liu, Han, Kolar, Mladen

An appropriate choice of batch sizes in large-scale model training is crucial, yet it involves an intrinsic yet inevitable dilemma: large-batch training improves training efficiency in terms of memory utilization, while generalization performance often deteriorates due to small amounts of gradient noise. Despite this dilemma, the common practice of choosing batch sizes in language model training often prioritizes training efficiency -- employing either constant large sizes with data parallelism or implementing batch size warmup schedules. However, such batch size schedule designs remain heuristic and often fail to adapt to training dynamics, presenting the challenge of designing adaptive batch size schedules. Given the abundance of available datasets and the data-hungry nature of language models, data parallelism has become an indispensable distributed training paradigm, enabling the use of larger batch sizes for gradient computation. However, vanilla data parallelism requires replicas of model parameters, gradients, and optimizer states at each worker, which prohibits training larger models with billions of parameters. To optimize memory usage, more advanced parallelism strategies must be employed. In this work, we propose general-purpose and theoretically principled adaptive batch size schedules compatible with data parallelism and model parallelism. We develop a practical implementation with PyTorch Fully Sharded Data Parallel, facilitating the pretraining of language models of different sizes. We empirically demonstrate that our proposed approaches outperform constant batch sizes and heuristic batch size warmup schedules in the pretraining of models in the Llama family, with particular focus on smaller models with up to 3 billion parameters. We also establish theoretical convergence guarantees for such adaptive batch size schedules with Adam for general smooth nonconvex objectives.

large language model, machine learning, natural language, (16 more...)

2412.21124

Country:

North America > United States > California (0.28)
North America > United States > Illinois > Cook County (0.14)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)

Genre: Research Report (0.50)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.93)

arXiv.org Machine LearningJun-19-2024

Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods

Lau, Tim Tsz-Kit, Li, Weijian, Xu, Chenwei, Liu, Han, Kolar, Mladen

Modern deep neural networks often require distributed training with many workers due to their large size. As worker numbers increase, communication overheads become the main bottleneck in data-parallel minibatch stochastic gradient methods with per-iteration gradient synchronization. Local gradient methods like Local SGD reduce communication by only syncing after several local steps. Despite understanding their convergence in i.i.d. and heterogeneous settings and knowing the importance of batch sizes for efficiency and generalization, optimal local batch sizes are difficult to determine. We introduce adaptive batch size strategies for local gradient methods that increase batch sizes adaptively to reduce minibatch gradient variance. We provide convergence guarantees under homogeneous data conditions and support our claims with image classification experiments, demonstrating the effectiveness of our strategies in training and generalization.

artificial intelligence, local batch size validation accuracy, machine learning, (12 more...)

2406.13936

Country:

North America > United States > California (0.28)
North America > United States > Illinois > Cook County (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningApr-4-2024

BiSHop: Bi-Directional Cellular Learning for Tabular Data with Generalized Sparse Modern Hopfield Model

Xu, Chenwei, Huang, Yu-Chao, Hu, Jerry Yao-Chieh, Li, Weijian, Gilani, Ammar, Goan, Hsi-Sheng, Liu, Han

The field of developing deep learning architectures for tabular data is recently experiencing rapid advancements [Arik and Pfister, 2021, Gorishniy et al., 2021, Huang et al., 2020, Somepalli et al., 2021]. The primary driving force behind this trend is the limitations of the current dominant methods for tabular data: tree-based methods. Specifically, while tree-based methods excel in tabular learning, tree-based methods lack the capability to integrate with deep learning architectures. Therefore, the pursuit of deep tabular learning is not just a matter of enhancing performance but is also crucial to bridge the existing gap. However, a recent tabular benchmark study [Grinsztajn et al., 2022] reveals that tree-based methods still surpass deep learning models, underscoring two main challenges for deep tabular learning, as highlighted by Grinsztajn et al. [2022, Section 5.3 & 5.4]: (C1) Non-Rotationally Invariant Data Structure: The non-rotationally invariant structure of tabular data weakens the effectiveness of deep learning models that have rotational invariant learning procedures.

artificial intelligence, deep learning, machine learning, (13 more...)

2404.0383

Country:

North America > United States (0.28)
Asia > Taiwan (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceFeb-6-2024

SMUTF: Schema Matching Using Generative Tags and Hybrid Features

Zhang, Yu, Di, Mei, Luo, Haozheng, Xu, Chenwei, Tsai, Richard Tzong-Han

We introduce SMUTF, a unique approach for large-scale tabular data schema matching (SM), which assumes that supervised learning does not affect performance in open-domain tasks, thereby enabling effective cross-domain matching. This system uniquely combines rule-based feature engineering, pre-trained language models, and generative large language models. In an innovative adaptation inspired by the Humanitarian Exchange Language, we deploy 'generative tags' for each data column, enhancing the effectiveness of SM. SMUTF exhibits extensive versatility, working seamlessly with any pre-existing pre-trained embeddings, classification methods, and generative models. Recognizing the lack of extensive, publicly available datasets for SM, we have created and open-sourced the HDXSM dataset from the public humanitarian data. We believe this to be the most exhaustive SM dataset currently available. In evaluations across various public datasets and the novel HDXSM dataset, SMUTF demonstrated exceptional performance, surpassing existing state-of-the-art models in terms of accuracy and efficiency, and} improving the F1 score by 11.84% and the AUC of ROC by 5.08%.

large language model, machine learning, natural language, (20 more...)

2402.01685

Country:

North America > United States (0.14)
Asia > Taiwan (0.14)
North America > Canada (0.14)
Europe > Estonia (0.14)

Genre: Research Report > Promising Solution (0.48)

Industry: Information Technology > Services (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceDec-28-2023

Beyond PID Controllers: PPO with Neuralized PID Policy for Proton Beam Intensity Control in Mu2e

Xu, Chenwei, Hu, Jerry Yao-Chieh, Narayanan, Aakaash, Thieme, Mattson, Nagaslaev, Vladimir, Austin, Mark, Arnold, Jeremy, Berlioz, Jose, Hanlet, Pierrick, Ibrahim, Aisha, Nicklaus, Dennis, Mitrevski, Jovan, John, Jason Michael St., Pradhan, Gauri, Saewert, Andrea, Seiya, Kiyomi, Schupbach, Brian, Thurman-Keup, Randy, Tran, Nhan, Shi, Rui, Ogrenci, Seda, Shuping, Alexis Maya-Isabelle, Hazelwood, Kyle, Liu, Han

We introduce a novel Proximal Policy Optimization (PPO) algorithm aimed at addressing the challenge of maintaining a uniform proton beam intensity delivery in the Muon to Electron Conversion Experiment (Mu2e) at Fermi National Accelerator Laboratory (Fermilab). Our primary objective is to regulate the spill process to ensure a consistent intensity profile, with the ultimate goal of creating an automated controller capable of providing real-time feedback and calibration of the Spill Regulation System (SRS) parameters on a millisecond timescale. We treat the Mu2e accelerator system as a Markov Decision Process suitable for Reinforcement Learning (RL), utilizing PPO to reduce bias and enhance training stability. A key innovation in our approach is the integration of a neuralized Proportional-Integral-Derivative (PID) controller into the policy function, resulting in a significant improvement in the Spill Duty Factor (SDF) by 13.6%, surpassing the performance of the current PID controller baseline by an additional 1.6%. This paper presents the preliminary offline results based on a differentiable simulator of the Mu2e accelerator. It paves the groundwork for real-time implementations and applications, representing a crucial step towards automated proton beam intensity control for the Mu2e experiment.

machine learning, natural language, reinforcement learning, (19 more...)

2312.17372

Country: North America > United States > Illinois > Cook County (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Energy (0.49)
Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

arXiv.org Machine LearningNov-29-2023

On Sparse Modern Hopfield Model

Hu, Jerry Yao-Chieh, Yang, Donglin, Wu, Dennis, Xu, Chenwei, Chen, Bo-Yu, Liu, Han

We introduce the sparse modern Hopfield model as a sparse extension of the modern Hopfield model. Like its dense counterpart, the sparse modern Hopfield model equips a memory-retrieval dynamics whose one-step approximation corresponds to the sparse attention mechanism. Theoretically, our key contribution is a principled derivation of a closed-form sparse Hopfield energy using the convex conjugate of the sparse entropic regularizer. Building upon this, we derive the sparse memory retrieval dynamics from the sparse energy function and show its one-step approximation is equivalent to the sparse-structured attention. Importantly, we provide a sparsity-dependent memory retrieval error bound which is provably tighter than its dense analog. The conditions for the benefits of sparsity to arise are therefore identified and discussed. In addition, we show that the sparse modern Hopfield model maintains the robust theoretical properties of its dense counterpart, including rapid fixed point convergence and exponential memory capacity. Empirically, we use both synthetic and real-world datasets to demonstrate that the sparse Hopfield model outperforms its dense counterpart in many situations.

hopfield model, machine learning, natural language, (17 more...)

2309.12673

Country: North America > United States > Maryland (0.28)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

arXiv.org Artificial IntelligenceJun-9-2023

Feature Programming for Multivariate Time Series Prediction

Reneau, Alex, Hu, Jerry Yao-Chieh, Xu, Chenwei, Li, Weijian, Gilani, Ammar, Liu, Han

We introduce the concept of programmable feature Our key motivation comes from a novel dynamical Ising-like engineering for time series modeling and propose model, the spin-gas Glauber dynamics, originated from a a feature programming framework. This newly debuted gas-like interaction that includes momentum framework generates large amounts of predictive and acceleration information. By using spin-gas Glauber features for noisy multivariate time series while dynamics as the fundamental model for time series generating allowing users to incorporate their inductive bias processes at the smallest time scale, we explore the with minimal effort. The key motivation of our potential of treating time series as the path-sum of infinitesimal framework is to view any multivariate time series increments generated by a series of Markovian coin as a cumulative sum of fine-grained trajectory tosses following the spin-gas Glauber dynamics. From such increments, with each increment governed by a a fine-grained perspective, a set of operators is motivated for novel spin-gas dynamical Ising model.

artificial intelligence, data mining, machine learning, (15 more...)

2306.06252

Country: North America > United States > Hawaii (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Energy (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)