AITopics

2502.17494

Country: North America > United States > Illinois (0.14)

Genre: Research Report (1.00)

Industry:

Education (0.94)
Marketing (0.70)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceJan-9-2024

Fine-Grained Embedding Dimension Optimization During Training for Recommender Systems

Luo, Qinyi, Wang, Penghan, Zhang, Wei, Lai, Fan, Mao, Jiachen, Wei, Xiaohan, Song, Jun, Tsai, Wei-Yu, Yang, Shuai, Hu, Yuxi, Qian, Xuehai

Huge embedding tables in modern Deep Learning Recommender Models (DLRM) require prohibitively large memory during training and inference. Aiming to reduce the memory footprint of training, this paper proposes FIne-grained In-Training Embedding Dimension optimization (FIITED). Given the observation that embedding vectors are not equally important, FIITED adjusts the dimension of each individual embedding vector continuously during training, assigning longer dimensions to more important embeddings while adapting to dynamic changes in data. A novel embedding storage system based on virtually-hashed physically-indexed hash tables is designed to efficiently implement the embedding dimension adjustment and effectively enable memory saving. Experiments on two industry models show that FIITED is able to reduce the size of embeddings by more than 65% while maintaining the trained model's quality, saving significantly more memory than a state-of-the-art in-training embedding pruning method. On public click-through rate prediction datasets, FIITED is able to prune up to 93.75%-99.75% embeddings without significant accuracy loss. Huge embedding tables in modern Deep Learning Recommendation Models (DLRM) reach terabytes in size (Lian et al., 2022). Training DLRMs usually requires model parallelism (Ivchenko et al., 2022; Sethi et al., 2023), but even with embedding tables distributed over multiple compute nodes, memory still proves a scarce resource (Lian et al., 2022). Reducing the memory cost of embedding tables is crucial to enable efficient model training and deployment of DLRM and allow for sustainable model development. The size of an embedding table is determined by the number of rows (i.e., hash size), the number of columns (i.e., embedding dimension), and the size of each value in the embedding.

artificial intelligence, deep learning, machine learning, (18 more...)

2401.04408

Country:

North America > United States > Michigan (0.28)
North America > United States > California (0.28)
North America > United States > Indiana > Tippecanoe County (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.53)

arXiv.org Artificial IntelligenceMay-31-2023

Provably Efficient Generalized Lagrangian Policy Optimization for Safe Multi-Agent Reinforcement Learning

Ding, Dongsheng, Wei, Xiaohan, Yang, Zhuoran, Wang, Zhaoran, Jovanović, Mihailo R.

We examine online safe multi-agent reinforcement learning using constrained Markov games in which agents compete by maximizing their expected total rewards under a constraint on expected total utilities. Our focus is confined to an episodic two-player zero-sum constrained Markov game with independent transition functions that are unknown to agents, adversarial reward functions, and stochastic utility functions. For such a Markov game, we employ an approach based on the occupancy measure to formulate it as an online constrained saddle-point problem with an explicit constraint. We extend the Lagrange multiplier method in constrained optimization to handle the constraint by creating a generalized Lagrangian with minimax decision primal variables and a dual variable. Next, we develop an upper confidence reinforcement learning algorithm to solve this Lagrangian problem while balancing exploration and exploitation. Our algorithm updates the minimax decision primal variables via online mirror descent and the dual variable via projected gradient step and we prove that it enjoys sublinear rate $ O((|X|+|Y|) L \sqrt{T(|A|+|B|)}))$ for both regret and constraint violation after playing $T$ episodes of the game. Here, $L$ is the horizon of each episode, $(|X|,|A|)$ and $(|Y|,|B|)$ are the state/action space sizes of the min-player and the max-player, respectively. To the best of our knowledge, we provide the first provably efficient online safe reinforcement learning algorithm in constrained Markov games.

constraint, machine learning, reinforcement learning, (14 more...)

2306.00212

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)

Genre: Research Report (0.40)

Industry: Energy > Oil & Gas > Upstream (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)

arXiv.org Artificial IntelligenceJul-25-2022

Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions

Qiu, Shuang, Wei, Xiaohan, Ye, Jieping, Wang, Zhaoran, Yang, Zhuoran

While single-agent policy optimization in a fixed environment has attracted a lot of research attention recently in the reinforcement learning community, much less is known theoretically when there are multiple agents playing in a potentially competitive environment. We take steps forward by proposing and analyzing new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions. We consider two classes of transition structures: factored independent transition and single-controller transition. For both scenarios, we prove tight $\widetilde{\mathcal{O}}(\sqrt{K})$ regret bounds after $K$ episodes in a two-agent competitive game scenario. The regret of each agent is measured against a potentially adversarial opponent who can choose a single best policy in hindsight after observing the full policy sequence. Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization in a non-stationary environment. When both players adopt the proposed algorithms, their overall optimality gap is $\widetilde{\mathcal{O}}(\sqrt{K})$.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

2207.12463

Genre: Research Report (0.49)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.66)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.66)

arXiv.org Machine LearningOct-17-2020

Fast Distributed Training of Deep Neural Networks: Dynamic Communication Thresholding for Model and Data Parallelism

Gupta, Vipul, Choudhary, Dhruv, Tang, Ping Tak Peter, Wei, Xiaohan, Wang, Xing, Huang, Yuzhen, Kejariwal, Arun, Ramchandran, Kannan, Mahoney, Michael W.

Data Parallelism (DP) and Model Parallelism (MP) are two common paradigms to enable large-scale distributed training of neural networks. Recent trends, such as the improved model performance of deeper and wider neural networks when trained with billions of data points, have prompted the use of hybrid parallelism---a paradigm that employs both DP and MP to scale further parallelization for machine learning. Hybrid training allows compute power to increase, but it runs up against the key bottleneck of communication overhead that hinders scalability. In this paper, we propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT filters the entities to be communicated across the network through a simple hard-thresholding function, allowing only the most relevant information to pass through. For communication efficient DP, DCT compresses the parameter gradients sent to the parameter server during model synchronization, while compensating for the introduced errors with known techniques. For communication efficient MP, DCT incorporates a novel technique to compress the activations and gradients sent across the network during the forward and backward propagation, respectively. This is done by identifying and updating only the most relevant neurons of the neural network for each training sample in the data. Under modest assumptions, we show that the convergence of training is maintained with DCT. We evaluate DCT on natural language processing and recommender system models. DCT reduces overall communication by 20x, improving end-to-end training time on industry scale models by 37%. Moreover, we observe an improvement in the trained model performance, as the induced sparsity is possibly acting as an implicit sparsity based regularization.

deep learning, dynamic communication thresholding, neural network, (14 more...)

2010.08899

Country: North America > United States > New York (0.14)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.84)

arXiv.org Machine LearningAug-23-2020

Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning

Qiu, Shuang, Yang, Zhuoran, Wei, Xiaohan, Ye, Jieping, Wang, Zhaoran

Temporal-Difference (TD) learning with nonlinear smooth function approximation for policy evaluation has achieved great success in modern reinforcement learning. It is shown that such a problem can be reformulated as a stochastic nonconvex-strongly-concave optimization problem, which is challenging as naive stochastic gradient descent-ascent algorithm suffers from slow convergence. Existing approaches for this problem are based on two-timescale or double-loop stochastic gradient algorithms, which may also require sampling large-batch data. However, in practice, a single-timescale single-loop stochastic algorithm is preferred due to its simplicity and also because its step-size is easier to tune. In this paper, we propose two single-timescale single-loop algorithms which require only one data point each step. Our first algorithm implements momentum updates on both primal and dual variables achieving an $O(\varepsilon^{-4})$ sample complexity, which shows the important role of momentum in obtaining a single-timescale algorithm. Our second algorithm improves upon the first one by applying variance reduction on top of momentum, which matches the best known $O(\varepsilon^{-3})$ sample complexity in existing works. Furthermore, our variance-reduction algorithm does not require a large-batch checkpoint. Moreover, our theoretical results for both algorithms are expressed in a tighter form of simultaneous primal and dual side convergence.

artificial intelligence, inequality, reinforcement learning, (16 more...)

2008.10103

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)

arXiv.org Machine LearningJun-22-2020

Beyond $\mathcal{O}(\sqrt{T})$ Regret for Constrained Online Optimization: Gradual Variations and Mirror Prox

Qiu, Shuang, Wei, Xiaohan

We study constrained online convex optimization, where the constraints consist of a relatively simple constraint set (e.g. a Euclidean ball) and multiple functional constraints. Projections onto such decision sets are usually computationally challenging. So instead of enforcing all constraints over each slot, we allow decisions to violate these functional constraints but aim at achieving a low regret and a low cumulative constraint violation over a horizon of $T$ time slot. The best known bound for solving this problem is $\mathcal{O}(\sqrt{T})$ regret and $\mathcal{O}(1)$ constraint violation, whose algorithms and analysis are restricted to Euclidean spaces. In this paper, we propose a new online primal-dual mirror prox algorithm whose regret is measured via a total gradient variation $V_*(T)$ over a sequence of $T$ loss functions. Specifically, we show that the proposed algorithm can achieve an $\mathcal{O}(\sqrt{V_*(T)})$ regret and $\mathcal{O}(1)$ constraint violation simultaneously. Such a bound holds in general non-Euclidean spaces, is never worse than the previously known $\big( \mathcal{O}(\sqrt{T}), \mathcal{O}(1) \big)$ result, and can be much better on regret when the variation is small. Furthermore, our algorithm is computationally efficient in that only two mirror descent steps are required during each slot instead of solving a general Lagrangian minimization problem. Along the way, our bounds also improve upon those of previous attempts using mirror-prox-type algorithms solving this problem, which yield a relatively worse $\mathcal{O}(T^{2/3})$ regret and $\mathcal{O}(T^{2/3})$ constraint violation.

constraint-based reasoning, inequality, optimization problem, (15 more...)

2006.12455

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)

Neural Information Processing SystemsFeb-14-2020, 13:44:00 GMT

Solving Non-smooth Constrained Programs with Lower Complexity than \mathcal{O}(1/\varepsilon): A Primal-Dual Homotopy Smoothing Approach

Wei, Xiaohan, Yu, Hao, Ling, Qing, Neely, Michael

artificial intelligence, primal-dual homotopy, varepsilon, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.68)

arXiv.org Machine LearningAug-14-2019

Robust One-Bit Recovery via ReLU Generative Networks: Improved Statistical Rates and Global Landscape Analysis

Qiu, Shuang, Wei, Xiaohan, Yang, Zhuoran

We study the robust one-bit compressed sensing problem whose goal is to design an algorithm that faithfully recovers any sparse target vector $\theta_0\in\mathbb{R}^d$ uniformly $m$ quantized noisy measurements. Under the assumption that the measurements are sub-Gaussian random vectors, to recover any $k$-sparse $\theta_0$ ($k\ll d$) uniformly up to an error $\varepsilon$ with high probability, the best known computationally tractable algorithm requires $m\geq\tilde{\mathcal{O}}(k\log d/\varepsilon^4)$ measurements. In this paper, we consider a new framework for the one-bit sensing problem where the sparsity is implicitly enforced via mapping a low dimensional representation $x_0 \in \mathbb{R}^k$ through a known $n$-layer ReLU generative network $G:\mathbb{R}^k\rightarrow\mathbb{R}^d$. Such a framework poses low-dimensional priors on $\theta_0$ without a known basis. We propose to recover the target $G(x_0)$ via an unconstrained empirical risk minimization (ERM) problem under a much weaker sub-exponential measurement assumption. For such a problem, we establish a joint statistical and computational analysis. In particular, we prove that the ERM estimator in this new framework achieves an improved statistical rate of $m=\tilde{\mathcal{O}}(kn \log d /\varepsilon^2)$ recovering any $G(x_0)$ uniformly up to an error $\varepsilon$. Moreover, from the lens of computation, despite non-convexity, we prove that the objective of our ERM problem has no spurious stationary point, that is, any stationary point are equally good for recovering the true target up to scaling with a certain accuracy. Furthermore, our analysis also shed lights on the possibility of inverting a deep generative model under partial and quantized measurements, complementing the recent success of using deep generative models for inverse problems.

deep learning, neural network, null 2, (18 more...)

1908.05368

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.44)

Neural Information Processing SystemsDec-31-2018

Solving Non-smooth Constrained Programs with Lower Complexity than $\mathcal{O}(1/\varepsilon)$: A Primal-Dual Homotopy Smoothing Approach

Wei, Xiaohan, Yu, Hao, Ling, Qing, Neely, Michael

We propose a new primal-dual homotopy smoothing algorithm for a linearly constrained convex program, where neither the primal nor the dual function has to be smooth or strongly convex. The best known iteration complexity solving such a non-smooth problem is $\mathcal{O}(\varepsilon^{-1})$. In this paper, we show that by leveraging a local error bound condition on the dual function, the proposed algorithm can achieve a better primal convergence time of $\mathcal{O}\l(\varepsilon^{-2/(2+\beta)}\log_2(\varepsilon^{-1})\r)$, where $\beta\in(0,1]$ is a local error bound parameter. As an example application, we show that the distributed geometric median problem, which can be formulated as a constrained convex program, has its dual function non-smooth but satisfying the aforementioned local error bound condition with $\beta=1/2$, therefore enjoying a convergence time of $\mathcal{O}\l(\varepsilon^{-4/5}\log_2(\varepsilon^{-1})\r)$. This result improves upon the $\mathcal{O}(\varepsilon^{-1})$ convergence time bound achieved by existing distributed optimization algorithms. Simulation experiments also demonstrate the performance of our proposed algorithm.

algorithm, artificial intelligence, optimization problem, (15 more...)

Neural Information Processing Systems

Country: North America > United States > California > Los Angeles County (0.14)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Networks (0.93)