dpk
Data-Prep-Kit: getting your data ready for LLM application development
Wood, David, Lublinsky, Boris, Roytman, Alexy, Singh, Shivdeep, Adam, Constantin, Adebayo, Abdulhamid, An, Sungeun, Chang, Yuan Chi, Dang, Xuan-Hong, Desai, Nirmit, Dolfi, Michele, Emami-Gohari, Hajar, Eres, Revital, Goto, Takuya, Joshi, Dhiraj, Koyfman, Yan, Nassar, Mohammad, Patel, Hima, Selvam, Paramesvaran, Shah, Yousaf, Surendran, Saptha, Tsuzuku, Daiki, Zerfos, Petros, Daijavad, Shahrokh
Data preparation is the first and a very important step towards any Large Language Model (LLM) development. This paper introduces an easy-to-use, extensible, and scale-flexible open-source data preparation toolkit called Data Prep Kit (DPK). DPK is architected and designed to enable users to scale their data preparation to their needs. With DPK they can prepare data on a local machine or effortlessly scale to run on a cluster with thousands of CPU Cores. DPK comes with a highly scalable, yet extensible set of modules that transform natural language and code data. If the user needs additional transforms, they can be easily developed using extensive DPK support for transform creation. These modules can be used independently or pipelined to perform a series of operations. In this paper, we describe DPK architecture and show its performance from a small scale to a very large number of CPUs. The modules from DPK have been used for the preparation of Granite Models [1] [2]. We believe DPK is a valuable contribution to the AI community to easily prepare data to enhance the performance of their LLM models or to fine-tune models with Retrieval-Augmented Generation (RAG).
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States > New York (0.04)
- (4 more...)
Minimal Communication-Cost Statistical Learning
Sefidgaran, Milad, Zaidi, Abdellatif, Krasnowski, Piotr
A client device which has access to $n$ training data samples needs to obtain a statistical hypothesis or model $W$ and then to send it to a remote server. The client and the server devices share some common randomness sequence as well as a prior on the hypothesis space. In this problem a suitable hypothesis or model $W$ should meet two distinct design criteria simultaneously: (i) small (population) risk during the inference phase and (ii) small 'complexity' for it to be conveyed to the server with minimum communication cost. In this paper, we propose a joint training and source coding scheme with provable in-expectation guarantees, where the expectation is over the encoder's output message. Specifically, we show that by imposing a constraint on a suitable Kullback-Leibler divergence between the conditional distribution induced by a compressed learning model $\widehat{W}$ given $W$ and the prior, one guarantees simultaneously small average empirical risk (aka training loss), small average generalization error and small average communication cost. We also consider a one-shot scenario in which the guarantees on the empirical risk and generalization error are obtained for every encoder's output message.
Packing Privacy Budget Efficiently
Tholoniat, Pierre, Kostopoulou, Kelly, Chowdhury, Mosharaf, Cidon, Asaf, Geambasu, Roxana, Lécuyer, Mathias, Yang, Junfeng
Machine learning (ML) models can leak information about users, and differential privacy (DP) provides a rigorous way to bound that leakage under a given budget. This DP budget can be regarded as a new type of compute resource in workloads of multiple ML models training on user data. Once it is used, the DP budget is forever consumed. Therefore, it is crucial to allocate it most efficiently to train as many models as possible. This paper presents the scheduler for privacy that optimizes for efficiency. We formulate privacy scheduling as a new type of multidimensional knapsack problem, called privacy knapsack, which maximizes DP budget efficiency. We show that privacy knapsack is NP-hard, hence practical algorithms are necessarily approximate. We develop an approximation algorithm for privacy knapsack, DPK, and evaluate it on microbenchmarks and on a new, synthetic private-ML workload we developed from the Alibaba ML cluster trace. We show that DPK: (1) often approaches the efficiency-optimal schedule, (2) consistently schedules more tasks compared to a state-of-the-art privacy scheduling algorithm that focused on fairness (1.3-1.7x in Alibaba, 1.0-2.6x in microbenchmarks), but (3) sacrifices some level of fairness for efficiency. Therefore, using DPK, DP ML operators should be able to train more models on the same amount of user data while offering the same privacy guarantee to their users.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- North America > United States > Michigan (0.04)
- (5 more...)
Koopman-theoretic Approach for Identification of Exogenous Anomalies in Nonstationary Time-series Data
Mallen, Alex, Keller, Christoph A., Kutz, J. Nathan
Traditional statistical methods include the time-domain Time-series analysis is used to extracting meaningful methods, such as the family of autoregressive (AR) models statistics and characteristics of temporal sequences and their many variants, including ARMA (AR moving of data [1], and is among the most ubiquitous mathematical average), ARIMA (AR integrated moving average), methods. Indeed, time-series are universal for SARIMA (seasonal ARIMA), etc. [1]. Such models use a signal processing methods and in pattern recognition applications, diversity of optimization techniques to estimate parameters dominating characterization of econometrics of a linear model with its history dependence. Traditional and finance along with almost any scientific and engineering frequency-domain methods use the properties of application. Time-series methods can be broadly short-time Fourier transforms [9] and/or wavelet transforms divided into time-domain and frequency-domain methods, [10] in order to characterize the signal in a joint the former of which uses a variety of statistical techniques time-frequency representation. More recently, there have to characterize a sequence, and the latter of which been efforts to model time-series data as from a dynamical uses spectral (e.g.
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.44)
- Asia > China (0.15)
- North America > United States > Washington > King County > Seattle (0.14)
- (7 more...)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.30)
- Health & Medicine > Therapeutic Area > Immunology (0.30)
- Health & Medicine > Epidemiology (0.30)
- Government (0.30)