AITopics | dpk

Collaborating Authors

dpk

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Data-Prep-Kit: getting your data ready for LLM application development

Wood, David, Lublinsky, Boris, Roytman, Alexy, Singh, Shivdeep, Adam, Constantin, Adebayo, Abdulhamid, An, Sungeun, Chang, Yuan Chi, Dang, Xuan-Hong, Desai, Nirmit, Dolfi, Michele, Emami-Gohari, Hajar, Eres, Revital, Goto, Takuya, Joshi, Dhiraj, Koyfman, Yan, Nassar, Mohammad, Patel, Hima, Selvam, Paramesvaran, Shah, Yousaf, Surendran, Saptha, Tsuzuku, Daiki, Zerfos, Petros, Daijavad, Shahrokh

arXiv.org Artificial IntelligenceNov-12-2024

Data preparation is the first and a very important step towards any Large Language Model (LLM) development. This paper introduces an easy-to-use, extensible, and scale-flexible open-source data preparation toolkit called Data Prep Kit (DPK). DPK is architected and designed to enable users to scale their data preparation to their needs. With DPK they can prepare data on a local machine or effortlessly scale to run on a cluster with thousands of CPU Cores. DPK comes with a highly scalable, yet extensible set of modules that transform natural language and code data. If the user needs additional transforms, they can be easily developed using extensive DPK support for transform creation. These modules can be used independently or pipelined to perform a series of operations. In this paper, we describe DPK architecture and show its performance from a small scale to a very large number of CPUs. The modules from DPK have been used for the preparation of Granite Models [1] [2]. We believe DPK is a valuable contribution to the AI community to easily prepare data to enhance the performance of their LLM models or to fine-tune models with Retrieval-Augmented Generation (RAG).

dpk, implementation, pipeline, (16 more...)

arXiv.org Artificial Intelligence

2409.18164

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > New York (0.04)
(4 more...)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Minimal Communication-Cost Statistical Learning

Sefidgaran, Milad, Zaidi, Abdellatif, Krasnowski, Piotr

arXiv.org Machine LearningJun-12-2024

A client device which has access to $n$ training data samples needs to obtain a statistical hypothesis or model $W$ and then to send it to a remote server. The client and the server devices share some common randomness sequence as well as a prior on the hypothesis space. In this problem a suitable hypothesis or model $W$ should meet two distinct design criteria simultaneously: (i) small (population) risk during the inference phase and (ii) small 'complexity' for it to be conveyed to the server with minimum communication cost. In this paper, we propose a joint training and source coding scheme with provable in-expectation guarantees, where the expectation is over the encoder's output message. Specifically, we show that by imposing a constraint on a suitable Kullback-Leibler divergence between the conditional distribution induced by a compressed learning model $\widehat{W}$ given $W$ and the prior, one guarantees simultaneously small average empirical risk (aka training loss), small average generalization error and small average communication cost. We also consider a one-shot scenario in which the guarantees on the empirical risk and generalization error are obtained for every encoder's output message.

algorithm, encoder, generalization error, (13 more...)

arXiv.org Machine Learning

2406.08193

Country: Europe > France (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.50)

Add feedback

Packing Privacy Budget Efficiently

Tholoniat, Pierre, Kostopoulou, Kelly, Chowdhury, Mosharaf, Cidon, Asaf, Geambasu, Roxana, Lécuyer, Mathias, Yang, Junfeng

arXiv.org Artificial IntelligenceDec-26-2022

Machine learning (ML) models can leak information about users, and differential privacy (DP) provides a rigorous way to bound that leakage under a given budget. This DP budget can be regarded as a new type of compute resource in workloads of multiple ML models training on user data. Once it is used, the DP budget is forever consumed. Therefore, it is crucial to allocate it most efficiently to train as many models as possible. This paper presents the scheduler for privacy that optimizes for efficiency. We formulate privacy scheduling as a new type of multidimensional knapsack problem, called privacy knapsack, which maximizes DP budget efficiency. We show that privacy knapsack is NP-hard, hence practical algorithms are necessarily approximate. We develop an approximation algorithm for privacy knapsack, DPK, and evaluate it on microbenchmarks and on a new, synthetic private-ML workload we developed from the Alibaba ML cluster trace. We show that DPK: (1) often approaches the efficiency-optimal schedule, (2) consistently schedules more tasks compared to a state-of-the-art privacy scheduling algorithm that focused on fairness (1.3-1.7x in Alibaba, 1.0-2.6x in microbenchmarks), but (3) sacrifices some level of fairness for efficiency. Therefore, using DPK, DP ML operators should be able to train more models on the same amount of user data while offering the same privacy guarantee to their users.

artificial intelligence, machine learning, workload, (17 more...)

arXiv.org Artificial Intelligence

2212.13228

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Santa Clara (0.04)
North America > United States > Michigan (0.04)
(5 more...)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Koopman-theoretic Approach for Identification of Exogenous Anomalies in Nonstationary Time-series Data

Mallen, Alex, Keller, Christoph A., Kutz, J. Nathan

arXiv.org Artificial IntelligenceSep-18-2022

Traditional statistical methods include the time-domain Time-series analysis is used to extracting meaningful methods, such as the family of autoregressive (AR) models statistics and characteristics of temporal sequences and their many variants, including ARMA (AR moving of data [1], and is among the most ubiquitous mathematical average), ARIMA (AR integrated moving average), methods. Indeed, time-series are universal for SARIMA (seasonal ARIMA), etc. [1]. Such models use a signal processing methods and in pattern recognition applications, diversity of optimization techniques to estimate parameters dominating characterization of econometrics of a linear model with its history dependence. Traditional and finance along with almost any scientific and engineering frequency-domain methods use the properties of application. Time-series methods can be broadly short-time Fourier transforms [9] and/or wavelet transforms divided into time-domain and frequency-domain methods, [10] in order to characterize the signal in a joint the former of which uses a variety of statistical techniques time-frequency representation. More recently, there have to characterize a sequence, and the latter of which been efforts to model time-series data as from a dynamical uses spectral (e.g.

anomaly, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2209.08618

Country:

North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.44)
Asia > China (0.15)
North America > United States > Washington > King County > Seattle (0.14)
(7 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.30)
Health & Medicine > Therapeutic Area > Immunology (0.30)
Health & Medicine > Epidemiology (0.30)
Government (0.30)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback