Dang, Xuan-Hong
GneissWeb: Preparing High Quality Data for LLMs at Scale
Gohari, Hajar Emami, Kadhe, Swanand Ravindra, Adam, Syed Yousaf Shah. Constantin, Adebayo, Abdulhamid, Adusumilli, Praneet, Ahmed, Farhan, Angel, Nathalie Baracaldo, Borse, Santosh, Chang, Yuan-Chi, Dang, Xuan-Hong, Desai, Nirmit, Eres, Ravital, Iwamoto, Ran, Karve, Alexei, Koyfman, Yan, Lee, Wei-Han, Liu, Changchang, Lublinsky, Boris, Ohko, Takuyo, Pesce, Pablo, Touma, Maroun, Wang, Shiqiang, Witherspoon, Shalisha, Woisetschlager, Herbert, Wood, David, Wu, Kun-Lung, Yoshida, Issei, Zawad, Syed, Zerfos, Petros, Zhou, Yi, Bhattacharjee, Bishwaranjan
Data quantity and quality play a vital role in determining the performance of Large Language Models (LLMs). High-quality data, in particular, can significantly boost the LLM's ability to generalize on a wide range of downstream tasks. Large pre-training datasets for leading LLMs remain inaccessible to the public, whereas many open datasets are small in size (less than 5 trillion tokens), limiting their suitability for training large models. In this paper, we introduce GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. Our GneissWeb recipe that produced the dataset consists of sharded exact sub-string deduplication and a judiciously constructed ensemble of quality filters. GneissWeb achieves a favorable trade-off between data quality and quantity, producing models that outperform models trained on state-of-the-art open large datasets (5+ trillion tokens). We show that models trained using GneissWeb dataset outperform those trained on FineWeb-V1.1.0 by 2.73 percentage points in terms of average score computed on a set of 11 commonly used benchmarks (both zero-shot and few-shot) for pre-training dataset evaluation. When the evaluation set is extended to 20 benchmarks (both zero-shot and few-shot), models trained using GneissWeb still achieve a 1.75 percentage points advantage over those trained on FineWeb-V1.1.0.
Data-Prep-Kit: getting your data ready for LLM application development
Wood, David, Lublinsky, Boris, Roytman, Alexy, Singh, Shivdeep, Adam, Constantin, Adebayo, Abdulhamid, An, Sungeun, Chang, Yuan Chi, Dang, Xuan-Hong, Desai, Nirmit, Dolfi, Michele, Emami-Gohari, Hajar, Eres, Revital, Goto, Takuya, Joshi, Dhiraj, Koyfman, Yan, Nassar, Mohammad, Patel, Hima, Selvam, Paramesvaran, Shah, Yousaf, Surendran, Saptha, Tsuzuku, Daiki, Zerfos, Petros, Daijavad, Shahrokh
Data preparation is the first and a very important step towards any Large Language Model (LLM) development. This paper introduces an easy-to-use, extensible, and scale-flexible open-source data preparation toolkit called Data Prep Kit (DPK). DPK is architected and designed to enable users to scale their data preparation to their needs. With DPK they can prepare data on a local machine or effortlessly scale to run on a cluster with thousands of CPU Cores. DPK comes with a highly scalable, yet extensible set of modules that transform natural language and code data. If the user needs additional transforms, they can be easily developed using extensive DPK support for transform creation. These modules can be used independently or pipelined to perform a series of operations. In this paper, we describe DPK architecture and show its performance from a small scale to a very large number of CPUs. The modules from DPK have been used for the preparation of Granite Models [1] [2]. We believe DPK is a valuable contribution to the AI community to easily prepare data to enhance the performance of their LLM models or to fine-tune models with Retrieval-Augmented Generation (RAG).
Scaling Granite Code Models to 128K Context
Stallone, Matt, Saxena, Vaibhav, Karlinsky, Leonid, McGinn, Bridget, Bula, Tim, Mishra, Mayank, Soria, Adriana Meza, Zhang, Gaoyuan, Prasad, Aditya, Shen, Yikang, Surendran, Saptha, Guttula, Shanmukha, Patel, Hima, Selvam, Parameswaran, Dang, Xuan-Hong, Koyfman, Yan, Sood, Atin, Feris, Rogerio, Desai, Nirmit, Cox, David D., Puri, Ruchir, Panda, Rameswar
This paper introduces long-context Granite code models that support effective context windows of up to 128K tokens. Our solution for scaling context length of Granite 3B/8B code models from 2K/4K to 128K consists of a light-weight continual pretraining by gradually increasing its RoPE base frequency with repository-level file packing and length-upsampled long-context data. Additionally, we also release instruction-tuned models with long-context support which are derived by further finetuning the long context base models on a mix of permissively licensed short and long-context instruction-response pairs. While comparing to the original short-context Granite code models, our long-context models achieve significant improvements on long-context tasks without any noticeable performance degradation on regular code completion benchmarks (e.g., HumanEval). We release all our long-context Granite code models under an Apache 2.0 license for both research and commercial use.
Granite Code Models: A Family of Open Foundation Models for Code Intelligence
Mishra, Mayank, Stallone, Matt, Zhang, Gaoyuan, Shen, Yikang, Prasad, Aditya, Soria, Adriana Meza, Merler, Michele, Selvam, Parameswaran, Surendran, Saptha, Singh, Shivdeep, Sethi, Manish, Dang, Xuan-Hong, Li, Pengyuan, Wu, Kun-Lung, Zawad, Syed, Coleman, Andrew, White, Matthew, Lewis, Mark, Pavuluri, Raju, Koyfman, Yan, Lublinsky, Boris, de Bayser, Maximilien, Abdelaziz, Ibrahim, Basu, Kinjal, Agarwal, Mayank, Zhou, Yi, Johnson, Chris, Goyal, Aanchal, Patel, Hima, Shah, Yousaf, Zerfos, Petros, Ludwig, Heiko, Munawar, Asim, Crouse, Maxwell, Kapanipathi, Pavan, Salaria, Shweta, Calio, Bob, Wen, Sophia, Seelam, Seetharami, Belgodere, Brian, Fonseca, Carlos, Singhee, Amith, Desai, Nirmit, Cox, David D., Puri, Ruchir, Panda, Rameswar
Large Language Models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being integrated into software development environments to improve the productivity of human programmers, and LLM-based agents are beginning to show promise for handling complex tasks autonomously. Realizing the full potential of code LLMs requires a wide range of capabilities, including code generation, fixing bugs, explaining and documenting code, maintaining repositories, and more. In this work, we introduce the Granite series of decoder-only code models for code generative tasks, trained with code written in 116 programming languages. The Granite Code models family consists of models ranging in size from 3 to 34 billion parameters, suitable for applications ranging from complex application modernization tasks to on-device memory-constrained use cases. Evaluation on a comprehensive set of tasks demonstrates that Granite Code models consistently reaches state-of-the-art performance among available open-source code LLMs. The Granite Code model family was optimized for enterprise software development workflows and performs well across a range of coding tasks (e.g. code generation, fixing and explanation), making it a versatile all around code model. We release all our Granite Code models under an Apache 2.0 license for both research and commercial use.
Modality-aware Transformer for Time series Forecasting
Emami, Hajar, Dang, Xuan-Hong, Shah, Yousaf, Zerfos, Petros
Time series forecasting presents a significant challenge, particularly when its accuracy relies on external data sources rather than solely on historical values. This issue is prevalent in the financial sector, where the future behavior of time series is often intricately linked to information derived from various textual reports and a multitude of economic indicators. In practice, the key challenge lies in constructing a reliable time series forecasting model capable of harnessing data from diverse sources and extracting valuable insights to predict the target time series accurately. In this work, we tackle this challenging problem and introduce a novel multimodal transformer-based model named the Modality-aware Transformer. Our model excels in exploring the power of both categorical text and numerical timeseries to forecast the target time series effectively while providing insights through its neural attention mechanism. To achieve this, we develop feature-level attention layers that encourage the model to focus on the most relevant features within each data modality. By incorporating the proposed feature-level attention, we develop a novel Intra-modal multi-head attention (MHA), Inter-modal MHA and Modality-target MHA in a way that both feature and temporal attentions are incorporated in MHAs. This enables the MHAs to generate temporal attentions with consideration of modality and feature importance which leads to more informative embeddings. The proposed modality-aware structure enables the model to effectively exploit information within each modality as well as foster cross-modal understanding. Our extensive experiments on financial datasets demonstrate that Modality-aware Transformer outperforms existing methods, offering a novel and practical solution to the complex challenges of multi-modality time series forecasting.
AutoAI-TS: AutoAI for Time Series Forecasting
Shah, Syed Yousaf, Patel, Dhaval, Vu, Long, Dang, Xuan-Hong, Chen, Bei, Kirchner, Peter, Samulowitz, Horst, Wood, David, Bramble, Gregory, Gifford, Wesley M., Ganapavarapu, Giridhar, Vaculin, Roman, Zerfos, Petros
A large number of time series forecasting models including traditional statistical models, machine learning models and more recently deep learning have been proposed in the literature. However, choosing the right model along with good parameter values that performs well on a given data is still challenging. Automatically providing a good set of models to users for a given dataset saves both time and effort from using trial-and-error approaches with a wide variety of available models along with parameter optimization. We present AutoAI for Time Series Forecasting (AutoAI-TS) that provides users with a zero configuration (zero-conf ) system to efficiently train, optimize and choose best forecasting model among various classes of models for the given dataset. With its flexible zero-conf design, AutoAI-TS automatically performs all the data preparation, model creation, parameter optimization, training and model selection for users and provides a trained model that is ready to use. For given data, AutoAI-TS utilizes a wide variety of models including classical statistical models, Machine Learning (ML) models, statistical-ML hybrid models and deep learning models along with various transformations to create forecasting pipelines. It then evaluates and ranks pipelines using the proposed T-Daub mechanism to choose the best pipeline. The paper describe in detail all the technical aspects of AutoAI-TS along with extensive benchmarking on a variety of real world data sets for various use-cases. Benchmark results show that AutoAI-TS, with no manual configuration from the user, automatically trains and selects pipelines that on average outperform existing state-of-the-art time series forecasting toolkits.
seq2graph: Discovering Dynamic Dependencies from Multivariate Time Series with Multi-level Attention
Dang, Xuan-Hong, Shah, Syed Yousaf, Zerfos, Petros
Discovering temporal lagged and inter-dependencies in multivariate time series data is an important task. However, in many real-world applications, such as commercial cloud management, manufacturing predictive maintenance, and portfolios performance analysis, such dependencies can be non-linear and time-variant, which makes it more challenging to extract such dependencies through traditional methods such as Granger causality or clustering. In this work, we present a novel deep learning model that uses multiple layers of customized gated recurrent units (GRUs) for discovering both time lagged behaviors as well as inter-timeseries dependencies in the form of directed weighted graphs. We introduce a key component of Dual-purpose recurrent neural network that decodes information in the temporal domain to discover lagged dependencies within each time series, and encodes them into a set of vectors which, collected from all component time series, form the informative inputs to discover inter-dependencies. Though the discovery of two types of dependencies are separated at different hierarchical levels, they are tightly connected and jointly trained in an end-to-end manner. With this joint training, learning of one type of dependency immediately impacts the learning of the other one, leading to overall accurate dependencies discovery. We empirically test our model on synthetic time series data in which the exact form of (non-linear) dependencies is known. We also evaluate its performance on two real-world applications, (i) performance monitoring data from a commercial cloud provider, which exhibit highly dynamic, non-linear, and volatile behavior and, (ii) sensor data from a manufacturing plant. We further show how our approach is able to capture these dependency behaviors via intuitive and interpretable dependency graphs and use them to generate highly accurate forecasts.