AITopics | Dang, Xuan-Hong

Collaborating Authors

Dang, Xuan-Hong

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

GneissWeb: Preparing High Quality Data for LLMs at Scale

Gohari, Hajar Emami, Kadhe, Swanand Ravindra, Adam, Syed Yousaf Shah. Constantin, Adebayo, Abdulhamid, Adusumilli, Praneet, Ahmed, Farhan, Angel, Nathalie Baracaldo, Borse, Santosh, Chang, Yuan-Chi, Dang, Xuan-Hong, Desai, Nirmit, Eres, Ravital, Iwamoto, Ran, Karve, Alexei, Koyfman, Yan, Lee, Wei-Han, Liu, Changchang, Lublinsky, Boris, Ohko, Takuyo, Pesce, Pablo, Touma, Maroun, Wang, Shiqiang, Witherspoon, Shalisha, Woisetschlager, Herbert, Wood, David, Wu, Kun-Lung, Yoshida, Issei, Zawad, Syed, Zerfos, Petros, Zhou, Yi, Bhattacharjee, Bishwaranjan

arXiv.org Artificial IntelligenceFeb-18-2025

Data quantity and quality play a vital role in determining the performance of Large Language Models (LLMs). High-quality data, in particular, can significantly boost the LLM's ability to generalize on a wide range of downstream tasks. Large pre-training datasets for leading LLMs remain inaccessible to the public, whereas many open datasets are small in size (less than 5 trillion tokens), limiting their suitability for training large models. In this paper, we introduce GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. Our GneissWeb recipe that produced the dataset consists of sharded exact sub-string deduplication and a judiciously constructed ensemble of quality filters. GneissWeb achieves a favorable trade-off between data quality and quantity, producing models that outperform models trained on state-of-the-art open large datasets (5+ trillion tokens). We show that models trained using GneissWeb dataset outperform those trained on FineWeb-V1.1.0 by 2.73 percentage points in terms of average score computed on a set of 11 commonly used benchmarks (both zero-shot and few-shot) for pre-training dataset evaluation. When the evaluation set is extended to 20 benchmarks (both zero-shot and few-shot), models trained using GneissWeb still achieve a 1.75 percentage points advantage over those trained on FineWeb-V1.1.0.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2502.14907

Country:

Europe (0.67)
North America > United States (0.46)

Genre: Research Report (0.50)

Industry:

Education (1.00)
Information Technology (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Data-Prep-Kit: getting your data ready for LLM application development

Wood, David, Lublinsky, Boris, Roytman, Alexy, Singh, Shivdeep, Adam, Constantin, Adebayo, Abdulhamid, An, Sungeun, Chang, Yuan Chi, Dang, Xuan-Hong, Desai, Nirmit, Dolfi, Michele, Emami-Gohari, Hajar, Eres, Revital, Goto, Takuya, Joshi, Dhiraj, Koyfman, Yan, Nassar, Mohammad, Patel, Hima, Selvam, Paramesvaran, Shah, Yousaf, Surendran, Saptha, Tsuzuku, Daiki, Zerfos, Petros, Daijavad, Shahrokh

arXiv.org Artificial IntelligenceNov-12-2024

Data preparation is the first and a very important step towards any Large Language Model (LLM) development. This paper introduces an easy-to-use, extensible, and scale-flexible open-source data preparation toolkit called Data Prep Kit (DPK). DPK is architected and designed to enable users to scale their data preparation to their needs. With DPK they can prepare data on a local machine or effortlessly scale to run on a cluster with thousands of CPU Cores. DPK comes with a highly scalable, yet extensible set of modules that transform natural language and code data. If the user needs additional transforms, they can be easily developed using extensive DPK support for transform creation. These modules can be used independently or pipelined to perform a series of operations. In this paper, we describe DPK architecture and show its performance from a small scale to a very large number of CPUs. The modules from DPK have been used for the preparation of Granite Models [1] [2]. We believe DPK is a valuable contribution to the AI community to easily prepare data to enhance the performance of their LLM models or to fine-tune models with Retrieval-Augmented Generation (RAG).

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2409.18164

Country:

North America > United States (0.28)
Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Middle East > Israel (0.14)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Scaling Granite Code Models to 128K Context

Stallone, Matt, Saxena, Vaibhav, Karlinsky, Leonid, McGinn, Bridget, Bula, Tim, Mishra, Mayank, Soria, Adriana Meza, Zhang, Gaoyuan, Prasad, Aditya, Shen, Yikang, Surendran, Saptha, Guttula, Shanmukha, Patel, Hima, Selvam, Parameswaran, Dang, Xuan-Hong, Koyfman, Yan, Sood, Atin, Feris, Rogerio, Desai, Nirmit, Cox, David D., Puri, Ruchir, Panda, Rameswar

arXiv.org Artificial IntelligenceJul-18-2024

This paper introduces long-context Granite code models that support effective context windows of up to 128K tokens. Our solution for scaling context length of Granite 3B/8B code models from 2K/4K to 128K consists of a light-weight continual pretraining by gradually increasing its RoPE base frequency with repository-level file packing and length-upsampled long-context data. Additionally, we also release instruction-tuned models with long-context support which are derived by further finetuning the long context base models on a mix of permissively licensed short and long-context instruction-response pairs. While comparing to the original short-context Granite code models, our long-context models achieve significant improvements on long-context tasks without any noticeable performance degradation on regular code completion benchmarks (e.g., HumanEval). We release all our long-context Granite code models under an Apache 2.0 license for both research and commercial use.

benchmark, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2407.13739

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)

Add feedback

Granite Code Models: A Family of Open Foundation Models for Code Intelligence

Mishra, Mayank, Stallone, Matt, Zhang, Gaoyuan, Shen, Yikang, Prasad, Aditya, Soria, Adriana Meza, Merler, Michele, Selvam, Parameswaran, Surendran, Saptha, Singh, Shivdeep, Sethi, Manish, Dang, Xuan-Hong, Li, Pengyuan, Wu, Kun-Lung, Zawad, Syed, Coleman, Andrew, White, Matthew, Lewis, Mark, Pavuluri, Raju, Koyfman, Yan, Lublinsky, Boris, de Bayser, Maximilien, Abdelaziz, Ibrahim, Basu, Kinjal, Agarwal, Mayank, Zhou, Yi, Johnson, Chris, Goyal, Aanchal, Patel, Hima, Shah, Yousaf, Zerfos, Petros, Ludwig, Heiko, Munawar, Asim, Crouse, Maxwell, Kapanipathi, Pavan, Salaria, Shweta, Calio, Bob, Wen, Sophia, Seelam, Seetharami, Belgodere, Brian, Fonseca, Carlos, Singhee, Amith, Desai, Nirmit, Cox, David D., Puri, Ruchir, Panda, Rameswar

arXiv.org Artificial IntelligenceMay-7-2024

Large Language Models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being integrated into software development environments to improve the productivity of human programmers, and LLM-based agents are beginning to show promise for handling complex tasks autonomously. Realizing the full potential of code LLMs requires a wide range of capabilities, including code generation, fixing bugs, explaining and documenting code, maintaining repositories, and more. In this work, we introduce the Granite series of decoder-only code models for code generative tasks, trained with code written in 116 programming languages. The Granite Code models family consists of models ranging in size from 3 to 34 billion parameters, suitable for applications ranging from complex application modernization tasks to on-device memory-constrained use cases. Evaluation on a comprehensive set of tasks demonstrates that Granite Code models consistently reaches state-of-the-art performance among available open-source code LLMs. The Granite Code model family was optimized for enterprise software development workflows and performs well across a range of coding tasks (e.g. code generation, fixing and explanation), making it a versatile all around code model. We release all our Granite Code models under an Apache 2.0 license for both research and commercial use.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2405.04324

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology (1.00)
Education > Educational Setting (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Modality-aware Transformer for Time series Forecasting

Emami, Hajar, Dang, Xuan-Hong, Shah, Yousaf, Zerfos, Petros

arXiv.org Artificial IntelligenceOct-2-2023

Time series forecasting presents a significant challenge, particularly when its accuracy relies on external data sources rather than solely on historical values. This issue is prevalent in the financial sector, where the future behavior of time series is often intricately linked to information derived from various textual reports and a multitude of economic indicators. In practice, the key challenge lies in constructing a reliable time series forecasting model capable of harnessing data from diverse sources and extracting valuable insights to predict the target time series accurately. In this work, we tackle this challenging problem and introduce a novel multimodal transformer-based model named the Modality-aware Transformer. Our model excels in exploring the power of both categorical text and numerical timeseries to forecast the target time series effectively while providing insights through its neural attention mechanism. To achieve this, we develop feature-level attention layers that encourage the model to focus on the most relevant features within each data modality. By incorporating the proposed feature-level attention, we develop a novel Intra-modal multi-head attention (MHA), Inter-modal MHA and Modality-target MHA in a way that both feature and temporal attentions are incorporated in MHAs. This enables the MHAs to generate temporal attentions with consideration of modality and feature importance which leads to more informative embeddings. The proposed modality-aware structure enables the model to effectively exploit information within each modality as well as foster cross-modal understanding. Our extensive experiments on financial datasets demonstrate that Modality-aware Transformer outperforms existing methods, offering a novel and practical solution to the complex challenges of multi-modality time series forecasting.

data mining, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2310.01232

Country: North America > United States (0.14)

Genre: Research Report (0.82)

Industry:

Banking & Finance > Economy (1.00)
Energy > Oil & Gas > Upstream (0.56)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AutoAI-TS: AutoAI for Time Series Forecasting

Shah, Syed Yousaf, Patel, Dhaval, Vu, Long, Dang, Xuan-Hong, Chen, Bei, Kirchner, Peter, Samulowitz, Horst, Wood, David, Bramble, Gregory, Gifford, Wesley M., Ganapavarapu, Giridhar, Vaculin, Roman, Zerfos, Petros

arXiv.org Artificial IntelligenceMar-8-2021

A large number of time series forecasting models including traditional statistical models, machine learning models and more recently deep learning have been proposed in the literature. However, choosing the right model along with good parameter values that performs well on a given data is still challenging. Automatically providing a good set of models to users for a given dataset saves both time and effort from using trial-and-error approaches with a wide variety of available models along with parameter optimization. We present AutoAI for Time Series Forecasting (AutoAI-TS) that provides users with a zero configuration (zero-conf ) system to efficiently train, optimize and choose best forecasting model among various classes of models for the given dataset. With its flexible zero-conf design, AutoAI-TS automatically performs all the data preparation, model creation, parameter optimization, training and model selection for users and provides a trained model that is ready to use. For given data, AutoAI-TS utilizes a wide variety of models including classical statistical models, Machine Learning (ML) models, statistical-ML hybrid models and deep learning models along with various transformations to create forecasting pipelines. It then evaluates and ranks pipelines using the proposed T-Daub mechanism to choose the best pipeline. The paper describe in detail all the technical aspects of AutoAI-TS along with extensive benchmarking on a variety of real world data sets for various use-cases. Benchmark results show that AutoAI-TS, with no manual configuration from the user, automatically trains and selects pipelines that on average outperform existing state-of-the-art time series forecasting toolkits.

deep learning, neural network, pipeline, (19 more...)

arXiv.org Artificial Intelligence

2102.12347

Country: North America > United States (0.47)

Genre: Research Report (0.84)

Industry: Information Technology (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

seq2graph: Discovering Dynamic Dependencies from Multivariate Time Series with Multi-level Attention

Dang, Xuan-Hong, Shah, Syed Yousaf, Zerfos, Petros

arXiv.org Machine LearningDec-7-2018

Discovering temporal lagged and inter-dependencies in multivariate time series data is an important task. However, in many real-world applications, such as commercial cloud management, manufacturing predictive maintenance, and portfolios performance analysis, such dependencies can be non-linear and time-variant, which makes it more challenging to extract such dependencies through traditional methods such as Granger causality or clustering. In this work, we present a novel deep learning model that uses multiple layers of customized gated recurrent units (GRUs) for discovering both time lagged behaviors as well as inter-timeseries dependencies in the form of directed weighted graphs. We introduce a key component of Dual-purpose recurrent neural network that decodes information in the temporal domain to discover lagged dependencies within each time series, and encodes them into a set of vectors which, collected from all component time series, form the informative inputs to discover inter-dependencies. Though the discovery of two types of dependencies are separated at different hierarchical levels, they are tightly connected and jointly trained in an end-to-end manner. With this joint training, learning of one type of dependency immediately impacts the learning of the other one, leading to overall accurate dependencies discovery. We empirically test our model on synthetic time series data in which the exact form of (non-linear) dependencies is known. We also evaluate its performance on two real-world applications, (i) performance monitoring data from a commercial cloud provider, which exhibit highly dynamic, non-linear, and volatile behavior and, (ii) sensor data from a manufacturing plant. We further show how our approach is able to capture these dependency behaviors via intuitive and interpretable dependency graphs and use them to generate highly accurate forecasts.

deep learning, neural network, time series, (22 more...)

arXiv.org Machine Learning

1812.04448

Country: North America > United States > New York (0.14)

Genre: Research Report (1.00)

Industry:

Health & Medicine (0.93)
Information Technology > Services (0.35)

Technology:

Information Technology > Cloud Computing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback