AITopics | Cheng, Derek Zhiyuan

Collaborating Authors

Cheng, Derek Zhiyuan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation

Hou, Yupeng, Ni, Jianmo, He, Zhankui, Sachdeva, Noveen, Kang, Wang-Cheng, Chi, Ed H., McAuley, Julian, Cheng, Derek Zhiyuan

arXiv.org Artificial IntelligenceFeb-19-2025

Generative recommendation (GR) is an emerging paradigm where user actions are tokenized into discrete token patterns and autoregressively generated as predictions. However, existing GR models tokenize each action independently, assigning the same fixed tokens to identical actions across all sequences without considering contextual relationships. This lack of context-awareness can lead to suboptimal performance, as the same action may hold different meanings depending on its surrounding context. To address this issue, we propose ActionPiece to explicitly incorporate context when tokenizing action sequences. In ActionPiece, each action is represented as a set of item features, which serve as the initial tokens. Given the action sequence corpora, we construct the vocabulary by merging feature patterns as new tokens, based on their co-occurrence frequency both within individual sets and across adjacent sets. Considering the unordered nature of feature sets, we further introduce set permutation regularization, which produces multiple segmentations of action sequences with the same semantics. Experiments on public datasets demonstrate that ActionPiece consistently outperforms existing action tokenization methods, improving NDCG@$10$ by $6.00\%$ to $12.82\%$.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2502.13581

Country: North America > United States > California (0.14)

Genre:

Workflow (1.00)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.69)
(3 more...)

Add feedback

How to Train Data-Efficient LLMs

Sachdeva, Noveen, Coleman, Benjamin, Kang, Wang-Cheng, Ni, Jianmo, Hong, Lichan, Chi, Ed H., Caverlee, James, McAuley, Julian, Cheng, Derek Zhiyuan

arXiv.org Artificial IntelligenceFeb-14-2024

The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, Ask-LLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose Density sampling, which models the data distribution to select a diverse sample. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories. Coverage sampling can recover the performance of the full data, while models trained on Ask-LLM data consistently outperform full-data training -- even when we reject 90% of the original dataset, while converging up to 70% faster.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2402.09668

Country: North America > United States > California (0.28)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.65)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)
Health & Medicine (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Unified Embedding: Battle-Tested Feature Representations for Web-Scale ML Systems

Coleman, Benjamin, Kang, Wang-Cheng, Fahrbach, Matthew, Wang, Ruoxi, Hong, Lichan, Chi, Ed H., Cheng, Derek Zhiyuan

arXiv.org Artificial IntelligenceNov-14-2023

Learning high-quality feature embeddings efficiently and effectively is critical for the performance of web-scale machine learning systems. A typical model ingests hundreds of features with vocabularies on the order of millions to billions of tokens. The standard approach is to represent each feature value as a d-dimensional embedding, introducing hundreds of billions of parameters for extremely high-cardinality features. This bottleneck has led to substantial progress in alternative embedding algorithms. Many of these methods, however, make the assumption that each feature uses an independent embedding table. This work introduces a simple yet highly effective framework, Feature Multiplexing, where one single representation space is used across many different categorical features. Our theoretical and empirical analysis reveals that multiplexed embeddings can be decomposed into components from each constituent feature, allowing models to distinguish between features. We show that multiplexed representations lead to Pareto-optimal parameter-accuracy tradeoffs for three public benchmark datasets. Further, we propose a highly practical approach called Unified Embedding with three major benefits: simplified feature configuration, strong adaptation to dynamic data distributions, and compatibility with modern hardware. Unified embedding gives significant improvements in offline and online metrics compared to highly competitive baselines across five web-scale search, ads, and recommender systems, where it serves billions of users across the world in industry-leading products.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2305.12102

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)

Add feedback

Farzi Data: Autoregressive Data Distillation

Sachdeva, Noveen, He, Zexue, Kang, Wang-Cheng, Ni, Jianmo, Cheng, Derek Zhiyuan, McAuley, Julian

arXiv.org Artificial IntelligenceOct-15-2023

We study data distillation for auto-regressive machine learning tasks, where the input and output have a strict left-to-right causal structure. More specifically, we propose Farzi, which summarizes an event sequence dataset into a small number of synthetic sequences -- Farzi Data -- which are optimized to maintain (if not improve) model performance compared to training on the full dataset. Under the hood, Farzi conducts memory-efficient data distillation by (i) deriving efficient reverse-mode differentiation of the Adam optimizer by leveraging Hessian-Vector Products; and (ii) factorizing the high-dimensional discrete event-space into a latent-space which provably promotes implicit regularization. Empirically, for sequential recommendation and language modeling tasks, we are able to achieve 98-120% of downstream full-data performance when training state-of-the-art models on Farzi Data of size as little as 0.1% of the original dataset. Notably, being able to train better models with significantly less data sheds light on the design of future large auto-regressive models, and opens up new opportunities to further scale up model and data sizes.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2310.09983

Country:

North America > United States > New York (0.14)
North America > United States > California (0.14)

Genre: Research Report (1.00)

Industry: Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback

HyperFormer: Learning Expressive Sparse Feature Representations via Hypergraph Transformer

Ding, Kaize, Liang, Albert Jiongqian, Perrozi, Bryan, Chen, Ting, Wang, Ruoxi, Hong, Lichan, Chi, Ed H., Liu, Huan, Cheng, Derek Zhiyuan

arXiv.org Artificial IntelligenceMay-27-2023

Learning expressive representations for high-dimensional yet sparse features has been a longstanding problem in information retrieval. Though recent deep learning methods can partially solve the problem, they often fail to handle the numerous sparse features, particularly those tail feature values with infrequent occurrences in the training data. Worse still, existing methods cannot explicitly leverage the correlations among different instances to help further improve the representation learning on sparse features since such relational prior knowledge is not provided. To address these challenges, in this paper, we tackle the problem of representation learning on feature-sparse data from a graph learning perspective. Specifically, we propose to model the sparse features of different instances using hypergraphs where each node represents a data instance and each hyperedge denotes a distinct feature value. By passing messages on the constructed hypergraphs based on our Hypergraph Transformer (HyperFormer), the learned feature representations capture not only the correlations among different instances but also the correlations among features. Our experiments demonstrate that the proposed approach can effectively improve feature representation learning on sparse features.

artificial intelligence, machine learning, representation, (16 more...)

arXiv.org Artificial Intelligence

2305.17386

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

Do LLMs Understand User Preferences? Evaluating LLMs On User Rating Prediction

Kang, Wang-Cheng, Ni, Jianmo, Mehta, Nikhil, Sathiamoorthy, Maheswaran, Hong, Lichan, Chi, Ed, Cheng, Derek Zhiyuan

arXiv.org Artificial IntelligenceMay-10-2023

Large Language Models (LLMs) have demonstrated exceptional capabilities in generalizing to new tasks in a zero-shot or few-shot manner. However, the extent to which LLMs can comprehend user preferences based on their previous behavior remains an emerging and still unclear research question. Traditionally, Collaborative Filtering (CF) has been the most effective method for these tasks, predominantly relying on the extensive volume of rating data. In contrast, LLMs typically demand considerably less data while maintaining an exhaustive world knowledge about each item, such as movies or products. In this paper, we conduct a thorough examination of both CF and LLMs within the classic task of user rating prediction, which involves predicting a user's rating for a candidate item based on their past ratings. We investigate various LLMs in different sizes, ranging from 250M to 540B parameters and evaluate their performance in zero-shot, few-shot, and fine-tuning scenarios. We conduct comprehensive analysis to compare between LLMs and strong CF methods, and find that zero-shot LLMs lag behind traditional recommender models that have the access to user interaction data, indicating the importance of user interaction data. However, through fine-tuning, LLMs achieve comparable or even better performance with only a small fraction of the training data, demonstrating their potential through data efficiency.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2305.06474

Country:

North America > United States (0.16)
Oceania > Australia (0.14)
North America > Canada (0.14)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback

Self-supervised Learning for Large-scale Item Recommendations

Yao, Tiansheng, Yi, Xinyang, Cheng, Derek Zhiyuan, Yu, Felix, Chen, Ting, Menon, Aditya, Hong, Lichan, Chi, Ed H., Tjoa, Steve, Kang, Jieqi, Ettinger, Evan

arXiv.org Machine LearningOct-22-2020

Large scale recommender models find most relevant items from huge catalogs, and they play a critical role in modern search and recommendation systems. To model the input space with large-vocab categorical features, a typical recommender model learns a joint embedding space through neural networks for both queries and items from user feedback data. However, with millions to billions of items, the power-law user feedback makes labels very sparse for a large amount of long-tail items. Inspired by the recent success in self-supervised representation learning research in both computer vision and natural language understanding, we propose a multi-task self-supervised learning (SSL) framework for large-scale item recommendations. The framework is designed to tackle the label sparsity problem by learning more robust item representations. Furthermore, we propose two self-supervised tasks applicable to models with categorical features within the proposed framework: (i) Feature Masking (FM) and (ii) Feature Dropout (FD). We evaluate our framework using two large-scale datasets with 500M and 1B training examples respectively. Our results demonstrate that the proposed framework outperforms traditional supervised learning only models and state-of-the-art regularization techniques in the context of item recommendations. The SSL framework shows larger improvement with less supervision compared to the counterparts. We also apply the proposed techniques to a web-scale commercial app-to-app recommendation system, and significantly improve top-tier business metrics via A/B experiments on live traffic. Our online results also verify our hypothesis that our framework indeed improves model performance on slices that lack supervision.

deep learning, neural network, representation, (18 more...)

arXiv.org Machine Learning

2007.12865

Country:

Europe (1.00)
North America > United States > California > San Francisco County > San Francisco (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology (0.67)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Beyond Point Estimate: Inferring Ensemble Prediction Variation from Neuron Activation Strength in Recommender Systems

Chen, Zhe, Wang, Yuyan, Lin, Dong, Cheng, Derek Zhiyuan, Hong, Lichan, Chi, Ed H., Cui, Claire

arXiv.org Machine LearningAug-16-2020

Despite deep neural network (DNN)'s impressive prediction performance in various domains, it is well known now that a set of DNN models trained with the same model specification and the same data can produce very different prediction results. Ensemble method is one state-of-the-art benchmark for prediction uncertainty estimation. However, ensembles are expensive to train and serve for web-scale traffic. In this paper, we seek to advance the understanding of prediction variation estimated by the ensemble method. Through empirical experiments on two widely used benchmark datasets MovieLens and Criteo in recommender systems, we observe that prediction variations come from various randomness sources, including training data shuffling, and parameter random initialization. By introducing more randomness into model training, we notice that ensemble's mean predictions tend to be more accurate while the prediction variations tend to be higher. Moreover, we propose to infer prediction variation from neuron activation strength and demonstrate the strong prediction power from activation strength features. Our experiment results show that the average R squared on MovieLens is as high as 0.56 and on Criteo is 0.81. Our method performs especially well when detecting the lowest and highest variation buckets, with 0.92 AUC and 0.89 AUC respectively. Our approach provides a simple way for prediction variation estimation, which opens up new opportunities for future work in many interesting areas (e.g.,model-based reinforcement learning) without relying on serving expensive ensemble models.

bayesian inference, neural network, prediction variation, (15 more...)

arXiv.org Machine Learning

2008.07032

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)

Add feedback