Just, Hoang Anh
Optimizing Product Provenance Verification using Data Valuation Methods
Yousuf, Raquib Bin, Just, Hoang Anh, Xu, Shengzhe, Mayer, Brian, Deklerck, Victor, Truszkowski, Jakub, Simeone, John C., Saunders, Jade, Lu, Chang-Tien, Jia, Ruoxi, Ramakrishnan, Naren
Determining and Determining and verifying product provenance remains a critical verifying product provenance is a challenge in global supply chains, challenge in global supply chains, particularly as geopolitical conflicts as geopolitics and the lure of "don't ask, don't tell" with respect to and shifting borders create new incentives for misrepresentation the ecological and social cost creates incentives for misrepresentation of commodities, such as hiding the origin of illegally harvested of commodities, such as hiding the origin of illegally harvested timber or agriculture grown on illegally cleared land. Stable Isotope timber or agriculture grown on illegally cleared land. Ratio Analysis (SIRA), combined with Gaussian process regressionbased Product identification and provenance verification of traded natural isoscapes, has emerged as a powerful tool for geographic resources have emerged as promising research areas, with origin verification. However, the effectiveness of these models is often various combinations of methods used based on the specific natural constrained by data scarcity and suboptimal dataset selection. In resource sector and the level of granularity of species identification this work, we introduce a novel data valuation framework designed and origin-provenance determination. For example, for wood and to enhance the selection and utilization of training data for machine forest products, determining species identification and geographic learning models applied in SIRA. By prioritizing high-informative harvest provenance requires utilizing multiple testing methods and samples, our approach improves model robustness and predictive tools [5, 8, 20].
DiPT: Enhancing LLM reasoning through diversified perspective-taking
Just, Hoang Anh, Dabas, Mahavir, Huang, Lifu, Jin, Ming, Jia, Ruoxi
Correct reasoning steps are important for language models to achieve high performance on many tasks, such as commonsense reasoning, question answering, and mathematical problem-solving [Wei et al., 2022, Kojima et al., 2022, Suzgun et al., 2022]. One way to elicit reasoning is through the chain-of-thought (CoT) method Wei et al. [2022], Kojima et al. [2022], which asks the model to provide step-by-step reasoning. Another approach encourages the model to provide similar problems Yasunaga et al. [2024] as the query, indirectly compelling the model to first understand the original query. Similarly, repeating and rephrasing the query Deng et al. [2023], Mekala et al. [2023] requires the model to first understand the problem and then modify the query into its own words. This rephrasing might help simplify the problem for the model. Additionally, reasoning can be generated by indirectly providing reasoning examples in demonstrations, referred to as in-context learning (ICL) Brown et al. [2020], Min et al. [2022], Xie et al. [2021]. While these methods have demonstrated significant performance improvements, language models are still prone to errors due to incorrect context understanding or analytical steps. Furthermore, they are subject to instability when requests are paraphrased. This instability is particularly concerning in the context of adversarial prompts, where recent research [Zou et al., 2023, Zeng et al., 2024] has shown that adversaries can intentionally rewrite prompts to coax safety-aligned language models into generating objectionable content that they would not generate otherwise.
Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs
Kang, Feiyang, Just, Hoang Anh, Sun, Yifan, Jahagirdar, Himanshu, Zhang, Yuanzhi, Du, Rongxing, Sahu, Anit Kumar, Jia, Ruoxi
This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired performance levels. While many data selection algorithms have been designed for small-scale applications, rendering them unsuitable for our context, some emerging methods do cater to language data scales. However, they often prioritize data that aligns with the target distribution. While this strategy may be effective when training a model from scratch, it can yield limited results when the model has already been pre-trained on a different distribution. Differing from prior work, our key idea is to select data that nudges the pre-training distribution closer to the target distribution. We show the optimality of this approach for fine-tuning tasks under certain conditions. We demonstrate the efficacy of our methodology across a diverse array of tasks (NLU, NLG, zero-shot) with models up to 2.7B, showing that it consistently surpasses other selection methods. Moreover, our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour. Our code is open-sourced (Code repository: https://anonymous.4open.science/r/DV4LLM-D761/ ). While fine-tuning offers significant potential for enhancing performance across diverse tasks, its associated costs often limit its widespread adoption; with this work, we hope to lay the groundwork for cost-effective fine-tuning, making its benefits more accessible.
LAVA: Data Valuation without Pre-Specified Learning Algorithms
Just, Hoang Anh, Kang, Feiyang, Wang, Jiachen T., Zeng, Yi, Ko, Myeongseob, Jin, Ming, Jia, Ruoxi
Traditionally, data valuation (DV) is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many DV use cases, such as setting priorities over different data sources in a data acquisition process and informing pricing mechanisms in a data marketplace. In these scenarios, data needs to be valued before the actual analysis and the choice of the learning algorithm is still undetermined then. Another side-effect of the dependence is that to assess the value of individual points, one needs to re-run the learning algorithm with and without a point, which incurs a large computation burden. This work leapfrogs over the current limits of data valuation methods by introducing a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. Our main results are as follows. (1) We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions. (2) We develop a novel method to value individual data based on the sensitivity analysis of the class-wise Wasserstein distance. Importantly, these values can be directly obtained for free from the output of off-the-shelf optimization solvers when computing the distance. (3) We evaluate our new data valuation framework over various use cases related to detecting low-quality data and show that, surprisingly, the learning-agnostic feature of our framework enables a significant improvement over SOTA performance while being orders of magnitude faster.
2D-Shapley: A Framework for Fragmented Data Valuation
Liu, Zhihong, Just, Hoang Anh, Chang, Xiangyu, Chen, Xi, Jia, Ruoxi
Data valuation -- quantifying the contribution of individual data sources to certain predictive behaviors of a model -- is of great importance to enhancing the transparency of machine learning and designing incentive systems for data sharing. Existing work has focused on evaluating data sources with the shared feature or sample space. How to valuate fragmented data sources of which each only contains partial features and samples remains an open question. We start by presenting a method to calculate the counterfactual of removing a fragment from the aggregated data matrix. Based on the counterfactual calculation, we further propose 2D-Shapley, a theoretical framework for fragmented data valuation that uniquely satisfies some appealing axioms in the fragmented data context. 2D-Shapley empowers a range of new use cases, such as selecting useful data fragments, providing interpretation for sample-wise data values, and fine-grained data issue diagnosis.
Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources
Kang, Feiyang, Just, Hoang Anh, Sahu, Anit Kumar, Jia, Ruoxi
Traditionally, data selection has been studied in settings where all samples from prospective sources are fully revealed to a machine learning developer. However, in practical data exchange scenarios, data providers often reveal only a limited subset of samples before an acquisition decision is made. Recently, there have been efforts to fit scaling laws that predict model performance at any size and data source composition using the limited available samples. However, these scaling functions are black-box, computationally expensive to fit, highly susceptible to overfitting, or/and difficult to optimize for data selection. This paper proposes a framework called
ModelPred: A Framework for Predicting Trained Model from Training Data
Zeng, Yingyan, Wang, Jiachen T., Chen, Si, Just, Hoang Anh, Jin, Ran, Jia, Ruoxi
In this work, we propose ModelPred, a framework that helps to understand the impact of changes in training data on a trained model. This is critical for building trust in various stages of a machine learning pipeline: from cleaning poor-quality samples and tracking important ones to be collected during data preparation, to calibrating uncertainty of model prediction, to interpreting why certain behaviors of a model emerge during deployment. Specifically, ModelPred learns a parameterized function that takes a dataset $S$ as the input and predicts the model obtained by training on $S$. Our work differs from the recent work of Datamodels [1] as we aim for predicting the trained model parameters directly instead of the trained model behaviors. We demonstrate that a neural network-based set function class is capable of learning the complex relationships between the training data and model parameters. We introduce novel global and local regularization techniques to prevent overfitting and we rigorously characterize the expressive power of neural networks (NN) in approximating the end-to-end training process. Through extensive empirical investigations, we show that ModelPred enables a variety of applications that boost the interpretability and accountability of machine learning (ML), such as data valuation, data selection, memorization quantification, and model calibration.