AITopics | Just, Hoang Anh

Collaborating Authors

Just, Hoang Anh

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Optimizing Product Provenance Verification using Data Valuation Methods

Yousuf, Raquib Bin, Just, Hoang Anh, Xu, Shengzhe, Mayer, Brian, Deklerck, Victor, Truszkowski, Jakub, Simeone, John C., Saunders, Jade, Lu, Chang-Tien, Jia, Ruoxi, Ramakrishnan, Naren

arXiv.org Artificial IntelligenceMar-16-2025

Determining and Determining and verifying product provenance remains a critical verifying product provenance is a challenge in global supply chains, challenge in global supply chains, particularly as geopolitical conflicts as geopolitics and the lure of "don't ask, don't tell" with respect to and shifting borders create new incentives for misrepresentation the ecological and social cost creates incentives for misrepresentation of commodities, such as hiding the origin of illegally harvested of commodities, such as hiding the origin of illegally harvested timber or agriculture grown on illegally cleared land. Stable Isotope timber or agriculture grown on illegally cleared land. Ratio Analysis (SIRA), combined with Gaussian process regressionbased Product identification and provenance verification of traded natural isoscapes, has emerged as a powerful tool for geographic resources have emerged as promising research areas, with origin verification. However, the effectiveness of these models is often various combinations of methods used based on the specific natural constrained by data scarcity and suboptimal dataset selection. In resource sector and the level of granularity of species identification this work, we introduce a novel data valuation framework designed and origin-provenance determination. For example, for wood and to enhance the selection and utilization of training data for machine forest products, determining species identification and geographic learning models applied in SIRA. By prioritizing high-informative harvest provenance requires utilizing multiple testing methods and samples, our approach improves model robustness and predictive tools [5, 8, 20].

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2502.15177

Country:

Europe (1.00)
North America > United States > Virginia (0.30)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine (1.00)
Materials > Paper & Forest Products (0.57)
Food & Agriculture > Agriculture (0.44)

Technology:

Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)

Add feedback

DiPT: Enhancing LLM reasoning through diversified perspective-taking

Just, Hoang Anh, Dabas, Mahavir, Huang, Lifu, Jin, Ming, Jia, Ruoxi

arXiv.org Artificial IntelligenceSep-10-2024

Correct reasoning steps are important for language models to achieve high performance on many tasks, such as commonsense reasoning, question answering, and mathematical problem-solving [Wei et al., 2022, Kojima et al., 2022, Suzgun et al., 2022]. One way to elicit reasoning is through the chain-of-thought (CoT) method Wei et al. [2022], Kojima et al. [2022], which asks the model to provide step-by-step reasoning. Another approach encourages the model to provide similar problems Yasunaga et al. [2024] as the query, indirectly compelling the model to first understand the original query. Similarly, repeating and rephrasing the query Deng et al. [2023], Mekala et al. [2023] requires the model to first understand the problem and then modify the query into its own words. This rephrasing might help simplify the problem for the model. Additionally, reasoning can be generated by indirectly providing reasoning examples in demonstrations, referred to as in-context learning (ICL) Brown et al. [2020], Min et al. [2022], Xie et al. [2021]. While these methods have demonstrated significant performance improvements, language models are still prone to errors due to incorrect context understanding or analytical steps. Furthermore, they are subject to instability when requests are paraphrased. This instability is particularly concerning in the context of adversarial prompts, where recent research [Zou et al., 2023, Zeng et al., 2024] has shown that adversaries can intentionally rewrite prompts to coax safety-aligned language models into generating objectionable content that they would not generate otherwise.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2409.06241

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.92)

Industry:

Media (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Government (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.93)
(2 more...)

Add feedback

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

Kang, Feiyang, Just, Hoang Anh, Sun, Yifan, Jahagirdar, Himanshu, Zhang, Yuanzhi, Du, Rongxing, Sahu, Anit Kumar, Jia, Ruoxi

arXiv.org Artificial IntelligenceMay-4-2024

This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired performance levels. While many data selection algorithms have been designed for small-scale applications, rendering them unsuitable for our context, some emerging methods do cater to language data scales. However, they often prioritize data that aligns with the target distribution. While this strategy may be effective when training a model from scratch, it can yield limited results when the model has already been pre-trained on a different distribution. Differing from prior work, our key idea is to select data that nudges the pre-training distribution closer to the target distribution. We show the optimality of this approach for fine-tuning tasks under certain conditions. We demonstrate the efficacy of our methodology across a diverse array of tasks (NLU, NLG, zero-shot) with models up to 2.7B, showing that it consistently surpasses other selection methods. Moreover, our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour. Our code is open-sourced (Code repository: https://anonymous.4open.science/r/DV4LLM-D761/ ). While fine-tuning offers significant potential for enhancing performance across diverse tasks, its associated costs often limit its widespread adoption; with this work, we hope to lay the groundwork for cost-effective fine-tuning, making its benefits more accessible.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2405.02774

Country: North America > United States > New York > New York County > New York City (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

LAVA: Data Valuation without Pre-Specified Learning Algorithms

Just, Hoang Anh, Kang, Feiyang, Wang, Jiachen T., Zeng, Yi, Ko, Myeongseob, Jin, Ming, Jia, Ruoxi

arXiv.org Machine LearningDec-19-2023

Traditionally, data valuation (DV) is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many DV use cases, such as setting priorities over different data sources in a data acquisition process and informing pricing mechanisms in a data marketplace. In these scenarios, data needs to be valued before the actual analysis and the choice of the learning algorithm is still undetermined then. Another side-effect of the dependence is that to assess the value of individual points, one needs to re-run the learning algorithm with and without a point, which incurs a large computation burden. This work leapfrogs over the current limits of data valuation methods by introducing a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. Our main results are as follows. (1) We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions. (2) We develop a novel method to value individual data based on the sensitivity analysis of the class-wise Wasserstein distance. Importantly, these values can be directly obtained for free from the output of off-the-shelf optimization solvers when computing the distance. (3) We evaluate our new data valuation framework over various use cases related to detecting low-quality data and show that, surprisingly, the learning-agnostic feature of our framework enables a significant improvement over SOTA performance while being orders of magnitude faster.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Machine Learning

2305.00054

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

2D-Shapley: A Framework for Fragmented Data Valuation

Liu, Zhihong, Just, Hoang Anh, Chang, Xiangyu, Chen, Xi, Jia, Ruoxi

arXiv.org Artificial IntelligenceJul-26-2023

Data valuation -- quantifying the contribution of individual data sources to certain predictive behaviors of a model -- is of great importance to enhancing the transparency of machine learning and designing incentive systems for data sharing. Existing work has focused on evaluating data sources with the shared feature or sample space. How to valuate fragmented data sources of which each only contains partial features and samples remains an open question. We start by presenting a method to calculate the counterfactual of removing a fragment from the aggregated data matrix. Based on the counterfactual calculation, we further propose 2D-Shapley, a theoretical framework for fragmented data valuation that uniquely satisfies some appealing axioms in the fragmented data context. 2D-Shapley empowers a range of new use cases, such as selecting useful data fragments, providing interpretation for sample-wise data values, and fine-grained data issue diagnosis.

artificial intelligence, machine learning, valuation, (19 more...)

arXiv.org Artificial Intelligence

2306.10473

Country: North America > United States (1.00)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area (0.31)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science (0.93)
Information Technology > Information Management (0.76)

Add feedback

Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources

Kang, Feiyang, Just, Hoang Anh, Sahu, Anit Kumar, Jia, Ruoxi

arXiv.org Artificial IntelligenceJul-5-2023

Traditionally, data selection has been studied in settings where all samples from prospective sources are fully revealed to a machine learning developer. However, in practical data exchange scenarios, data providers often reveal only a limited subset of samples before an acquisition decision is made. Recently, there have been efforts to fit scaling laws that predict model performance at any size and data source composition using the limited available samples. However, these scaling functions are black-box, computationally expensive to fit, highly susceptible to overfitting, or/and difficult to optimize for data selection. This paper proposes a framework called , which predicts model performance and supports data selection decisions based on partial samples of prospective data sources. Our approach distinguishes itself from existing work by introducing a novel *two-stage* performance inference process. In the first stage, we leverage the Optimal Transport distance to predict the model's performance for any data mixture ratio within the range of disclosed data sizes. In the second stage, we extrapolate the performance to larger undisclosed data sizes based on a novel parameter-free mapping technique inspired by neural scaling laws. We further derive an efficient gradient-based method to select data sources based on the projected model performance. Evaluation over a diverse range of applications demonstrates that significantly improves existing performance scaling approaches in terms of both the accuracy of performance inference and the computation costs associated with constructing the performance predictor. Also, outperforms by a wide margin in data selection effectiveness compared to a range of other off-the-shelf solutions.

artificial intelligence, machine learning, model performance, (18 more...)

arXiv.org Artificial Intelligence

2307.0246

Country:

North America > United States (0.28)
Europe > Switzerland (0.28)
Europe > Austria (0.28)

Genre: Research Report (1.00)

Industry: Transportation (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

ModelPred: A Framework for Predicting Trained Model from Training Data

Zeng, Yingyan, Wang, Jiachen T., Chen, Si, Just, Hoang Anh, Jin, Ran, Jia, Ruoxi

arXiv.org Artificial IntelligenceDec-23-2022

In this work, we propose ModelPred, a framework that helps to understand the impact of changes in training data on a trained model. This is critical for building trust in various stages of a machine learning pipeline: from cleaning poor-quality samples and tracking important ones to be collected during data preparation, to calibrating uncertainty of model prediction, to interpreting why certain behaviors of a model emerge during deployment. Specifically, ModelPred learns a parameterized function that takes a dataset $S$ as the input and predicts the model obtained by training on $S$. Our work differs from the recent work of Datamodels [1] as we aim for predicting the trained model parameters directly instead of the trained model behaviors. We demonstrate that a neural network-based set function class is capable of learning the complex relationships between the training data and model parameters. We introduce novel global and local regularization techniques to prevent overfitting and we rigorously characterize the expressive power of neural networks (NN) in approximating the end-to-end training process. Through extensive empirical investigations, we show that ModelPred enables a variety of applications that boost the interpretability and accountability of machine learning (ML), such as data valuation, data selection, memorization quantification, and model calibration.

artificial intelligence, machine learning, optimization problem, (18 more...)

arXiv.org Artificial Intelligence

2111.12545

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback