AITopics

2311.16831

Country:

North America > United States > Kansas (0.05)
North America > United States > Ohio (0.04)
North America > United States > Mississippi (0.04)
(6 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Obstetrics/Gynecology (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

arXiv.org Machine LearningFeb-22-2025

Transfer Learning through Enhanced Sufficient Representation: Enriching Source Domain Knowledge with Target Data

Ge, Yeheng, Zhou, Xueyu, Huang, Jian

Transfer learning is an important approach for addressing the challenges posed by limited data availability in various applications. It accomplishes this by transferring knowledge from well-established source domains to a less familiar target domain. However, traditional transfer learning methods often face difficulties due to rigid model assumptions and the need for a high degree of similarity between source and target domain models. In this paper, we introduce a novel method for transfer learning called Transfer learning through Enhanced Sufficient Representation (TESR). Our approach begins by estimating a sufficient and invariant representation from the source domains. This representation is then enhanced with an independent component derived from the target data, ensuring that it is sufficient for the target domain and adaptable to its specific characteristics. A notable advantage of TESR is that it does not rely on assuming similar model structures across different tasks. For example, the source domain models can be regression models, while the target domain task can be classification. This flexibility makes TESR applicable to a wide range of supervised learning problems. We explore the theoretical properties of TESR and validate its performance through simulation studies and real-world data applications, demonstrating its effectiveness in finite sample settings.

artificial intelligence, machine learning, representation, (18 more...)

arXiv.org Machine Learning

2502.20414

Country:

Asia > China > Hong Kong (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.35)

Lawless, Connor, Weng, Tsui-Wei, Ustun, Berk, Udell, Madeleine

Understanding Fixed Predictions via Confined Regions

Machine learning models are designed to predict outcomes using features about an individual, but fail to take into account how individuals can change them. Consequently, models can assign fixed predictions that deny individuals recourse to change their outcome. This work develops a new paradigm to identify fixed predictions by finding confined regions in which all individuals receive fixed predictions. We introduce the first method, ReVer, for this task, using tools from mixed-integer quadratically constrained programming. Our approach certifies recourse for out-of-sample data, provides interpretable descriptions of confined regions, and runs in seconds on real world datasets. We conduct a comprehensive empirical study of confined regions across diverse applications. Our results highlight that existing point-wise verification methods fail to discover confined regions, while ReVer provably succeeds.

artificial intelligence, constraint, machine learning, (16 more...)

2502.1638

Country:

North America > United States > Pennsylvania (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(2 more...)

Genre: Research Report > New Finding (0.88)

Industry:

Information Technology (0.68)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

TabGen-ICL: Residual-Aware In-Context Example Selection for Tabular Data Generation

Fang, Liancheng, Liu, Aiwei, Zhang, Hengrui, Zou, Henry Peng, Zhang, Weizhi, Yu, Philip S.

Large Language models (LLMs) have achieved encouraging results in tabular data generation. However, existing approaches require fine-tuning, which is computationally expensive. This paper explores an alternative: prompting a fixed LLM with in-context examples. We observe that using randomly selected in-context examples hampers the LLM's performance, resulting in sub-optimal generation quality. To address this, we propose a novel in-context learning framework: TabGen-ICL, to enhance the in-context learning ability of LLMs for tabular data generation. TabGen-ICL operates iteratively, retrieving a subset of real samples that represent the residual between currently generated samples and true data distributions. This approach serves two purposes: locally, it provides more effective in-context learning examples for the LLM in each iteration; globally, it progressively narrows the gap between generated and real data. Extensive experiments on five real-world tabular datasets demonstrate that TabGen-ICL significantly outperforms the random selection strategy. Specifically, it reduces the error rate by a margin of $3.5\%-42.2\%$ on fidelity metrics. We demonstrate for the first time that prompting a fixed LLM can yield high-quality synthetic tabular data. The code is provided in the \href{https://github.com/fangliancheng/TabGEN-ICL}{link}.

dataset, in-context example, llm, (15 more...)

2502.16414

Country:

North America > United States > California (0.05)
North America > United States > New York (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.66)

Practical programming research of Linear DML model based on the simplest Python code: From the standpoint of novice researchers

Yao, Shunxin

This paper presents linear DML models for causal inference using the simplest Python code on a Jupyter notebook based on an Anaconda platform and compares the performance of different DML models. The results show that current Library API technology is not yet sufficient to enable novice Python users to build qualified and high-quality DML models with the simplest coding approach. Novice users attempting to perform DML causal inference using Python still have to improve their mathematical and computer knowledge to adapt to more flexible DML programming. Additionally, the issue of mismatched outcome variable dimensions is also widespread when building linear DML models in Jupyter notebook.

causal inference, sklearn, train, (12 more...)

2502.16172

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.30)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.47)

Pham, Kieu Thao Nguyen, Sim, Rachael Hwee Ling, Nguyen, Quoc Phong, Ng, See Kiong, Low, Bryan Kian Hsiang

DUPRE: Data Utility Prediction for Efficient Data Valuation

Data valuation is increasingly used in machine learning (ML) to decide the fair compensation for data owners and identify valuable or harmful data for improving ML models. Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility (e.g., validation accuracy) and retraining the ML model for multiple data subsets. While most existing works on efficient estimation of the Shapley values have focused on reducing the number of subsets to evaluate, our framework, \texttt{DUPRE}, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, \texttt{DUPRE} fits a \emph{Gaussian process} (GP) regression model to predict the utility of every other data subset. Our key contribution lies in the design of our GP kernel based on the sliced Wasserstein distance between empirical data distributions. In particular, we show that the kernel is valid and positive semi-definite, encodes prior knowledge of similarities between different data subsets, and can be efficiently computed. We empirically verify that \texttt{DUPRE} introduces low prediction error and speeds up data valuation for various ML models, datasets, and utility functions.

coalition, dataset, shapley value, (16 more...)

2502.16152

Country:

Europe > Austria > Vienna (0.14)
North America > Canada > Ontario > Toronto (0.14)
Asia > Singapore > Central Region > Singapore (0.04)
(14 more...)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment (0.34)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.34)

arXiv.org Machine LearningFeb-22-2025

Rectifying Conformity Scores for Better Conditional Coverage

Plassier, Vincent, Fishkov, Alexander, Dheur, Victor, Guizani, Mohsen, Taieb, Souhaib Ben, Panov, Maxim, Moulines, Eric

We present a new method for generating confidence sets within the split conformal prediction framework. Our method performs a trainable transformation of any given conformity score to improve conditional coverage while ensuring exact marginal coverage. The transformation is based on an estimate of the conditional quantile of conformity scores. The resulting method is particularly beneficial for constructing adaptive confidence sets in multi-output problems where standard conformal quantile regression approaches have limited applicability. We develop a theoretical bound that captures the influence of the accuracy of the quantile estimate on the approximate conditional validity, unlike classical bounds for conformal prediction methods that only offer marginal coverage. We experimentally show that our method is highly adaptive to the local data structure and outperforms existing methods in terms of conditional coverage, improving the reliability of statistical inference in various applications.

conditional coverage, prediction, rectifying conformity score, (12 more...)

arXiv.org Machine Learning

2502.16336

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > New Finding (0.92)
Instructional Material > Course Syllabus & Notes (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Sharifi, Abbas, Naeini, Hajar Kazemi, Ahmadi, Mohsen, Asadi, Saeed, Varmaghani, Abbas

Multi-Objective Optimization of Water Resource Allocation for Groundwater Recharge and Surface Runoff Management in Watershed Systems

arXiv.org Artificial IntelligenceFeb-21-2025

Land degradation and air pollution are primarily caused by the salinization of soil and desertification that occurs from the drying of salinity lakes and the release of dust into the atmosphere because of their dried bottom. The complete drying up of a lake has caused a community environmental catastrophe. In this study, we presented an optimization problem to determine the total surface runoff to maintain the level of salinity lake (Urmia Lake). The proposed process has two key stages: identifying the influential factors in determining the lake water level using sensitivity analysis approaches based upon historical data and optimizing the effective variable to stabilize the lake water level under changing design variables. Based upon the Sobol'-Jansen and Morris techniques, the groundwater level and total surface runoff flow are highly effective with nonlinear and interacting impacts of the lake water level. As a result of the sensitivity analysis, we found that it may be possible to effectively manage lake levels by adjusting total surface runoff. We used genetic algorithms, non-linear optimization, and pattern search techniques to solve the optimization problem. Furthermore, the lake level constraint is established based on a pattern as a constant number every month. In order to maintain a consistent pattern of lake levels, it is necessary to increase surface runoff by approximately 8.7 times during filling season. It is necessary to increase this quantity by 33.5 times during the draining season. In the future, the results may serve as a guide for the rehabilitation of the lake.

evolutionary algorithm, machine learning, water level, (18 more...)

2502.15953

Country:

North America > United States (0.46)
Asia > Middle East > Iran (0.15)

Genre:

Research Report > New Finding (0.66)
Research Report > Experimental Study (0.46)

Industry:

Food & Agriculture > Agriculture (1.00)
Energy > Oil & Gas > Upstream (0.67)
Water & Waste Management > Water Management > Water Supplies & Services (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Mazumdar, Arya, Sangwan, Neha

Exact Recovery of Sparse Binary Vectors from Generalized Linear Measurements

arXiv.org Machine LearningFeb-21-2025

We consider the problem of exact recovery of a $k$-sparse binary vector from generalized linear measurements (such as logistic regression). We analyze the linear estimation algorithm (Plan, Vershynin, Yudovina, 2017), and also show information theoretic lower bounds on the number of required measurements. As a consequence of our results, for noisy one bit quantized linear measurements ($\mathsf{1bCSbinary}$), we obtain a sample complexity of $O((k+\sigma^2)\log{n})$, where $\sigma^2$ is the noise variance. This is shown to be optimal due to the information theoretic lower bound. We also obtain tight sample complexity characterization for logistic regression. Since $\mathsf{1bCSbinary}$ is a strictly harder problem than noisy linear measurements ($\mathsf{SparseLinearReg}$) because of added quantization, the same sample complexity is achievable for $\mathsf{SparseLinearReg}$. While this sample complexity can be obtained via the popular lasso algorithm, linear estimation is computationally more efficient. Our lower bound holds for any set of measurements for $\mathsf{SparseLinearReg}$, (similar bound was known for Gaussian measurement matrices) and is closely matched by the maximum-likelihood upper bound. For $\mathsf{SparseLinearReg}$, it was conjectured in Gamarnik and Zadik, 2017 that there is a statistical-computational gap and the number of measurements should be at least $(2k+\sigma^2)\log{n}$ for efficient algorithms to exist. It is worth noting that our results imply that there is no such statistical-computational gap for $\mathsf{1bCSbinary}$ and logistic regression.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Machine Learning

2502.16008

Country:

North America > United States > New Jersey > Mercer County > Princeton (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > California > San Diego County > La Jolla (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

Lane, Benjamin, Sayer, Simeon

News Sentiment as a Predictor for American Domestic Migration

arXiv.org Artificial IntelligenceFeb-21-2025

This paper goes into depth on the effect that US News Sentiment from national newspapers has on US interstate migration trends. Through harnessing data from the New York Times between 2010 and 2020, an average sentiment score was calculated, allowing for data to be entered into a neural network. Then a logistic regression model was used to predict interstate migration. The results indicate the model was highly accurate as the mean margin of error was +/- 900 citizens. The predictions from the model were compared with the US Census data from 2010 to 2020 that was used to train the model. Since the input for the model was not exposed to any migration data, the model clearly demonstrated that its results were drawn from sentiment data alone. These findings are significant as they indicate that the role of the press could be used as a predictor for domestic migration which can help the government and businesses understand better what is influencing people to move to certain places.

migration, news sentiment, sentiment score, (13 more...)

2502.15998

Country:

North America > United States > Washington (0.04)
North America > United States > Rhode Island (0.04)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
North America > United States > District of Columbia > Washington (0.04)

Genre:

Research Report > Experimental Study (0.49)
Research Report > New Finding (0.35)

Industry:

Media > News (0.52)
Government > Regional Government (0.48)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.35)