Goto

Collaborating Authors

 Regression


Polarized Online Discourse on Abortion: Frames and Hostile Expressions among Liberals and Conservatives

arXiv.org Artificial Intelligence

Abortion has been one of the most divisive issues in the United States. Yet, missing is comprehensive longitudinal evidence on how political divides on abortion are reflected in public discourse over time, on a national scale, and in response to key events before and after the overturn of Roe v Wade. We analyze a corpus of over 3.5M tweets related to abortion over the span of one year (January 2022 to January 2023) from over 1.1M users. We estimate users' ideology and rely on state-of-the-art transformer-based classifiers to identify expressions of hostility and extract five prominent frames surrounding abortion. We use those data to examine (a) how prevalent were expressions of hostility (i.e., anger, toxic speech, insults, obscenities, and hate speech), (b) what frames liberals and conservatives used to articulate their positions on abortion, and (c) the prevalence of hostile expressions in liberals and conservative discussions of these frames. We show that liberals and conservatives largely mirrored each other's use of hostile expressions: as liberals used more hostile rhetoric, so did conservatives, especially in response to key events. In addition, the two groups used distinct frames and discussed them in vastly distinct contexts, suggesting that liberals and conservatives have differing perspectives on abortion. Lastly, frames favored by one side provoked hostile reactions from the other: liberals use more hostile expressions when addressing religion, fetal personhood, and exceptions to abortion bans, whereas conservatives use more hostile language when addressing bodily autonomy and women's health. This signals disrespect and derogation, which may further preclude understanding and exacerbate polarization.


Transfer Learning through Enhanced Sufficient Representation: Enriching Source Domain Knowledge with Target Data

arXiv.org Machine Learning

Transfer learning is an important approach for addressing the challenges posed by limited data availability in various applications. It accomplishes this by transferring knowledge from well-established source domains to a less familiar target domain. However, traditional transfer learning methods often face difficulties due to rigid model assumptions and the need for a high degree of similarity between source and target domain models. In this paper, we introduce a novel method for transfer learning called Transfer learning through Enhanced Sufficient Representation (TESR). Our approach begins by estimating a sufficient and invariant representation from the source domains. This representation is then enhanced with an independent component derived from the target data, ensuring that it is sufficient for the target domain and adaptable to its specific characteristics. A notable advantage of TESR is that it does not rely on assuming similar model structures across different tasks. For example, the source domain models can be regression models, while the target domain task can be classification. This flexibility makes TESR applicable to a wide range of supervised learning problems. We explore the theoretical properties of TESR and validate its performance through simulation studies and real-world data applications, demonstrating its effectiveness in finite sample settings.


Understanding Fixed Predictions via Confined Regions

arXiv.org Artificial Intelligence

Machine learning models are designed to predict outcomes using features about an individual, but fail to take into account how individuals can change them. Consequently, models can assign fixed predictions that deny individuals recourse to change their outcome. This work develops a new paradigm to identify fixed predictions by finding confined regions in which all individuals receive fixed predictions. We introduce the first method, ReVer, for this task, using tools from mixed-integer quadratically constrained programming. Our approach certifies recourse for out-of-sample data, provides interpretable descriptions of confined regions, and runs in seconds on real world datasets. We conduct a comprehensive empirical study of confined regions across diverse applications. Our results highlight that existing point-wise verification methods fail to discover confined regions, while ReVer provably succeeds.


TabGen-ICL: Residual-Aware In-Context Example Selection for Tabular Data Generation

arXiv.org Artificial Intelligence

Large Language models (LLMs) have achieved encouraging results in tabular data generation. However, existing approaches require fine-tuning, which is computationally expensive. This paper explores an alternative: prompting a fixed LLM with in-context examples. We observe that using randomly selected in-context examples hampers the LLM's performance, resulting in sub-optimal generation quality. To address this, we propose a novel in-context learning framework: TabGen-ICL, to enhance the in-context learning ability of LLMs for tabular data generation. TabGen-ICL operates iteratively, retrieving a subset of real samples that represent the residual between currently generated samples and true data distributions. This approach serves two purposes: locally, it provides more effective in-context learning examples for the LLM in each iteration; globally, it progressively narrows the gap between generated and real data. Extensive experiments on five real-world tabular datasets demonstrate that TabGen-ICL significantly outperforms the random selection strategy. Specifically, it reduces the error rate by a margin of $3.5\%-42.2\%$ on fidelity metrics. We demonstrate for the first time that prompting a fixed LLM can yield high-quality synthetic tabular data. The code is provided in the \href{https://github.com/fangliancheng/TabGEN-ICL}{link}.


Practical programming research of Linear DML model based on the simplest Python code: From the standpoint of novice researchers

arXiv.org Artificial Intelligence

This paper presents linear DML models for causal inference using the simplest Python code on a Jupyter notebook based on an Anaconda platform and compares the performance of different DML models. The results show that current Library API technology is not yet sufficient to enable novice Python users to build qualified and high-quality DML models with the simplest coding approach. Novice users attempting to perform DML causal inference using Python still have to improve their mathematical and computer knowledge to adapt to more flexible DML programming. Additionally, the issue of mismatched outcome variable dimensions is also widespread when building linear DML models in Jupyter notebook.


DUPRE: Data Utility Prediction for Efficient Data Valuation

arXiv.org Artificial Intelligence

Data valuation is increasingly used in machine learning (ML) to decide the fair compensation for data owners and identify valuable or harmful data for improving ML models. Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility (e.g., validation accuracy) and retraining the ML model for multiple data subsets. While most existing works on efficient estimation of the Shapley values have focused on reducing the number of subsets to evaluate, our framework, \texttt{DUPRE}, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, \texttt{DUPRE} fits a \emph{Gaussian process} (GP) regression model to predict the utility of every other data subset. Our key contribution lies in the design of our GP kernel based on the sliced Wasserstein distance between empirical data distributions. In particular, we show that the kernel is valid and positive semi-definite, encodes prior knowledge of similarities between different data subsets, and can be efficiently computed. We empirically verify that \texttt{DUPRE} introduces low prediction error and speeds up data valuation for various ML models, datasets, and utility functions.


Rectifying Conformity Scores for Better Conditional Coverage

arXiv.org Machine Learning

We present a new method for generating confidence sets within the split conformal prediction framework. Our method performs a trainable transformation of any given conformity score to improve conditional coverage while ensuring exact marginal coverage. The transformation is based on an estimate of the conditional quantile of conformity scores. The resulting method is particularly beneficial for constructing adaptive confidence sets in multi-output problems where standard conformal quantile regression approaches have limited applicability. We develop a theoretical bound that captures the influence of the accuracy of the quantile estimate on the approximate conditional validity, unlike classical bounds for conformal prediction methods that only offer marginal coverage. We experimentally show that our method is highly adaptive to the local data structure and outperforms existing methods in terms of conditional coverage, improving the reliability of statistical inference in various applications.


Multi-Objective Optimization of Water Resource Allocation for Groundwater Recharge and Surface Runoff Management in Watershed Systems

arXiv.org Artificial Intelligence

Land degradation and air pollution are primarily caused by the salinization of soil and desertification that occurs from the drying of salinity lakes and the release of dust into the atmosphere because of their dried bottom. The complete drying up of a lake has caused a community environmental catastrophe. In this study, we presented an optimization problem to determine the total surface runoff to maintain the level of salinity lake (Urmia Lake). The proposed process has two key stages: identifying the influential factors in determining the lake water level using sensitivity analysis approaches based upon historical data and optimizing the effective variable to stabilize the lake water level under changing design variables. Based upon the Sobol'-Jansen and Morris techniques, the groundwater level and total surface runoff flow are highly effective with nonlinear and interacting impacts of the lake water level. As a result of the sensitivity analysis, we found that it may be possible to effectively manage lake levels by adjusting total surface runoff. We used genetic algorithms, non-linear optimization, and pattern search techniques to solve the optimization problem. Furthermore, the lake level constraint is established based on a pattern as a constant number every month. In order to maintain a consistent pattern of lake levels, it is necessary to increase surface runoff by approximately 8.7 times during filling season. It is necessary to increase this quantity by 33.5 times during the draining season. In the future, the results may serve as a guide for the rehabilitation of the lake.


Exact Recovery of Sparse Binary Vectors from Generalized Linear Measurements

arXiv.org Machine Learning

We consider the problem of exact recovery of a $k$-sparse binary vector from generalized linear measurements (such as logistic regression). We analyze the linear estimation algorithm (Plan, Vershynin, Yudovina, 2017), and also show information theoretic lower bounds on the number of required measurements. As a consequence of our results, for noisy one bit quantized linear measurements ($\mathsf{1bCSbinary}$), we obtain a sample complexity of $O((k+\sigma^2)\log{n})$, where $\sigma^2$ is the noise variance. This is shown to be optimal due to the information theoretic lower bound. We also obtain tight sample complexity characterization for logistic regression. Since $\mathsf{1bCSbinary}$ is a strictly harder problem than noisy linear measurements ($\mathsf{SparseLinearReg}$) because of added quantization, the same sample complexity is achievable for $\mathsf{SparseLinearReg}$. While this sample complexity can be obtained via the popular lasso algorithm, linear estimation is computationally more efficient. Our lower bound holds for any set of measurements for $\mathsf{SparseLinearReg}$, (similar bound was known for Gaussian measurement matrices) and is closely matched by the maximum-likelihood upper bound. For $\mathsf{SparseLinearReg}$, it was conjectured in Gamarnik and Zadik, 2017 that there is a statistical-computational gap and the number of measurements should be at least $(2k+\sigma^2)\log{n}$ for efficient algorithms to exist. It is worth noting that our results imply that there is no such statistical-computational gap for $\mathsf{1bCSbinary}$ and logistic regression.


News Sentiment as a Predictor for American Domestic Migration

arXiv.org Artificial Intelligence

This paper goes into depth on the effect that US News Sentiment from national newspapers has on US interstate migration trends. Through harnessing data from the New York Times between 2010 and 2020, an average sentiment score was calculated, allowing for data to be entered into a neural network. Then a logistic regression model was used to predict interstate migration. The results indicate the model was highly accurate as the mean margin of error was +/- 900 citizens. The predictions from the model were compared with the US Census data from 2010 to 2020 that was used to train the model. Since the input for the model was not exposed to any migration data, the model clearly demonstrated that its results were drawn from sentiment data alone. These findings are significant as they indicate that the role of the press could be used as a predictor for domestic migration which can help the government and businesses understand better what is influencing people to move to certain places.