AITopics

2402.01258

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > New York (0.04)
(2 more...)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.34)

arXiv.org Machine LearningFeb-1-2024

On the design-dependent suboptimality of the Lasso

Pathak, Reese, Ma, Cong

This paper investigates the effect of the design matrix on the ability (or inability) to estimate a sparse parameter in linear regression. More specifically, we characterize the optimal rate of estimation when the smallest singular value of the design matrix is bounded away from zero. In addition to this information-theoretic result, we provide and analyze a procedure which is simultaneously statistically optimal and computationally efficient, based on soft thresholding the ordinary least squares estimator. Most surprisingly, we show that the Lasso estimator -- despite its widespread adoption for sparse linear regression -- is provably minimax rate-suboptimal when the minimum singular value is small. We present a family of design matrices and sparse parameters for which we can guarantee that the Lasso with any choice of regularization parameter -- including those which are data-dependent and randomized -- would fail in the sense that its estimation rate is suboptimal by polynomial factors in the sample size. Our lower bound is strong enough to preclude the statistical optimality of all forms of the Lasso, including its highly popular penalized, norm-constrained, and cross-validated variants.

estimator, lasso, regression, (15 more...)

arXiv.org Machine Learning

2402.00382

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.56)

Bahmani, Bahador, Suh, Hyoung Suk, Sun, WaiChing

Discovering interpretable elastoplasticity models via the neural polynomial method enabled symbolic regressions

Conventional neural network elastoplasticity models are often perceived as lacking interpretability. This paper introduces a two-step machine learning approach that returns mathematical models interpretable by human experts. In particular, we introduce a surrogate model where yield surfaces are expressed in terms of a set of single-variable feature mappings obtained from supervised learning. A post-processing step is then used to re-interpret the set of single-variable neural network mapping functions into mathematical form through symbolic regression. This divide-and-conquer approach provides several important advantages. First, it enables us to overcome the scaling issue of symbolic regression algorithms. From a practical perspective, it enhances the portability of learned models for partial differential equation solvers written in different programming languages. Finally, it enables us to have a concrete understanding of the attributes of the materials, such as convexity and symmetries of models, through automated derivations and reasoning. Numerical examples have been provided, along with an open-source code to enable third-party validation.

latexit sha1, neural network, shape function, (16 more...)

doi: 10.1016/j.cma.2024.116827

2307.13149

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Illinois > Cook County > Evanston (0.04)
(2 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Energy (0.92)
Government > Regional Government (0.46)
Government > Military (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.49)

Tan, Zhiquan, Li, Chenghai, Huang, Weiran

The Information of Large Language Model Geometry

This paper investigates the information encoded in the embeddings of large language models (LLMs). We conduct simulations to analyze the representation entropy and discover a power law relationship with model sizes. Building upon this observation, we propose a theory based on (conditional) entropy to elucidate the scaling law phenomenon. Furthermore, we delve into the auto-regressive structure of LLMs and examine the relationship between the last token and previous context tokens using information theory and regression techniques. Specifically, we establish a theoretical connection between the information gain of new tokens and ridge regression. Additionally, we explore the effectiveness of Lasso regression in selecting meaningful tokens, which sometimes outperforms the closely related attention weights. Finally, we conduct controlled experiments, and find that information is distributed across tokens, rather than being concentrated in specific "meaningful" tokens alone.

arxiv preprint arxiv, entropy, language model, (15 more...)

2402.03471

Country:

Europe > France (0.04)
Asia > China > Beijing > Beijing (0.04)
Asia > Russia (0.04)
(17 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Scalable Higher-Order Tensor Product Spline Models

Rügamer, David

In the current era of vast data and transparent machine learning, it is essential for techniques to operate at a large scale while providing a clear mathematical comprehension of the internal workings of the method. Although there already exist interpretable semi-parametric regression methods for large-scale applications that take into account non-linearity in the data, the complexity of the models is still often limited. One of the main challenges is the absence of interactions in these models, which are left out for the sake of better interpretability but also due to impractical computational costs. To overcome this limitation, we propose a new approach using a factorization method to derive a highly scalable higher-order tensor product spline model. Our method allows for the incorporation of all (higher-order) interactions of non-linear feature effects while having computational costs proportional to a model without interactions. We further develop a meaningful penalization scheme and examine the induced optimization problem. We conclude by evaluating the predictive and estimation performance of our method.

ahofm, gam, interaction, (16 more...)

2402.0109

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Spain (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.48)

Benefits of Transformer: In-Context Learning in Linear Regression Tasks with Unstructured Data

Xing, Yue, Lin, Xiaofeng, Suh, Namjoon, Song, Qifan, Cheng, Guang

In practice, it is observed that transformer-based models can learn concepts in context in the inference stage. While existing literature, e.g., \citet{zhang2023trained,huang2023context}, provide theoretical explanations on this in-context learning ability, they assume the input $x_i$ and the output $y_i$ for each sample are embedded in the same token (i.e., structured data). However, in reality, they are presented in two tokens (i.e., unstructured data \cite{wibisono2023role}). In this case, this paper conducts experiments in linear regression tasks to study the benefits of the architecture of transformers and provides some corresponding theoretical intuitions to explain why the transformer can learn from unstructured data. We study the exact components in a transformer that facilitate the in-context learning. In particular, we observe that (1) a transformer with two layers of softmax (self-)attentions with look-ahead attention mask can learn from the prompt if $y_i$ is in the token next to $x_i$ for each example; (2) positional encoding can further improve the performance; and (3) multi-head attention with a high input embedding dimension has a better prediction performance than single-head attention.

in-context learning, linear regression task, transformer, (12 more...)

2402.00743

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > Michigan (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Lemberger, Pirmin, Saillenfest, Antoine

Explaining Text Classifiers with Counterfactual Representations

One well motivated explanation method for classifiers leverages counterfactuals which are hypothetical events identical to real observations in all aspects except for one categorical feature. Constructing such counterfactual poses specific challenges for texts, however, as some attribute values may not necessarily align with plausible real-world events. In this paper we propose a simple method for generating counterfactuals by intervening in the space of text representations which bypasses this limitation. We argue that our interventions are minimally disruptive and that they are theoretically sound as they align with counterfactuals as defined in Pearl's causal inference framework. To validate our method, we first conduct experiments on a synthetic dataset of counterfactuals, allowing for a direct comparison between classifier predictions based on ground truth counterfactuals (obtained through explicit text interventions) and our counterfactuals, derived through interventions in the representation space. Second, we study a real world scenario where our counterfactuals can be leveraged both for explaining a classifier and for bias mitigation.

classifier, counterfactual, information, (16 more...)

2402.00711

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > France > Île-de-France > Paris > Paris (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
(4 more...)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Blanquero, Rafael, Carrizosa, Emilio, Ramírez-Cobo, Pepa, Sillero-Denamiel, M. Remedios

A cost-sensitive constrained Lasso

arXiv.org Machine LearningJan-31-2024

The Lasso has become a benchmark data analysis procedure, and numerous variants have been proposed in the literature. Although the Lasso formulations are stated so that overall prediction error is optimized, no full control over the accuracy prediction on certain individuals of interest is allowed. In this work we propose a novel version of the Lasso in which quadratic performance constraints are added to Lasso-based objective functions, in such a way that threshold values are set to bound the prediction errors in the different groups of interest (not necessarily disjoint). As a result, a constrained sparse regression model is defined by a nonlinear optimization problem. This cost-sensitive constrained Lasso has a direct application in heterogeneous samples where data are collected from distinct sources, as it is standard in many biomedical contexts. Both theoretical properties and empirical studies concerning the new method are explored in this paper. In addition, two illustrations of the method on biomedical and sociological contexts are considered.

csclasso, improv, lasso, (16 more...)

arXiv.org Machine Learning

doi: 10.1007/s11634-020-00389-5

2401.18023

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
North America > United States > District of Columbia > Washington (0.04)
North America > United States > New York (0.04)
Europe > Spain > Andalusia > Seville Province > Seville (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Law Enforcement & Public Safety (0.93)
Law (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.88)

Islam, Raisa, Mazumdar, Subhasish, Islam, Rakibul

An Experiment on Feature Selection using Logistic Regression

arXiv.org Artificial IntelligenceJan-31-2024

In supervised machine learning, feature selection plays a very important role by potentially enhancing explainability and performance as measured by computing time and accuracy-related metrics. In this paper, we investigate a method for feature selection based on the well-known L1 and L2 regularization strategies associated with logistic regression (LR). It is well known that the learned coefficients, which serve as weights, can be used to rank the features. Our approach is to synthesize the findings of L1 and L2 regularization. For our experiment, we chose the CIC-IDS2018 dataset owing partly to its size and also to the existence of two problematic classes that are hard to separate. We report first with the exclusion of one of them and then with its inclusion. We ranked features first with L1 and then with L2, and then compared logistic regression with L1 (LR+L1) against that with L2 (LR+L2) by varying the sizes of the feature sets for each of the two rankings. We found no significant difference in accuracy between the two methods once the feature set is selected. We chose a synthesis, i.e., only those features that were present in both the sets obtained from L1 and that from L2, and experimented with it on more complex models like Decision Tree and Random Forest and observed that the accuracy was very close in spite of the small size of the feature set. Additionally, we also report on the standard metrics: accuracy, precision, recall, and f1-score.

accuracy, experiment, regularization, (15 more...)

2402.00201

Country: North America > United States > New Mexico > Socorro County > Socorro (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology > Security & Privacy (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.57)

arXiv.org Artificial IntelligenceJan-31-2024

Continuous Treatment Effects with Surrogate Outcomes

Zeng, Zhenghao, Arbour, David, Feller, Avi, Addanki, Raghavendra, Rossi, Ryan, Sinha, Ritwik, Kennedy, Edward H.

In many causal inference applications, the primary outcomes are missing for a non-trivial number of observations. For instance, in studies on long-term health effects of medical interventions, some measurements require expensive testing and a loss to follow-up is common (Hogan et al., 2004). In evaluating commercial online ad effectiveness, some individuals may drop out from the panel because they use multiple devices (Shankar et al., 2023), leading to missing revenue measures. In many of these studies, however, there often exist short-term outcomes that are easier and faster to measure, e.g., short-term health measures or an online ad's click-through rate, that are observed for a greater share of the sample. These outcomes, which are typically informative about the primary outcomes themselves, are refered to as surrogate outcomes or surrogates. There is a rich causal inference literature addressing missing outcome data. Simply restricting to data with observed primary outcomes may induce strong bias (Hernán and Robins, 2010). Ignoring unlabeled data also reduces the effective sample size for estimating the treatment effects and inflates the variance. Chakrabortty et al. (2022) considered the missing completely at random (MCAR) setting and showed that incorporating unlabeled data reduces variance.

doubly robust estimator, estimation, estimator, (14 more...)

2402.00168

Country: North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry:

Law (0.68)
Education (0.68)
Health & Medicine (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)