AITopics | Regression

Collaborating Authors

Regression

News Overviews Instructional Materials AI-Alerts Classics

GPT in Data Science: A Practical Exploration of Model Selection

Nascimento, Nathalia, Tavares, Cristina, Alencar, Paulo, Cowan, Donald

arXiv.org Artificial IntelligenceNov-19-2023

There is an increasing interest in leveraging Large Language Models (LLMs) for managing structured data and enhancing data science processes. Despite the potential benefits, this integration poses significant questions regarding their reliability and decision-making methodologies. It highlights the importance of various factors in the model selection process, including the nature of the data, problem type, performance metrics, computational resources, interpretability vs accuracy, assumptions about data, and ethical considerations. Our objective is to elucidate and express the factors and assumptions guiding GPT-4's model selection recommendations. We employ a variability model to depict these factors and use toy datasets to evaluate both the model and the implementation of the identified heuristics. By contrasting these outcomes with heuristics from other platforms, our aim is to determine the effectiveness and distinctiveness of GPT-4's methodology. This research is committed to advancing our comprehension of AI decision-making processes, especially in the realm of model selection within data science. Our efforts are directed towards creating AI systems that are more transparent and comprehensible, contributing to a more responsible and efficient practice in data science.

accuracy, dataset, model selection, (16 more...)

arXiv.org Artificial Intelligence

2311.11516

Country: North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)

Genre:

Research Report > Experimental Study (0.48)
Research Report > New Finding (0.30)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.49)
Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.71)

Add feedback

Interpretability in Machine Learning: on the Interplay with Explainability, Predictive Performances and Models

Leblanc, Benjamin, Germain, Pascal

arXiv.org Artificial IntelligenceNov-19-2023

In some areas such as the medical field, ML-assisted predictions or decisions can drastically impact human life. For example, breast cancer [131] can be devastating if not diagnosed in time (or at all). The use of black-box predictors in these crucial cases has deceived more than once: a classical example of which is the use of the COMPAS system by the USA judiciary system for predicting criminal recidivism [133]. Other cases where fairness has been jeopardized by the use of black-boxes are numerous: job and loan applications biased toward men [40]; mortgage-approval biased toward white applicants [122]; higher credit card limits for men [172]; etc. With time, it became clear that interpretability is crucial when it comes to understanding how a predictor behaves and thus preventing unfortunate events; as pointed out by Goodman and Flaxman [70]: "If we do not know how ML [predictors] work, we cannot check or regulate them to ensure that they do not encode discrimination against minorities [...], we will not be able to learn from instances in which it is mistaken."

explanation, interpretability, predictor, (13 more...)

arXiv.org Artificial Intelligence

2311.11491

Country:

North America > Canada (0.04)
North America > United States > Virginia (0.04)
North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
(2 more...)

Genre:

Overview (1.00)
Research Report (0.64)

Industry:

Banking & Finance (1.00)
Information Technology > Security & Privacy (0.93)
Health & Medicine > Therapeutic Area > Oncology (0.34)

Technology:

Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.67)
(3 more...)

Add feedback

Data Selection for Language Models via Importance Resampling

Xie, Sang Michael, Santurkar, Shibani, Ma, Tengyu, Liang, Percy

arXiv.org Artificial IntelligenceNov-18-2023

Selecting a suitable pretraining dataset is crucial for both general-domain (e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We formalize this problem as selecting a subset of a large raw unlabeled dataset to match a desired target distribution given unlabeled target samples. Due to the scale and dimensionality of the raw text data, existing methods use simple heuristics or require human experts to manually curate data. Instead, we extend the classic importance resampling approach used in low-dimensions for LM data selection. We propose Data Selection with Importance Resampling (DSIR), an efficient and scalable framework that estimates importance weights in a reduced feature space for tractability and selects data with importance resampling according to these weights. We instantiate the DSIR framework with hashed n-gram features for efficiency, enabling the selection of 100M documents from the full Pile dataset in 4.5 hours. To measure whether hashed n-gram features preserve the aspects of the data that are relevant to the target, we define KL reduction, a data metric that measures the proximity between the selected pretraining data and the target on some feature space. Across 8 data selection methods (including expert selection), KL reduction on hashed n-gram features highly correlates with average downstream accuracy (r=0.82). When selecting data for continued pretraining on a specific domain, DSIR performs comparably to expert curation across 8 target distributions. When pretraining general-domain models (target is Wikipedia and books), DSIR improves over random selection and heuristic filtering baselines by 2-2.5% on the GLUE benchmark. Code is available at https://github.com/p-lambda/dsir.

dataset, dsir, selection, (15 more...)

arXiv.org Artificial Intelligence

2302.03169

Country:

Asia > Middle East > Jordan (0.04)
Europe > Italy > Tuscany > Florence (0.04)
North America > United States > Indiana (0.04)
(7 more...)

Genre: Research Report (1.00)

Industry: Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

Exponentially Convergent Algorithms for Supervised Matrix Factorization

Lee, Joowon, Lyu, Hanbaek, Yao, Weixin

arXiv.org Machine LearningNov-18-2023

Supervised matrix factorization (SMF) is a classical machine learning method that simultaneously seeks feature extraction and classification tasks, which are not necessarily a priori aligned objectives. Our goal is to use SMF to learn low-rank latent factors that offer interpretable, data-reconstructive, and class-discriminative features, addressing challenges posed by high-dimensional data. Training SMF model involves solving a nonconvex and possibly constrained optimization with at least three blocks of parameters. Known algorithms are either heuristic or provide weak convergence guarantees for special cases. In this paper, we provide a novel framework that 'lifts' SMF as a low-rank matrix estimation problem in a combined factor space and propose an efficient algorithm that provably converges exponentially fast to a global minimizer of the objective with arbitrary initialization under mild assumptions. Our framework applies to a wide range of SMF-type problems for multi-class classification with auxiliary features. To showcase an application, we demonstrate that our algorithm successfully identified well-known cancer-associated gene groups for various cancers.

algorithm, artificial intelligence, machine learning, (19 more...)

arXiv.org Machine Learning

2311.11182

Country:

North America > United States > Wisconsin > Dane County > Madison (0.14)
North America > United States > California > Riverside County > Riverside (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.69)

Add feedback

Short-term Volatility Estimation for High Frequency Trades using Gaussian processes (GPs)

Mushunje, Leonard, Mashasha, Maxwell, Chandiwana, Edina

arXiv.org Artificial IntelligenceNov-17-2023

The fundamental theorem behind financial markets is that stock prices are intrinsically complex and stochastic. One of the complexities is the volatility associated with stock prices. Volatility is a tendency for prices to change unexpectedly [1]. Price volatility is often detrimental to the return economics, and thus, investors should factor it in whenever making investment decisions, choices, and temporal or permanent moves. It is, therefore, crucial to make necessary and regular short and long-term stock price volatility forecasts for the safety and economics of investors returns. These forecasts should be accurate and not misleading. Different models and methods, such as ARCH GARCH models, have been intuitively implemented to make such forecasts. However, such traditional means fail to capture the short-term volatility forecasts effectively. This paper, therefore, investigates and implements a combination of numeric and probabilistic models for short-term volatility and return forecasting for high-frequency trades. The essence is that one-day-ahead volatility forecasts were made with Gaussian Processes (GPs) applied to the outputs of a Numerical market prediction (NMP) model. Firstly, the stock price data from NMP was corrected by a GP. Since it is not easy to set price limits in a market due to its free nature and randomness, a Censored GP was used to model the relationship between the corrected stock prices and returns. Forecasting errors were evaluated using the implied and estimated data.

stock price, stock return, volatility, (14 more...)

arXiv.org Artificial Intelligence

2311.10935

Country:

Africa > South Africa > Gauteng > Johannesburg (0.05)
North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
(2 more...)

Genre: Research Report > New Finding (0.68)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

DUET: 2D Structured and Approximately Equivariant Representations

Suau, Xavier, Danieli, Federico, Keller, T. Anderson, Blaas, Arno, Huang, Chen, Ramapuram, Jason, Busbridge, Dan, Zappella, Luca

arXiv.org Artificial IntelligenceNov-17-2023

Multiview Self-Supervised Learning (MSSL) is based on learning invariances with respect to a set of input transformations. However, invariance partially or totally removes transformation-related information from the representations, which might harm performance for specific downstream tasks that require such information. We propose 2D strUctured and EquivarianT representations (coined DUET), which are 2d representations organized in a matrix structure, and equivariant with respect to transformations acting on the input data. DUET representations maintain information about an input transformation, while remaining semantically expressive. Compared to SimCLR (Chen et al., 2020) (unstructured and invariant) and ESSL (Dangovski et al., 2022) (unstructured and equivariant), the structured and equivariant nature of DUET representations enables controlled generation with lower reconstruction error, while controllability is not possible with SimCLR or ESSL. DUET also achieves higher accuracy for several discriminative tasks, and improves transfer learning.

duet, representation, transformation, (16 more...)

arXiv.org Artificial Intelligence

2306.16058

Country:

North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

Multiscale Hodge Scattering Networks for Data Analysis

Saito, Naoki, Schonsheck, Stefan C., Shvarts, Eugene

arXiv.org Machine LearningNov-16-2023

We propose new scattering networks for signals measured on simplicial complexes, which we call \emph{Multiscale Hodge Scattering Networks} (MHSNs). Our construction is based on multiscale basis dictionaries on simplicial complexes, i.e., the $\kappa$-GHWT and $\kappa$-HGLET, which we recently developed for simplices of dimension $\kappa \in \N$ in a given simplicial complex by generalizing the node-based Generalized Haar-Walsh Transform (GHWT) and Hierarchical Graph Laplacian Eigen Transform (HGLET). The $\kappa$-GHWT and the $\kk$-HGLET both form redundant sets (i.e., dictionaries) of multiscale basis vectors and the corresponding expansion coefficients of a given signal. Our MHSNs use a layered structure analogous to a convolutional neural network (CNN) to cascade the moments of the modulus of the dictionary coefficients. The resulting features are invariant to reordering of the simplices (i.e., node permutation of the underlying graphs). Importantly, the use of multiscale basis dictionaries in our MHSNs admits a natural pooling operation that is akin to local pooling in CNNs, and which may be performed either locally or per-scale. These pooling operations are harder to define in both traditional scattering networks based on Morlet wavelets, and geometric scattering networks based on Diffusion Wavelets. As a result, we are able to extract a rich set of descriptive yet robust features that can be used along with very simple machine learning methods (i.e., logistic regression or support vector machines) to achieve high-accuracy classification systems with far fewer parameters to train than most modern graph neural networks. Finally, we demonstrate the usefulness of our MHSNs in three distinct types of problems: signal classification, domain (i.e., graph/simplex) classification, and molecular dynamics prediction.

artificial intelligence, graph, machine learning, (16 more...)

arXiv.org Machine Learning

2311.1027

Country:

North America > United States > California > Yolo County > Davis (0.04)
North America > United States > Massachusetts > Middlesex County > Burlington (0.04)

Genre:

Research Report > New Finding (0.34)
Research Report > Experimental Study (0.34)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.48)

Add feedback

Co-data Learning for Bayesian Additive Regression Trees

Goedhart, Jeroen M., Klausch, Thomas, Janssen, Jurriaan, van de Wiel, Mark A.

arXiv.org Machine LearningNov-16-2023

Medical prediction applications often need to deal with small sample sizes compared to the number of covariates. Such data pose problems for prediction and variable selection, especially when the covariate-response relationship is complicated. To address these challenges, we propose to incorporate co-data, i.e. external information on the covariates, into Bayesian additive regression trees (BART), a sum-of-trees prediction model that utilizes priors on the tree parameters to prevent overfitting. To incorporate co-data, an empirical Bayes (EB) framework is developed that estimates, assisted by a co-data model, prior covariate weights in the BART model. The proposed method can handle multiple types of co-data simultaneously. Furthermore, the proposed EB framework enables the estimation of the other hyperparameters of BART as well, rendering an appealing alternative to cross-validation. We show that the method finds relevant covariates and that it improves prediction compared to default BART in simulations. If the covariate-response relationship is nonlinear, the method benefits from the flexibility of BART to outperform regression-based co-data learners. Finally, the use of co-data enhances prediction in an application to diffuse large B-cell lymphoma prognosis based on clinical covariates, gene mutations, DNA translocations, and DNA copy number data. Keywords: Bayesian additive regression trees; Empirical Bayes; Co-data; High-dimensional data; Omics; Prediction

artificial intelligence, covariate, machine learning, (20 more...)

arXiv.org Machine Learning

2311.09997

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
North America > United States (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.66)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.87)
Health & Medicine > Therapeutic Area > Oncology > Lymphoma (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

Biomembrane-based Memcapacitive Reservoir Computing System for Energy Efficient Temporal Data Processing

Hossain, Md Razuan, Mohamed, Ahmed Salah, Armendarez, Nicholas Xavier, Najem, Joseph S., Hasan, Md Sakib

arXiv.org Artificial IntelligenceNov-15-2023

Reservoir computing is a highly efficient machine learning framework for processing temporal data by extracting features from the input signal and mapping them into higher dimensional spaces. Physical reservoir layers have been realized using spintronic oscillators, atomic switch networks, silicon photonic modules, ferroelectric transistors, and volatile memristors. However, these devices are intrinsically energy-dissipative due to their resistive nature, which leads to increased power consumption. Therefore, capacitive memory devices can provide a more energy-efficient approach. Here, we leverage volatile biomembrane-based memcapacitors that closely mimic certain short-term synaptic plasticity functions as reservoirs to solve classification tasks and analyze time-series data in simulation and experimentally. Our system achieves a 99.6% accuracy rate for spoken digit classification and a normalized mean square error of 7.81*10^{-4} in a second-order non-linear regression task. Furthermore, to showcase the device's real-time temporal data processing capability, we achieve 100% accuracy for a real-time epilepsy detection problem from an inputted electroencephalography (EEG) signal. Most importantly, we demonstrate that each memcapacitor consumes an average of 41.5 fJ of energy per spike, regardless of the selected input voltage pulse width, while maintaining an average power of 415 fW for a pulse width of 100 ms. These values are orders of magnitude lower than those achieved by state-of-the-art memristors used as reservoirs. Lastly, we believe the biocompatible, soft nature of our memcapacitor makes it highly suitable for computing and signal-processing applications in biological environments.

artificial intelligence, machine learning, memcapacitor, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1002/aisy.202300346

2305.12025

Country:

North America > United States > Mississippi > Lafayette County > Oxford (0.14)
Europe > Germany (0.14)
North America > United States > Pennsylvania > Centre County (0.14)
(2 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Energy > Oil & Gas > Upstream (1.00)
Health & Medicine > Therapeutic Area > Neurology > Epilepsy (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.86)

Add feedback

Challenges for Predictive Modeling with Neural Network Techniques using Error-Prone Dietary Intake Data

Spicker, Dylan, Nazemi, Amir, Hutchinson, Joy, Fieguth, Paul, Kirkpatrick, Sharon I., Wallace, Michael, Dodd, Kevin W.

arXiv.org Artificial IntelligenceNov-15-2023

Dietary intake data are routinely drawn upon to explore diet-health relationships. However, these data are often subject to measurement error, distorting the true relationships. Beyond measurement error, there are likely complex synergistic and sometimes antagonistic interactions between different dietary components, complicating the relationships between diet and health outcomes. Flexible models are required to capture the nuance that these complex interactions introduce. This complexity makes research on diet-health relationships an appealing candidate for the application of machine learning techniques, and in particular, neural networks. Neural networks are computational models that are able to capture highly complex, nonlinear relationships so long as sufficient data are available. While these models have been applied in many domains, the impacts of measurement error on the performance of predictive modeling has not been systematically investigated. However, dietary intake data are typically collected using self-report methods and are prone to large amounts of measurement error. In this work, we demonstrate the ways in which measurement error erodes the performance of neural networks, and illustrate the care that is required for leveraging these models in the presence of error. We demonstrate the role that sample size and replicate measurements play on model performance, indicate a motivation for the investigation of transformations to additivity, and illustrate the caution required to prevent model overfitting. While the past performance of neural networks across various domains make them an attractive candidate for examining diet-health relationships, our work demonstrates that substantial care and further methodological development are both required to observe increased predictive performance when applying these techniques, compared to more traditional statistical procedures.

measurement error, neural network, predictor, (15 more...)

arXiv.org Artificial Intelligence

2311.09338

Country:

North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Maryland (0.04)
(2 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.92)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Consumer Health (1.00)
Education > Health & Safety > School Nutrition (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.49)

Add feedback