AITopics

2411.18085

Country:

Asia > China > Beijing > Beijing (0.06)
Asia > China > Shanghai > Shanghai (0.05)
Asia > China > Guangdong Province > Shenzhen (0.05)
(5 more...)

Genre: Research Report > New Finding (0.66)

Industry: Banking & Finance > Real Estate (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)

Zabaleta, Miguel, Lehman, Joel

Simulating Tabular Datasets through LLMs to Rapidly Explore Hypotheses about Real-World Entities

arXiv.org Artificial IntelligenceNov-27-2024

Do horror writers have worse childhoods than other writers? Though biographical details are known about many writers, quantitatively exploring such a qualitative hypothesis requires significant human effort, e.g. to sift through many biographies and interviews of writers and to iteratively search for quantitative features that reflect what is qualitatively of interest. This paper explores the potential to quickly prototype these kinds of hypotheses through (1) applying LLMs to estimate properties of concrete entities like specific people, companies, books, kinds of animals, and countries; (2) performing off-the-shelf analysis methods to reveal possible relationships among such properties (e.g. linear regression); and towards further automation, (3) applying LLMs to suggest the quantitative properties themselves that could help ground a particular qualitative hypothesis (e.g. number of adverse childhood events, in the context of the running example). The hope is to allow sifting through hypotheses more quickly through collaboration between human and machine. Our experiments highlight that indeed, LLMs can serve as useful estimators of tabular data about specific entities across a range of domains, and that such estimations improve with model scale. Further, initial experiments demonstrate the potential of LLMs to map a qualitative hypothesis of interest to relevant concrete variables that the LLM can then estimate. The conclusion is that LLMs offer intriguing potential to help illuminate scientifically interesting patterns latent within the internet-scale data they are trained upon.

dataset, hypothesis, population age, (13 more...)

2411.18071

Country:

Africa > Namibia (0.05)
South America > Uruguay (0.04)
South America > Suriname (0.04)
(148 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Public Health (1.00)
Leisure & Entertainment > Sports (0.69)
Health & Medicine > Therapeutic Area > Pediatrics/Neonatology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Machine LearningNov-27-2024

Ridge Regression for Manifold-valued Time-Series with Application to Meteorological Forecast

Nava-Yazdani, Esfandiar

We propose a natural intrinsic extension of the ridge regression from Euclidean spaces to general manifolds, which relies on Riemannian least-squares fitting, empirical covariance, and Mahalanobis distance. We utilize it for time-series prediction and apply the approach to forecast hurricane tracks and their wind speeds.

manifold, regression, trajectory, (15 more...)

arXiv.org Machine Learning

2411.18339

Country:

North America > United States > New Jersey > Mercer County > Princeton (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Germany > Berlin (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.47)

Khan, Mohammad Zubair, Li, David

Dynamic Logistic Ensembles with Recursive Probability and Automatic Subset Splitting for Enhanced Binary Classification

arXiv.org Artificial IntelligenceNov-26-2024

This paper presents a novel approach to binary classification using dynamic logistic ensemble models. The proposed method addresses the challenges posed by datasets containing inherent internal clusters that lack explicit feature-based separations. By extending traditional logistic regression, we develop an algorithm that automatically partitions the dataset into multiple subsets, constructing an ensemble of logistic models to enhance classification accuracy. A key innovation in this work is the recursive probability calculation, derived through algebraic manipulation and mathematical induction, which enables scalable and efficient model construction. Compared to traditional ensemble methods such as Bagging and Boosting, our approach maintains interpretability while offering competitive performance. Furthermore, we systematically employ maximum likelihood and cost functions to facilitate the analytical derivation of recursive gradients as functions of ensemble depth. The effectiveness of the proposed approach is validated on a custom dataset created by introducing noise and shifting data to simulate group structures, resulting in significant performance improvements with layers. Implemented in Python, this work balances computational efficiency with theoretical rigor, providing a robust and interpretable solution for complex classification tasks with broad implications for machine learning applications. Code at https://github.com/ensemble-art/Dynamic-Logistic-Ensembles

artificial intelligence, dataset, machine learning, (17 more...)

doi: 10.1109/UEMCON62879.2024.10754761

2411.18649

Country: North America > United States > New York > New York County > New York City (0.04)

Genre:

Research Report > New Finding (0.72)
Research Report > Experimental Study (0.53)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.35)

arXiv.org Machine LearningNov-26-2024

Optimized Conformal Selection: Powerful Selective Inference After Conformity Score Optimization

Bai, Tian, Jin, Ying

Model selection/optimization in conformal inference is challenging, since it may break the exchangeability between labeled and unlabeled data. We study this problem in the context of conformal selection, which uses conformal p-values to select ``interesting'' instances with large unobserved labels from a pool of unlabeled data, while controlling the FDR in finite sample. For validity, existing solutions require the model choice to be independent of the data used to construct the p-values and calibrate the selection set. However, when presented with many model choices and limited labeled data, it is desirable to (i) select the best model in a data-driven manner, and (ii) mitigate power loss due to sample splitting. This paper presents OptCS, a general framework that allows valid statistical testing (selection) after flexible data-driven model optimization. We introduce general conditions under which OptCS constructs valid conformal p-values despite substantial data reuse and handles complex p-value dependencies to maintain finite-sample FDR control via a novel multiple testing procedure. We instantiate this general recipe to propose three FDR-controlling procedures, each optimizing the models differently: (i) selecting the most powerful one among multiple pre-trained candidate models, (ii) using all data for model fitting without sample splitting, and (iii) combining full-sample model fitting and selection. We demonstrate the efficacy of our methods via simulation studies and real applications in drug discovery and alignment of large language models in radiology report generation.

model selection, procedure, selection, (13 more...)

arXiv.org Machine Learning

2411.17983

Country: North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.67)

Nyangon, Joseph, Akintunde, Ruth

Anomaly Detection in California Electricity Price Forecasting: Enhancing Accuracy and Reliability Using Principal Component Analysis

Accurate and reliable electricity price forecasting has significant practical implications for grid management, renewable energy integration, power system planning, and price volatility management. This study focuses on enhancing electricity price forecasting in California's grid, addressing challenges from complex generation data and heteroskedasticity. Utilizing principal component analysis (PCA), we analyze CAISO's hourly electricity prices and demand from 2016-2021 to improve day-ahead forecasting accuracy. Initially, we apply traditional outlier analysis with the interquartile range method, followed by robust PCA (RPCA) for more effective outlier elimination. This approach improves data symmetry and reduces skewness. We then construct multiple linear regression models using both raw and PCA-transformed features. The model with transformed features, refined through traditional and SAS Sparse Matrix outlier removal methods, shows superior forecasting performance. The SAS Sparse Matrix method, in particular, significantly enhances model accuracy. Our findings demonstrate that PCA-based methods are key in advancing electricity price forecasting, supporting renewable integration and grid management in day-ahead markets. Keywords: Electricity price forecasting, principal component analysis (PCA), power system planning, heteroskedasticity, renewable energy integration.

data mining, forecasting, machine learning, (16 more...)

doi: 10.1002/wene.504

2412.07787

Country:

North America > United States > North Carolina > Wake County > Cary (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > District of Columbia > Washington (0.04)
(5 more...)

Genre: Research Report > New Finding (0.86)

Industry:

Energy > Renewable > Solar (1.00)
Energy > Power Industry > Utilities (0.93)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Principal Component Analysis (0.82)

Machine learning for cerebral blood vessels' malformations

Topal, Irem, Cherevko, Alexander, Bugay, Yuri, Shishlenin, Maxim, Barbier, Jean, Eroglu, Deniz, Roldán, Édgar, Belousov, Roman

Cerebral aneurysms and arteriovenous malformations are life-threatening hemodynamic pathologies of the brain. While surgical intervention is often essential to prevent fatal outcomes, it carries significant risks both during the procedure and in the postoperative period, making the management of these conditions highly challenging. Parameters of cerebral blood flow, routinely monitored during medical interventions, could potentially be utilized in machine learning-assisted protocols for risk assessment and therapeutic prognosis. To this end, we developed a linear oscillatory model of blood velocity and pressure for clinical data acquired from neurosurgical operations. Using the method of Sparse Identification of Nonlinear Dynamics (SINDy), the parameters of our model can be reconstructed online within milliseconds from a short time series of the hemodynamic variables. The identified parameter values enable automated classification of the blood-flow pathologies by means of logistic regression, achieving an accuracy of 73 %. Our results demonstrate the potential of this model for both diagnostic and prognostic applications, providing a robust and interpretable framework for assessing cerebral blood vessel conditions.

equation, malformation, time sery, (16 more...)

2411.16349

Country:

Europe > Russia (0.04)
Asia > Russia > Siberian Federal District > Novosibirsk Oblast > Novosibirsk (0.04)
North America > United States > Illinois > Champaign County > Champaign (0.04)
(6 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Surgery (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.35)

LDACP: Long-Delayed Ad Conversions Prediction Model for Bidding Strategy

Cui, Peng, Yang, Yiming, Jin, Fusheng, Tang, Siyuan, Wang, Yunli, Yang, Fukang, Jia, Yalong, Cai, Qingpeng, Pan, Fei, Li, Changcheng, Jiang, Peng

In online advertising, once an ad campaign is deployed, the automated bidding system dynamically adjusts the bidding strategy to optimize Cost Per Action (CPA) based on the number of ad conversions. For ads with a long conversion delay, relying solely on the real-time tracked conversion number as a signal for bidding strategy can significantly overestimate the current CPA, leading to conservative bidding strategies. Therefore, it is crucial to predict the number of long-delayed conversions. Nonetheless, it is challenging to predict ad conversion numbers through traditional regression methods due to the wide range of ad conversion numbers. Previous regression works have addressed this challenge by transforming regression problems into bucket classification problems, achieving success in various scenarios. However, specific challenges arise when predicting the number of ad conversions: 1) The integer nature of ad conversion numbers exacerbates the discontinuity issue in one-hot hard labels; 2) The long-tail distribution of ad conversion numbers complicates tail data prediction. In this paper, we propose the Long-Delayed Ad Conversions Prediction model for bidding strategy (LDACP), which consists of two sub-modules. To alleviate the issue of discontinuity in one-hot hard labels, the Bucket Classification Module with label Smoothing method (BCMS) converts one-hot hard labels into non-normalized soft labels, then fits these soft labels by minimizing classification loss and regression loss. To address the challenge of predicting tail data, the Value Regression Module with Proxy labels (VRMP) uses the prediction bias of aggregated pCTCVR as proxy labels. Finally, a Mixture of Experts (MoE) structure integrates the predictions from BCMS and VRMP to obtain the final predicted ad conversion number.

ad conversion, conversion, conversion number, (15 more...)

2411.16095

Country:

Asia > China > Beijing > Beijing (0.06)
Oceania > Australia > New South Wales > Sydney (0.05)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.48)

Industry:

Marketing (0.88)
Information Technology > Services (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.50)

Diagnosing Hate Speech Classification: Where Do Humans and Machines Disagree, and Why?

Yang, Xilin

This study uses the cosine similarity ratio, embedding regression, and manual re-annotation to diagnose hate speech classification. We begin by computing cosine similarity ratio on a dataset "Measuring Hate Speech" that contains 135,556 annotated comments on social media. This way, we show a basic use of cosine similarity as a description of hate speech content. We then diagnose hate speech classification starting from understanding the inconsistency of human annotation from the dataset. Using embedding regression as a basic diagnostic, we found that female annotators are more sensitive to racial slurs that target the black population. We perform with a more complicated diagnostic by training a hate speech classifier using a SoTA pre-trained large language model, NV-Embed-v2, to convert texts to embeddings and run a logistic regression. This classifier achieves a testing accuracy of 94%. In diagnosing where machines disagree with human annotators, we found that machines make fewer mistakes than humans despite the fact that human annotations are treated as ground truth in the training set. Machines perform better in correctly labeling long statements of facts, but perform worse in labeling short instances of swear words. We hypothesize that this is due to model alignment - while curating models at their creation prevents the models from producing obvious hate speech, it also reduces the model's ability to detect such content.

annotator, regression, speech, (13 more...)

2410.10153

Country: Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)

Genre: Research Report > Experimental Study (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.90)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.89)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)

Valentin, Six, Alexandre, Chidiac, Arkin, Worlikar

Evaluating Rank-N-Contrast: Continuous and Robust Representations for Regression

arXiv.org Machine LearningNov-25-2024

This document is a replication of the original "Rank-N-Contrast" (arXiv:2210.01189v2) paper published in 2023. This evaluation is done for academic purposes. Deep regression models often fail to capture the continuous nature of sample orders, creating fragmented representations and suboptimal performance. To address this, we reproduced the Rank-N-Contrast (RNC) framework, which learns continuous representations by contrasting samples by their rankings in the target space. Our study validates RNC's theoretical and empirical benefits, including improved performance and robustness. We extended the evaluation to an additional regression dataset and conducted robustness tests using a holdout method, where a specific range of continuous data was excluded from the training set. This approach assessed the model's ability to generalise to unseen data and achieve state-of-the-art performance. This replication study validates the original findings and broadens the understanding of RNC's applicability and robustness.

dataset, regression task, representation, (13 more...)

arXiv.org Machine Learning

2411.16298

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.50)