Regression
MONOPOLY: Learning to Price Public Facilities for Revaluing Private Properties with Large-Scale Urban Data
Fan, Miao, Huang, Jizhou, Zhuo, An, Li, Ying, Li, Ping, Wang, Haifeng
The value assessment of private properties is an attractive but challenging task which is widely concerned by a majority of people around the world. A prolonged topic among us is ``\textit{how much is my house worth?}''. To answer this question, most experienced agencies would like to price a property given the factors of its attributes as well as the demographics and the public facilities around it. However, no one knows the exact prices of these factors, especially the values of public facilities which may help assess private properties. In this paper, we introduce our newly launched project ``Monopoly'' (named after a classic board game) in which we propose a distributed approach for revaluing private properties by learning to price public facilities (such as hospitals etc.) with the large-scale urban data we have accumulated via Baidu Maps. To be specific, our method organizes many points of interest (POIs) into an undirected weighted graph and formulates multiple factors including the virtual prices of surrounding public facilities as adaptive variables to parallelly estimate the housing prices we know. Then the prices of both public facilities and private properties can be iteratively updated according to the loss of prediction until convergence. We have conducted extensive experiments with the large-scale urban data of several metropolises in China. Results show that our approach outperforms several mainstream methods with significant margins. Further insights from more in-depth discussions demonstrate that the ``Monopoly'' is an innovative application in the interdisciplinary field of business intelligence and urban computing, and it will be beneficial to tens of millions of our users for investments and to the governments for urban planning as well as taxation.
Simulating Tabular Datasets through LLMs to Rapidly Explore Hypotheses about Real-World Entities
Zabaleta, Miguel, Lehman, Joel
Do horror writers have worse childhoods than other writers? Though biographical details are known about many writers, quantitatively exploring such a qualitative hypothesis requires significant human effort, e.g. to sift through many biographies and interviews of writers and to iteratively search for quantitative features that reflect what is qualitatively of interest. This paper explores the potential to quickly prototype these kinds of hypotheses through (1) applying LLMs to estimate properties of concrete entities like specific people, companies, books, kinds of animals, and countries; (2) performing off-the-shelf analysis methods to reveal possible relationships among such properties (e.g. linear regression); and towards further automation, (3) applying LLMs to suggest the quantitative properties themselves that could help ground a particular qualitative hypothesis (e.g. number of adverse childhood events, in the context of the running example). The hope is to allow sifting through hypotheses more quickly through collaboration between human and machine. Our experiments highlight that indeed, LLMs can serve as useful estimators of tabular data about specific entities across a range of domains, and that such estimations improve with model scale. Further, initial experiments demonstrate the potential of LLMs to map a qualitative hypothesis of interest to relevant concrete variables that the LLM can then estimate. The conclusion is that LLMs offer intriguing potential to help illuminate scientifically interesting patterns latent within the internet-scale data they are trained upon.
Ridge Regression for Manifold-valued Time-Series with Application to Meteorological Forecast
We propose a natural intrinsic extension of the ridge regression from Euclidean spaces to general manifolds, which relies on Riemannian least-squares fitting, empirical covariance, and Mahalanobis distance. We utilize it for time-series prediction and apply the approach to forecast hurricane tracks and their wind speeds.
Dynamic Logistic Ensembles with Recursive Probability and Automatic Subset Splitting for Enhanced Binary Classification
Khan, Mohammad Zubair, Li, David
This paper presents a novel approach to binary classification using dynamic logistic ensemble models. The proposed method addresses the challenges posed by datasets containing inherent internal clusters that lack explicit feature-based separations. By extending traditional logistic regression, we develop an algorithm that automatically partitions the dataset into multiple subsets, constructing an ensemble of logistic models to enhance classification accuracy. A key innovation in this work is the recursive probability calculation, derived through algebraic manipulation and mathematical induction, which enables scalable and efficient model construction. Compared to traditional ensemble methods such as Bagging and Boosting, our approach maintains interpretability while offering competitive performance. Furthermore, we systematically employ maximum likelihood and cost functions to facilitate the analytical derivation of recursive gradients as functions of ensemble depth. The effectiveness of the proposed approach is validated on a custom dataset created by introducing noise and shifting data to simulate group structures, resulting in significant performance improvements with layers. Implemented in Python, this work balances computational efficiency with theoretical rigor, providing a robust and interpretable solution for complex classification tasks with broad implications for machine learning applications. Code at https://github.com/ensemble-art/Dynamic-Logistic-Ensembles
Optimized Conformal Selection: Powerful Selective Inference After Conformity Score Optimization
Model selection/optimization in conformal inference is challenging, since it may break the exchangeability between labeled and unlabeled data. We study this problem in the context of conformal selection, which uses conformal p-values to select ``interesting'' instances with large unobserved labels from a pool of unlabeled data, while controlling the FDR in finite sample. For validity, existing solutions require the model choice to be independent of the data used to construct the p-values and calibrate the selection set. However, when presented with many model choices and limited labeled data, it is desirable to (i) select the best model in a data-driven manner, and (ii) mitigate power loss due to sample splitting. This paper presents OptCS, a general framework that allows valid statistical testing (selection) after flexible data-driven model optimization. We introduce general conditions under which OptCS constructs valid conformal p-values despite substantial data reuse and handles complex p-value dependencies to maintain finite-sample FDR control via a novel multiple testing procedure. We instantiate this general recipe to propose three FDR-controlling procedures, each optimizing the models differently: (i) selecting the most powerful one among multiple pre-trained candidate models, (ii) using all data for model fitting without sample splitting, and (iii) combining full-sample model fitting and selection. We demonstrate the efficacy of our methods via simulation studies and real applications in drug discovery and alignment of large language models in radiology report generation.
Anomaly Detection in California Electricity Price Forecasting: Enhancing Accuracy and Reliability Using Principal Component Analysis
Nyangon, Joseph, Akintunde, Ruth
Accurate and reliable electricity price forecasting has significant practical implications for grid management, renewable energy integration, power system planning, and price volatility management. This study focuses on enhancing electricity price forecasting in California's grid, addressing challenges from complex generation data and heteroskedasticity. Utilizing principal component analysis (PCA), we analyze CAISO's hourly electricity prices and demand from 2016-2021 to improve day-ahead forecasting accuracy. Initially, we apply traditional outlier analysis with the interquartile range method, followed by robust PCA (RPCA) for more effective outlier elimination. This approach improves data symmetry and reduces skewness. We then construct multiple linear regression models using both raw and PCA-transformed features. The model with transformed features, refined through traditional and SAS Sparse Matrix outlier removal methods, shows superior forecasting performance. The SAS Sparse Matrix method, in particular, significantly enhances model accuracy. Our findings demonstrate that PCA-based methods are key in advancing electricity price forecasting, supporting renewable integration and grid management in day-ahead markets. Keywords: Electricity price forecasting, principal component analysis (PCA), power system planning, heteroskedasticity, renewable energy integration.
Machine learning for cerebral blood vessels' malformations
Topal, Irem, Cherevko, Alexander, Bugay, Yuri, Shishlenin, Maxim, Barbier, Jean, Eroglu, Deniz, Roldรกn, รdgar, Belousov, Roman
Cerebral aneurysms and arteriovenous malformations are life-threatening hemodynamic pathologies of the brain. While surgical intervention is often essential to prevent fatal outcomes, it carries significant risks both during the procedure and in the postoperative period, making the management of these conditions highly challenging. Parameters of cerebral blood flow, routinely monitored during medical interventions, could potentially be utilized in machine learning-assisted protocols for risk assessment and therapeutic prognosis. To this end, we developed a linear oscillatory model of blood velocity and pressure for clinical data acquired from neurosurgical operations. Using the method of Sparse Identification of Nonlinear Dynamics (SINDy), the parameters of our model can be reconstructed online within milliseconds from a short time series of the hemodynamic variables. The identified parameter values enable automated classification of the blood-flow pathologies by means of logistic regression, achieving an accuracy of 73 %. Our results demonstrate the potential of this model for both diagnostic and prognostic applications, providing a robust and interpretable framework for assessing cerebral blood vessel conditions.
LDACP: Long-Delayed Ad Conversions Prediction Model for Bidding Strategy
Cui, Peng, Yang, Yiming, Jin, Fusheng, Tang, Siyuan, Wang, Yunli, Yang, Fukang, Jia, Yalong, Cai, Qingpeng, Pan, Fei, Li, Changcheng, Jiang, Peng
In online advertising, once an ad campaign is deployed, the automated bidding system dynamically adjusts the bidding strategy to optimize Cost Per Action (CPA) based on the number of ad conversions. For ads with a long conversion delay, relying solely on the real-time tracked conversion number as a signal for bidding strategy can significantly overestimate the current CPA, leading to conservative bidding strategies. Therefore, it is crucial to predict the number of long-delayed conversions. Nonetheless, it is challenging to predict ad conversion numbers through traditional regression methods due to the wide range of ad conversion numbers. Previous regression works have addressed this challenge by transforming regression problems into bucket classification problems, achieving success in various scenarios. However, specific challenges arise when predicting the number of ad conversions: 1) The integer nature of ad conversion numbers exacerbates the discontinuity issue in one-hot hard labels; 2) The long-tail distribution of ad conversion numbers complicates tail data prediction. In this paper, we propose the Long-Delayed Ad Conversions Prediction model for bidding strategy (LDACP), which consists of two sub-modules. To alleviate the issue of discontinuity in one-hot hard labels, the Bucket Classification Module with label Smoothing method (BCMS) converts one-hot hard labels into non-normalized soft labels, then fits these soft labels by minimizing classification loss and regression loss. To address the challenge of predicting tail data, the Value Regression Module with Proxy labels (VRMP) uses the prediction bias of aggregated pCTCVR as proxy labels. Finally, a Mixture of Experts (MoE) structure integrates the predictions from BCMS and VRMP to obtain the final predicted ad conversion number.
Diagnosing Hate Speech Classification: Where Do Humans and Machines Disagree, and Why?
This study uses the cosine similarity ratio, embedding regression, and manual re-annotation to diagnose hate speech classification. We begin by computing cosine similarity ratio on a dataset "Measuring Hate Speech" that contains 135,556 annotated comments on social media. This way, we show a basic use of cosine similarity as a description of hate speech content. We then diagnose hate speech classification starting from understanding the inconsistency of human annotation from the dataset. Using embedding regression as a basic diagnostic, we found that female annotators are more sensitive to racial slurs that target the black population. We perform with a more complicated diagnostic by training a hate speech classifier using a SoTA pre-trained large language model, NV-Embed-v2, to convert texts to embeddings and run a logistic regression. This classifier achieves a testing accuracy of 94%. In diagnosing where machines disagree with human annotators, we found that machines make fewer mistakes than humans despite the fact that human annotations are treated as ground truth in the training set. Machines perform better in correctly labeling long statements of facts, but perform worse in labeling short instances of swear words. We hypothesize that this is due to model alignment - while curating models at their creation prevents the models from producing obvious hate speech, it also reduces the model's ability to detect such content.
Evaluating Rank-N-Contrast: Continuous and Robust Representations for Regression
Valentin, Six, Alexandre, Chidiac, Arkin, Worlikar
This document is a replication of the original "Rank-N-Contrast" (arXiv:2210.01189v2) paper published in 2023. This evaluation is done for academic purposes. Deep regression models often fail to capture the continuous nature of sample orders, creating fragmented representations and suboptimal performance. To address this, we reproduced the Rank-N-Contrast (RNC) framework, which learns continuous representations by contrasting samples by their rankings in the target space. Our study validates RNC's theoretical and empirical benefits, including improved performance and robustness. We extended the evaluation to an additional regression dataset and conducted robustness tests using a holdout method, where a specific range of continuous data was excluded from the training set. This approach assessed the model's ability to generalise to unseen data and achieve state-of-the-art performance. This replication study validates the original findings and broadens the understanding of RNC's applicability and robustness.