Regression
Debiased Regression for Root-N-Consistent Conditional Mean Estimation
This study introduces a debiasing method for regression estimators, including high-dimensional and nonparametric regression estimators. For example, nonparametric regression methods allow for the estimation of regression functions in a data-driven manner with minimal assumptions; however, these methods typically fail to achieve $\sqrt{n}$-consistency in their convergence rates, and many, including those in machine learning, lack guarantees that their estimators asymptotically follow a normal distribution. To address these challenges, we propose a debiasing technique for nonparametric estimators by adding a bias-correction term to the original estimators, extending the conventional one-step estimator used in semiparametric analysis. Specifically, for each data point, we estimate the conditional expected residual of the original nonparametric estimator, which can, for instance, be computed using kernel (Nadaraya-Watson) regression, and incorporate it as a bias-reduction term. Our theoretical analysis demonstrates that the proposed estimator achieves $\sqrt{n}$-consistency and asymptotic normality under a mild convergence rate condition for both the original nonparametric estimator and the conditional expected residual estimator. Notably, this approach remains model-free as long as the original estimator and the conditional expected residual estimator satisfy the convergence rate condition. The proposed method offers several advantages, including improved estimation accuracy and simplified construction of confidence intervals.
A review on Machine Learning based User-Centric Multimedia Streaming Techniques
Ghosh, Monalisa, Singhal, Chetna
The multimedia content and streaming are a major means of information exchange in the modern era and there is an increasing demand for such services. This coupled with the advancement of future wireless networks B5G/6G and the proliferation of intelligent handheld mobile devices, has facilitated the availability of multimedia content to heterogeneous mobile users. Apart from the conventional video, the 360$^o$ videos have gained popularity with the emerging virtual reality applications. All formats of videos (conventional and 360$^o$) undergo processing, compression, and transmission across dynamic wireless channels with restricted bandwidth to facilitate the streaming services. This causes video impairments, leading to quality degradation and poses challenges in delivering good Quality-of-Experience (QoE) to the viewers. The QoE is a prominent subjective quality measure to assess multimedia services. This requires end-to-end QoE evaluation. Efficient multimedia streaming techniques can improve the service quality while dealing with dynamic network and end-user challenges. A paradigm shift in user-centric multimedia services is envisioned with a focus on Machine Learning (ML) based QoE modeling and streaming strategies. This survey paper presents a comprehensive overview of the overall and continuous, time varying QoE modeling for the purpose of QoE management in multimedia services. It also examines the recent research on intelligent and adaptive multimedia streaming strategies, with a special emphasis on ML based techniques for video (conventional and 360$^o$) streaming. This paper discusses the overall and continuous QoE modeling to optimize the end-user viewing experience, efficient video streaming with a focus on user-centric strategies, associated datasets for modeling and streaming, along with existing shortcoming and open challenges.
Selective Inference for Time-Varying Effect Moderation
Bakshi, Soham, Dempsey, Walter, Panigrahi, Snigdha
Causal effect moderation investigates how the effect of interventions (or treatments) on outcome variables changes based on observed characteristics of individuals, known as potential effect moderators. With advances in data collection, datasets containing many observed features as potential moderators have become increasingly common. High-dimensional analyses often lack interpretability, with important moderators masked by noise, while low-dimensional, marginal analyses yield many false positives due to strong correlations with true moderators. In this paper, we propose a two-step method for selective inference on time-varying causal effect moderation that addresses the limitations of both high-dimensional and marginal analyses. Our method first selects a relatively smaller, more interpretable model to estimate a linear causal effect moderation using a Gaussian randomization approach. We then condition on the selection event to construct a pivot, enabling uniformly asymptotic semi-parametric inference in the selected model. Through simulations and real data analyses, we show that our method consistently achieves valid coverage rates, even when existing conditional methods and common sample splitting techniques fail. Moreover, our method yields shorter, bounded intervals, unlike existing methods that may produce infinitely long intervals.
Assumption-Lean Post-Integrated Inference with Negative Control Outcomes
Du, Jin-Hong, Roeder, Kathryn, Wasserman, Larry
In the big data era, integrating information from multiple heterogeneous sources has become increasingly crucial for achieving larger sample sizes and more diverse study populations. The applications of data integration are in a variety of fields, including but not limited to, causal inference on heterogeneous populations (Shi et al., 2023), survey sampling (Yang et al., 2020), health policy (Paddock et al., 2024), retrospective psychometrics (Howe and Brown, 2023), and multi-omics biological science (Du et al., 2022). Data integration methods have been proposed to mitigate the unwanted effects of heterogeneous datasets and unmeasured covariates, recovering the common variation across datasets. However, a critical and often overlooked question is whether reliable statistical inference can be made from integrated data. Directly performing statistical inference on integrated outcomes and covariates of interests fails to account for the complex correlation structures introduced by the data integration process, often leading to improper analyses that incorrectly assume the corrected data points are independent (Li et al., 2023). While data integration is broadly utilized in various fields, our paper focuses on a challenging scenario with the presence of high-dimensional outcomes.
Can a Large Language Model Learn Matrix Functions In Context?
Goulart, Paimon, Papalexakis, Evangelos E.
Large Language Models (LLMs) have demonstrated the ability to solve complex tasks through In-Context Learning (ICL), where models learn from a few input-output pairs without explicit fine-tuning. In this paper, we explore the capacity of LLMs to solve non-linear numerical computations, with specific emphasis on functions of the Singular Value Decomposition. Our experiments show that while LLMs perform comparably to traditional models such as Stochastic Gradient Descent (SGD) based Linear Regression and Neural Networks (NN) for simpler tasks, they outperform these models on more complex tasks, particularly in the case of top-k Singular Values. Furthermore, LLMs demonstrate strong scalability, maintaining high accuracy even as the matrix size increases. Additionally, we found that LLMs can achieve high accuracy with minimal prior examples, converging quickly and avoiding the overfitting seen in classical models. These results suggest that LLMs could provide an efficient alternative to classical methods for solving high-dimensional problems. Future work will focus on extending these findings to larger matrices and more complex matrix operations while exploring the effect of using different numerical representations in ICL.
Understanding the Impact of News Articles on the Movement of Market Index: A Case on Nifty 50
Dasgupta, Subhasis, Satpati, Pratik, Choudhary, Ishika, Sen, Jaydip
In the recent past, there were several works on the prediction of stock price using different methods. Sentiment analysis of news and tweets and relating them to the movement of stock prices have already been explored. But, when we talk about the news, there can be several topics such as politics, markets, sports etc. It was observed that most of the prior analyses dealt with news or comments associated with particular stock prices only or the researchers dealt with overall sentiment scores only. However, it is quite possible that different topics having different levels of impact on the movement of the stock price or an index. The current study focused on bridging this gap by analysing the movement of Nifty 50 index with respect to the sentiments associated with news items related to various different topic such as sports, politics, markets etc. The study established that sentiment scores of news items of different other topics also have a significant impact on the movement of the index.
Influence functions and regularity tangents for efficient active learning
In this paper we describe an efficient method for providing a regression model with a sense of curiosity about its data. In the field of machine learning, our framework for representing curiosity is called active learning, which means automatically choosing data points for which to query labels in the semisupervised setting. The methods we propose are based on computing a "regularity tangent" vector that can be calculated (with only a constant slow-down) together with the model's parameter vector during training. We then take the inner product of this tangent vector with the gradient vector of the model's loss at a given data point to obtain a measure of the influence of that point on the complexity of the model. There is only a single regularity tangent vector, of the same dimension as the parameter vector. Thus, in the proposed technique, once training is complete, evaluating our "curiosity" about a potential query data point can be done as quickly as calculating the model's loss gradient at that point. The new vector only doubles the amount of storage required by the model. We show that the quantity computed by our technique is an example of an "influence function", and that it measures the expected squared change in model complexity incurred by up-weighting a given data point. We propose a number of ways for using this quantity to choose new training data for a model in the framework of active learning.
Detecting Distributed Denial of Service Attacks Using Logistic Regression and SVM Methods
Ullah, Mohammad Arafat, Anjum, Arthy, Tuhin, Rashedul Amin, Akhter, Shamim
A distributed denial-of-service (DDoS) attack is an attempt to produce humongous traffic within a network by overwhelming a targeted server or its neighboring infrastructure with a flood of service requests ceaselessly coming from multiple remotely controlled malware-infected computers or network-connected devices. Thus, exploring DDoS attacks by recognizing their functionalities and differentiating them from normal traffic services are the primary concerns of network security issues particularly for online businesses. In modern networks, most DDoS attacks occur in the network and application layer including HTTP flood, UDP flood, SIDDOS, SMURF, SNMP flood, IP NULL, etc. The goal of this paper is to detect DDoS attacks from all service requests and classify them according to DDoS classes. In this regard, a standard dataset is collected from the internet which contains several network-related attributes and their corresponding DDoS attack class name. Two(2) different machine learning approaches, SVM and Logistic Regression, are implemented in the dataset for detecting and classifying DDoS attacks, and a comparative study is accomplished among them in terms of accuracy, precision, and recall rates. Logistic Regression and SVM both achieve 98.65% classification accuracy which is the highest achieved accuracy among other previous experiments with the same dataset.
Social Media Algorithms Can Shape Affective Polarization via Exposure to Antidemocratic Attitudes and Partisan Animosity
Piccardi, Tiziano, Saveski, Martin, Jia, Chenyan, Hancock, Jeffrey T., Tsai, Jeanne L., Bernstein, Michael
There is widespread concern about the negative impacts of social media feed ranking algorithms on political polarization. Leveraging advancements in large language models (LLMs), we develop an approach to re-rank feeds in real-time to test the effects of content that is likely to polarize: expressions of antidemocratic attitudes and partisan animosity (AAPA). In a preregistered 10-day field experiment on X/Twitter with 1,256 consented participants, we increase or decrease participants' exposure to AAPA in their algorithmically curated feeds. We observe more positive outparty feelings when AAPA exposure is decreased and more negative outparty feelings when AAPA exposure is increased. Exposure to AAPA content also results in an immediate increase in negative emotions, such as sadness and anger. The interventions do not significantly impact traditional engagement metrics such as re-post and favorite rates. These findings highlight a potential pathway for developing feed algorithms that mitigate affective polarization by addressing content that undermines the shared values required for a healthy democracy.
Gradient-based optimization for variational empirical Bayes multiple regression
Banerjee, Saikat, Carbonetto, Peter, Stephens, Matthew
Multiple linear regression provides a simple, but widely used, method to find associations between outcomes (responses) and a set of predictors (explanatory variables). It has been actively studied over more than a century, and there is a rich and vast literature on the subject [1]. In practical situations the number of predictor variables is often large, and it becomes desirable to induce sparsity in the regression coefficients to avoid overfitting [2, 3]. Sparse linear regression also serves as the foundation for non-linear techniques, such as trendfiltering [4, 5], which can estimate an underlying non-linear trend from time series data. Applications of sparse multiple linear regression and trendfiltering arise in a wide range of applications in modern science and engineering, including astronomy [6], atmospheric sciences [7], biology [8], economics [9, 10], genetics [11-15], geophysics [16], medical sciences [17, 18], social sciences [19] and text analysis [20]. Approaches to sparse linear regression can be broadly classified into two groups: (a) penalized linear regressions (PLR), which add a penalty term to the likelihood to penalize the magnitude of its parameters [21-23], and (b) Bayesian approaches [11-14, 24-29], which use a prior probability distribution on the model parameters to induce sparsity.