AITopics

2210.06327

Country:

Europe > United Kingdom > England > Devon > Exeter (0.04)
Europe > Portugal > Porto > Porto (0.04)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Sports > Soccer (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.55)

#artificialintelligenceJan-16-2023, 04:15:25 GMT

Feature Transformation for Multiple Linear Regression in Python

Data processing and transformation is an iterative process and in a way, it can never be'perfect'. Because as we gain more understanding on the dataset, such as the inner relationships between target variable and features, or the business context, we think of new ways to deal with them. Recently I started working on media mix models and some predictive models utilizing multiple linear regression. In this post, I will introduce the thought process and different ways to deal with variables for modeling purpose. I will use King County house price data set (a modified version for more fun) as an example.

feature transformation, multiple linear regression, python, (3 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.81)
Information Technology > Modeling & Simulation (0.79)

Tran, Christopher, Burghardt, Keith, Lerman, Kristina, Zheleva, Elena

Data-Driven Estimation of Heterogeneous Treatment Effects

arXiv.org Artificial IntelligenceJan-16-2023

Estimating the effect of a treatment on an outcome is a fundamental problem in many fields such as medicine [33, 34, 61], public policy [20] and more [2, 37]. For example, doctors might be interested in how a treatment, such as a drug, affects the recovery of patients [18], economists may be interested in how a job training program affects employment prospectives [35], and advertisers may want to model the average effect an advertisement has on sales [36]. However, individuals may react differently to the treatment of interest, and knowing only the average treatment effect in the population is insufficient. For example, a drug may have adverse effects on some individuals but not others [61], or a person's education and background may affect how much they benefit from job training [35, 50]. Measuring the extent to which different individuals react differently to treatment is known as heterogeneous treatment effect (HTE) estimation. Traditionally, HTE estimation has been done through subgroup analysis [9, 19]. However, this can lead to cherry-picking since the practitioner is the one who identifies subgroups for estimating effects. Recently, there has been more focus on data-driven estimation of heterogeneous treatment effects by letting the data identify which features are important for treatment effect estimation using machine learning techniques [28, 39, 61, 69]. A straightforward approach is to create interaction terms between all covariates and use them in a regression [6].

artificial intelligence, estimation, machine learning, (16 more...)

2301.06615

Country:

North America > United States > California (0.14)
North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Research Report > Experimental Study (1.00)
Overview (1.00)
Research Report > Strength High (0.68)

Industry:

Education (0.86)
Health & Medicine > Pharmaceuticals & Biotechnology (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

arXiv.org Artificial IntelligenceJan-16-2023

Computational Assessment of Hyperpartisanship in News Titles

Lyu, Hanjia, Pan, Jinsheng, Wang, Zichen, Luo, Jiebo

We first adopt a human-guided machine learning framework to develop a new dataset for hyperpartisan news title detection with 2,200 manually labeled and 1.8 million machine-labeled titles that were posted from 2014 to the present by nine representative media organizations across three media bias groups - Left, Central, and Right in an active learning manner. The fine-tuned transformer-based language model achieves an overall accuracy of 0.84 and an F1 score of 0.78 on an external validation set. Next, we conduct a computational analysis to quantify the extent and dynamics of partisanship in news titles. While some aspects are as expected, our study reveals new or nuanced differences between the three media groups. We find that overall the Right media tends to use proportionally more hyperpartisan titles. Roughly around the 2016 Presidential Election, the proportions of hyperpartisan titles increased in all media bias groups where the relative increase in the proportion of hyperpartisan titles of the Left media was the most. We identify three major topics including foreign issues, political systems, and societal issues that are suggestive of hyperpartisanship in news titles using logistic regression models and the Shapley values. Through an analysis of the topic distribution, we find that societal issues gradually receive more attention from all media groups. We further apply a lexicon-based language analysis tool to the titles of each topic and quantify the linguistic distance between any pairs of the three media groups. Three distinct patterns are discovered. The Left media is linguistically more different from Central and Right in terms of foreign issues. The linguistic distance between the three media groups becomes smaller over recent years. In addition, a seasonal pattern where linguistic difference is associated with elections is observed for societal issues.

artificial intelligence, machine learning, natural language, (19 more...)

2301.0627

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
Europe > Ukraine (0.14)
North America > United States > Texas (0.04)
(9 more...)

Genre: Research Report > New Finding (0.66)

Industry:

Media > News (1.00)
Government > Voting & Elections (1.00)
Government > Regional Government > North America Government > United States Government (0.47)
Health & Medicine > Therapeutic Area > Immunology (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

#artificialintelligenceJan-15-2023, 00:35:32 GMT

ML Basics (Part-1): REGRESSION -- A Gateway Method to Machine Learning

There has been growing interest in the introductory posts on the elementary topics in Machine Learning. So, I am writing on such topics in the coming posts starting from this one. This article is mostly self-contained however, it requires basic understanding of linear algebra, and calculus. Regression is the process of estimating the relationship of a dependent variable (Y) with one or more independent variables (Xi). It is used primarily for finding patterns in a given set of data samples and forecasting the value of a variable while given a set of values of other variables.

artificial intelligence, equation, machine learning, (18 more...)

#artificialintelligence

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.81)

Mou, Wenlong, Ding, Peng, Wainwright, Martin J., Bartlett, Peter L.

Kernel-based off-policy estimation without overlap: Instance optimality beyond semiparametric efficiency

arXiv.org Machine LearningJan-15-2023

We study optimal procedures for estimating a linear functional based on observational data. In many problems of this kind, a widely used assumption is strict overlap, i.e., uniform boundedness of the importance ratio, which measures how well the observational data covers the directions of interest. When it is violated, the classical semi-parametric efficiency bound can easily become infinite, so that the instance-optimal risk depends on the function class used to model the regression function. For any convex and symmetric function class $\mathcal{F}$, we derive a non-asymptotic local minimax bound on the mean-squared error in estimating a broad class of linear functionals. This lower bound refines the classical semi-parametric one, and makes connections to moduli of continuity in functional estimation. When $\mathcal{F}$ is a reproducing kernel Hilbert space, we prove that this lower bound can be achieved up to a constant factor by analyzing a computationally simple regression estimator. We apply our general results to various families of examples, thereby uncovering a spectrum of rates that interpolate between the classical theories of semi-parametric efficiency (with $\sqrt{n}$-consistency) and the slower minimax rates associated with non-parametric function estimation.

artificial intelligence, estimator, machine learning, (17 more...)

arXiv.org Machine Learning

2301.0624

Country:

North America > United States > Massachusetts (0.04)
North America > United States > California (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Research Report (0.82)
Workflow (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.35)

Kim, Kwangho, Kennedy, Edward H., Zubizarreta, José R.

Doubly Robust Counterfactual Classification

We study counterfactual classification as a new tool for decision-making under hypothetical (contrary to fact) scenarios. We propose a doubly-robust nonparametric estimator for a general counterfactual classifier, where we can incorporate flexible constraints by casting the classification problem as a nonlinear mathematical program involving counterfactuals. We go on to analyze the rates of convergence of the estimator and provide a closed-form expression for its asymptotic distribution. Our analysis shows that the proposed estimator is robust against nuisance model misspecification, and can attain fast $\sqrt{n}$ rates with tractable inference even when using nonparametric machine learning approaches. We study the empirical performance of our methods by simulation and apply them for recidivism risk prediction.

artificial intelligence, estimator, machine learning, (16 more...)

2301.06199

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Greenland (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > Strength High (0.68)
Research Report > New Finding (0.68)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)

Interpretable and Scalable Graphical Models for Complex Spatio-temporal Processes

Wang, Yu

This thesis focuses on data that has complex spatio-temporal structure and on probabilistic graphical models that learn the structure in an interpretable and scalable manner. We target two research areas of interest: Gaussian graphical models for tensor-variate data and summarization of complex time-varying texts using topic models. This work advances the state-of-the-art in several directions. First, it introduces a new class of tensor-variate Gaussian graphical models via the Sylvester tensor equation. Second, it develops an optimization technique based on a fast-converging proximal alternating linearized minimization method, which scales tensor-variate Gaussian graphical model estimations to modern big-data settings. Third, it connects Kronecker-structured (inverse) covariance models with spatio-temporal partial differential equations (PDEs) and introduces a new framework for ensemble Kalman filtering that is capable of tracking chaotic physical systems. Fourth, it proposes a modular and interpretable framework for unsupervised and weakly-supervised probabilistic topic modeling of time-varying data that combines generative statistical models with computational geometric methods. Throughout, practical applications of the methodology are considered using real datasets. This includes brain-connectivity analysis using EEG data, space weather forecasting using solar imaging data, longitudinal analysis of public opinions using Twitter data, and mining of mental health related issues using TalkLife data. We show in each case that the graphical modeling framework introduced here leads to improved interpretability, accuracy, and scalability.

data mining, machine learning, natural language, (25 more...)

2301.06021

Country:

North America > United States > California > San Francisco County > San Francisco (0.13)
North America > United States > Texas (0.04)
North America > United States > Illinois (0.04)
(21 more...)

Genre:

Research Report > New Finding (1.00)
Overview (0.92)
Research Report > Experimental Study (0.67)

Industry:

Information Technology > Services (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
(3 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Systems & Languages (1.00)
(8 more...)

A Coreset Learning Reality Check

Lu, Fred, Raff, Edward, Holt, James

Subsampling algorithms are a natural approach to reduce data size before fitting models on massive datasets. In recent years, several works have proposed methods for subsampling rows from a data matrix while maintaining relevant information for classification. While these works are supported by theory and limited experiments, to date there has not been a comprehensive evaluation of these methods. In our work, we directly compare multiple methods for logistic regression drawn from the coreset and optimal subsampling literature and discover inconsistencies in their effectiveness. In many cases, methods do not outperform simple uniform subsampling.

artificial intelligence, dataset, machine learning, (18 more...)

2301.06163

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Maryland > Baltimore County (0.04)
North America > United States > Maryland > Baltimore (0.04)
North America > United States > California (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.36)

Bradley, Taylor, Alhajjar, Elie, Bastian, Nathaniel

Novelty Detection in Network Traffic: Using Survival Analysis for Feature Identification

Intrusion Detection Systems are an important component of many organizations' cyber defense and resiliency strategies. However, one downside of these systems is their reliance on known attack signatures for detection of malicious network events. When it comes to unknown attack types and zero-day exploits, modern Intrusion Detection Systems often fall short. In this paper, we introduce an unconventional approach to identifying network traffic features that influence novelty detection based on survival analysis techniques. Specifically, we combine several Cox proportional hazards models and implement Kaplan-Meier estimates to predict the probability that a classifier identifies novelty after the injection of an unknown network attack at any given time. The proposed model is successful at pinpointing PSH Flag Count, ACK Flag Count, URG Flag Count, and Down/Up Ratio as the main features to impact novelty detection via Random Forest, Bayesian Ridge, and Linear Support Vector Regression classifiers.

data mining, detection, machine learning, (18 more...)

2301.06229

Country:

North America > United States > New York (0.04)
North America > United States > Maryland > Baltimore (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)