AITopics

2304.13342

Country:

Asia > Japan > Kyūshū & Okinawa > Kyūshū > Fukuoka Prefecture > Fukuoka (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report (0.83)

Industry:

Health & Medicine (0.93)
Education (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.91)

van Loon, Wouter, Fokkema, Marjolein, de Rooij, Mark

Imputation of missing values in multi-view data

Data for which a set of objects is described by multiple distinct feature sets (called views) is known as multi-view data. When missing values occur in multi-view data, all features in a view are likely to be missing simultaneously. This leads to very large quantities of missing data which, especially when combined with high-dimensionality, makes the application of conditional imputation methods computationally infeasible. We introduce a new imputation method based on the existing stacked penalized logistic regression (StaPLR) algorithm for multi-view learning. It performs imputation in a dimension-reduced space to address computational challenges inherent to the multi-view context. We compare the performance of the new imputation method with several existing imputation algorithms in simulated data sets. The results show that the new imputation method leads to competitive results at a much lower computational cost, and makes the use of advanced imputation algorithms such as missForest and predictive mean matching possible in settings where they would otherwise be computationally infeasible.

artificial intelligence, imputation, machine learning, (16 more...)

2210.14484

Country:

Europe > Austria > Vienna (0.14)
Europe > Netherlands > South Holland > Leiden (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.88)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.94)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.88)

Momeni, Fakhri, Dietze, Stefan, Mayr, Philipp, Biesenbender, Kristin, Peters, Isabella

Which Factors are associated with Open Access Publishing? A Springer Nature Case Study

Open Access (OA) facilitates access to articles. But, authors or funders often must pay the publishing costs preventing authors who do not receive financial support from participating in OA publishing and citation advantage for OA articles. OA may exacerbate existing inequalities in the publication system rather than overcome them. To investigate this, we studied 522,411 articles published by Springer Nature. Employing correlation and regression analyses, we describe the relationship between authors affiliated with countries from different income levels, their choice of publishing model, and the citation impact of their papers. A machine learning classification method helped us to explore the importance of different features in predicting the publishing model. The results show that authors eligible for APC waivers publish more in gold-OA journals than others. In contrast, authors eligible for an APC discount have the lowest ratio of OA publications, leading to the assumption that this discount insufficiently motivates authors to publish in gold-OA journals. We found a strong correlation between the journal rank and the publishing model in gold-OA journals, whereas the OA option is mostly avoided in hybrid journals. Also, results show that the countries' income level, seniority, and experience with OA publications are the most predictive factors for OA publishing in hybrid journals.

artificial intelligence, machine learning, publishing model, (15 more...)

doi: 10.1162/qss_a_00253

2208.08221

Country:

North America > United States (0.04)
Europe > Germany > North Rhine-Westphalia > Düsseldorf Region > Düsseldorf (0.04)
Europe > Sweden (0.04)
(10 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine (1.00)
Media > Publishing (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.34)

Zouhar, Vilém, Dhuliawala, Shehzaad, Zhou, Wangchunshu, Daheim, Nico, Kocmi, Tom, Jiang, Yuchen Eleanor, Sachan, Mrinmaya

Poor Man's Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference

Machine translation quality estimation (QE) predicts human judgements of a translation hypothesis without seeing the reference. State-of-the-art QE systems based on pretrained language models have been achieving remarkable correlations with human judgements yet they are computationally heavy and require human annotations, which are slow and expensive to create. To address these limitations, we define the problem of metric estimation (ME) where one predicts the automated metric scores also without the reference. We show that even without access to the reference, our model can estimate automated metrics ($\rho$=60% for BLEU, $\rho$=51% for other metrics) at the sentence-level. Because automated metrics correlate with human judgements, we can leverage the ME task for pre-training a QE model. For the QE task, we find that pre-training on TER is better ($\rho$=23%) than training for scratch ($\rho$=20%).

artificial intelligence, machine learning, natural language, (15 more...)

2301.09008

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Finland > Uusimaa > Helsinki (0.04)
North America > United States > Michigan (0.04)
(7 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

A Novel Dual of Shannon Information and Weighting Scheme

Zhang, Arthur Jun

Shannon Information theory has achieved great success in not only communication technology where it was originally developed for but also many other science and engineering fields such as machine learning and artificial intelligence. Inspired by the famous weighting scheme TF-IDF, we discovered that information entropy has a natural dual. We complement the classical Shannon information theory by proposing a novel quantity, namely troenpy. Troenpy measures the certainty, commonness and similarity of the underlying distribution. To demonstrate its usefulness, we propose a troenpy based weighting scheme for document with class labels, namely positive class frequency (PCF). On a collection of public datasets we show the PCF based weighting scheme outperforms the classical TF-IDF and a popular Optimal Transportation based word moving distance algorithm in a kNN setting. We further developed a new odds-ratio type feature, namely Expected Class Information Bias(ECIB), which can be regarded as the expected odds ratio of the information quantity entropy and troenpy. In the experiments we observe that including the new ECIB features and simple binary term features in a simple logistic regression model can further significantly improve the performance. The simple new weighting scheme and ECIB features are very effective and can be computed with linear order complexity.

artificial intelligence, machine learning, natural language, (18 more...)

2304.12814

Country:

North America > United States > Massachusetts > Middlesex County > Waltham (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre: Research Report > Experimental Study (0.36)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.88)

Cenita, Jonelle Angelo S., Asuncion, Paul Richie F., Victoriano, Jayson M.

Performance Evaluation of Regression Models in Predicting the Cost of Medical Insurance

The study aimed to evaluate the regression models' performance in predicting the cost of medical insurance. The Three (3) Regression Models in Machine Learning namely Linear Regression, Gradient Boosting, and Support Vector Machine were used. The performance will be evaluated using the metrics RMSE (Root Mean Square), r2 (R Square), and K-Fold Cross-validation. The study also sought to pinpoint the feature that would be most important in predicting the cost of medical insurance.The study is anchored on the knowledge discovery in databases (KDD) process. (KDD) process refers to the overall process of discovering useful knowledge from data. It show the performance evaluation results reveal that among the three (3) Regression models, Gradient boosting received the highest r2 (R Square) 0.892 and the lowest RMSE (Root Mean Square) 1336.594. Furthermore, the 10-Fold Cross-validation weighted mean findings are not significantly different from the r2 (R Square) results of the three (3) regression models. In addition, Exploratory Data Analysis (EDA) using a box plot of descriptive statistics observed that in the charges and smoker features the median of one group lies outside of the box of the other group, so there is a difference between the two groups. It concludes that Gradient boosting appears to perform better among the three (3) regression models. K-Fold Cross-Validation concluded that the three (3) regression models are good. Moreover, Exploratory Data Analysis (EDA) using a box plot of descriptive statistics ceases that the highest charges are due to the smoker feature.

artificial intelligence, machine learning, regression model, (12 more...)

doi: 10.25147/ijcsr.2017.001.1.146

2304.12605

Country:

South America > Paraguay > Asunción > Asunción (0.05)
Asia > Middle East > Jordan (0.04)
Asia > Philippines > Luzon > National Capital Region > City of Manila (0.04)
(3 more...)

Genre: Research Report (0.84)

Industry:

Banking & Finance > Insurance (1.00)
Health & Medicine > Health Care Providers & Services > Reimbursement (0.86)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

arXiv.org Machine LearningApr-24-2023

Theory of Posterior Concentration for Generalized Bayesian Additive Regression Trees

Saha, Enakshi

Bayesian Additive Regression Trees (BART) are a powerful semiparametric ensemble learning technique for modeling nonlinear regression functions. Although initially BART was proposed for predicting only continuous and binary response variables, over the years multiple extensions have emerged that are suitable for estimating a wider class of response variables (e.g. categorical and count data) in a multitude of application areas. In this paper we describe a Generalized framework for Bayesian trees and their additive ensembles where the response variable comes from an exponential family distribution and hence encompasses a majority of these variants of BART. We derive sufficient conditions on the response distribution, under which the posterior concentrates at a minimax rate, up to a logarithmic factor. In this regard our results provide theoretical justification for the empirical success of BART and its variants.

artificial intelligence, machine learning, posterior concentration rate, (17 more...)

2304.12505

Country: North America > United States > Massachusetts > Suffolk County > Boston (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

Wundervald, Bruna, Parnell, Andrew, Domijan, Katarina

Hierarchical Embedded Bayesian Additive Regression Trees

arXiv.org Artificial IntelligenceApr-24-2023

We propose a simple yet powerful extension of Bayesian Additive Regression Trees which we name Hierarchical Embedded BART (HE-BART). The model allows for random effects to be included at the terminal node level of a set of regression trees, making HE-BART a non-parametric alternative to mixed effects models which avoids the need for the user to specify the structure of the random effects in the model, whilst maintaining the prediction and uncertainty calibration properties of standard BART. Using simulated and real-world examples, we demonstrate that this new extension yields superior predictions for many of the standard mixed effects models' example data sets, and yet still provides consistent estimates of the random effect variances. In a future version of this paper, we outline its use in larger, more advanced data sets and structures.

artificial intelligence, machine learning, prediction, (14 more...)

2204.07207

Country:

Europe > Ireland (0.14)
Europe > Austria > Vienna (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.83)

Industry: Health & Medicine (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

arXiv.org Machine LearningApr-24-2023

Estimation of sparse linear regression coefficients under $L$-subexponential covariates

Sasai, Takeyuki

We address a task of estimating sparse coefficients in linear regression when the covariates are drawn from an $L$-subexponential random vector, which belongs to a class of distributions having heavier tails than a Gaussian random vector. Prior works have tackled this issue by assuming that the covariates are drawn from an $L$-subexponential random vector and have established error bounds that resemble those derived for Gaussian random vectors. However, these previous methods require stronger conditions to derive error bounds than those employed for Gaussian random vectors. In the present paper, we present an error bound identical to that obtained for Gaussian random vectors, up to constant factors, without requiring stronger conditions, even when the covariates are drawn from an $L$-subexponential random vector. Somewhat interestingly, we utilize an $\ell_1$-penalized Huber regression, that is recognized for its robustness to heavy-tailed random noises, not covariates. We believe that the present paper reveals a new aspect of the $\ell_1$-penalized Huber regression.

artificial intelligence, machine learning, random vector, (17 more...)

2304.11958

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

arXiv.org Machine LearningApr-24-2023

Nonlinear Sufficient Dimension Reduction for Distribution-on-Distribution Regression

Zhang, Qi, Li, Bing, Xue, Lingzhou

We introduce a new approach to nonlinear sufficient dimension reduction in cases where both the predictor and the response are distributional data, modeled as members of a metric space. Our key step is to build universal kernels (cc-universal) on the metric spaces, which results in reproducing kernel Hilbert spaces for the predictor and response that are rich enough to characterize the conditional independence that determines sufficient dimension reduction. For univariate distributions, we construct the universal kernel using the Wasserstein distance, while for multivariate distributions, we resort to the sliced Wasserstein distance. The sliced Wasserstein distance ensures that the metric space possesses similar topological properties to the Wasserstein space while also offering significant computation benefits. Numerical results based on synthetic data show that our method outperforms possible competing methods. The method is also applied to several data sets, including fertility and mortality data and Calgary temperature data.

artificial intelligence, machine learning, predictor, (17 more...)

2207.04613

Country:

North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.24)
Asia > Middle East > Jordan (0.04)
North America > United States > Pennsylvania (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning in High Dimensional Spaces (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)