AITopics

2409.03962

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.45)

Industry: Health & Medicine (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

arXiv.org Machine LearningSep-4-2024

Introduction to Machine Learning

Younes, Laurent

This book introduces the mathematical foundations and techniques that lead to the development and analysis of many of the algorithms that are used in machine learning. It starts with an introductory chapter that describes notation used throughout the book and serve at a reminder of basic concepts in calculus, linear algebra and probability and also introduces some measure theoretic terminology, which can be used as a reading guide for the sections that use these tools. The introductory chapters also provide background material on matrix analysis and optimization. The latter chapter provides theoretical support to many algorithms that are used in the book, including stochastic gradient descent, proximal methods, etc. After discussing basic concepts for statistical prediction, the book includes an introduction to reproducing kernel theory and Hilbert space techniques, which are used in many places, before addressing the description of various algorithms for supervised statistical learning, including linear methods, support vector machines, decision trees, boosting, or neural networks. The subject then switches to generative methods, starting with a chapter that presents sampling methods and an introduction to the theory of Markov chains. The following chapter describe the theory of graphical models, an introduction to variational methods for models with latent variables, and to deep-learning based generative models. The next chapters focus on unsupervised learning methods, for clustering, factor analysis and manifold learning. The final chapter of the book is theory-oriented and discusses concentration inequalities and generalization bounds.

bayesian information criterion, complementary slackness condition, independent component analysis, (17 more...)

2409.02668

Genre:

Workflow (1.00)
Summary/Review (1.00)
Instructional Material (0.92)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
(6 more...)

Divol, Vincent, Gaucher, Solenne

Demographic parity in regression and classification within the unawareness framework

arXiv.org Machine LearningSep-4-2024

This paper explores the theoretical foundations of fair regression under the constraint of demographic parity within the unawareness framework, where disparate treatment is prohibited, extending existing results where such treatment is permitted. Specifically, we aim to characterize the optimal fair regression function when minimizing the quadratic loss. Our results reveal that this function is given by the solution to a barycenter problem with optimal transport costs. Additionally, we study the connection between optimal fair cost-sensitive classification, and optimal fair regression. We demonstrate that nestedness of the decision sets of the classifiers is both necessary and sufficient to establish a form of equivalence between classification and regression. Under this nestedness assumption, the optimal classifiers can be derived by applying thresholds to the optimal fair regression function; conversely, the optimal fair regression function is characterized by the family of cost-sensitive classifiers.

classifier, regression function, unawareness framework, (11 more...)

2409.02471

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.77)

Khuat, Thanh Tung, Bassett, Robert, Otte, Ellen, Gabrys, Bogdan

Uncertainty Quantification Using Ensemble Learning and Monte Carlo Sampling for Performance Prediction and Monitoring in Cell Culture Processes

arXiv.org Artificial IntelligenceSep-3-2024

Biopharmaceutical products, particularly monoclonal antibodies (mAbs), have gained prominence in the pharmaceutical market due to their high specificity and efficacy. As these products are projected to constitute a substantial portion of global pharmaceutical sales, the application of machine learning models in mAb development and manufacturing is gaining momentum. This paper addresses the critical need for uncertainty quantification in machine learning predictions, particularly in scenarios with limited training data. Leveraging ensemble learning and Monte Carlo simulations, our proposed method generates additional input samples to enhance the robustness of the model in small training datasets. We evaluate the efficacy of our approach through two case studies: predicting antibody concentrations in advance and real-time monitoring of glucose concentrations during bioreactor runs using Raman spectra data. Our findings demonstrate the effectiveness of the proposed method in estimating the uncertainty levels associated with process performance predictions and facilitating real-time decision-making in biopharmaceutical manufacturing. This contribution not only introduces a novel approach for uncertainty quantification but also provides insights into overcoming challenges posed by small training datasets in bioprocess development. The evaluation demonstrates the effectiveness of our method in addressing key challenges related to uncertainty estimation within upstream cell cultivation, illustrating its potential impact on enhancing process control and product quality in the dynamic field of biopharmaceuticals.

concentration, ensemble model, prediction, (14 more...)

2409.02149

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Immunology (0.89)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.75)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Architecture > Real Time Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

arXiv.org Artificial IntelligenceSep-3-2024

Federated Prediction-Powered Inference from Decentralized Data

Luo, Ping, Deng, Xiaoge, Wen, Ziqing, Sun, Tao, Li, Dongsheng

In various domains, the increasing application of machine learning allows researchers to access inexpensive predictive data, which can be utilized as auxiliary data for statistical inference. Although such data are often unreliable compared to gold-standard datasets, Prediction-Powered Inference (PPI) has been proposed to ensure statistical validity despite the unreliability. However, the challenge of `data silos' arises when the private gold-standard datasets are non-shareable for model training, leading to less accurate predictive models and invalid inferences. In this paper, we introduces the Federated Prediction-Powered Inference (Fed-PPI) framework, which addresses this challenge by enabling decentralized experimental data to contribute to statistically valid conclusions without sharing private information. The Fed-PPI framework involves training local models on private data, aggregating them through Federated Learning (FL), and deriving confidence intervals using PPI computation. The proposed framework is evaluated through experiments, demonstrating its effectiveness in producing valid confidence intervals.

confidence interval, dataset, prediction, (15 more...)

2409.0173

Country:

North America > United States (0.48)
Asia > Middle East > Jordan (0.04)
Asia > China (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Information Technology > Security & Privacy (0.87)
Government > Regional Government > North America Government > United States Government (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.47)

La Morgia, Massimo, Mei, Alessandro, Sassi, Francesco, Stefa, Julinda

Pump and Dumps in the Bitcoin Era: Real Time Detection of Cryptocurrency Market Manipulations

arXiv.org Artificial IntelligenceSep-2-2024

In the last years, cryptocurrencies are increasingly popular. Even people who are not experts have started to invest in these securities and nowadays cryptocurrency exchanges process transactions for over 100 billion US dollars per month. However, many cryptocurrencies have low liquidity and therefore they are highly prone to market manipulation schemes. In this paper, we perform an in-depth analysis of pump and dump schemes organized by communities over the Internet. We observe how these communities are organized and how they carry out the fraud. Then, we report on two case studies related to pump and dump groups. Lastly, we introduce an approach to detect the fraud in real time that outperforms the current state of the art, so to help investors stay out of the market when a pump and dump scheme is in action.

cryptocurrency, opération, pump and dump, (15 more...)

doi: 10.1109/ICCCN49398.2020.9209660

2005.0661

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Italy > Lazio > Rome (0.04)
Europe > Austria > Tyrol > Innsbruck (0.04)
Asia > Pakistan > Sindh > Karachi Division > Karachi (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > e-Commerce > Financial Technology (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)

Fuhr, Jonathan, Papies, Dominik

Double Machine Learning meets Panel Data -- Promises, Pitfalls, and Potential Solutions

arXiv.org Machine LearningSep-2-2024

Estimating causal effect using machine learning (ML) algorithms can help to relax functional form assumptions if used within appropriate frameworks. However, most of these frameworks assume settings with cross-sectional data, whereas researchers often have access to panel data, which in traditional methods helps to deal with unobserved heterogeneity between units. In this paper, we explore how we can adapt double/debiased machine learning (DML) (Chernozhukov et al., 2018) for panel data in the presence of unobserved heterogeneity. This adaptation is challenging because DML's cross-fitting procedure assumes independent data and the unobserved heterogeneity is not necessarily additively separable in settings with nonlinear observed confounding. We assess the performance of several intuitively appealing estimators in a variety of simulations. While we find violations of the cross-fitting assumptions to be largely inconsequential for the accuracy of the effect estimates, many of the considered methods fail to adequately account for the presence of unobserved heterogeneity. However, we find that using predictive models based on the correlated random effects approach (Mundlak, 1978) within DML leads to accurate coefficient estimates across settings, given a sample size that is large relative to the number of observed confounders. We also show that the influence of the unobserved heterogeneity on the observed confounders plays a significant role for the performance of most alternative methods.

confounder, dml, unobserved heterogeneity, (16 more...)

2409.01266

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
North America > United States > Ohio > Warren County > Mason (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre:

Research Report > Promising Solution (0.64)
Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

arXiv.org Machine LearningSep-1-2024

Global Public Sentiment on Decentralized Finance: A Spatiotemporal Analysis of Geo-tagged Tweets from 150 Countries

Chen, Yuqi, Li, Yifan, Zhou, Kyrie Zhixuan, Fu, Xiaokang, Liu, Lingbo, Bao, Shuming, Sui, Daniel, Zhang, Luyao

In the digital era, blockchain technology, cryptocurrencies, and non-fungible tokens (NFTs) have transformed financial and decentralized systems. However, existing research often neglects the spatiotemporal variations in public sentiment toward these technologies, limiting macro-level insights into their global impact. This study leverages Twitter data to explore public attention and sentiment across 150 countries, analyzing over 150 million geotagged tweets from 2012 to 2022. Sentiment scores were derived using a BERT-based multilingual sentiment model trained on 7.4 billion tweets. The analysis integrates global cryptocurrency regulations and economic indicators from the World Development Indicators database. Results reveal significant global sentiment variations influenced by economic factors, with more developed nations engaging more in discussions, while less developed countries show higher sentiment levels. Geographically weighted regression indicates that GDP-tweet engagement correlation intensifies following Bitcoin price surges. Topic modeling shows that countries within similar economic clusters share discussion trends, while different clusters focus on distinct topics. This study highlights global disparities in sentiment toward decentralized finance, shaped by economic and regional factors, with implications for poverty alleviation, cryptocurrency crime, and sustainable development. The dataset and code are publicly available on GitHub.

decentralized finance, sentiment, tweet, (14 more...)

2409.00843

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > South Korea (0.04)
Asia > Bangladesh (0.04)
(142 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Industry:

Banking & Finance > Trading (1.00)
Information Technology > Services > e-Commerce Services (0.34)

Technology:

Information Technology > e-Commerce > Financial Technology (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.82)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Nau, Anna-Maria, Ditto, Phillip, Steadman, Dawnie Wolfe, Mockus, Audris

Identifying Factors to Help Improve Existing Decomposition-Based PMI Estimation Methods

arXiv.org Artificial IntelligenceAug-31-2024

Accurately assessing the postmortem interval (PMI) is an important task in forensic science. Some of the existing techniques use regression models that use a decomposition score to predict the PMI or accumulated degree days (ADD), however, the provided formulas are based on very small samples and the accuracy is low. With the advent of Big Data, much larger samples can be used to improve PMI estimation methods. We, therefore, aim to investigate ways to improve PMI prediction accuracy by (a) using a much larger sample size, (b) employing more advanced linear models, and (c) enhancing models with factors known to affect the human decay process. Specifically, this study involved the curation of a sample of 249 human subjects from a large-scale decomposition dataset, followed by evaluating pre-existing PMI/ADD formulas and fitting increasingly sophisticated models to estimate the PMI/ADD. Results showed that including the total decomposition score (TDS), demographic factors (age, biological sex, and BMI), and weather-related factors (season of discovery, temperature history, and humidity history) increased the accuracy of the PMI/ADD models. Furthermore, the best performing PMI estimation model using the TDS, demographic, and weather-related features as predictors resulted in an adjusted R-squared of 0.34 and an RMSE of 0.95. It had a 7% lower RMSE than a model using only the TDS to predict the PMI and a 48% lower RMSE than the pre-existing PMI formula. The best ADD estimation model, also using the TDS, demographic, and weather-related features as predictors, resulted in an adjusted R-squared of 0.52 and an RMSE of 0.89. It had an 11% lower RMSE than the model using only the TDS to predict the ADD and a 52% lower RMSE than the pre-existing ADD formula. This work demonstrates the need (and way) to incorporate demographic and environmental factors into PMI/ADD estimation models.

decomposition, formula, regression analysis, (17 more...)

2409.09056

Country:

North America > United States > Tennessee > Knox County > Knoxville (0.29)
Europe > North Sea (0.04)
Europe > Netherlands (0.04)
(2 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.55)

arXiv.org Machine LearningAug-31-2024

Multi-Output Distributional Fairness via Post-Processing

Li, Gang, Lin, Qihang, Ghosh, Ayush, Yang, Tianbao

The post-processing approaches are becoming prominent techniques to enhance machine learning models' fairness because of their intuitiveness, low computational cost, and excellent scalability. However, most existing post-processing methods are designed for task-specific fairness measures and are limited to single-output models. In this paper, we introduce a post-processing method for multi-output models, such as the ones used for multi-task/multi-class classification and representation learning, to enhance a model's distributional parity, a task-agnostic fairness measure. Existing techniques to achieve distributional parity are based on the (inverse) cumulative density function of a model's output, which is limited to single-output models. Extending previous works, our method employs an optimal transport mapping to move a model's outputs across different groups towards their empirical Wasserstein barycenter. An approximation technique is applied to reduce the complexity of computing the exact barycenter and a kernel regression method is proposed for extending this process to out-of-sample data. Our empirical studies, which compare our method to current existing post-processing baselines on multi-task/multi-class classification and representation learning tasks, demonstrate the effectiveness of the proposed approach.

classification, fairness, post-processing method, (13 more...)

2409.00553

Country:

North America > United States > Texas > Brazos County > College Station (0.04)
North America > United States > Iowa (0.04)
North America > United States > California (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine > Diagnostic Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.48)