Regression
Correlating Medi-Claim Service by Deep Learning Neural Networks
Vajiram, Jayanthi, Senthil, Negha, P, Nean Adhith.
Organized crime is a continuous issue, and predicting it is always under research. Medical insurance claims are one of the organized crimes related to patients, physicians, diagnostic centers, and insurance providers, forming a chain reaction that must be monitored constantly. These kinds of frauds affect the financial growth of both the insured people and the health insurance companies. The Convolution Neural Network architecture is used to detect fraudulent claims through a correlation study of regression models, which helps to detect money laundering on different claims given by different providers. Supervised and unsupervised classifiers are used to detect fraud and non-fraud claims. By using different attributes of patient case studies, diagnostic reports, and service provider reimbursement claim codes as control variables and attributes of the target class to detect performance metrics, this paper highlights the top reason for organized crime through the public dataset. The claims are filed by the provider, so the fraud can be organized crime. The performance metrics of accuracy, sensitivity, specificity, recall, precision, AUC, and f1-scores are calculated.
Accurate, Explainable, and Private Models: Providing Recourse While Minimizing Training Data Leakage
Huang, Catherine, Swoopes, Chelse, Xiao, Christina, Ma, Jiaqi, Lakkaraju, Himabindu
Machine learning models are increasingly utilized across impactful domains to predict individual outcomes. As such, many models provide algorithmic recourse to individuals who receive negative outcomes. However, recourse can be leveraged by adversaries to disclose private information. This work presents the first attempt at mitigating such attacks. We present two novel methods to generate differentially private recourse: Differentially Private Model (DPM) and Laplace Recourse (LR). Using logistic regression classifiers and real world and synthetic datasets, we find that DPM and LR perform well in reducing what an adversary can infer, especially at low FPR. When training dataset size is large enough, we find particular success in preventing privacy leakage while maintaining model and recourse accuracy with our novel LR method.
Varying-coefficients for regional quantile via KNN-based LASSO with applications to health outcome study
Park, Seyoung, Lee, Eun Ryung, Hong, Hyokyoung G.
Health outcomes, such as body mass index and cholesterol levels, are known to be dependent on age and exhibit varying effects with their associated risk factors. In this paper, we propose a novel framework for dynamic modeling of the associations between health outcomes and risk factors using varying-coefficients (VC) regional quantile regression via K-nearest neighbors (KNN) fused Lasso, which captures the time-varying effects of age. The proposed method has strong theoretical properties, including a tight estimation error bound and the ability to detect exact clustered patterns under certain regularity conditions. To efficiently solve the resulting optimization problem, we develop an alternating direction method of multipliers (ADMM) algorithm. Our empirical results demonstrate the efficacy of the proposed method in capturing the complex age-dependent associations between health outcomes and their risk factors.
Iterative Sketching for Secure Coded Regression
Charalambides, Neophytos, Mahdavifar, Hessam, Pilanci, Mert, Hero, Alfred O. III
In this work, we propose methods for speeding up linear regression distributively, while ensuring security. We leverage randomized sketching techniques, and improve straggler resilience in asynchronous systems. Specifically, we apply a random orthonormal matrix and then subsample \textit{blocks}, to simultaneously secure the information and reduce the dimension of the regression problem. In our setup, the transformation corresponds to an encoded encryption in an \textit{approximate gradient coding scheme}, and the subsampling corresponds to the responses of the non-straggling workers; in a centralized coded computing network. This results in a distributive \textit{iterative sketching} approach for an $\ell_2$-subspace embedding, \textit{i.e.} a new sketch is considered at each iteration. We also focus on the special case of the \textit{Subsampled Randomized Hadamard Transform}, which we generalize to block sampling; and discuss how it can be modified in order to secure the data.
Inherently Interpretable Multi-Label Classification Using Class-Specific Counterfactuals
Sun, Susu, Woerner, Stefano, Maier, Andreas, Koch, Lisa M., Baumgartner, Christian F.
Interpretability is essential for machine learning algorithms in high-stakes application fields such as medical image analysis. However, high-performing black-box neural networks do not provide explanations for their predictions, which can lead to mistrust and suboptimal human-ML collaboration. Post-hoc explanation techniques, which are widely used in practice, have been shown to suffer from severe conceptual problems. Furthermore, as we show in this paper, current explanation techniques do not perform adequately in the multi-label scenario, in which multiple medical findings may co-occur in a single image. We propose Attri-Net, an inherently interpretable model for multi-label classification. Attri-Net is a powerful classifier that provides transparent, trustworthy, and human-understandable explanations. The model first generates class-specific attribution maps based on counterfactuals to identify which image regions correspond to certain medical findings. Then a simple logistic regression classifier is used to make predictions based solely on these attribution maps. We compare Attri-Net to five post-hoc explanation techniques and one inherently interpretable classifier on three chest X-ray datasets. We find that Attri-Net produces high-quality multi-label explanations consistent with clinical knowledge and has comparable classification performance to state-of-the-art classification models.
A Meta-learning based Stacked Regression Approach for Customer Lifetime Value Prediction
Gadgil, Karan, Gill, Sukhpal Singh, Abdelmoniem, Ahmed M.
Abstract-- Companies across the globe are keen on targeting potential high-value customers in an attempt to expand revenue and this could be achieved only by understanding the customers more. Customer Lifetime Value (CLV) is the total monetary value of transactions/purchases made by a customer with the business over an intended period of time and is used as means to estimate future customer interactions. CLV finds application in a number of distinct business domains such as Banking, Insurance, Online-entertainment, Gaming, and E-Commerce. The existing distribution-based and basic (recency, frequency & monetary) based models face a limitation in terms of handling a wide variety of input features. Moreover, the more advanced Deep learning approaches could be superfluous and add an undesirable element of complexity in certain application areas. We, therefore, propose a system which is able to qualify both as effective, and comprehensive yet simple and interpretable. With that in mind, we develop a meta-learning-based stacked regression model which combines the predictions from bagging and boosting models that each is found to perform well individually. Empirical tests have been carried out on an openly available Online Retail dataset to evaluate various models and show the efficacy of the proposed approach. The key to flourishing businesses lies in understanding the customers using various aspects of their interactions with the businesses.
Generalization bound for estimating causal effects from observational network data
Cai, Ruichu, Yang, Zeqin, Chen, Weilin, Yan, Yuguang, Hao, Zhifeng
Estimating causal effects from observational network data is a significant but challenging problem. Existing works in causal inference for observational network data lack an analysis of the generalization bound, which can theoretically provide support for alleviating the complex confounding bias and practically guide the design of learning objectives in a principled manner. To fill this gap, we derive a generalization bound for causal effect estimation in network scenarios by exploiting 1) the reweighting schema based on joint propensity score and 2) the representation learning schema based on Integral Probability Metric (IPM). We provide two perspectives on the generalization bound in terms of reweighting and representation learning, respectively. Motivated by the analysis of the bound, we propose a weighting regression method based on the joint propensity score augmented with representation learning. Extensive experimental studies on two real-world networks with semi-synthetic data demonstrate the effectiveness of our algorithm.
A data-driven approach to predict decision point choice during normal and evacuation wayfinding in multi-story buildings
Feng, Yan, Krishnakumari, Panchamy
Understanding pedestrian route choice behavior in complex buildings is important to ensure pedestrian safety. Previous studies have mostly used traditional data collection methods and discrete choice modeling to understand the influence of different factors on pedestrian route and exit choice, particularly in simple indoor environments. However, research on pedestrian route choice in complex buildings is still limited. This paper presents a data-driven approach for understanding and predicting the pedestrian decision point choice during normal and emergency wayfinding in a multi-story building. For this, we first built an indoor network representation and proposed a data mapping technique to map VR coordinates to the indoor representation. We then used a well-established machine learning algorithm, namely the random forest (RF) model to predict pedestrian decision point choice along a route during four wayfinding tasks in a multi-story building. Pedestrian behavioral data in a multi-story building was collected by a Virtual Reality experiment. The results show a much higher prediction accuracy of decision points using the RF model (i.e., 93% on average) compared to the logistic regression model. The highest prediction accuracy was 96% for task 3. Additionally, we tested the model performance combining personal characteristics and we found that personal characteristics did not affect decision point choice. This paper demonstrates the potential of applying a machine learning algorithm to study pedestrian route choice behavior in complex indoor buildings.
Batches Stabilize the Minimum Norm Risk in High Dimensional Overparameterized Linear Regression
Ioushua, Shahar Stein, Hasidim, Inbar, Shayevitz, Ofer, Feder, Meir
Learning algorithms that divide the data into batches are prevalent in many machine-learning applications, typically offering useful trade-offs between computational efficiency and performance. In this paper, we examine the benefits of batch-partitioning through the lens of a minimum-norm overparameterized linear regression model with isotropic Gaussian features. We suggest a natural small-batch version of the minimum-norm estimator, and derive an upper bound on its quadratic risk, showing it is inversely proportional to the noise level as well as to the overparameterization ratio, for the optimal choice of batch size. In contrast to minimum-norm, our estimator admits a stable risk behavior that is monotonically increasing in the overparameterization ratio, eliminating both the blowup at the interpolation point and the double-descent phenomenon. Interestingly, we observe that this implicit regularization offered by the batch partition is partially explained by feature overlap between the batches. Our bound is derived via a novel combination of techniques, in particular normal approximation in the Wasserstein metric of noisy projections over random subspaces.
Comparative Analysis of Epileptic Seizure Prediction: Exploring Diverse Pre-Processing Techniques and Machine Learning Models
Talukder, Md. Simul Hasan, Sulaiman, Rejwan Bin
Epilepsy is a prevalent neurological disorder characterized by recurrent and unpredictable seizures, necessitating accurate prediction for effective management and patient care. Application of machine learning (ML) on electroencephalogram (EEG) recordings, along with its ability to provide valuable insights into brain activity during seizures, is able to make accurate and robust seizure prediction an indispensable component in relevant studies. In this research, we present a comprehensive comparative analysis of five machine learning models - Random Forest (RF), Decision Tree (DT), Extra Trees (ET), Logistic Regression (LR), and Gradient Boosting (GB) - for the prediction of epileptic seizures using EEG data. The dataset underwent meticulous preprocessing, including cleaning, normalization, outlier handling, and oversampling, ensuring data quality and facilitating accurate model training. These preprocessing techniques played a crucial role in enhancing the models' performance. The results of our analysis demonstrate the performance of each model in terms of accuracy. The LR classifier achieved an accuracy of 56.95%, while GB and DT both attained 97.17% accuracy. RT achieved a higher accuracy of 98.99%, while the ET model exhibited the best performance with an accuracy of 99.29%. Our findings reveal that the ET model outperformed not only the other models in the comparative analysis but also surpassed the state-of-the-art results from previous research. The superior performance of the ET model makes it a compelling choice for accurate and robust epileptic seizure prediction using EEG data.