Goto

Collaborating Authors

 Regression


Discovering Significant Topics from Legal Decisions with Selective Inference

arXiv.org Artificial Intelligence

We propose and evaluate an automated pipeline for discovering significant topics from legal decision texts by passing features synthesized with topic models through penalised regressions and post-selection significance tests. The method identifies case topics significantly correlated with outcomes, topic-word distributions which can be manually-interpreted to gain insights about significant topics, and case-topic weights which can be used to identify representative cases for each topic. We demonstrate the method on a new dataset of domain name disputes and a canonical dataset of European Court of Human Rights violation cases. Topic models based on latent semantic analysis as well as language model embeddings are evaluated. We show that topics derived by the pipeline are consistent with legal doctrines in both areas and can be useful in other related legal analysis tasks.


Coefficient Shape Alignment in Multivariate Functional Regression

arXiv.org Machine Learning

In multivariate functional data analysis, different functional covariates can be homogeneous. The hidden homogeneity structure is informative about the connectivity or association of different covariates. The covariates with pronounced homogeneity can be analyzed jointly within the same group, which gives rise to a way of parsimoniously modeling multivariate functional data. In this paper, a novel grouped multivariate functional regression model with a new regularization approach termed "coefficient shape alignment" is developed to tackle the potential homogeneity of different functional covariates. The modeling procedure includes two main steps: first detect the unknown grouping structure with the new regularization approach to aggregate covariates into disjoint groups; and then the grouped multivariate functional regression model is established based on the detected grouping structure. In this new grouped model, the coefficient functions of covariates in the same homogeneous group share the same shape invariant to scaling. The new regularization approach builds on penalizing the discrepancy of coefficient shape. The consistency property of the detected grouping structure is thoroughly investigated, and the conditions that guarantee uncovering the underlying true grouping structure are developed. The asymptotic properties of the model estimates are also developed. Extensive simulation studies are conducted to investigate the finite-sample properties of the developed methods. The practical utility of the proposed methods is illustrated in the real data analysis on sugar quality evaluation. This work provides a novel means for analyzing the underlying homogeneity of functional covariates and developing parsimonious model structures for multivariate functional data.


Revisiting inference after prediction

arXiv.org Machine Learning

Recent work has focused on the very common practice of prediction-based inference: that is, (i) using a pre-trained machine learning model to predict an unobserved response variable, and then (ii) conducting inference on the association between that predicted response and some covariates. As pointed out by Wang et al. (2020), applying a standard inferential approach in (ii) does not accurately quantify the association between the unobserved (as opposed to the predicted) response and the covariates. In recent work, Wang et al. (2020) and Angelopoulos et al. (2023) propose corrections to step (ii) in order to enable valid inference on the association between the unobserved response and the covariates. Here, we show that the method proposed by Angelopoulos et al. (2023) successfully controls the type 1 error rate and provides confidence intervals with correct nominal coverage, regardless of the quality of the pre-trained machine learning model used to predict the unobserved response. However, the method proposed by Wang et al. (2020) provides valid inference only under very strong conditions that rarely hold in practice: for instance, if the machine learning model perfectly estimates the true regression function in the study population of interest.


Machine Learning Classification of Alzheimer's Disease Stages Using Cerebrospinal Fluid Biomarkers Alone

arXiv.org Artificial Intelligence

Early diagnosis of Alzheimer's disease is a challenge because the existing methodologies do not identify the patients in their preclinical stage, which can last up to a decade prior to the onset of clinical symptoms. Several research studies demonstrate the potential of cerebrospinal fluid biomarkers, amyloid beta 1-42, T-tau, and P-tau, in early diagnosis of Alzheimer's disease stages. In this work, we used machine learning models to classify different stages of Alzheimer's disease based on the cerebrospinal fluid biomarker levels alone. An electronic health record of patients from the National Alzheimer's Coordinating Centre database was analyzed and the patients were subdivided based on mini-mental state scores and clinical dementia ratings. Statistical and correlation analyses were performed to identify significant differences between the Alzheimer's stages. Afterward, machine learning classifiers including K-Nearest Neighbors, Ensemble Boosted Tree, Ensemble Bagged Tree, Support Vector Machine, Logistic Regression, and Naรฏve Bayes classifiers were employed to classify the Alzheimer's disease stages. The results demonstrate that Ensemble Boosted Tree (84.4%) and Logistic Regression (73.4%) provide the highest accuracy for binary classification, while Ensemble Bagged Tree (75.4%) demonstrates better accuracy for multiclassification. The findings from this research are expected to help clinicians in making an informed decision regarding the early diagnosis of Alzheimer's from the cerebrospinal fluid biomarkers alone, monitoring of the disease progression, and implementation of appropriate intervention measures.


Automated Model Selection for Tabular Data

arXiv.org Artificial Intelligence

Structured data in the form of tabular datasets contain features that are distinct and discrete, with varying individual and relative importances to the target. Combinations of one or more features may be more predictive and meaningful than simple individual feature contributions. R's mixed effect linear models library allows users to provide such interactive feature combinations in the model design. However, given many features and possible interactions to select from, model selection becomes an exponentially difficult task. We aim to automate the model selection process for predictions on tabular datasets incorporating feature interactions while keeping computational costs small. The framework includes two distinct approaches for feature selection: a Priority-based Random Grid Search and a Greedy Search method. The Priority-based approach efficiently explores feature combinations using prior probabilities to guide the search. The Greedy method builds the solution iteratively by adding or removing features based on their impact. Experiments on synthetic demonstrate the ability to effectively capture predictive feature combinations.


Application of Machine Learning in Stock Market Forecasting: A Case Study of Disney Stock

arXiv.org Artificial Intelligence

This document presents a stock market analysis conducted on a dataset consisting of 750 instances and 16 attributes donated in 2014-10-23. The analysis includes an exploratory data analysis (EDA) section, feature engineering, data preparation, model selection, and insights from the analysis. The Fama French 3-factor model is also utilized in the analysis. The results of the analysis are presented, with linear regression being the best-performing model.


Exploring AI-Generated Text in Student Writing: How Does AI Help?

arXiv.org Artificial Intelligence

English as foreign language_EFL_students' use of text generated from artificial intelligence_AI_natural language generation_NLG_tools may improve their writing quality. However, it remains unclear to what extent AI-generated text in these students' writing might lead to higher-quality writing. We explored 23 Hong Kong secondary school students' attempts to write stories comprising their own words and AI-generated text. Human experts scored the stories for dimensions of content, language and organization. We analyzed the basic organization and structure and syntactic complexity of the stories' AI-generated text and performed multiple linear regression and cluster analyses. The results show the number of human words and the number of AI-generated words contribute significantly to scores. Besides, students can be grouped into competent and less competent writers who use more AI-generated text or less AI-generated text compared to their peers. Comparisons of clusters reveal some benefit of AI-generated text in improving the quality of both high-scoring students' and low-scoring students' writing. The findings can inform pedagogical strategies to use AI-generated text for EFL students' writing and to address digital divides. This study contributes designs of NLG tools and writing activities to implement AI-generated text in schools.


Cluster-based Regression using Variational Inference and Applications in Financial Forecasting

arXiv.org Machine Learning

This paper describes an approach to simultaneously identify clusters and estimate cluster-specific regression parameters from the given data. Such an approach can be useful in learning the relationship between input and output when the regression parameters for estimating output are different in different regions of the input space. Variational Inference (VI), a machine learning approach to obtain posterior probability densities using optimization techniques, is used to identify clusters of explanatory variables and regression parameters for each cluster. From these results, one can obtain both the expected value and the full distribution of predicted output. Other advantages of the proposed approach include the elegant theoretical solution and clear interpretability of results. The proposed approach is well-suited for financial forecasting where markets have different regimes (or clusters) with different patterns and correlations of market changes in each regime. In financial applications, knowledge about such clusters can provide useful insights about portfolio performance and identify the relative importance of variables in different market regimes. An illustrative example of predicting one-day S&P change is considered to illustrate the approach and compare the performance of the proposed approach with standard regression without clusters. Due to the broad applicability of the problem, its elegant theoretical solution, and the computational efficiency of the proposed algorithm, the approach may be useful in a number of areas extending beyond the financial domain.


Conditional Density Estimations from Privacy-Protected Data

arXiv.org Machine Learning

Many modern statistical analysis and machine learning applications require training models on sensitive user data. Differential privacy provides a formal guarantee that individual-level information about users does not leak. In this framework, randomized algorithms inject calibrated noise into the confidential data, resulting in privacy-protected datasets or queries. However, restricting access to only privatized data during statistical analysis makes it computationally challenging to make valid inferences on the parameters underlying the confidential data. In this work, we propose simulation-based inference methods from privacy-protected datasets. In addition to sequential Monte Carlo approximate Bayesian computation, we use neural conditional density estimators as a flexible family of distributions to approximate the posterior distribution of model parameters given the observed private query results. We illustrate our methods on discrete time-series data under an infectious disease model and with ordinary linear regression models. Illustrating the privacy-utility trade-off, our experiments and analysis demonstrate the necessity and feasibility of designing valid statistical inference procedures to correct for biases introduced by the privacy-protection mechanisms.


KAXAI: An Integrated Environment for Knowledge Analysis and Explainable AI

arXiv.org Artificial Intelligence

In order to fully harness the potential of machine learning, it is crucial to establish a system that renders the field more accessible and less daunting for individuals who may not possess a comprehensive understanding of its intricacies. The paper describes the design of a system that integrates AutoML, XAI, and synthetic data generation to provide a great UX design for users. The system allows users to navigate and harness the power of machine learning while abstracting its complexities and providing high usability. The paper proposes two novel classifiers, Logistic Regression Forest and Support Vector Tree, for enhanced model performance, achieving 96\% accuracy on a diabetes dataset and 93\% on a survey dataset. The paper also introduces a model-dependent local interpreter called MEDLEY and evaluates its interpretation against LIME, Greedy, and Parzen. Additionally, the paper introduces LLM-based synthetic data generation, library-based data generation, and enhancing the original dataset with GAN. The findings on synthetic data suggest that enhancing the original dataset with GAN is the most reliable way to generate synthetic data, as evidenced by KS tests, standard deviation, and feature importance. The authors also found that GAN works best for quantitative datasets.