Goto

Collaborating Authors

 testing data


Covariance-Driven Regression Trees: Reducing Overfitting in CART

Zhang, Likun, Ma, Wei

arXiv.org Machine Learning

Decision trees are powerful machine learning algorithms, widely used in fields such as economics and medicine for their simplicity and interpretability. However, decision trees such as CART are prone to overfitting, especially when grown deep or the sample size is small. Conventional methods to reduce overfitting include pre-pruning and post-pruning, which constrain the growth of uninformative branches. In this paper, we propose a complementary approach by introducing a covariance-driven splitting criterion for regression trees (CovRT). This method is more robust to overfitting than the empirical risk minimization criterion used in CART, as it produces more balanced and stable splits and more effectively identifies covariates with true signals. We establish an oracle inequality of CovRT and prove that its predictive accuracy is comparable to that of CART in high-dimensional settings. We find that CovRT achieves superior prediction accuracy compared to CART in both simulations and real-world tasks.


Predicting COVID-19 Prevalence Using Wastewater RNA Surveillance: A Semi-Supervised Learning Approach with Temporal Feature Trust

Chen, Yifei, Liang, Eric

arXiv.org Artificial Intelligence

As COVID-19 transitions into an endemic disease that remains constantly present in the population at a stable level, monitoring its prevalence without invasive measures becomes increasingly important. In this paper, we present a deep neural network estimator for the COVID-19 daily case count based on wastewater surveillance data and other confounding factors. This work builds upon the study by Jiang, Kolozsvary, and Li (2024), which connects the COVID-19 case counts with testing data collected early in the pandemic. Using the COVID-19 testing data and the wastewater surveillance data during the period when both data were highly reliable, one can train an artificial neural network that learns the nonlinear relation between the COVID-19 daily case count and the wastewater viral RNA concentration. From a machine learning perspective, the main challenge lies in addressing temporal feature reliability, as the training data has different reliability over different time periods.



Learning Multi-resolution Functional Maps with Spectral Attention for Robust Shape Matching: Supplementary Material

Neural Information Processing Systems

Appendix C. We give running time analysis in Appendix D. At last, we show more qualitative results In Section 4.2 of the main manuscript, we introduce spectral alignment residuals {r It is proved in [2] that the solution to Eq. 8 is given in closed form by: [C Figure 4: Architecture of our spectral attention network based on PointNet. In this section, we provide more implementation details of our model to complement Sec 4.3. The dimensionality of WKS is 128. The SMAL dataset has eight species of four-legged animal shapes. The training data is composed of five species, including cow (8 shapes), dog (9), fox (4), lion (5), and wolf (3).


Appendix A The Proof of Prop. 1 Denote P

Neural Information Processing Systems

Compared with Eq. 25 and Eq. 29, we find We can see the approximate solutions. The tasks are binary classification. We use the default train/val/test split with ratio 8:1:1. The experiments are easily influenced by the parameter initialization. The AUC-ROC results are reported in Tab. 2. H.4 We use the official splits and evaluation is conducted on validation set.



Measuring Moral LLM Responses in Multilingual Capacities

Basu, Kimaya, Kolari, Savi, Yu, Allison

arXiv.org Artificial Intelligence

With LLM usage becoming widespread across countries, languages, and humanity more broadly, the need to understand and guardrail their multilingual responses increases. Large-scale datasets for testing and benchmarking have been created to evaluate and facilitate LLM responses across multiple dimensions. In this study, we evaluate the responses of frontier and leading open-source models in five dimensions across low and high-resource languages to measure LLM accuracy and consistency across multilingual contexts. We evaluate the responses using a five-point grading rubric and a judge LLM. Our study shows that GPT -5 performed the best on average in each category, while other models displayed more inconsistency across language and category. Most notably, in the Consent & Autonomy and Harm Prevention & Safety categories, GPT scored the highest with averages of 3.56 and 4.73, while Gemini 2.5 Pro scored the lowest with averages of 1.39 and 1.98, respectively. These findings emphasize the need for further testing on how linguistic shifts impact LLM responses across various categories and improvement in these areas.


Enhancing Bankruptcy Prediction of Banks through Advanced Machine Learning Techniques: An Innovative Approach and Analysis

Rustam, Zuherman, Hartini, Sri, Islam, Sardar M. N., Novkaniza, Fevi, Aszhari, Fiftitah R., Rifqi, Muhammad

arXiv.org Artificial Intelligence

Context: Financial system stability is determined by the condition of the banking system. A bank failure can destroy the stability of the financial system, as banks are subject to systemic risk, affecting not only individual banks but also segments or the entire financial system. Calculating the probability of a bank going bankrupt is one way to ensure the banking system is safe and sound. Existing literature and limitations: Statistical models, such as Altman's Z-Score, are one of the common techniques for developing a bankruptcy prediction model. However, statistical methods rely on rigid and sometimes irrelevant assumptions, which can result in low forecast accuracy. New approaches are necessary. Objective of the research: Bankruptcy models are developed using machine learning techniques, such as logistic regression (LR), random forest (RF), and support vector machines (SVM). According to several studies, machine learning is also more accurate and effective than statistical methods for categorising and forecasting banking risk management. Present Research: The commercial bank data are derived from the annual financial statements of 44 active banks and 21 bankrupt banks in Turkey from 1994 to 2004, and the rural bank data are derived from the quarterly financial reports of 43 active and 43 bankrupt rural banks in Indonesia between 2013 and 2019. Five rural banks in Indonesia have also been selected to demonstrate the feasibility of analysing bank bankruptcy trends. Findings and implications: The results of the research experiments show that RF can forecast data from commercial banks with a 90% accuracy rate. Furthermore, the three machine learning methods proposed accurately predict the likelihood of rural bank bankruptcy. Contribution and Conclusion: The proposed innovative machine learning approach help to implement policies that reduce the costs of bankruptcy.



High Cycle S-N curve prediction for Al 7075-T6 alloy using Recurrent Neural Networks (RNNs)

Patel, Aryan

arXiv.org Artificial Intelligence

Aluminum is a widely used alloy, which is susceptible to fatigue failure. Characterizing fatigue performance for materials is extremely time and cost demanding, especially for high cycle data. To help mitigate this, a transfer learning based framework has been developed using Long short-term memory networks (LSTMs) in which a source LSTM model is trained based on pure axial fatigue data for Aluminum 7075-T6 alloy which is then transferred to predict high cycle torsional S-N curves. The framework was able to accurately predict Al torsional S-N curves for a much higher cycle range. It is the belief that this framework will help to drastically mitigate the cost of gathering fatigue characteristics for different materials and help prioritize tests with better cost and time constraints.