Data Science
RelDenClu: A Relative Density based Biclustering Method for identifying non-linear feature relations
Jain, Namita, Ghosh, Susmita, Murthy, C. A.
The existing biclustering algorithms for finding feature relation based biclusters often depend on assumptions like monotonicity or linearity. Though a few algorithms overcome this problem by using density-based methods, they tend to miss out many biclusters because they use global criteria for identifying dense regions. The proposed method, RelDenClu uses the local variations in marginal and joint densities for each pair of features to find the subset of observations, which forms the bases of the relation between them. It then finds the set of features connected by a common set of observations, resulting in a bicluster. To show the effectiveness of the proposed methodology, experimentation has been carried out on fifteen types of simulated datasets. Further, it has been applied to six real-life datasets. For three of these real-life datasets, the proposed method is used for unsupervised learning, while for other three real-life datasets it is used as an aid to supervised learning. For all the datasets the performance of the proposed method is compared with that of seven different state-of-the-art algorithms and the proposed algorithm is seen to produce better results. The efficacy of proposed algorithm is also seen by its use on COVID-19 dataset for identifying some features (genetic, demographics and others) that are likely to affect the spread of COVID-19.
GroundHog: Revolutionizing GLDAS Groundwater Storage Downscaling for Enhanced Recharge Estimation in Bangladesh
Ahmed, Saleh Sakib, Zzaman, Rashed Uz, Jony, Saifur Rahman, Himel, Faizur Rahman, Sharmin, Afroza, Rahman, A. H. M. Khalequr, Rahman, M. Sohel, Nowreen, Sara
Long-term groundwater level (GWL) measurement is vital for effective policymaking and recharge estimation using annual maxima and minima. However, current methods prioritize short-term predictions and lack multi-year applicability, limiting their utility. Moreover, sparse in-situ measurements lead to reliance on low-resolution satellite data like GLDAS as the ground truth for Machine Learning models, further constraining accuracy. To overcome these challenges, we first develop an ML model to mitigate data gaps, achieving $R^2$ scores of 0.855 and 0.963 for maximum and minimum GWL predictions, respectively. Subsequently, using these predictions and well observations as ground truth, we train an Upsampling Model that uses low-resolution (25 km) GLDAS data as input to produce high-resolution (2 km) GWLs, achieving an excellent $R^2$ score of 0.96. Our approach successfully upscales GLDAS data for 2003-2024, allowing high-resolution recharge estimations and revealing critical trends for proactive resource management. Our method allows upsampling of groundwater storage (GWS) from GLDAS to high-resolution GWLs for any points independently of officially curated piezometer data, making it a valuable tool for decision-making.
TS: A Unified Multi-Task Time Series Model
Although pre-trained transformers and reprogrammed text-based LLMs have shown strong performance on time series tasks, the best-performing architectures vary widely across tasks, with most models narrowly focused on specific areas, such as time series forecasting. Unifying predictive and generative time series tasks within a single model remains challenging.
On the Powerfulness of Textual Outlier Exposure for Visual OoD Detection
Successful detection of Out-of-Distribution (OoD) data is becoming increasingly important to ensure safe deployment of neural networks. One of the main challenges in OoD detection is that neural networks output overconfident predictions on OoD data, make it difficult to determine OoD-ness of data solely based on their predictions. Outlier exposure addresses this issue by introducing an additional loss that encourages low-confidence predictions on OoD data during training. While outlier exposure has shown promising potential in improving OoD detection performance, all previous studies on outlier exposure have been limited to utilizing visual outliers.
Achievable Fairness on Your Data With Utility Guarantees
In machine learning fairness, training models that minimize disparity across different sensitive groups often leads to diminished accuracy, a phenomenon known as the fairness-accuracy trade-off. The severity of this trade-off inherently depends on dataset characteristics such as dataset imbalances or biases and therefore, using a uniform fairness requirement across diverse datasets remains questionable. To address this, we present a computationally efficient approach to approximate the fairness-accuracy trade-off curve tailored to individual datasets, backed by rigorous statistical guarantees. By utilizing the You-Only-Train-Once (YOTO) framework, our approach mitigates the computational burden of having to train multiple models when approximating the trade-off curve. Crucially, we introduce a novel methodology for quantifying uncertainty in our estimates, thereby providing practitioners with a robust framework for auditing model fairness while avoiding false conclusions due to estimation errors. Our experiments spanning tabular (e.g., Adult), image (CelebA), and language (Jigsaw) datasets underscore that our approach not only reliably quantifies the optimum achievable trade-offs across various data modalities but also helps detect suboptimality in SOTA fairness methods.
case, please provide a description
This document is based on Datasheets for Datasets by and edges)? Please see the most updated version The instances of this graph-based dataset comprise here. Link prediction on this dataset is a multi-instance prediction task [3]. For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that How many instances are there in total (of each type, needed to be filled?
Diverse Community Data for Benchmarking Data Privacy Algorithms
The Collaborative Research Cycle (CRC) is a National Institute of Standards and Technology (NIST) benchmarking program intended to strengthen understanding of tabular data deidentification technologies. Deidentification algorithms are vulnerable to the same bias and privacy issues that impact other data analytics and machine learning applications, and it can even amplify those issues by contaminating downstream applications. This paper summarizes four CRC contributions: theoretical work on the relationship between diverse populations and challenges for equitable deidentification; public benchmark data focused on diverse populations and challenging features; a comprehensive open source suite of evaluation metrology for deidentified datasets; and an archive of more than 450 deidentified data samples from a broad range of techniques. The initial set of evaluation results demonstrate the value of the CRC tools for investigations in this field.