dfr
How Do LLM-Generated Texts Impact Term-Based Retrieval Models?
Huang, Wei, Bi, Keping, Cai, Yinqiong, Chen, Wei, Guo, Jiafeng, Cheng, Xueqi
As more content generated by large language models (LLMs) floods into the Internet, information retrieval (IR) systems now face the challenge of distinguishing and handling a blend of human-authored and machine-generated texts. Recent studies suggest that neural retrievers may exhibit a preferential inclination toward LLM-generated content, while classic term-based retrievers like BM25 tend to favor human-written documents. This paper investigates the influence of LLM-generated content on term-based retrieval models, which are valued for their efficiency and robust generalization across domains. Our linguistic analysis reveals that LLM-generated texts exhibit smoother high-frequency and steeper low-frequency Zipf slopes, higher term specificity, and greater document-level diversity. These traits are aligned with LLMs being trained to optimize reader experience through diverse and precise expressions. Our study further explores whether term-based retrieval models demonstrate source bias, concluding that these models prioritize documents whose term distributions closely correspond to those of the queries, rather than displaying an inherent source bias. This work provides a foundation for understanding and addressing potential biases in term-based IR systems managing mixed-source content.
Hardware-Friendly Delayed-Feedback Reservoir for Multivariate Time-Series Classification
Ikeda, Sosei, Awano, Hiromitsu, Sato, Takashi
Reservoir computing (RC) is attracting attention as a machine-learning technique for edge computing. In time-series classification tasks, the number of features obtained using a reservoir depends on the length of the input series. Therefore, the features must be converted to a constant-length intermediate representation (IR), such that they can be processed by an output layer. Existing conversion methods involve computationally expensive matrix inversion that significantly increases the circuit size and requires processing power when implemented in hardware. In this article, we propose a simple but effective IR, namely, dot-product-based reservoir representation (DPRR), for RC based on the dot product of data features. Additionally, we propose a hardware-friendly delayed-feedback reservoir (DFR) consisting of a nonlinear element and delayed feedback loop with DPRR. The proposed DFR successfully classified multivariate time series data that has been considered particularly difficult to implement efficiently in hardware. In contrast to conventional DFR models that require analog circuits, the proposed model can be implemented in a fully digital manner suitable for high-level syntheses. A comparison with existing machine-learning methods via field-programmable gate array implementation using 12 multivariate time-series classification tasks confirmed the superior accuracy and small circuit size of the proposed method.
Distributionally robust self-supervised learning for tabular data
Ghosh, Shantanu, Xie, Tiankang, Kuznetsov, Mikhail
Machine learning (ML) models trained using Empirical Risk Minimization (ERM) often exhibit systematic errors on specific subpopulations of tabular data, known as error slices. Learning robust representation in presence of error slices is challenging, especially in self-supervised settings during the feature reconstruction phase, due to high cardinality features and the complexity of constructing error sets. Traditional robust representation learning methods are largely focused on improving worst group performance in supervised setting in computer vision, leaving a gap in approaches tailored for tabular data. We address this gap by developing a framework to learn robust representation in tabular data during self-supervised pre-training. Our approach utilizes an encoder-decoder model trained with Masked Language Modeling (MLM) loss to learn robust latent representations. This paper applies the Just Train Twice (JTT) and Deep Feature Reweighting (DFR) methods during the pre-training phase for tabular data. These methods fine-tune the ERM pre-trained model by up-weighting error-prone samples or creating balanced datasets for specific categorical features. This results in specialized models for each feature, which are then used in an ensemble approach to enhance downstream classification performance. This methodology improves robustness across slices, thus enhancing overall generalization performance. Extensive experiments across various datasets demonstrate the efficacy of our approach. The code is available: \url{https://github.com/amazon-science/distributionally-robust-self-supervised-learning-for-tabular-data}.
ExMap: Leveraging Explainability Heatmaps for Unsupervised Group Robustness to Spurious Correlations
Chakraborty, Rwiddhi, Sletten, Adrian, Kampffmeyer, Michael
Group robustness strategies aim to mitigate learned biases in deep learning models that arise from spurious correlations present in their training datasets. However, most existing methods rely on the access to the label distribution of the groups, which is time-consuming and expensive to obtain. As a result, unsupervised group robustness strategies are sought. Based on the insight that a trained model's classification strategies can be inferred accurately based on explainability heatmaps, we introduce ExMap, an unsupervised two stage mechanism designed to enhance group robustness in traditional classifiers. ExMap utilizes a clustering module to infer pseudo-labels based on a model's explainability heatmaps, which are then used during training in lieu of actual labels. Our empirical studies validate the efficacy of ExMap - We demonstrate that it bridges the performance gap with its supervised counterparts and outperforms existing partially supervised and unsupervised methods. Additionally, ExMap can be seamlessly integrated with existing group robustness learning strategies. Finally, we demonstrate its potential in tackling the emerging issue of multiple shortcut mitigation\footnote{Code available at \url{https://github.com/rwchakra/exmap}}.
Is Last Layer Re-Training Truly Sufficient for Robustness to Spurious Correlations?
Le, Phuong Quynh, Schlรถtterer, Jรถrg, Seifert, Christin
Models trained with empirical risk minimization (ERM) are known to learn to rely on spurious features, i.e., their prediction is based on undesired auxiliary features which are strongly correlated with class labels but lack causal reasoning. This behavior particularly degrades accuracy in groups of samples of the correlated class that are missing the spurious feature or samples of the opposite class but with the spurious feature present. The recently proposed Deep Feature Reweighting (DFR) method improves accuracy of these worst groups. Based on the main argument that ERM mods can learn core features sufficiently well, DFR only needs to retrain the last layer of the classification model with a small group-balanced data set. In this work, we examine the applicability of DFR to realistic data in the medical domain. Furthermore, we investigate the reasoning behind the effectiveness of last-layer retraining and show that even though DFR has the potential to improve the accuracy of the worst group, it remains susceptible to spurious correlations.
Modular DFR: Digital Delayed Feedback Reservoir Model for Enhancing Design Flexibility
Ikeda, Sosei, Awano, Hiromitsu, Sato, Takashi
In RC, the reservoir weights are not altered and the weights of the output layer that follows the reservoir are the target of learning [13], which allows efficient training. RC is considered suitable for time series processing because of its recurrent structure, which reflects past inputs. Because the weights of a reservoir are fixed, it can be implemented in hardware utilizing various physical phenomena. A delayed feedback reservoir (DFR) [2] is a specific type of RC system. It is particularly suitable for hardware implementations because it can be compactly constructed with a single nonlinear element and a feedback loop [18]. Until now, hardware implementations of DFRs have been of two types: analog and digital [2]. In an analog implementation, only the nonlinear element of the reservoir or the nonlinear element and the feedback loop are implemented in an analog manner. However, the inputs and outputs are generally processed digitally, requiring a digital-toanalog converter (DAC) and an analog-to-digital converter (ADC). In addition, the time required for signal propagation through the feedback loop reduces the throughput.
Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations
Kirichenko, Polina, Izmailov, Pavel, Wilson, Andrew Gordon
Neural network classifiers can largely rely on simple spurious features, such as backgrounds, to make predictions. However, even in these cases, we show that they still often learn core features associated with the desired attributes of the data, contrary to recent findings. Inspired by this insight, we demonstrate that simple last layer retraining can match or outperform state-of-the-art approaches on spurious correlation benchmarks, but with profoundly lower complexity and computational expenses. Moreover, we show that last layer retraining on large ImageNet-trained models can also significantly reduce reliance on background and texture information, improving robustness to covariate shift, after only minutes of training on a single GPU.
Multi-Point Integrated Sensing and Communication: Fusion Model and Functionality Selection
Li, Guoliang, Wang, Shuai, Ye, Kejiang, Wen, Miaowen, Ng, Derrick Wing Kwan, Di Renzo, Marco
Integrated sensing and communication (ISAC) represents a paradigm shift, where previously competing wireless transmissions are jointly designed to operate in harmony via the shared use of the hardware platform for improving the spectral and energy efficiencies. However, due to adversarial factors such as fading and interference, ISAC may suffer from high sensing uncertainties. This paper presents a multi-point ISAC (MPISAC) system that fuses the outputs from multiple ISAC devices for achieving higher sensing performance by exploiting multi-view data redundancy. Furthermore, we propose to effectively explore the performance trade-off between sensing and communication via a functionality selection module that adaptively determines the working state (i.e., sensing or communication) of an ISAC device. The crux of our approach is to derive a fusion model that predicts the fusion accuracy via hypothesis testing and optimal voting analysis. Simulation results demonstrate the superiority of MPISAC over various benchmark schemes and show that the proposed approach can effectively span the trade-off region in ISAC systems.
Various Machine learning methods in predicting rainfall - Tutors India Blog
The term machine learning (ML) stands for "making it easier for machines," i.e., reviewing data without having to programme them explicitly. The major aspect of the machine learning process is performance evaluation. Four commonly used machine learning algorithms (BK1) are Supervised, semi-supervised, unsupervised and reinforcement learning methods. The variation between supervised and unsupervised learning is that supervised learning already has the expert knowledge to developed the input/output [2]. On the other hand, unsupervised learning takes only the input and uses it for data distribution or learn the hidden structure to produce the output as a cluster or feature [3].