Inductive Learning
Sentiment Classification of Thai Central Bank Press Releases Using Supervised Learning
Central bank communication plays a critical role in shaping economic expectations and monetary policy effectiveness. This study applies supervised machine learning techniques to classify the sentiment of press releases from the Bank of Thailand, addressing gaps in research that primarily focus on lexicon-based approaches. My findings show that supervised learning can be an effective method, even with smaller datasets, and serves as a starting point for further automation. However, achieving higher accuracy and better generalization requires a substantial amount of labeled data, which is time-consuming and demands expertise. Using models such as Na\"ive Bayes, Random Forest and SVM, this study demonstrates the applicability of machine learning for central bank sentiment analysis, with English-language communications from the Thai Central Bank as a case study.
Graph-Based Uncertainty-Aware Self-Training with Stochastic Node Labeling
Self-training has become a popular semi-supervised learning technique for leveraging unlabeled data. However, the over-confidence of pseudo-labels remains a key challenge. In this paper, we propose a novel \emph{graph-based uncertainty-aware self-training} (GUST) framework to combat over-confidence in node classification. Drawing inspiration from the uncertainty integration idea introduced by Wang \emph{et al.}~\cite{wang2024uncertainty}, our method largely diverges from previous self-training approaches by focusing on \emph{stochastic node labeling} grounded in the graph topology. Specifically, we deploy a Bayesian-inspired module to estimate node-level uncertainty, incorporate these estimates into the pseudo-label generation process via an expectation-maximization (EM)-like step, and iteratively update both node embeddings and adjacency-based transformations. Experimental results on several benchmark graph datasets demonstrate that our GUST framework achieves state-of-the-art performance, especially in settings where labeled data is extremely sparse.
ProtoBERT-LoRA: Parameter-Efficient Prototypical Finetuning for Immunotherapy Study Identification
Zhang, Shijia, Ding, Xiyu, Ding, Kai, Zhang, Jacob, Galinsky, Kevin, Wang, Mengrui, Mayers, Ryan P., Wang, Zheyu, Kharrazi, Hadi
Identifying immune checkpoint inhibitor (ICI) studies in genomic repositories like Gene Expression Omnibus (GEO) is vital for cancer research yet remains challenging due to semantic ambiguity, extreme class imbalance, and limited labeled data in low-resource settings. We present ProtoBERT-LoRA, a hybrid framework that combines PubMedBERT with prototypical networks and Low-Rank Adaptation (LoRA) for efficient fine-tuning. The model enforces class-separable embeddings via episodic prototype training while preserving biomedical domain knowledge. Our dataset was divided as: Training (20 positive, 20 negative), Prototype Set (10 positive, 10 negative), Validation (20 positive, 200 negative), and Test (71 positive, 765 negative). Evaluated on test dataset, ProtoBERT-LoRA achieved F1-score of 0.624 (precision: 0.481, recall: 0.887), outperforming the rule-based system, machine learning baselines and finetuned PubMedBERT. Application to 44,287 unlabeled studies reduced manual review efforts by 82%. Ablation studies confirmed that combining prototypes with LoRA improved performance by 29% over stand-alone LoRA.
Interpretable Feature Interaction via Statistical Self-supervised Learning on Tabular Data
In high-dimensional and high-stakes contexts, ensuring both rigorous statistical guarantees and interpretability in feature extraction from complex tabular data remains a formidable challenge. Traditional methods such as Principal Component Analysis (PCA) reduce dimensionality and identify key features that explain the most variance, but are constrained by their reliance on linear assumptions. In contrast, neural networks offer assumption-free feature extraction through self-supervised learning techniques such as autoencoders, though their interpretability remains a challenge in fields requiring transparency. To address this gap, this paper introduces Spofe, a novel self-supervised machine learning pipeline that marries the power of kernel principal components for capturing nonlinear dependencies with a sparse and principled polynomial representation to achieve clear interpretability with statistical rigor. Underpinning our approach is a robust theoretical framework that delivers precise error bounds and rigorous false discovery rate (FDR) control via a multi-objective knockoff selection procedure; it effectively bridges the gap between data-driven complexity and statistical reliability via three stages: (1) generating self-supervised signals using kernel principal components to model complex patterns, (2) distilling these signals into sparse polynomial functions for improved interpretability, and (3) applying a multi-objective knockoff selection procedure with significance testing to rigorously identify important features. Extensive experiments on diverse real-world datasets demonstrate the effectiveness of Spofe, consistently surpassing KPCA, SKPCA, and other methods in feature selection for regression and classification tasks. Visualization and case studies highlight its ability to uncover key insights, enhancing interpretability and practical utility.
Meta-Representational Predictive Coding: Biomimetic Self-Supervised Learning
Ororbia, Alexander, Friston, Karl, Rao, Rajesh P. N.
Self-supervised learning has become an increasingly important paradigm in the domain of machine intelligence. Furthermore, evidence for self-supervised adaptation, such as contrastive formulations, has emerged in recent computational neuroscience and brain-inspired research. Nevertheless, current work on self-supervised learning relies on biologically implausible credit assignment -- in the form of backpropagation of errors -- and feedforward inference, typically a forward-locked pass. Predictive coding, in its mechanistic form, offers a biologically plausible means to sidestep these backprop-specific limitations. However, unsupervised predictive coding rests on learning a generative model of raw pixel input (akin to ``generative AI'' approaches), which entails predicting a potentially high dimensional input; on the other hand, supervised predictive coding, which learns a mapping between inputs to target labels, requires human annotation, and thus incurs the drawbacks of supervised learning. In this work, we present a scheme for self-supervised learning within a neurobiologically plausible framework that appeals to the free energy principle, constructing a new form of predictive coding that we call meta-representational predictive coding (MPC). MPC sidesteps the need for learning a generative model of sensory input (e.g., pixel-level features) by learning to predict representations of sensory input across parallel streams, resulting in an encoder-only learning and inference scheme. This formulation rests on active inference (in the form of sensory glimpsing) to drive the learning of representations, i.e., the representational dynamics are driven by sequences of decisions made by the model to sample informative portions of its sensorium.
Understanding Social Support Needs in Questions: A Hybrid Approach Integrating Semi-Supervised Learning and LLM-based Data Augmentation
Kuang, Junwei, Yang, Liang, Cui, Shaoze, Fan, Weiguo
Patients are increasingly turning to online health Q&A communities for social support to improve their well-being. However, when this support received does not align with their specific needs, it may prove ineffective or even detrimental. This necessitates a model capable of identifying the social support needs in questions. However, training such a model is challenging due to the scarcity and class imbalance issues of labeled data. To overcome these challenges, we follow the computational design science paradigm to develop a novel framework, Hybrid Approach for SOcial Support need classification (HA-SOS). HA-SOS integrates an answer-enhanced semi-supervised learning approach, a text data augmentation technique leveraging large language models (LLMs) with reliability- and diversity-aware sample selection mechanism, and a unified training process to automatically label social support needs in questions. Extensive empirical evaluations demonstrate that HA-SOS significantly outperforms existing question classification models and alternative semi-supervised learning approaches. This research contributes to the literature on social support, question classification, semi-supervised learning, and text data augmentation. In practice, our HA-SOS framework facilitates online Q&A platform managers and answerers to better understand users' social support needs, enabling them to provide timely, personalized answers and interventions.
Interpretable Machine Learning for Oral Lesion Diagnosis through Prototypical Instances Identification
Cascione, Alessio, Setzu, Mattia, Galatolo, Federico A., Cimino, Mario G. C. A., Guidotti, Riccardo
Decision-making processes in healthcare can be highly complex and challenging. Machine Learning tools offer significant potential to assist in these processes. However, many current methodologies rely on complex models that are not easily interpretable by experts. This underscores the need to develop interpretable models that can provide meaningful support in clinical decision-making. When approaching such tasks, humans typically compare the situation at hand to a few key examples and representative cases imprinted in their memory. Using an approach which selects such exemplary cases and grounds its predictions on them could contribute to obtaining high-performing interpretable solutions to such problems. To this end, we evaluate PivotTree, an interpretable prototype selection model, on an oral lesion detection problem, specifically trying to detect the presence of neoplastic, aphthous and traumatic ulcerated lesions from oral cavity images. We demonstrate the efficacy of using such method in terms of performance and offer a qualitative and quantitative comparison between exemplary cases and ground-truth prototypes selected by experts.
Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs
Roux, Nicolas Le, Bellemare, Marc G., Lebensold, Jonathan, Bergeron, Arnaud, Greaves, Joshua, Fréchette, Alex, Pelletier, Carolyne, Thibodeau-Laufer, Eric, Toth, Sándor, Work, Sam
We propose a new algorithm for fine-tuning large language models using reinforcement learning. Tapered Off-Policy REINFORCE (TOPR) uses an asymmetric, tapered variant of importance sampling to speed up learning while maintaining stable learning dynamics, even without the use of KL regularization. TOPR can be applied in a fully offline fashion, allows the handling of positive and negative examples in a unified framework, and benefits from the implementational simplicity that is typical of Monte Carlo algorithms. We demonstrate the effectiveness of our approach with a series of experiments on the GSM8K and MATH reasoning benchmarks, finding performance gains for training both a model for solution generation and as a generative verifier. We show that properly leveraging positive and negative examples alike in the off-policy regime simultaneously increases test-time accuracy and training data efficiency, all the while avoiding the ``wasted inference'' that comes with discarding negative examples. We find that this advantage persists over multiple iterations of training and can be amplified by dataset curation techniques, enabling us to match 70B-parameter model performance with 8B language models. As a corollary to this work, we find that REINFORCE's baseline parameter plays an important and unexpected role in defining dataset composition in the presence of negative examples, and is consequently critical in driving off-policy performance.
Deep Contrastive Unlearning for Language Models
He, Estrid, Sarwar, Tabinda, Khalil, Ibrahim, Yi, Xun, Wang, Ke
X, XX 2025 1 Deep Contrastive Unlearning for Language Models Estrid He, Tabinda Sarwar, Ibrahim Khalil, Xun Yi, and Ke Wang Abstract --The past a few years have witnessed the great success of large language models, demonstrating powerful capabilities in comprehending textual data and generating humanlike languages. Thus, to safeguard individuals' "right to be forgotten", there has been increasing interests in machine unlearning - the process of removing information carried by particular training samples from a model while not deteriorating its predictive quality. This is a challenging task due to the black-box nature of language models. Most existing studies focus on mitigating the impact of those forgot samples upon a model's outputs, and do not explicitly consider the geometric distributions of samples in the latent space of a model. T o address this issue, we propose a machine unlearning framework, named Deep C ontrastive U nlearning for fine-T uning (DeepCUT) language models. Our proposed model achieves machine unlearning by directly optimizing the latent space of a model. Comprehensive experiments on real-world datasets demonstrate the effectiveness and efficiency of DeepCUT with consistent and significant improvement over baseline methods. I NTRODUCTION I N the existing digital era, the availability of user-contributed data has increased exponentially. The rich and diverse data has been the engine of the significant advancements in the development of natural language processing (NLP) models. In the past a few years, the introduction of Transformer architecture [1] has revolutionized NLP, enabling language models such as BERT [2], RoBERTa [3].
EEG-CLIP : Learning EEG representations from natural language descriptions
N'dir, Tidiane Camaret, Schirrmeister, Robin Tibor
Deep networks for electroencephalogram (EEG) decoding are currently often trained to only solve a specific task like pathology or gender decoding. A more general approach leveraging the medical reports of clinical EEG recordings is to learn mappings between medical reports and EEG recordings. This approach was pioneered in the computer vision domain matching images and their text captions and subsequently allowed to do successful zero-shot decoding using textual class prompts. In this work, we follow this approach and develop a contrastive learning framework EEG-CLIP that aligns EEG time series and their corresponding clinical text descriptions in a shared embedding space. We investigate its potential for versatile EEG decoding, assessing performance on a range of few-shot and zero-shot settings. Overall, results show that EEG-CLIP manages to nontrivially align text and EEG representations. Our work presents a promising approach to learn general EEG representations, which could enable easier analyses of diverse decoding questions through zero shot decoding or training task-specific models from fewer training examples.