Oceania
A Survey on Archetypal Analysis
Alcacer, Aleix, Epifanio, Irene, Mair, Sebastian, Mørup, Morten
Archetypal analysis (AA) was originally proposed in 1994 by Adele Cutler and Leo Breiman as a computational procedure to extract the distinct aspects called archetypes in observations with each observational record approximated as a mixture (i.e., convex combination) of these archetypes. AA thereby provides straightforward, interpretable, and explainable representations for feature extraction and dimensionality reduction, facilitating the understanding of the structure of high-dimensional data with wide applications throughout the sciences. However, AA also faces challenges, particularly as the associated optimization problem is non-convex. This survey provides researchers and data mining practitioners an overview of methodologies and opportunities that AA has to offer surveying the many applications of AA across disparate fields of science, as well as best practices for modeling data using AA and limitations. The survey concludes by explaining important future research directions concerning AA.
Differentially Private Selection using Smooth Sensitivity
Chaves, Iago, Farias, Victor, Perez, Amanda, Mesquita, Diego, Machado, Javam
Differentially private selection mechanisms offer strong privacy guarantees for queries aiming to identify the top-scoring element r from a finite set R, based on a dataset-dependent utility function. While selection queries are fundamental in data science, few mechanisms effectively ensure their privacy. Furthermore, most approaches rely on global sensitivity to achieve differential privacy (DP), which can introduce excessive noise and impair downstream inferences. To address this limitation, we propose the Smooth Noisy Max (SNM) mechanism, which leverages smooth sensitivity to yield provably tighter (upper bounds on) expected errors compared to global sensitivity-based methods. Empirical results demonstrate that SNM is more accurate than state-of-the-art differentially private selection methods in three applications: percentile selection, greedy decision trees, and random forests.
Preference-based Learning with Retrieval Augmented Generation for Conversational Question Answering
Kaiser, Magdalena, Weikum, Gerhard
Conversational Question Answering (ConvQA) involves multiple subtasks, i) to understand incomplete questions in their context, ii) to retrieve relevant information, and iii) to generate answers. This work presents PRAISE, a pipeline-based approach for ConvQA that trains LLM adapters for each of the three subtasks. As labeled training data for individual subtasks is unavailable in practice, PRAISE learns from its own generations using the final answering performance as feedback signal without human intervention and treats intermediate information, like relevant evidence, as weakly labeled data. We apply Direct Preference Optimization by contrasting successful and unsuccessful samples for each subtask. In our experiments, we show the effectiveness of this training paradigm: PRAISE shows improvements per subtask and achieves new state-of-the-art performance on a popular ConvQA benchmark, by gaining 15.5 percentage points increase in precision over baselines.
Langformers: Unified NLP Pipelines for Language Models
Lamsal, Rabindra, Read, Maria Rodriguez, Karunasekera, Shanika
Transformer-based language models have revolutionized the field of natural language processing (NLP). However, using these models often involves navigating multiple frameworks and tools, as well as writing repetitive boilerplate code. This complexity can discourage non-programmers and beginners, and even slow down prototyping for experienced developers. To address these challenges, we introduce Langformers, an open-source Python library designed to streamline NLP pipelines through a unified, factory-based interface for large language model (LLM) and masked language model (MLM) tasks. Langformers integrates conversational AI, MLM pretraining, text classification, sentence embedding/reranking, data labelling, semantic search, and knowledge distillation into a cohesive API, supporting popular platforms such as Hugging Face and Ollama. Key innovations include: (1) task-specific factories that abstract training, inference, and deployment complexities; (2) built-in memory and streaming for conversational agents; and (3) lightweight, modular design that prioritizes ease of use. Documentation: https://langformers.com
Personalizing Federated Learning for Hierarchical Edge Networks with Non-IID Data
Lee, Seunghyun, Tavallaie, Omid, Chen, Shuaijun, Thilakarathna, Kanchana, Seneviratne, Suranga, Toosi, Adel Nadjaran, Zomaya, Albert Y.
Zomaya School of Computer Science, The University of Sydney, Australia Department of Engineering Science, University of Oxford, United Kingdom School of Computing and Information Systems, The University of Melbourne, Australia Abstract --Accommodating edge networks between IoT devices and the cloud server in Hierarchical Federated Learning (HFL) enhances communication efficiency without compromising data privacy. However, devices connected to the same edge often share geographic or contextual similarities, leading to varying edge-level data heterogeneity with different subsets of labels per edge, on top of device-level heterogeneity. This hierarchical non-Independent and Identically Distributed (non-IID) nature, which implies that each edge has its own optimization goal, has been overlooked in HFL research. Therefore, existing edge-accommodated HFL demonstrates inconsistent performance across edges in various hierarchical non-IID scenarios. T o ensure robust performance with diverse edge-level non-IID data, we propose a Personalized Hierarchical Edge-enabled Federated Learning (PHE-FL), which personalizes each edge model to perform well on the unique class distributions specific to each edge. We evaluated PHE-FL across 4 scenarios with varying levels of edge-level non-IIDness, with extreme IoT device level non-IIDness. T o accurately assess the effectiveness of our personaliza-tion approach, we deployed test sets on each edge server instead of the cloud server, and used both balanced and imbalanced test sets. Extensive experiments show that PHE-FL achieves up to 83% higher accuracy compared to existing federated learning approaches that incorporate edge networks, given the same number of training rounds. Moreover, PHE-FL exhibits improved stability, as evidenced by reduced accuracy fluctuations relative to the state-of-the-art FedA vg with two-level (edge and cloud) aggregation. I NTRODUCTION Federated Learning (FL) is an emerging Machine Learning (ML) framework that achieves high accuracy without requiring the sharing of local data with a centralized server. Involving IoT devices and a central cloud server, 2-level FL aggregation framework was first proposed under the name FederatedAveraging (FedAvg) algorithm [1]. In FedAvg, IoT devices train models individually and then transmit the model weights to the cloud server. The server then averages these weights to create an aggregated global model that performs well and therefore can be deployed across all participating devices.
Diachronic and synchronic variation in the performance of adaptive machine learning systems: The ethical challenges
Hatherley, Joshua, Sparrow, Robert
Leveraging this'adaptive' potential of medical ML could generate significant benefits for patient health and well-being. Recent engagements with the ethical issues generated by the use of adaptive ML systems in medicine have typically been limited to discussions of'the update problem': how should systems that continue to change and evolve post-regulatory approval be regulated? In this paper, we draw attention to an important set of ethical issues raised by the use of adaptive machine learning systems in medicine that have, thus far, been neglected and are highly deserving of further attention. Discussions of adaptive machine learning systems to date have overlooked the distinction between two sorts of variance that such systems may exhibit -- diachronic evolution (change over time) and synchronic variation (difference between cotempo-raneous instantiations of the algorithmic system at different sites) -- and underestimated the significance of the latter. Both diachronic evolution and synchronic variation will complicate the hermeneutic task of clinicians in interpreting the outputs of AI systems, and will therefore pose significant challenges to the process of securing informed consent to treatment. Equity issues may occur where synchronic variation is permitted, as the quality of care may vary significantly across patients or between hospitals. However, the decision as to whether to allow or eliminate synchronic variation involves complex trade-offs between accuracy and generalisability, as well as a number of other values, including justice and non-maleficence. In some contexts, preventing synchronic variation from emerging may only be possible at the expense of the wellbeing, and the quality of care available to, particular patients or classes of patients. Designers and regulators of adaptive ML systems will need to confront these issues if the potential benefits of adaptive ML in medical care are to be realised.
Generative AI in Collaborative Academic Report Writing: Advantages, Disadvantages, and Ethical Considerations
Sadeghpour, Mahshid, Arakala, Arathi, Rao, Asha
The availability and abundance of GenAI tools to administer tasks traditionally managed by people have raised concerns, particularly within the education and academic sectors, as some students may highly rely on these tools to complete the assignments designed to enable learning. This article focuses on informing students about the significance of investing their time during their studies on developing essential life-long learning skills using their own critical thinking, rather than depending on AI models that are susceptible to misinformation, hallucination, and bias. As we transition to an AI-centric era, it is important to educate students on how these models work, their pitfalls, and the ethical concerns associated with feeding data to such tools. Keywords: GenAI in Academic Writing GenAI's Ethics GenAI's Privacy Concerns. 1 Introduction Writing academic reports, and papers have been instrumental to assisting students and researchers in shaping their ideas, organising their methods, and practicing their communication skills, particularly when this process is combined with receiving constant feedback from experts. With the launch of OpenAI's first publicly available Large Language Model, namely ChatGPT (GPT-3.5), a significant concern rose within the academic and research community about the reliability of the academic and research output. Evidence suggests that as individuals began discovering the availability and efficiency in using Generative Artificial Intelligence tools in late 2022, there was a significant surge in retracted research articles resulting in more than 10,000 retracted papers [1]. The over-reliance of individuals on various Generative Artificial Intelligence (Gen AI) tools for completing tasks that require a human's critical thinking has raised concerns.
Exploring Gradient-Guided Masked Language Model to Detect Textual Adversarial Attacks
Zhang, Xiaomei, Zhang, Zhaoxi, Zhang, Yanjun, Zheng, Xufei, Zhang, Leo Yu, Hu, Shengshan, Pan, Shirui
--T extual adversarial examples pose serious threats to the reliability of natural language processing systems. Recent studies suggest that adversarial examples tend to deviate from the underlying manifold of normal texts, whereas pre-trained masked language models can approximate the manifold of normal data. These findings inspire the exploration of masked language models for detecting textual adversarial attacks. We first introduce Masked Language Model-based Detection (MLMD), leveraging the mask and unmask operations of the masked language modeling (MLM) objective to induce the difference in manifold changes between normal and adversarial texts. Although MLMD achieves competitive detection performance, its exhaustive one-by-one masking strategy introduces significant computational overhead. Our posterior analysis reveals that a significant number of non-keywords in the input are not important for detection but consume resources. Building on this, we introduce Gradient-guided MLMD (GradMLMD), which leverages gradient information to identify and skip non-keywords during detection, significantly reducing resource consumption without compromising detection performance. Extensive experiments show that GradMLMD maintains comparable or better performance than MLMD and outperforms existing detectors. Among defenses based on the off-manifold conjecture, GradMLMD presents a novel method for capturing manifold changes and provides a practical solution for real-world application challenges. Index T erms --NLP, adversarial attack, adversarial defense, masked language model. L THOUGH advanced deep neural networks have the potential to revolutionize the performance of myriad natural language processing (NLP) tasks [1-3], they are highly vulnerable to adversarial attacks [4-7]. Through carefully manipulated inputs, attackers can drive models to produce erroneous outputs to their advantage. Many researchers have focused on introducing adversarial perturbations into the input by altering entire sentences. However, predominant efforts have been made to develop attacks at the word-level and character-level [8-14]. Correspondence to Dr. L. Zhang and Prof. X. Zheng Xiaomei Zhang, Leo Y u Zhang and Shirui Pan are with the School of Information and Communication Technology, Griffith University, Queensland, Australia (e-mail: xiaomei.zhang@griffithuni.edu.au, Zhaoxi Zhang and Y anjun Zhang are with the School of Computer Science, University of Technology Sydney, Sydney, New South Wales, Australia (email: Zhaoxi.Zhang-1@student.uts.edu.au, Xufei Zheng is with the College of Computer and Information Science, Southwest University, Chongqing, China (e-mail: zxufei@swu.edu.cn).
AdaptRec: A Self-Adaptive Framework for Sequential Recommendations with Large Language Models
The recent advancements in Large Language Models (LLMs) have generated considerable interest in their utilization for sequential recommendation tasks. While collaborative signals from similar users are central to recommendation modeling, effectively transforming these signals into a format that LLMs can understand and utilize remains challenging. The critical challenges include selecting relevant demonstrations from large-scale user interactions and ensuring their alignment with LLMs' reasoning process. To address these challenges, we introduce AdaptRec, a self-adaptive fram-ework that leverages LLMs for sequential recommendations by incorporating explicit collaborative signals. AdaptRec employs a two-phase user selection mechanism -- User Similarity Retrieval and Self-Adaptive User Selection -- to efficiently identify relevant user sequences in large-scale datasets from multi-metric evaluation. We also develop a User-Based Similarity Retrieval Prompt, enabling the model to actively select similar users and continuously refine its selection criteria during training. Using the collaborative signals from similar users, we construct a User-Contextualized Recommendation Prompt that translates their behavior sequences into natural language, explicitly integrating this information into the recommendation process. Experiments demonstrate AdaptRec's superior performance, with significant improvements in HitRatio@1 scores of 7.13\%, 18.16\%, and 10.41\% across real-world datasets with full fine-tuning, and even higher gains of 23.00\%, 15.97\%, and 17.98\% in few-shot scenarios.
Towards Accurate Forecasting of Renewable Energy : Building Datasets and Benchmarking Machine Learning Models for Solar and Wind Power in France
Lindas, Eloi, Goude, Yannig, Ciais, Philippe
Accurate prediction of non-dispatchable renewable energy sources is essential for grid stability and price prediction. Regional power supply forecasts are usually indirect through a bottom-up approach of plant-level forecasts, incorporate lagged power values, and do not use the potential of spatially resolved data. This study presents a comprehensive methodology for predicting solar and wind power production at country scale in France using machine learning models trained with spatially explicit weather data combined with spatial information about production sites capacity. A dataset is built spanning from 2012 to 2023, using daily power production data from RTE (the national grid operator) as the target variable, with daily weather data from ERA5, production sites capacity and location, and electricity prices as input features. Three modeling approaches are explored to handle spatially resolved weather data: spatial averaging over the country, dimension reduction through principal component analysis, and a computer vision architecture to exploit complex spatial relationships. The study benchmarks state-of-the-art machine learning models as well as hyperparameter tuning approaches based on cross-validation methods on daily power production data. Results indicate that cross-validation tailored to time series is best suited to reach low error. We found that neural networks tend to outperform traditional tree-based models, which face challenges in extrapolation due to the increasing renewable capacity over time. Model performance ranges from 4% to 10% in nRMSE for midterm horizon, achieving similar error metrics to local models established at a single-plant level, highlighting the potential of these methods for regional power supply forecasting.