Oceania
Pre-training vs. Fine-tuning: A Reproducibility Study on Dense Retrieval Knowledge Acquisition
Yao, Zheng, Wang, Shuai, Zuccon, Guido
Dense retrievers utilize pre-trained backbone language models (e.g., BERT, LLaMA) that are fine-tuned via contrastive learning to perform the task of encoding text into sense representations that can be then compared via a shallow similarity operation, e.g. inner product. Recent research has questioned the role of fine-tuning vs. that of pre-training within dense retrievers, specifically arguing that retrieval knowledge is primarily gained during pre-training, meaning knowledge not acquired during pre-training cannot be sub-sequentially acquired via fine-tuning. We revisit this idea here as the claim was only studied in the context of a BERT-based encoder using DPR as representative dense retriever. We extend the previous analysis by testing other representation approaches (comparing the use of CLS tokens with that of mean pooling), backbone architectures (encoder-only BERT vs. decoder-only LLaMA), and additional datasets (MSMARCO in addition to Natural Questions). Our study confirms that in DPR tuning, pre-trained knowledge underpins retrieval performance, with fine-tuning primarily adjusting neuron activation rather than reorganizing knowledge. However, this pattern does not hold universally, such as in mean-pooled (Contriever) and decoder-based (LLaMA) models. We ensure full reproducibility and make our implementation publicly available at https://github.com/ielab/DenseRetriever-Knowledge-Acquisition.
A Formally Verified Robustness Certifier for Neural Networks (Extended Version)
Tobler, James, Syeda, Hira Taqdees, Murray, Toby
Neural networks are often susceptible to minor perturbations in input that cause them to misclassify. A recent solution to this problem is the use of globally-robust neural networks, which employ a function to certify that the classification of an input cannot be altered by such a perturbation. Outputs that pass this test are called certified robust. However, to the authors' knowledge, these certification functions have not yet been verified at the implementation level. We demonstrate how previous unverified implementations are exploitably unsound in certain circumstances. Moreover, they often rely on approximation-based algorithms, such as power iteration, that (perhaps surprisingly) do not guarantee soundness. To provide assurance that a given output is robust, we implemented and formally verified a certification function for globally-robust neural networks in Dafny. We describe the program, its specifications, and the important design decisions taken for its implementation and verification, as well as our experience applying it in practice.
Utilizing LLMs to Investigate the Disputed Role of Evidence in Electronic Cigarette Health Policy Formation in Australia and the UK
Curran, Damian, Chapman, Brian, Conway, Mike
Australia and the UK have developed contrasting approaches to the regulation of electronic cigarettes, with - broadly speaking - Australia adopting a relatively restrictive approach and the UK adopting a more permissive approach. Notably, these divergent policies were developed from the same broad evidence base. In this paper, to investigate differences in how the two jurisdictions manage and present evidence, we developed and evaluated a Large Language Model-based sentence classifier to perform automated analyses of electronic cigarette-related policy documents drawn from official Australian and UK legislative processes (109 documents in total). Specifically, we utilized GPT-4 to automatically classify sentences based on whether they contained claims that e-cigarettes were broadly helpful or harmful for public health. Our LLM-based classifier achieved an F-score of 0.9. Further, when applying the classifier to our entire sentence-level corpus, we found that Australian legislative documents show a much higher proportion of harmful statements, and a lower proportion of helpful statements compared to the expected values, with the opposite holding for the UK. In conclusion, this work utilized an LLM-based approach to provide evidence to support the contention that - drawing on the same evidence base - Australian ENDS-related policy documents emphasize the harms associated with ENDS products and UK policy documents emphasize the benefits. Further, our approach provides a starting point for using LLM-based methods to investigate the complex relationship between evidence and health policy formation.
Domain-Adversarial Anatomical Graph Networks for Cross-User Human Activity Recognition
Ye, Xiaozhou, Wang, Kevin I-Kai
Cross-user variability in Human Activity Recognition (HAR) remains a critical challenge due to differences in sensor placement, body dynamics, and behavioral patterns. Traditional methods often fail to capture biomechanical invariants that persist across users, limiting their generalization capability. We propose an Edge-Enhanced Graph-Based Adversarial Domain Generalization (EEG-ADG) framework that integrates anatomical correlation knowledge into a unified graph neural network (GNN) architecture. By modeling three biomechanically motivated relationships together--Interconnected Units, Analogous Units, and Lateral Units--our method encodes domain-invariant features while addressing user-specific variability through Variational Edge Feature Extractor. A Gradient Reversal Layer (GRL) enforces adversarial domain generalization, ensuring robustness to unseen users. Extensive experiments on OPPORTUNITY and DSADS datasets demonstrate state-of-the-art performance. Introduction Human Activity Recognition (HAR) using wearable sensors has transformative applications in healthcare, sports, and smart environments. However, deploying HAR systems across diverse users faces a fundamental challenge: cross-user variability. Differences in body morphology (e.g., limb length, muscle mass) and movement styles (e.g., gait patterns) lead to significant distribution shifts in sensor data. For instance, accelerometer readings from a wrist sensor during "drinking from a cup" can vary substantially between users due to differences in arm motion and grip style. Traditional machine learning models, which assume identical training and testing distributions, often fail to generalize under such shifts. Conventional HAR methods typically involve feature extraction followed by classification using models such as Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) [1, 2]. Corresponding author Email addresses: xye685@aucklanduni.ac.nz (Xiaozhou Ye), kevin.wang@auckland.ac.nz (Kevin I-Kai Wang) Preprint submitted to Information Fusion May 13, 2025 To address this limitation, recent research has explored domain adaptation and transfer learning techniques [3]. However, these approaches rely heavily on labeled target-user data, which is often impractical to obtain in real-world scenarios. Domain generalization offers a promising alternative by handling scenarios where no data from the target user(s) is available during training [4, 5]. Despite its potential, most current domain generalization methods focus primarily on aligning user-specific features without considering shared biomechanical patterns that persist across users [6]. Even though users differ in attributes like gender, weight, and height, certain anatomical correlations between body parts remain consistent across individuals.
Bielik 11B v2 Technical Report
Ociepa, Krzysztof, Flis, Łukasz, Wróbel, Krzysztof, Gwoździej, Adrian, Kinas, Remigiusz
We present Bielik 11B v2, a state-of-the-art language model optimized for Polish text processing. Built on the Mistral 7B v0.2 architecture and scaled to 11B parameters using depth up-scaling, this model demonstrates exceptional performance across Polish language benchmarks while maintaining strong cross-lingual capabilities. We introduce two key technical innovations: Weighted Instruction Cross-Entropy Loss, which optimizes learning across diverse instruction types by assigning quality-based weights to training examples, and Adaptive Learning Rate, which dynamically adjusts based on context length. Comprehensive evaluation across multiple benchmarks demonstrates that Bielik 11B v2 outperforms many larger models, including those with 2-6 times more parameters, and significantly surpasses other specialized Polish language models on tasks ranging from linguistic understanding to complex reasoning. The model's parameter efficiency and extensive quantization options enable deployment across various hardware configurations, advancing Polish language AI capabilities and establishing new benchmarks for resource-efficient language modeling in less-represented languages.
Towards a Unified Representation Evaluation Framework Beyond Downstream Tasks
Plachouras, Christos, Guinot, Julien, Fazekas, George, Quinton, Elio, Benetos, Emmanouil, Pauwels, Johan
--Downstream probing has been the dominant method for evaluating model representations, an important process given the increasing prominence of self-supervised learning and foundation models. However, downstream probing primarily assesses the availability of task-relevant information in the model's latent space, overlooking attributes such as equivariance, invariance, and disentanglement, which contribute to the interpretability, adaptability, and utility of representations in real-world applications. While some attempts have been made to measure these qualities in representations, no unified evaluation framework with modular, generalizable, and interpretable metrics exists. In this paper, we argue for the importance of representation evaluation beyond downstream probing. We introduce a standardized protocol to quantify informativeness, equivariance, invariance, and disentanglement of factors of variation in model representations. We use it to evaluate representations from a variety of models in the image and speech domains using different architectures and pretraining approaches on identified controllable factors of variation. We find that representations from models with similar downstream performance can behave substantially differently with regard to these attributes. This hints that the respective mechanisms underlying their downstream performance are functionally different, prompting new research directions to understand and improve representations. Representation learning has become popular across many fields due to its effectiveness, computational efficiency, and the relative simplicity of using representations from pretrained models as features for various downstream tasks. Many architectures, training paradigms, and modalities have been used to learn representations that are effective in a variety of tasks, such as retrieval, classification, and generation.
Do Not Change Me: On Transferring Entities Without Modification in Neural Machine Translation -- a Multilingual Perspective
Wisniewski, Dawid, Pokrywka, Mikolaj, Rostek, Zofia
Current machine translation models provide us with high-quality outputs in most scenarios. However, they still face some specific problems, such as detecting which entities should not be changed during translation. In this paper, we explore the abilities of popular NMT models, including models from the OPUS project, Google Translate, MADLAD, and EuroLLM, to preserve entities such as URL addresses, IBAN numbers, or emails when producing translations between four languages: English, German, Polish, and Ukrainian. We investigate the quality of popular NMT models in terms of accuracy, discuss errors made by the models, and examine the reasons for errors. Our analysis highlights specific categories, such as emojis, that pose significant challenges for many models considered. In addition to the analysis, we propose a new multilingual synthetic dataset of 36,000 sentences that can help assess the quality of entity transfer across nine categories and four aforementioned languages.
Australia has been hesitant – but could robots soon be delivering your pizza?
Robots zipping down footpaths may sound futuristic, but they are increasingly being put to work making deliveries around the world – though a legal minefield and cautious approach to new tech means they are largely absent in Australia. Retail and food businesses have been using robots for a variety of reasons, with hazard detection robots popping up in certain Woolworths stores and virtual waiters taking dishes from kitchens in understaffed restaurants to hungry diners in recent years. Overseas, in jurisdictions such as California, robots are far more visible in everyday life. Following on from the first wave of self-driving car trials in cities such as San Francisco, humans now also share footpaths with robots. Likened to lockers on wheels, companies including Serve Robotics and Coco have partnered with Uber Eats and Doordash, which have armies of robots travelling along footpaths in Los Angeles delivering takeaway meals and groceries.
Does video game monetisation harm children – and what is Australia doing about it?
Over the last decade, Dean has amassed a healthy collection of video games, from smash hits to cult classics. His digital library is like a modern day Blockbuster, all readily accessible with just a click or two. But his son, Sam, has eyes for only one video game: Roblox, the behemoth virtual universe-slash-video game that's among the most popular on the planet. The company reports that more than 97 million people log on to Roblox every day. Around 40% of those are, like Sam, under 13 years of age.
Our favourite science fiction books of all time (the ones we forgot)
Is your favourite sci-fi novel included here, or have we forgotten it? Almost exactly a year ago, I asked our team of expert science writers here at New Scientist to name their favourite science fiction novels. Personal tastes meant we ended up with a wonderfully eclectic list, ranging from classics by the likes of Margaret Atwood and Octavia Butler to titles I'd not previously read (Jon Bois's 17776 was a particularly wild suggestion, from our US editor Chelsea Whyte – but it's well worth your time). We New Scientist staffers tend to be sci-fi nerds, and we realised we hadn't quite got all the greats yet. So here, for your reading pleasure, is our second take on our favourite sci-fi novels of all time, otherwise known as the ones we forgot. Again, we're not claiming this is a definitive list. It's just our top sci-fi reads, in no particular order, and we hope you'll discover some new favourites of your own in this line-up. We asked New Scientist staff to pick their favourite science fiction books. Here are the results, ranging from 19th-century classics to modern day offerings, and from Octavia E. Butler to Iain M. Banks And if we still haven't got them all, then come and tell us about it on Facebook.