Solothurn
Towards an intelligent assessment system for evaluating the development of algorithmic thinking skills: An exploratory study in Swiss compulsory schools
The rapid digitalisation of contemporary society has profoundly impacted various facets of our lives, including healthcare, communication, business, and education. The ability to engage with new technologies and solve problems has become crucial, making CT skills, such as pattern recognition, decomposition, and algorithm design, essential competencies. In response, Switzerland is conducting research and initiatives to integrate CT into its educational system. This study aims to develop a comprehensive framework for large-scale assessment of CT skills, particularly focusing on AT, the ability to design algorithms. To achieve this, we first developed a competence model capturing the situated and developmental nature of CT, guiding the design of activities tailored to cognitive abilities, age, and context. This framework clarifies how activity characteristics influence CT development and how to assess these competencies. Additionally, we developed an activity for large-scale assessment of AT skills, offered in two variants: one based on non-digital artefacts (unplugged) and manual expert assessment, and the other based on digital artefacts (virtual) and automatic assessment. To provide a more comprehensive evaluation of students' competencies, we developed an IAS based on BNs with noisy gates, which offers real-time probabilistic assessment for each skill rather than a single overall score. The results indicate that the proposed instrument can measure AT competencies across different age groups and educational contexts in Switzerland, demonstrating its applicability for large-scale use. AT competencies exhibit a progressive development, with no overall gender differences, though variations are observed at the school level, significantly influenced by the artefact-based environment and its context, underscoring the importance of creating accessible and adaptable assessment tools.
SwiLTra-Bench: The Swiss Legal Translation Benchmark
Niklaus, Joel, Merane, Jakob, Nenadic, Luka, Ahmadi, Sina, Gao, Yingqiang, Chevalley, Cyrill A. H., Humbel, Claude, Gösken, Christophe, Tanzi, Lorenzo, Lüthi, Thomas, Palombo, Stefan, Poff, Spencer, Yang, Boling, Wu, Nan, Guillod, Matthew, Mamié, Robin, Brunner, Daniel, Pereyra, Julio, Grupen, Niko
In Switzerland legal translation is uniquely important due to the country's four official languages and requirements for multilingual legal documentation. However, this process traditionally relies on professionals who must be both legal experts and skilled translators -- creating bottlenecks and impacting effective access to justice. To address this challenge, we introduce SwiLTra-Bench, a comprehensive multilingual benchmark of over 180K aligned Swiss legal translation pairs comprising laws, headnotes, and press releases across all Swiss languages along with English, designed to evaluate LLM-based translation systems. Our systematic evaluation reveals that frontier models achieve superior translation performance across all document types, while specialized translation systems excel specifically in laws but under-perform in headnotes. Through rigorous testing and human expert validation, we demonstrate that while fine-tuning open SLMs significantly improves their translation quality, they still lag behind the best zero-shot prompted frontier models such as Claude-3.5-Sonnet. Additionally, we present SwiLTra-Judge, a specialized LLM evaluation system that aligns best with human expert assessments.
G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models
Jia, Pengyue, Liu, Yiding, Li, Xiaopeng, Zhao, Xiangyu, Wang, Yuhao, Du, Yantong, Han, Xiao, Wei, Xuetao, Wang, Shuaiqiang, Yin, Dawei
Worldwide geolocalization aims to locate the precise location at the coordinate level of photos taken anywhere on the Earth. It is very challenging due to 1) the difficulty of capturing subtle location-aware visual semantics, and 2) the heterogeneous geographical distribution of image data. As a result, existing studies have clear limitations when scaled to a worldwide context. They may easily confuse distant images with similar visual contents, or cannot adapt to various locations worldwide with different amounts of relevant data. To resolve these limitations, we propose G3, a novel framework based on Retrieval-Augmented Generation (RAG). In particular, G3 consists of three steps, i.e., Geo-alignment, Geo-diversification, and Geo-verification to optimize both retrieval and generation phases of worldwide geolocalization. During Geo-alignment, our solution jointly learns expressive multi-modal representations for images, GPS and textual descriptions, which allows us to capture location-aware semantics for retrieving nearby images for a given query. During Geo-diversification, we leverage a prompt ensembling method that is robust to inconsistent retrieval performance for different image queries. Finally, we combine both retrieved and generated GPS candidates in Geo-verification for location prediction. Experiments on two well-established datasets IM2GPS3k and YFCC4k verify the superiority of G3 compared to other state-of-the-art methods.
Dialect Transfer for Swiss German Speech Translation
Paonessa, Claudio, Schraner, Yanick, Deriu, Jan, Hürlimann, Manuela, Vogel, Manfred, Cieliebak, Mark
This paper investigates the challenges in building Swiss German speech translation systems, specifically focusing on the impact of dialect diversity and differences between Swiss German and Standard German. Swiss German is a spoken language with no formal writing system, it comprises many diverse dialects and is a low-resource language with only around 5 million speakers. The study is guided by two key research questions: how does the inclusion and exclusion of dialects during the training of speech translation models for Swiss German impact the performance on specific dialects, and how do the differences between Swiss German and Standard German impact the performance of the systems? We show that dialect diversity and linguistic differences pose significant challenges to Swiss German speech translation, which is in line with linguistic hypotheses derived from empirical investigations.
SCALE: Scaling up the Complexity for Advanced Language Model Evaluation
Rasiah, Vishvaksenan, Stern, Ronja, Matoshi, Veton, Stürmer, Matthias, Chalkidis, Ilias, Ho, Daniel E., Niklaus, Joel
Recent strides in Large Language Models (LLMs) have saturated many NLP benchmarks (even professional domain-specific ones), emphasizing the need for novel, more challenging novel ones to properly assess LLM capabilities. In this paper, we introduce a novel NLP benchmark that poses challenges to current LLMs across four key dimensions: processing long documents (up to 50K tokens), utilizing domain specific knowledge (embodied in legal texts), multilingual understanding (covering five languages), and multitasking (comprising legal document to document Information Retrieval, Court View Generation, Leading Decision Summarization, Citation Extraction, and eight challenging Text Classification tasks). Our benchmark comprises diverse legal NLP datasets from the Swiss legal system, allowing for a comprehensive study of the underlying Non-English, inherently multilingual, federal legal system. Despite recent advances, efficiently processing long documents for intense review/analysis tasks remains an open challenge for language models. Also, comprehensive, domain-specific benchmarks requiring high expertise to develop are rare, as are multilingual benchmarks. This scarcity underscores our contribution's value, considering most public models are trained predominantly on English corpora, while other languages remain understudied, particularly for practical domain-specific NLP tasks. Our benchmark allows for testing and advancing the state-of-the-art LLMs. As part of our study, we evaluate several pre-trained multilingual language models on our benchmark to establish strong baselines as a point of reference. Despite the large size of our datasets (tens to hundreds of thousands of examples), existing publicly available models struggle with most tasks, even after in-domain pretraining. We publish all resources (benchmark suite, pre-trained models, code) under a fully permissive open CC BY-SA license.
CODET: A Benchmark for Contrastive Dialectal Evaluation of Machine Translation
Alam, Md Mahfuz Ibn, Ahmadi, Sina, Anastasopoulos, Antonios
Neural machine translation (NMT) systems exhibit limited robustness in handling source-side linguistic variations. Their performance tends to degrade when faced with even slight deviations in language usage, such as different domains or variations introduced by second-language speakers. It is intuitive to extend this observation to encompass dialectal variations as well, but the work allowing the community to evaluate MT systems on this dimension is limited. To alleviate this issue, we compile and release \dataset, a contrastive dialectal benchmark encompassing 882 different variations from nine different languages. We also quantitatively demonstrate the challenges large MT models face in effectively translating dialectal variants. We are releasing all code and data.
Improving decision-making via risk-based active learning: Probabilistic discriminative classifiers
Hughes, Aidan J., Gardner, Paul, Bull, Lawrence A., Dervilis, Nikolaos, Worden, Keith
Gaining the ability to make informed decisions on operation and maintenance of structures provides motivation for the implementation of structural health monitoring (SHM) systems. However, descriptive labels for measured data corresponding to health-states of the monitored system are often unavailable. This issue limits the applicability of fully-supervised machine learning paradigms for the development of statistical classifiers to be used in decision-support in SHM systems. One approach to dealing with this problem is risk-based active learning. In such an approach, data-label querying is guided according to the expected value of perfect information for incipient data points. For risk-based active learning in SHM, the value of information is evaluated with respect to a maintenance decision process, and the data-label querying corresponds to the inspection of a structure to determine its health state. In the context of SHM, risk-based active learning has only been considered for generative classifiers. The current paper demonstrates several advantages of using an alternative type of classifier -- discriminative models. Using the Z24 Bridge dataset as a case study, it is shown that discriminative classifiers have benefits, in the context of SHM decision-support, including improved robustness to sampling bias, and reduced expenditure on structural inspections.
On robust risk-based active-learning algorithms for enhanced decision support
Hughes, Aidan J., Bull, Lawrence A., Gardner, Paul, Dervilis, Nikolaos, Worden, Keith
Classification models are a fundamental component of physical-asset management technologies such as structural health monitoring (SHM) systems and digital twins. Previous work introduced \textit{risk-based active learning}, an online approach for the development of statistical classifiers that takes into account the decision-support context in which they are applied. Decision-making is considered by preferentially querying data labels according to \textit{expected value of perfect information} (EVPI). Although several benefits are gained by adopting a risk-based active learning approach, including improved decision-making performance, the algorithms suffer from issues relating to sampling bias as a result of the guided querying process. This sampling bias ultimately manifests as a decline in decision-making performance during the later stages of active learning, which in turn corresponds to lost resource/utility. The current paper proposes two novel approaches to counteract the effects of sampling bias: \textit{semi-supervised learning}, and \textit{discriminative classification models}. These approaches are first visualised using a synthetic dataset, then subsequently applied to an experimental case study, specifically, the Z24 Bridge dataset. The semi-supervised learning approach is shown to have variable performance; with robustness to sampling bias dependent on the suitability of the generative distributions selected for the model with respect to each dataset. In contrast, the discriminative classifiers are shown to have excellent robustness to the effects of sampling bias. Moreover, it was found that the number of inspections made during a monitoring campaign, and therefore resource expenditure, could be reduced with the careful selection of the statistical classifiers used within a decision-supporting monitoring system.