Iceland
Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling Xuan Long Do
Safety of Large Language Models (LLMs) has become a critical issue given their rapid progresses. Greedy Coordinate Gradient (GCG) is shown to be effective in constructing adversarial prompts to break the aligned LLMs, but optimization of GCG is time-consuming. To reduce the time cost of GCG and enable more comprehensive studies of LLM safety, in this work, we study a new algorithm called Probe sampling. At the core of the algorithm is a mechanism that dynamically determines how similar a smaller draft model's predictions are to the target model's predictions for prompt candidates. When the target model is similar to the draft model, we rely heavily on the draft model to filter out a large number of potential prompt candidates. Probe sampling achieves up to 5.6 times speedup using Llama2-7b-chat and leads to equal or improved attack success rate (ASR) on the AdvBench. Furthermore, probe sampling is also able to accelerate other prompt optimization techniques and adversarial methods, leading to acceleration of 1.8 for AutoPrompt, 2.4 for APE and 2.4 for AutoDAN.
Mixture of Experts for Recognizing Depression from Interview and Reading Tasks
Ilias, Loukas, Askounis, Dimitris
Depression is a mental disorder and can cause a variety of symptoms, including psychological, physical, and social. Speech has been proved an objective marker for the early recognition of depression. For this reason, many studies have been developed aiming to recognize depression through speech. However, existing methods rely on the usage of only the spontaneous speech neglecting information obtained via read speech, use transcripts which are often difficult to obtain (manual) or come with high word-error rates (automatic), and do not focus on input-conditional computation methods. To resolve these limitations, this is the first study in depression recognition task obtaining representations of both spontaneous and read speech, utilizing multimodal fusion methods, and employing Mixture of Experts (MoE) models in a single deep neural network. Specifically, we use audio files corresponding to both interview and reading tasks and convert each audio file into log-Mel spectrogram, delta, and delta-delta. Next, the image representations of the two tasks pass through shared AlexNet models. The outputs of the AlexNet models are given as input to a multimodal fusion method. The resulting vector is passed through a MoE module. In this study, we employ three variants of MoE, namely sparsely-gated MoE and multilinear MoE based on factorization. Findings suggest that our proposed approach yields an Accuracy and F1-score of 87.00% and 86.66% respectively on the Androids corpus.
World of ScoreCraft: Novel Multi Scorer Experiment on the Impact of a Decision Support System in Sleep Staging
Holm, Benedikt, Óskarsson, Arnar, Þorleifsson, Björn Elvar, Hafsteinsson, Hörður Þór, Sigurðardóttir, Sigríður, Grétarsdóttir, Heiður, Hoelke, Kenan, Jouan, Gabriel Marc Marie, Penzel, Thomas, Arnardottir, Erna Sif, Óskarsdóttir, María
Manual scoring of polysomnography (PSG) is a time intensive task, prone to inter scorer variability that can impact diagnostic reliability. This study investigates the integration of decision support systems (DSS) into PSG scoring workflows, focusing on their effects on accuracy, scoring time, and potential biases toward recommendations from artificial intelligence (AI) compared to human generated recommendations. Using a novel online scoring platform, we conducted a repeated measures study with sleep technologists, who scored traditional and self applied PSGs. Participants were occasionally presented with recommendations labeled as either human or AI generated. We found that traditional PSGs tended to be scored slightly more accurately than self applied PSGs, but this difference was not statistically significant. Correct recommendations significantly improved scoring accuracy for both PSG types, while incorrect recommendations reduced accuracy. No significant bias was observed toward or against AI generated recommendations compared to human generated recommendations. These findings highlight the potential of AI to enhance PSG scoring reliability. However, ensuring the accuracy of AI outputs is critical to maximizing its benefits. Future research should explore the long term impacts of DSS on scoring workflows and strategies for integrating AI in clinical practice.
LuxBank: The First Universal Dependency Treebank for Luxembourgish
Plum, Alistair, Döhmer, Caroline, Milano, Emilia, Lutgen, Anne-Marie, Purschke, Christoph
The Universal Dependencies (UD) project has significantly expanded linguistic coverage across 161 languages, yet Luxembourgish, a West Germanic language spoken by approximately 400,000 people, has remained absent until now. In this paper, we introduce LuxBank, the first UD Treebank for Luxembourgish, addressing the gap in syntactic annotation and analysis for this `low-research' language. We establish formal guidelines for Luxembourgish language annotation, providing the foundation for the first large-scale quantitative analysis of its syntax. LuxBank serves not only as a resource for linguists and language learners but also as a tool for developing spell checkers and grammar checkers, organising existing text archives and even training large language models. By incorporating Luxembourgish into the UD framework, we aim to enhance the understanding of syntactic variation within West Germanic languages and offer a model for documenting smaller, semi-standardised languages. This work positions Luxembourgish as a valuable resource in the broader linguistic and NLP communities, contributing to the study of languages with limited research and resources.
Multi-Session Client-Centered Treatment Outcome Evaluation in Psychotherapy
Na, Hongbin, Shen, Tao, Yu, Shumao, Chen, Ling
In psychotherapy, therapeutic outcome assessment, or treatment outcome evaluation, is essential for enhancing mental health care by systematically evaluating therapeutic processes and outcomes. Existing large language model approaches often focus on therapist-centered, single-session evaluations, neglecting the client's subjective experience and longitudinal progress across multiple sessions. To address these limitations, we propose IPAEval, a client-Informed Psychological Assessment-based Evaluation framework that automates treatment outcome evaluations from the client's perspective using clinical interviews. IPAEval integrates cross-session client-contextual assessment and session-focused client-dynamics assessment to provide a comprehensive understanding of therapeutic progress. Experiments on our newly developed TheraPhase dataset demonstrate that IPAEval effectively tracks symptom severity and treatment outcomes over multiple sessions, outperforming previous single-session models and validating the benefits of items-aware reasoning mechanisms. In psychotherapy, therapeutic outcome assessment, a.k.a treatment outcome evaluation under clinical settings, refers to the systematic evaluation of therapeutic processes and outcomes (Groth-Marnat, 2009), focusing on factors such as therapist effectiveness (Johns et al., 2019) and treatment efficacy (Jensen-Doss et al., 2018) to improve mental health care delivery. It plays a significant role in enhancing the quality and effectiveness of mental health care by providing actionable insights that guide therapists in refining their treatment approaches (Wampold & Imel, 2015), ultimately leading to better client outcomes and improved therapeutic relationships in real-world clinical practice (Maruish & Leahy, 2000). Over the last couple of years, the emergence of large language models has demonstrated their effectiveness in automatic evaluations, showing a high degree of alignment with human judgment when provided with proper instruction and contextual guidance (Liu et al., 2023; Li et al., 2024b; Kim et al., 2024). This aligns with the "LLMs-as-a-judge" paradigm, where LLMs are employed to simulate human evaluators by providing assessments based on natural language input (Zheng et al., 2023; Wang et al., 2024b). This paradigm has been extended to therapeutic outcome assessment by harnessing LLMs' ability to model complex therapeutic procedures and interactions, offering a novel pathway for automating the assessment of therapeutic efficacy (Chiu et al., 2024; Lee et al., 2024; Li et al., 2024a).
Killing Two Flies with One Stone: An Attempt to Break LLMs Using English->Icelandic Idioms and Proper Names
Ármannsson, Bjarki, Hafsteinsson, Hinrik, Jasonarson, Atli, Steingrímsson, Steinþór
This paper presents the submission of the \'Arni Magn\'usson Institute's team to the WMT24 test suite subtask, focusing on idiomatic expressions and proper names for the English->Icelandic translation direction. Intuitively and empirically, idioms and proper names are known to be a significant challenge for modern translation models. We create two different test suites. The first evaluates the competency of MT systems in translating common English idiomatic expressions, as well as testing whether systems can distinguish between those expressions and the same phrases when used in a literal context. The second test suite consists of place names that should be translated into their Icelandic exonyms (and correctly inflected) and pairs of Icelandic names that share a surface form between the male and female variants, so that incorrect translations impact meaning as well as readability. The scores reported are relatively low, especially for idiomatic expressions and place names, and indicate considerable room for improvement.
AdaptEval: Evaluating Large Language Models on Domain Adaptation for Text Summarization
Afzal, Anum, Chalumattu, Ribin, Matthes, Florian, Espuny, Laura Mascarell
Despite the advances in the abstractive summarization task using Large Language Models (LLM), there is a lack of research that asses their abilities to easily adapt to different domains. We evaluate the domain adaptation abilities of a wide range of LLMs on the summarization task across various domains in both fine-tuning and in-context learning settings. We also present AdaptEval, the first domain adaptation evaluation suite. AdaptEval includes a domain benchmark and a set of metrics to facilitate the analysis of domain adaptation. Our results demonstrate that LLMs exhibit comparable performance in the in-context learning setting, regardless of their parameter scale.
LLM Questionnaire Completion for Automatic Psychiatric Assessment
Rosenman, Gony, Wolf, Lior, Hendler, Talma
Psychiatric evaluation nowadays is heavily dependent on the patient's verbal report about disturbed feelings, thoughts, behaviors and their changes over time. Accordingly, evaluation hinges on two main components: unstructured interviews, which allow patients to express themselves freely under the guidance of open questions, and structured questionnaires, aimed at standardizing the assessment. These methods are outlined in the Diagnostic and Statistical Manual of Mental Disorders (DSM) series, which attempts to assign universal scores to individual experiences of mental disorders [1]. However, the inherent complexity of mental health conditions, characterized by a known positive manifold of symptoms and compounded by the subjective nature and potential unreliability of self-reported data (especially from one session to another), along with interviewer biases, make accurate diagnosis challenging. The overlapping symptoms and the instability of mental state, especially in pathological conditions, further complicate the need for precision, precluding an objective and quantitative account of a critical element in the psychiatric evaluation process; the subjective self-experience [2, 3, 4]. The evolution of psychiatric practice is increasingly shaped by the integration of Natural Language Processing (NLP) and machine learning within traditional diagnostic approaches.
Exploring Activation Patterns of Parameters in Language Models
Wang, Yudong, Dai, Damai, Sui, Zhifang
Most work treats large language models as black boxes without in-depth understanding of their internal working mechanism. In order to explain the internal representations of LLMs, we propose a gradient-based metric to assess the activation level of model parameters. Based on this metric, we obtain three preliminary findings. (1) When the inputs are in the same domain, parameters in the shallow layers will be activated densely, which means a larger portion of parameters will have great impacts on the outputs. In contrast, parameters in the deep layers are activated sparsely. (2) When the inputs are across different domains, parameters in shallow layers exhibit higher similarity in the activation behavior than deep layers. (3) In deep layers, the similarity of the distributions of activated parameters is positively correlated to the empirical data relevance. Further, we develop three validation experiments to solidify these findings. (1) Firstly, starting from the first finding, we attempt to configure different prune ratios for different layers, and find this method can benefit model pruning. (2) Secondly, we find that a pruned model based on one calibration set can better handle tasks related to the calibration task than those not related, which validate the second finding. (3) Thirdly, Based on the STS-B and SICK benchmark, we find that two sentences with consistent semantics tend to share similar parameter activation patterns in deep layers, which aligns with our third finding. Our work sheds light on the behavior of parameter activation in LLMs, and we hope these findings will have the potential to inspire more practical applications.
Lowering Barriers to Entry for Fully-Integrated Custom Payloads on a DJI Matrice
Springer, Joshua, Guðmundsson, Gylfi Þór, Kyas, Marcel
Consumer-grade drones have become effective multimedia collection tools, spring-boarded by rapid development in embedded CPUs, GPUs, and cameras. They are best known for their ability to cheaply collect high-quality aerial video, 3D terrain scans, infrared imagery, etc., with respect to manned aircraft. However, users can also create and attach custom sensors, actuators, or computers, so the drone can collect different data, generate composite data, or interact intelligently with its environment, e.g., autonomously changing behavior to land in a safe way, or choosing further data collection sites. Unfortunately, developing custom payloads is prohibitively difficult for many researchers outside of engineering. We provide guidelines for how to create a sophisticated computational payload that integrates a Raspberry Pi 5 into a DJI Matrice 350. The payload fits into the Matrice's case like a typical DJI payload (but is much cheaper), is easy to build and expand (3D-printed), uses the drone's power and telemetry, can control the drone and its other payloads, can access the drone's sensors and camera feeds, and can process video and stream it to the operator via the controller in real time. We describe the difficulties and proprietary quirks we encountered, how we worked through them, and provide setup scripts and a known-working configuration for others to use.