South America
FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference
Du, Hongchao, Wu, Shangyu, Kharlamova, Arina, Guan, Nan, Xue, Chun Jason
Although these approaches can improve models' Large Language Models (LLMs) face challenges for on-device memory efficiency, they inevitably impact the generality inference due to high memory demands. Traditional methods performance and still suffer in extreme resource-constrained to reduce memory usage often compromise performance scenarios [4, 9, 12]. Furthermore, these methods lack the flexibility and lack adaptability. We propose FlexInfer, an optimized to vary memory budgets or deployment constraints, offloading framework for on-device inference, addressing requiring adjusting the hyper-parameters, such as quantization these issues with techniques like asynchronous prefetching, or sparsity levels, offering limited choices, and imposing balanced memory locking, and flexible tensor preservation.
Periodontal Bone Loss Analysis via Keypoint Detection With Heuristic Post-Processing
Banks, Ryan, Thengane, Vishal, Guerrero, Marรญa Eugenia, Garcรญa-Madueรฑo, Nelly Maria, Li, Yunpeng, Tang, Hongying, Chaurasia, Akhilanand
Calculating percentage bone loss is a critical test for periodontal disease staging but is sometimes imprecise and time consuming when manually calculated. This study evaluates the application of a deep learning keypoint and object detection model, YOLOv8-pose, for the automatic identification of localised periodontal bone loss landmarks, conditions and staging. YOLOv8-pose was fine-tuned on 193 annotated periapical radiographs. We propose a keypoint detection metric, Percentage of Relative Correct Keypoints (PRCK), which normalises the metric to the average tooth size of teeth in the image. We propose a heuristic post-processing module that adjusts certain keypoint predictions to align with the edge of the related tooth, using a supporting instance segmentation model trained on an open source auxiliary dataset. The model can sufficiently detect bone loss keypoints, tooth boxes, and alveolar ridge resorption, but has insufficient performance at detecting detached periodontal ligament and furcation involvement. The model with post-processing demonstrated a PRCK 0.25 of 0.726 and PRCK 0.05 of 0.401 for keypoint detection, mAP 0.5 of 0.715 for tooth object detection, mesial dice score of 0.593 for periodontal staging, and dice score of 0.280 for furcation involvement. Our annotation methodology provides a stage agnostic approach to periodontal disease detection, by ensuring most keypoints are present for each tooth in the image, allowing small imbalanced datasets. Our PRCK metric allows accurate evaluation of keypoints in dental domains. Our post-processing module adjusts predicted keypoints correctly but is dependent on a minimum quality of prediction by the pose detection and segmentation models. Code: https:// anonymous.4open.science/r/Bone-Loss-Keypoint-Detection-Code. Dataset: https://bit.ly/4hJ3aE7.
Generative Active Adaptation for Drifting and Imbalanced Network Intrusion Detection
Gupta, Ragini, Liu, Shinan, Zhang, Ruixiao, Hu, Xinyue, Kommaraju, Pranav, Wang, Xiaoyang, Benkraouda, Hadjer, Feamster, Nick, Nahrstedt, Klara
Machine learning has shown promise in network intrusion detection systems, yet its performance often degrades due to concept drift and imbalanced data. These challenges are compounded by the labor-intensive process of labeling network traffic, especially when dealing with evolving and rare attack types, which makes selecting the right data for adaptation difficult. To address these issues, we propose a generative active adaptation framework that minimizes labeling effort while enhancing model robustness. Our approach employs density-aware active sampling to identify the most informative samples for annotation and leverages deep generative models to synthesize diverse samples, thereby augmenting the training set and mitigating the effects of concept drift. We evaluate our end-to-end framework on both simulated IDS data and a real-world ISP dataset, demonstrating significant improvements in intrusion detection performance. Our method boosts the overall F1-score from 0.60 (without adaptation) to 0.86. Rare attacks such as Infiltration, Web Attack, and FTP-BruteForce, which originally achieve F1 scores of 0.001, 0.04, and 0.00, improve to 0.30, 0.50, and 0.71, respectively, with generative active adaptation in the CIC-IDS 2018 dataset. Our framework effectively enhances rare attack detection while reducing labeling costs, making it a scalable and adaptive solution for real-world intrusion detection.
Zero-Shot Multi-Label Classification of Bangla Documents: Large Decoders Vs. Classic Encoders
Sarkar, Souvika, Hasan, Md. Najib, Karmaker, Santu
Bangla, a language spoken by over 300 million native speakers and ranked as the sixth most spoken language worldwide, presents unique challenges in natural language processing (NLP) due to its complex morphological characteristics and limited resources. While recent Large Decoder Based models (LLMs), such as GPT, LLaMA, and DeepSeek, have demonstrated excellent performance across many NLP tasks, their effectiveness in Bangla remains largely unexplored. In this paper, we establish the first benchmark comparing decoder-based LLMs with classic encoder-based models for Zero-Shot Multi-Label Classification (Zero-Shot-MLC) task in Bangla. Our evaluation of 32 state-of-the-art models reveals that, existing so-called powerful encoders and decoders still struggle to achieve high accuracy on the Bangla Zero-Shot-MLC task, suggesting a need for more research and resources for Bangla NLP.
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Betley, Jan, Tan, Daniel, Warncke, Niels, Sztyber-Betley, Anna, Bao, Xuchan, Soto, Martรญn, Labenz, Nathan, Evans, Owain
We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned. Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is modified so the user asks for insecure code for a computer security class, this prevents emergent misalignment. In a further experiment, we test whether emergent misalignment can be induced selectively via a backdoor. We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger. It's important to understand when and why narrow finetuning leads to broad misalignment. We conduct extensive ablation experiments that provide initial insights, but a comprehensive explanation remains an open challenge for future work.
Multilingualism, Transnationality, and K-pop in the Online #StopAsianHate Movement
Masis, Tessa, Duan, Zhangqi, Xu, Weiai Wayne, Zuckerman, Ethan, Pyo, Jane Yeahin, O'Connor, Brendan
The #StopAsianHate (SAH) movement is a broad social movement against violence targeting Asians and Asian Americans, beginning in 2021 in response to racial discrimination related to COVID-19 and sparking worldwide conversation about anti-Asian hate. However, research on the online SAH movement has focused on English-speaking participants so the spread of the movement outside of the United States is largely unknown. In addition, there have been no long-term studies of SAH so the extent to which it has been successfully sustained over time is not well understood. We present an analysis of 6.5 million "#StopAsianHate" tweets from 2.2 million users all over the globe and spanning 60 different languages, constituting the first study of the non-English and transnational component of the online SAH movement. Using a combination of topic modeling, user modeling, and hand annotation, we identify and characterize the dominant discussions and users participating in the movement and draw comparisons of English versus non-English topics and users. We discover clear differences in events driving topics, where spikes in English tweets are driven by violent crimes in the US but spikes in non-English tweets are driven by transnational incidents of anti-Asian sentiment towards symbolic representatives of Asian nations. We also find that global K-pop fans were quick to adopt the SAH movement and, in fact, sustained it for longer than any other user group. Our work contributes to understanding the transnationality and evolution of the SAH movement, and more generally to exploring upward scale shift and public attention in large-scale multilingual online activism.
Generative Tools for Graphical Assets: Empirical Guidelines based on Game Designers' and Developers' Preferences
Fukaya, Kaisei, Daylamani-Zad, Damon, Agius, Harry
Graphical assets play an important role in the design and development of games. There is potential in the use of generative tools, to aid in creating graphical assets, thus improving game design and development pipelines. However, there is little research to address how the generative methods can fit into the wider pipeline. We conducted a user study with 16 game designers and developers to examine their preferences regarding generative tools for graphical assets. The findings highlight that early design stage is preferred by all participants (mean values above 0.67 and p < .001 for early stages). Designers and developers prefer to use such tools for creating large amounts of variations at the cost of quality as they can improve the quality of the artefacts once they generate a suitable asset (mean value 0.17 where 1 is high quality, p < .001). They also strongly (mean value .78, p < .001) raised the need for better integration of such tools in existing design and development environments and the need for the outputs to be in common data formats, to be manipulatable and integrate smoothly into existing environments (mean 3.5 out of 5, p = .004). The study also highlights the requirement for further emphasis on the needs of the users to incorporate these tools effectively in existing pipelines. Informed by these results, we provide a set of guidelines for creating tools that meet the expectations and needs of game designers and developers.
Use Me Wisely: AI-Driven Assessment for LLM Prompting Skills Development
Ognibene, Dimitri, Donabauer, Gregor, Theophilou, Emily, Koyuturk, Cansu, Yavari, Mona, Bursic, Sathya, Telari, Alessia, Testa, Alessia, Boiano, Raffaele, Taibi, Davide, Hernandez-Leo, Davinia, Kruschwitz, Udo, Ruskov, Martin
The use of large language model (LLM)-powered chatbots, such as ChatGPT, has become popular across various domains, supporting a range of tasks and processes. However, due to the intrinsic complexity of LLMs, effective prompting is more challenging than it may seem. This highlights the need for innovative educational and support strategies that are both widely accessible and seamlessly integrated into task workflows. Yet, LLM prompting is highly task- and domain-dependent, limiting the effectiveness of generic approaches. In this study, we explore whether LLM-based methods can facilitate learning assessments by using ad-hoc guidelines and a minimal number of annotated prompt samples. Our framework transforms these guidelines into features that can be identified within learners' prompts. Using these feature descriptions and annotated examples, we create few-shot learning detectors. We then evaluate different configurations of these detectors, testing three state-of-the-art LLMs and ensembles. We run experiments with cross-validation on a sample of original prompts, as well as tests on prompts collected from task-naive learners. Our results show how LLMs perform on feature detection. Notably, GPT- 4 demonstrates strong performance on most features, while closely related models, such as GPT-3 and GPT-3.5 Turbo (Instruct), show inconsistent behaviors in feature classification. These differences highlight the need for further research into how design choices impact feature selection and prompt detection. Our findings contribute to the fields of generative AI literacy and computer-supported learning assessment, offering valuable insights for both researchers and practitioners.
Magic in Human-Robot Interaction (HRI)
"Magic" is referred to here and there in the robotics literature, from "magical moments" afforded by a mobile bubble machine, to "spells" intended to entertain and motivate children--but what exactly could this concept mean for designers? Here, we present (1) some theoretical discussion on how magic could inform interaction designs based on reviewing the literature, followed by (2) a practical description of using such ideas to develop a simplified prototype, which received an award in an international robot magic competition. Although this topic can be considered unusual and some negative connotations exist (e.g., unrealistic thinking can be referred to as magical), our results seem to suggest that magic, in the experiential, supernatural, and illusory senses of the term, could be useful to consider in various robot design contexts, also for artifacts like home assistants and autonomous vehicles--thus, inviting further discussion and exploration.
AILS-NTUA at SemEval-2025 Task 4: Parameter-Efficient Unlearning for Large Language Models using Data Chunking
Premptis, Iraklis, Lymperaiou, Maria, Filandrianos, Giorgos, Mastromichalakis, Orfeas Menis, Voulodimos, Athanasios, Stamou, Giorgos
The Unlearning Sensitive Content from Large Language Models task aims to remove targeted datapoints from trained models while minimally affecting their general knowledge. In our work, we leverage parameter-efficient, gradient-based unlearning using low-rank (LoRA) adaptation and layer-focused fine-tuning. To further enhance unlearning effectiveness, we employ data chunking, splitting forget data into disjoint partitions and merging them with cyclically sampled retain samples at a pre-defined ratio. Our task-agnostic method achieves an outstanding forget-retain balance, ranking first on leaderboards and significantly outperforming baselines and competing systems.