Education
Lightweight and Generalizable Acoustic Scene Representations via Contrastive Fine-Tuning and Distillation
Yuan, Kuang, Gao, Yang, Li, Xilin, Mei, Xinhao, Zadissa, Syavosh, Pruthi, Tarun, Sereshki, Saeed Bagheri
ABSTRACT Acoustic scene classification (ASC) models on edge devices typically operate under fixed class assumptions, lacking the transferability needed for real-world applications that require adaptation to new or refined acoustic categories. We propose ContrastASC, which learns generalizable acoustic scene representations by structuring the embedding space to preserve semantic relationships between scenes, enabling adaptation to unseen categories without retraining. Our approach combines supervised contrastive fine-tuning of pre-trained models with contrastive representation distillation to transfer this structured knowledge to compact student models. Our evaluation shows that ContrastASC demonstrates improved few-shot adaptation to unseen categories while maintaining strong closed-set performance. Index T erms-- Acoustic Scene Classification, Contrastive Learning, Knowledge Distillation, Model Fine-tuning 1. INTRODUCTION Acoustic scene classification (ASC) has attracted significant research attention as a crucial capability for context-aware AI systems on edge devices [1, 2].
Bridging the Gap Between Multimodal Foundation Models and World Models
Humans understand the world through the integration of multiple sensory modalities, enabling them to perceive, reason about, and imagine dynamic physical processes. Inspired by this capability, multimodal foundation models (MFMs) have emerged as powerful tools for multimodal understanding and generation. However, today's MFMs fall short of serving as effective world models. They lack the essential ability such as perform counterfactual reasoning, simulate dynamics, understand the spatiotemporal information, control generated visual outcomes, and perform multifaceted reasoning. We investigates what it takes to bridge the gap between multimodal foundation models and world models. We begin by improving the reasoning capabilities of MFMs through discriminative tasks and equipping MFMs with structured reasoning skills, such as causal inference, counterfactual thinking, and spatiotemporal reasoning, enabling them to go beyond surface correlations and understand deeper relationships within visual and textual data. Next, we explore generative capabilities of multimodal foundation models across both image and video modalities, introducing new frameworks for structured and controllable generation. Our approaches incorporate scene graphs, multimodal conditioning, and multimodal alignment strategies to guide the generation process, ensuring consistency with high-level semantics and fine-grained user intent. We further extend these techniques to controllable 4D generation, enabling interactive, editable, and morphable object synthesis over time and space.
Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models
Girrbach, Leander, Alaniz, Stephan, Smith, Genevieve, Darrell, Trevor, Akata, Zeynep
Vision-language models trained on large-scale multimodal datasets show strong demographic biases, but the role of training data in producing these biases remains unclear. A major barrier has been the lack of demographic annotations in web-scale datasets such as LAION-400M. We address this gap by creating person-centric annotations for the full dataset, including over 276 million bounding boxes, perceived gender and race/ethnicity labels, and automatically generated captions. These annotations are produced through validated automatic labeling pipelines combining object detection, multimodal captioning, and finetuned classifiers. Using them, we uncover demographic imbalances and harmful associations, such as the disproportionate linking of men and individuals perceived as Black or Middle Eastern with crime-related and negative content. We also show that 60-70% of gender bias in CLIP and Stable Diffusion can be linearly explained by direct co-occurrences in the data. Our resources establish the first large-scale empirical link between dataset composition and downstream model bias.
Deep Reinforcement Learning for Multi-Agent Coordination
We address the challenge of coordinating multiple robots in narrow and confined environments, where congestion and interference often hinder collective task performance. Drawing inspiration from insect colonies, which achieve robust coordination through stigmergy -- modifying and interpreting environmental traces -- we propose a Stigmergic Multi-Agent Deep Reinforcement Learning (S-MADRL) framework that leverages virtual pheromones to model local and social interactions, enabling decentralized emergent coordination without explicit communication. To overcome the convergence and scalability limitations of existing algorithms such as MADQN, MADDPG, and MAPPO, we leverage curriculum learning, which decomposes complex tasks into progressively harder sub-problems. Simulation results show that our framework achieves the most effective coordination of up to eight agents, where robots self-organize into asymmetric workload distributions that reduce congestion and modulate group performance. This emergent behavior, analogous to strategies observed in nature, demonstrates a scalable solution for decentralized multi-agent coordination in crowded environments with communication constraints.
A Qualitative Comparative Evaluation of Cognitive and Generative Theories
Evaluation is a critical activity associated with any theory. Yet this has proven to be a n exceptionally challenging activity for theories based on cognitive architectures. For an overlapping set of reasons, evaluation can also be challenging for theories based on generative neural architectures. T h is dual challenge is approached here by leveraging a broad perspective on theory evaluation to yield a wide - ranging, albeit qualitative, comparison of whole - mind - orie n ted cognitive and generative architectures an d the full systems th a t are based on these architectures .
Can an AI-Powered Presentation Platform Based On The Game "Just a Minute" Be Used To Improve Students' Public Speaking Skills?
This study explores the effectiveness of applying AI and gamification into a presentation platform aimed at University students wanting to improve their public speaking skills in their native tongue. Specifically, a platform based on the radio show, Just a Minute (JAM), is explored. In this game, players are challenged to speak fluently on a topic for 60 seconds without repeating themselves, hesitating or deviating from the topic. JAM has proposed benefits such as allowing students to improve their spontaneous speaking skills and reduce their use of speech disfluencies ("um", "uh", etc.). Previous research has highlighted the difficulties students face when speaking publicly, the main one being anxiety. AI Powered Presentation Platforms (AI-PPPs), where students can speak with an immersive AI audience and receive real-time feedback, have been explored as a method to improve student's speaking skills and confidence. So far they have shown promising results which this study aims to build upon. A group of students from the University of York are enlisted to evaluate the effectiveness of the JAM platform. They are asked to fill in a questionnaire, play through the game twice and then complete a final questionnaire to discuss their experiences playing the game. Various statistics are gathered during their gameplay such as the number of points they gained and the number of rules they broke. The results showed that students found the game promising and believed that their speaking skills could improve if they played the game for longer. More work will need to be carried out to prove the effectiveness of the game beyond the short term.
Conditional Pseudo-Supervised Contrast for Data-Free Knowledge Distillation
Shao, Renrong, Zhang, Wei, wang, Jun
Data-free knowledge distillation (DFKD) is an effective manner to solve model compression and transmission restrictions while retaining privacy protection, which has attracted extensive attention in recent years. Currently, the majority of existing methods utilize a generator to synthesize images to support the distillation. Although the current methods have achieved great success, there are still many issues to be explored. Firstly, the outstanding performance of supervised learning in deep learning drives us to explore a pseudo-supervised paradigm on DFKD. Secondly, current synthesized methods cannot distinguish the distributions of different categories of samples, thus producing ambiguous samples that may lead to an incorrect evaluation by the teacher. Besides, current methods cannot optimize the category-wise diversity samples, which will hinder the student model learning from diverse samples and further achieving better performance. In this paper, to address the above limitations, we propose a novel learning paradigm, i.e., conditional pseudo-supervised contrast for data-free knowledge distillation (CPSC-DFKD). The primary innovations of CPSC-DFKD are: (1) introducing a conditional generative adversarial network to synthesize category-specific diverse images for pseudo-supervised learning, (2) improving the modules of the generator to distinguish the distributions of different categories, and (3) proposing pseudo-supervised contrastive learning based on teacher and student views to enhance diversity. Comprehensive experiments on three commonly-used datasets validate the performance lift of both the student and generator brought by CPSC-DFKD. The code is available at https://github.com/RoryShao/CPSC-DFKD.git Keywords: model compression, knowledge distillation, representation learning, contrastive learning, privacy protection1. Introduction With the development of artificial intelligence, the deep con-volutional neural networks (DCNNs) have been widely applied in various computer vision tasks and achieved remarkable success, such as image classification [1], object detection [2], and semantic segmentation [3]. Nevertheless, in practical applications, DCNNs suffer from some heavy issues. Firstly, DCNNs always require heavy computation and storage. For example, only to handle one image, a VGG network commonly requires more than 500MB of memory, which makes them hard to be deployed on resource-constrained embedded or edge devices such as mobile phones and autonomous cars.
Lightweight Prompt Engineering for Cognitive Alignment in Educational AI: A OneClickQuiz Case Study
Yaacoub, Antoun, Assaghir, Zainab, Da-Rugna, Jรฉrรดme
The rapid integration of Artificial Intelligence (AI) into educational technology promises to revolutionize content creation and assessment. However, the quality and pedagogical alignment of AI-generated content remain critical challenges. This paper investigates the impact of lightweight prompt engineering strategies on the cognitive alignment of AI-generated questions within OneClickQuiz, a Moodle plugin leveraging generative AI. We evaluate three prompt variants-a detailed baseline, a simpler version, and a persona-based approach-across Knowledge, Application, and Analysis levels of Bloom's Taxonomy. Utilizing an automated classification model (from prior work) and human review, our findings demonstrate that explicit, detailed prompts are crucial for precise cognitive alignment. While simpler and persona-based prompts yield clear and relevant questions, they frequently misalign with intended Bloom's levels, generating outputs that are either too complex or deviate from the desired cognitive objective. This study underscores the importance of strategic prompt engineering in fostering pedagogically sound AI-driven educational solutions and advises on optimizing AI for quality content generation in learning analytics and smart learning environments.
Defining a Strategic Action Plan for AI in Higher Education
We start with reviewing normative actions of international organizations and concerns expressed about the current technical landscape. Then we proceed with proposing a framework that comprises five key dimensions relating to the main challenges relating to AI in higher education institutions, followed by five key strategic actions that the main stakeholders need to take in order to address the current developments . W e map these actions to the main stakeholders of higher education and propose a deployment plan . This defines a framework along the dimensions: C hallenges, Actions, Stakeholders, Deployment CASD . Examples of AI specific actions at the institutional and individu al course level are also provided and discussed.
NS-Pep: De novo Peptide Design with Non-Standard Amino Acids
Guo, Tao, Yin, Junbo, Wang, Yu, Gao, Xin
Peptide drugs incorporating non-standard amino acids (NSAAs) offer improved binding affinity and improved pharmacological properties. However, existing peptide design methods are limited to standard amino acids, leaving NSAA-aware design largely unexplored. We introduce NS-Pep, a unified framework for co-designing peptide sequences and structures with NSAAs. The main challenge is that NSAAs are extremely underrepresented-even the most frequent one, SEP, accounts for less than 0.4% of residues-resulting in a severe long-tailed distribution. To improve generalization to rare amino acids, we propose Residue Frequency-Guided Modification (RFGM), which mitigates over-penalization through frequency-aware logit calibration, supported by both theoretical and empirical analysis. Furthermore, we identify that insufficient side-chain modeling limits geometric representation of NSAAs. To address this, we introduce Progressive Side-chain Perception (PSP) for coarse-to-fine torsion and location prediction, and Interaction-Aware Weighting (IAW) to emphasize pocket-proximal residues. Moreover, NS-Pep generalizes naturally to the peptide folding task with NSAAs, addressing a major limitation of current tools. Experiments show that NS-Pep improves sequence recovery rate and binding affinity by 6.23% and 5.12%, respectively, and outperforms AlphaFold3 by 17.76% in peptide folding success rate.