Media
LLAMAPIE: Proactive In-Ear Conversation Assistants
Chen, Tuochao, Batchelder, Nicholas, Liu, Alisa, Smith, Noah, Gollakota, Shyamnath
We introduce LlamaPIE, the first real-time proactive assistant designed to enhance human conversations through discreet, concise guidance delivered via hearable devices. Unlike traditional language models that require explicit user invocation, this assistant operates in the background, anticipating user needs without interrupting conversations. We address several challenges, including determining when to respond, crafting concise responses that enhance conversations, leveraging knowledge of the user for context-aware assistance, and real-time, on-device processing. To achieve this, we construct a semi-synthetic dialogue dataset and propose a two-model pipeline: a small model that decides when to respond and a larger model that generates the response. We evaluate our approach on real-world datasets, demonstrating its effectiveness in providing helpful, unobtrusive assistance. User studies with our assistant, implemented on Apple Silicon M2 hardware, show a strong preference for the proactive assistant over both a baseline with no assistance and a reactive model, highlighting the potential of LlamaPie to enhance live conversations.
My Life in Artificial Intelligence: People, anecdotes, and some lessons learnt
In this very personal workography, I relate my 40-year experiences as a researcher and educator in and around Artificial Intelligence (AI), more specifically Natural Language Processing. I describe how curiosity, and the circumstances of the day, led me to work in both industry and academia, and in various countries, including The Netherlands (Amsterdam, Eindhoven, and Utrecht), the USA (Stanford), England (Brighton), Scotland (Aberdeen), and China (Beijing and Harbin). People and anecdotes play a large role in my story; the history of AI forms its backdrop. I focus on things that might be of interest to (even) younger colleagues, given the choices they face in their own work and life at a time when AI is finally emerging from the shadows.
Overview of ADoBo at IberLEF 2025: Automatic Detection of Anglicisms in Spanish
Alvarez-Mellado, Elena, Porta-Zamorano, Jordi, Lignos, Constantine, Gonzalo, Julio
Linguistic borrowing is the process of reproducing in one language elements and patterns that come from another language (Hau-gen, 1950). Linguistic borrowing therefore involves the exchange between two languages and has been widely studied within the field of contact linguistics (Weinreich, 1963). Lexical borrowing in particular is the process of importing words from one language into another (Poplack, Sankoff, and Miller, 1988; Onysko, 2007). Lexical borrowing is a phenomenon that occurs in all languages and is a prolific source of new words and meanings (Gerding et al., 2014). In recent decades, English in particular has produced numerous lexical borrowings (often called anglicisms) in many European languages (Furiassi, Pulcini, and Gonz alez, 2012). Previous work estimated that a reader of French newspapers encounters a new lexical borrowing every 1,000 words (Chesley and Baayen, 2010), English borrowings outnumbering all other borrowings combined (Ches-ley, 2010). In Chilean newspapers, lexical borrowings account for approximately 30% of neologisms, 80% of those corresponding to anglicisms (Gerding et al., 2014). In European Spanish, it was estimated that anglicisms could account for 2% of the vocabulary used in Spanish newspaper El Pa ısin 1991 (Rodr ıguez Gonz alez, 2002), a number that is likely to be higher today. As a result, the usage of lexical borrowings in Spanish (and particularly anglicisms) has attracted lots of attention, both in linguistic studies and among the general public.
Progressive Homeostatic and Plastic Prompt Tuning for Audio-Visual Multi-Task Incremental Learning
Yin, Jiong, Li, Liang, Zhang, Jiehua, Gao, Yuhan, Yan, Chenggang, Sheng, Xichun
Audio-visual multi-task incremental learning aims to continuously learn from multiple audio-visual tasks without the need for joint training on all tasks. The challenge of the problem is how to preserve the old task knowledge while facilitating the learning of new task with previous experiences. T o address these challenges, we introduce a three-stage Progressive Homeostatic and Plastic audio-visual prompt (PHP) method. In the shallow phase, we design the task-shared modality aggregating adapter to foster cross-task and cross-modal audio-visual representation learning to enhance shared understanding between tasks. In the middle phase, we propose the task-specific modality-shared dynamic generating adapter, which constructs prompts that are tailored to individual tasks while remaining general across modalities, which balances the model's ability to retain knowledge against forgetting with its potential for versatile multi-task transferability. In the deep phase, we introduce the task-specific modality-independent prompts to further refine the understand ability by targeting individual information for each task and modality. By incorporating these three phases, PHP retains task-specific prompts while adapting shared parameters for new tasks to effectively balance knowledge sharing and specificity. Our method achieves SOTA performance in different orders of four tasks (A VE, A VVP, A VS and A VQA).
MORNING GLORY: Has President Trump ordered the big re-think?
Neither President Franklin Delano Roosevelt nor British Prime Minister Winston Churchill, nor any of their senior military or political advisors, saw the Japanese attacks of late 1941 coming. The forces of Imperial Japan achieved total surprise across the Pacific. The intelligence failures in the U.S. leading up to Pearl Harbor were catastrophic. So was Great Britain's general underestimation of the threat from Imperial Japan. The U.K.'s fortress outpost in the Pacific at Singapore was thought to be, if not impregnable, than as close to it as possible.
JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment
Liu, Renhang, Hung, Chia-Yu, Majumder, Navonil, Gautreaux, Taylor, Bagherzadeh, Amir Ali, Li, Chuan, Herremans, Dorien, Poria, Soujanya
Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics-to-song models, such as, DiffRhythm, ACE-Step, and LeVo, have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine-grained word-level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation, allowing fine-grained vocal control. To enhance the quality of generated songs to better align with human preferences, we implement aesthetic alignment through Direct Preference Optimization, which iteratively refines the model using a synthetic dataset, eliminating the need or manual data annotations. Furthermore, we aim to standardize the evaluation of such lyrics-to-song models through our public evaluation dataset JAME. We show that JAM outperforms the existing models in terms of the music-specific attributes.
When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification
Shcharbakova, Hanna, Anikina, Tatiana, Skachkova, Natalia, van Genabith, Josef
The rapid spread of multilingual misinformation requires robust automated fact verification systems capable of handling fine-grained veracity assessments across diverse languages. While large language models have shown remarkable capabilities across many NLP tasks, their effectiveness for multilingual claim verification with nuanced classification schemes remains understudied. We conduct a comprehensive evaluation of five state-of-the-art language models on the X-Fact dataset, which spans 25 languages with seven distinct veracity categories. Our experiments compare small language models (encoder-based XLM-R and mT5) with recent decoder-only LLMs (Llama 3.1, Qwen 2.5, Mistral Nemo) using both prompting and fine-tuning approaches. Surprisingly, we find that XLM-R (270M parameters) substantially outperforms all tested LLMs (7-12B parameters), achieving 57.7% macro-F1 compared to the best LLM performance of 16.9%. This represents a 15.8% improvement over the previous state-of-the-art (41.9%), establishing new performance benchmarks for multilingual fact verification. Our analysis reveals problematic patterns in LLM behavior, including systematic difficulties in leveraging evidence and pronounced biases toward frequent categories in imbalanced data settings. These findings suggest that for fine-grained multilingual fact verification, smaller specialized models may be more effective than general-purpose large models, with important implications for practical deployment of fact-checking systems.
Controllable Video-to-Music Generation with Multiple Time-Varying Conditions
Wu, Junxian, You, Weitao, Zuo, Heda, Zhang, Dengming, Chen, Pei, Sun, Lingyun
Music enhances video narratives and emotions, driving demand for automatic video-to-music (V2M) generation. However, existing V2M methods relying solely on visual features or supplementary textual inputs generate music in a black-box manner, often failing to meet user expectations. To address this challenge, we propose a novel multi-condition guided V2M generation framework that incorporates multiple time-varying conditions for enhanced control over music generation. Our method uses a two-stage training strategy that enables learning of V2M fundamentals and audiovisual temporal synchronization while meeting users' needs for multi-condition control. In the first stage, we introduce a fine-grained feature selection module and a progressive temporal alignment attention mechanism to ensure flexible feature alignment. For the second stage, we develop a dynamic conditional fusion module and a control-guided decoder module to integrate multiple conditions and accurately guide the music composition process. Extensive experiments demonstrate that our method outperforms existing V2M pipelines in both subjective and objective evaluations, significantly enhancing control and alignment with user expectations.
Before the Outrage: Challenges and Advances in Predicting Online Antisocial Behavior
Antisocial behavior (ASB) on social media-including hate speech, harassment, and trolling-poses growing challenges for platform safety and societal wellbeing. While prior work has primarily focused on detecting harmful content after it appears, predictive approaches aim to forecast future harmful behaviors-such as hate speech propagation, conversation derailment, or user recidivism-before they fully unfold. Despite increasing interest, the field remains fragmented, lacking a unified taxonomy or clear synthesis of existing methods. This paper presents a systematic review of over 49 studies on ASB prediction, offering a structured taxonomy of five core task types: early harm detection, harm emergence prediction, harm propagation prediction, behavioral risk prediction, and proactive moderation support. We analyze how these tasks differ by temporal framing, prediction granularity, and operational goals. In addition, we examine trends in modeling techniques-from classical machine learning to pre-trained language models-and assess the influence of dataset characteristics on task feasibility and generalization. Our review highlights methodological challenges, such as dataset scarcity, temporal drift, and limited benchmarks, while outlining emerging research directions including multilingual modeling, cross-platform generalization, and human-in-the-loop systems. By organizing the field around a coherent framework, this survey aims to guide future work toward more robust and socially responsible ASB prediction.
Video Forgery Detection for Surveillance Cameras: A Review
Tayfor, Noor B., Rashid, Tarik A., Qader, Shko M., Hassan, Bryar A., Abdalla, Mohammed H., Majidpour, Jafar, Ahmed, Aram M., Ali, Hussein M., Aladdin, Aso M., Abdullah, Abdulhady A., Shamsaldin, Ahmed S., Sidqi, Haval M., Salih, Abdulrahman, Yaseen, Zaher M., Ameen, Azad A., Nayak, Janmenjoy, Hamza, Mahmood Yashar
The widespread availability of video recording through smartphones and digital devices has made video-based evidence more accessible than ever. Surveillance footage plays a crucial role in security, law enforcement, and judicial processes. However, with the rise of advanced video editing tools, tampering with digital recordings has become increasingly easy, raising concerns about their authenticity. Ensuring the integrity of surveillance videos is essential, as manipulated footage can lead to misinformation and undermine judicial decisions. This paper provides a comprehensive review of existing forensic techniques used to detect video forgery, focusing on their effectiveness in verifying the authenticity of surveillance recordings. Various methods, including compression-based analysis, frame duplication detection, and machine learning-based approaches, are explored. The findings highlight the growing necessity for more robust forensic techniques to counteract evolving forgery methods. Strengthening video forensic capabilities will ensure that surveillance recordings remain credible and admissible as legal evidence.