Machine Translation
CaMDN: Enhancing Cache Efficiency for Multi-tenant DNNs on Integrated NPUs
Cai, Tianhao, Wang, Liang, Xiao, Limin, Han, Meng, Wang, Zeyu, Sun, Lin, Liao, Xiaojian
With the rapid development of DNN applications, multi-tenant execution, where multiple DNNs are co-located on a single SoC, is becoming a prevailing trend. Although many methods are proposed in prior works to improve multi-tenant performance, the impact of shared cache is not well studied. This paper proposes CaMDN, an architecture-scheduling co-design to enhance cache efficiency for multi-tenant DNNs on integrated NPUs. Specifically, a lightweight architecture is proposed to support model-exclusive, NPU-controlled regions inside shared cache to eliminate unexpected cache contention. Moreover, a cache scheduling method is proposed to improve shared cache utilization. In particular, it includes a cache-aware mapping method for adaptability to the varying available cache capacity and a dynamic allocation algorithm to adjust the usage among co-located DNNs at runtime. Compared to prior works, CaMDN reduces the memory access by 33.4% on average and achieves a model speedup of up to 2.56$\times$ (1.88$\times$ on average).
Aya Vision: Advancing the Frontier of Multilingual Multimodality
Dash, Saurabh, Nan, Yiyang, Dang, John, Ahmadian, Arash, Singh, Shivalika, Smith, Madeline, Venkitesh, Bharat, Shmyhlo, Vlad, Aryabumi, Viraat, Beller-Morales, Walter, Pekmez, Jeremy, Ozuzu, Jason, Richemond, Pierre, Locatelli, Acyr, Frosst, Nick, Blunsom, Phil, Gomez, Aidan, Zhang, Ivan, Fadaee, Marzieh, Govindassamy, Manoj, Roy, Sudip, Gallรฉ, Matthias, Ermis, Beyza, รstรผn, Ahmet, Hooker, Sara
Building multimodal language models is fundamentally challenging: it requires aligning vision and language modalities, curating high-quality instruction data, and avoiding the degradation of existing text-only capabilities once vision is introduced. These difficulties are further magnified in the multilingual setting, where the need for multimodal data in different languages exacerbates existing data scarcity, machine translation often distorts meaning, and catastrophic forgetting is more pronounced. To address the aforementioned challenges, we introduce novel techniques spanning both data and modeling. First, we develop a synthetic annotation framework that curates high-quality, diverse multilingual multimodal instruction data, enabling Aya Vision models to produce natural, human-preferred responses to multimodal inputs across many languages. Complementing this, we propose a cross-modal model merging technique that mitigates catastrophic forgetting, effectively preserving text-only capabilities while simultaneously enhancing multimodal generative performance. Aya-Vision-8B achieves best-in-class performance compared to strong multimodal models such as Qwen-2.5-VL-7B, Pixtral-12B, and even much larger Llama-3.2-90B-Vision. We further scale this approach with Aya-Vision-32B, which outperforms models more than twice its size, such as Molmo-72B and LLaMA-3.2-90B-Vision. Our work advances multilingual progress on the multi-modal frontier, and provides insights into techniques that effectively bend the need for compute while delivering extremely high performance.
Graph Laplacian Wavelet Transformer via Learnable Spectral Decomposition
Kiruluta, Andrew, Lundy, Eric, Burity, Priscilla
We introduce the Graph W avelet Transformer (GWT), a novel architecture that replaces this bottleneck with a learnable, multi-scale wavelet transform defined over an explicit graph Laplacian derived from syntactic or semantic parses. By parameterizing K N bandpass filters in the graph Fourier domain, GWT achieves a linear-time mixing operator that simultaneously captures local syntactic dependencies and global semantic context. We provide a rigorous mathematical formulation of the spectral filtering and mixing process, integrate GWT modules into a standard Graph Transformer backbone, and evaluate on the WMT14 English-German translation benchmark. Empirical results demonstrate that GWT outperforms the baseline Graph Transformer by 0.8 BLEU, reduces parameter count by 7 %, and speeds up inference by 15 %. Our analysis shows that multi-scale spectral decomposition offers an interpretable, efficient, and expressive alternative to quadratic self-attention for graph-structured sequence modeling.
Are We Paying Attention to Her? Investigating Gender Disambiguation and Attention in Machine Translation
Manna, Chiara, Alishahi, Afra, Blain, Frรฉdรฉric, Vanmassenhove, Eva
While gender bias in modern Neural Machine Translation (NMT) systems has received much attention, traditional evaluation metrics do not to fully capture the extent to which these systems integrate contextual gender cues. We propose a novel evaluation metric called Minimal Pair Accuracy (MPA), which measures the reliance of models on gender cues for gender disambiguation. MPA is designed to go beyond surface-level gender accuracy metrics by focusing on whether models adapt to gender cues in minimal pairs -- sentence pairs that differ solely in the gendered pronoun, namely the explicit indicator of the target's entity gender in the source language (EN). We evaluate a number of NMT models on the English-Italian (EN--IT) language pair using this metric, we show that they ignore available gender cues in most cases in favor of (statistical) stereotypical gender interpretation. We further show that in anti-stereotypical cases, these models tend to more consistently take masculine gender cues into account while ignoring the feminine cues. Furthermore, we analyze the attention head weights in the encoder component and show that while all models encode gender information to some extent, masculine cues elicit a more diffused response compared to the more concentrated and specialized responses to feminine gender cues.
Translating the Grievance Dictionary: a psychometric evaluation of Dutch, German, and Italian versions
van der Vegt, Isabelle, Kleinberg, Bennett, Miotto, Marilu, Festor, Jonas
This paper introduces and evaluates three translations of the Grievance Dictionary, a psycholinguistic dictionary for the analysis of violent, threatening or grievance-fuelled texts. Considering the relevance of these themes in languages beyond English, we translated the Grievance Dictionary to Dutch, German, and Italian. We describe the process of automated translation supplemented by human annotation. Psychometric analyses are performed, including internal reliability of dictionary categories and correlations with the LIWC dictionary. The Dutch and German translations perform similarly to the original English version, whereas the Italian dictionary shows low reliability for some categories. Finally, we make suggestions for further validation and application of the dictionary, as well as for future dictionary translations following a similar approach.
Do Not Change Me: On Transferring Entities Without Modification in Neural Machine Translation -- a Multilingual Perspective
Wisniewski, Dawid, Pokrywka, Mikolaj, Rostek, Zofia
Current machine translation models provide us with high-quality outputs in most scenarios. However, they still face some specific problems, such as detecting which entities should not be changed during translation. In this paper, we explore the abilities of popular NMT models, including models from the OPUS project, Google Translate, MADLAD, and EuroLLM, to preserve entities such as URL addresses, IBAN numbers, or emails when producing translations between four languages: English, German, Polish, and Ukrainian. We investigate the quality of popular NMT models in terms of accuracy, discuss errors made by the models, and examine the reasons for errors. Our analysis highlights specific categories, such as emojis, that pose significant challenges for many models considered. In addition to the analysis, we propose a new multilingual synthetic dataset of 36,000 sentences that can help assess the quality of entity transfer across nine categories and four aforementioned languages.
TopicVD: A Topic-Based Dataset of Video-Guided Multimodal Machine Translation for Documentaries
Lv, Jinze, Chen, Jian, Long, Zi, Fu, Xianghua, Chen, Yin
Most existing multimodal machine translation (MMT) datasets are predominantly composed of static images or short video clips, lacking extensive video data across diverse domains and topics. As a result, they fail to meet the demands of real-world MMT tasks, such as documentary translation. In this study, we developed TopicVD, a topic-based dataset for video-supported multimodal machine translation of documentaries, aiming to advance research in this field. We collected video-subtitle pairs from documentaries and categorized them into eight topics, such as economy and nature, to facilitate research on domain adaptation in video-guided MMT. Additionally, we preserved their contextual information to support research on leveraging the global context of documentaries in video-guided MMT. To better capture the shared semantics between text and video, we propose an MMT model based on a cross-modal bidirectional attention module. Extensive experiments on the TopicVD dataset demonstrate that visual information consistently improves the performance of the NMT model in documentary translation. However, the MMT model's performance significantly declines in out-of-domain scenarios, highlighting the need for effective domain adaptation methods. Additionally, experiments demonstrate that global context can effectively improve translation performance. % Dataset and our implementations are available at https://github.com/JinzeLv/TopicVD
A new AI translation system for headphones clones multiple voices simultaneously
"There are so many smart people across the world, and the language barrier prevents them from having the confidence to communicate," says Shyam Gollakota, a professor at the University of Washington, who worked on the project. "My mom has such incredible ideas when she's speaking in Telugu, but it's so hard for her to communicate with people in the US when she visits from India. We think this kind of system could be transformative for people like her." While there are plenty of other live AI translation systems out there, such as the one running on Meta's Ray-Ban smart glasses, they focus on a single speaker, not multiple people speaking at once, and deliver robotic-sounding automated translations. The new system is designed to work with existing, off-the shelf noise-canceling headphones that have microphones, plugged into a laptop powered by Apple's M2 silicon chip, which can support neural networks.
Language translation, and change of accent for speech-to-speech task using diffusion model
Mishra, Abhishek, Chowdhury, Ritesh Sur, Bahuguna, Vartul, Pandey, Isha, Ramakrishnan, Ganesh
Speech-to-speech translation (S2ST) aims to convert spoken input in one language to spoken output in another, typically focusing on either language translation or accent adaptation. However, effective cross-cultural communication requires handling both aspects simultaneously -- translating content while adapting the speaker's accent to match the target language context. In this work, we propose a unified approach for simultaneous speech translation and change of accent, a task that remains underexplored in current literature. Our method reformulates the problem as a conditional generation task, where target speech is generated based on phonemes and guided by target speech features. Leveraging the power of diffusion models, known for high-fidelity generative capabilities, we adapt text-to-image diffusion strategies by conditioning on source speech transcriptions and generating Mel spectrograms representing the target speech with desired linguistic and accentual attributes. This integrated framework enables joint optimization of translation and accent adaptation, offering a more parameter-efficient and effective model compared to traditional pipelines.
Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management
Zhang, Hang, Shi, Jiuchen, Wang, Yixiao, Chen, Quan, Shan, Yizhou, Guo, Minyi
Multiple Low-Rank Adapters (Multi-LoRAs) are gaining popularity for task-specific Large Language Model (LLM) applications. For multi-LoRA serving, caching hot KV caches and LoRA adapters in high bandwidth memory of accelerations can improve inference performance. However, existing Multi-LoRA inference systems fail to optimize serving performance like Time-To-First-Toke (TTFT), neglecting usage dependencies when caching LoRAs and KVs. We therefore propose FASTLIBRA, a Multi-LoRA caching system to optimize the serving performance. FASTLIBRA comprises a dependency-aware cache manager and a performance-driven cache swapper. The cache manager maintains the usage dependencies between LoRAs and KV caches during the inference with a unified caching pool. The cache swapper determines the swap-in or out of LoRAs and KV caches based on a unified cost model, when the HBM is idle or busy, respectively. Experimental results show that ELORA reduces the TTFT by 63.4% on average, compared to state-of-the-art works.