Goto

Collaborating Authors

 new language





Improving Language Plasticity via Pretraining with Active Forgetting

Neural Information Processing Systems

Pretrained language models (PLMs) are today the primary model for natural language processing. Despite their impressive downstream performance, it can be difficult to apply PLMs to new languages, a barrier to making their capabilities universally accessible. While prior work has shown it possible to address this issue by learning a new embedding layer for the new language, doing so is both data and compute inefficient. We propose to use an active forgetting mechanism during pretraining, as a simple way of creating PLMs that can quickly adapt to new languages. Concretely, by resetting the embedding layer every K updates during pretraining, we encourage the PLM to improve its ability of learning new embeddings within limited number of updates, similar to a meta-learning effect. Experiments with RoBERTa show that models pretrained with our forgetting mechanism not only demonstrate faster convergence during language adaptation, but also outperform standard ones in a low-data regime, particularly for languages that are distant from English.


Persian-Phi: Efficient Cross-Lingual Adaptation of Compact LLMs via Curriculum Learning

Akhlaghi, Amir Mohammad, Shabani, Amirhossein, Abdolmaleki, Mostafa, Kheradpisheh, Saeed Reza

arXiv.org Artificial Intelligence

The democratization of AI is currently hindered by the immense computational costs required to train Large Language Models (LLMs) for low-resource languages. This paper presents Persian-Phi, a 3.8B parameter model that challenges the assumption that robust multilingual capabilities require massive model sizes or multilingual baselines. We demonstrate how Microsoft Phi-3 Mini -- originally a monolingual English model -- can be effectively adapted to Persian through a novel, resource-efficient curriculum learning pipeline. Our approach employs a unique "warm-up" stage using bilingual narratives (Tiny Stories) to align embeddings prior to heavy training, followed by continual pretraining and instruction tuning via Parameter-Efficient Fine-Tuning (PEFT). Despite its compact size, Persian-Phi achieves competitive results on Open Persian LLM Leaderboard in HuggingFace. Our findings provide a validated, scalable framework for extending the reach of state-of-the-art LLMs to underrepresented languages with minimal hardware resources. The Persian-Phi model is publicly available at https://huggingface.co/amirakhlaghiqqq/PersianPhi.


FastPOS: Language-Agnostic Scalable POS Tagging Framework Low-Resource Use Case

Kafi, Md Abdullah Al, Banshal, Sumit Kumar

arXiv.org Artificial Intelligence

This study proposes a language-agnostic transformer-based POS tagging framework designed for low-resource languages, using Bangla and Hindi as case studies. With only three lines of framework-specific code, the model was adapted from Bangla to Hindi, demonstrating effective portability with minimal modification. The framework achieves 96.85 percent and 97 percent token-level accuracy across POS categories in Bangla and Hindi while sustaining strong F1 scores despite dataset imbalance and linguistic overlap. A performance discrepancy in a specific POS category underscores ongoing challenges in dataset curation. The strong results stem from the underlying transformer architecture, which can be replaced with limited code adjustments. Its modular and open-source design enables rapid cross-lingual adaptation while reducing model design and tuning overhead, allowing researchers to focus on linguistic preprocessing and dataset refinement, which are essential for advancing NLP in underrepresented languages.


The Bambu Lab A1 is the best multi-color 3D printer for beginners and it's up to 260 off at Amazon

Popular Science

Gear The Bambu Lab A1 is the best multi-color 3D printer for beginners and it's up to $260 off at Amazon Bambu Lab 3D printers rarely go on-sale, but Amazon has every version of the auto-leveling A1 deeply discounted right now. We may earn revenue from the products available on this page and participate in affiliate programs. Bambu Labs makes some of our very favorite 3D printers on the market, but they rarely go on sale. Even on big shopping holidays, they tend to hover at or around retail prices. Right now, Amazon has one of the company's most popular (and accessible) models on huge discount.




Align2Speak: Improving TTS for Low Resource Languages via ASR-Guided Online Preference Optimization

Hussain, Shehzeen, Neekhara, Paarth, Yang, Xuesong, Casanova, Edresson, Ghosh, Subhankar, Fejgin, Roy, Langman, Ryan, Desta, Mikyas, Tavabi, Leili, Li, Jason

arXiv.org Artificial Intelligence

Developing high-quality text-to-speech (TTS) systems for low-resource languages is challenging due to the scarcity of paired text and speech data. In contrast, automatic speech recognition (ASR) models for such languages are often more accessible, owing to large-scale multilingual pre-training efforts. We propose a framework based on Group Relative Policy Optimization (GRPO) to adapt an autoregressive, multilingual TTS model to new languages. Our method first establishes a language-agnostic foundation for TTS synthesis by training a multilingual baseline with International Phonetic Alphabet (IPA) tokens. Next, we fine-tune this model on limited paired data of the new languages to capture the target language's prosodic features. Finally, we apply GRPO to optimize the model using only unpaired text and speaker prompts, guided by a multi-objective reward from pretrained ASR, speaker verification, and audio quality estimation models. Experiments demonstrate that this pipeline produces intelligible and speaker-consistent speech in low-resource languages, substantially outperforming fine-tuning alone. Furthermore, our GRPO-based framework also improves TTS performance in high-resource languages, surpassing offline alignment methods such as Direct Preference Optimization (DPO) yielding superior intelligibility, speaker similarity, and audio quality.