Machine Translation
Transformers for molecular property prediction: Domain adaptation efficiently improves performance
Sultan, Afnan, Rausch-Dupont, Max, Khan, Shahrukh, Kalinina, Olga, Volkamer, Andrea, Klakow, Dietrich
Most of the current transformer-based chemical language models are pre-trained on millions to billions of molecules. However, the improvement from such scaling in dataset size is not confidently linked to improved molecular property prediction. The aim of this study is to investigate and overcome some of the limitations of transformer models in predicting molecular properties. Specifically, we examine the impact of pre-training dataset size and diversity on the performance of transformer models and investigate the use of domain adaptation as a technique for improving model performance. First, our findings indicate that increasing pretraining dataset size beyond 400K molecules from the GuacaMol dataset does not result in a significant improvement on four ADME endpoints, namely, solubility, permeability, microsomal stability, and plasma protein binding. Second, our results demonstrate that using domain adaptation by further training the transformer model on a small set of domain-relevant molecules, i.e., a few hundred to a few thousand, using multi-task regression of physicochemical properties was sufficient to significantly improve performance for three out of the four investigated ADME endpoints (P-value < 0.001). Finally, we observe that a model pre-trained on 400K molecules and domain adopted on a few hundred/thousand molecules performs similarly (P-value > 0.05) to more complicated transformer models like MolBERT(pre-trained on 1.3M molecules) and MolFormer (pre-trained on 100M molecules). A comparison to a random forest model trained on basic physicochemical properties showed similar performance to the examined transformer models. We believe that current transformer models can be improved through further systematic analysis of pre-training and downstream data, pre-training objectives, and scaling laws, ultimately leading to better and more helpful models.
Large-Scale AI in Telecom: Charting the Roadmap for Innovation, Scalability, and Enhanced Digital Experiences
Shahid, Adnan, Kliks, Adrian, Al-Tahmeesschi, Ahmed, Elbakary, Ahmed, Nikou, Alexandros, Maatouk, Ali, Mokh, Ali, Kazemi, Amirreza, De Domenico, Antonio, Karapantelakis, Athanasios, Cheng, Bo, Yang, Bo, Wang, Bohao, Fischione, Carlo, Zhang, Chao, Issaid, Chaouki Ben, Yuen, Chau, Peng, Chenghui, Huang, Chongwen, Chaccour, Christina, Thomas, Christo Kurisummoottil, Sharma, Dheeraj, Kalogiros, Dimitris, Niyato, Dusit, De Poorter, Eli, Mhanna, Elissa, Strinati, Emilio Calvanese, Bader, Faouzi, Abdeldayem, Fathi, Wang, Fei, Zhu, Fenghao, Fontanesi, Gianluca, Geraci, Giovanni, Zhou, Haibo, Purmehdi, Hakimeh, Ahmadi, Hamed, Zou, Hang, Du, Hongyang, Lee, Hoon, Yang, Howard H., Poli, Iacopo, Carron, Igor, Chatzistefanidis, Ilias, Lee, Inkyu, Pitsiorlas, Ioannis, Fontaine, Jaron, Wu, Jiajun, Zeng, Jie, Li, Jinan, Karam, Jinane, Gemayel, Johny, Deng, Juan, Frison, Julien, Huang, Kaibin, Qiu, Kehai, Ball, Keith, Wang, Kezhi, Guo, Kun, Tassiulas, Leandros, Gwenole, Lecorve, Yue, Liexiang, Bariah, Lina, Powell, Louis, Dryjanski, Marcin, Galdon, Maria Amparo Canaveras, Kountouris, Marios, Hafeez, Maryam, Elkael, Maxime, Bennis, Mehdi, Boudjelli, Mehdi, Dai, Meiling, Debbah, Merouane, Polese, Michele, Assaad, Mohamad, Benzaghta, Mohamed, Refai, Mohammad Al, Djerrab, Moussab, Syed, Mubeen, Amir, Muhammad, Yan, Na, Alkaabi, Najla, Li, Nan, Sehad, Nassim, Nikaein, Navid, Hashash, Omar, Sroka, Pawel, Yang, Qianqian, Zhao, Qiyang, Silab, Rasoul Nikbakht, Ying, Rex, Morabito, Roberto, Li, Rongpeng, Madi, Ryad, Ayoubi, Salah Eddine El, D'Oro, Salvatore, Lasaulce, Samson, Shalmashi, Serveh, Liu, Sige, Cherrared, Sihem, Chetty, Swarna Bindu, Dutta, Swastika, Zaidi, Syed A. R., Chen, Tianjiao, Murphy, Timothy, Melodia, Tommaso, Quek, Tony Q. S., Ram, Vishnu, Saad, Walid, Hamidouche, Wassim, Chen, Weilong, Liu, Xiaoou, Yu, Xiaoxue, Wang, Xijun, Shang, Xingyu, Wang, Xinquan, Cao, Xuelin, Su, Yang, Liang, Yanping, Deng, Yansha, Yang, Yifan, Cui, Yingping, Sun, Yu, Chen, Yuxuan, Pointurier, Yvan, Nehme, Zeinab, Nezami, Zeinab, Yang, Zhaohui, Zhang, Zhaoyang, Liu, Zhe, Yang, Zhenyu, Han, Zhu, Zhou, Zhuang, Chen, Zihan, Chen, Zirui, Shuai, Zitao
The rise of generative artificial intelligence (AI) as a novel frontier that uniquely merges advanced levels of intelligence with revolutionary user experiences is redefining the AI landscape for future cellular networks. In particular, the transition towards 6G systems has introduced a myriad of challenges inherent to their AI-native network design, requiring innovative solutions to enable real-time network orchestration, intelligent decision-making, and adaptive dynamic configurations. Meanwhile, the envisioned user experiences for 6G are growing increasingly complex, exceeding the capabilities offered by vintage wireless technologies and conventional AI solutions to satisfy their advanced demands. With its disruptive impact evident across diverse fields, generative AI possesses immense potential to tackle these challenges, leveraging its exceptional capabilities to manage complex tasks, operate autonomously, and adapt seamlessly to scenarios beyond its training domain. Remarkably, generative AI provides a transformative opportunity for telecom and cellular networks to bridge this defined gap in 6G systems, thereby shifting towards a new era with cutting-edge AI innovations across the different system and user levels.
Assumed Identities: Quantifying Gender Bias in Machine Translation of Ambiguous Occupational Terms
Mastromichalakis, Orfeas Menis, Filandrianos, Giorgos, Symeonaki, Maria, Stamou, Giorgos
Machine Translation (MT) systems frequently encounter ambiguous scenarios where they must assign gender to certain occupations when translating without explicit guidance or contextual cues. While individual translations in such cases may not be inherently biased, systematic patterns-such as the repeated association of certain professions with specific genders-can emerge, reflecting and perpetuating societal stereotypes. This ambiguity challenges traditional instance-level single-answer evaluation approaches, as no single gold standard translation exists. To address this, we propose an approach that evaluates gender bias through aggregated model responses. Specifically, we introduce a methodology to detect gender imbalances between source texts and translations, a benchmarking dataset with ambiguous English inputs, and probability-based metrics to quantify a model's divergence from normative standards or reference distributions.
Leveraging Domain Knowledge at Inference Time for LLM Translation: Retrieval versus Generation
Li, Bryan, Luo, Jiaming, Briakou, Eleftheria, Cherry, Colin
While large language models (LLMs) have been increasingly adopted for machine translation (MT), their performance for specialist domains such as medicine and law remains an open challenge. Prior work has shown that LLMs can be domain-adapted at test-time by retrieving targeted few-shot demonstrations or terminologies for inclusion in the prompt. Meanwhile, for general-purpose LLM MT, recent studies have found some success in generating similarly useful domain knowledge from an LLM itself, prior to translation. Our work studies domain-adapted MT with LLMs through a careful prompting setup, finding that demonstrations consistently outperform terminology, and retrieval consistently outperforms generation. We find that generating demonstrations with weaker models can close the gap with larger model's zero-shot performance. Given the effectiveness of demonstrations, we perform detailed analyses to understand their value. We find that domain-specificity is particularly important, and that the popular multi-domain benchmark is testing adaptation to a particular writing style more so than to a specific domain.
Compositional Translation: A Novel LLM-based Approach for Low-resource Machine Translation
Zebaze, Armel, Sagot, Benoît, Bawden, Rachel
The ability of generative large language models (LLMs) to perform in-context learning has given rise to a large body of research into how best to prompt models for various natural language processing tasks. Machine Translation (MT) has been shown to benefit from in-context examples, in particular when they are semantically similar to the sentence to translate. In this paper, we propose a new LLM-based translation paradigm, compositional translation, to replace naive few-shot MT with similarity-based demonstrations. An LLM is used to decompose a sentence into simpler phrases, and then to translate each phrase with the help of retrieved demonstrations. Finally, the LLM is prompted to translate the initial sentence with the help of the self-generated phrase-translation pairs. Our intuition is that this approach should improve translation because these shorter phrases should be intrinsically easier to translate and easier to match with relevant examples. This is especially beneficial in low-resource scenarios, and more generally whenever the selection pool is small or out of domain. We show that compositional translation boosts LLM translation performance on a wide range of popular MT benchmarks, including FLORES 200, NTREX 128 and TICO-19. Code and outputs are available at https://github.com/ArmelRandy/compositional-translation
Comparative Study of Zero-Shot Cross-Lingual Transfer for Bodo POS and NER Tagging Using Gemini 2.0 Flash Thinking Experimental Model
Narzary, Sanjib, Brahma, Bihung, Mahilary, Haradip, Brahma, Mahananda, Som, Bidisha, Nandi, Sukumar
Part-of-Speech (POS) tagging and Named Entity Recognition (NER) are fundamental tasks within the field of Natural Language Processing (NLP), serving as essential prerequisites for a multitude of downstream applications. POS tagging, the process of assigning grammatical categories to individual words within a sentence (e.g., noun, verb, adjective, adverb), provides crucial syntactic information that underpins higher-level language understanding. NER, on the contrary, focuses on identifying and classifying named entities - real-world objects that are designated with a proper name - into predefined semantic categories such as persons, organizations, locations, dates, times, and quantities [1, 2]. The synergy of POS and NER tagging empowers a wide spectrum of NLP applications. In information extraction, NER helps to pinpoint key entities, while POS tags help to understand the relationships between these entities and other words in the text, facilitating the extraction of structured information from unstructured text [3]. Machine translation systems benefit from POS tagging to improve syntactic analysis and word order prediction, and NER to ensure accurate translation of named entities in languages [4]. Question-answer systems rely on both NER and POS to understand the question's intent, identify relevant entities and relationships in the knowledge base, and formulate accurate answers. Text summarization algorithms leverage NER to identify salient entities and POS tags to preserve grammatical coherence and readability in summaries.
NaijaNLP: A Survey of Nigerian Low-Resource Languages
With over 500 languages in Nigeria, three languages -- Hausa, Yor\`ub\'a and Igbo -- spoken by over 175 million people, account for about 60% of the spoken languages. However, these languages are categorised as low-resource due to insufficient resources to support tasks in computational linguistics. Several research efforts and initiatives have been presented, however, a coherent understanding of the state of Natural Language Processing (NLP) - from grammatical formalisation to linguistic resources that support complex tasks such as language understanding and generation is lacking. This study presents the first comprehensive review of advancements in low-resource NLP (LR-NLP) research across the three major Nigerian languages (NaijaNLP). We quantitatively assess the available linguistic resources and identify key challenges. Although a growing body of literature addresses various NLP downstream tasks in Hausa, Igbo, and Yor\`ub\'a, only about 25.1% of the reviewed studies contribute new linguistic resources. This finding highlights a persistent reliance on repurposing existing data rather than generating novel, high-quality resources. Additionally, language-specific challenges, such as the accurate representation of diacritics, remain under-explored. To advance NaijaNLP and LR-NLP more broadly, we emphasise the need for intensified efforts in resource enrichment, comprehensive annotation, and the development of open collaborative initiatives.
The Serendipity of Claude AI: Case of the 13 Low-Resource National Languages of Mali
Dembele, Alou, Coulibaly, Nouhoum Souleymane, Leventhal, Michael
However, most of the world's languages, often referred to as "low-resource languages", still remain either not supported or insufficiently supported due to the limited availability of data and language resources, and market, economic, and global inequality factors. Mali, a multilingual country with 13 official languages, including Bamanankan (Bambara), Bomu, Bozo, Dɔgɔsɔ (Dogon), Fulfulde (Fula), Hassaniya Arabic, Mamara (Minyanka), Maninka, Soninke, Sɔõɔy (Songhay), Senara, Tàmàsàyt (Tamasheq) and Xaasongaxanno (Kassonke), faces severe challenges in digital inclusion limiting economic development, educational advancement, and preservation of cultural heritage (Bird, 2020; Nekoto et al., 2020). These languages share in common a penury of language resources needed to train AI and NLP systems which could play a role in lessening the digital divide (Hammarström et al., 2018). This penury extends from severe in the case of a language like Bambara which has very limited resources to catastrophic for languages like Bomu and Bozo with an almost complete absence of language resources. The need for innovative methods for low-resource languages has spawned varied strategies, such as transfer learning, zero-shot learning, and pre-trained models in related languages (Ruder, 2021; Adelani et al., 2022).
The Box is in the Pen: Evaluating Commonsense Reasoning in Neural Machine Translation
He, Jie, Wang, Tao, Xiong, Deyi, Liu, Qun
Does neural machine translation yield translations that are congenial with common sense? In this paper, we present a test suite to evaluate the commonsense reasoning capability of neural machine translation. The test suite consists of three test sets, covering lexical and contextless/contextual syntactic ambiguity that requires commonsense knowledge to resolve. We manually create 1,200 triples, each of which contain a source sentence and two contrastive translations, involving 7 different common sense types. Language models pretrained on large-scale corpora, such as BERT, GPT-2, achieve a commonsense reasoning accuracy of lower than 72% on target translations of this test suite. We conduct extensive experiments on the test suite to evaluate commonsense reasoning in neural machine translation and investigate factors that have impact on this capability. Our experiments and analyses demonstrate that neural machine translation performs poorly on commonsense reasoning of the three ambiguity types in terms of both reasoning accuracy (60.1%) and reasoning consistency (31%). The built commonsense test suite is available at https://github.com/tjunlp-lab/CommonMT.
FourierNAT: A Fourier-Mixing-Based Non-Autoregressive Transformer for Parallel Sequence Generation
Kiruluta, Andrew, Lundy, Eric, Lemos, Andreas
We present FourierNAT, a novel non-autoregressive Transformer (NAT) architecture that employs Fourier-based mixing in the decoder to generate output sequences in parallel. While traditional NAT approaches often face challenges with capturing global dependencies, our method leverages a discrete Fourier transform to mix token embeddings across the entire sequence dimension, coupled with learned frequency-domain gating. This allows the model to efficiently propagate context without explicit autoregressive steps. Empirically, FourierNAT achieves competitive results against leading NAT baselines on standard benchmarks like WMT machine translation and CNN/DailyMail summarization, providing significant speed advantages over autoregressive Transformers. We further demonstrate that learned frequency-domain parameters allow the model to adaptively focus on long-range or short-range dependencies, partially mitigating the well-known coherence gaps in one-pass NAT generation. Overall, FourierNAT highlights the potential of integrating spectral-domain operations to accelerate and improve parallel text generation. This approach can potentially provide great computational and time savings in inference tasks LLMs.