AITopics | cross-lingual retrieval

Appendix of Modeling

Neural Information Processing SystemsApr-25-2026, 13:47:11 GMT

To create a passage representation, the passage title and text are concatenated ([CLS]title [SEP]passage [SEP]), following common practice (Karpukhin et al., 2020). We retrieve top 10 passages and use them as input to mGEN. We differentiate those paragraphs from the question using special tokens (

vs. He graduated with a B.S. degree in Biology in 1957. As in the case of machine translation, we found that the language code does not need to be specified during inference as our model learns the question language automatically. Yet, we found that training with language codes is particularly useful to augment training data for Ltarget without any question data in Ltarget.

artificial intelligence, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country: North America > United States > New York (0.14)

Industry:

Leisure & Entertainment (0.93)
Media > Film (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.89)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.34)

Add feedback

3df07fdae1ab273a967aaa1d355b8bb6-Paper.pdf

Neural Information Processing SystemsApr-25-2026, 13:47:08 GMT

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: Asia (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications (0.70)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback

3df07fdae1ab273a967aaa1d355b8bb6-Paper.pdf

Neural Information Processing SystemsFeb-8-2026, 08:06:02 GMT

dataset, target language, training data, (16 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Hong Kong (0.04)
Asia > British Indian Ocean Territory > Diego Garcia (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.66)

Add feedback

Cross-lingual Retrieval for Iterative Self-Supervised Training (supplementary materials) 1 Experiment details

Neural Information Processing SystemsFeb-7-2026, 14:57:07 GMT

Becauseof the file size limit, we will release the source code and pretrained checkpoints after the anonymity period. To be able to make a fair comparison,we followed the same preprocessingsteps as described in [13]. In each iteration, we mine all90 language pairs in parallel, using8 GPUs for each pair, each pair taking about15 30 hours to finish. We lightly tune the margin score threshold using validation BLEU (using threshold score between 1.04and1.07.) For all experiments, we use Transformerwith 12 layers of encoder and 12 layers of decoder with model dimension of1024 on 16 heads ( 680M parameters). 1 We trained for maximum20,000 steps using label-smoothed cross-entropy loss with 0.2 label smoothing,0.3

artificial intelligence, machine translation, natural language, (12 more...)

Neural Information Processing Systems

Country:

Europe > Bulgaria > Sofia City Province > Sofia (0.05)
Europe > Belgium (0.05)
Asia > Middle East > Saudi Arabia > Northern Borders Province > Arar (0.05)
Asia > China > Hong Kong (0.05)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Cross-lingual Retrieval for Iterative Self-Supervised Training

Neural Information Processing SystemsDec-23-2025, 19:26:05 GMT

Recent studies have demonstrated the cross-lingual alignment ability of multilingual pretrained language models. In this work, we found that the cross-lingual alignment can be further improved by training seq2seq models on sentence pairs mined using their own encoder outputs. We utilized these findings to develop a new approach --- cross-lingual retrieval for iterative self-supervised training (CRISS), where mining and training processes are applied iteratively, improving cross-lingual alignment and translation ability at the same time. Using this method, we achieved state-of-the-art unsupervised machine translation results on 9 language directions with an average improvement of 2.4 BLEU, and on the Tatoeba sentence retrieval task in the XTREME benchmark on 16 languages with an average improvement of 21.5% in absolute accuracy. Furthermore, CRISS also brings an additional 1.8 BLEU improvement on average compared to mBART, when finetuned on supervised machine translation downstream tasks.

cross-lingual retrieval, iterative self-supervised training, name change, (3 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.85)

Add feedback

What Drives Cross-lingual Ranking? Retrieval Approaches with Multilingual Language Models

Goworek, Roksana, Macmillan-Scott, Olivia, Özyiğit, Eda B.

arXiv.org Artificial IntelligenceNov-25-2025

Cross-lingual information retrieval (CLIR) enables access to multilingual knowledge but remains challenging due to disparities in resources, scripts, and weak cross-lingual semantic alignment in embedding models. Existing pipelines often rely on translation and monolingual retrieval heuristics, which add computational overhead and noise, degrading performance. This work systematically evaluates four intervention types, namely document translation, multilingual dense retrieval with pretrained encoders, contrastive learning at word, phrase, and query-document levels, and cross-encoder re-ranking, across three benchmark datasets. We find that dense retrieval models trained specifically for CLIR consistently outperform lexical matching methods and derive little benefit from document translation. Contrastive learning mitigates language biases and yields substantial improvements for encoders with weak initial alignment, and re-ranking can be effective, but depends on the quality of the cross-encoder training data. Although high-resource languages still dominate overall performance, gains over lexical and document-translated baselines are most pronounced for low-resource and cross-script pairs. These findings indicate that cross-lingual search systems should prioritise semantic multilingual embeddings and targeted learning-based alignment over translation-based pipelines, particularly for cross-script and under-resourced languages.

information retrieval, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2511.19324

Country:

Asia (1.00)
Europe > United Kingdom > England (0.46)
North America > United States > Minnesota (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Cross-lingual Retrieval for Iterative Self-Supervised Training (supplementary materials) 1 Experiment details

Neural Information Processing SystemsOct-2-2025, 06:16:51 GMT

In this section, we describe our experimental procedures in more details including hyperparameters, and intermediate results. For unsupervised machine translation task, we evaluate BLEU scores using multi-bleu.perl

artificial intelligence, machine translation, natural language, (14 more...)

Neural Information Processing Systems

Country:

Europe > Bulgaria (0.14)
Asia > China (0.14)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector

Nguyen, Thong, Lei, Yibin, Ju, Jia-Huei, Yang, Eugene, Yates, Andrew

arXiv.org Artificial IntelligenceOct-2-2025

Learned Sparse Retrieval (LSR) combines the efficiency of bi-encoders with the transparency of lexical matching, but existing approaches struggle to scale beyond English. We introduce MILCO, an LSR architecture that maps queries and documents from different languages into a shared English lexical space via a multilingual connector. MILCO is trained with a specialized two-stage regime that combines Sparse Alignment Pretraining with contrastive training to provide representation transparency and effectiveness while mitigating semantic collapse. MILCO achieves state-of-the-art multilingual and cross-lingual LSR performance, outperforming leading dense, sparse, and multi-vector baselines such as BGE-M3 and Qwen3-Embed on standard multilingual benchmarks, while supporting dynamic efficiency through post-hoc pruning. Notably, when using mass-based pruning to reduce document representations to only 30 active dimensions on average, MILCO 560M outperforms the similarly-sized Qwen3-Embed 0.6B with 1024 dimensions. Learned Sparse Retrieval (LSR)(MacAvaney et al., 2020; Formal et al., 2021; Nguyen et al., 2023) represents queries and documents as sparse lexical embeddings and retains the scalability benefits of bi-encoders. Unlike dense methods, LSR aligns representation with a natural language vocabulary, yielding transparent representations that facilitate error tracing and bias inspection. LSR naturally supports dynamic post-hoc pruning at inference time (Bruch et al., 2024), providing Matryoshka-like latency control (Kusupati et al., 2022) without requiring auxiliary training objectives. Empirically, LSR (Lassance et al., 2024; Lei et al., 2025) is competitive on benchmarks like BEIR (Thakur et al., 2021) and MTEB (Enevoldsen et al., 2025).

large language model, machine learning, milco, (17 more...)

arXiv.org Artificial Intelligence

2510.00671

Genre: Research Report (0.50)

Industry: Law (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.92)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.88)

Add feedback

Review for NeurIPS paper: Cross-lingual Retrieval for Iterative Self-Supervised Training

Neural Information Processing SystemsJan-22-2025, 02:01:45 GMT

The paper proposes a novel approach for unsupervised parallel corpus mining and unsupervised machine translation, improving on the SoTA on both tasks by significant margins. Experiments are conducted on the Tatoeba retrieval task and a 25 language translation task based on a combination of a few academic benchmark datasets. Careful experiments to demonstrate how using parallel data from just one language pair significantly improves the cross-lingual embedding alignment in a multilingual de-noising auto-encoder. All reviewers support acceptance, as does the AC. Please make sure to incorporate the clarifications from the author response in the final version of the paper.

cross-lingual retrieval, iterative self-supervised training, neurips paper

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.73)

Add feedback

Cross-lingual Retrieval for Iterative Self-Supervised Training

Neural Information Processing SystemsOct-9-2024, 15:14:25 GMT

Recent studies have demonstrated the cross-lingual alignment ability of multilingual pretrained language models. In this work, we found that the cross-lingual alignment can be further improved by training seq2seq models on sentence pairs mined using their own encoder outputs. We utilized these findings to develop a new approach --- cross-lingual retrieval for iterative self-supervised training (CRISS), where mining and training processes are applied iteratively, improving cross-lingual alignment and translation ability at the same time. Using this method, we achieved state-of-the-art unsupervised machine translation results on 9 language directions with an average improvement of 2.4 BLEU, and on the Tatoeba sentence retrieval task in the XTREME benchmark on 16 languages with an average improvement of 21.5% in absolute accuracy. Furthermore, CRISS also brings an additional 1.8 BLEU improvement on average compared to mBART, when finetuned on supervised machine translation downstream tasks.

average improvement, cross-lingual retrieval, iterative self-supervised training

Neural Information Processing Systems

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.91)

Add feedback

Filters

Collaborating Authors

cross-lingual retrieval

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Appendix of Modeling

3df07fdae1ab273a967aaa1d355b8bb6-Paper.pdf

3df07fdae1ab273a967aaa1d355b8bb6-Paper.pdf

Cross-lingual Retrieval for Iterative Self-Supervised Training (supplementary materials) 1 Experiment details

Cross-lingual Retrieval for Iterative Self-Supervised Training

What Drives Cross-lingual Ranking? Retrieval Approaches with Multilingual Language Models

Cross-lingual Retrieval for Iterative Self-Supervised Training (supplementary materials) 1 Experiment details

Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector

Review for NeurIPS paper: Cross-lingual Retrieval for Iterative Self-Supervised Training

Cross-lingual Retrieval for Iterative Self-Supervised Training