Goto

Collaborating Authors

 Machine Translation



Language Models are Few-Shot Learners

Neural Information Processing Systems

Specifically, we train GPT -3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting.


Appendix of Prophet Attention

Neural Information Processing Systems

CIDEr-c40, which is the default ranking score in the leaderboard, and rank the 1st. Compared with image captioning, the target of video captioning is the video clip, i.e., an ordered The dataset contain 10,000 video clips, and each video is paired with 20 annotated sentences. We use the official splits to report our results. CIDEr, which is built upon on n-gram matching, is used in our tests for performance evaluation. All re-implementations and our experiments were ran on V100 GPUs.




proposed idea to be impactful (all reviewers), clear (all reviewers), novel (R1,R2), principled (R3,R4) and applicable to

Neural Information Processing Systems

We thank all reviewers for their thorough reviews and insightful feedback! We will incorporate all suggested improvements in the final version. We did not compare to Zhang et al. (2019) because (1) our method is independent of We missed Zhang et al. (2020) since it was published at ACL '20 which is one month after our But we will include both and relevant multilingual MT references within it in the final version. It is the standard error after running with different seeds. In Table 4, we compared 12/100 (24.16 BLEU) to 12/24 (23.7 BLEU) so as to isolate the effect from increased encoder depths.



TASER: Translation Assessment via Systematic Evaluation and Reasoning

arXiv.org Artificial Intelligence

We introduce TASER (Translation Assessment via Systematic Evaluation and Reasoning), a metric that uses Large Reasoning Models (LRMs) for automated translation quality assessment. TASER harnesses the explicit reasoning capabilities of LRMs to conduct systematic, step-by-step evaluation of translation quality. We evaluate TASER on the WMT24 Metrics Shared Task across both reference-based and reference-free scenarios, demonstrating state-of-the-art performance. In system-level evaluation, TASER achieves the highest soft pairwise accuracy in both reference-based and reference-free settings, outperforming all existing metrics. At the segment level, TASER maintains competitive performance with our reference-free variant ranking as the top-performing metric among all reference-free approaches. Our experiments reveal that structured prompting templates yield superior results with LRMs compared to the open-ended approaches that proved optimal for traditional LLMs. We evaluate o3, a large reasoning model from OpenAI, with varying reasoning efforts, providing insights into the relationship between reasoning depth and evaluation quality. The explicit reasoning process in LRMs offers interpretability and visibility, addressing a key limitation of existing automated metrics. Our results demonstrate that Large Reasoning Models show a measurable advancement in translation quality assessment, combining improved accuracy with transparent evaluation across diverse language pairs.


Improving Retrieval-Augmented Neural Machine Translation with Monolingual Data

arXiv.org Artificial Intelligence

Conventional retrieval-augmented neural machine translation (RANMT) systems leverage bilingual corpora, e.g., translation memories (TMs). Yet, in many settings, monolingual corpora in the target language are often available. This work explores ways to take advantage of such resources by directly retrieving relevant target language segments, based on a source-side query. For this, we design improved cross-lingual retrieval systems, trained with both sentence level and word-level matching objectives. In our experiments with three RANMT architectures, we assess such cross-lingual objectives in a controlled setting, reaching performances that match those of standard TM-based models. We also showcase our method on a real-world settings, using much larger monolingual and observe strong improvements over both the baseline setting and general-purpose cross-lingual retrievers.


06964dce9addb1c5cb5d6e3d9838f733-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for their feedback. We will reflect reviewer's comments and our response in the revision. Reviewers showed concern on the novelty and the accuracy. DA is more effective when the task is more challenging. On the other hand, we find DA effective as well when the amount of labeled data is small.