wmt
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Singapore (0.05)
- North America > United States > Texas > Travis County > Austin (0.04)
- (8 more...)
1325cdae3b6f0f91a1b629307bf2d498-Supplemental.pdf
C.1 Datasetdescription For WMT'16 English-German experiment, we used the same preprocessed data provided by [31] 1, including the samevalidation(neewsteest2013)andtest (neewsteest2014) splits. The data volume for train, validation and test splits are 4500966, 3000, 3003 sentence pairs respectively. When using LayerDrop we use 50% dropout probability. Similarly,we use beam search with beam size 5and length penalty 1.0 for decoding. First, we show that adding the auxiliary lossLK discretizes the samples and achieve the pruning purpose byenforcingsparsity oftheresulting model.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging
Lee, Seungeon, Das, Soumi, Gupta, Manish, Gummadi, Krishna P.
Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for fine-tuning large language models. However, conventional LoRA adapters are typically trained for a single task, limiting their applicability in real-world settings where inputs may span diverse and unpredictable domains. At inference time, existing approaches combine multiple LoRAs for improving performance on diverse tasks, while usually requiring labeled data or additional task-specific training, which is expensive at scale. In this work, we introduce LoRA on the Go (LoGo), a training-free framework that dynamically selects and merges adapters at the instance level without any additional requirements. LoGo leverages signals extracted from a single forward pass through LoRA adapters, to identify the most relevant adapters and determine their contributions on-the-fly. Across 5 NLP benchmarks, 27 datasets, and 3 model families, LoGo outperforms training-based baselines on some tasks upto a margin of 3.6% while remaining competitive on other tasks and maintaining inference throughput, highlighting its effectiveness and practicality.
- South America > Venezuela (0.04)
- Europe > Poland (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Singapore (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (10 more...)
A of Main Results
B.1 Additional V ariants We also conducted ablations on several variants of GAPX. Specifically, GAPX(neg-log) modifies the Eqn. 5 and Eqn. C.2 Interpreting the Results Figure 6: An example from QQP illustrating how to interpret the result of our method, by OODP . For GAP, we can use the the score defined in Eqn.. 4 split on each word, namely: In all three models, higher scores represent a higher chance of being non-paraphrases. GAP, the threshold is 0 while for OODP the threshold is 3. Its reliance on the word'a' might be due to the error The metrics are calculated as follow: 1. We use the RoBERTa model described in Section 4.2
Don't Sweat the Small Stuff: Segment-Level Meta-Evaluation Based on Pairwise Difference Correlation
DiIanni, Colten, Deutsch, Daniel
This paper introduces Pairwise Difference Pearson (PDP), a novel segment-level meta-evaluation metric for Machine Translation (MT) that address limitations in previous Pearson's $ρ$-based and and Kendall's $τ$-based meta-evaluation approaches. PDP is a correlation-based metric that utilizes pairwise differences rather than raw scores. It draws on information from all segments for a more robust understanding of score distributions and uses segment-wise pairwise differences to refine Global Pearson to intra-segment score comparisons. Analysis on the WMT'24 shared task shows PDP properly ranks sentinel evaluation metrics and better aligns with human error weightings than previous work. Noise injection analysis demonstrates PDP's robustness to random noise, segment bias, and system bias while highlighting its sensitivity to extreme outliers.
- Asia > Singapore (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- (4 more...)
Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination's Impact on Machine Translation
Kocyigit, Muhammed Yusuf, Briakou, Eleftheria, Deutsch, Daniel, Luo, Jiaming, Cherry, Colin, Freitag, Markus
Data contamination -- the accidental consumption of evaluation examples within the pre-training data -- can undermine the validity of evaluation benchmarks. In this paper, we present a rigorous analysis of the effects of contamination on language models at 1B and 8B scales on the machine translation task. Starting from a carefully decontaminated train-test split, we systematically introduce contamination at various stages, scales, and data formats to isolate its effect and measure its impact on performance metrics. Our experiments reveal that contamination with both source and target substantially inflates BLEU scores, and this inflation is 2.5 times larger (up to 30 BLEU points) for 8B compared to 1B models. In contrast, source-only and target-only contamination generally produce smaller, less consistent over-estimations. Finally, we study how the temporal distribution and frequency of contaminated samples influence performance over-estimation across languages with varying degrees of data resources.
- North America > Mexico (0.04)
- Asia > Singapore (0.04)
- North America > United States > Pennsylvania (0.04)
- (4 more...)
From Jack of All Trades to Master of One: Specializing LLM-based Autoraters to a Test Set
Finkelstein, Mara, Deutsch, Dan, Riley, Parker, Juraska, Juraj, Kovacs, Geza, Freitag, Markus
As LLMs continue to become more powerful and versatile, human evaluation has quickly become intractable at scale and reliance on automatic metrics has become the norm. Recently, it has been shown that LLMs are themselves state-of-the-art evaluators for many tasks. These Autoraters are typically designed so that they generalize to new systems and test sets. In practice, however, evaluation is performed on a small set of fixed, canonical test sets, which are carefully curated to measure certain capabilities of interest and are not changed frequently. In this work, we design a method which specializes a prompted Autorater to a given test set, by leveraging historical ratings on the test set to construct in-context learning (ICL) examples. We evaluate our Specialist method on the task of fine-grained machine translation evaluation, and show that it dramatically outperforms the state-of-the-art XCOMET metric by 54% and 119% on the WMT'23 and WMT'24 test sets, respectively. We perform extensive analyses to understand the representations learned by our Specialist metrics, and how variability in rater behavior affects their performance. We also verify the generalizability and robustness of our Specialist method for designing automatic metrics across different numbers of ICL examples, LLM backbones, systems to evaluate, and evaluation tasks.