iwslt
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Singapore (0.05)
- North America > United States > Texas > Travis County > Austin (0.04)
- (8 more...)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Hong Kong (0.04)
- (2 more...)
Findings of the IWSLT 2024 Evaluation Campaign
Ahmad, Ibrahim Said, Anastasopoulos, Antonios, Bojar, Ondřej, Borg, Claudia, Carpuat, Marine, Cattoni, Roldano, Cettolo, Mauro, Chen, William, Dong, Qianqian, Federico, Marcello, Haddow, Barry, Javorský, Dávid, Krubiński, Mateusz, Lam, Tsz Kin, Ma, Xutai, Mathur, Prashant, Matusov, Evgeny, Maurya, Chandresh, McCrae, John, Murray, Kenton, Nakamura, Satoshi, Negri, Matteo, Niehues, Jan, Niu, Xing, Ojha, Atul Kr., Ortega, John, Papi, Sara, Polák, Peter, Pospíšil, Adam, Pecina, Pavel, Salesky, Elizabeth, Sethiya, Nivedita, Sarkar, Balaram, Shi, Jiatong, Sikasote, Claytone, Sperber, Matthias, Stüker, Sebastian, Sudoh, Katsuhito, Thompson, Brian, Turchi, Marco, Waibel, Alex, Watanabe, Shinji, Wilken, Patrick, Zemánek, Petr, Zevallos, Rodolfo
This paper reports on the shared tasks organized by the 21st IWSLT Conference. The shared tasks address 7 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, dialect and low-resource speech translation, and Indic languages. The shared tasks attracted 18 teams whose submissions are documented in 26 system papers. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Czechia (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- (50 more...)
- Leisure & Entertainment (0.94)
- Education (0.68)
- Media > Television (0.47)
- Government > Regional Government (0.45)
Sparse Spectral Training and Inference on Euclidean and Hyperbolic Neural Networks
Zhao, Jialin, Zhang, Yingtao, Li, Xinghang, Liu, Huaping, Cannistraci, Carlo Vittorio
The growing computational demands posed by increasingly number of neural network's parameters necessitate low-memory-consumption training approaches. Previous memory reduction techniques, such as Low-Rank Adaptation (LoRA) and ReLoRA, suffer from the limitation of low rank and saddle point issues, particularly during intensive tasks like pre-training. In this paper, we propose Sparse Spectral Training (SST), an advanced training methodology that updates all singular values and selectively updates singular vectors of network weights, thereby optimizing resource usage while closely approximating full-rank training. SST refines the training process by employing a targeted updating strategy for singular vectors, which is determined by a multinomial sampling method weighted by the significance of the singular values, ensuring both high performance and memory reduction. Through comprehensive testing on both Euclidean and hyperbolic neural networks across various tasks, including natural language generation, machine translation, node classification and link prediction, SST demonstrates its capability to outperform existing memory reduction training methods and is comparable with full-rank training in some cases. On OPT-125M, with rank equating to 8.3% of embedding dimension, SST reduces the perplexity gap to full-rank training by 67.6%, demonstrating a significant reduction of the performance loss with prevalent low-rank methods. This approach offers a strong alternative to traditional training techniques, paving the way for more efficient and scalable neural network training solutions.
- South America > Colombia > Meta Department > Villavicencio (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Dominican Republic (0.04)
- (2 more...)
Condensing Multilingual Knowledge with Lightweight Language-Specific Modules
Xu, Haoran, Tan, Weiting, Li, Shuyue Stella, Chen, Yunmo, Van Durme, Benjamin, Koehn, Philipp, Murray, Kenton
Incorporating language-specific (LS) modules is a proven method to boost performance in multilingual machine translation. This approach bears similarity to Mixture-of-Experts (MoE) because it does not inflate FLOPs. However, the scalability of this approach to hundreds of languages (experts) tends to be unmanageable due to the prohibitive number of parameters introduced by full-rank matrices in fully-connected layers. In this work, we introduce the Language-Specific Matrix Synthesis (LMS) method. This approach constructs LS modules by generating low-rank matrices from two significantly smaller matrices to approximate the full-rank matrix. Furthermore, we condense multilingual knowledge from multiple LS modules into a single shared module with the Fuse Distillation (FD) technique to improve the efficiency of inference and model serialization. We show that our LMS method significantly outperforms previous LS methods and MoE methods with the same amount of extra parameters, e.g., 1.73 BLEU points over the Switch Transformer on many-to-many multilingual machine translation. Importantly, LMS is able to have comparable translation performance with much fewer parameters.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (6 more...)
Finding the Pillars of Strength for Multi-Head Attention
Ni, Jinjie, Mao, Rui, Yang, Zonglin, Lei, Han, Cambria, Erik
Recent studies have revealed some issues of Multi-Head Attention (MHA), e.g., redundancy and over-parameterization. Specifically, the heads of MHA were originally designed to attend to information from different representation subspaces, whereas prior studies found that some attention heads likely learn similar features and can be pruned without harming performance. Inspired by the minimum-redundancy feature selection, we assume that focusing on the most representative and distinctive features with minimum resources can mitigate the above issues and lead to more effective and efficient MHAs. In particular, we propose Grouped Head Attention, trained with a self-supervised group constraint that group attention heads, where each group focuses on an essential but distinctive feature subset. We additionally propose a Voting-to-Stay procedure to remove redundant heads, thus achieving a transformer with lighter weights. Moreover, our method achieves significant performance gains on three well-established tasks while considerably compressing parameters.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (14 more...)
GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution
Lu, Yining, Yu, Haoping, Khashabi, Daniel
Augmenting large language models (LLM) to use external tools enhances their performance across a variety of tasks. However, prior works over-rely on task-specific demonstration of tool use that limits their generalizability and computational cost due to making many calls to large-scale LLMs. We introduce GEAR, a computationally efficient query-tool grounding algorithm that is generalizable to various tasks that require tool use while not relying on task-specific demonstrations. GEAR achieves better efficiency by delegating tool grounding and execution to small language models (SLM) and LLM, respectively; while leveraging semantic and pattern-based evaluation at both question and answer levels for generalizable tool grounding. We evaluate GEAR on 14 datasets across 6 downstream tasks, demonstrating its strong generalizability to novel tasks, tools and different SLMs. Despite offering more efficiency, GEAR achieves higher precision in tool grounding compared to prior strategies using LLM prompting, thus improving downstream accuracy at a reduced computational cost. For example, we demonstrate that GEAR-augmented GPT-J and GPT-3 outperform counterpart tool-augmented baselines because of better tool use.
Forming Trees with Treeformers
Patel, Nilay, Flanigan, Jeffrey
Human language is known to exhibit a nested, hierarchical structure, allowing us to form complex sentences out of smaller pieces. However, many state-of-the-art neural networks models such as Transformers have no explicit hierarchical structure in its architecture -- that is, they don't have an inductive bias toward hierarchical structure. Additionally, Transformers are known to perform poorly on compositional generalization tasks which require such structures. In this paper, we introduce Treeformer, a general-purpose encoder module inspired by the CKY algorithm which learns a composition operator and pooling function to construct hierarchical encodings for phrases and sentences. Our extensive experiments demonstrate the benefits of incorporating hierarchical structure into the Transformer and show significant improvements in compositional generalization as well as in downstream tasks such as machine translation, abstractive summarization, and various natural language understanding tasks.
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
- North America > United States > Pennsylvania (0.04)
- (11 more...)
Progressive Translation: Improving Domain Robustness of Neural Machine Translation with Intermediate Sequences
Wang, Chaojun, Liu, Yang, Lam, Wai
Previous studies show that intermediate supervision signals benefit various Natural Language Processing tasks. However, it is not clear whether there exist intermediate signals that benefit Neural Machine Translation (NMT). Borrowing techniques from Statistical Machine Translation, we propose intermediate signals which are intermediate sequences from the "source-like" structure to the "target-like" structure. Such intermediate sequences introduce an inductive bias that reflects a domain-agnostic principle of translation, which reduces spurious correlations that are harmful to out-of-domain generalisation. Furthermore, we introduce a full-permutation multi-task learning to alleviate the spurious causal relations from intermediate sequences to the target, which results from exposure bias. The Minimum Bayes Risk decoding algorithm is used to pick the best candidate translation from all permutations to further improve the performance. Experiments show that the introduced intermediate signals can effectively improve the domain robustness of NMT and reduces the amount of hallucinations on out-of-domain translation. Further analysis shows that our methods are especially promising in low-resource scenarios.
- North America > Dominican Republic (0.04)
- Europe > Czechia > Prague (0.04)
- Asia > China > Hong Kong (0.04)
- (14 more...)
LABO: Towards Learning Optimal Label Regularization via Bi-level Optimization
Lu, Peng, Rashid, Ahmad, Kobyzev, Ivan, Rezagholizadeh, Mehdi, Langlais, Philippe
Regularization techniques are crucial to improving the generalization performance and training efficiency of deep neural networks. Many deep learning algorithms rely on weight decay, dropout, batch/layer normalization to converge faster and generalize. Label Smoothing (LS) is another simple, versatile and efficient regularization which can be applied to various supervised classification tasks. Conventional LS, however, regardless of the training instance assumes that each non-target class is equally likely. In this work, we present a general framework for training with label regularization, which includes conventional LS but can also model instance-specific variants. Based on this formulation, we propose an efficient way of learning LAbel regularization by devising a Bi-level Optimization (LABO) problem. We derive a deterministic and interpretable solution of the inner loop as the optimal label smoothing without the need to store the parameters or the output of a trained model. Finally, we conduct extensive experiments and demonstrate our LABO consistently yields improvement over conventional label regularization on various fields, including seven machine translation and three image classification tasks across various
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States > Washington > King County > Seattle (0.04)
- (22 more...)