Goto

Collaborating Authors

 Yamada, Ikuya


LEIA: Facilitating Cross-lingual Knowledge Transfer in Language Models with Entity-based Data Augmentation

arXiv.org Artificial Intelligence

Adapting English-based large language models (LLMs) to other languages has become increasingly popular due to the efficiency and potential of cross-lingual transfer. However, existing language adaptation methods often overlook the benefits of cross-lingual supervision. In this study, we introduce LEIA, a language adaptation tuning method that utilizes Wikipedia entity names aligned across languages. This method involves augmenting the target language corpus with English entity names and training the model using left-to-right language modeling. We assess LEIA on diverse question answering datasets using 7B-parameter LLMs, demonstrating significant performance gains across various non-English languages. The source code is available at https://github.com/studio-ousia/leia.


Arukikata Travelogue Dataset with Geographic Entity Mention, Coreference, and Link Annotation

arXiv.org Artificial Intelligence

Geoparsing is a fundamental technique for analyzing geo-entity information in text. We focus on document-level geoparsing, which considers geographic relatedness among geo-entity mentions, and presents a Japanese travelogue dataset designed for evaluating document-level geoparsing systems. Our dataset comprises 200 travelogue documents with rich geo-entity information: 12,171 mentions, 6,339 coreference clusters, and 2,551 geo-entities linked to geo-database entries.


A Multilingual Bag-of-Entities Model for Zero-Shot Cross-Lingual Text Classification

arXiv.org Artificial Intelligence

Inspired learning, models are trained on annotated data in a by previous work (Yamada and Shindo, 2019; Peters resource-rich language (the source language) and et al., 2019), we compute the weights using then applied to another language (the target language) an attention mechanism that selects the entities relevant without any training. Substantial progress to the given document. We then compute in cross-lingual transfer learning has been made the sum of the entity-based document representation using multilingual pre-trained language models and the text-based document representation (PLMs), such as multilingual BERT (M-BERT), computed using the PLM and feed it into a linear jointly trained on massive corpora in multiple languages classifier. Since the entity vocabulary and entity (Devlin et al., 2019; Conneau and Lample, embedding are shared across languages, a model 2019; Conneau et al., 2020a). However, recent empirical trained on entity features in the source language can studies have found that cross-lingual transfer be directly transferred to multiple target languages.


NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned

arXiv.org Artificial Intelligence

We review the EfficientQA competition from NeurIPS 2020. The competition focused on open-domain question answering (QA), where systems take natural language questions as input and return natural language answers. The aim of the competition was to build systems that can predict correct answers while also satisfying strict on-disk memory budgets. These memory budgets were designed to encourage contestants to explore the trade-off between storing large, redundant, retrieval corpora or the parameters of large learned models. In this report, we describe the motivation and organization of the competition, review the best submissions, and analyze system predictions to inform a discussion of evaluation for open-domain QA.