Building Efficient and Effective OpenQA Systems for Low-Resource Languages
Budur, Emrah, Özçelik, Rıza, Soylu, Dilara, Khattab, Omar, Güngör, Tunga, Potts, Christopher
–arXiv.org Artificial Intelligence
Question answering (QA) is the task of answering questions posed in natural language with free-form natural language answers extracted from a given passage. In the OpenQA variant, only a question text is given, and the system must retrieve relevant passages from an unstructured knowledge source and use them to provide answers, which is the case in the mainstream QA systems on the Web. QA systems currently are mostly limited to the English language due to the lack of large-scale labeled QA datasets in non-English languages. In this paper, we show that effective, low-cost OpenQA systems can be developed for low-resource languages. The key ingredients are (1) weak supervision using machine-translated labeled datasets and (2) a relevant unstructured knowledge source in the target language. Furthermore, we show that only a few hundred gold assessment examples are needed to reliably evaluate these systems. We apply our method to Turkish as a challenging case study, since English and Turkish are typologically very distinct. We present SQuAD-TR, a machine translation of SQuAD2.0, and we build our OpenQA system by adapting ColBERT-QA for Turkish. We obtain a performance improvement of 9-34% in the EM score and 13-33% in the F1 score compared to the BM25-based and DPR-based baseline QA reader models by using two versions of Wikipedia dumps spanning two years. Our results show that SQuAD-TR makes OpenQA feasible for Turkish, which we hope encourages researchers to build OpenQA systems in other low-resource languages. We make all the code, models, and the dataset publicly available.
arXiv.org Artificial Intelligence
Jan-7-2024
- Country:
- South America
- Oceania > Australia
- North America
- Dominican Republic (0.04)
- United States
- Washington > King County
- Seattle (0.14)
- Texas > Travis County
- Austin (0.04)
- New York > New York County
- New York City (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.04)
- Maryland > Montgomery County
- Gaithersburg (0.04)
- California > Santa Clara County
- Palo Alto (0.04)
- Washington > King County
- Puerto Rico > San Juan
- San Juan (0.04)
- Canada
- Ontario > Toronto (0.04)
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Europe
- United Kingdom (0.14)
- Italy > Tuscany
- Florence (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Netherlands > North Brabant
- Eindhoven (0.04)
- Asia
- China > Hong Kong (0.04)
- India (0.04)
- Vietnam > Hanoi
- Hanoi (0.04)
- Middle East
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- UAE > Abu Dhabi Emirate
- Africa
- Zambia (0.04)
- Nigeria (0.04)
- Middle East > Tunisia
- Tunis Governorate > Tunis (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Media (0.93)
- Leisure & Entertainment (0.93)
- Health & Medicine (0.67)
- Government > Regional Government (0.67)
- Education > Educational Setting (0.67)
- Technology: