english question
Text2Cypher Across Languages: Evaluating and Finetuning LLMs
Ozsoy, Makbule Gulcin, Tai, William
Recent advances in large language models (LLMs) have enabled natural language interfaces that translate user questions into database queries, such as Text2SQL, Text2SPARQL, and Text2Cypher. While these interfaces enhance database accessibility, most research today focuses on English, with limited evaluation in other languages. This paper investigates the performance of both foundational and finetuned LLMs on the Text2Cypher task across multiple languages. We create and release a multilingual dataset by translating English questions into Spanish and Turkish while preserving the original Cypher queries, enabling fair cross-lingual comparison. Using standardized prompts and metrics, we evaluate several foundational models and observe a consistent performance pattern: highest on English, followed by Spanish, and lowest on Turkish. We attribute this to differences in training data availability and linguistic features. We also examine the impact of translating task prompts into Spanish and Turkish. Results show little to no change in evaluation metrics, suggesting prompt translation has minor impact. Furthermore, we finetune a foundational model on two datasets: one in English only, and one multilingual. Finetuning on English improves overall accuracy but widens the performance gap between languages. In contrast, multilingual finetuning narrows the gap, resulting in more balanced performance. Our findings highlight the importance for multilingual evaluation and training to build more inclusive and robust query generation systems.
Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models
Xu, Zixiang, Wang, Yanbo, Huang, Yue, Chen, Xiuying, Zhao, Jieyu, Jiang, Meng, Zhang, Xiangliang
Large Language Models (LLMs) have achieved remarkable success in Natural Language Processing (NLP), yet their cross-lingual performance consistency remains a significant challenge. This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in LLMs. Our approach leverages beam search and LLM-based simulation to generate bilingual question pairs that expose performance discrepancies between English and target languages. We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models. The extensive experiments demonstrate that our method precisely and cost-effectively pinpoints cross-lingual weaknesses, consistently revealing over 50\% accuracy drops in target languages across a wide range of models. Moreover, further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns and benefit from targeted post-training. Code is available at https://github.com/xzx34/Cross-Lingual-Pitfalls.
Can Code-Switched Texts Activate a Knowledge Switch in LLMs? A Case Study on English-Korean Code-Switching
Kim, Seoyeon, Kim, Huiseo, Park, Chanjun, Yeo, Jinyoung, Lee, Dongha
Code-switching (CS), a phenomenon where multilingual speakers alternate between languages in a discourse, can convey subtle cultural and linguistic nuances that can be otherwise lost in translation. Recent state-of-the-art multilingual large language models (LLMs) demonstrate excellent multilingual abilities in various aspects including understanding CS, but the power of CS in eliciting language-specific knowledge is yet to be discovered. Therefore, we investigate the effectiveness of code-switching on a wide range of multilingual LLMs in terms of knowledge activation, or the act of identifying and leveraging knowledge for reasoning. To facilitate the research, we first present EnKoQA, a synthetic English-Korean CS question-answering dataset. We provide a comprehensive analysis on a variety of multilingual LLMs by subdividing activation process into knowledge identification and knowledge leveraging. Our experiments demonstrate that compared to English text, CS can faithfully activate knowledge inside LLMs, especially on language-specific domains. In addition, the performance gap between CS and English is larger in models that show excellent monolingual abilities, suggesting that there exists a correlation with CS and Korean proficiency.
An Empirical Study of NetOps Capability of Pre-Trained Large Language Models
Miao, Yukai, Bai, Yu, Chen, Li, Li, Dan, Sun, Haifeng, Wang, Xizheng, Luo, Ziqiu, Ren, Yanyu, Sun, Dapeng, Xu, Xiuting, Zhang, Qi, Xiang, Chao, Li, Xinchi
Nowadays, the versatile capabilities of Pre-trained Large Language Models (LLMs) have attracted much attention from the industry. However, some vertical domains are more interested in the in-domain capabilities of LLMs. For the Networks domain, we present NetEval, an evaluation set for measuring the comprehensive capabilities of LLMs in Network Operations (NetOps). NetEval is designed for evaluating the commonsense knowledge and inference ability in NetOps in a multi-lingual context. NetEval consists of 5,732 questions about NetOps, covering five different sub-domains of NetOps. With NetEval, we systematically evaluate the NetOps capability of 26 publicly available LLMs. The results show that only GPT-4 can achieve a performance competitive to humans. However, some open models like LLaMA 2 demonstrate significant potential.
An approach toward answering English questions from text
Simmons, R. F. | Burger, J. F. | Long, R. E.
Research on question answering by Raphael, Black, and Elliott, and our own work on Protosynthex II has shown that question-answering algorithms can be most easily written if the text source is in the form of simple, explicitly structured sets of subject-verb-nominal strings. Question-answering algorithms that have thus far been developed include word- and structure-matching operations and some few logical inference functions. All of the systems cited have in some fashion limited their input language to simple subject-verb-nominal strings, thus eliminating many problems of syntactic analysis and providing a normalized form for language data.
Indexing and dependency logic for answering English questions
Simmons, R. F., Klein, S., McConlogue, K.
This paper describes a computer system which uses a combination of coordinate indexing and structure matching techniques to extract from English questions many criteria which can be used for selecting and recognizing answers. A complete index of all content words in text is first searched to find information-rich statements which may be answers to the question. Each of these statements is then dependency analyzed to determine if the words (or synonyms) which correspond to question words maintain the dependency relations holding in the question. A simple semantic evaluation of structurally acceptable answers follows. A human editor working with the computer system helps to resolve syntactic ambiguities which are otherwise a major stumbling block in question-answering systems.