Pan, Lin
PRACTIQ: A Practical Conversational Text-to-SQL dataset with Ambiguous and Unanswerable Queries
Dong, Mingwen, Kumar, Nischal Ashok, Hu, Yiqun, Chauhan, Anuj, Hang, Chung-Wei, Chang, Shuaichen, Pan, Lin, Lan, Wuwei, Zhu, Henghui, Jiang, Jiarong, Ng, Patrick, Wang, Zhiguo
Previous text-to-SQL datasets and systems have primarily focused on user questions with clear intentions that can be answered. However, real user questions can often be ambiguous with multiple interpretations or unanswerable due to a lack of relevant data. In this work, we construct a practical conversational text-to-SQL dataset called PRACTIQ, consisting of ambiguous and unanswerable questions inspired by real-world user questions. We first identified four categories of ambiguous questions and four categories of unanswerable questions by studying existing text-to-SQL datasets. Then, we generate conversations with four turns: the initial user question, an assistant response seeking clarification, the user's clarification, and the assistant's clarified SQL response with the natural language explanation of the execution results. For some ambiguous queries, we also directly generate helpful SQL responses, that consider multiple aspects of ambiguity, instead of requesting user clarification. To benchmark the performance on ambiguous, unanswerable, and answerable questions, we implemented large language model (LLM)-based baselines using various LLMs. Our approach involves two steps: question category classification and clarification SQL prediction. Our experiments reveal that state-of-the-art systems struggle to handle ambiguous and unanswerable questions effectively. We will release our code for data generation and experiments on GitHub.
Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall
Yuan, Jiaqing, Pan, Lin, Hang, Chung-Wei, Guo, Jiang, Jiang, Jiarong, Min, Bonan, Ng, Patrick, Wang, Zhiguo
Large language models (LLMs) have shown remarkable performance on a variety of NLP tasks, and are being rapidly adopted in a wide range of use cases. It is therefore of vital importance to holistically evaluate the factuality of their generated outputs, as hallucinations remain a challenging issue. In this work, we focus on assessing LLMs' ability to recall factual knowledge learned from pretraining, and the factors that affect this ability. To that end, we construct FACT-BENCH, a representative benchmark covering 20 domains, 134 property types, 3 answer types, and different knowledge popularity levels. We benchmark 31 models from 10 model families and provide a holistic assessment of their strengths and weaknesses. We observe that instruction-tuning hurts knowledge recall, as pretraining-only models consistently outperform their instruction-tuned counterparts, and positive effects of model scaling, as larger models outperform smaller ones for all model families. However, the best performance from GPT-4 still represents a large gap with the upper-bound. We additionally study the role of in-context exemplars using counterfactual demonstrations, which lead to significant degradation of factual knowledge recall for large models. By further decoupling model known and unknown knowledge, we find the degradation is attributed to exemplars that contradict a model's known knowledge, as well as the number of such exemplars. Lastly, we fine-tune LLaMA-7B in different settings of known and unknown knowledge. In particular, fine-tuning on a model's known knowledge is beneficial, and consistently outperforms fine-tuning on unknown and mixed knowledge. We will make our benchmark publicly available.
Curvilinear object segmentation in medical images based on ODoS filter and deep learning network
Peng, Yuanyuan, Pan, Lin, Luan, Pengpeng, Tu, Hongbin, Li, Xiong
Automatic segmentation of curvilinear objects in medical images plays an important role in the diagnosis and evaluation of human diseases, yet it is a challenging uncertainty in the complex segmentation tasks due to different issues such as various image appearances, low contrast between curvilinear objects and their surrounding backgrounds, thin and uneven curvilinear structures, and improper background illumination conditions. To overcome these challenges, we present a unique curvilinear structure segmentation framework based on an oriented derivative of stick (ODoS) filter and a deep learning network for curvilinear object segmentation in medical images. Currently, a large number of deep learning models emphasize developing deep architectures and ignore capturing the structural features of curvilinear objects, which may lead to unsatisfactory results. Consequently, a new approach that incorporates an ODoS filter as part of a deep learning network is presented to improve the spatial attention of curvilinear objects. Specifically, the input image is transfered into four-channel image constructed by the ODoS filter. In which, the original image is considered the principal part to describe various image appearance and complex background illumination conditions, a multi-step strategy is used to enhance the contrast between curvilinear objects and their surrounding backgrounds, and a vector field is applied to discriminate thin and uneven curvilinear structures. Subsequently, a deep learning framework is employed to extract various structural features for curvilinear object segmentation in medical images. The performance of the computational model is validated in experiments conducted on the publicly available DRIVE, STARE and CHASEDB1 datasets. The experimental results indicate that the presented model yields surprising results compared with those of some state-of-the-art methods.
UNITE: A Unified Benchmark for Text-to-SQL Evaluation
Lan, Wuwei, Wang, Zhiguo, Chauhan, Anuj, Zhu, Henghui, Li, Alexander, Guo, Jiang, Zhang, Sheng, Hang, Chung-Wei, Lilien, Joseph, Hu, Yiqun, Pan, Lin, Dong, Mingwen, Wang, Jun, Jiang, Jiarong, Ash, Stephen, Castelli, Vittorio, Ng, Patrick, Xiang, Bing
A practical text-to-SQL system should generalize well on a wide variety of natural language questions, unseen database schemas, and novel SQL query structures. To comprehensively evaluate text-to-SQL systems, we introduce a UNIfied benchmark for Text-to-SQL Evaluation (UNITE). It is composed of publicly available text-to-SQL datasets, containing natural language questions from more than 12 domains, SQL queries from more than 3.9K patterns, and 29K databases. Compared to the widely used Spider benchmark, we introduce $\sim$120K additional examples and a threefold increase in SQL patterns, such as comparative and boolean questions. We conduct a systematic study of six state-of-the-art (SOTA) text-to-SQL parsers on our new benchmark and show that: 1) Codex performs surprisingly well on out-of-domain datasets; 2) specially designed decoding methods (e.g. constrained beam search) can improve performance for both in-domain and out-of-domain settings; 3) explicitly modeling the relationship between questions and schemas further improves the Seq2Seq models. More importantly, our benchmark presents key challenges towards compositional generalization and robustness issues -- which these SOTA models cannot address well. Our code and data processing script are available at https://github.com/awslabs/unified-text2sql-benchmark
Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness
Chang, Shuaichen, Wang, Jun, Dong, Mingwen, Pan, Lin, Zhu, Henghui, Li, Alexander Hanbo, Lan, Wuwei, Zhang, Sheng, Jiang, Jiarong, Lilien, Joseph, Ash, Steve, Wang, William Yang, Wang, Zhiguo, Castelli, Vittorio, Ng, Patrick, Xiang, Bing
Neural text-to-SQL models have achieved remarkable performance in translating natural language questions into SQL queries. However, recent studies reveal that text-to-SQL models are vulnerable to task-specific perturbations. Previous curated robustness test sets usually focus on individual phenomena. In this paper, we propose a comprehensive robustness benchmark based on Spider, a cross-domain text-to-SQL benchmark, to diagnose the model robustness. We design 17 perturbations on databases, natural language questions, and SQL queries to measure the robustness from different angles. In order to collect more diversified natural question perturbations, we utilize large pretrained language models (PLMs) to simulate human behaviors in creating natural questions. We conduct a diagnostic study of the state-of-the-art models on the robustness set. Experimental results reveal that even the most robust model suffers from a 14.0% performance drop overall and a 50.7% performance drop on the most challenging perturbation. We also present a breakdown analysis regarding text-to-SQL model designs and provide insights for improving model robustness.
Importance of Synthesizing High-quality Data for Text-to-SQL Parsing
Zhao, Yiyun, Jiang, Jiarong, Hu, Yiqun, Lan, Wuwei, Zhu, Henry, Chauhan, Anuj, Li, Alexander, Pan, Lin, Wang, Jun, Hang, Chung-Wei, Zhang, Sheng, Dong, Marvin, Lilien, Joe, Ng, Patrick, Wang, Zhiguo, Castelli, Vittorio, Xiang, Bing
Recently, there has been increasing interest in synthesizing data to improve downstream text-to-SQL tasks. In this paper, we first examined the existing synthesized datasets and discovered that state-of-the-art text-to-SQL algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We observed two shortcomings: illogical synthetic SQL queries from independent column sampling and arbitrary table joins. To address these issues, we propose a novel synthesis framework that incorporates key relationships from schema, imposes strong typing, and conducts schema-distance-weighted column sampling. We also adopt an intermediate representation (IR) for the SQL-to-text task to further improve the quality of the generated natural language questions. When existing powerful semantic parsers are pre-finetuned on our high-quality synthesized data, our experiments show that these models have significant accuracy boosts on popular benchmarks, including new state-of-the-art performance on Spider.
Span Selection Pre-training for Question Answering
Glass, Michael, Gliozzo, Alfio, Chakravarti, Rishav, Ferritto, Anthony, Pan, Lin, Bhargav, G P Shrivatsa, Garg, Dinesh, Sil, Avirup
BERT (Bidirectional Encoder Representations from Transformers) and related pre-trained Transformers have provided large gains across many language understanding tasks, achieving a new state-of-the-art (SOTA). BERT is pre-trained on two auxiliary tasks: Masked Language Model and Next Sentence Prediction. In this paper we introduce a new pre-training task inspired by reading comprehension and an effort to avoid encoding general knowledge in the transformer network itself. We find significant and consistent improvements over both BERT-BASE and BERT-LARGE on multiple reading comprehension (MRC) and paraphrasing datasets. Specifically, our proposed model has strong empirical evidence as it obtains SOTA results on Natural Questions, a new benchmark MRC dataset, outperforming BERT-LARGE by 3 F1 points on short answer prediction. We also establish a new SOTA in HotpotQA, improving answer prediction F1 by 4 F1 points and supporting fact prediction by 1 F1 point. Moreover, we show that our pre-training approach is particularly effective when training data is limited, improving the learning curve by a large amount.