Large Language Model
Can ChatGPT be used to generate scientific hypotheses?
Park, Yang Jeong, Kaplan, Daniel, Ren, Zhichu, Hsu, Chia-Wei, Li, Changhao, Xu, Haowei, Li, Sipei, Li, Ju
We investigate whether large language models can perform the creative hypothesis generation that human researchers regularly do. While the error rate is high, generative AI seems to be able to effectively structure vast amounts of scientific knowledge and provide interesting and testable hypotheses. The future scientific enterprise may include synergistic efforts with a swarm of "hypothesis machines", challenged by automated experimentation and adversarial peer reviews. In a university or research institute, a significant portion of fresh ideas arises out of discussions.
Unified Text Structuralization with Instruction-tuned Language Models
Ni, Xuanfan, Li, Piji, Li, Huayang
Text structuralization is one of the important fields of natural language processing (NLP) consists of information extraction (IE) and structure formalization. However, current studies of text structuralization suffer from a shortage of manually annotated high-quality datasets from different domains and languages, which require specialized professional knowledge. In addition, most IE methods are designed for a specific type of structured data, e.g., entities, relations, and events, making them hard to generalize to others. In this work, we propose a simple and efficient approach to instruct large language model (LLM) to extract a variety of structures from texts. More concretely, we add a prefix and a suffix instruction to indicate the desired IE task and structure type, respectively, before feeding the text into a LLM. Experiments on two LLMs show that this approach can enable language models to perform comparable with other state-of-the-art methods on datasets of a variety of languages and knowledge, and can generalize to other IE sub-tasks via changing the content of instruction. Another benefit of our approach is that it can help researchers to build datasets in low-source and domain-specific scenarios, e.g., fields in finance and law, with low cost.
From Natural Language to Simulations: Applying GPT-3 Codex to Automate Simulation Modeling of Logistics Systems
Jackson, Ilya, Saenz, Maria Jesus
Our work is the first attempt to apply Natural Language Processing to automate the development of simulation models of systems vitally important for logistics. We demonstrated that the framework built on top of the fine-tuned GPT-3 Codex, a Transformer-based language model, could produce functionally valid simulations of queuing and inventory control systems given the verbal description. In conducted experiments, GPT-3 Codex demonstrated convincing expertise in Python as well as an understanding of the domain-specific vocabulary. As a result, the language model could produce simulations of a single-product inventory-control system and single-server queuing system given the domain-specific context, a detailed description of the process, and a list of variables with the corresponding values. The demonstrated results, along with the rapid improvement of language models, open the door for significant simplification of the workflow behind the simulation model development, which will allow experts to focus on the high-level consideration of the problem and holistic thinking.
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
Mei, Xinhao, Meng, Chutong, Liu, Haohe, Kong, Qiuqiang, Ko, Tom, Zhao, Chengqi, Plumbley, Mark D., Zou, Yuexian, Wang, Wenwu
The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset. However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically. We conduct a comprehensive analysis of the characteristics of WavCaps dataset and evaluate it on multiple downstream audio-language multimodal learning tasks. The systems trained on WavCaps outperform previous state-of-the-art (SOTA) models by a significant margin. Our aspiration is for the WavCaps dataset we have proposed to facilitate research in audio-language multimodal learning and demonstrate the potential of utilizing ChatGPT to enhance academic research. Our dataset and codes are available at https://github.com/XinhaoMei/WavCaps.
Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence
Zhang, Jiaxing, Gan, Ruyi, Wang, Junjie, Zhang, Yuxiang, Zhang, Lin, Yang, Ping, Gao, Xinyu, Wu, Ziwei, Dong, Xiaoqun, He, Junqing, Zhuo, Jianheng, Yang, Qi, Huang, Yongfeng, Li, Xiayu, Wu, Yanghan, Lu, Junyu, Zhu, Xinyu, Chen, Weifeng, Han, Ting, Pan, Kunhao, Wang, Rui, Wang, Hao, Wu, Xiaojun, Zeng, Zhongshen, Chen, Chongpei
Nowadays, foundation models become one of fundamental infrastructures in artificial intelligence, paving ways to the general intelligence. However, the reality presents two urgent challenges: existing foundation models are dominated by the English-language community; users are often given limited resources and thus cannot always use foundation models. To support the development of the Chinese-language community, we introduce an open-source project, called Fengshenbang, which leads by the research center for Cognitive Computing and Natural Language (CCNL). Our project has comprehensive capabilities, including large pre-trained models, user-friendly APIs, benchmarks, datasets, and others. We wrap all these in three sub-projects: the Fengshenbang Model, the Fengshen Framework, and the Fengshen Benchmark. An open-source roadmap, Fengshenbang, aims to re-evaluate the open-source community of Chinese pre-trained large-scale models, prompting the development of the entire Chinese large-scale model community. We also want to build a user-centered open-source ecosystem to allow individuals to access the desired models to match their computing resources. Furthermore, we invite companies, colleges, and research institutions to collaborate with us to build the large-scale open-source model-based ecosystem. We hope that this project will be the foundation of Chinese cognitive intelligence.
Elastic Weight Removal for Faithful and Abstractive Dialogue Generation
Daheim, Nico, Dziri, Nouha, Sachan, Mrinmaya, Gurevych, Iryna, Ponti, Edoardo M.
Ideally, dialogue systems should generate responses that are faithful to the knowledge contained in relevant documents. However, many models generate hallucinated responses instead that contradict it or contain unverifiable information. To mitigate such undesirable behaviour, it has been proposed to fine-tune a `negative expert' on negative examples and subtract its parameters from those of a pre-trained model. However, intuitively, this does not take into account that some parameters are more responsible than others in causing hallucinations. Thus, we propose to weigh their individual importance via (an approximation of) the Fisher Information matrix, which measures the uncertainty of their estimate. We call this method Elastic Weight Removal (EWR). We evaluate our method -- using different variants of Flan-T5 as a backbone language model -- on multiple datasets for information-seeking dialogue generation and compare our method with state-of-the-art techniques for faithfulness, such as CTRL, Quark, DExperts, and Noisy Channel reranking. Extensive automatic and human evaluation shows that EWR systematically increases faithfulness at minor costs in terms of other metrics. However, we notice that only discouraging hallucinations may increase extractiveness, i.e. shallow copy-pasting of document spans, which can be undesirable. Hence, as a second main contribution, we show that our method can be extended to simultaneously discourage hallucinations and extractive responses. We publicly release the code for reproducing EWR and all baselines.
Yes but.. Can ChatGPT Identify Entities in Historical Documents?
González-Gallardo, Carlos-Emiliano, Boros, Emanuela, Girdhar, Nancy, Hamdi, Ahmed, Moreno, Jose G., Doucet, Antoine
Large language models (LLMs) have been leveraged for several years now, obtaining state-of-the-art performance in recognizing entities from modern documents. For the last few months, the conversational agent ChatGPT has "prompted" a lot of interest in the scientific community and public due to its capacity of generating plausible-sounding answers. In this paper, we explore this ability by probing it in the named entity recognition and classification (NERC) task in primary sources (e.g., historical newspapers and classical commentaries) in a zero-shot manner and by comparing it with state-of-the-art LM-based systems. Our findings indicate several shortcomings in identifying entities in historical text that range from the consistency of entity annotation guidelines, entity complexity, and code-switching, to the specificity of prompting. Moreover, as expected, the inaccessibility of historical archives to the public (and thus on the Internet) also impacts its performance.
Humans in Humans Out: On GPT Converging Toward Common Sense in both Success and Failure
Koralus, Philipp, Wang-Maścianica, Vincent
Increase in computational scale and fine-tuning has seen a dramatic improvement in the quality of outputs of large language models (LLMs) like GPT. Given that both GPT-3 and GPT-4 were trained on large quantities of human-generated text, we might ask to what extent their outputs reflect patterns of human thinking, both for correct and incorrect cases. The Erotetic Theory of Reason (ETR) provides a symbolic generative model of both human success and failure in thinking, across propositional, quantified, and probabilistic reasoning, as well as decision-making. We presented GPT-3, GPT-3.5, and GPT-4 with 61 central inference and judgment problems from a recent book-length presentation of ETR, consisting of experimentally verified data-points on human judgment and extrapolated data-points predicted by ETR, with correct inference patterns as well as fallacies and framing effects (the ETR61 benchmark). ETR61 includes classics like Wason's card task, illusory inferences, the decoy effect, and opportunity-cost neglect, among others. GPT-3 showed evidence of ETR-predicted outputs for 59% of these examples, rising to 77% in GPT-3.5 and 75% in GPT-4. Remarkably, the production of human-like fallacious judgments increased from 18% in GPT-3 to 33% in GPT-3.5 and 34% in GPT-4. This suggests that larger and more advanced LLMs may develop a tendency toward more human-like mistakes, as relevant thought patterns are inherent in human-produced training data. According to ETR, the same fundamental patterns are involved both in successful and unsuccessful ordinary reasoning, so that the "bad" cases could paradoxically be learned from the "good" cases. We further present preliminary evidence that ETR-inspired prompt engineering could reduce instances of these mistakes.
Matrix diagonalization and singular value decomposition: Static SageMath and dynamic ChatGPT juxtaposed
We investigated some difficulties that students often face when studying linear algebra at the undergraduate level, and identified some common mistakes and difficulties they often encountered when dealing with topics that require algorithmic thinking skills such as matrix factorization. In particular, we focused on (orthogonal) diagonalization and singular value decomposition (SVD). We also offered the possibility of exploring these topics using SageMath, a Python-based free open software computer algebra system (CAS) that has been identified to be useful for assisting many students in the computational process even though its output is static by nature. We then explored dynamic ChatGPT by inquiring the chatbot about the topic, either by asking to provide an example or to solve a problem, that is by constructing an (orthogonal) diagonalization or SVD from a particular matrix. By consolidating essential concepts in linear algebra and improving computational skills through effective practice, mastering these topics would become easier and mistakes could be minimized. Static SageMath, in particular, is a great aid for calculation confirmation and handling tedious computations. Although dynamic ChatGPT is relatively unreliable for solving problems in linear algebra, the mistakes it produces could become a valuable tool for improving critical thinking skills.
Whose Opinions Do Language Models Reflect?
Santurkar, Shibani, Durmus, Esin, Ladhak, Faisal, Lee, Cinoo, Liang, Percy, Hashimoto, Tatsunori
Language models (LMs) are increasingly being used in open-ended contexts, where the opinions reflected by LMs in response to subjective queries can have a profound impact, both on user satisfaction, as well as shaping the views of society at large. In this work, we put forth a quantitative framework to investigate the opinions reflected by LMs -- by leveraging high-quality public opinion polls and their associated human responses. Using this framework, we create OpinionsQA, a new dataset for evaluating the alignment of LM opinions with those of 60 US demographic groups over topics ranging from abortion to automation. Across topics, we find substantial misalignment between the views reflected by current LMs and those of US demographic groups: on par with the Democrat-Republican divide on climate change. Notably, this misalignment persists even after explicitly steering the LMs towards particular demographic groups. Our analysis not only confirms prior observations about the left-leaning tendencies of some human feedback-tuned LMs, but also surfaces groups whose opinions are poorly reflected by current LMs (e.g., 65+ and widowed individuals). Our code and data are available at https://github.com/tatsu-lab/opinions_qa.