webgpt
WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences
Liu, Xiao, Lai, Hanyu, Yu, Hao, Xu, Yifan, Zeng, Aohan, Du, Zhengxiao, Zhang, Peng, Dong, Yuxiao, Tang, Jie
We present WebGLM, a web-enhanced question-answering system based on the General Language Model (GLM). Its goal is to augment a pre-trained large language model (LLM) with web search and retrieval capabilities while being efficient for real-world deployments. To achieve this, we develop WebGLM with strategies for the LLM-augmented retriever, bootstrapped generator, and human preference-aware scorer. Specifically, we identify and address the limitations of WebGPT (OpenAI), through which WebGLM is enabled with accuracy, efficiency, and cost-effectiveness advantages. In addition, we propose systematic criteria for evaluating web-enhanced QA systems. We conduct multi-dimensional human evaluation and quantitative ablation studies, which suggest the outperformance of the proposed WebGLM designs over existing systems. WebGLM with the 10-billion-parameter GLM (10B) is shown to perform better than the similar-sized WebGPT (13B) and even comparably to WebGPT (175B) in human evaluation. The code, demo, and data are at \url{https://github.com/THUDM/WebGLM}.
- North America > United States > California > Los Angeles County > Long Beach (0.07)
- Asia > China > Beijing > Beijing (0.05)
- South America > Brazil > Federal District > Brasília (0.04)
- (10 more...)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Language models that can search the web hold promise -- but also raise concerns
Did you miss a session at the Data Summit? Language models -- AI systems that can be prompted to write essays and emails, answer questions, and more -- remain flawed in many ways. Because they "learn" to write from examples on the web, including problematic social media posts, they're prone to generating misinformation, conspiracy theories, and racist, sexist, or otherwise toxic language. Another major limitation of many of today's language models is that they're "stuck in time," in a sense. Because they're trained once on a large collection of text from the web, their knowledge of the world -- which they gain from that collection -- can quickly become outdated depending on when they were deployed.
- North America > United States > New York (0.06)
- Europe > Ukraine (0.05)
WebGPT: Browser-assisted question-answering with human feedback
Nakano, Reiichiro, Hilton, Jacob, Balaji, Suchir, Wu, Jeff, Ouyang, Long, Kim, Christina, Hesse, Christopher, Jain, Shantanu, Kosaraju, Vineet, Saunders, William, Jiang, Xu, Cobbe, Karl, Eloundou, Tyna, Krueger, Gretchen, Button, Kevin, Knight, Matthew, Chess, Benjamin, Schulman, John
We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.
- North America > United States > Vermont (0.04)
- Asia > Middle East > Israel (0.04)
- North America > United States > Virginia (0.04)
- (3 more...)
- Research Report (1.00)
- Questionnaire & Opinion Survey (0.92)