Li, Xingxuan
Chain of Ideas: Revolutionizing Research Via Novel Idea Development with LLM Agents
Li, Long, Xu, Weiwen, Guo, Jiayan, Zhao, Ruochen, Li, Xingxuan, Yuan, Yuqian, Zhang, Boqiang, Jiang, Yuming, Xin, Yifei, Dang, Ronghao, Zhao, Deli, Rong, Yu, Feng, Tian, Bing, Lidong
Effective research ideation is a critical step for scientific research. However, the exponential increase in scientific literature makes it challenging for researchers to stay current with recent advances and identify meaningful research directions. Recent developments in large language models~(LLMs) suggest a promising avenue for automating the generation of novel research ideas. However, existing methods for idea generation either trivially prompt LLMs or directly expose LLMs to extensive literature without indicating useful information. Inspired by the research process of human researchers, we propose a Chain-of-Ideas~(CoI) agent, an LLM-based agent that organizes relevant literature in a chain structure to effectively mirror the progressive development in a research domain. This organization facilitates LLMs to capture the current advancements in research, thereby enhancing their ideation capabilities. Furthermore, we propose Idea Arena, an evaluation protocol that can comprehensively evaluate idea generation methods from different perspectives, aligning closely with the preferences of human researchers. Experimental results indicate that the CoI agent consistently outperforms other methods and shows comparable quality as humans in research idea generation. Moreover, our CoI agent is budget-friendly, with a minimum cost of \$0.50 to generate a candidate idea and its corresponding experimental design.
Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks
Li, Xingxuan, Xu, Weiwen, Zhao, Ruochen, Jiao, Fangkai, Joty, Shafiq, Bing, Lidong
State-of-the-art large language models (LLMs) exhibit impressive problemsolving capabilities but may struggle with complex reasoning and factual correctness. Existing methods harness the strengths of chain-of-thought (CoT) and retrieval-augmented generation (RAG) to decompose a complex problem into simpler steps and apply retrieval to improve factual correctness. These methods work well on straightforward reasoning tasks but often falter on challenging tasks such as competitive programming and mathematics, due to frequent reasoning errors and irrelevant knowledge retrieval. To address this, we introduce Critic-guided planning with Retrieval-augmentation, CR-Planner, a novel framework that leverages fine-tuned critic models to guide both reasoning and retrieval processes through planning. CR-Planner solves a problem by iteratively selecting and executing sub-goals. Initially, it identifies the most promising sub-goal from reasoning, query generation, and retrieval, guided by rewards given by a critic model named sub-goal critic. It then executes this sub-goal through sampling and selecting the optimal output based on evaluations from another critic model named execution critic. This iterative process, informed by retrieved information and critic models, enables CR-Planner to effectively navigate the solution space towards the final answer. We employ Monte Carlo Tree Search (MCTS) to collect the data for training the critic models, allowing for a systematic exploration of action sequences and their long-term impacts. Our experiments demonstrate that CR-Planner significantly outperforms baselines, highlighting its effectiveness in addressing challenging problems by improving both reasoning and retrieval. Existing approaches (Yao et al., 2023b; Zhao et al., 2023b; Li et al., 2024) seek to harness the strengths of both chain-ofthought (CoT) reasoning (Wei et al., 2022) and retrieval-augmented generation (RAG) (Lewis et al., 2020) on knowledge-intensive complex reasoning problems.
ParaICL: Towards Robust Parallel In-Context Learning
Li, Xingxuan, Nguyen, Xuan-Phi, Joty, Shafiq, Bing, Lidong
Large language models (LLMs) have become the norm in natural language processing (NLP), excelling in few-shot in-context learning (ICL) with their remarkable abilities. Nonetheless, the success of ICL largely hinges on the choice of few-shot demonstration examples, making the selection process increasingly crucial. Existing methods have delved into optimizing the quantity and semantic similarity of these examples to improve ICL performances. However, our preliminary experiments indicate that the effectiveness of ICL is limited by the length of the input context. Moreover, varying combinations of few-shot demonstration examples can significantly boost accuracy across different test samples. To address this, we propose a novel method named parallel in-context learning (ParaICL) that effectively utilizes all demonstration examples without exceeding the manageable input context length. ParaICL employs parallel batching to distribute demonstration examples into different batches according to the semantic similarities of the questions in the demonstrations to the test question. It then computes normalized batch semantic scores for each batch. A weighted average semantic objective, constrained by adaptive plausibility, is applied to select the most appropriate tokens. Through extensive experiments, we validate the effectiveness of ParaICL and conduct ablation studies to underscore its design rationale. We further demonstrate that ParaICL can seamlessly integrate with existing methods.
How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library
Ravaut, Mathieu, Ding, Bosheng, Jiao, Fangkai, Chen, Hailin, Li, Xingxuan, Zhao, Ruochen, Qin, Chengwei, Xiong, Caiming, Joty, Shafiq
With the rise of Large Language Models (LLMs) in recent years, new opportunities are emerging, but also new challenges, and contamination is quickly becoming critical. Business applications and fundraising in AI have reached a scale at which a few percentage points gained on popular question-answering benchmarks could translate into dozens of millions of dollars, placing high pressure on model integrity. At the same time, it is becoming harder and harder to keep track of the data that LLMs have seen; if not impossible with closed-source models like GPT-4 and Claude-3 not divulging any information on the training set. As a result, contamination becomes a critical issue: LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data. This limitation jeopardizes the entire progress in the field of NLP, yet, there remains a lack of methods on how to efficiently address contamination, or a clear consensus on prevention, mitigation and classification of contamination.
ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up?
Chen, Hailin, Jiao, Fangkai, Li, Xingxuan, Qin, Chengwei, Ravaut, Mathieu, Zhao, Ruochen, Xiong, Caiming, Joty, Shafiq
Upon its release in late 2022, ChatGPT has brought a seismic shift in the entire landscape of AI, both in research and commerce. Through instruction-tuning a large language model (LLM) with supervised fine-tuning and reinforcement learning from human feedback, it showed that a model could answer human questions and follow instructions on a broad panel of tasks. Following this success, interests in LLMs have intensified, with new LLMs flourishing at frequent interval across academia and industry, including many start-ups focused on LLMs. While closed-source LLMs (e.g., OpenAI's GPT, Anthropic's Claude) generally outperform their open-source counterparts, the progress on the latter has been rapid with claims of achieving parity or even better on certain tasks. This has crucial implications not only on research but also on business. In this work, on the first anniversary of ChatGPT, we provide an exhaustive overview of this success, surveying all tasks where an open-source LLM has claimed to be on par or better than ChatGPT.
Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources
Li, Xingxuan, Zhao, Ruochen, Chia, Yew Ken, Ding, Bosheng, Joty, Shafiq, Poria, Soujanya, Bing, Lidong
It results in more factual rationales and reduced hallucination in generation. Specifically, CoK consists of three stages: reasoning preparation, dynamic knowledge adapting, and answer consolidation. Given a knowledge-intensive question, CoK first prepares several preliminary rationales and answers while identifying the relevant knowledge domains. If there is no majority consensus among the answers from samples, CoK corrects the rationales step by step by adapting knowledge from the identified domains. These corrected rationales can plausibly serve as a better foundation for the final answer consolidation. Unlike prior studies that primarily use unstructured data, CoK also leverages structured knowledge sources such as Wikidata and tables that provide more reliable factual information. To access both unstructured and structured knowledge sources in the dynamic knowledge adapting stage, we propose an adaptive query generator that allows the generation of queries for various types of query languages, including SPARQL, SQL, and natural sentences. Moreover, to minimize error propagation between rationales, CoK corrects the rationales progressively using preceding corrected rationales to generate and correct subsequent rationales. Extensive experiments show that CoK consistently improves the performance of LLMs on knowledge-intensive tasks across different domains. In recent years, large language models (LLMs) such as ChatGPT (OpenAI, 2023) have demonstrated impressive language generation capabilities (Cheng et al., 2023; Ding et al., 2023). However, one major challenge of LLMs lies in hallucination, which is their tendency to confidently generate plausible but factually incorrect texts (Ji et al., 2023). As shown in Figure 1, given a question, "What year was the Argentine actor who directed El Tio Disparate born?" which requires factual knowledge to answer, the most advanced LLMs often provide an incorrect answer. While LLMs have the remarkable capability to recall information from their training data, effectively updating or controlling the factual knowledge within these models remains challenging (Luo et al., 2023). A promising direction to address hallucination in generation is to augment the LLMs with external knowledge (Mialon et al., 2023). These methods involve incorporating LLMs with a retrieval system, which seeks to utilize external factual knowledge to guide the generation process. Instead of relying solely on the internal training knowledge of LLMs, these methods can fetch relevant infor-Equal contribution. Xingxuan Li, Yew Ken Chia, and Bosheng Ding are under the Joint Ph.D. Program between Alibaba and their corresponding universities. We will make our code and data publicly available.
Retrieving Multimodal Information for Augmented Generation: A Survey
Zhao, Ruochen, Chen, Hailin, Wang, Weishi, Jiao, Fangkai, Do, Xuan Long, Qin, Chengwei, Ding, Bosheng, Guo, Xiaobao, Li, Minzhi, Li, Xingxuan, Joty, Shafiq
As Large Language Models (LLMs) become popular, there emerged an important trend of using multimodality to augment the LLMs' generation ability, which enables LLMs to better interact with the world. However, there lacks a unified perception of at which stage and how to incorporate different modalities. In this survey, we review methods that assist and augment generative models by retrieving multimodal knowledge, whose formats range from images, codes, tables, graphs, to audio. Such methods offer a promising solution to important concerns such as factuality, reasoning, interpretability, and robustness. By providing an in-depth review, this survey is expected to provide scholars with a deeper understanding of the methods' applications and encourage them to adapt existing techniques to the fast-growing field of LLMs.
Is GPT-4 a Good Data Analyst?
Cheng, Liying, Li, Xingxuan, Bing, Lidong
As large language models (LLMs) have demonstrated their powerful capabilities in plenty of domains and tasks, including context understanding, code generation, language generation, data storytelling, etc., many data analysts may raise concerns if their jobs will be replaced by artificial intelligence (AI). This controversial topic has drawn great attention in public. However, we are still at a stage of divergent opinions without any definitive conclusion. Motivated by this, we raise the research question of "is GPT-4 a good data analyst?" in this work and aim to answer it by conducting head-to-head comparative studies. In detail, we regard GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains. We propose a framework to tackle the problems by carefully designing the prompts for GPT-4 to conduct experiments. We also design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4. Experimental results show that GPT-4 can achieve comparable performance to humans. We also provide in-depth discussions about our results to shed light on further studies before reaching the conclusion that GPT-4 can replace data analysts.
Towards Robust Low-Resource Fine-Tuning with Multi-View Compressed Representations
Liu, Linlin, Li, Xingxuan, Thakkar, Megh, Li, Xin, Joty, Shafiq, Si, Luo, Bing, Lidong
Due to the huge amount of parameters, fine-tuning of pretrained language models (PLMs) is prone to overfitting in the low resource scenarios. In this work, we present a novel method that operates on the hidden representations of a PLM to reduce overfitting. During fine-tuning, our method inserts random autoencoders between the hidden layers of a PLM, which transform activations from the previous layers into multi-view compressed representations before feeding them into the upper layers. The autoencoders are plugged out after fine-tuning, so our method does not add extra parameters or increase computation cost during inference. Our method demonstrates promising performance improvement across a wide range of sequence- and token-level low-resource NLP tasks.
Unlocking Temporal Question Answering for Large Language Models Using Code Execution
Li, Xingxuan, Cheng, Liying, Tan, Qingyu, Ng, Hwee Tou, Joty, Shafiq, Bing, Lidong
Large language models (LLMs) have made significant progress in natural language processing (NLP), and are utilized extensively in various applications. Recent works, such as chain-of-thought (CoT), have shown that intermediate reasoning steps can improve the performance of LLMs for complex reasoning tasks, such as math problems and symbolic question-answering tasks. However, we notice the challenge that LLMs face when it comes to temporal reasoning. Our preliminary experiments show that generating intermediate reasoning steps does not always boost the performance of complex temporal question-answering tasks. Therefore, we propose a novel framework that combines the extraction capability of LLMs and the logical reasoning capability of a Python solver to tackle this issue. Extensive experiments and analysis demonstrate the effectiveness of our framework in handling intricate time-bound reasoning tasks.