Goto

Collaborating Authors

 Ding, Bosheng


StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs

arXiv.org Artificial Intelligence

The rapid development of large language models (LLMs) necessitates robust, unbiased, and scalable methods for evaluating their capabilities. However, human annotations are expensive to scale, model-based evaluations are prone to biases in answer style, while target-answer-based benchmarks are vulnerable to data contamination and cheating. To address these limitations, we propose StructTest, a novel benchmark that evaluates LLMs on their ability to produce compositionally specified structured outputs as an unbiased, cheap-to-run and difficult-to-cheat measure. The evaluation is done deterministically by a rule-based evaluator, which can be easily extended to new tasks. By testing structured outputs across diverse task domains -- including Summarization, Code, HTML and Math -- we demonstrate that StructTest serves as a good proxy for general reasoning abilities, as producing structured outputs often requires internal logical reasoning. We believe that StructTest offers a critical, complementary approach to objective and robust model evaluation.


Relevant or Random: Can LLMs Truly Perform Analogical Reasoning?

arXiv.org Artificial Intelligence

Analogical reasoning is a unique ability of humans to address unfamiliar challenges by transferring strategies from relevant past experiences. One key finding in psychology is that compared with irrelevant past experiences, recalling relevant ones can help humans better handle new tasks. Coincidentally, the NLP community has also recently found that self-generating relevant examples in the context can help large language models (LLMs) better solve a given problem than hand-crafted prompts. However, it is yet not clear whether relevance is the key factor eliciting such capability, i.e., can LLMs benefit more from self-generated relevant examples than irrelevant ones? In this work, we systematically explore whether LLMs can truly perform analogical reasoning on a diverse set of reasoning tasks. With extensive experiments and analysis, we show that self-generated random examples can surprisingly achieve comparable or even better performance, e.g., 4% performance boost on GSM8K with random biological examples. We find that the accuracy of self-generated examples is the key factor and subsequently design two improved methods with significantly reduced inference costs. Overall, we aim to advance a deeper understanding of LLM analogical reasoning and hope this work stimulates further research in the design of self-generated contexts.


How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library

arXiv.org Artificial Intelligence

With the rise of Large Language Models (LLMs) in recent years, new opportunities are emerging, but also new challenges, and contamination is quickly becoming critical. Business applications and fundraising in AI have reached a scale at which a few percentage points gained on popular question-answering benchmarks could translate into dozens of millions of dollars, placing high pressure on model integrity. At the same time, it is becoming harder and harder to keep track of the data that LLMs have seen; if not impossible with closed-source models like GPT-4 and Claude-3 not divulging any information on the training set. As a result, contamination becomes a critical issue: LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data. This limitation jeopardizes the entire progress in the field of NLP, yet, there remains a lack of methods on how to efficiently address contamination, or a clear consensus on prevention, mitigation and classification of contamination.


Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources

arXiv.org Artificial Intelligence

It results in more factual rationales and reduced hallucination in generation. Specifically, CoK consists of three stages: reasoning preparation, dynamic knowledge adapting, and answer consolidation. Given a knowledge-intensive question, CoK first prepares several preliminary rationales and answers while identifying the relevant knowledge domains. If there is no majority consensus among the answers from samples, CoK corrects the rationales step by step by adapting knowledge from the identified domains. These corrected rationales can plausibly serve as a better foundation for the final answer consolidation. Unlike prior studies that primarily use unstructured data, CoK also leverages structured knowledge sources such as Wikidata and tables that provide more reliable factual information. To access both unstructured and structured knowledge sources in the dynamic knowledge adapting stage, we propose an adaptive query generator that allows the generation of queries for various types of query languages, including SPARQL, SQL, and natural sentences. Moreover, to minimize error propagation between rationales, CoK corrects the rationales progressively using preceding corrected rationales to generate and correct subsequent rationales. Extensive experiments show that CoK consistently improves the performance of LLMs on knowledge-intensive tasks across different domains. In recent years, large language models (LLMs) such as ChatGPT (OpenAI, 2023) have demonstrated impressive language generation capabilities (Cheng et al., 2023; Ding et al., 2023). However, one major challenge of LLMs lies in hallucination, which is their tendency to confidently generate plausible but factually incorrect texts (Ji et al., 2023). As shown in Figure 1, given a question, "What year was the Argentine actor who directed El Tio Disparate born?" which requires factual knowledge to answer, the most advanced LLMs often provide an incorrect answer. While LLMs have the remarkable capability to recall information from their training data, effectively updating or controlling the factual knowledge within these models remains challenging (Luo et al., 2023). A promising direction to address hallucination in generation is to augment the LLMs with external knowledge (Mialon et al., 2023). These methods involve incorporating LLMs with a retrieval system, which seeks to utilize external factual knowledge to guide the generation process. Instead of relying solely on the internal training knowledge of LLMs, these methods can fetch relevant infor-Equal contribution. Xingxuan Li, Yew Ken Chia, and Bosheng Ding are under the Joint Ph.D. Program between Alibaba and their corresponding universities. We will make our code and data publicly available.


Retrieving Multimodal Information for Augmented Generation: A Survey

arXiv.org Artificial Intelligence

As Large Language Models (LLMs) become popular, there emerged an important trend of using multimodality to augment the LLMs' generation ability, which enables LLMs to better interact with the world. However, there lacks a unified perception of at which stage and how to incorporate different modalities. In this survey, we review methods that assist and augment generative models by retrieving multimodal knowledge, whose formats range from images, codes, tables, graphs, to audio. Such methods offer a promising solution to important concerns such as factuality, reasoning, interpretability, and robustness. By providing an in-depth review, this survey is expected to provide scholars with a deeper understanding of the methods' applications and encourage them to adapt existing techniques to the fast-growing field of LLMs.


Is GPT-3 a Good Data Annotator?

arXiv.org Artificial Intelligence

Evaluations show that GPT-3 has gained The democratization of artificial intelligence (AI) through pretraining a surprisingly wide range of (Garvey, 2018; Rubeis et al., 2022) aims to provide knowledge, which can be transferred to downstream access to AI technologies to all members of tasks through knowledge distillation (Kim society, including individuals, small-and mediumsized et al., 2022). We present some examples in Appendix enterprises (SMEs), academic research labs, A.12. Due to the model architecture and and nonprofit organizations. Achieving this goal is pretraining tasks designed for auto-regressive generation, crucial for the promotion of innovation, economic GPT-3 is capable of generating human-like growth, and fairness and equality. As typical AI text and performing a broad array of NLP tasks, models are usually data-hungry, one significant obstacle such as machine translation, summarization, and of AI democratization is the preparation of question-answering.


LogicLLM: Exploring Self-supervised Logic-enhanced Training for Large Language Models

arXiv.org Artificial Intelligence

Existing efforts to improve logical reasoning ability of language models have predominantly relied on supervised fine-tuning, hindering generalization to new domains and/or tasks. The development of Large Langauge Models (LLMs) has demonstrated the capacity of compressing abundant knowledge into a single proxy, enabling them to tackle multiple tasks effectively. Our preliminary experiments, nevertheless, show that LLMs do not show capability on logical reasoning. The performance of LLMs on logical reasoning benchmarks is far behind the existing state-of-the-art baselines. In this paper, we make the first attempt to investigate the feasibility of incorporating logical knowledge through self-supervised post-training, and activating it via in-context learning, which we termed as LogicLLM. Specifically, we devise an auto-regressive objective variant of MERIt and integrate it with two LLM series, i.e., FLAN-T5 and LLaMA, with parameter size ranging from 3 billion to 13 billion. The results on two challenging logical reasoning benchmarks demonstrate the effectiveness of LogicLLM. Besides, we conduct extensive ablation studies to analyze the key factors in designing logic-oriented proxy tasks.


Panda LLM: Training Data and Evaluation for Open-Sourced Chinese Instruction-Following Large Language Models

arXiv.org Artificial Intelligence

It due to their exceptional versatility in various is also the first released LLM of the Dandelion natural language proccessing tasks such as code Project. Our Panda LLM model has been writing and article editing, making them ubiquitous trained on Chinese-Wiki-2019, Chinese-News-in various industries and significantly enhancing 2016, Chinese-Baike-2018, Chinese-Webtext-2019 people's productivity (Ding et al., 2022; Zhao et al., and Translation-2019 (Xu, 2019) and COIG 2023). However, there are limitations to current datasets (Zhang et al., 2023) with instructiontuning off-the-shelf instruction-following large language (Wei et al., 2021) based on the LLaMA models, including the lack of trustworthiness in model (Touvron et al., 2023). Anticipated future releases generated results, lack of transparency in the model include progressively larger models such as used which raises concerns about data security, and Panda-13B and Panda-33B, with expected release the unknown training recipe, making it difficult to dates in the near future. Equal contribution, order decided by coin flip.


Can ChatGPT-like Generative Models Guarantee Factual Accuracy? On the Mistakes of New Generation Search Engines

arXiv.org Artificial Intelligence

Although large conversational AI models such as OpenAI's ChatGPT have demonstrated great potential, we question whether such models can guarantee factual accuracy. Recently, technology companies such as Microsoft and Google have announced new services which aim to combine search engines with conversational AI. However, we have found numerous mistakes in the public demonstrations that suggest we should not easily trust the factual claims of the AI models. Rather than criticizing specific models or companies, we hope to call on researchers and developers to improve AI models' transparency and factual correctness.


DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks

arXiv.org Artificial Intelligence

Data augmentation techniques have been widely used to improve machine learning performance as they enhance the generalization capability of models. In this work, to generate high quality synthetic data for low-resource tagging tasks, we propose a novel augmentation method with language models trained on the linearized labeled sentences. Our method is applicable to both supervised and semi-supervised settings. For the supervised settings, we conduct extensive experiments on named entity recognition (NER), part of speech (POS) tagging and end-to-end target based sentiment analysis (E2E-TBSA) tasks. For the semi-supervised settings, we evaluate our method on the NER task under the conditions of given unlabeled data only and unlabeled data plus a knowledge base. The results show that our method can consistently outperform the baselines, particularly when the given gold training data are less.