Paolini, Giovanni
Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models
Havrilla, Alex, Dai, Andrew, O'Mahony, Laura, Oostermeijer, Koen, Zisler, Vera, Albalak, Alon, Milo, Fabrizio, Raparthy, Sharath Chandra, Gandhi, Kanishk, Abbasi, Baber, Phung, Duy, Iyer, Maia, Mahan, Dakota, Blagden, Chase, Gureja, Srishti, Hamdy, Mohammed, Li, Wen-Ding, Paolini, Giovanni, Ammanamanchi, Pawan Sasanka, Meyerson, Elliot
Synthetic data generation with Large Language Models is a promising paradigm for augmenting natural data over a nearly infinite range of tasks. Given this variety, direct comparisons among synthetic data generation algorithms are scarce, making it difficult to understand where improvement comes from and what bottlenecks exist. We propose to evaluate algorithms via the makeup of synthetic data generated by each algorithm in terms of data quality, diversity, and complexity. We choose these three characteristics for their significance in open-ended processes and the impact each has on the capabilities of downstream models. We find quality to be essential for in-distribution model generalization, diversity to be essential for out-of-distribution generalization, and complexity to be beneficial for both. Further, we emphasize the existence of Quality-Diversity trade-offs in training data and the downstream effects on model performance. We then examine the effect of various components in the synthetic data pipeline on each data characteristic. This examination allows us to taxonomize and compare synthetic data generation algorithms through the components they utilize and the resulting effects on data QDC composition. This analysis extends into a discussion on the importance of balancing QDC in synthetic data for efficient reinforcement learning and self-improvement algorithms. Analogous to the QD trade-offs in training data, often there exist trade-offs between model output quality and output diversity which impact the composition of synthetic data. We observe that many models are currently evaluated and optimized only for output quality, thereby limiting output diversity and the potential for self-improvement. We argue that balancing these trade-offs is essential to the development of future self-improvement algorithms and highlight a number of works making progress in this direction.
Learning to Play 7 Wonders Duel Without Human Supervision
Paolini, Giovanni, Moreschini, Lorenzo, Veneziano, Francesco, Iraci, Alessandro
This paper introduces ZeusAI, an artificial intelligence system developed to play the board game 7 Wonders Duel. Inspired by the AlphaZero reinforcement learning algorithm, ZeusAI relies on a combination of Monte Carlo Tree Search and a Transformer Neural Network to learn the game without human supervision. ZeusAI competes at the level of top human players, develops both known and novel strategies, and allows us to test rule variants to improve the game's balance. This work demonstrates how AI can help in understanding and enhancing board games.
Fewer Truncations Improve Language Modeling
Ding, Hantian, Wang, Zijian, Paolini, Giovanni, Kumar, Varun, Deoras, Anoop, Roth, Dan, Soatto, Stefano
In large language model training, input documents are typically concatenated together and then split into sequences of equal length to avoid padding tokens. Despite its efficiency, the concatenation approach compromises data integrity -- it inevitably breaks many documents into incomplete pieces, leading to excessive truncations that hinder the model from learning to compose logically coherent and factually consistent content that is grounded on the complete context. To address the issue, we propose Best-fit Packing, a scalable and efficient method that packs documents into training sequences through length-aware combinatorial optimization. Our method completely eliminates unnecessary truncations while retaining the same training efficiency as concatenation. Empirical results from both text and code pre-training show that our method achieves superior performance (e.g., relatively +4.7% on reading comprehension; +16.8% in context following; and +9.2% on program synthesis), and reduces closed-domain hallucination effectively by up to 58.3%.
General Purpose Verification for Chain of Thought Prompting
Vacareanu, Robert, Pratik, Anurag, Spiliopoulou, Evangelia, Qi, Zheng, Paolini, Giovanni, John, Neha Anna, Ma, Jie, Benajiba, Yassine, Ballesteros, Miguel
Many of the recent capabilities demonstrated by Large Language Models (LLMs) arise primarily from their ability to exploit contextual information. In this paper, we explore ways to improve reasoning capabilities of LLMs through (1) exploration of different chains of thought and (2) validation of the individual steps of the reasoning process. We propose three general principles that a model should adhere to while reasoning: (i) Relevance, (ii) Mathematical Accuracy, and (iii) Logical Consistency. We apply these constraints to the reasoning steps generated by the LLM to improve the accuracy of the final generation. The constraints are applied in the form of verifiers: the model itself is asked to verify if the generated steps satisfy each constraint. To further steer the generations towards high-quality solutions, we use the perplexity of the reasoning steps as an additional verifier. We evaluate our method on 4 distinct types of reasoning tasks, spanning a total of 9 different datasets. Experiments show that our method is always better than vanilla generation, and, in 6 out of the 9 datasets, it is better than best-of N sampling which samples N reasoning chains and picks the lowest perplexity generation.
A Weak Supervision Approach for Few-Shot Aspect Based Sentiment
Vacareanu, Robert, Varia, Siddharth, Halder, Kishaloy, Wang, Shuai, Paolini, Giovanni, John, Neha Anna, Ballesteros, Miguel, Muresan, Smaranda
We explore how weak supervision on abundant unlabeled data can be leveraged to improve few-shot performance in aspect-based sentiment analysis (ABSA) tasks. We propose a pipeline approach to construct a noisy ABSA dataset, and we use it to adapt a pre-trained sequence-to-sequence model to the ABSA tasks. We test the resulting model on three widely used ABSA datasets, before and after fine-tuning. Our proposed method preserves the full fine-tuning performance while showing significant improvements (15.84% absolute F1) in the few-shot learning scenario for the harder tasks. In zero-shot (i.e., without fine-tuning), our method outperforms the previous state of the art on the aspect extraction sentiment classification (AESC) task and is, additionally, capable of performing the harder aspect sentiment triplet extraction (ASTE) task.
\`A-la-carte Prompt Tuning (APT): Combining Distinct Data Via Composable Prompting
Bowman, Benjamin, Achille, Alessandro, Zancato, Luca, Trager, Matthew, Perera, Pramuditha, Paolini, Giovanni, Soatto, Stefano
We introduce \`A-la-carte Prompt Tuning (APT), a transformer-based scheme to tune prompts on distinct data so that they can be arbitrarily composed at inference time. The individual prompts can be trained in isolation, possibly on different devices, at different times, and on different distributions or domains. Furthermore each prompt only contains information about the subset of data it was exposed to during training. During inference, models can be assembled based on arbitrary selections of data sources, which we call "\`a-la-carte learning". \`A-la-carte learning enables constructing bespoke models specific to each user's individual access rights and preferences. We can add or remove information from the model by simply adding or removing the corresponding prompts without retraining from scratch. We demonstrate that \`a-la-carte built models achieve accuracy within $5\%$ of models trained on the union of the respective sources, with comparable cost in terms of training and inference time. For the continual learning benchmarks Split CIFAR-100 and CORe50, we achieve state-of-the-art performance.
Estimating informativeness of samples with Smooth Unique Information
Harutyunyan, Hrayr, Achille, Alessandro, Paolini, Giovanni, Majumder, Orchid, Ravichandran, Avinash, Bhotika, Rahul, Soatto, Stefano
We define a notion of information that an individual sample provides to the training of a neural network, and we specialize it to measure both how much a sample informs the final weights and how much it informs the function computed by the weights. Though related, we show that these quantities have a qualitatively different behavior. We give efficient approximations of these quantities using a linearized network and demonstrate empirically that the approximation is accurate for real-world architectures, such as pre-trained ResNets. We apply these measures to several problems, such as dataset summarization, analysis of under-sampled classes, comparison of informativeness of different data sources, and detection of adversarial and corrupted examples. Our work generalizes existing frameworks but enjoys better computational properties for heavily overparametrized models, which makes it possible to apply it to real-world networks. Training a deep neural network (DNN) entails extracting information from samples in a dataset and storing it in the weights of the network, so that it may be used in future inference or prediction. But how much information does a particular sample contribute to the trained model? The answer can be used to provide strong generalization bounds (if no information is used, the network is not memorizing the sample), privacy bounds (how much information the network can leak about a particular sample), and enable better interpretation of the training process and its outcome. To determine the information content of samples, we need to define and compute information. In the classical sense, information is a property of random variables, which may be degenerate for the deterministic process of computing the output of a trained DNN in response to a given input (inference). So, even posing the problem presents some technical challenges.
The Information Complexity of Learning Tasks, their Structure and their Distance
Achille, Alessandro, Paolini, Giovanni, Mbeng, Glen, Soatto, Stefano
We introduce an asymmetric distance in the space of learning tasks, and a framework to compute their complexity. These concepts are foundational to the practice of transfer learning, ubiquitous in Deep Learning, whereby a parametric model is pre-trained for a task, and then used for another after fine-tuning. The framework we develop is intrinsically non-asymptotic, capturing the finite nature of the training dataset, yet it allows distinguishing learning from memorization. It encompasses, as special cases, classical notions from Kolmogorov complexity, Shannon, and Fisher Information. However, unlike some of those frameworks, it can be applied easily to large-scale models and real-world datasets. It is the first framework to explicitly account for the optimization scheme, which plays a crucial role in Deep Learning, in measuring complexity and information.