Goto

Collaborating Authors

 Large Language Model


Heterogeneous Value Evaluation for Large Language Models

arXiv.org Artificial Intelligence

The emergent capabilities of Large Language Models (LLMs) have made it crucial to align their values with those of humans. Current methodologies typically attempt alignment with a homogeneous human value and requires human verification, yet lack consensus on the desired aspect and depth of alignment and resulting human biases. In this paper, we propose A2EHV, an Automated Alignment Evaluation with a Heterogeneous Value system that (1) is automated to minimize individual human biases, and (2) allows assessments against various target values to foster heterogeneous agents. Our approach pivots on the concept of value rationality, which represents the ability for agents to execute behaviors that satisfy a target value the most. The quantification of value rationality is facilitated by the Social Value Orientation framework from social psychology, which partitions the value space into four categories to assess social preferences from agents' behaviors. We evaluate the value rationality of eight mainstream LLMs and observe that large models are more inclined to align neutral values compared to those with strong personal values. By examining the behavior of these LLMs, we contribute to a deeper understanding of value alignment within a heterogeneous value system.


KEYword based Sampling (KEYS) for Large Language Models

arXiv.org Artificial Intelligence

Question answering (Q/A) can be formulated as a generative task (Mitra, 2017) where the task is to generate an answer given the question and the passage (knowledge, if available). Recent advances in QA task is focused a lot on language model advancements and less on other areas such as sampling(Krishna et al., 2021), (Nakano et al., 2021). Keywords play very important role for humans in language generation. (Humans formulate keywords and use grammar to connect those keywords and work). In the research community, very little focus is on how humans generate answers to a question and how this behavior can be incorporated in a language model. In this paper, we want to explore these two areas combined, i.e., how sampling can be to used generate answers which are close to human-like behavior and factually correct. Hence, the type of decoding algorithm we think should be used for Q/A tasks should also depend on the keywords. These keywords can be obtained from the question, passage or internet results. We use knowledge distillation techniques to extract keywords and sample using these extracted keywords on top of vanilla decoding algorithms when formulating the answer to generate a human-like answer. In this paper, we show that our decoding method outperforms most commonly used decoding methods for Q/A task


Level Generation Through Large Language Models

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are powerful tools, capable of leveraging their training on natural language to write stories, generate code, and answer questions. But can they generate functional video game levels? Game levels, with their complex functional constraints and spatial relationships in more than one dimension, are very different from the kinds of data an LLM typically sees during training. Datasets of game levels are also hard to come by, potentially taxing the abilities of these data-hungry models. We investigate the use of LLMs to generate levels for the game Sokoban, finding that LLMs are indeed capable of doing so, and that their performance scales dramatically with dataset size. We also perform preliminary experiments on controlling LLM level generators and discuss promising areas for future work.


Teaching Small Language Models to Reason

arXiv.org Artificial Intelligence

Chain of thought prompting successfully improves the reasoning capabilities of large language models, achieving state of the art results on a range of datasets. However, these reasoning capabilities only appear to emerge in models with a size of over 100 billion parameters. In this paper, we explore the transfer of such reasoning capabilities to models with less than 100 billion parameters via knowledge distillation. Specifically, we finetune a student model on the chain of thought outputs generated by a larger teacher model. Our experiments show that the proposed method improves task performance across arithmetic, commonsense and symbolic reasoning datasets. For example, the accuracy of T5 XXL on GSM8K improves from 8.11% to 21.99% when finetuned on PaLM-540B generated chains of thought.


Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models

arXiv.org Artificial Intelligence

Methods such as beam search and Gumbel top-k sampling can guarantee a different output for each element of the beam, but are not easy to parallelize. Alternatively, methods such as temperature sampling and its modifications (top-k sampling, nucleus sampling, typical decoding, and others), are embarrassingly parallel, but have no guarantees about duplicate samples. We present a framework for sampling according to an arithmetic code book implicitly defined by a large language model, compatible with common sampling variations, with provable beam diversity under certain conditions, as well as being embarrassingly parallel and providing unbiased and consistent expectations from the original model. We demonstrate the effectiveness of Figure 1: Sequence model over sequences of length two our approach on WMT machine translation, and a vocabulary of three symbols mapping points in the more than halving the standard deviation when unit interval to each sequence. An even lattice of code estimating expected BLEU score reward, and points parallelizes decoding into diverse high-probability closing the BLEU score gap between independent sequences.


A Survey on In-context Learning

arXiv.org Artificial Intelligence

With the increasing ability of large language models (LLMs), in-context learning (ICL) has become a new paradigm for natural language processing (NLP), where LLMs make predictions only based on contexts augmented with a few examples. It has been a new trend to explore ICL to evaluate and extrapolate the ability of LLMs. In this paper, we aim to survey and summarize the progress and challenges of ICL. We first present a formal definition of ICL and clarify its correlation to related studies. Then, we organize and discuss advanced techniques, including training strategies, demonstration designing strategies, as well as related analysis. Finally, we discuss the challenges of ICL and provide potential directions for further research. We hope that our work can encourage more research on uncovering how ICL works and improving ICL.


Responsibility Perspective Transfer for Italian Femicide News

arXiv.org Artificial Intelligence

Different ways of linguistically expressing the same real-world event can lead to different perceptions of what happened. Previous work has shown that different descriptions of gender-based violence (GBV) influence the reader's perception of who is to blame for the violence, possibly reinforcing stereotypes which see the victim as partly responsible, too. As a contribution to raise awareness on perspective-based writing, and to facilitate access to alternative perspectives, we introduce the novel task of automatically rewriting GBV descriptions as a means to alter the perceived level of responsibility on the perpetrator. We present a quasi-parallel dataset of sentences with low and high perceived responsibility levels for the perpetrator, and experiment with unsupervised (mBART-based), zero-shot and few-shot (GPT3-based) methods for rewriting sentences. We evaluate our models using a questionnaire study and a suite of automatic metrics.


Cook-Gen: Robust Generative Modeling of Cooking Actions from Recipes

arXiv.org Artificial Intelligence

As people become more aware of their food choices, food computation models have become increasingly popular in assisting people in maintaining healthy eating habits. For example, food recommendation systems analyze recipe instructions to assess nutritional contents and provide recipe recommendations. The recent and remarkable successes of generative AI methods, such as auto-regressive large language models, can lead to robust methods for a more comprehensive understanding of recipes for healthy food recommendations beyond surface-level nutrition content assessments. In this study, we explore the use of generative AI methods to extend current food computation models, primarily involving the analysis of nutrition and ingredients, to also incorporate cooking actions (e.g., add salt, fry the meat, boil the vegetables, etc.). Cooking actions are notoriously hard to model using statistical learning methods due to irregular data patterns - significantly varying natural language descriptions for the same action (e.g., marinate the meat vs. marinate the meat and leave overnight) and infrequently occurring patterns (e.g., add salt occurs far more frequently than marinating the meat). The prototypical approach to handling irregular data patterns is to increase the volume of data that the model ingests by orders of magnitude. Unfortunately, in the cooking domain, these problems are further compounded with larger data volumes presenting a unique challenge that is not easily handled by simply scaling up. In this work, we propose novel aggregation-based generative AI methods, Cook-Gen, that reliably generate cooking actions from recipes, despite difficulties with irregular data patterns, while also outperforming Large Language Models and other strong baselines.


Exploring EFL students' prompt engineering in human-AI story writing: an Activity Theory perspective

arXiv.org Artificial Intelligence

The integration of NLG tools in education has generated many questions (Rospigliosi, 2023) and a growing interest among researchers. For instance, Gero et al. (2022) found NLG tools could support science writing by inspiring writers with sentences about scientific concepts; and Guo et al. (2023) found students could interact with chatbots to better prepare for classroom debates. Particularly in the context of language learning (Haristiani, 2019) NLG tools might provide language learners with real-time feedback and support in various language tasks (Chen et al., 2021). Besides, researchers have found NLG tool-based activities can positively influence English as a foreign language (EFL) students' willingness to engage in English language (Tai & Chen, 2020; Lee et al., 2023). However, individual EFL students may perceive the affordances of using NLG tools differently and some may even perceive affordances as constraints (Jeon, 2022). For EFL students to effectively interact with NLG tools to complete language tasks, it appears students will not only need strategies but also the right NLG tools (Woo et al., 2023). Activity theory (AT; Engeström, 1987) provides a framework to analyze how language learners interact with NLG tools as a mediated activity system. The present qualitative study applies AT to explore the rules governing the use of NLG tools by EFL students to write short stories. By analyzing EFL students' written reflections for the rules they have developed to interact with NLG tools, the study can provide insights into human-AI collaboration in education, improving pedagogy and tool design.


Is Model Attention Aligned with Human Attention? An Empirical Study on Large Language Models for Code Generation

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have been demonstrated effective for code generation. Due to the complexity and opacity of LLMs, little is known about how these models generate code. To deepen our understanding, we investigate whether LLMs attend to the same parts of a natural language description as human programmers during code generation. An analysis of five LLMs on a popular benchmark, HumanEval, revealed a consistent misalignment between LLMs' and programmers' attention. Furthermore, we found that there is no correlation between the code generation accuracy of LLMs and their alignment with human programmers. Through a quantitative experiment and a user study, we confirmed that, among twelve different attention computation methods, attention computed by the perturbation-based method is most aligned with human attention and is constantly favored by human programmers. Our findings highlight the need for human-aligned LLMs for better interpretability and programmer trust.