Large Language Model
Aligning Offline Metrics and Human Judgments of Value for Code Generation Models
Dibia, Victor, Fourney, Adam, Bansal, Gagan, Poursabzi-Sangdeh, Forough, Liu, Han, Amershi, Saleema
Large language models have demonstrated great potential to assist programmers in generating code. For such human-AI pair programming scenarios, we empirically demonstrate that while generated code is most often evaluated in terms of their functional correctness (i.e., whether generations pass available unit tests), correctness does not fully capture (e.g., may underestimate) the productivity gains these models may provide. Through a user study with N = 49 experienced programmers, we show that while correctness captures high-value generations, programmers still rate code that fails unit tests as valuable if it reduces the overall effort needed to complete a coding task. Finally, we propose a hybrid metric that combines functional correctness and syntactic similarity and show that it achieves a 14% stronger correlation with value and can therefore better represent real-world gains when evaluating and comparing models.
Google, OpenAI will share AI models with the UK government
The UK's AI oversight will include chances to directly study some companies' technology. In a speech at London Tech Week, Prime Minister Rishi Sunak revealed that Google DeepMind, OpenAI and Anthropic have pledged to provide "early or priority access" to AI models for the sake of research and safety. This will ideally improve inspections of these models and help the government recognize the "opportunities and risks," Sunak says. It's not clear just what data the tech firms will share with the UK government. We've asked Google, OpenAI and Anthropic for comment.
It's time to talk about the real AI risks
Unsurprisingly, everyone was talking about AI and the recent rush to deploy large language models. Ahead of the conference, the United Nations put out a statement, encouraging RightsCon attendees to focus on AI oversight and transparency. I was surprised, however, by how different the conversations about the risks of generative AI were at RightsCon from all the warnings from big Silicon Valley voices that I've been reading in the news. Throughout the last few weeks, tech luminaries like OpenAI CEO Sam Altman, ex-Googler Geoff Hinton, top AI researcher Yoshua Bengio, Elon Musk, and many others have been calling for regulation and urgent action to address the "existential risks"--even including extinction--that AI poses to humanity. Certainly, the rapid deployment of large language models without risk assessments, disclosures about training data and processes, or seemingly much attention paid to how the tech could be misused is concerning.
The case for bottom up AI
ChatGPT and other generative artificial intelligence tools are rising in popularity. If you have ever used these tools, you might have realised that you are revealing your thoughts (and possibly emotions) through your questions and interactions with the AI platforms. You can therefore imagine the huge amount of data these AI tools are gathering and the patterns that they are able to extract from the way we think. The impact of these business practices is crystal clear: a new AI economy is emerging through collecting, codifying, and monetising the patterns derived from our thoughts and feelings. Intrusions into our intimacy and cognition will be much greater than with existing social media and tech platforms.
Augmenting Zero-Shot Detection Training with Image Labels
Kornmeier, Katharina, Scheler, Ulla, Herrmann, Pascal
Zero-shot detection (ZSD), i.e., detection on classes not seen during training, is essential for real world detection use-cases, but remains a difficult task. Recent research attempts ZSD with detection models that output embeddings instead of direct class labels. To this aim, the output of the detection model must be aligned to a learned embedding space such as CLIP. However, this alignment is hindered by detection data sets which are expensive to produce compared to image classification annotations, and the resulting lack of category diversity in the training data. We address this challenge by leveraging the CLIP embedding space in combination with image labels from ImageNet. Our results show that image labels are able to better align the detector output to the embedding space and thus have a high potential for ZSD. Compared to only training on detection data, we see a significant gain by adding image label data of 3.3 mAP for the 65/15 split on COCO on the unseen classes, i.e., we more than double the gain of related work.
Your Prompt is My Command: On Assessing the Human-Centred Generality of Multimodal Models
Schellaert, Wout, Martínez-Plumed, Fernando, Vold, Karina, Burden, John, A. M. Casares, Pablo, Sheng Loe, Bao, Reichart, Roi, Ó hÉigeartaigh, Sean, Korhonen, Anna, Hernández-Orallo, José
Even with obvious deficiencies, large prompt-commanded multimodal models are proving to be flexible cognitive tools representing an unprecedented generality. But the directness, diversity, and degree of user interaction create a distinctive “human-centred generality” (HCG), rather than a fully autonomous one. HCG implies that —for a specific user— a system is only as general as it is effective for the user’s relevant tasks and their prevalent ways of prompting. A human-centred evaluation of general-purpose AI systems therefore needs to reflect the personal nature of interaction, tasks and cognition. We argue that the best way to understand these systems is as highly-coupled cognitive extenders, and to analyse the bidirectional cognitive adaptations between them and humans. In this paper, we give a formulation of HCG, as well as a high-level overview of the elements and trade-offs involved in the prompting process. We end the paper by outlining some essential research questions and suggestions for improving evaluation practices, which we envision as characteristic for the evaluation of general artificial intelligence in the future. This paper appears in the AI & Society track.
On the Amplification of Linguistic Bias through Unintentional Self-reinforcement Learning by Generative Language Models -- A Perspective
Generative Language Models (GLMs) have the potential to significantly shape our linguistic landscape due to their expansive use in various digital applications. However, this widespread adoption might inadvertently trigger a self-reinforcement learning cycle that can amplify existing linguistic biases. This paper explores the possibility of such a phenomenon, where the initial biases in GLMs, reflected in their generated text, can feed into the learning material of subsequent models, thereby reinforcing and amplifying these biases. Moreover, the paper highlights how the pervasive nature of GLMs might influence the linguistic and cognitive development of future generations, as they may unconsciously learn and reproduce these biases. The implications of this potential self-reinforcement cycle extend beyond the models themselves, impacting human language and discourse. The advantages and disadvantages of this bias amplification are weighed, considering educational benefits and ease of future GLM learning against threats to linguistic diversity and dependence on initial GLMs. This paper underscores the need for rigorous research to understand and address these issues. It advocates for improved model transparency, bias-aware training techniques, development of methods to distinguish between human and GLM-generated text, and robust measures for fairness and bias evaluation in GLMs. The aim is to ensure the effective, safe, and equitable use of these powerful technologies, while preserving the richness and diversity of human language.
Mitigating Prior Errors in Causal Structure Learning: Towards LLM driven Prior Knowledge
Chen, Lyuzhou, Ban, Taiyu, Wang, Xiangyu, Lyu, Derui, Chen, Huanhuan
Causal structure learning, a prominent technique for encoding cause and effect relationships among variables, through Bayesian Networks (BNs). Merely recovering causal structures from real-world observed data lacks precision, while the development of Large Language Models (LLM) is opening a new frontier of causality. LLM presents strong capability in discovering causal relationships between variables with the "text" inputs defining the investigated variables, leading to a potential new hierarchy and new ladder of causality. We aim an critical issue in the emerging topic of LLM based causal structure learning, to tackle erroneous prior causal statements from LLM, which is seldom considered in the current context of expert dominating prior resources. As a pioneer attempt, we propose a BN learning strategy resilient to prior errors without need of human intervention. Focusing on the edge-level prior, we classify the possible prior errors into three types: order-consistent, order-reversed, and irrelevant, and provide their theoretical impact on the Structural Hamming Distance (SHD) under the presumption of sufficient data. Intriguingly, we discover and prove that only the order-reversed error contributes to an increase in a unique acyclic closed structure, defined as a "quasi-circle". Leveraging this insight, a post-hoc strategy is employed to identify the order-reversed prior error by its impact on the increment of "quasi-circles". Through empirical evaluation on both real and synthetic datasets, we demonstrate our strategy's robustness against prior errors. Specifically, we highlight its substantial ability to resist order-reversed errors while maintaining the majority of correct prior knowledge.
On the Viability of using LLMs for SW/HW Co-Design: An Example in Designing CiM DNN Accelerators
Yan, Zheyu, Qin, Yifan, Hu, Xiaobo Sharon, Shi, Yiyu
Deep Neural Networks (DNNs) have demonstrated impressive performance across a wide range of tasks. However, deploying DNNs on edge devices poses significant challenges due to stringent power and computational budgets. An effective solution to this issue is software-hardware (SW-HW) co-design, which allows for the tailored creation of DNN models and hardware architectures that optimally utilize available resources. However, SW-HW co-design traditionally suffers from slow optimization speeds because their optimizers do not make use of heuristic knowledge, also known as the ``cold start'' problem. In this study, we present a novel approach that leverages Large Language Models (LLMs) to address this issue. By utilizing the abundant knowledge of pre-trained LLMs in the co-design optimization process, we effectively bypass the cold start problem, substantially accelerating the design process. The proposed method achieves a significant speedup of 25x. This advancement paves the way for the rapid and efficient deployment of DNNs on edge devices.
Lost in Translation: Large Language Models in Non-English Content Analysis
Nicholas, Gabriel, Bhatia, Aliya
In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as chatbots, content moderation systems, and search engines -- are primarily designed for and work far more effectively in English than in the world's other 7,000 languages. Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. In this paper, we explain how these multilingual language models work and explore their capabilities and limits. Part I provides a simple technical explanation of how large language models work, why there is a gap in available data between English and other languages, and how multilingual language models attempt to bridge that gap. Part II accounts for the challenges of doing content analysis with large language models in general and multilingual language models in particular. Part III offers recommendations for companies, researchers, and policymakers to keep in mind when considering researching, developing and deploying large and multilingual language models.