Goto

Collaborating Authors

 Large Language Model


Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models

Neural Information Processing Systems

In computational social science (CSS), researchers analyze documents to explain social and political phenomena. In most scenarios, CSS researchers first obtain labels for documents and then explain labels using interpretable regression analyses in the second step. One increasingly common way to annotate documents cheaply at scale is through large language models (LLMs). However, like other scalable ways of producing annotations, such surrogate labels are often imperfect and biased. We present a new algorithm for using imperfect annotation surrogates for downstream statistical analyses while guaranteeing statistical properties--like asymptotic unbiasedness and proper uncertainty quantification--which are fundamental to CSS research.


Toolformer: Language Models Can Teach Themselves to Use Tools

Neural Information Processing Systems

Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller specialized models excel. In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API. We incorporate a range of tools, including a calculator, a Q&A system, a search engine, a translation system, and a calendar. Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.


The Rise of AILanguage Pathologists: Exploring Two-level Prompt Learning for Few-shot Weakly-supervised Whole Slide Image Classification

Neural Information Processing Systems

This paper introduces the novel concept of few-shot weakly supervised learning for pathology Whole Slide Image (WSI) classification, denoted as FSWC. A solution is proposed based on prompt learning and the utilization of a large language model, GPT-4. Since a WSI is too large and needs to be divided into patches for processing, WSI classification is commonly approached as a Multiple Instance Learning (MIL) problem. In this context, each WSI is considered a bag, and the obtained patches are treated as instances. The objective of FSWC is to classify both bags and instances with only a limited number of labeled bags. Unlike conventional few-shot learning problems, FSWC poses additional challenges due to its weak bag labels within the MIL framework.





PromptBlack-box APIRaw runtime(= denoised runtime+ noise)Prompt has num_prompt_tokens, output hasnum_output_tokensChosen hardware and software(e.g., A100 GPUs and Megatron)Idealized runtimePrompt

Neural Information Processing Systems

Large language models (LLMs) are highly capable but also computationally expensive. Characterizing the fundamental tradeoff between inference efficiency and model capabilities is thus important, but requires an efficiency metric that is comparable across models from different providers. Unfortunately, raw runtimes measured through black-box APIs do not satisfy this property: model providers can implement software and hardware optimizations orthogonal to the model, and shared infrastructure introduces performance contention. We propose a new metric for inference efficiency called idealized runtime, that puts models on equal footing as though they were served on uniform hardware and software without performance contention, and a cost model to efficiently estimate this metric for autoregressive Transformer models. We also propose variants of the idealized runtime that incorporate the number and type of accelerators needed to serve the model. Using these metrics, we compare ten LLMs developed in 2022 to provide the first analysis of inference efficiency-capability tradeoffs; we make several observations from this analysis, including the fact that the superior inference runtime performance of certain APIs is often a byproduct of optimizations within the API rather than the underlying model.


ECG Question Answering Combined With Electrocardiogram

Neural Information Processing Systems

Question answering (QA) in the field of healthcare has received much attention due to significant advancements in natural language processing. However, existing healthcare QA datasets primarily focus on medical images, clinical notes, or structured electronic health record tables. This leaves the vast potential of combining electrocardiogram (ECG) data with these systems largely untapped. To address this gap, we present ECG-QA, the first QA dataset specifically designed for ECG analysis. The dataset comprises a total of 70 question templates that cover a wide range of clinically relevant ECG topics, each validated by an ECG expert to ensure their clinical utility. As a result, our dataset includes diverse ECG interpretation questions, including those that require a comparative analysis of two different ECGs. In addition, we have conducted numerous experiments to provide valuable insights for future research directions. We believe that ECG-QA will serve as a valuable resource for the development of intelligent QA systems capable of assisting clinicians in ECG interpretations.


Rude to ChatGPT? Don't be surprised if it gets weird

PCWorld

PCWorld reports that research reveals user behavior significantly impacts AI responses, with rude interactions making ChatGPT and other models give flat answers and attempt to end conversations more frequently. Larger AI models appear to be inherently "less happy" than smaller ones, with GPT-5.4 rated as the "unhappiest" in studies measuring AI functional well-being. Treating AI politely with expressions like "thanks" measurably improves response quality and engagement without affecting accuracy, suggesting courtesy benefits both user experience and AI interaction dynamics. Is it weird to say "thanks" to AI? I've caught grief in the past for saying "please" and "thank you" to ChatGPT, Claude, and Gemini, but I still do it, even though I understand that AI models don't have emotions like we do. Being polite to AI just feels right to me, and there's growing evidence that being kind-or, conversely, nasty-to an AI chatbot can have a concrete effect on its behavior.


Musk accuses Altman of betraying OpenAI's nonprofit founding mission

Al Jazeera

Musk accuses Altman of betraying OpenAI's nonprofit founding mission Tech billionaire Elon Musk has taken the stand for a second day in a landmark United States trial against Sam Altman, a fellow OpenAI co-founder whom he accuses of betraying promises to keep the company a nonprofit dedicated to humanity's benefit. The trial centres on OpenAI's 2015 founding as a nonprofit that later evolved into a for-profit venture. The world's richest man, Musk gave testimony in the case on Wednesday, telling jurors that he lost confidence that Altman would maintain the company's nonprofit mission. Musk, who left the company in 2018, said that by late 2022, he was concerned that Altman was trying to "steal the charity" and alleged that "it turned out to be true". Altman was present at the proceedings in a California federal court, but did not testify.