language model work
Does In-IDE Calibration of Large Language Models work at Scale?
Koohestani, Roham, Sergeyuk, Agnia, Gros, David, Spiess, Claudio, Titov, Sergey, Devanbu, Prem, Izadi, Maliheh
The introduction of large language models into integrated development environments (IDEs) is revolutionizing software engineering, yet it poses challenges to the usefulness and reliability of Artificial Intelligence-generated code. Post-hoc calibration of internal model confidences aims to align probabilities with an acceptability measure. Prior work suggests calibration can improve alignment, but at-scale evidence is limited. In this work, we investigate the feasibility of applying calibration of code models to an in-IDE context. We study two aspects of the problem: (1) the technical method for implementing confidence calibration and improving the reliability of code generation models, and (2) the human-centered design principles for effectively communicating reliability signal to developers. First, we develop a scalable and flexible calibration framework which can be used to obtain calibration weights for open-source models using any dataset, and evaluate whether calibrators improve the alignment between model confidence and developer acceptance behavior. Through a large-scale analysis of over 24 million real-world developer interactions across multiple programming languages, we find that a general, post-hoc calibration model based on Platt-scaling does not, on average, improve the reliability of model confidence signals. We also find that while dynamically personalizing calibration to individual users can be effective, its effectiveness is highly dependent on the volume of user interaction data. Second, we conduct a multi-phase design study with 3 expert designers and 153 professional developers, combining scenario-based design, semi-structured interviews, and survey validation, revealing a clear preference for presenting reliability signals via non-numerical, color-coded indicators within the in-editor code generation workflow.
We know remarkably little about how AI language models work
A growing number of experts have called for these tests to be ditched, saying they boost AI hype and create "the illusion that [AI language models] have greater capabilities than what truly exists." What stood out to me in Will's story is that we know remarkably little about how AI language models work and why they generate the things they do. With these tests, we're trying to measure and glorify their "intelligence" based on their outputs, without fully understanding how they function under the hood. Our tendency to anthropomorphize makes this messy: "People have been giving human intelligence tests--IQ tests and so on--to machines since the very beginning of AI," says Melanie Mitchell, an artificial-intelligence researcher at the Santa Fe Institute in New Mexico. "The issue throughout has been what it means when you test a machine like this. It doesn't mean the same thing that it means for a human."
Meta AI Giving Away Its New Large Language Model
AI researchers at Meta have created a massive new language model to rival OpenAI's GPT-3 and advance our understanding of large language models. And it is giving it away as part of its effort to democratize AI. Open Pretrained Transformer (OPT-175B) is a language model with 175 billion parameters trained on publicly available data sets. According to Meta, 992 A100 GPUs equipped with 80GB of onboard memory from Nvidia were used over a training period of two months. To facilitate "community engagement", the release includes both the pre-trained model, extensive notes about its development, logbook detailing the training process, and the code needed to train and use the model.
Learning Data Science from Real-World Projects
Mixed-integer programming saves the day. Taking a cue from consumer supply chains and the data-driven advances that have revolutionized them in recent decades, Gabe Verzino walks us through a scheduling program that would empower both patients and healthcare providers to use their time more efficiently. Bayes' Theorem might sound, well, theoretical. As Khuyen Tran shows in her recent tutorial (based on the traffic patterns of her own website), it can also be a powerful tool for detecting and analyzing change points in your data. The road to the perfect shot of espresso passes through a lot of data.
Generating Beatles' Lyrics with Machine Learning - Towards Data Science
The Beatles were a huge cultural phenomenon. Their timeless music still resonates with people today, both young and old. In my humble opinion, they are the greatest band to have ever lived¹. Their songs are full of interesting lyrics and deep ideas. When you've seen beyond yourself Then you may find peace of mind is waiting there² However, the thing that made the Beatles great was their versatility.