underperform
Robust Heuristic Algorithm Design with LLMs
Karimi, Pantea, Rouhana, Dany, Namyar, Pooria, Kakarla, Siva Kesava Reddy, Arun, Venkat, Arzani, Behnaz
We posit that we can generate more robust and performant heuristics if we augment approaches using LLMs for heuristic design with tools that explain why heuristics underperform and suggestions about how to fix them. We find even simple ideas that (1) expose the LLM to instances where the heuristic underperforms; (2) explain why they occur; and (3) specialize design to regions in the input space, can produce more robust algorithms compared to existing techniques~ -- ~the heuristics we produce have a $\sim28\times$ better worst-case performance compared to FunSearch, improve average performance, and maintain the runtime.
- North America > United States > California (0.86)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (2 more...)
Reviews: Training Language GANs from Scratch
I've raised my score accordingly, but I still think that there needs to be more solid results. In particular, while the rebuttal notes that ScratchGAN can almost match the MLE baseline, I am not sure how strong the MLE baseline itself is. Based on sample quality, I suspect that the MLE baseline itself is quite weak and does not use more modern LM approaches (e.g. Of course, I am not saying that the authors deliberately used weak baselines, but it would be helpful to compare against stronger MLE baselines too. Weaknesses: - The main weakness is empirical---scratchGAN appreciably underperforms an MLE model in terms of LM score and reverse LM score.
New Tests Reveal AI's Capacity for Deception
The myth of King Midas is about a man who wishes for everything he touches to turn to gold. This does not go well: Midas finds himself unable to eat or drink, with even his loved ones transmuted. The myth is sometimes invoked to illustrate the challenge of ensuring AI systems do what we want, particularly as they grow more powerful. As Stuart Russell--who coauthored AI's standard textbook--tells TIME over email, the concern is that "what seem to be reasonable goals, such as fixing climate change, lead to catastrophic consequences, such as eliminating the human race as a way to fix climate change." On Dec. 5, a paper released by AI safety nonprofit Apollo Research found that in certain contrived scenarios, today's cutting-edge AI systems, including OpenAI's o1 and Anthropic's Claude 3.5 Sonnet, can engage in deceptive behavior in pursuit of their goals--providing empirical evidence to support a concern that to date has been largely theoretical.
Rethinking Thinking Tokens: Understanding Why They Underperform in Practice
Vennam, Sreeram, Valente, David, Herel, David, Kumaraguru, Ponnurangam
Thinking Tokens (TT) have been proposed as an unsupervised method to facilitate reasoning in language models. However, despite their conceptual appeal, our findings show that TTs marginally improves performance and consistently underperforms compared to Chain-of-Thought (CoT) reasoning across multiple benchmarks. We hypothesize that this underperformance stems from the reliance on a single embedding for TTs, which results in inconsistent learning signals and introduces noisy gradients. This paper provides a comprehensive empirical analysis to validate this hypothesis and discusses the implications for future research on unsupervised reasoning in LLMs.
Towards Safer Heuristics With XPlain
Karimi, Pantea, Pirelli, Solal, Kakarla, Siva Kesava Reddy, Beckett, Ryan, Segarra, Santiago, Li, Beibin, Namyar, Pooria, Arzani, Behnaz
Many problems that cloud operators solve are computationally expensive, and operators often use heuristic algorithms (that are faster and scale better than optimal) to solve them more efficiently. Heuristic analyzers enable operators to find when and by how much their heuristics underperform. However, these tools do not provide enough detail for operators to mitigate the heuristic's impact in practice: they only discover a single input instance that causes the heuristic to underperform (and not the full set), and they do not explain why. We propose XPlain, a tool that extends these analyzers and helps operators understand when and why their heuristics underperform. We present promising initial results that show such an extension is viable.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > California > Orange County > Irvine (0.05)
- North America > United States > New York > New York County > New York City (0.05)
- (12 more...)
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
van der Weij, Teun, Hofstätter, Felix, Jaffe, Ollie, Brown, Samuel F., Ward, Francis Rhys
Trustworthy capability evaluations are crucial for ensuring the safety of AI systems, and are becoming a key component of AI regulation. However, the developers of an AI system, or the AI system itself, may have incentives for evaluations to understate the AI's actual capability. These conflicting interests lead to the problem of sandbagging $\unicode{x2013}$ which we define as "strategic underperformance on an evaluation". In this paper we assess sandbagging capabilities in contemporary language models (LMs). We prompt frontier LMs, like GPT-4 and Claude 3 Opus, to selectively underperform on dangerous capability evaluations, while maintaining performance on general (harmless) capability evaluations. Moreover, we find that models can be fine-tuned, on a synthetic dataset, to hide specific capabilities unless given a password. This behaviour generalizes to high-quality, held-out benchmarks such as WMDP. In addition, we show that both frontier and smaller models can be prompted, or password-locked, to target specific scores on a capability evaluation. Even more, we found that a capable password-locked model (Llama 3 70b) is reasonably able to emulate a less capable model (Llama 2 7b). Overall, our results suggest that capability evaluations are vulnerable to sandbagging. This vulnerability decreases the trustworthiness of evaluations, and thereby undermines important safety decisions regarding the development and deployment of advanced AI systems.
- North America > United States (0.46)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- Asia > Middle East > Israel (0.04)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- (4 more...)
Stable Diffusion Benchmarked: Which GPU Runs AI Fastest
Artificial Intelligence and deep learning are constantly in the headlines these days, whether it be ChatGPT generating poor advice, self-driving cars, artists being accused of using AI, medical advice from AI, and more. Most of these tools rely on complex servers with lots of hardware for training, but using the trained network via inference can be done on your PC, using its graphics card. But how fast are consumer GPUs for doing AI inference? We've benchmarked Stable Diffusion, a popular AI image creator, on the latest Nvidia, AMD, and even Intel GPUs to see how they stack up. If you've by chance tried to get Stable Diffusion up and running on your own PC, you may have some inkling of how complex -- or simple!
Discovering the systematic errors made by machine learning models
In this blog post, we introduce Domino, a new approach for discovering systematic errors made by machine learning models. We also discuss a framework for quantitatively evaluating methods like Domino. Machine learning models that achieve high overall accuracy often make systematic errors on coherent slices of validation data. A slice is a set of data samples that share a common characteristic. As an example, in large image datasets, photos of vintage cars comprise a slice (i.e.