single number
Trading Inference-Time Compute for Adversarial Robustness
Zaremba, Wojciech, Nitishinskaya, Evgenia, Barak, Boaz, Lin, Stephanie, Toyer, Sam, Yu, Yaodong, Dias, Rachel, Wallace, Eric, Xiao, Kai, Heidecke, Johannes, Glaese, Amelia
We conduct experiments on the impact of increasing inference-time compute in reasoning models (specifically OpenAI o1-preview and o1-mini) on their robustness to adversarial attacks. We find that across a variety of attacks, increased inference-time compute leads to improved robustness. In many cases (with important exceptions), the fraction of model samples where the attack succeeds tends to zero as the amount of test-time compute grows. We perform no adversarial training for the tasks we study, and we increase inference-time compute by simply allowing the models to spend more compute on reasoning, independently of the form of attack. Our results suggest that inference-time compute has the potential to improve adversarial robustness for Large Language Models. We also explore new attacks directed at reasoning models, as well as settings where inference-time compute does not improve reliability, and speculate on the reasons for these as well as ways to address them.
Changing a single number among billions can destroy an AI model
An artificial intelligence model can be made to spout gibberish if a single one of the many billions of numbers that compose it is altered. Large language models (LLMs) like the one behind OpenAI's ChatGPT contain billions of parameters or weights, which are the numerical values used to represent each "neuron" of their neural network. These are what get tuned and tweaked during training so the AI can learn abilities such as generating text. Input is passed through these weights, which determine the most statistically likely output.…
Self-Reflection Outcome is Sensitive to Prompt Construction
Liu, Fengyuan, AlDahoul, Nouar, Eady, Gregory, Zaki, Yasir, AlShebli, Bedoor, Rahwan, Talal
Large language models (LLMs) demonstrate impressive zero-shot and few-shot reasoning capabilities. Some propose that such capabilities can be improved through self-reflection, i.e., letting LLMs reflect on their own output to identify and correct mistakes in the initial responses. However, despite some evidence showing the benefits of self-reflection, recent studies offer mixed results. Here, we aim to reconcile these conflicting findings by first demonstrating that the outcome of self-reflection is sensitive to prompt wording; e.g., LLMs are more likely to conclude that it has made a mistake when explicitly prompted to find mistakes. Consequently, idiosyncrasies in reflection prompts may lead LLMs to change correct responses unnecessarily. We show that most prompts used in the self-reflection literature are prone to this bias. We then propose different ways of constructing prompts that are conservative in identifying mistakes and show that self-reflection using such prompts results in higher accuracy. Our findings highlight the importance of prompt engineering in self-reflection tasks. We release our code at https://github.com/Michael98Liu/mixture-of-prompts.
Chain of Thoughtlessness? An Analysis of CoT in Planning
Stechly, Kaya, Valmeekam, Karthik, Kambhampati, Subbarao
Large language model (LLM) performance on reasoning problems typically does not generalize out of distribution. Previous work has claimed that this can be mitigated with chain of thought prompting-a method of demonstrating solution procedures-with the intuition that it is possible to in-context teach an LLM an algorithm for solving the problem. This paper presents a case study of chain of thought on problems from Blocksworld, a classical planning domain, and examines the performance of two state-of-the-art LLMs across two axes: generality of examples given in prompt, and complexity of problems queried with each prompt. While our problems are very simple, we only find meaningful performance improvements from chain of thought prompts when those prompts are exceedingly specific to their problem class, and that those improvements quickly deteriorate as the size n of the query-specified stack grows past the size of stacks shown in the examples. We also create scalable variants of three domains commonly studied in previous CoT papers and demonstrate the existence of similar failure modes. Our results hint that, contrary to previous claims in the literature, CoT's performance improvements do not stem from the model learning general algorithmic procedures via demonstrations but depend on carefully engineering highly problem specific prompts. This spotlights drawbacks of chain of thought, especially the sharp tradeoff between possible performance gains and the amount of human labor necessary to generate examples with correct reasoning traces.
Using Machine Learning to Get the Most Out of Electric Vehicle Batteries
With the uptake of electric vehicles (EVs) increasing across the automotive market, there is a need to ensure optimized function and reliability of the battery that is powering the vehicle. Across many industries and markets, lithium-ion (Li-ion) batteries are crucial components of devices and machinery, including smartphones, solar power storage, and power supplies. Thus, maintaining good battery health is absolutely vital in today's world. Now, a group of researchers from the University of Cambridge has recently developed a new algorithm that uses machine learning to help preserve good battery health in EVs. The algorithm is able to use pattern recognition and predictability models to see how various driving styles influence the performance of the vehicle's battery.
Linear Algebra for Data Science With Python - Analytics Vidhya
This article was published as a part of the Data Science Blogathon. Linear Algebra, a branch of mathematics, is very much useful in Data Science. We can mathematically operate on large amounts of data by using Linear Algebra. Most algorithms used in ML use Linear Algebra, especially matrices. As most of the data is represented in matrix form.
Embrace Uncertainty in Machine Learning Models to Maximize Business Value - Covail
'All models are wrong, but some are useful' As this famous quote by George Box (known as the Box Theorem) shows, no model is ever going to be 100% accurate. If one is, run for the hills! Rather, models should be evaluated by their impact on the bottom line, or how useful they are to the business. In this blog post, we will explore a way in which models can be more useful, by embracing and leveraging uncertainty to maximize business results. Much of the time, business users want a single number to represent the'goodness' of a model, but machine learning models can tell us so much more than just a single number (like accuracy).
WTF is a Tensor?!?
When we represent data for machine learning, this generally needs to be done numerically. Especially when referring specifically of neural network data representation, this is accomplished via a data repository known as the tensor. A tensor is a container which can house data in N dimensions. Often and erroneously used interchangeably with the matrix (which is specifically a 2-dimensional tensor), tensors are generalizations of matrices to N-dimensional space. Mathematically speaking, tensors are more than simply a data container, however.