Goto

Collaborating Authors

 Morawiecki, Paweł


Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists

arXiv.org Artificial Intelligence

We present a benchmark for large language models designed to tackle one of the most knowledge-intensive tasks in data science: writing feature engineering code, which requires domain knowledge in addition to a deep understanding of the underlying problem and data structure. The model is provided with a dataset description in a prompt and asked to generate code transforming it. The evaluation score is derived from the improvement achieved by an XGBoost model fit on the modified dataset compared to the original data. By an extensive evaluation of state-of-the-art models and comparison to well-established benchmarks, we demonstrate that the FeatEng of our proposal can cheaply and efficiently assess the broad capabilities of LLMs, in contrast to the existing methods. The reference implementation is available at https://github.com/FeatEng/FeatEng. The rapid evolution of LLMs has significantly expanded their capabilities in processing and generating human-like text. As these models become increasingly sophisticated, defining what constitutes a meaningful benchmark is becoming harder and harder, as it is much easier to distinguish between bad and good models than between good and better. Today, the limitations of LLMs are predominantly assessed using benchmarks focused on language understanding, world knowledge, code generation, or mathematical reasoning in separation. This setup, however, overlooks some critical capabilities that can be measured in scenarios requiring inregration of skills and verification of their instrumental value in complex, real-world problems. We argue that well-designed LLM benchmarks should embody the following qualities, each reflecting a fundamental aspect of problem-solving ability: 1. Practical Usability. We demand that tasks are grounded in real-world problems where solutions have high functional value. This ensures that improvements in the observed performance translates into tangible benefits, aligning with the pragmatist view on the instrumental value of knowledge and truth, meaning that the validity of an idea depends on its practical utility in achieving desired outcomes (James, 1907). We would value LLM's knowledge for its role in enabling reasoning, decision-making, and problem-solving. The benchmark should be designed to evaluate not only the breadth of a model's knowledge base but also, more importantly, its capacity to dynamically and effectively apply this knowledge within various functional contexts, similarly to how functionalism frames it (Block, 1980). We opt for assessing models concerning their ability to seamlessly combine various competencies, in contrast to measuring them in separation.


Towards More Realistic Membership Inference Attacks on Large Diffusion Models

arXiv.org Artificial Intelligence

Generative diffusion models, including Stable Diffusion and Midjourney, can generate visually appealing, diverse, and high-resolution images for various applications. These models are trained on billions of internet-sourced images, raising significant concerns about the potential unauthorized use of copyright-protected images. In this paper, we examine whether it is possible to determine if a specific image was used in the training set, a problem known in the cybersecurity community and referred to as a membership inference attack. Our focus is on Stable Diffusion, and we address the challenge of designing a fair evaluation framework to answer this membership question. We propose a methodology to establish a fair evaluation setup and apply it to Stable Diffusion, enabling potential extensions to other generative models. Utilizing this evaluation setup, we execute membership attacks (both known and newly introduced). Our research reveals that previously proposed evaluation setups do not provide a full understanding of the effectiveness of membership inference attacks. We conclude that the membership inference attack remains a significant challenge for large diffusion models (often deployed as black-box systems), indicating that related privacy and copyright issues will persist in the foreseeable future.


Adversarial Examples Detection and Analysis with Layer-wise Autoencoders

arXiv.org Machine Learning

We present a mechanism for detecting adversarial examples based on data representations taken from the hidden layers of the target network. For this purpose, we train individual autoencoders at intermediate layers of the target network. This allows us to describe the manifold of true data and, in consequence, decide whether a given example has the same characteristics as true data. It also gives us insight into the behavior of adversarial examples and their flow through the layers of a deep neural network. Experimental results show that our method outperforms the state of the art in supervised and unsupervised settings.


Fast and Stable Interval Bounds Propagation for Training Verifiably Robust Models

arXiv.org Machine Learning

We present an efficient technique, which allows to train classification networks which are verifiably robust against norm-bounded adversarial attacks. This framework is built upon the work of Gowal et al., who applies the interval arithmetic to bound the activations at each layer and keeps the prediction invariant to the input perturbation. While that method is faster than competitive approaches, it requires careful tuning of hyper-parameters and a large number of epochs to converge. To speed up and stabilize training, we supply the cost function with an additional term, which encourages the model to keep the interval bounds at hidden layers small. Experimental results demonstrate that we can achieve comparable (or even better) results using a smaller number of training iterations, in a more stable fashion. Moreover, the proposed model is not so sensitive to the exact specification of the training process, which makes it easier to use by practitioners.