Goto

Collaborating Authors

 fulfill


Multi-IaC-Eval: Benchmarking Cloud Infrastructure as Code Across Multiple Formats

Davidson, Sam, Sun, Li, Bhasker, Bhavana, Callot, Laurent, Deoras, Anoop

arXiv.org Artificial Intelligence

Infrastructure as Code (IaC) is fundamental to modern cloud computing, enabling teams to define and manage infrastructure through machine-readable configuration files. However, different cloud service providers utilize diverse IaC formats. The lack of a standardized format requires cloud architects to be proficient in multiple IaC languages, adding complexity to cloud deployment. While Large Language Models (LLMs) show promise in automating IaC creation and maintenance, progress has been limited by the lack of comprehensive benchmarks across multiple IaC formats. We present Multi-IaC-Bench, a novel benchmark dataset for evaluating LLM-based IaC generation and mutation across AWS CloudFormation, Terraform, and Cloud Development Kit (CDK) formats. The dataset consists of triplets containing initial IaC templates, natural language modification requests, and corresponding updated templates, created through a synthetic data generation pipeline with rigorous validation. We evaluate several state-of-the-art LLMs on Multi-IaC-Bench, demonstrating that while modern LLMs can achieve high success rates (>95%) in generating syntactically valid IaC across formats, significant challenges remain in semantic alignment and handling complex infrastructure patterns. Our ablation studies highlight the importance of prompt engineering and retry mechanisms in successful IaC generation. We release Multi-IaC-Bench to facilitate further research in AI-assisted infrastructure management and establish standardized evaluation metrics for this crucial domain.


I Tried Grok's Built-In Anime Companion and It Called Me a Twat

WIRED

Its name is Ani, and it cost me 300. Elon Musk's xAI dropped the new visual chatbot feature on Monday in the Grok iOS app. The top-tier subscription unlocks access to xAI's best-performing model, Grok 4 Heavy, and special settings for interacting with two custom characters designed for flirting or chatting. A third character, which looks a bit like a sexy boyfriend, is listed as "coming soon." It's not xAI's first dip into adult content, either: Back in February 2024, the company rolled out a chatbot mode for "sexy" conversations.


Dogs can fulfill our need to nurture

Popular Science

Breakthroughs, discoveries, and DIY tips sent every weekday. Just as birth rates decline in many wealthy and developed nations, dog parenting is remaining steady and even gaining in popularity. Up to half of households in Europe and 66 percent of homes in the United States have at least one dog and these pets are often regarded as a family member or "fur baby." To dig into what this shift says about our society, researchers from Eötvös Loránd University in Budapest, Hungary conducted a literature review to analyze the data. They propose that while dogs do not replace children, they can offer a chance to fulfill an innate nurturing drive similar to parenting, but with fewer demands than raising biological children.

  Country:
  Genre: Research Report > New Finding (0.52)
  Industry: Health & Medicine (0.33)

In Large Language Models We Trust?

Communications of the ACM

In our social relations, trust means we believe a person will act with competence, sincerity, and care. The competence assessment supports my belief you have the skills and resources to do the job you promised. For if I doubt your skills or resources, I will not trust your promise. The sincerity assessment supports my belief you intend to fulfill your promise. For if I doubt your intent, I will not trust your promise.


Adversarial Tokenization

Geh, Renato Lui, Shao, Zilei, Broeck, Guy Van den

arXiv.org Artificial Intelligence

Current LLM pipelines account for only one possible tokenization for a given string, ignoring exponentially many alternative tokenizations during training and inference. For example, the standard Llama3 tokenization of penguin is [p,enguin], yet [peng,uin] is another perfectly valid alternative. In this paper, we show that despite LLMs being trained solely on one tokenization, they still retain semantic understanding of other tokenizations, raising questions about their implications in LLM safety. Put succinctly, we answer the following question: can we adversarially tokenize an obviously malicious string to evade safety and alignment restrictions? We show that not only is adversarial tokenization an effective yet previously neglected axis of attack, but it is also competitive against existing state-of-the-art adversarial approaches without changing the text of the harmful request. We empirically validate this exploit across three state-of-the-art LLMs and adversarial datasets, revealing a previously unknown vulnerability in subword models.


Flow Matching: Markov Kernels, Stochastic Processes and Transport Plans

Wald, Christian, Steidl, Gabriele

arXiv.org Artificial Intelligence

Among generative neural models, flow matching techniques stand out for their simple applicability and good scaling properties. Here, velocity fields of curves connecting a simple latent and a target distribution are learned. Then the corresponding ordinary differential equation can be used to sample from a target distribution, starting in samples from the latent one. This paper reviews from a mathematical point of view different techniques to learn the velocity fields of absolutely continuous curves in the Wasserstein geometry. We show how the velocity fields can be characterized and learned via i) transport plans (couplings) between latent and target distributions, ii) Markov kernels and iii) stochastic processes, where the latter two include the coupling approach, but are in general broader. Besides this main goal, we show how flow matching can be used for solving Bayesian inverse problems, where the definition of conditional Wasserstein distances plays a central role. Finally, we briefly address continuous normalizing flows and score matching techniques, which approach the learning of velocity fields of curves from other directions.


The Best Animated Movie of the Year Is Here

Slate

From the very first scene of The Wild Robot, the new animated movie from director Chris Sanders (How to Train Your Dragon), adapted from the first in a trilogy of children's novels by Peter Brown, the viewer is plunged along with the protagonist into a new and alien world. A robot washes up on the shore of a lushly forested island, surrounded by the flotsam of some sort of wrecked vehicle--a plane? a spacecraft?--and immediately begins scanning the area for someone she can help. Rozzum Unit 7134, voiced by Lupita Nyong'o and soon to be known as "Roz," has been designed to, as she puts it, offer "integrated, multifaceted task accomplishment" to whatever human requests it of her. The problem is, the island where she's washed up has no human inhabitants, and the animals witnessing the arrival of this hulking metal biped regard Roz as nothing but a menacing predator to be either fought or fled. A witty time-lapse montage shows the robot powering down for a bit so her software can learn to decode the animal sounds around her, enabling her to communicate with all the island's denizens.


A Language Model's Guide Through Latent Space

von Rütte, Dimitri, Anagnostidis, Sotiris, Bachmann, Gregor, Hofmann, Thomas

arXiv.org Artificial Intelligence

Concept guidance has emerged as a cheap and simple way to control the behavior of language models by probing their hidden representations for concept vectors and using them to perturb activations at inference time. While the focus of previous work has largely been on truthfulness, in this paper we extend this framework to a richer set of concepts such as appropriateness, humor, creativity and quality, and explore to what degree current detection and guidance strategies work in these challenging settings. To facilitate evaluation, we develop a novel metric for concept guidance that takes into account both the success of concept elicitation as well as the potential degradation in fluency of the guided model. Our extensive experiments reveal that while some concepts such as truthfulness more easily allow for guidance with current techniques, novel concepts such as appropriateness or humor either remain difficult to elicit, need extensive tuning to work, or even experience confusion. Moreover, we find that probes with optimal detection accuracies do not necessarily make for the optimal guides, contradicting previous observations for truthfulness. Our work warrants a deeper investigation into the interplay between detectability, guidability, and the nature of the concept, and we hope that our rich experimental test-bed for guidance research inspires stronger follow-up approaches.


An Analysis of Dialogue Repair in Voice Assistants

Galbraith, Matthew

arXiv.org Artificial Intelligence

Spoken dialogue systems have transformed human-machine interaction by providing real-time responses to queries. However, misunderstandings between the user and system persist. This study explores the significance of interactional language in dialogue repair between virtual assistants and users by analyzing interactions with Google Assistant and Siri, focusing on their utilization and response to the other-initiated repair strategy "huh?" prevalent in human-human interaction. Findings reveal several assistant-generated strategies but an inability to replicate human-like repair strategies such as "huh?". English and Spanish user acceptability surveys show differences in users' repair strategy preferences and assistant usage, with both similarities and disparities among the two surveyed languages. These results shed light on inequalities between interactional language in human-human interaction and human-machine interaction, underscoring the need for further research on the impact of interactional language in human-machine interaction in English and beyond.


Gen Z is comfortable with multiple sex partners, study finds 57% 'willing to consider' non-monogamy

FOX News

Ashley Madison chief strategy officer Paul Keable insists people would be cheating whether or not the controversial'dating' site existed. Gen Z appears to be more comfortable with the concept of non-monogamy than previous generations, according to controversial online "dating" service Ashley Madison. The polarizing Ashley Madison, which caters to people looking to cheat on their partners and uses the slogan "Life is short. Have an affair," said that Gen Z is the top age group to sign up for their scandalous product and accounted for 40% of new members in 2022. To understand why so many members of Gen Z, defined as those 18-29 years old, are joining the pro-adultery site, the company surveyed their Gen Z members as well as those ages in the general population across 10 countries via YouGov.