Energy
Assessing Web Search Credibility and Response Groundedness in Chat Assistants
Vykopal, Ivan, Pikuliak, Matúš, Ostermann, Simon, Šimko, Marián
Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources. While this promises more reliable answers, it also raises the risk of amplifying misinformation from low-credibility sources. In this paper, we introduce a novel methodology for evaluating assistants' web search behavior, focusing on source credibility and the groundedness of responses with respect to cited sources. Using 100 claims across five misinformation-prone topics, we assess GPT-4o, GPT-5, Perplexity, and Qwen Chat. Our findings reveal differences between the assistants, with Perplexity achieving the highest source credibility, whereas GPT-4o exhibits elevated citation of non-credibility sources on sensitive topics. This work provides the first systematic comparison of commonly used chat assistants for fact-checking behavior, offering a foundation for evaluating AI systems in high-stakes information environments.
Assessing the Geographic Generalization and Physical Consistency of Generative Models for Climate Downscaling
Saccardi, Carlo, Pierzyna, Maximilian, Borde, Haitz Sáez de Ocáriz, Monaco, Simone, Meo, Cristian, Liò, Pietro, Saathof, Rudolf, Joseph, Geethu, Dauwels, Justin
Kilometer-scale weather data is crucial for real-world applications but remains computationally intensive to produce using traditional weather simulations. An emerging solution is to use deep learning models, which offer a faster alternative for climate downscaling. However, their reliability is still in question, as they are often evaluated using standard machine learning metrics rather than insights from atmospheric and weather physics. This paper benchmarks recent state-of-the-art deep learning models and introduces physics-inspired diagnostics to evaluate their performance and reliability, with a particular focus on geographic generalization and physical consistency. Our experiments show that, despite the seemingly strong performance of models such as CorrDiff, when trained on a limited set of European geographies (e.g., central Europe), they struggle to generalize to other regions such as Iberia, Morocco in the south, or Scandinavia in the north. They also fail to accurately capture second-order variables such as divergence and vorticity derived from predicted velocity fields. These deficiencies appear even in in-distribution geographies, indicating challenges in producing physically consistent predictions. We propose a simple initial solution: introducing a power spectral density loss function that empirically improves geographic generalization by encouraging the reconstruction of small-scale physical structures. The code for reproducing the experimental results can be found at https://github.com/CarloSaccardi/PSD-Downscaling
Simplicial Embeddings Improve Sample Efficiency in Actor-Critic Agents
Obando-Ceron, Johan, Mayor, Walter, Lavoie, Samuel, Fujimoto, Scott, Courville, Aaron, Castro, Pablo Samuel
Recent works have proposed accelerating the wall-clock training time of actor-critic methods via the use of large-scale environment parallelization; unfortunately, these can sometimes still require large number of environment interactions to achieve a desired level of performance. Noting that well-structured representations can improve the generalization and sample efficiency of deep reinforcement learning (RL) agents, we propose the use of simplicial embeddings: lightweight representation layers that constrain embeddings to simplicial structures. This geometric inductive bias results in sparse and discrete features that stabilize critic bootstrapping and strengthen policy gradients. When applied to FastTD3, FastSAC, and PPO, simplicial embeddings consistently improve sample efficiency and final performance across a variety of continuous- and discrete-control environments, without any loss in runtime speed.
Time Series Foundation Models: Benchmarking Challenges and Requirements
Meyer, Marcel, Kaltenpoth, Sascha, Zalipski, Kevin, Müller, Oliver
Time Series Foundation Models (TSFMs) represent a new paradigm for time series forecasting, offering zero-shot forecasting capabilities without the need for domain-specific pre-training or fine-tuning. However, as with Large Language Models (LLMs), evaluating TSFMs is tricky, as with ever more extensive training sets, it becomes more and more challenging to ensure the integrity of benchmarking data. Our investigation of existing TSFM evaluation highlights multiple challenges, ranging from the representativeness of the benchmark datasets, over the lack of spatiotemporal evaluation, to risks of information leakage due to overlapping and obscure datasets, and the memorization of global patterns caused by external shocks like economic crises or pandemics. Our findings reveal widespread confusion regarding data partitions, risking inflated performance estimates and incorrect transfer of global knowledge to local time series. We argue for the development of robust evaluation methodologies to prevent pitfalls already observed in LLM and classical time series benchmarking, and call upon the research community to design new, principled approaches, such as evaluations on truly out-of-sample future data, to safeguard the integrity of TSFM assessment.
In-Browser LLM-Guided Fuzzing for Real-Time Prompt Injection Testing in Agentic AI Browsers
AI-powered browser assistants (also known as autonomous browsing agents or agentic AI browsers) are emerging tools that use LLMs to help users navigate and interact with web content. For example, an AI agent can be instructed to summarize a webpage or perform actions like clicking links and filling forms on behalf of the user. While these agents promise enhanced productivity, they also introduce new security risks. One major risk is prompt injection, where an attacker embeds malicious instructions into web content that the agent will process [5]. Crucially, such instructions can be hidden from the human user (e.g., invisible text, HTML comments) yet still parsed by the LLM, causing it to alter its behavior in unintended ways [10]. In effect, the agent can be tricked into executing the attacker's commands rather than the user's, leading to potentially severe consequences [2]. Indirect prompt injections have been demonstrated in real-world scenarios.
Narrow Operator Models of Stellarator Equilibria in Fourier Zernike Basis
Thun, Timo, Conlin, Rory, Panici, Dario, Böckenhoff, Daniel
Stellarators are inherently steady-state plasma confinement devices, which is among the key reasons behind their renaissance as promising candidates for fusion power plants. Ideal MHD equilibria are a central part in optimising the complex, three-dimensional plasma shapes which are a necessary condition for steady-state operation of such devices. The equilibrium magnetic field is required not only in optimisation but also plays a role in future real-time control algorithms and simulation frameworks (Schissel et al. 2025). Solving the three-dimensional MHD equations requires numerical approaches, because no analytical solutions throughout the full volume of ideal MHD equilibria with nested magnetic topology exists yet (Bruno & Laurence 1996). Recent work advanced analytical models for Fourier components of the equilibrium magnetic field in a subset of reactor-relevant magnetic fields and analytical expansions close to the magnetic axis are used extensively in research (Nikulsin et al. 2024; Sengupta et al. 2024). These analytical solutions and the following numerical solvers assume nested magnetic topology, or inte-grability throughout the volume, and computation of chaotic regions or magnetic islands takes considerably more effort (Hudson et al. 2012). Accuracy of numerical PDE solutions is inherently connected to the representation which defines gradients, and commonly used ideal MHD equilibrium solvers with nested magnetic field topology can be differentiated accordingly: A widely used finite-difference solver employed in the design of currently operating stellarator devices is VMEC (Hirshman & Whitson 1983), another pseudo spectral solver is DESC (Dudt & Kolemen 2020) and a third example is GVEC (Hindenlang et al. 2025), that abstracts the notion of basis functions, which enabled computation of plasmas with figure-8 shape (Plunk et al. 2025). Email address for correspondence: timo.thun@ipp.mpg.de
Assessing LLM Reasoning Through Implicit Causal Chain Discovery in Climate Discourse
Allein, Liesbeth, Pineda-Castañeda, Nataly, Rocci, Andrea, Moens, Marie-Francine
How does a cause lead to an effect, and which intermediate causal steps explain their connection? This work scrutinizes the mechanistic causal reasoning capabilities of large language models (LLMs) to answer these questions through the task of implicit causal chain discovery. In a diagnostic evaluation framework, we instruct nine LLMs to generate all possible intermediate causal steps linking given cause-effect pairs in causal chain structures. These pairs are drawn from recent resources in argumentation studies featuring polarized discussion on climate change. Our analysis reveals that LLMs vary in the number and granularity of causal steps they produce. Although they are generally self-consistent and confident about the intermediate causal connections in the generated chains, their judgments are mainly driven by associative pattern matching rather than genuine causal reasoning. Nonetheless, human evaluations confirmed the logical coherence and integrity of the generated chains. Our baseline causal chain discovery approach, insights from our diagnostic evaluation, and benchmark dataset with causal chains lay a solid foundation for advancing future work in implicit, mechanistic causal reasoning in argumentation settings.
Semantic Communication Enabled Holographic Video Processing and Transmission
Ying, Jingkai, Qi, Zhiyuan, Feng, Yulong, Qin, Zhijin, Han, Zhu, Tafazolli, Rahim, Eldar, Yonina C.
Abstract--Holographic video communication is considered a paradigm shift in visual communications, becoming increasingly popular for its ability to offer immersive experiences. This article provides an overview of holographic video communication and outlines the requirements of a holographic video communication system. Particularly, following a brief review of semantic communication, an architecture for a semantic-enabled holographic video communication system is presented. Key technologies, including semantic sampling, joint semantic-channel coding, and semantic-aware transmission, are designed based on the proposed architecture. Two related use cases are presented to demonstrate the performance gain of the proposed methods. Finally, potential research topics are discussed to pave the way for the realization of semantic-enabled holographic video communications. Holographic video is a revolutionary information modality, which provides panoramic video content and an immer-sive experience based on three-dimensional view and high-resolution holograms [1]. Holographic video communication (HVC) is regarded as the dominant paradigm for future visual-type communications. It is considered the potential method to realize metaverse and enable numerous applications, such as holographic conferencing, education, and entertainment.
Prediction Markets with Intermittent Contributions
Vitali, Michael, Pinson, Pierre
Although both data availability and the demand for accurate forecasts are increasing, collaboration between stakeholders is often constrained by data ownership and competitive interests. In contrast to recent proposals within cooperative game-theoretical frameworks, we place ourselves in a more general framework, based on prediction markets. There, independent agents trade forecasts of uncertain future events in exchange for rewards. We introduce and analyse a prediction market that (i) accounts for the historical performance of the agents, (ii) adapts to time-varying conditions, while (iii) permitting agents to enter and exit the market at will. The proposed design employs robust regression models to learn the optimal forecasts' combination whilst handling missing submissions. Moreover, we introduce a pay-off allocation mechanism that considers both in-sample and out-of-sample performance while satisfying several desirable economic properties. Case-studies using simulated and real-world data allow demonstrating the effectiveness and adaptability of the proposed market design.
Km-scale dynamical downscaling through conformalized latent diffusion models
Brusaferri, Alessandro, Ballarino, Andrea
Abstract--Dynamical downscaling is crucial for deriving high-resolution meteorological fields from coarse-scale simulations, enabling detailed analysis for critical applications such as weather forecasting and renewable energy modeling. Generative Diffusion models (DMs) have recently emerged as powerful data-driven tools for this task, offering reconstruction fidelity and more scalable sampling supporting uncertainty quantification. In this work, we tackle this issue by augmenting the downscaling pipeline with a conformal prediction framework. Specifically, the DM's samples are post-processed to derive conditional quantile estimates, incorporated into a conformalized quantile regression procedure targeting locally adaptive prediction intervals with finite-sample marginal validity. The proposed approach is evaluated on ERA5 reanalysis data over Italy, downscaled to a 2-km grid. Results demonstrate grid-point-level uncertainty estimates with markedly improved coverage and stable probabilistic scores relative to the DM baseline, highlighting the potential of con-formalized generative models for more trustworthy probabilistic downscaling to high-resolution meteorological fields.