Thuringia
MassSpecGym: A benchmark for the discovery and identification of molecules Roman Bushuiev
Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym - the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data.
Medieval plague victims likely found in mass grave in Germany
Archaeologists say they located a Black Death burial site containing some of a village's 12,000 dead. Breakthroughs, discoveries, and DIY tips sent six days a week. The Black Death () killed as much as half of Europe's total population between 1346 and 1353, so there are a of bodies buried across the continent. For example, contemporary accounts from Thuringia--a state in central Germany--report that about 12,000 plague victims died around Erfurt amid the city's outbreak in 1350. But despite multiple accounts attesting to this devastation, none of the 11 mass graves could be pinpointed for centuries.
CausalRivers -- Scaling up benchmarking of causal discovery for real-world time-series
Stein, Gideon, Shadaydeh, Maha, Blunk, Jan, Penzel, Niklas, Denzler, Joachim
Causal discovery, or identifying causal relationships from observational data, is a notoriously challenging task, with numerous methods proposed to tackle it. Despite this, in-the-wild evaluation of these methods is still lacking, as works frequently rely on synthetic data evaluation and sparse real-world examples under critical theoretical assumptions. Real-world causal structures, however, are often complex, making it hard to decide on a proper causal discovery strategy. To bridge this gap, we introduce CausalRivers, the largest in-the-wild causal discovery benchmarking kit for time-series data to date. CausalRivers features an extensive dataset on river discharge that covers the eastern German territory (666 measurement stations) and the state of Bavaria (494 measurement stations). It spans the years 2019 to 2023 with a 15-minute temporal resolution. Further, we provide additional data from a flood around the Elbe River, as an event with a pronounced distributional shift. Leveraging multiple sources of information and time-series meta-data, we constructed two distinct causal ground truth graphs (Bavaria and eastern Germany). These graphs can be sampled to generate thousands of subgraphs to benchmark causal discovery across diverse and challenging settings. To demonstrate the utility of CausalRivers, we evaluate several causal discovery approaches through a set of experiments to identify areas for improvement. CausalRivers has the potential to facilitate robust evaluations and comparisons of causal discovery methods. Besides this primary purpose, we also expect that this dataset will be relevant for connected areas of research, such as time-series forecasting and anomaly detection. Based on this, we hope to push benchmark-driven method development that fosters advanced techniques for causal discovery, as is the case for many other areas of machine learning.
Early Detection of Forest Calamities in Homogeneous Stands -- Deep Learning Applied to Bark-Beetle Outbreaks
Kirsch, Maximilian, Wernicke, Jakob, Datta, Pawan, Preisach, Christine
Climate change has increased the vulnerability of forests to insect-related damage, resulting in widespread forest loss in Central Europe and highlighting the need for effective, continuous monitoring systems. Remote sensing based forest health monitoring, oftentimes, relies on supervised machine learning algorithms that require labeled training data. Monitoring temporal patterns through time series analysis offers a potential alternative for earlier detection of disturbance but requires substantial storage resources. This study investigates the potential of a Deep Learning algorithm based on a Long Short Term Memory (LSTM) Autoencoder for the detection of anomalies in forest health (e.g. bark beetle outbreaks), utilizing Sentinel-2 time series data. This approach is an alternative to supervised machine learning methods, avoiding the necessity for labeled training data. Furthermore, it is more memory-efficient than other time series analysis approaches, as a robust model can be created using only a 26-week-long time series as input. In this study, we monitored pure stands of spruce in Thuringia, Germany, over a 7-year period from 2018 to the end of 2024. Our best model achieved a detection accuracy of 87% on test data and was able to detect 61% of all anomalies at a very early stage (more than a month before visible signs of forest degradation). Compared to another widely used time series break detection algorithm - BFAST (Breaks For Additive Season and Trend), our approach consistently detected higher percentage of anomalies at an earlier stage. These findings suggest that LSTM-based Autoencoders could provide a promising, resource-efficient approach to forest health monitoring, enabling more timely responses to emerging threats.
Teleology-Driven Affective Computing: A Causal Framework for Sustained Well-Being
Yin, Bin, Liu, Chong-Yi, Fu, Liya, Zhang, Jinkun
Affective computing has made significant strides in emotion recognition and generation, yet current approaches mainly focus on short-term pattern recognition and lack a comprehensive framework to guide affective agents toward long-term human well-being. To address this, we propose a teleology-driven affective computing framework that unifies major emotion theories (basic emotion, appraisal, and constructivist approaches) under the premise that affect is an adaptive, goal-directed process that facilitates survival and development. Our framework emphasizes aligning agent responses with both personal/individual and group/collective well-being over extended timescales. We advocate for creating a "dataverse" of personal affective events, capturing the interplay between beliefs, goals, actions, and outcomes through real-world experience sampling and immersive virtual reality. By leveraging causal modeling, this "dataverse" enables AI systems to infer individuals' unique affective concerns and provide tailored interventions for sustained well-being. Additionally, we introduce a meta-reinforcement learning paradigm to train agents in simulated environments, allowing them to adapt to evolving affective concerns and balance hierarchical goals - from immediate emotional needs to long-term self-actualization. This framework shifts the focus from statistical correlations to causal reasoning, enhancing agents' ability to predict and respond proactively to emotional challenges, and offers a foundation for developing personalized, ethically aligned affective systems that promote meaningful human-AI interactions and societal well-being.
A Cautionary Tale About "Neutrally" Informative AI Tools Ahead of the 2025 Federal Elections in Germany
Dormuth, Ina, Franke, Sven, Hafer, Marlies, Katzke, Tim, Marx, Alexander, Müller, Emmanuel, Neider, Daniel, Pauly, Markus, Rutinowski, Jérôme
In this study, we examine the reliability of AI-based Voting Advice Applications (VAAs) and large language models (LLMs) in providing objective political information. Our analysis is based upon a comparison with party responses to 38 statements of the Wahl-O-Mat, a well-established German online tool that helps inform voters by comparing their views with political party positions. For the LLMs, we identify significant biases. They exhibit a strong alignment (over 75% on average) with left-wing parties and a substantially lower alignment with center-right (smaller 50%) and right-wing parties (around 30%). Furthermore, for the VAAs, intended to objectively inform voters, we found substantial deviations from the parties' stated positions in Wahl-O-Mat: While one VAA deviated in 25% of cases, another VAA showed deviations in more than 50% of cases. For the latter, we even observed that simple prompt injections led to severe hallucinations, including false claims such as non-existent connections between political parties and right-wing extremist ties.
Noncooperative Equilibrium Selection via a Trading-based Auction
Im, Jaehan, Fotiadis, Filippos, Delahaye, Daniel, Topcu, Ufuk, Fridovich-Keil, David
Noncooperative multi-agent systems often face coordination challenges due to conflicting preferences among agents. In particular, agents acting in their own self-interest can settle on different equilibria, leading to suboptimal outcomes or even safety concerns. We propose an algorithm named trading auction for consensus (TACo), a decentralized approach that enables noncooperative agents to reach consensus without communicating directly or disclosing private valuations. TACo facilitates coordination through a structured trading-based auction, where agents iteratively select choices of interest and provably reach an agreement within an a priori bounded number of steps. A series of numerical experiments validate that the termination guarantees of TACo hold in practice, and show that TACo achieves a median performance that minimizes the total cost across all agents, while allocating resources significantly more fairly than baseline approaches.
Predicting Air Temperature from Volumetric Urban Morphology with Machine Learning
Kıvılcım, Berk, Bradley, Patrick Erik
In this study, we firstly introduce a method that converts CityGML data into voxels which works efficiently and fast in high resolution for large scale datasets such as cities but by sacrificing some building details to overcome the limitations of previous voxelization methodologies that have been computationally intensive and inefficient at transforming large-scale urban areas into voxel representations for high resolution. Those voxelized 3D city data from multiple cities and corresponding air temperature data are used to develop a machine learning model. Before the model training, Gaussian blurring is implemented on input data to consider spatial relationships, as a result the correlation rate between air temperature and volumetric building morphology is also increased after the Gaussian blurring. After the model training, the prediction results are not just evaluated with Mean Square Error (MSE) but some image similarity metrics such as Structural Similarity Index Measure (SSIM) and Learned Perceptual Image Patch Similarity (LPIPS) that are able to detect and consider spatial relations during the evaluation process. This trained model is capable of predicting the spatial distribution of air temperature by using building volume information of corresponding pixel as input. By doing so, this research aims to assist urban planners in incorporating environmental parameters into their planning strategies, thereby facilitating more sustainable and inhabitable urban environments.
LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation
Ye, Xi, Yin, Fangcong, He, Yinghui, Zhang, Joie, Yen, Howard, Gao, Tianyu, Durrett, Greg, Chen, Danqi
Existing benchmarks for evaluating long-context language models (LCLMs) primarily focus on long-context recall, requiring models to produce short responses based on a few critical snippets while processing thousands of irrelevant tokens. We introduce LongProc (Long Procedural Generation), a new benchmark that requires both the integration of highly dispersed information and long-form generation. LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans. These tasks challenge LCLMs by testing their ability to follow detailed procedural instructions, synthesize and reason over dispersed information, and generate structured, long-form outputs (up to 8K tokens). Furthermore, as these tasks adhere to deterministic procedures and yield structured outputs, they enable reliable rule-based evaluation. We evaluate 17 LCLMs on LongProc across three difficulty levels, with maximum numbers of output tokens set at 500, 2K, and 8K. Notably, while all tested models claim a context window size above 32K tokens, open-weight models typically falter on 2K-token tasks, and closed-source models like GPT-4o show significant degradation on 8K-token tasks. Further analysis reveals that LCLMs struggle to maintain long-range coherence in long-form generations. These findings highlight critical limitations in current LCLMs and suggest substantial room for improvement. Data and code available at: https://princeton-pli.github.io/LongProc