South America
A new classification system of beer categories and styles based on large-scale data mining and self-organizing maps of beer recipes
A data-driven quantitative approach was used to develop a novel classification system for beer categories and styles. Sixty-two thousand one hundred twenty-one beer recipes were mined and analyzed, considering ingredient profiles, fermentation parameters, and recipe vital statistics. Statistical analyses combined with self-organizing maps (SOMs) identified four major superclusters that showed distinctive malt and hop usage patterns, style characteristics, and historical brewing traditions. Cold fermented styles showed a conservative grain and hop composition, whereas hot fermented beers exhibited high heterogeneity, reflecting regional preferences and innovation. This new taxonomy offers a reproducible and objective framework beyond traditional sensory-based classifications, providing brewers, researchers, and educators with a scalable tool for recipe analysis and beer development. The findings in this work provide an understanding of beer diversity and open avenues for linking ingredient usage with fermentation profiles and flavor outcomes.
Constrained Non-negative Matrix Factorization for Guided Topic Modeling of Minority Topics
Ebrahimi, Seyedeh Fatemeh, Peltonen, Jaakko
Topic models often fail to capture low-prevalence, domain-critical themes, so-called minority topics, such as mental health themes in online comments. While some existing methods can incorporate domain knowledge, such as expected topical content, methods allowing guidance may require overly detailed expected topics, hindering the discovery of topic divisions and variation. We propose a topic modeling solution via a specially constrained NMF. We incorporate a seed word list characterizing minority content of interest, but we do not require experts to pre-specify their division across minority topics. Through prevalence constraints on minority topics and seed word content across topics, we learn distinct data-driven minority topics as well as majority topics. The constrained NMF is fitted via Karush-Kuhn-Tucker (KKT) conditions with multiplicative updates. We outperform several baselines on synthetic data in terms of topic purity, normalized mutual information, and also evaluate topic quality using Jensen-Shannon divergence (JSD). We conduct a case study on YouTube vlog comments, analyzing viewer discussion of mental health content; our model successfully identifies and reveals this domain-relevant minority content.
PoseX: AI Defeats Physics Approaches on Protein-Ligand Cross Docking
Jiang, Yize, Li, Xinze, Zhang, Yuanyuan, Han, Jin, Xu, Youjun, Pandit, Ayush, Zhang, Zaixi, Wang, Mengdi, Wang, Mengyang, Liu, Chong, Yang, Guang, Choi, Yejin, Li, Wu-Jun, Fu, Tianfan, Wu, Fang, Liu, Junhong
Existing protein-ligand docking studies typically focus on the self-docking scenario, which is less practical in real applications. Moreover, some studies involve heavy frameworks requiring extensive training, posing challenges for convenient and efficient assessment of docking methods. To fill these gaps, we design PoseX, an open-source benchmark to evaluate both self-docking and cross-docking, enabling a practical and comprehensive assessment of algorithmic advances. Specifically, we curated a novel dataset comprising 718 entries for self-docking and 1,312 entries for cross-docking; second, we incorporated 23 docking methods in three methodological categories, including physics-based methods (e.g., Schrödinger Glide), AI docking methods (e.g., DiffDock) and AI co-folding methods (e.g., AlphaFold3); third, we developed a relaxation method for post-processing to minimize conformational energy and refine binding poses; fourth, we built a leaderboard to rank submitted models in real-time. We derived some key insights and conclusions from extensive experiments: (1) AI approaches have consistently outperformed physics-based methods in overall docking success rate. (2) Most intra- and intermolecular clashes of AI approaches can be greatly alleviated with relaxation, which means combining AI modeling with physics-based post-processing could achieve excellent performance. (3) AI co-folding methods exhibit ligand chirality issues, except for Boltz-1x, which introduced physics-inspired potentials to fix hallucinations, suggesting modeling on stereochemistry improves the structural plausibility markedly. (4) Specifying binding pockets significantly promotes docking performance, indicating that pocket information can be leveraged adequately, particularly for AI co-folding methods, in future modeling efforts. The code, dataset, and leaderboard are released at https://github.com/CataAI/PoseX.
ClickSight: Interpreting Student Clickstreams to Reveal Insights on Learning Strategies via LLMs
Radmehr, Bahar, Shved, Ekaterina, Güreş, Fatma Betül, Singla, Adish, Käser, Tanja
Clickstream data from digital learning environments offer valuable insights into students' learning behaviors, but are challenging to interpret due to their high dimensionality and granularity. Prior approaches have relied mainly on handcrafted features, expert labeling, clustering, or supervised models, therefore often lacking generalizability and scalability. In this work, we introduce ClickSight, an in-context Large Language Model (LLM)-based pipeline that interprets student clickstreams to reveal their learning strategies. ClickSight takes raw clickstreams and a list of learning strategies as input and generates textual interpretations of students' behaviors during interaction. We evaluate four different prompting strategies and investigate the impact of self-refinement on interpretation quality. Our evaluation spans two open-ended learning environments and uses a rubric-based domain-expert evaluation. Results show that while LLMs can reasonably interpret learning strategies from clickstreams, interpretation quality varies by prompting strategy, and self-refinement offers limited improvement. ClickSight demonstrates the potential of LLMs to generate theory-driven insights from educational interaction data.
Neural Conditional Transport Maps
Rodriguez-Pardo, Carlos, Chiani, Leonardo, Borgonovo, Emanuele, Tavoni, Massimo
We present a neural framework for learning conditional optimal transport (OT) maps between probability distributions. Our approach introduces a conditioning mechanism capable of processing both categorical and continuous conditioning variables simultaneously. At the core of our method lies a hypernetwork that generates transport layer parameters based on these inputs, creating adaptive mappings that outperform simpler conditioning methods. Comprehensive ablation studies demonstrate the superior performance of our method over baseline configurations. Furthermore, we showcase an application to global sensitivity analysis, offering high performance in computing OT-based sensitivity indices. This work advances the state-of-the-art in conditional optimal transport, enabling broader application of optimal transport principles to complex, high-dimensional domains such as generative modeling and black-box model explainability.
Clustering and Pruning in Causal Data Fusion
Tabell, Otto, Tikka, Santtu, Karvanen, Juha
Data fusion--the process of combining observational and exp erimental data--can enable the identification of causal effects that would otherwise rem ain non-identifiable. Although identification algorithms have been developed for specific s cenarios, do-calculus remains the only general-purpose tool for causal data fusion, particul arly when variables are present in some data sources but not others. However, approaches based on do-calculus may encounter computational challenges as the number of variables increa ses and the causal graph grows in complexity. Consequently, there exists a need to reduce t he size of such models while preserving the essential features. For this purpose, we pro pose pruning (removing unnecessary variables) and clustering (combining variables) as pr eprocessing operations for causal data fusion. We generalize earlier results on a single data s ource and derive conditions for applying pruning and clustering in the case of multiple data sources. We give sufficient conditions for inferring the identifiability or non-identi fiability of a causal effect in a larger graph based on a smaller graph and show how to obtain the corre sponding identifying functional for identifiable causal effects. Examples from ep idemiology and social science demonstrate the use of the results.
Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning
Santos, Pedro P., Sardinha, Alberto, Melo, Francisco S.
In this work, we contribute the first approach to solve infinite-horizon discounted general-utility Markov decision processes (GUMDPs) in the single-trial regime, i.e., when the agent's performance is evaluated based on a single trajectory. First, we provide some fundamental results regarding policy optimization in the single-trial regime, investigating which class of policies suffices for optimality, casting our problem as a particular MDP that is equivalent to our original problem, as well as studying the computational hardness of policy optimization in the single-trial regime. Second, we show how we can leverage online planning techniques, in particular a Monte-Carlo tree search algorithm, to solve GUMDPs in the single-trial regime. Third, we provide experimental results showcasing the superior performance of our approach in comparison to relevant baselines.
Data Augmentation and Resolution Enhancement using GANs and Diffusion Models for Tree Segmentation
Ferreira, Alessandro dos Santos, Ramos, Ana Paula Marques, Junior, José Marcato, Gonçalves, Wesley Nunes
Urban forests play a key role in enhancing environmental quality and supporting biodiversity in cities. Mapping and monitoring these green spaces are crucial for urban planning and conservation, yet accurately detecting trees is challenging due to complex landscapes and the variability in image resolution caused by different satellite sensors or UAV flight altitudes. While deep learning architectures have shown promise in addressing these challenges, their effectiveness remains strongly dependent on the availability of large and manually labeled datasets, which are often expensive and difficult to obtain in sufficient quantity. In this work, we propose a novel pipeline that integrates domain adaptation with GANs and Diffusion models to enhance the quality of low-resolution aerial images. Our proposed pipeline enhances low-resolution imagery while preserving semantic content, enabling effective tree segmentation without requiring large volumes of manually annotated data. Leveraging models such as pix2pix, Real-ESRGAN, Latent Diffusion, and Stable Diffusion, we generate realistic and structurally consistent synthetic samples that expand the training dataset and unify scale across domains. This approach not only improves the robustness of segmentation models across different acquisition conditions but also provides a scalable and replicable solution for remote sensing scenarios with scarce annotation resources. Experimental results demonstrated an improvement of over 50% in IoU for low-resolution images, highlighting the effectiveness of our method compared to traditional pipelines.
What AI Thinks It Knows About You
Large language models such as GPT, Llama, Claude, and DeepSeek can be so fluent that people feel it as a "you," and it answers encouragingly as an "I." The models can write poetry in nearly any given form, read a set of political speeches and promptly sift out and share all the jokes, draw a chart, code a website. How do they do these and so many other things that were just recently the sole realm of humans? Practitioners are left explaining jaw-dropping conversational rabbit-from-a-hat extractions with arm-waving that the models are just predicting one word at a time from an unthinkably large training set scraped from every recorded written or spoken human utterance that can be found--fair enough--or a with a small shrug and a cryptic utterance of "fine-tuning" or "transformers!" These aren't very satisfying answers for how these models can converse so intelligently, and how they sometimes err so weirdly.
'Every person that clashed with him has left': the rise, fall and spectacular comeback of Sam Altman
The short-lived firing of Sam Altman, the CEO of possibly the world's most important AI company, was sensational. When he was sacked by OpenAI's board members, some of them believed the stakes could not have been higher – the future of humanity – if the organisation continued under Altman. Imagine Succession, with added apocalypse vibes. In early November 2023, after three weeks of secret calls and varying degrees of paranoia, the OpenAI board agreed: Altman had to go. After his removal, Altman's most loyal staff resigned, and others signed an open letter calling for his reinstatement.