Technology
Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment
Despite Contrastive Language-Image Pre-training (CLIP)'s remarkable capability to retrieve content across modalities, a substantial modality gap persists in its feature space. Intriguingly, we discover that off-the-shelf MLLMs (Multimodal Large Language Models) demonstrate powerful inherent modality alignment properties. While recent MLLM-based retrievers with unified architectures partially mitigate this gap, their reliance on coarse modality alignment mechanisms fundamentally limits their potential. In this work, We introduce MAPLE (Modality-Aligned Preference Learning for Embeddings), a novel framework that leverages the finegrained alignment priors inherent in MLLM to guide cross-modal representation learning. MAPLE formulates the learning process as reinforcement learning with two key components: (1) Automatic preference data construction using off-theshelf MLLM, and (2) a new Relative Preference Alignment (RPA) loss, which adapts Direct Preference Optimization (DPO) to the embedding learning setting. Experimental results show that our preference-guided alignment achieves substantial gains in fine-grained cross-modal retrieval, underscoring its effectiveness in handling nuanced semantic distinctions.
Less is More: Local Intrinsic Dimensions of Contextual Language Models
Benjamin Matthias Ruppik, Julius von Rohrscheidt, Carel van Niekerk, Michael Heck, Renato Vukovic, Shutong Feng, Hsien-chin Lin, Nurul Lubis, Bastian Rieck, Marcus Zibrowius, Milica Gasic
Understanding the internal mechanisms of large language models (LLMs) remains a challenging and complex endeavor. Even fundamental questions, such as how fine-tuning affects model behavior, often require extensive empirical evaluation. In this paper, we introduce a novel perspective based on the geometric properties of contextual latent embeddings to study the effects of training and fine-tuning. To that end, we measure the local dimensions of a contextual language model's latent space and analyze their shifts during training and fine-tuning. We show that the local dimensions provide insights into the model's training dynamics and generalization ability. Specifically, the mean of the local dimensions predicts when the model's training capabilities are exhausted, as exemplified in a dialogue state tracking task, overfitting, as demonstrated in an emotion recognition task, and grokking, as illustrated with an arithmetic task. Furthermore, our experiments suggest a practical heuristic: reductions in the mean local dimension tend to accompany and predict subsequent performance gains. Through this exploration, we aim to provide practitioners with a deeper understanding of the implications of fine-tuning on embedding spaces, facilitating informed decisions when configuring models for specific applications. The results of this work contribute to the ongoing discourse on the interpretability, adaptability, and generalizability of LLMs by bridging the gap between intrinsic model mechanisms and geometric properties in embeddings.
Anthropic's design assistant now works better with its coding agent
Anthropic's design assistant now works better with its coding agent Anthropic's design assistant now works better with its coding agent Exactly two months after releasing a preview of Claude Design to subscribers, Anthropic has begun rolling out a major update for its design assistant that brings better integration with its other apps. To start, Claude Design can now begin working from a local codebase, meaning any assets it generates will contain elements that already exist in your front-facing products. From there, the app can hand off a design to Claude Code, allowing the coding agent to program an interface without the need to start from scratch. You also don't need to provide it with screenshots to give it an idea of your intent. And if you want to skip Claude Design, you can do that too, with Anthropic adding the option to create and edit designs directly from Claude Code. Outside of more robust integration with Anthropic's other apps, today's update brings with it quality of life improvements, starting with a more flexible import tool that can build entire design systems from GitHub and raw files.
Automatic Auxiliary Task Selection and Adaptive Weighting Boost Molecular Property Prediction
Recent studies in Machine Learning (ML) for biological research focus on investigating molecular properties to accelerate drug discovery. However, limited labeled molecular data often hampers the performance of ML models. A common strategy to mitigate data scarcity is leveraging auxiliary learning tasks to provide additional supervision, but selecting effective auxiliary tasks requires substantial domain expertise and manual effort, and their inclusion does not always guarantee performance gains. To overcome these challenges, we introduce Automatic Auxiliary Task Selection (AUTAUT), a fully automated framework that seamlessly retrieves auxiliary tasks using large language models and adaptively integrates them through a novel gradient alignment weighting mechanism. By automatically emphasizing auxiliary tasks aligned with the primary objective, AUTAUT significantly enhances predictive accuracy while reducing negative impacts from irrelevant tasks. Extensive evaluations demonstrate that AUTAUT outperforms 10 auxiliary task-based approaches and 18 advanced molecular property prediction models.
Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking
Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets--including WebQSPand CWQ--we find that the average factual correctness rate is only 57%. To address these issues, we introduce KGQAGen, an LLM-inthe-loop framework that systematically resolves these pitfalls. KGQAGencombines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a 10K-scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation 1.
Text-to-Code Generation for Modular Building Layouts in Building Information Modeling
We present Text2MBL, a text-to-code generation framework that generates executable Building Information Modeling (BIM) code directly from textual descriptions of modular building layout (MBL) design. Unlike conventional layout generation approaches that operate in 2D space, Text2MBL produces fully parametric, semantically rich BIM layouts through on-the-fly code instantiation. To address MBLs' unique challenges due to their hierarchical three-tier structure: modules (physical building blocks), units (self-contained dwellings), and rooms (functional spaces), we developed an object-oriented code architecture and fine-tuned large language models to output structured action sequences in code format. To train and evaluate the framework, we curated a dataset of paired descriptions and ground truth layouts drawn from real-world modular housing projects. Performance was assessed using metrics for executable validity, semantic fidelity, and geometric consistency. By tightly unifying natural language understanding with BIM code generation, Text2MBL establishes a scalable pipeline from high-level conceptual design to automation-ready modular construction workflows.
Do Large Language Models Really
Recent studies have demonstrated the feasibility of modeling single-cell data as natural languages and the potential of leveraging powerful large language models (LLMs) for understanding cell biology. However, a comprehensive evaluation of LLMs' performance on language-driven single-cell analysis tasks remains unexplored. Motivated by this challenge, we introduce CELLVERSE, a unified language-centric question-answering benchmark that integrates four types of single-cell multi-omics data and encompasses three hierarchical levels of single-cell analysis tasks: cell type annotation (cell-level), drug response prediction (drug-level), and perturbation analysis (gene-level). Going beyond this, we systematically evaluate the performance across 14 open-source and closed-source LLMs ranging 160M 671B on CELLVERSE. Remarkably, the experimental results reveal: Existing specialist models (e.g., C2S-Pythia) fail to make reasonable decisions across all sub-tasks within CELLVERSE, while generalist models such as Qwen, Llama, GPT, and DeepSeekfamily models exhibit preliminary understanding capabilities within the realm of cell biology. The performance of current LLMs falls short of expectations and has substantial room for improvement. Notably, in the widely studied drug response prediction task, none of the evaluated LLMs demonstrate significant performance improvement over random guessing. CELLVERSE offers the first large-scale empirical demonstration that significant challenges still remain in applying LLMs to cell biology. By introducing CELLVERSE, we lay the foundation for advancing cell biology through natural languages and hope this paradigm could facilitate next-generation single-cell analysis.
MultiNet: Adaptive Multi-Viewed Subgraph Convolutional Networks for Graph Classification
The problem of over-smoothing has emerged as a fundamental issue for Graph Convolutional Networks (GCNs). While existing efforts primarily focus on enhancing the discriminability of node representations for node classification, they tend to overlook the over-smoothing at the graph level, significantly influencing the performance of graph classification. In this paper, we provide an explanation of the graph-level over-smoothing phenomenon and propose a novel Adaptive MultiViewed Subgraph Convolutional Network (MultiNet) to address this challenge. Specifically, the MultiNet introduces a local subgraph convolution module that adaptively divides each input graph into multiple subgraph views. Then a number of subgraph-based view-specific convolution operations are applied to constrain the extent of node information propagation over the original global graph structure, not only mitigating the over-smoothing issue but also generating more discriminative local node representations. Moreover, we develop an alignment-based readout that establishes correspondences between nodes over different graphs, thereby effectively preserving the local node-level structure information and improving the discriminative ability of the resulting graph-level representations. Theoretical analysis and empirical studies show that the MultiNet mitigates the graph-level over-smoothing and achieves excellent performance for graph classification.