visualization
nvBench 2.0: Resolving Ambiguity in Text-to-Visualization through Stepwise Reasoning
Text-to-Visualization (Text2VIS) enables users to create visualizations from natural language queries, making data insights more accessible. However, Text2VIS faces challenges in interpreting ambiguous queries, as users often express their visualization needs in imprecise language. To address this challenge, we introduce nvBench 2.0, a new benchmark designed to evaluate Text2VIS systems in scenarios involving ambiguous queries.
SimSort: AData-Driven Framework for Spike Sorting by Large-Scale Electrophysiology Simulation
Spike sorting is an essential process in neural recording, which identifies and separates electrical signals from individual neurons recorded by electrodes in the brain, enabling researchers to study how specific neurons communicate and process information. Although there exist a number of spike sorting methods which have contributed to significant neuroscientific breakthroughs, many are heuristically designed, making it challenging to verify their correctness due to the difficulty of obtaining ground truth labels from real-world neural recordings. In this work, we explore a data-driven, deep learning-based approach. We begin by creating a largescale dataset through electrophysiology simulations using biologically realistic computational models.
Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data
Fine-tuning large language models (LLMs) using diverse datasets is crucial for enhancing their overall performance across various domains. In practical scenarios, existing methods based on modeling the mixture proportions of data composition often struggle with data whose domain labels are missing, imprecise or nonnormalized, while methods based on data selection usually encounter difficulties in balancing multi-domain performance. To address these challenges, in this work, we investigate the role of data diversity in enhancing the overall abilities of LLMs by empirically constructing contrastive data pools and theoretically deriving explanations. Building upon the insights gained, we propose a new method that gives the LLM a dual identity: an output model to cognitively probe and select data based on diversity reward, as well as an input model to be tuned with the selected data. Extensive experiments show that the proposed method notably boosts performance across domain-undetermined data and a series of foundational downstream tasks when applied to various advanced LLMs. We release our code and hope this study can shed light on the understanding of data diversity and advance feedback-driven data-model co-design for LLMs.
9 Claude tips and tricks to get more out of the AI chatbot
More information Adding us as a Preferred Source in Google by using this link indicates that you would like to see more of our content in Google News results. You can access Claude on mobile and on the desktop. Breakthroughs, discoveries, and DIY tips sent six days a week. By signing up, you confirm you are 16+, will receive newsletters and promotional content and agree to our Terms of Use and acknowledge the data practices in our Privacy Policy . If you're discussing the best AI chatbots, Anthropic's Claude is going to get a mention.
Random Forest Autoencoders for Guided Representation Learning
Extensive research has produced robust methods for unsupervised data visualization. Yet supervised visualization--where expert labels guide representations--remains underexplored, as most supervised approaches prioritize classification over visualization. Recently, RF-PHATE, a diffusion-based manifold learning method leveraging random forests and information geometry, marked significant progress in supervised visualization.
TopER: Topological Embeddings in Graph Representation Learning
Graph embeddings play a critical role in graph representation learning, allowing machine learning models to explore and interpret graph-structured data. However, existing methods often rely on opaque, high-dimensional embeddings, limiting interpretability and practical visualization. In this work, we introduce Topological Evolution Rate (TopER), a novel, lowdimensional embedding approach grounded in topological data analysis.
gAttention Sinks: A ' Catch, Tag, Release ' Mechanism for Embeddings
Large language models (LLMs) often concentrate their attention on a few specific tokens referred to as attention sinks. Common examples include the first token, a prompt-independent sink, and punctuation tokens, which are prompt-dependent. While the tokens causing the sinks often lack direct semantic meaning, the presence of the sinks is critical for model performance, particularly under model compression and KV-caching. Despite their ubiquity, the function, semantic role, and origin of attention sinks--especially those beyond the first token--remain poorly understood. In this work, we conduct a comprehensive investigation demonstrating that attention sinks: catch a sequence of tokens, tag them using a common direction in embedding space, and release them back into the residual stream, where tokens are later retrieved based on the tags they have acquired. Probing experiments reveal these tags carry semantically meaningful information, such as the truth of a statement. These findings extend to reasoning models, where the mechanism spans more heads and explains greater variance in embeddings, or recent models with querykey normalization, where sinks remain just as prevalent. To encourage future theoretical analysis, we introduce a minimal problem which can be solved through the'catch, tag, release' mechanism, and where it emerges through training.
World Central Banks (Supplementary Material)
Kaggle3 Publishing our dataset on Kaggle offers distinct advantages over platforms like HuggingFace, par-4 ticularly in terms of usability and community engagement. Kaggle provides integrated tools for data5 visualization, version control, and collaborative discussion, streamlining the research workflow. Unlike Hug-7 gingFace, which is primarily model-focused, Kaggle is optimized for dataset-driven experimenta-8 tion, making it a more practical platform for sharing, validating, and improving data-centric work.9 Hosting the dataset on Kaggle thus ensures greater transparency, accessibility, and impact across10 both academic and applied research communities.11 Website12 Our World Central Banks website offers a structured and accessible overview of our research. It13 features a task-specific model leaderboard with direct links to download and explore the best per-14 forming model.
TRACE: Contrastive learning for multi-trial time-series data in neuroscience
Modern neural recording techniques such as two-photon imaging or Neuropixel probes allow to acquire vast time-series datasets with responses of hundreds or thousands of neurons. Contrastive learning is a powerful self-supervised framework for learning representations of complex datasets. Existing applications for neural time series rely on generic data augmentations and do not exploit the multi-trial data structure inherent in many neural datasets. Here we present TRACE, a new contrastive learning framework that averages across different subsets of trials to generate positive pairs. TRACE allows to directly learn a two-dimensional embedding, combining ideas from contrastive learning and neighbor embeddings. We show that TRACE outperforms other methods, resolving fine response differences in simulated data. Further, using in vivo recordings, we show that the representations learned by TRACE capture both biologically relevant continuous variation, cell-type-related cluster structure, and can assist data quality control.
PartNeXt: ANext-Generation Dataset for Fine-Grained and Hierarchical 3DPart Understanding
Understanding objects at the level of their constituent parts is fundamental to advancing computer vision, graphics, and robotics. While datasets like PartNet have driven progress in 3D part understanding, their reliance on untextured geometries and expert-dependent annotation limits scalability and usability. We introduce PartNeXt, a next-generation dataset addressing these gaps with over 23,000 highquality, textured 3D models annotated with fine-grained, hierarchical part labels across 50 categories. We benchmark PartNeXt on two tasks: (1) class-agnostic part segmentation, where state-of-the-art methods (e.g., PartField, SAMPart3D) struggle with fine-grained and leaf-level parts, and (2) 3D part-centric question answering, a new benchmark for 3D-LLMs that reveals significant gaps in open-vocabulary part grounding. Additionally, training Point-SAM on PartNeXt yields substantial gains over PartNet, underscoring the dataset's superior quality and diversity.