Goto

Collaborating Authors

 Technology


I Found 22 Early Prime Day Deals That Are Worth Shopping Now

WIRED

We've trawled the depths of Amazon to find the best deals on gear we've tested. Amazon Prime Day is just around the corner. This year the deals officially kick off June 23 and ru through midnight Friday, June 26, but there are already some good early deals going on. Whether you need a new laptop, an Alexa speaker, or some noise-canceling earbuds, you can shop today. Microsoft just announced an update to the Surface line, and yes, it'll be slightly faster, but it's also significantly more expensive, especially with this deal happening now.


59d2eaa5842fa641ff9b8e4c7ff0f6ee-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing Systems

While text-to-image models like GPT-4o-Image and FLUX are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequently undergo inadequate evaluation of their capabilities and limitations, potentially leading to misalignment and unsafe fine-tuning outcomes. To address this issue, we introduce MJ-BENCH, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across six key perspectives: alignment, safety, image quality, bias, composition, and visualization. Specifically, we evaluate a large variety of multimodal judges including smaller-sized CLIP-based scoring models, open-source VLMs, and close-source VLMs on each decomposed subcategory of our preference dataset. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in average. Compared with open-source VLMs, smaller-sized scoring models can provide better feedback regarding textimage alignment and image quality, while VLMs provide more accurate feedback regarding safety and generation bias due to their stronger reasoning capabilities. Further studies in feedback scale reveal that VLM judges can generally provide more accurate and stable feedback in natural language than numerical scales. Notably, human evaluations on end-to-end fine-tuned models using separate feedback from these multimodal judges provide similar conclusions, further confirming the effectiveness of MJ-BENCH.


No-Regret Online Autobidding Algorithms in First-price Auctions

Neural Information Processing Systems

ROI constraints and budget constraints, is widely adopted by advertisers. A key challenge lies in designing algorithms for non-truthful mechanisms with ROI constraints. While prior work has addressed truthful auctions or non-truthful auctions with weaker benchmarks, this paper provides a significant improvement: We develop online bidding algorithms for repeated first-price auctions with ROI constraints, benchmarking against the optimal randomized strategy in hindsight. In the full feedback setting, where the maximum competing bid is observed, our algorithm achieves a near-optimal eO( T)regret bound, and in the bandit feedback setting (where the bidder only observes whether the bidder wins each auction), our algorithm attains eO(T3/4)regret bound.


Eluder dimension: localise it!

Neural Information Processing Systems

We establish a lower bound on the eluder dimension in generalised linear model classes, showing that standard eluder dimension-based analysis cannot lead to first-order regret bounds. To address this, we introduce a localisation method for the eluder dimension; our analysis immediately recovers and improves on classic results for Bernoulli bandits, and allows for the first genuine first-order bounds for finite-horizon reinforcement learning tasks with bounded cumulative returns.


Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models

Neural Information Processing Systems

Previous methods for image geo-localization have typically treated the task as either classification or retrieval, often relying on black-box decisions that lack interpretability. The rise of large vision-language models (LVLMs) has enabled a rethinking of geo-localization as a reasoning-driven task grounded in visual cues. However, two major challenges persist. On the data side, existing reasoningfocused datasets are primarily based on street-view imagery, offering limited scene diversity and constrained viewpoints. On the modeling side, current approaches predominantly rely on supervised fine-tuning, which yields only marginal improvements in reasoning capabilities. To address these challenges, we propose a novel pipeline that constructs a reasoning-oriented geo-localization dataset, MP16Reason, using diverse social media images. We introduce GLOBE, Group-relative policy optimization for Localizability assessment and Optimized visual-cue reasoning, yielding Bi-objective geo-Enhancement for the VLM in recognition and reasoning. GLOBE incorporates task-specific rewards that jointly enhance localizability assessment, visual-cue reasoning, and geolocation accuracy. Both qualitative and quantitative results demonstrate that GLOBE outperforms state-of-the-art opensource LVLMs on geo-localization tasks, particularly in diverse visual scenes, while also generating more insightful and interpretable reasoning trajectories.


See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model

Neural Information Processing Systems

We introduce SEE&TREK, the first training-free prompting framework tailored to enhance the spatial understanding of Multimodal Large Language Models (MLLMS) under vision-only constraints. While prior efforts have incorporated modalities like depth or point clouds to improve spatial reasoning, purely visualspatial understanding remains underexplored.


The Download: a reality check for geoengineering and the science of interoception

MIT Technology Review

Plus: SpaceX is now valued higher than Amazon. Solar geoengineering, the controversial idea that we could deliberately intervene in the climate system to counteract global warming, is moving beyond computer simulations and into the practical engineering challenges required to make it real. Researchers are now working on aircraft, materials, and other systems for solar geoengineering. But as they delve into these details, they're finding that even early deployment would require significant new infrastructure, time, and investment. Find out what happens when solar geoengineering encounters the realities of trying to cool the planet . Scientists have a word for how we sense ourselves from the inside: interoception.


Continual Gaussian Mixture Distribution Modeling for Class Incremental Semantic Segmentation

Neural Information Processing Systems

Class incremental semantic segmentation (CISS) enables a model to continually segment new classes from non-stationary data while preserving previously learned knowledge. Recent top-performing approaches are prototype-based methods that assign a prototype to each learned class to reproduce previous knowledge. However, modeling each class distribution relying on only a single prototype, which remains fixed throughout the incremental process, presents two key limitations: (i) a single prototype is insufficient to accurately represent the complete class distribution when incoming data stream for a class is naturally multimodal; (ii) the features of old classes may exhibit anisotropy during the incremental process, preventing fixed prototypes from faithfully reproducing the matched distribution. To address the aforementioned limitations, we propose a Continual Gaussian Mixture Distribution (CoGaMiD) modeling method. Specifically, the means and covariance matrices of the Gaussian Mixture Models (GMMs) are estimated to model the complete feature distributions of learned classes.


5975754c7650dfee0682e06e1fec0522-Paper-Conference.pdf

Neural Information Processing Systems

Predicting the 3D conformation of small molecules within protein binding sites is a key challenge in drug design. When a crystallized reference ligand (template) is available, it provides geometric priors that can guide 3D pose prediction. We present a two-stage method for ligand conformation generation guided by such templates. In the first stage, we introduce a molecular alignment approach based on flow-matching to generate 3D coordinates for the ligand, using the template structure as a reference. In the second stage, a differentiable pose optimization procedure refines this conformation based on shape and pharmacophore similarities, internal energy, and, optionally, the protein binding pocket. We introduce a new benchmark of ligand pairs co-crystallized with the same target to evaluate our approach and show that it outperforms standard docking tools and open-access alignment methods, especially in cases involving low similarity to the template or high ligand flexibility.


Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation

Neural Information Processing Systems

Vision-Language Navigation in Continuous Environments (VLN-CE) poses a formidable challenge for autonomous agents, requiring seamless integration of natural language instructions and visual observations to navigate complex 3D indoor spaces. Existing approaches often falter in long-horizon tasks due to limited scene understanding, inefficient planning, and lack of robust decision-making frameworks. We introduce the Hierarchical Semantic-Augmented Navigation (HSAN) framework, a groundbreaking approach that redefines VLN-CE through three synergistic innovations. First, HSAN constructs a dynamic hierarchical semantic scene graph, leveraging vision-language models to capture multi-level environmental representations--from objects to regions to zones--enabling nuanced spatial reasoning. Second, it employs an optimal transport-based topological planner, grounded in Kantorovich's duality, to select long-term goals by balancing semantic relevance and spatial accessibility with theoretical guarantees of optimality. Third, a graph-aware reinforcement learning policy ensures precise low-level control, navigating subgoals while robustly avoiding obstacles. By integrating spectral graph theory, optimal transport, and advanced multi-modal learning, HSAN addresses the shortcomings of static maps and heuristic planners prevalent in prior work. Extensive experiments on multiple challenging VLN-CE datasets demonstrate that HSAN achieves state-of-the-art performance, with significant improvements in navigation success and generalization to unseen environments.