Goto

Collaborating Authors

 Industry


RFMPose: Generative Category-level Object Pose Estimation via Riemannian Flow Matching

Neural Information Processing Systems

We introduce RFMPose, a novel generative framework for category-level 6D object pose estimation that learns deterministic pose trajectories through Riemannian Flow Matching (RFM). Existing discriminative approaches struggle with multihypothesis predictions (e.g., symmetry ambiguities) and often require specialized network architectures. RFMPose advances this paradigm through three key innovations: (1) Ensuring geometric consistency via geodesic interpolation on Riemannian manifolds combined with bi-invariant metric constraints; (2) Alleviating symmetryinduced ambiguities through Riemannian Optimal Transport for probability mass redistribution without ad-hoc design; (3) Enabling end-to-end likelihood estimation through Hutchinson trace approximation, thereby eliminating auxiliary model dependencies. Extensive experiments on the Omni6DPose demonstrate state-ofthe-art performance of the proposed method, with significant improvements of +4.1 in IoU25 and +2.4 in 5 2cm metrics compared to prior generative approaches. Furthermore, the proposed RFM framework exhibits robust sim-to-real transfer capabilities and facilitates pose tracking extensions with minimal architectural adaptation.


Why You're Seeing a PA or NP--But Not a Doctor

TIME - Tech

Follow this section to personalize your feed and get instant alerts. Follow Go to your personalized feed WHY FOLLOW? Smart Alerts: Get notified about major news as it happens. Follow this tag to personalize your feed and get instant alerts. Follow Go to your personalized feed WHY FOLLOW?


C-LoRA: Contextual Low-Rank Adaptation for Uncertainty Estimation in Large Language Models

Neural Information Processing Systems

Low-Rank Adaptation (LoRA) offers a cost-effective solution for fine-tuning large language models (LLMs), but it often produces overconfident predictions in datascarce few-shot settings. To address this issue, several classical statistical learning approaches have been repurposed for scalable uncertainty-aware LoRA fine-tuning. However, these approaches neglect how input characteristics affect the predictive uncertainty estimates. To address this limitation, we propose Contextual Low-Rank Adaptation (C-LoRA) as a novel uncertainty-aware and parameter efficient finetuning approach, by developing new lightweight LoRA modules contextualized to each input data sample to dynamically adapt uncertainty estimates. Incorporating data-driven contexts into the parameter posteriors, C-LoRA mitigates overfitting, achieves well-calibrated uncertainties, and yields robust predictions.


No, I Don't Want to Watch Your Straight Hockey Show

WIRED

From Amazon's to Netflix's upcoming, the recent spate of hetero hockey romances shows Hollywood learned the wrong lessons from The streaming industry has gotten a lot of flak over the past few years, but there is one thing that Hollywood studios are undeniably good at: recycling the same idea, over and over and over again until the world ends (or until everyone finally decides they're sick of, whichever comes first). This tried-and-true formula is now playing out in real time with Prime Video's and Netflix's upcoming series Icebreaker shows that, like are hockey-themed romances about polar opposites who just can't seem to keep their hands off each other. But there's one key difference: and are about heterosexual romances, while is about a secret gay relationship. And considering how much queerness played a role in's explosive popularity, it seems like the clamor for straight horny hockey content is another example of Hollywood just not getting the message. The forthcoming which Netflix announced this week, is about a figure skater who falls in love with a hockey player after they're forced to practice on the same rink.


FINERS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning

Neural Information Processing Systems

Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities across a wide range of vision-language tasks. However, due to the restricted input resolutions, MLLMs face significant challenges in precisely understanding and localizing visual details in high-resolution images--particularly when dealing with extra-small objects embedded in cluttered contexts. To address this issue, we propose FINERS, a two-stage MLLM-based reinforcement learning framework for jointly reasoning and segmenting extremely small objects within high-resolution scenes. FINERS adopts a coarse-to-fine pipeline comprising Global Semantic Exploration (GSE) and Localized Perceptual Refinement (LPR). Specifically, GSE performs instruction-guided reasoning to generate a textural response and a coarse target region, while LPR refines this region to produce an accurate bounding box and segmentation mask. To couple the two stages, we introduce a locate-informed retrospective reward, where LPR's outputs are used to optimize GSE for more robust coarse region exploration.


Right Question is Already Half the Answer: Fully Unsupervised LLMReasoning Incentivization

Neural Information Processing Systems

Existing methods to enhance the reasoning capability of large language models predominantly rely on supervised fine-tuning (SFT) followed by reinforcement learning (RL) on reasoning-specific data. These approaches critically depend on external supervisions-such as labeled reasoning traces, verified golden answers, or pre-trained reward models. In this work, we propose Entropy Minimized Policy Optimization (EMPO), which makes an early attempt at fully unsupervised LLM reasoning incentivization. By minimizing the semantic entropy of LLMs on unlabeled questions, EMPO achieves competitive performance compared to supervised counterparts. Specifically, without any external supervision, EMPO boosts the accuracy of Qwen2.5-Math-7BBase from 33.7% to 51.6% on math benchmarks and improves the accuracy of Qwen2.5-7BBase from 32.1% to 50.1% on MMLU-Pro. Primary analysis are also provided to interpret the effectiveness of EMPO.


FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Neural Information Processing Systems

Vision-Language-Action (VLA) models are increasingly used for end-to-end driving due to their world knowledge and reasoning ability. Most prior work, however, inserts textual chains-of-thought (CoT) as intermediate steps tailored to the current scene. Such symbolic compressions can blur spatio-temporal relations and discard fine visual cues, creating a cross-modal gap between perception and planning. We propose FSDrive, a visual spatio-temporal CoT framework that enables VLAs to think in images. The model first acts as a world model to generate a unified future frame that overlays coarse but physically-plausible priors--future lane dividers and 3D boxes--on the predicted future image. This unified frame serves as the visual CoT, capturing both spatial structure and temporal evolution.


3DPE-Gaze: Unlocking the Potential of 3DFacial Priors for Generalized Gaze Estimation

Neural Information Processing Systems

In recent years, face-based deep-learning gaze estimation methods have achieved significant advancements. However, while face images provide supplementary information beneficial for gaze inference, the substantial extraneous information they contain also increases the risk of overfitting during model training and compromises generalization capability. To alleviate this problem, we propose the 3DPE-Gaze framework, explicitly modeling 3D facial priors for feature decoupling and generalized gaze estimation. The 3DPE-Gaze framework consists of two core modules: the 3DGeometric Prior Module (3DGP) incorporating the FLAME model to parameterize facial structures and gaze-irrelevant facial appearances while extracting gaze features; the Semantic Concept Alignment Module (SCAM) separates gaze-related and unrelated concepts through CLIP-guided contrastive learning. Finally, the 3DPE-Gaze framework combines 3D facial landmark as prior for generalized gaze estimation. Experimental results show that 3DPE-Gaze outperforms existing state-of-the-art methods on four major cross-domain tasks, with particularly outstanding performance in challenging scenarios such as lighting variations, extreme head poses, and glasses occlusion.


See through the Dark: Learning Illumination-affined Representations for Nighttime Occupancy Prediction

Neural Information Processing Systems

Occupancy prediction aims to estimate the 3D spatial distribution of occupied regions along with their corresponding semantic labels. Existing vision-based methods perform well on daytime benchmarks but struggle in nighttime scenarios due to limited visibility and challenging lighting conditions. To address these challenges, we propose LIAR, a novel framework that learns illumination-affined representations. LIAR first introduces Selective Low-light Image Enhancement (SLLIE), which leverages the illumination priors from daytime scenes to adaptively determine whether a nighttime image is genuinely dark or sufficiently well-lit, enabling more targeted global enhancement. Building on the illumination maps generated by SLLIE, LIAR further incorporates two illumination-aware components: 2DIllumination-guided Sampling (2D-IGS) and 3DIllumination-driven Projection (3D-IDP), to respectively tackle local underexposure and overexposure. Specifically, 2D-IGS modulates feature sampling positions according to illumination maps, assigning larger offsets to darker regions and smaller ones to brighter regions, thereby alleviating feature degradation in underexposed areas. Subsequently, 3D-IDP enhances semantic understanding in overexposed regions by constructing illumination intensity fields and supplying refined residual queries to the BEV context refinement process. Extensive experiments on both real and synthetic datasets demonstrate the superior performance of LIAR under challenging nighttime scenarios. The source code and pretrained models are available here.


New Scientist recommends an excellent look at the future of work

New Scientist

Sarah O'Connor's We Are Not Machines explores how we are contorting ourselves to fit AI into our working lives - and what to do about it, finds Tom Knowles Employers wanting staff to be more like machines isn't new, says O'Connor If you are a fan of translated films, you may have noticed the subtitles on streaming platforms have changed in recent years. They aren't wrong exactly, but they can come across as a bit, well, flat. "You get the meaning, but the language? It's not as rich," Petr Čermoch, a translator in the Czech Republic, tells Sarah O'Connor in We Are Not Machines, which explores how artificial intelligence is changing the way we work. That lack of richness is usually because the streaming platform has used AI to translate a script, then had a professional translator like Čermoch finesse it.