AITopics | supervision

Collaborating Authors

supervision

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Dynamic Bundling with Large Language Models for Zero-Shot Inference on Text-Attributed Graphs

Neural Information Processing SystemsJun-23-2026, 12:24:30 GMT

Large language models (LLMs) have been used in many zero-shot learning problems, with their strong generalization ability. Recently, adopting LLMs in textattributed graphs (TAGs) has drawn increasing attention. However, the adoption of LLMs faces two major challenges: limited information on graph structure and unreliable responses. LLMs struggle with text attributes isolated from the graph topology. Worse still, they yield unreliable predictions due to both information insufficiency and the inherent weakness of LLMs (e.g., hallucination). Towards this end, this paper proposes a novel method named Dynamic Text Bundling Supervision (DENSE) that queries LLMs with bundles of texts to obtain bundle-level labels and uses these labels to supervise graph neural networks.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry:

Education (0.48)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

OnlineSplatter: Pose-Free Online 3DReconstruction for Free-Moving Objects

Neural Information Processing SystemsJun-23-2026, 07:04:00 GMT

Free-mo ticularly ving without object reliable reconstruction pose or depth from cues monocular and under video arbitrary remains object challenging, motion.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Robots (0.67)
(3 more...)

Add feedback

Stitch and Tell Data Augmentation Method for Spatial Understanding

Neural Information Processing SystemsJun-23-2026, 06:23:01 GMT

Existing vision-language models often suffer from spatial hallucinations, i.e., generating incorrect descriptions about the relative positions of objects in an image. We argue that this problem mainly stems from the asymmetric properties between images and text. To enrich the spatial understanding ability of vision-language models, we propose a simple, annotation-free, plug-and-play method named Stitch and Tell (abbreviated as SiTe), which injects structured spatial supervision into multimodal data. It constructs stitched image-text pairs by stitching images along a spatial axis and generating spatially-aware captions or question answer pairs based on the layout of stitched image, without relying on costly advanced models or human involvement. We evaluate SiTe across three architectures including LLaVA-v1.5-7B,

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States (0.67)
Asia (0.67)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.67)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.95)
(2 more...)

Add feedback

Shortcuts and Identifiability in Concept-based Models from a Neuro-Symbolic Lens

Neural Information Processing SystemsJun-23-2026, 04:10:17 GMT

Concept-based Models are neural networks that learn a concept extractor to map inputs to high-level concepts and an inference layer to translate these into predictions. Ensuring these modules produce interpretable concepts and behave reliably in out-of-distribution is crucial, yet the conditions for achieving this remain unclear. We study this problem by establishing a novel connection between Concept-based Models and reasoning shortcuts (RSs), a common issue where models achieve high accuracy by learning low-quality concepts, even when the inference layer is fixed and provided upfront. Specifically, we extend RSs to the more complex setting of Concept-based Models and derive theoretical conditions for identifying both the concepts and the inference layer. Our empirical results highlight the impact of RSs and show that existing methods, even combined with multiple natural mitigation strategies, often fail to meet these conditions in practice.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: Europe (0.28)

Genre:

Research Report > Experimental Study (0.67)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

OmniZoom: AUniversal Plug-and-Play Paradigm for Cross-Device Smooth Zoom Interpolation

Neural Information Processing SystemsJun-23-2026, 01:09:26 GMT

Dual-camera smartphones suffer from geometric and photometric inconsistencies during zoom transitions, primarily due to disparities in intrinsic/extrinsic parameters and divergent image processing pipelines between the two cameras. Existing interpolation methods struggle to effectively address this issue, constrained by the lack of ground-truth datasets and motion ambiguity in dynamic scenarios. To overcome these challenges, we propose OmniZoom, a universal plug-and-play paradigm for cross-device smooth zoom interpolation. Specifically, we present a novel cross-device virtual data generation method utilizing 3DGaussian Splatting. This method tackles data scarcity by decoupling geometric features via spatial transition modeling and correcting photometric variations with dynamic color adaptation. It is further enhanced by cross-domain consistency learning for device-agnostic semantic alignment.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

Europe (0.28)
North America > United States (0.28)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
(4 more...)

Add feedback

Distil-E2D: Distilling Image-to-Depth Priors for Event-Based Monocular Depth Estimation

Neural Information Processing SystemsJun-22-2026, 05:28:03 GMT

Event cameras are neuromorphic vision sensors that asynchronously capture pixellevel intensity changes with high temporal resolution and dynamic range. These make them well suited for monocular depth estimation under challenging lighting conditions. However, progress in event-based monocular depth estimation remains constrained by the quality of supervision: LiDAR-based depth labels are inherently sparse, spatially incomplete, and prone to artifacts. Consequently, these signals are suboptimal for learning dense depth from sparse events. To address this problem, we propose Distil-E2D, a framework that distills depth priors from the image domain into the event domain by generating dense synthetic pseudolabels from co-recorded APS or RGB frames using foundational depth models. These pseudolabels complement sparse LiDAR depths with dense semantically rich supervision informed by large-scale image-depth datasets. To reconcile discrepancies between synthetic and real depths, we introduce a Confidence-Guided Calibrated Depth Loss that learns nonlinear depth alignment and adaptively weights supervision by alignment confidence. Additionally, our architecture integrates past predictions via a Context Transformer and employs a Dual-Decoder Training scheme that enhances encoder representations by jointly learning metric and relative depth abstractions. Experiments on benchmark datasets show that Distil-E2D achieves state-of-the-art performance in event-based monocular depth estimation across both event-only and event+APS settings.

artificial intelligence, image understanding, machine learning, (19 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Overview (0.67)

Industry: Information Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision > Image Understanding (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

UniMotion: AUnified Motion Framework for Simulation, Prediction and Planning

Neural Information Processing SystemsJun-21-2026, 21:50:31 GMT

Motion simulation, prediction and planning are foundational tasks in autonomous driving, each essential for modeling and reasoning about dynamic traffic scenarios. While often addressed in isolation due to their differing objectives, such as generating diverse motion states or estimating optimal trajectories, these tasks inherently depend on shared capabilities: understanding multi-agent interactions, modeling motion behaviors, and reasoning over temporal and spatial dynamics. Despite this underlying commonality, existing approaches typically adopt specialized model designs, which hinders cross-task generalization and system scalability. More critically, this separation overlooks the potential mutual benefits among tasks. Motivated by these observations, we propose UniMotion, a unified motion framework that captures shared structures across motion tasks while accommodating their individual requirements. Built on a decoder-only Transformer architecture, UniMotion employs dedicated interaction modes and tailored training strategies to simultaneously support these motion tasks. This unified design not only enables joint optimization and representation sharing but also allows for targeted fine-tuning to specialize in individual tasks when needed. Extensive experiments on the Waymo Open Motion Dataset demonstrate that joint training leads to robust generalization and effective task integration. With further fine-tuning, UniMotion achieves state-of-the-art performance across a range of motion tasks, establishing it as a versatile and scalable solution for autonomous driving.

machine learning, natural language, prediction, (21 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology (0.70)
Transportation > Ground > Road (0.56)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Balanced Conic Rectified Flow

Neural Information Processing SystemsJun-21-2026, 14:42:36 GMT

Rectified flow is a generative model that learns smooth transport mappings between two distributions through an ordinary differential equation (ODE). The model learns a straight ODE by reflow steps which iteratively update the supervisory flow. It allows for a relatively simple and efficient generation of high-quality images. However, rectified flow still faces several challenges. 1) The reflow process is slow because it requires a large number of generated pairs to model the target distribution.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: North America > Canada (0.28)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

SuperCLIP: CLIP with Simple Classification Supervision

Neural Information Processing SystemsJun-21-2026, 14:16:54 GMT

Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text, and this issue becomes even more pronounced when dealing with long and detailed captions. This stems from CLIP's training objective, which optimizes only global image-text similarity and overlooks tokenlevel supervision--limiting its ability to achieve fine-grained visual-text alignment. To address this, we propose SuperCLIP, a simple yet effective framework that augments contrastive learning with classification-based supervision. By adding only a lightweight linear layer to the vision encoder, SuperCLIP leverages tokenlevel cues to enhance visual-textual alignment -- with just a 0.077% increase in total FLOPs, and no need for additional annotated data. Experiments show that SuperCLIP consistently improves zero-shot classification, image-text retrieval, and purely visual tasks. These gains hold regardless of whether the model is trained on original web data or rich re-captioned data, demonstrating SuperCLIP's ability to recover textual supervision in both cases. Furthermore, SuperCLIP alleviates CLIP's small-batch performance drop through classification-based supervision that avoids reliance on large batch sizes.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: Europe (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation

Neural Information Processing SystemsJun-21-2026, 08:47:10 GMT

Multimodal Dataset Distillation (MDD) seeks to condense large-scale image-text datasets into compact surrogates while retaining their effectiveness for cross-modal learning. Despite recent progress, existing MDD approaches often suffer from Modality Collapse, characterized by over-concentrated intra-modal representations and enlarged distributional gap across modalities. In this paper, for the first time, we identify this issue as stemming from a fundamental conflict between the over-compression behavior inherent in dataset distillation and the cross-modal supervision imposed by contrastive objectives. To alleviate modality collapse, we introduce RepBlend, a novel MDD framework that weakens overdominant cross-modal supervision via representation blending, thereby significantly enhancing intra-modal diversity. Additionally, we observe that current MDD methods impose asymmetric supervision across modalities, resulting in biased optimization. To address this, we propose symmetric projection trajectory matching, which synchronizes the optimization dynamics using modality-specific projection heads, thereby promoting balanced supervision and enhancing cross-modal alignment. Experiments on Flickr-30K and MS-COCO show that RepBlend consistently outperforms prior state-of-the-art MDD methods, achieving significant gains in retrieval performance (e.g., +9.4 IR@10, +6.3 TR@10 under the 100-pair setting) and offering up to 6.7 distillation speedup.

distillation, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: Asia (0.68)

Genre: