AITopics | camera pose

Collaborating Authors

camera pose

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

OnlineSplatter: Pose-Free Online 3DReconstruction for Free-Moving Objects

Neural Information Processing SystemsJun-23-2026, 07:04:00 GMT

Free-mo ticularly ving without object reliable reconstruction pose or depth from cues monocular and under video arbitrary remains object challenging, motion.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Robots (0.67)
(3 more...)

Add feedback

Rooms from Motion: Un-posed Indoor 3DObject Detection as Localization and Mapping

Neural Information Processing SystemsJun-23-2026, 01:00:17 GMT

We revisit scene-level 3D object detection as the output of an object-centric framework capable of both localization and mapping using 3D oriented boxes as the underlying geometric primitive. While existing 3D object detection approaches operate globally and implicitly rely on the a priori existence of metric camera poses, our method, Rooms from Motion (RfM) operates on a collection of un-posed images. By replacing the standard 2D keypoint-based matcher of structure-frommotion with an object-centric matcher based on image-derived 3D boxes, we estimate metric camera poses, object tracks, and finally produce a global, semantic 3D object map. When a priori pose is available, we can significantly improve map quality through optimization of global 3D boxes against individual observations. RfM shows strong localization performance and subsequently produces maps of higher quality than leading point-based and multi-view 3D object detection methods on CA-1M and ScanNet++, despite these global methods relying on overparameterization through point clouds or dense volumes. Rooms from Motion achieves an object-centric representation which allows for inherently sparse localization and parametric mapping proportional to the number of objects in a scene.

artificial intelligence, detection, machine learning, (18 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

e3a0db7c0a191854c176af1d20cdec80-Supplemental-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsJun-23-2026, 00:03:15 GMT

The descriptions of each task are as follows:799 Single-view tasks Single-view tasks test a model's ability to infer spatial properties from a single800 image. These tasks include:801 Depth estimation (OC, OO, NA): Predicting absolute or relative depth values for objects802 Distance prediction (OC, OO, NA): Estimating the Euclidean distance between objects or803 from an object to the camera.804 Object center distance inference (OO, MCA): Given objects A, B and C, determine which805 of B and C is farther or closer to A.806 Object spatial relation (OO, MCA): Determining relative positioning (e.g., left, right, in807 Spatial imagination (OC, OO, MCA): Predicting unseen spatial relationships based on809 limited visual information.810 Multi-view tasks Multi-view tasks require reasoning across multiple images to infer spatial rela-811 tionships. These tasks include:812 Viewpoint change inference (NA): Given two perspectives, output how the camera should813 be moved to see the second perspective.814 Multi-view distance prediction (OC, OO, NA): Estimating object distances across different816 views.817 Multi-view object matching (MCA): Identifying the same object across multiple views.818

artificial intelligence, image understanding, spatial reasoning, (19 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.48)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.34)

Add feedback

Tracking World: World-centric Monocular 3D Tracking of Almost All Pixels

Neural Information Processing SystemsJun-22-2026, 19:57:11 GMT

Monocular 3D tracking aims to capture the long-term motion of pixels in 3D space from a single monocular video and has witnessed rapid progress in recent years. However, we argue that the existing monocular 3D tracking methods still fall short in separating the camera motion from foreground dynamic motion and cannot densely track newly emerging dynamic subjects in the videos. To address these two limitations, we propose TrackingWorld, a novel pipeline for dense 3D tracking of almost all pixels within a world-centric 3D coordinate system. First, we introduce a tracking upsampler that efficiently lifts the arbitrary sparse 2D tracks into dense 2D tracks. Then, to generalize the current tracking methods to newly emerging objects, we apply the upsampler to all frames and reduce the redundancy of 2D tracks by eliminating the tracks in overlapped regions. Finally, we present an efficient optimization-based framework to back-project dense 2D tracks into world-centric 3D trajectories by estimating the camera poses and the 3D coordinates of these 2D tracks. Extensive evaluations on both synthetic and real-world datasets demonstrate that our system achieves accurate and dense 3D tracking in a world-centric coordinate frame.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry:

Media > Television (0.35)
Media > Photography (0.35)
Media > Film (0.35)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

Tracking Any Point in Persistent 3D Geometry

Neural Information Processing SystemsJun-22-2026, 16:56:04 GMT

We introduce TAPIP3D, a novel approach for long-term 3D point tracking in monocular RGB and RGB-D videos. TAPIP3D represents videos as camerastabilized spatio-temporal feature clouds, leveraging depth and camera motion information to lift 2D video features into a 3D world space where camera movement is effectively canceled out. Within this stabilized 3D representation, TAPIP3D iteratively refines multi-frame motion estimates, enabling robust point tracking over long time horizons.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.93)

Industry:

Media > Television (0.55)
Media > Photography (0.55)
Media > Film (0.55)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing (0.93)
Information Technology > Artificial Intelligence > Natural Language (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

ViDAR: Video Diffusion-Aware 4DReconstruction From Monocular Inputs

Neural Information Processing SystemsJun-22-2026, 13:47:58 GMT

Dynamic Novel View Synthesis aims to generate photorealistic views of moving subjects from arbitrary viewpoints. This task is particularly challenging when relying on monocular video, where disentangling structure from motion is ill-posed and supervision is scarce. We introduce Video Diffusion-Aware Reconstruction (ViDAR), a novel 4D reconstruction framework that leverages personalised image diffusion models to synthesise pseudo multi-view supervision signals for training a Gaussian splatting representation. By conditioning on scene-specific features, ViDAR recovers fine-grained appearance details while mitigating artefacts introduced by monocular ambiguity. To address the spatio-temporal inconsistency of diffusion-based supervision, we propose a diffusion-aware loss function and a camera pose optimisation strategy that aligns synthetic views with the underlying scene geometry. Experiments on DyCheck, a challenging benchmark with extreme viewpoint variation, show that ViDAR outperforms all state-of-the-art baselines in visual quality and geometric consistency. We further highlight ViDAR's strong improvement over baselines on dynamic regions and provide a new benchmark to compare performance in reconstructing motion-rich parts of the scene.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry:

Law (0.34)
Government (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction

Neural Information Processing SystemsJun-17-2026, 13:12:56 GMT

Geometric models like DUSt3R have shown great advances in understanding the geometry of a scene from pairs of photos. However, they fail when the inputs are from vastly different viewpoints (e.g., aerial vs. ground) or modalities (e.g., photos vs. abstract drawings) compared to what was observed during training. This paper addresses a challenging version of this problem: predicting correspondences between ground-level photos and floor plans. Current datasets for joint photo-floor plan reasoning are limited, either lacking in varying modalities (VIGOR) or lacking in correspondences (WAFFLE). To address these limitations, we introduce a new dataset, C3, created by first reconstructing a number of scenes in 3D from Internet photo collections via structure-from-motion, then manually registering the reconstructions to floor plans gathered from the Internet, from which we can derive correspondences between images and floor plans.

artificial intelligence, floor plan, machine learning, (17 more...)

Neural Information Processing Systems

Country: Europe (0.93)

Genre: Research Report > Experimental Study (1.00)

Industry: Banking & Finance (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

NopeRoomGS: Indoor 3DGaussian Splatting Optimization without Camera Pose Input

Neural Information Processing SystemsJun-16-2026, 22:57:21 GMT

Recent advances in 3DGaussian Splatting (3DGS) have enabled real-time, highfidelity view synthesis, but remain critically dependent on camera poses estimated by Structure-from-Motion (SfM), which is notoriously unreliable in textureless indoor environments. To eliminate this dependency, recent pose-free variants have been proposed, yet they often fail under abrupt camera motion due to unstable initialization and purely photometric objectives. In this work, we introduce NopeRoomGS, an optimization framework with no need for camera pose inputs, which effectively addresses the textureless regions and abrupt camera motion in indoor room environments through a local-to-global optimization paradigm for 3DGS reconstruction. In the local stage, we propose a lightweight local neural geometric representation to bootstrap a set of reliable local 3DGaussians for separated short video clips, regularized by multi-frame tracking constraints and foundation model depth priors. This enables reliable initialization even in textureless regions or under abrupt camera motions. In the global stage, we fuse local 3DGaussians into a unified 3DGS representation through an alternating optimization strategy that jointly refines camera poses and Gaussian parameters, effectively mitigating gradient interference between them. Furthermore, we decompose camera pose optimization based on a piecewise planarity assumption, further enhancing robustness under abrupt camera motion.

artificial intelligence, machine learning, representation, (15 more...)

Neural Information Processing Systems

Country: Asia (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry: Media (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Graphics (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

LI-GoOuOuInpFeMrtupstut

Neural Information Processing SystemsJun-16-2026, 19:15:14 GMT

We tackle the task of recovering an animatable 3D human avatar from a single or a sparse set of images. For this task, beyond a set of images, many prior state-of-theart methods use accurate "ground-truth" camera poses and human poses as input to guide reconstruction at test-time. We show that pose-dependent reconstruction degrades results significantly if pose estimates are noisy. To overcome this, we introduce NoPo-Avatar, which reconstructs avatars solely from images, without any pose input. By removing the dependence of test-time reconstruction on human poses, NoPo-Avatar is not affected by noisy human pose estimates, making it more widely applicable.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Sensing and Signal Processing > Image Processing (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

VisualSync: Multi-Camera Synchronization via Cross-View Object Motion

Neural Information Processing SystemsJun-15-2026, 14:10:59 GMT

Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consumer cameras. However, synchronizing these cross-camera streams remains challenging. Existing methods assume controlled settings, specific targets, manual correction, or costly hardware. We present VisualSync, an optimization framework based on multi-view dynamics that aligns unposed, unsynchronized videos at millisecond accuracy. Our key insight is that any moving 3D point, when co-visible in two cameras, obeys epipolar constraints once properly synchronized.

artificial intelligence, machine learning, synchronization, (16 more...)

Neural Information Processing Systems

Country: Europe (0.93)

Genre: Research Report > Experimental Study (1.00)

Industry: Leisure & Entertainment (0.87)

Technology: