Gevers, Theo
LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting
Xing, Xiaoyan, Groh, Konrad, Karaoglu, Sezer, Gevers, Theo, Bhattad, Anand
We introduce LumiNet, a novel architecture that leverages generative models and latent intrinsic representations for effective lighting transfer. Given a source image and a target lighting image, LumiNet synthesizes a relit version of the source scene that captures the target's lighting. Our approach makes two key contributions: a data curation strategy from the StyleGAN-based relighting model for our training, and a modified diffusion-based ControlNet that processes both latent intrinsic properties from the source image and latent extrinsic properties from the target image. We further improve lighting transfer through a learned adaptor (MLP) that injects the target's latent extrinsic properties via cross-attention and fine-tuning. Unlike traditional ControlNet, which generates images with conditional maps from a single scene, LumiNet processes latent representations from two different images - preserving geometry and albedo from the source while transferring lighting characteristics from the target. Experiments demonstrate that our method successfully transfers complex lighting phenomena including specular highlights and indirect illumination across scenes with varying spatial layouts and materials, outperforming existing approaches on challenging indoor scenes using only images as input.
MAGiC-SLAM: Multi-Agent Gaussian Globally Consistent SLAM
Yugay, Vladimir, Gevers, Theo, Oswald, Martin R.
Simultaneous localization and mapping (SLAM) systems with novel view synthesis capabilities are widely used in computer vision, with applications in augmented reality, robotics, and autonomous driving. However, existing approaches are limited to single-agent operation. Recent work has addressed this problem using a distributed neural scene representation. Unfortunately, existing methods are slow, cannot accurately render real-world data, are restricted to two agents, and have limited tracking accuracy. In contrast, we propose a rigidly deformable 3D Gaussian-based scene representation that dramatically speeds up the system. However, improving tracking accuracy and reconstructing a globally consistent map from multiple agents remains challenging due to trajectory drift and discrepancies across agents' observations. Therefore, we propose new tracking and map-merging mechanisms and integrate loop closure in the Gaussian-based SLAM pipeline. We evaluate MAGiC-SLAM on synthetic and real-world datasets and find it more accurate and faster than the state of the art.
Gaussian-SLAM: Photo-realistic Dense SLAM with Gaussian Splatting
Yugay, Vladimir, Li, Yue, Gevers, Theo, Oswald, Martin R.
Specifically, earlier works focus a scene representation. The new representation enables on tracking using various scene representations like interactive-time reconstruction and photo-realistic rendering feature point clouds [15, 26, 40], surfels [53, 71], depth of real-world and synthetic scenes. We propose novel maps [43, 58], or implicit representations [14, 42, 44]. Later strategies for seeding and optimizing Gaussian splats to works focused more on the map quality and density. With extend their use from multiview offline scenarios to sequential the advent of powerful neural scene representations like monocular RGBD input data setups. In addition, we neural radiance fields [38] that allow for high fidelity viewsynthesis, extend Gaussian splats to encode geometry and experiment a rapidly growing body of dense neural SLAM with tracking against this scene representation. Our methods [19, 34, 51, 60, 62, 64, 81, 84] has been developed.
HaarNet: Large-scale Linear-Morphological Hybrid Network for RGB-D Semantic Segmentation
Groenendijk, Rick, Dorst, Leo, Gevers, Theo
Signals from different modalities each have their own combination algebra which affects their sampling processing. RGB is mostly linear; depth is a geometric signal following the operations of mathematical morphology. If a network obtaining RGB-D input has both kinds of operators available in its layers, it should be able to give effective output with fewer parameters. In this paper, morphological elements in conjunction with more familiar linear modules are used to construct a mixed linear-morphological network called HaarNet. This is the first large-scale linear-morphological hybrid, evaluated on a set of sizeable real-world datasets. In the network, morphological Haar sampling is applied to both feature channels in several layers, which splits extreme values and high-frequency information such that both can be processed to improve both modalities. Moreover, morphologically parameterised ReLU is used, and morphologically-sound up-sampling is applied to obtain a full-resolution output. Experiments show that HaarNet is competitive with a state-of-the-art CNN, implying that morphological networks are a promising research direction for geometry-based learning tasks.
MorphPool: Efficient Non-linear Pooling & Unpooling in CNNs
Groenendijk, Rick, Dorst, Leo, Gevers, Theo
Contemporary deep learning architectures exploit pooling operations for two reasons: to filter impactful activation values from feature maps, and to reduce spatial feature size [28]. The most used pooling operation is the max pool, which is used in nearly all common network architectures such as ResNet [14], VGGNet [32], and DenseNet [16]. These network architectures can be applied to pixel-level prediction tasks, such as semantic segmentation. To do so, inputs are down-sampled to a set of latent features of small spatial size, after which they are up-sampled to full resolution again. Up-sampling from pooled feature sets most often happens with a combination of unpooling and deconvolution [41, 42] and is used in seminal works such as [3, 22, 26]. As will be shown in this paper, down-sampling using max pooling can be formalised and improved using mathematical morphology, the mathematics of contact. Ever since the works of Serra [29], the underlying algebraic structure of data that is acquired using probing contact (e.g. LiDAR and radar) has been known to the computer vision community [5, 11, 25, 33]. It is different from the algebra of linear diffusion that is used to build convolutional neural networks (CNNs).
Multi-Loss Weighting with Coefficient of Variations
Groenendijk, Rick, Karaoglu, Sezer, Gevers, Theo, Mensink, Thomas
Many interesting tasks in machine learning and computer vision are learned by optimising an objective function defined as a weighted linear combination of multiple losses. The final performance is sensitive to choosing the correct (relative) weights for these losses. Finding a good set of weights is often done by adopting them into the set of hyper-parameters, which are set using an extensive grid search. This is computationally expensive. In this paper, the weights are defined based on properties observed while training the model, including the specific batch loss, the average loss, and the variance for each of the losses. An additional advantage is that the defined weights evolve during training, instead of using static loss weights. In literature, loss weighting is mostly used in a multi-task learning setting, where the different tasks obtain different weights. However, there is a plethora of single-task multi-loss problems that can benefit from automatic loss weighting. In this paper, it is shown that these multi-task approaches do not work on single tasks. Instead, a method is proposed that automatically and dynamically tunes loss weights throughout training specifically for single-task multi-loss problems. The method incorporates a measure of uncertainty to balance the losses. The validity of the approach is shown empirically for different tasks on multiple datasets.
Towards Personalised Gaming via Facial Expression Recognition
Blom, Paris Mavromoustakos (University of Amsterdam) | Bakkes, Sander (University of Amsterdam) | Tan, Chek Tien (University of Technology Sydney) | Whiteson, Shimon (University of Amsterdam) | Roijers, Diederik (University of Amsterdam) | Valenti, Roberto (University of Amsterdam) | Gevers, Theo (University of Amsterdam)
In this paper we propose an approach for personalising the space in which a game is played (i.e., levels) dependent on classifications of the user's facial expression — to the end of tailoring the affective game experience to the individual user. Our approach is aimed at online game personalisation, i.e., the game experience is personalised during actual play of the game. A key insight of this paper is that game personalisation techniques can leverage novel computer vision-based techniques to unobtrusively infer player experiences automatically based on facial expression analysis. Specifically, to the end of tailoring the affective game experience to the individual user, in this paper we (1) leverage the proven InSight facial expression recognition SDK as a model of the user's affective state InSight, and (2) employ this model for guiding the online game personalisation process. User studies that validate the game personalisation approach in the actual video game Infinite Mario Bros. reveal that it provides an effective basis for converging to an appropriate affective state for the individual human player.