Goto

Collaborating Authors

 conv3d



698d51a19d8a121ce581499d7b701668-Supplemental.pdf

Neural Information Processing Systems

Section 2, Section 3 and Section 4 provide more visualization results on a number of 3D modelingtasks,includingshapereconstruction,generationandinterpolation. Note that all the hierarchical aggregators {Ei} share the same networkparameters. Hierarchical decoder D includes an implicit octant decoder and some hierarchical local decoders {Di}, which maps the decoded shape code and a sample point (x,y,z) to the local geometry and the local latent feature, respectively. IN means the instance normalization 3D operator[9]. We provide two tables (Table 1 and Table 2) to detail the network structures of 3D voxel encoder andimplicitdecoder,respectively.





Diff-Ensembler: Learning to Ensemble 2D Diffusion Models for Volume-to-Volume Medical Image Translation

Zhu, Xiyue, Kwark, Dou Hoon, Zhu, Ruike, Hong, Kaiwen, Tao, Yiqi, Luo, Shirui, Li, Yudu, Liang, Zhi-Pei, Kindratenko, Volodymyr

arXiv.org Artificial Intelligence

Despite success in volume-to-volume translations in medical images, most existing models struggle to effectively capture the inherent volumetric distribution using 3D representations. The current state-of-the-art approach combines multiple 2D-based networks through weighted averaging, thereby neglecting the 3D spatial structures. Directly training 3D models in medical imaging presents significant challenges due to high computational demands and the need for large-scale datasets. To address these challenges, we introduce Diff-Ensembler, a novel hybrid 2D-3D model for efficient and effective volumetric translations by ensembling perpendicularly trained 2D diffusion models with a 3D network in each diffusion step. Moreover, our model can naturally be used to ensemble diffusion models conditioned on different modalities, allowing flexible and accurate fusion of input conditions. Extensive experiments demonstrate that Diff-Ensembler attains superior accuracy and volumetric realism in 3D medical image super-resolution and modality translation. We further demonstrate the strength of our model's volumetric realism using tumor segmentation as a downstream task.


Local Spatiotemporal Representation Learning for Longitudinally-consistent Neuroimage Analysis

Ren, Mengwei, Dey, Neel, Styner, Martin A., Botteron, Kelly, Gerig, Guido

arXiv.org Artificial Intelligence

Recent self-supervised advances in medical computer vision exploit global and local anatomical self-similarity for pretraining prior to downstream tasks such as segmentation. However, current methods assume i.i.d. image acquisition, which is invalid in clinical study designs where follow-up longitudinal scans track subject-specific temporal changes. Further, existing self-supervised methods for medically-relevant image-to-image architectures exploit only spatial or temporal self-similarity and only do so via a loss applied at a single image-scale, with naive multi-scale spatiotemporal extensions collapsing to degenerate solutions. To these ends, this paper makes two contributions: (1) It presents a local and multi-scale spatiotemporal representation learning method for image-to-image architectures trained on longitudinal images. It exploits the spatiotemporal self-similarity of learned multi-scale intra-subject features for pretraining and develops several feature-wise regularizations that avoid collapsed identity representations; (2) During finetuning, it proposes a surprisingly simple self-supervised segmentation consistency regularization to exploit intra-subject correlation. Benchmarked in the one-shot segmentation setting, the proposed framework outperforms both well-tuned randomly-initialized baselines and current self-supervised techniques designed for both i.i.d. and longitudinal datasets. These improvements are demonstrated across both longitudinal neurodegenerative adult MRI and developing infant brain MRI and yield both higher performance and longitudinal consistency.


FourierNets enable the design of highly non-local optical encoders for computational imaging

Deb, Diptodip, Jiao, Zhenfei, Sims, Ruth, Chen, Alex B., Broxton, Michael, Ahrens, Misha B., Podgorski, Kaspar, Turaga, Srinivas C.

arXiv.org Artificial Intelligence

Differentiable simulations of optical systems can be combined with deep learning-based reconstruction networks to enable high performance computational imaging via end-to-end (E2E) optimization of both the optical encoder and the deep decoder. This has enabled imaging applications such as 3D localization microscopy, depth estimation, and lensless photography via the optimization of local optical encoders. More challenging computational imaging applications, such as 3D snapshot microscopy which compresses 3D volumes into single 2D images, require a highly non-local optical encoder. We show that existing deep network decoders have a locality bias which prevents the optimization of such highly non-local optical encoders. We address this with a decoder based on a shallow neural network architecture using global kernel Fourier convolutional neural networks (FourierNets). We show that FourierNets surpass existing deep network based decoders at reconstructing photographs captured by the highly non-local DiffuserCam optical encoder. Further, we show that FourierNets enable E2E optimization of highly non-local optical encoders for 3D snapshot microscopy. By combining FourierNets with a large-scale multi-GPU differentiable optical simulation, we are able to optimize non-local optical encoders 170$\times$ to 7372$\times$ larger than prior state of the art, and demonstrate the potential for ROI-type specific optical encoding with a programmable microscope.


From Patterson Maps to Atomic Coordinates: Training a Deep Neural Network to Solve the Phase Problem for a Simplified Case

Hurwitz, David

arXiv.org Machine Learning

This work demonstrates that, for a simple case of 10 randomly positioned atoms, a neural network can be trained to infer atomic coordinates from Patterson maps. The network was trained entirely on synthetic data. For the training set, the network outputs were 3D maps of randomly positioned atoms. From each output map, a Patterson map was generated and used as input to the network. The network generalized to cases not in the test set, inferring atom positions from Patterson maps. A key finding in this work is that the Patterson maps presented to the network input during training must uniquely describe the atomic coordinates they are paired with on the network output or the network will not train and it will not generalize. The network cannot train on conflicting data. Avoiding conflicts is handled in 3 ways: 1. Patterson maps are invariant to translation. To remove this degree of freedom, output maps are centered on the average of their atom positions. 2. Patterson maps are invariant to centrosymmetric inversion. This conflict is removed by presenting the network output with both the atoms used to make the Patterson Map and their centrosymmetry-related counterparts simultaneously. 3. The Patterson map does not uniquely describe a set of coordinates because the origin for each vector in the Patterson map is ambiguous. By adding empty space around the atoms in the output map, this ambiguity is removed. Forcing output atoms to be closer than half the output box edge dimension means the origin of each peak in the Patterson map must be the origin to which it is closest.


See and Think: Disentangling Semantic Scene Completion

Liu, Shice, HU, YU, Zeng, Yiming, Tang, Qiankun, Jin, Beibei, Han, Yinhe, Li, Xiaowei

Neural Information Processing Systems

Semantic scene completion predicts volumetric occupancy and object category of a 3D scene, which helps intelligent agents to understand and interact with the surroundings. In this work, we propose a disentangled framework, sequentially carrying out 2D semantic segmentation, 2D-3D reprojection and 3D semantic scene completion. This three-stage framework has three advantages: (1) explicit semantic segmentation significantly boosts performance; (2) flexible fusion ways of sensor data bring good extensibility; (3) progress in any subtask will promote the holistic performance. Experimental results show that regardless of inputing a single depth or RGB-D, our framework can generate high-quality semantic scene completion, and outperforms state-of-the-art approaches on both synthetic and real datasets.