Sensing and Signal Processing
Quality-Improved and Property-Preserved Polarimetric Imaging via Complementarily Fusing Chu Zhou
Polarimetric imaging is a challenging problem in the field of polarization-based vision, since setting a short exposure time reduces the signal-to-noise ratio, making the degree of polarization (DoP) and the angle of polarization (AoP) severely degenerated, while if setting a relatively long exposure time, the DoP and AoP would tend to be over-smoothed due to the frequently-occurring motion blur. This work proposes a polarimetric imaging framework that can produce clean and clear polarized snapshots by complementarily fusing a degraded pair of noisy and blurry ones. By adopting a neural network-based three-phase fusing scheme with speciallydesigned modules tailored to each phase, our framework can not only improve the image quality but also preserve the polarization properties. Experimental results show that our framework achieves state-of-the-art performance.
AirSketch: Generative Motion to Sketch
Illustration is a fundamental mode of human expression and communication. Certain types of motion that accompany speech can provide this illustrative mode of communication. While Augmented and Virtual Reality technologies (AR/VR) have introduced tools for producing drawings with hand motions (air drawing), they typically require costly hardware and additional digital markers, thereby limiting their accessibility and portability. Furthermore, air drawing demands considerable skill to achieve aesthetic results. To address these challenges, we introduce the concept of AirSketch, aimed at generating faithful and visually coherent sketches directly from hand motions, eliminating the need for complicated headsets or markers. We devise a simple augmentation-based self-supervised training procedure, enabling a controllable image diffusion model to learn to translate from highly noisy hand tracking images to clean, aesthetically pleasing sketches, while preserving the essential visual cues from the original tracking data. We present two air drawing datasets to study this problem. Our findings demonstrate that beyond producing photo-realistic images from precise spatial inputs, controllable image diffusion can effectively produce a refined, clear sketch from a noisy input. Our work serves as an initial step towards marker-less air drawing and reveals distinct applications of controllable diffusion models to AirSketch and AR/VR in general.
Class-Aware Adversarial Transformers for Medical Image Segmentation
Synapse: Synapse multi-organ segmentation dataset includes 30 abdominal CT scans with 3779 axial contrast-enhanced abdominal clinical CT images. Each CT volume consists of 85 198 slices of 512 512 pixels, with a voxel spatial resolution of ([0.54 0.54] [0.98 0.98] [2.5 5.0])mm For each case, 8 anatomical structures are aorta, gallbladder, spleen, left kidney, right kidney, liver, pancreas, spleen, stomach. LiTS: MICCAI 2017 Liver Tumor Segmentation Challenge (LiTS) includes 131 contrast-enhanced 3D abdominal CT volumes for training and testing. The dataset is assembled by different scanners and protocols from seven hospitals and research institutions. The dataset is randomly divided into 100 volumes for training, and 31 for testing.
SeeClear: Semantic Distillation Enhances Pixel Condensation for Video Super-Resolution Qi Tang 1,2
Diffusion-based Video Super-Resolution (VSR) is renowned for generating perceptually realistic videos, yet it grapples with maintaining detail consistency across frames due to stochastic fluctuations. The traditional approach of pixel-level alignment is ineffective for diffusion-processed frames because of iterative disruptions. To overcome this, we introduce SeeClear-a novel VSR framework leveraging conditional video generation, orchestrated by instance-centric and channel-wise semantic controls. This framework integrates a Semantic Distiller and a Pixel Condenser, which synergize to extract and upscale semantic details from low-resolution frames. The Instance-Centric Alignment Module (InCAM) utilizes video-clip-wise tokens to dynamically relate pixels within and across frames, enhancing coherency.