patch size
SHF: Symmetrical Hierarchical Forest with Pretrained Vision Transformer Encoder for High-Resolution Medical Segmentation
This paper presents a novel approach to addressing the long-sequence problem in high-resolution medical images for Vision Transformers (ViTs). Using smaller patches as tokens can enhance ViT performance, but quadratically increases computation and memory requirements. Therefore, the common practice for applying ViTs to high-resolution images is either to: (a) employ complex sub-quadratic attention schemes or (b) use large to medium-sized patches and rely on additional mechanisms within the model to capture the spatial hierarchy of details. We propose Symmetrical Hierarchical Forest (SHF), a lightweight approach that adaptively patches the input image to increase token information density and encode hierarchical spatial structures into the input embedding. We then apply a reverse depatching scheme to the output embeddings of the transformer encoder, eliminating the need for convolution-based decoders. Unlike previous methods that modify attention mechanisms or use a complex hierarchy of interacting models, SHFcan be retrofitted to any ViT model to allow it to learn the hierarchical structure of details in high-resolution images without requiring architectural changes. Experimental results demonstrate significant gains in computational efficiency and performance: on the PAIPWSI dataset, we achieved a 3 32 speedup or a 2.95% 7.03% increase in accuracy (measured by Dice score) at a 64K2 resolution with the same computational budget, compared to state-of-the-art production models. On the 3D medical datasets BTCV and KiTS, training was 6 faster, with accuracy gains of 6.93% and 5.9%, respectively, compared to models without SHF.
HEROFILTER: Adaptive Spectral Graph Filter for Varying Heterophilic Relations
Graph heterophily, where connected nodes have different labels, has attracted significant interest recently. Most existing works adopt a simplified approach using low-pass filters for homophilic graphs and high-pass filters for heterophilic graphs. However, we discover that the relationship between graph heterophily and spectral filters is more complex - the optimal filter response varies across frequency components and does not follow a strict monotonic correlation with heterophily degree. This finding challenges conventional fixed filter designs and suggests the need for adaptive filtering to preserve expressiveness in graph embeddings. Formally, natural questions arise: Given a heterophilic graph G, how and to what extent will the varying heterophily degree of G affect the performance of GNNs? How can we design adaptive filters to fit those varying heterophilic connections? Our theoretical analysis reveals that the average frequency response of GNNs and graph heterophily degree do not follow a strict monotonic correlation, necessitating adaptive graph filters to guarantee good generalization performance. Hence, we propose HEROFILTER, a simple yet powerful GNN, which extracts information across the heterophily spectrum and combines salient representations through adaptive mixing. HEROFILTER's superior performance achieves up to 9.2% accuracy improvement over leading baselines across homophilic and heterophilic graphs.
Masked Generative Adversarial Networks are Data-Efficient Generation Learners Supplemental Materials
Prior studies [18, 12] show that GAN often experiences generation failures with severely degraded generation performance when only limited training data is available. Specifically, with limited training data, the discriminator tends to discriminate via meaningless shortcuts by merely focusing on easy-to-discriminate image locations and spectra instead of holistic understanding of images. This can be viewed clearly in Figure 1, where the Gini Coefficient [4] of discriminator's spatial attentions increases quickly along the training iteration (when only limited training data is available). Note that the Gini coefficient [4] is negatively correlated with equality, i.e., the discriminator will pay more unevenly distributed attention to each spatial location while the Gini coefficient increases from '0' to '1'. For image generation with GAN, the large Gini coefficient (of discriminator's spatial attentions) thus means that the discriminator starts to focus on certain spatial locations (easy to discriminate) while ignoring other spatial locations (hard to discriminate), ultimately leading to an over-confident discriminator and training collapse. In another word, the Gini coefficient [4] of '0' expresses perfect equality where all values are the same (i.e., where the discriminator pays the same attention to every spatial location) while '1' expresses maximal inequality among values (i.e., the discriminator focuses on only one location while all others are ignored).
01c561df365429f33fcd7a7faa44c985-Supplemental-Conference.pdf
A.1 Datasets fMoWRGBFunctional Map of the World (fMoW) [17] is a dataset of high-resolution satellite image time series across the world, with a task of classification among 62 architecture categories such as airport, shipyard, and zoo. The license is provided here 2. Co-located images of different timestamps, or sequences, are provided in fMoW. They are of different length, and around 60% of the samples have length larger than 2. Readers can refer to the fMoW paper [17] for statistics on the distribution of sequence lengths. We construct a temporal version of fMoW by randomly associating every single image with two images of the same location but of different timestamps if possible. For a given spatial location loc, we define Tloc as the number of temporally distinct snapshots present in the dataset. We crop surface reflectance images from the Sentinel-2 (ESA) satellite (courtesy of the U.S. Geological Survey), consisting of 90-day composites of images at the same locations as fMoW images (to reduce the impacts of cloud coverage). At each fMoW datapoint location, we collect a time series of Sentinel-2 images, using the provided geo-coordinate bounding boxes.