Locally Hierarchical Auto-Regressive Modeling for Image Generation Supplementary Document A Implementation Details A.1 HQ-V AE
–Neural Information Processing Systems
The input of the main transformer starts with the start-of-sentence (SOS) token. Our implementation is based on PyTorch 1.10 HQ-TV AE in Table B. When learning the resizing operations, we apply two different loss functions, Figure C: Examples of reconstructed images using HQ-V AE with the learnable down-and up-sampling layers. We set the number of self-attention blocks in IET to 1 or 2, i.e., We propose locally hierarchical decoding in PHT contrary to the standard sequential approach by assuming the conditional independence among bottom codes given a top code. The ablation study Table C(b) demonstrates the benefit of our decoding strategy in the PHT with respect to image generation quality. We use the smallest model HQ-Transformer (S) to verify architectural choices.Input embedding Decoding policy Label type ( top-k,t) FID Precision Recall (a) Addition Locally hierarchical conditioning One-hot label (2048, 0.9) 11.03 0.70 0.55 IET N B.4 Soft-Labeling in HQ-Transformer Table C(c) shows that soft-labeling improves FID compared to one-hot labeling.
Neural Information Processing Systems
Nov-14-2025, 18:03:42 GMT
- Technology: