LuxDiT: Lighting Estimation with Video Diffusion Transformer
–Neural Information Processing Systems
Estimating scene lighting from a single image or video remains a longstand-ing challenge in computer vision and graphics. Learning-based approaches areconstrained by the scarcity of ground-truth HDR environment maps, which areexpensive to capture and limited in diversity. While recent generative modelsoffer strong priors for image synthesis, lighting estimation remains difficult dueto its reliance on indirect visual cues, the need to infer global (non-local) con-text, and the recovery of high-dynamic-range outputs. We propose LuxDiT, anovel data-driven approach that fine-tunes a video diffusion transformer to gen-erate HDR environment maps conditioned on visual input. Trained on a largesynthetic dataset with diverse lighting conditions, our model learns to infer il-lumination from indirect visual cues and generalizes effectively to real-worldscenes. To improve semantic alignment between the input and the predicted environment map, we introduce a low-rank adaptation finetuning strategy using a collected dataset of HDR panoramas.
Neural Information Processing Systems
Jun-14-2026, 12:12:23 GMT
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Media (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Representation & Reasoning (1.00)
- Natural Language (1.00)
- Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence