Goto

Collaborating Authors

 resolution


Overall Counting Anomaly Detection and Interpretation

Neural Information Processing Systems

Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg.


SHF: Symmetrical Hierarchical Forest with Pretrained Vision Transformer Encoder for High-Resolution Medical Segmentation

Neural Information Processing Systems

This paper presents a novel approach to addressing the long-sequence problem in high-resolution medical images for Vision Transformers (ViTs). Using smaller patches as tokens can enhance ViT performance, but quadratically increases computation and memory requirements. Therefore, the common practice for applying ViTs to high-resolution images is either to: (a) employ complex sub-quadratic attention schemes or (b) use large to medium-sized patches and rely on additional mechanisms within the model to capture the spatial hierarchy of details. We propose Symmetrical Hierarchical Forest (SHF), a lightweight approach that adaptively patches the input image to increase token information density and encode hierarchical spatial structures into the input embedding. We then apply a reverse depatching scheme to the output embeddings of the transformer encoder, eliminating the need for convolution-based decoders. Unlike previous methods that modify attention mechanisms or use a complex hierarchy of interacting models, SHFcan be retrofitted to any ViT model to allow it to learn the hierarchical structure of details in high-resolution images without requiring architectural changes. Experimental results demonstrate significant gains in computational efficiency and performance: on the PAIPWSI dataset, we achieved a 3 32 speedup or a 2.95% 7.03% increase in accuracy (measured by Dice score) at a 64K2 resolution with the same computational budget, compared to state-of-the-art production models. On the 3D medical datasets BTCV and KiTS, training was 6 faster, with accuracy gains of 6.93% and 5.9%, respectively, compared to models without SHF.


PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors

Neural Information Processing Systems

Synthetic image detectors (SIDs) are a key defense against the risks posed by the growing realism of images from text-to-image (T2I) models. Red teaming improves SID's effectiveness by identifying and exploiting their failure modes via misclassified synthetic images. However, existing red-teaming solutions (i) require white-box access to SIDs, which is infeasible for proprietary state-of-the-art detectors, and (ii) generate image-specific attacks through expensive online optimization. To address these limitations, we propose PolyJuice, the first black-box, imageagnostic red-teaming method for SIDs, based on an observed distribution shift in the T2I latent space between samples correctly and incorrectly classified by the SID. PolyJuice generates attacks by (i) identifying the direction of this shift through a lightweight offline process that only requires black-box access to the SID, and (ii) exploiting this direction by universally steering all generated images towards the SID's failure modes. PolyJuice-steered T2I models are significantly more effective at deceiving SIDs (up to 84%) compared to their unsteered counterparts. We also show that the steering directions can be estimated efficiently at lower resolutions and transferred to higher resolutions using simple interpolation, reducing computational overhead. Finally, tuning SID models on PolyJuice-augmented datasets notably enhances the performance of the detectors (up to 30%).


VITRIX-UniViTAR: Unified Vision Transformer with Native Resolution

Neural Information Processing Systems

While preliminary explorations have superficially investigated native resolution modeling, existing works still lack systematic training recipe from the visual representation perspective. To bridge this gap, we introduce Unified Vision Transformer with NAtive Resolution, i.e. UniViTAR, a family of homogeneous vision foundation models tailored for unified visual modality and native resolution scenario in the era of multimodal. Our framework first conducts architectural upgrades to the vanilla paradigm by integrating multiple advanced components. Building upon these improvements, a progressive training paradigm is introduced, which strategically combines two core mechanisms: (1) resolution curriculum learning, transitioning from fixedresolution pretraining to native resolution tuning, thereby leveraging ViT's inherent adaptability to variable-length sequences, and (2) visual modality adaptation via inter-batch image-video switching, which balances computational efficiency with enhanced temporal reasoning. In parallel, a hybrid training framework further synergizes sigmoid-based contrastive loss with feature distillation from a frozen teacher model, thereby accelerating early-stage convergence. Finally, trained exclusively on public accessible image-caption data, our UniViTAR family across multiple model scales from 0.3B to 1.4B achieves state-of-the-art performance on a wide variety of visual-related tasks. The code and models are available here. Figure 1: The figure presents: (left) a systematic overview of model scaling performance across downstream tasks when increasing parameter size from 0.3B to 1B, and (right) a comprehensive comparison of multimodal capabilities against SOTA baselines on diversified benchmarks.


Multiresolution Analysis and Statistical Thresholding on Dynamic Networks

Neural Information Processing Systems

Detecting structural change in dynamic network data has wide-ranging applications. Existing approaches typically divide the data into time bins, extract network features within each bin, and then compare these features over time. This introduces an inherent tradeoff between temporal resolution and statistical stability of the extracted features. Despite this tradeoff, reminiscent of time-frequency tradeoffs in signal processing, most methods rely on a fixed temporal resolution. Choosing an appropriate resolution parameter is typically difficult, and can be especially problematic in domains like cybersecurity, where anomalous behavior may emerge at multiple time scales.


Sparc3D: Sparse Representation and Construction for High-Resolution 3DShapes Modeling

Neural Information Processing Systems

High-fidelity 3D object synthesis remains significantly more challenging than 2D image generation due to the unstructured nature of mesh data and the cubic complexity of dense volumetric grids. Existing two-stage pipelines--compressing meshes with a VAE (using either 2D or 3D supervision), followed by latent diffusion sampling--often suffer from severe detail loss caused by inefficient representations and modality mismatches introduced in VAE.


Chilling predictions from 1997 suggest a 'crisis' that reshapes America peaks this year

Daily Mail - Science & tech

High-stakes Iran war talks get underway as JD Vance is joined by pregnant wife Usha in leading familiar team of American negotiators... but fresh conflict threatens to derail peace plan Angelina Jolie's son Pax, 22, surfaces in LA after bombshell revelation about his relationship to Brad Pitt'Media-obsessed' Anna Paulina Luna reveals secret to her rising power as she turns into Republicans' 'favorite headache' Call me cynical, but the real reason Gruesome Twosome Harry and Meghan are returning to the UK is just so obvious... and highly humiliating: MAUREEN CALLAHAN No one can see the real reason Jelly Roll divorced Bunnie XO. Royals wish Prince William happy birthday and Father's Day with sweet photo of him and Charlotte after King's Trooping the Colour - as Charles pays tribute to Philip Mortifying truth about Clavicular's'botched' nose job: Infertile influencer's'trans' admission to friends... as insider reveals what's said behind closed doors - and twisted secrets that'll leave fans floored Lauren Sanchez swaps sultry dresses for leggings as she flaunts her all-natural beauty in rare glam-free photo in honor of Father's Day'I married a death row double killer': London single mother tells the full, incredible story of how she wed a Texas convict, then watched his execution Americans are flocking to'Goldilocks' city that has all the perks of major southern hubs... but homes are a fraction of the price The four mistakes that led to bungee tragedy on Skeleton Bridge: FRED KELLY saw the scene for himself, now he retraces the prelude to disaster. So was it really an accident? Ashen-faced minister admits Starmer is facing'political reality' as PM is expected to quit tomorrow Trump says algae-infested Reflecting Pool must be EMPTIED for repairs as knife-wielding'vandals' tear hole in facade and destroy $16 million renovation Inside America's new fattest town: Burgers are the size of your head, gyms lie empty and custom mobility scooters carry 800lb loads... as we investigate why Ozempic just DOESN'T work Horrific'womb-raider' murder that shocked US: Boyfriend and family's astonishing account of how woman butchered her friend to steal her unborn child after lying SHE was pregnant I lost 50lb without jabs using this easy but overlooked method. But I still felt dowdy - until I discovered these expert anti-ageing fashion and beauty tips.


MAPLE: Multi-scale Attribute-enhanced Prompt Learning for Few-shot Whole Slide Image Classification

Neural Information Processing Systems

Prompt learning has emerged as a promising paradigm for adapting pre-trained vision-language models (VLMs) to few-shot whole slide image (WSI) classification by aligning visual features with textual representations, thereby reducing annotation cost and enhancing model generalization. Nevertheless, existing methods typically rely on slide-level prompts and fail to capture the subtype-specific phenotypic variations of histological entities (e.g., nuclei, glands) that are critical for cancer diagnosis. To address this gap, we propose Multi-scale Attribute-enhanced Prompt Learning (MAPLE), a hierarchical framework for few-shot WSI classification that jointly integrates multi-scale visual semantics and performs prediction at both the entity and slide levels. Specifically, we first leverage large language models (LLMs) to generate entity-level prompts that can help identify multi-scale histological entities and their phenotypic attributes, as well as slide-level prompts to capture global visual descriptions. Then, an entity-guided cross-attention module is proposed to generate entity-level features, followed by aligning with their corresponding subtype-specific attributes for fine-grained entity-level prediction. To enrich entity representations, we further develop a cross-scale entity graph learning module that can update these representations by capturing their semantic correlations within and across scales. The refined representations are then aggregated into a slide-level representation and aligned with the corresponding prompts for slide-level prediction. Finally, we combine both entity-level and slide-level outputs to produce the final prediction results. Results on three cancer cohorts confirm the effectiveness of our approach in addressing few-shot pathology diagnosis tasks.


DCA: Graph-Guided Deep Embedding Clustering for Brain Atlases

Neural Information Processing Systems

Brain atlases are essential for reducing the dimensionality of neuroimaging data and enabling interpretable analysis. However, most existing atlases are predefined, group-level templates with limited flexibility and resolution.


BitMark: Watermarking Bitwise Autoregressive Image Generative Models

Neural Information Processing Systems

State-of-the-art text-to-image models generate photorealistic images at an unprecedented speed. This work focuses on models that operate in a bitwise autoregressive manner over a discrete set of tokens that is practically infinite in size. However, their impressive generative power comes with a growing risk: as their outputs increasingly populate the Internet, they are likely to be scraped and reused as training data--potentially by the very same models. This phenomenon has been shown to lead to model collapse, where repeated training on generated content, especially from the models' own previous versions, causes a gradual degradation in performance. A promising mitigation strategy is watermarking, which embeds human-imperceptible yet detectable signals into generated images--enabling the identification of generated content. In this work, we introduce BitMark, a robust bitwise watermarking framework.