Goto

Collaborating Authors

 resolution image


3DILG: Irregular Latent Grids for 3D Generative Modeling

Neural Information Processing Systems

We propose a new representation for encoding 3D shapes as neural fields. The representation is designed to be compatible with the transformer architecture and to benefit both shape reconstruction and shape generation. Existing works on neural fields are grid-based representations with latents being defined on a regular grid. In contrast, we define latents on irregular grids which facilitates our representation to be sparse and adaptive. In the context of shape reconstruction from point clouds, our shape representation built on irregular grids improves upon grid-based methods in terms of reconstruction accuracy.





Chest X-ray Classification using Deep Convolution Models on Low-resolution images with Uncertain Labels

Agarwal, Snigdha, Sinha, Neelam

arXiv.org Artificial Intelligence

Deep Convolutional Neural Networks have consistently proven to achieve state-of-the-art results on a lot of imaging tasks over the past years' majority of which comprise of high-quality data. However, it is important to work on low-resolution images since it could be a cheaper alternative for remote healthcare access where the primary need of automated pathology identification models occurs. Medical diagnosis using low-resolution images is challenging since critical details may not be easily identifiable. In this paper, we report classification results by experimenting on different input image sizes of Chest X-rays to deep CNN models and discuss the feasibility of classification on varying image sizes. We also leverage the noisy labels in the dataset by proposing a Randomized Flipping of labels techniques. We use an ensemble of multi-label classification models on frontal and lateral studies. Our models are trained on 5 out of the 14 chest pathologies of the publicly available CheXpert dataset. We incorporate techniques such as augmentation, regularization for model improvement and use class activation maps to visualize the neural network's decision making. Comparison with classification results on data from 200 subjects, obtained on the corresponding high-resolution images, reported in the original CheXpert paper, has been presented. For pathologies Cardiomegaly, Consolidation and Edema, we obtain 3% higher accuracy with our model architecture.


Leveraging ChatGPT's Multimodal Vision Capabilities to Rank Satellite Images by Poverty Level: Advancing Tools for Social Science Research

Sarmadi, Hamid, Hall, Ola, Rögnvaldsson, Thorsteinn, Ohlsson, Mattias

arXiv.org Artificial Intelligence

This paper investigates the novel application of Large Language Models (LLMs) with vision capabilities to analyze satellite imagery for village-level poverty prediction. Although LLMs were originally designed for natural language understanding, their adaptability to multimodal tasks, including geospatial analysis, has opened new frontiers in data-driven research. By leveraging advancements in vision-enabled LLMs, we assess their ability to provide interpretable, scalable, and reliable insights into human poverty from satellite images. Using a pairwise comparison approach, we demonstrate that ChatGPT can rank satellite images based on poverty levels with accuracy comparable to domain experts. These findings highlight both the promise and the limitations of LLMs in socioeconomic research, providing a foundation for their integration into poverty assessment workflows. This study contributes to the ongoing exploration of unconventional data sources for welfare analysis and opens pathways for cost-effective, large-scale poverty monitoring.


Review for NeurIPS paper: Wavelet Flow: Fast Training of High Resolution Normalizing Flows

Neural Information Processing Systems

Summary and Contributions: This paper introduces a hierarchical structure for normalizing flows for density estimation and data generation based on wavelet transforms, allowing for a natural factorization of the data distribution based on different resolutions of the data. For density estimation, each image is fed into a sequence of wavelet transforms. Each wavelet transform takes an image and outputs a lower resolution image (obtained by a low-pass filter) and a tensor of detail coefficients (obtained by a high-pass filter). Repeatedly applying wavelet transforms to the output images leads to a set of detail coefficient tensors for each scale and a final 1x1x3 "image" representing the average intensity per channel. The original representation can be recovered from this representation with a sequence of inverse wavelet transforms.


3DILG: Irregular Latent Grids for 3D Generative Modeling

Neural Information Processing Systems

We propose a new representation for encoding 3D shapes as neural fields. The representation is designed to be compatible with the transformer architecture and to benefit both shape reconstruction and shape generation. Existing works on neural fields are grid-based representations with latents being defined on a regular grid. In contrast, we define latents on irregular grids which facilitates our representation to be sparse and adaptive. In the context of shape reconstruction from point clouds, our shape representation built on irregular grids improves upon grid-based methods in terms of reconstruction accuracy.


Reviews: Pose Guided Person Image Generation

Neural Information Processing Systems

The paper proposes a human image generator conditioned on appearance and human pose. The proposed generation is based on adversarial training architecture where two-step generative networks that produces high resolution image to feed into a discriminator. In the generator part, the first generator produce a coarse image using a U-shape network given appearance and pose map, then the second generator takes the coarse input with the original appearance to predict residual to refine the coarse image. The paper proposes a few important ideas. Conditioned on appearance and pose information, the proposed generator stacks two networks to adopt a coarse-to-fine strategy.


Towards Optimal Trade-offs in Knowledge Distillation for CNNs and Vision Transformers at the Edge

Violos, John, Papadopoulos, Symeon, Kompatsiaris, Ioannis

arXiv.org Artificial Intelligence

This paper discusses four facets of the Knowledge Distillation (KD) process for Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) architectures, particularly when executed on edge devices with constrained processing capabilities. First, we conduct a comparative analysis of the KD process between CNNs and ViT architectures, aiming to elucidate the feasibility and efficacy of employing different architectural configurations for the teacher and student, while assessing their performance and efficiency. Second, we explore the impact of varying the size of the student model on accuracy and inference speed, while maintaining a constant KD duration. Third, we examine the effects of employing higher resolution images on the accuracy, memory footprint and computational workload. Last, we examine the performance improvements obtained by fine-tuning the student model after KD to specific downstream tasks. Through empirical evaluations and analyses, this research provides AI practitioners with insights into optimal strategies for maximizing the effectiveness of the KD process on edge devices.