Pande, Shivam
Hyperspectral Image Analysis in Single-Modal and Multimodal setting using Deep Learning Techniques
Pande, Shivam
Hyperspectral imaging provides precise classification for land use and cover due to its exceptional spectral resolution. However, the challenges of high dimensionality and limited spatial resolution hinder its effectiveness. This study addresses these challenges by employing deep learning techniques to efficiently process, extract features, and classify data in an integrated manner. To enhance spatial resolution, we integrate information from complementary modalities such as LiDAR and SAR data through multimodal learning. Moreover, adversarial learning and knowledge distillation are utilized to overcome issues stemming from domain disparities and missing modalities. We also tailor deep learning architectures to suit the unique characteristics of HSI data, utilizing 1D convolutional and recurrent neural networks to handle its continuous spectral dimension. Techniques like visual attention and feedback connections within the architecture bolster the robustness of feature extraction. Additionally, we tackle the issue of limited training samples through self-supervised learning methods, employing autoencoders for dimensionality reduction and exploring semi-supervised learning techniques that leverage unlabeled data. Our proposed approaches are evaluated across various HSI datasets, consistently outperforming existing state-of-the-art techniques.
Land use/land cover classification of fused Sentinel-1 and Sentinel-2 imageries using ensembles of Random Forests
Pande, Shivam
The study explores the synergistic combination of Synthetic Aperture Radar (SAR) and Visible-Near Infrared-Short Wave Infrared (VNIR-SWIR) imageries for land use/land cover (LULC) classification. Image fusion, employing Bayesian fusion, merges SAR texture bands with VNIR-SWIR imageries. The research aims to investigate the impact of this fusion on LULC classification. Despite the popularity of random forests for supervised classification, their limitations, such as suboptimal performance with fewer features and accuracy stagnation, are addressed. To overcome these issues, ensembles of random forests (RFE) are created, introducing random rotations using the Forest-RC algorithm. Three rotation approaches: principal component analysis (PCA), sparse random rotation (SRP) matrix, and complete random rotation (CRP) matrix are employed. Sentinel-1 SAR data and Sentinel-2 VNIR-SWIR data from the IIT-Kanpur region constitute the training datasets, including SAR, SAR with texture, VNIR-SWIR, VNIR-SWIR with texture, and fused VNIR-SWIR with texture. The study evaluates classifier efficacy, explores the impact of SAR and VNIR-SWIR fusion on classification, and significantly enhances the execution speed of Bayesian fusion code. The SRP-based RFE outperforms other ensembles for the first two datasets, yielding average overall kappa values of 61.80% and 68.18%, while the CRP-based RFE excels for the last three datasets with average overall kappa values of 95.99%, 96.93%, and 96.30%. The fourth dataset achieves the highest overall kappa of 96.93%. Furthermore, incorporating texture with SAR bands results in a maximum overall kappa increment of 10.00%, while adding texture to VNIR-SWIR bands yields a maximum increment of approximately 3.45%.
Visual Question Answering in Remote Sensing with Cross-Attention and Multimodal Information Bottleneck
Songara, Jayesh, Pande, Shivam, Choudhury, Shabnam, Banerjee, Biplab, Velmurugan, Rajbabu
In this research, we deal with the problem of visual question answering (VQA) in remote sensing. While remotely sensed images contain information significant for the task of identification and object detection, they pose a great challenge in their processing because of high dimensionality, volume and redundancy. Furthermore, processing image information jointly with language features adds additional constraints, such as mapping the corresponding image and language features. To handle this problem, we propose a cross attention based approach combined with information maximization. The CNN-LSTM based cross-attention highlights the information in the image and language modalities and establishes a connection between the two, while information maximization learns a low dimensional bottleneck layer, that has all the relevant information required to carry out the VQA task. We evaluate our method on two VQA remote sensing datasets of different resolutions. For the high resolution dataset, we achieve an overall accuracy of 79.11% and 73.87% for the two test sets while for the low resolution dataset, we achieve an overall accuracy of 85.98%.