Hasan, Zahid
CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation
Ahmed, Masud, Hasan, Zahid, Haque, Syed Arefinul, Faridee, Abu Zaher Md, Purushotham, Sanjay, You, Suya, Roy, Nirmalya
Traditional transformer-based semantic segmentation relies on quantized embeddings. However, our analysis reveals that autoencoder accuracy on segmentation mask using quantized embeddings (e.g. VQ-VAE) is 8% lower than continuous-valued embeddings (e.g. KL-VAE). Motivated by this, we propose a continuous-valued embedding framework for semantic segmentation. By reformulating semantic mask generation as a continuous image-to-embedding diffusion process, our approach eliminates the need for discrete latent representations while preserving fine-grained spatial and semantic details. Our key contribution includes a diffusion-guided autoregressive transformer that learns a continuous semantic embedding space by modeling long-range dependencies in image features. Our framework contains a unified architecture combining a VAE encoder for continuous feature extraction, a diffusion-guided transformer for conditioned embedding generation, and a VAE decoder for semantic mask reconstruction. Our setting facilitates zero-shot domain adaptation capabilities enabled by the continuity of the embedding space. Experiments across diverse datasets (e.g., Cityscapes and domain-shifted variants) demonstrate state-of-the-art robustness to distribution shifts, including adverse weather (e.g., fog, snow) and viewpoint variations. Our model also exhibits strong noise resilience, achieving robust performance ($\approx$ 95% AP compared to baseline) under gaussian noise, moderate motion blur, and moderate brightness/contrast variations, while experiencing only a moderate impact ($\approx$ 90% AP compared to baseline) from 50% salt and pepper noise, saturation and hue shifts. Code available: https://github.com/mahmed10/CAMSS.git
A Systematic Study on Object Recognition Using Millimeter-wave Radar
Devnath, Maloy Kumar, Chakma, Avijoy, Anwar, Mohammad Saeid, Dey, Emon, Hasan, Zahid, Conn, Marc, Pal, Biplab, Roy, Nirmalya
Millimeter-wave (MMW) radar is becoming an essential sensing technology in smart environments due to its light and weatherindependent sensing capability. Such capabilities have been widely explored and integrated with intelligent vehicle systems, often deployed in industry-grade MMW radars. However, industry-grade MMW radars are often expensive and difficult to attain for deployable community-purpose smart environment applications. On the other hand, commercially available MMW radars pose hidden underpinning challenges that are yet to be well investigated for tasks such as recognizing objects, and activities, real-time person tracking, object localization, etc. Such tasks are frequently accompanied by image and video data, which are relatively easy for an individual to obtain, interpret, and annotate. However, image and video data are light and weather-dependent, vulnerable to the occlusion effect, and inherently raise privacy concerns for individuals. It is crucial to investigate the performance of an alternative sensing mechanism where commercially available MMW radars can be a viable alternative to eradicate the dependencies and preserve privacy issues. Before championing MMW radar, several questions need to be answered regarding MMW radar's practical feasibility and performance under different operating environments. To answer the concerns, we have collected a dataset using commercially available MMW radar, Automotive mmWave Radar (AWR2944) from Texas Instruments, and reported the optimum experimental settings for object recognition performance using several deep learning algorithms in this study. Moreover, our robust data collection procedure allows us to systematically study and identify potential challenges in the object recognition task under a cross-ambience scenario.
Where were my keys? -- Aggregating Spatial-Temporal Instances of Objects for Efficient Retrieval over Long Periods of Time
Idrees, Ifrah, Hasan, Zahid, Reiss, Steven P., Tellex, Stefanie
Robots equipped with situational awareness can help humans efficiently find their lost objects by leveraging spatial and temporal structure. Existing approaches to video and image retrieval do not take into account the unique constraints imposed by a moving camera with a partial view of the environment. We present a Detection-based 3-level hierarchical Association approach, D3A, to create an efficient query-able spatial-temporal representation of unique object instances in an environment. D3A performs online incremental and hierarchical learning to identify keyframes that best represent the unique objects in the environment. These keyframes are learned based on both spatial and temporal features and once identified their corresponding spatial-temporal information is organized in a key-value database. D3A allows for a variety of query patterns such as querying for objects with/without the following: 1) specific attributes, 2) spatial relationships with other objects, and 3) time slices. For a given set of 150 queries, D3A returns a small set of candidate keyframes (which occupy only 0.17% of the total sensory data) with 81.98\% mean accuracy in 11.7 ms. This is 47x faster and 33% more accurate than a baseline that naively stores the object matches (detections) in the database without associating spatial-temporal information.