Goto

Collaborating Authors

 Tran, Minh-Triet


The CASTLE 2024 Dataset: Advancing the Art of Multimodal Understanding

arXiv.org Artificial Intelligence

Multi-perspective datasets that combine firstperson and third-person views are rare and typically include only a Egocentric video has seen increased interest in recent years, as limited number of activities and do not last long enough to capture it is used in a range of areas. However, most existing datasets the full range of interactions and social dynamics characteristic of are limited to a single perspective. In this paper, we present the everyday life. CASTLE 2024 dataset, a multimodal collection containing ego-and In this paper, we introduce the CASTLE 2024 dataset, a multimodal exo-centric (i.e., first-and third-person perspective) video and audio multi-perspective collection of ego-centric (first-person) from 15 time-aligned sources, as well as other sensor streams and and exo-centric (third-person) high-resolution video recordings, auxiliary data. The dataset was recorded by volunteer participants augmented with additional sensor streams, designed to capture the over four days in a fixed location and includes the point of view complexity of daily human experiences. The dataset captures the of 10 participants, with an additional 5 fixed cameras providing an experience and daily interaction of ten volunteer participants over exocentric perspective. The entire dataset contains over 600 hours the course of four days. It shows a broad range of domestic and of UHD video recorded at 50 frames per second. In contrast to other social activities, including cooking, eating, cleaning, meeting and datasets, CASTLE 2024 does not contain any partial censoring, such leisure activities, capturing authentic interactions among participants.


Improving Resistance to Noisy Label Fitting by Reweighting Gradient in SAM

arXiv.org Artificial Intelligence

These authors contributed equally to this work. Noisy labels pose a substantial challenge in machine learning, often resulting in overfitting and poor generalization. Sharpness-Aware Minimization (SAM), as demonstrated by Foret et al. (2021), improves generalization over traditional Stochastic Gradient Descent (SGD) in classification tasks with noisy labels by implicitly slowing noisy learning. While SAM's ability to generalize in noisy environments has been studied in several simplified settings, its full potential in more realistic training settings remains underexplored. In this work, we analyze SAM's behavior at each iteration, identifying specific components of the gradient vector that contribute significantly to its robustness against noisy labels. Based on these insights, we propose SANER (Sharpness-Aware Noise-Explicit Reweighting), an effective variant that enhances SAM's ability to manage noisy fitting rate. Our experiments on CIFAR-10, CIFAR-100, and Mini-WebVision demonstrate that SANER consistently outperforms SAM, achieving up to an 8% increase on CIFAR-100 with 50% label noise. The issue of noisy labels due to human error annotation has been commonly observed in many largescale datasets such as CIFAR-10N, CIFAR-100N (Wei et al., 2022), Clothing1M (Xiao et al., 2015), and WebVision (Li et al., 2017). Over-parameterized deep neural networks, which have enough capacity to memorize entire large datasets, can easily overfit such noisy label data, leading to poor generalization performance (Zhang et al., 2021). Moreover, the lottery ticket hypothesis (Frankle & Carbin, 2019) indicates that only a subset of the network's parameters is crucial for generalization. This highlights the importance of noise-robust learning, where the goal is to train a robust classifier despite the presence of inaccurate or noisy labels in the training dataset. Sharpness-Aware Minimization (SAM), introduced by Foret et al. (2021), is an optimizer designed to find better generalization by searching for flat minima. It has shown superior performance over SGD in various tasks, especially in classification tasks involving noisy labels Baek et al. (2024). Understanding the mechanisms behind the success of SAM is crucial for further improvements in handling label noise.


Improving Referring Image Segmentation using Vision-Aware Text Features

arXiv.org Artificial Intelligence

Referring image segmentation is a challenging task that involves generating pixel-wise segmentation masks based on natural language descriptions. Existing methods have relied mostly on visual features to generate the segmentation masks while treating text features as supporting components. This over-reliance on visual features can lead to suboptimal results, especially in complex scenarios where text prompts are ambiguous or context-dependent. To overcome these challenges, we present a novel framework VATEX to improve referring image segmentation by enhancing object and context understanding with Vision-Aware Text Feature. Our method involves using CLIP to derive a CLIP Prior that integrates an object-centric visual heatmap with text description, which can be used as the initial query in DETR-based architecture for the segmentation task. Furthermore, by observing that there are multiple ways to describe an instance in an image, we enforce feature similarity between text variations referring to the same visual input by two components: a novel Contextual Multimodal Decoder that turns text embeddings into vision-aware text features, and a Meaning Consistency Constraint to ensure further the coherent and consistent interpretation of language expressions with the context understanding obtained from the image. Our method achieves a significant performance improvement on three benchmark datasets RefCOCO, RefCOCO+ and G-Ref.


Enhancing Video Summarization with Context Awareness

arXiv.org Artificial Intelligence

Video summarization is a crucial research area that aims to efficiently browse and retrieve relevant information from the vast amount of video content available today. With the exponential growth of multimedia data, the ability to extract meaningful representations from videos has become essential. Video summarization techniques automatically generate concise summaries by selecting keyframes, shots, or segments that capture the video's essence. This process improves the efficiency and accuracy of various applications, including video surveillance, education, entertainment, and social media. Despite the importance of video summarization, there is a lack of diverse and representative datasets, hindering comprehensive evaluation and benchmarking of algorithms. Existing evaluation metrics also fail to fully capture the complexities of video summarization, limiting accurate algorithm assessment and hindering the field's progress. To overcome data scarcity challenges and improve evaluation, we propose an unsupervised approach that leverages video data structure and information for generating informative summaries. By moving away from fixed annotations, our framework can produce representative summaries effectively. Moreover, we introduce an innovative evaluation pipeline tailored specifically for video summarization. Human participants are involved in the evaluation, comparing our generated summaries to ground truth summaries and assessing their informativeness. This human-centric approach provides valuable insights into the effectiveness of our proposed techniques. Experimental results demonstrate that our training-free framework outperforms existing unsupervised approaches and achieves competitive results compared to state-of-the-art supervised methods.


Cluster-based Video Summarization with Temporal Context Awareness

arXiv.org Artificial Intelligence

In this paper, we present TAC-SUM, a novel and efficient training-free approach for video summarization that addresses the limitations of existing cluster-based models by incorporating temporal context. Our method partitions the input video into temporally consecutive segments with clustering information, enabling the injection of temporal awareness into the clustering process, setting it apart from prior cluster-based summarization methods. The resulting temporal-aware clusters are then utilized to compute the final summary, using simple rules for keyframe selection and frame importance scoring. Experimental results on the SumMe dataset demonstrate the effectiveness of our proposed approach, outperforming existing unsupervised methods and achieving comparable performance to state-of-the-art supervised summarization techniques. Our source code is available for reference at \url{https://github.com/hcmus-thesis-gulu/TAC-SUM}.


Ensemble Learning for Vietnamese Scene Text Spotting in Urban Environments

arXiv.org Artificial Intelligence

This paper presents a simple yet efficient ensemble learning framework for Vietnamese scene text spotting. Leveraging the power of ensemble learning, which combines multiple models to yield more accurate predictions, our approach aims to significantly enhance the performance of scene text spotting in challenging urban settings. Through experimental evaluations on the VinText dataset, our proposed method achieves a significant improvement in accuracy compared to existing methods with an impressive accuracy of 5%. These results unequivocally demonstrate the efficacy of ensemble learning in the context of Vietnamese scene text spotting in urban environments, highlighting its potential for real world applications, such as text detection and recognition in urban signage, advertisements, and various text-rich urban scenes.


TSRNet: Simple Framework for Real-time ECG Anomaly Detection with Multimodal Time and Spectrogram Restoration Network

arXiv.org Artificial Intelligence

The electrocardiogram (ECG) is a valuable signal used to assess various aspects of heart health, such as heart rate and rhythm. It plays a crucial role in identifying cardiac conditions and detecting anomalies in ECG data. However, distinguishing between normal and abnormal ECG signals can be a challenging task. In this paper, we propose an approach that leverages anomaly detection to identify unhealthy conditions using solely normal ECG data for training. Furthermore, to enhance the information available and build a robust system, we suggest considering both the time series and time-frequency domain aspects of the ECG signal. As a result, we introduce a specialized network called the Multimodal Time and Spectrogram Restoration Network (TSRNet) designed specifically for detecting anomalies in ECG signals. TSRNet falls into the category of restoration-based anomaly detection and draws inspiration from both the time series and spectrogram domains. By extracting representations from both domains, TSRNet effectively captures the comprehensive characteristics of the ECG signal. This approach enables the network to learn robust representations with superior discrimination abilities, allowing it to distinguish between normal and abnormal ECG patterns more effectively. Furthermore, we introduce a novel inference method, termed Peak-based Error, that specifically focuses on ECG peaks, a critical component in detecting abnormalities. The experimental result on the large-scale dataset PTB-XL has demonstrated the effectiveness of our approach in ECG anomaly detection, while also prioritizing efficiency by minimizing the number of trainable parameters. Our code is available at https://github.com/UARK-AICV/TSRNet.