Goto

Collaborating Authors

 specific feature


Multi-modal Co-learning for Earth Observation: Enhancing single-modality models via modality collaboration

Mena, Francisco, Ienco, Dino, Dantas, Cassio F., Interdonato, Roberto, Dengel, Andreas

arXiv.org Artificial Intelligence

Multi-modal co-learning is emerging as an effective paradigm in machine learning, enabling models to collaboratively learn from different modalities to enhance single-modality predictions. Earth Observation (EO) represents a quintessential domain for multi-modal data analysis, wherein diverse remote sensors collect data to sense our planet. This unprecedented volume of data introduces novel challenges. Specifically, the access to the same sensor modalities at both training and inference stages becomes increasingly complex based on real-world constraints affecting remote sensing platforms. In this context, multi-modal co-learning presents a promising strategy to leverage the vast amount of sensor-derived data available at the training stage to improve single-modality models for inference-time deployment. Most current research efforts focus on designing customized solutions for either particular downstream tasks or specific modalities available at the inference stage. To address this, we propose a novel multi-modal co-learning framework capable of generalizing across various tasks without targeting a specific modality for inference. Our approach combines contrastive and modality discriminative learning together to guide single-modality models to structure the internal model manifold into modality-shared and modality-specific information. We evaluate our framework on four EO benchmarks spanning classification and regression tasks across different sensor modalities, where only one of the modalities available during training is accessible at inference time. Our results demonstrate consistent predictive improvements over state-of-the-art approaches from the recent machine learning and computer vision literature, as well as EO-specific methods. The obtained findings validate our framework in the single-modality inference scenarios across a diverse range of EO applications.


Advancing Vision-based Human Action Recognition: Exploring Vision-Language CLIP Model for Generalisation in Domain-Independent Tasks

Shandilya, Utkarsh, Kappan, Marsha Mariya, Jain, Sanyam, Sharma, Vijeta

arXiv.org Artificial Intelligence

Human action recognition plays a critical role in healthcare and medicine, supporting applications such as patient behavior monitoring, fall detection, surgical robot supervision, and procedural skill assessment. While traditional models like CNNs and RNNs have achieved moderate success, they often struggle to generalize across diverse and complex actions. Recent advancements in vision-language models, especially the transformer-based CLIP model, offer promising capabilities for generalizing action recognition from video data. In this work, we evaluate CLIP on the UCF-101 dataset and systematically analyze its performance under three masking strategies: (1) percentage-based and shape-based black masking at 10%, 30%, and 50%, (2) feature-specific masking to suppress bias-inducing elements, and (3) isolation masking that retains only class-specific regions. Our results reveal that CLIP exhibits inconsistent behavior and frequent misclassifications, particularly when essential visual cues are obscured. To overcome these limitations, we propose incorporating class-specific noise, learned via a custom loss function, to reinforce attention to class-defining features. This enhancement improves classification accuracy and model confidence while reducing bias. We conclude with a discussion on the challenges of applying such models in clinical domains and outline directions for future work to improve generalizability across domain-independent healthcare scenarios.


Disentangling and Generating Modalities for Recommendation in Missing Modality Scenarios

Kim, Jiwan, Kang, Hongseok, Kim, Sein, Kim, Kibum, Park, Chanyoung

arXiv.org Artificial Intelligence

Multi-modal recommender systems (MRSs) have achieved notable success in improving personalization by leveraging diverse modalities such as images, text, and audio. However, two key challenges remain insufficiently addressed: (1) Insufficient consideration of missing modality scenarios and (2) the overlooking of unique characteristics of modality features. These challenges result in significant performance degradation in realistic situations where modalities are missing. To address these issues, we propose Disentangling and Generating Modality Recommender (DGMRec), a novel framework tailored for missing modality scenarios. DGMRec disentangles modality features into general and specific modality features from an information-based perspective, enabling richer representations for recommendation. Building on this, it generates missing modality features by integrating aligned features from other modalities and leveraging user modality preferences. Extensive experiments show that DGMRec consistently outperforms state-of-the-art MRSs in challenging scenarios, including missing modalities and new item settings as well as diverse missing ratios and varying levels of missing modalities. Moreover, DGMRec's generation-based approach enables cross-modal retrieval, a task inapplicable for existing MRSs, highlighting its adaptability and potential for real-world applications. Our code is available at https://github.com/ptkjw1997/DGMRec.


Multi-Modality Collaborative Learning for Sentiment Analysis

Wang, Shanmin, Liu, Chengguang, Liu, Qingshan

arXiv.org Artificial Intelligence

Multimodal sentiment analysis (MSA) identifies individuals' sentiment states in videos by integrating visual, audio, and text modalities. Despite progress in existing methods, the inherent modality heterogeneity limits the effective capture of interactive sentiment features across modalities. In this paper, by introducing a Multi-Modality Collaborative Learning (MMCL) framework, we facilitate cross-modal interactions and capture enhanced and complementary features from modality-common and modality-specific representations, respectively. Specifically, we design a parameter-free decoupling module and separate uni-modality into modality-common and modality-specific components through semantics assessment of cross-modal elements. For modality-specific representations, inspired by the act-reward mechanism in reinforcement learning, we design policy models to adaptively mine complementary sentiment features under the guidance of a joint reward. For modality-common representations, intra-modal attention is employed to highlight crucial components, playing enhanced roles among modalities. Experimental results, including superiority evaluations on four databases, effectiveness verification of each module, and assessment of complementary features, demonstrate that MMCL successfully learns collaborative features across modalities and significantly improves performance. The code can be available at https://github.com/smwanghhh/MMCL.


Reliable Feature Selection for Adversarially Robust Cyber-Attack Detection

Vitorino, João, Silva, Miguel, Maia, Eva, Praça, Isabel

arXiv.org Artificial Intelligence

The growing cybersecurity threats make it essential to use high-quality data to train Machine Learning (ML) models for network traffic analysis, without noisy or missing data. By selecting the most relevant features for cyber-attack detection, it is possible to improve both the robustness and computational efficiency of the models used in a cybersecurity system. This work presents a feature selection and consensus process that combines multiple methods and applies them to several network datasets. Two different feature sets were selected and were used to train multiple ML models with regular and adversarial training. Finally, an adversarial evasion robustness benchmark was performed to analyze the reliability of the different feature sets and their impact on the susceptibility of the models to adversarial examples. By using an improved dataset with more data diversity, selecting the best time-related features and a more specific feature set, and performing adversarial training, the ML models were able to achieve a better adversarially robust generalization. The robustness of the models was significantly improved without their generalization to regular traffic flows being affected, without increases of false alarms, and without requiring too many computational resources, which enables a reliable detection of suspicious activity and perturbed traffic flows in enterprise computer networks.


Deep Learning or classical Machine Learning -- which one to use for your project?

#artificialintelligence

During the last decade, Deep Learning has received a lot of attention throughout the globe. "Deep Learning is a superpower. With it, you can make a computer see, synthesise novel art, translate languages, render a medical diagnosis, or build pieces of a car that can drive itself. If that isn't a superpower, I don't know what is." Thus, as many know, one needs to carefully select when to use a superpower.


An Artificial Intelligence Rant: Neural Networks Are Not Magic, They're Code

#artificialintelligence

I was reading yet another document about artificial intelligence (AI). The introduction was covering the basics and the history of the subject. The authors mentioned expert systems and the real flaws that tactic had. Then the authors said that, luckily, there was an alternative called "machine learning." Yet more people who think anything older than them couldn't be classified the same way as the things they know.


How AI *Understand* Images in Simple Terms

#artificialintelligence

This article aims to explain one of the most used artificial intelligence models in the world. I will try to make it very simple, so anyone can understand how it works. AI surrounds our daily lives, and it will only become more present, so you need to understand how it works, where we are at, and what's to come. The more you learn about AI, the more you will realize that it is not as advanced as most think due to its narrow intelligence, yet it has powerful applications for individuals and companies. Knowing how it works will help you better understand the possible applications, limitations and communicate better with your tech employees and colleagues.


Face Anonymization Pipeline in Pytorch

#artificialintelligence

Protecting data privacy is critical to preserving customer trust and is also gaining increasing attention from policy makers. Staying ahead of these expectations requires continual improvements to AI toolchains. Anonymizing image data is particularly challenging without badly degrading the quality of the image samples. We developed the capability to anonymize images while preserving the image distribution, giving us an excellent way to maintain the anonymity of the persons in the images while still performing data augmentation tasks. Our approach is based on the paper, "DeepPrivacy: A Generative Adversarial Network for Face Anonymization," published in 2019 at the International Symposium on Visual Computing.


Introduction to Computer Vision

#artificialintelligence

Computer vision is a field of AI that focuses on giving computers the ability to see and interpret the world around them in the same way that humans do. Computer vision involves teaching computers to observe the physical world, analyze data, and extract insights from visual inputs. Computer vision is one of the most promising areas of research in artificial intelligence and computer science, and it offers great benefits to businesses today. Basically, image processing involves altering one image in order to produce a new image with improved characteristics. The image might be resized, the brightness and contrast adjusted, the image cropped, blurred, or any number of other digital transformations performed.