Amato, Giuseppe
Maybe you are looking for CroQS: Cross-modal Query Suggestion for Text-to-Image Retrieval
Pacini, Giacomo, Carrara, Fabio, Messina, Nicola, Tonellotto, Nicola, Amato, Giuseppe, Falchi, Fabrizio
Query suggestion, a technique widely adopted in information retrieval, enhances system interactivity and the browsing experience of document collections. In cross-modal retrieval, many works have focused on retrieving relevant items from natural language queries, while few have explored query suggestion solutions. In this work, we address query suggestion in cross-modal retrieval, introducing a novel task that focuses on suggesting minimal textual modifications needed to explore visually consistent subsets of the collection, following the premise of "Maybe you are looking for". To facilitate the evaluation and development of methods, we present a tailored benchmark named CroQS. This dataset comprises initial queries, grouped result sets, and human-defined suggested queries for each group. We establish dedicated metrics to rigorously evaluate the performance of various methods on this task, measuring representativeness, cluster specificity, and similarity of the suggested queries to the original ones. Baseline methods from related fields, such as image captioning and content summarization, are adapted for this task to provide reference performance scores. Although relatively far from human performance, our experiments reveal that both LLM-based and captioningbased methods achieve competitive results on CroQS, improving the recall on cluster specificity by more than 115% and representativeness mAP by more than 52% with respect to the initial query. The dataset, the implementation of the baseline methods and the notebooks containing our experiments are available here: paciosoft.com/CroQS-benchmark/
Adversarial Magnification to Deceive Deepfake Detection through Super Resolution
Coccomini, Davide Alessandro, Caldelli, Roberto, Amato, Giuseppe, Falchi, Fabrizio, Gennaro, Claudio
Deepfake technology is rapidly advancing, posing significant challenges to the detection of manipulated media content. Parallel to that, some adversarial attack techniques have been developed to fool the deepfake detectors and make deepfakes even more difficult to be detected. This paper explores the application of super resolution techniques as a possible adversarial attack in deepfake detection. Through our experiments, we demonstrate that minimal changes made by these methods in the visual appearance of images can have a profound impact on the performance of deepfake detection systems. We propose a novel attack using super resolution as a quick, black-box and effective method to camouflage fake images and/or generate false alarms on pristine images. Our results indicate that the usage of super resolution can significantly impair the accuracy of deepfake detectors, thereby highlighting the vulnerability of such systems to adversarial attacks.
Deepfake Detection without Deepfakes: Generalization via Synthetic Frequency Patterns Injection
Coccomini, Davide Alessandro, Caldelli, Roberto, Gennaro, Claudio, Fiameni, Giuseppe, Amato, Giuseppe, Falchi, Fabrizio
Deepfake detectors are typically trained on large sets of pristine and generated images, resulting in limited generalization capacity; they excel at identifying deepfakes created through methods encountered during training but struggle with those generated by unknown techniques. This paper introduces a learning approach aimed at significantly enhancing the generalization capabilities of deepfake detectors. Our method takes inspiration from the unique "fingerprints" that image generation processes consistently introduce into the frequency domain. These fingerprints manifest as structured and distinctly recognizable frequency patterns. We propose to train detectors using only pristine images injecting in part of them crafted frequency patterns, simulating the effects of various deepfake generation techniques without being specific to any. These synthetic patterns are based on generic shapes, grids, or auras. We evaluated our approach using diverse architectures across 25 different generation methods. The models trained with our approach were able to perform state-of-the-art deepfake detection, demonstrating also superior generalization capabilities in comparison with previous methods. Indeed, they are untied to any specific generation technique and can effectively identify deepfakes regardless of how they were made.
Synaptic Plasticity Models and Bio-Inspired Unsupervised Deep Learning: A Survey
Lagani, Gabriele, Falchi, Fabrizio, Gennaro, Claudio, Amato, Giuseppe
Recently emerged technologies based on Deep Learning (DL) achieved outstanding results on a variety of tasks in the field of Artificial Intelligence (AI). However, these encounter several challenges related to robustness to adversarial inputs, ecological impact, and the necessity of huge amounts of training data. In response, researchers are focusing more and more interest on biologically grounded mechanisms, which are appealing due to the impressive capabilities exhibited by biological brains. This survey explores a range of these biologically inspired models of synaptic plasticity, their application in DL scenarios, and the connections with models of plasticity in Spiking Neural Networks (SNNs). Overall, Bio-Inspired Deep Learning (BIDL) represents an exciting research direction, aiming at advancing not only our current technologies but also our understanding of intelligence.
Spiking Neural Networks and Bio-Inspired Supervised Deep Learning: A Survey
Lagani, Gabriele, Falchi, Fabrizio, Gennaro, Claudio, Amato, Giuseppe
Indeed, biological brains exhibit extraordinary capabilities in terms of energy efficiency, supporting advanced cognitive functions while consuming only 20W [89]. It is believed that the key to the energy efficient computation of biological neurons lies in the particular coding paradigm based on short pulses, or spikes [61]. SNN models aim at simulating the behavior of biological neurons more realistically, compared to traditional DNNs. As a result, SNNs are well suited for energy-efficient implementations in neuromorphic [84, 174, 186, 190, 229] or biological [92, 111, 176] hardware. This makes SNNs a promising direction toward energy-efficient DL. Unfortunately, training SNNs is not trivial, as traditional optimization based on the backpropagation algorithm (backprop) is not directly applicable [165]. In fact, the biological plausibility of backprop - the workhorse of DL - is questioned by neuroscientists [73, 113, 130, 157, 172]. Therefore, researchers took again inspiration from biology, in order to find new learning solutions as alternatives to backprop. The goal was not only to address the problem of SNN training [33, 148], but also to discover novel approaches to the learning problem [77, 139, 182], and possibly more data efficient strategies [69, 90, 105-107, 110].
ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval
Messina, Nicola, Stefanini, Matteo, Cornia, Marcella, Baraldi, Lorenzo, Falchi, Fabrizio, Amato, Giuseppe, Cucchiara, Rita
Image-text matching is gaining a leading role among tasks involving the joint understanding of vision and language. In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts. Nonetheless, it has a direct downstream application: cross-modal retrieval, which consists in finding images related to a given query text or vice-versa. Solving this task is of critical importance in cross-modal search engines. Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks. However, these models are often computationally expensive, especially at inference time. This prevents their adoption in large-scale cross-modal retrieval scenarios, where results should be provided to the user almost instantaneously. In this paper, we propose to fill in the gap between effectiveness and efficiency by proposing an ALign And DIstill Network (ALADIN). ALADIN first produces high-effective scores by aligning at fine-grained level images and texts. Then, it learns a shared embedding space - where an efficient kNN search can be performed - by distilling the relevance scores obtained from the fine-grained alignments. We obtained remarkable results on MS-COCO, showing that our method can compete with state-of-the-art VL Transformers while being almost 90 times faster. The code for reproducing our results is available at https://github.com/mesnico/ALADIN.
A Leap among Entanglement and Neural Networks: A Quantum Survey
Massoli, Fabio Valerio, Vadicamo, Lucia, Amato, Giuseppe, Falchi, Fabrizio
Perhaps, the modern definition of AI - as the ensemble of computer systems empowered with the ability to learn from data through statistical techniques - can be dated back to 1959. Machine Learning (ML), a subclass of AI, is a discipline that aims at studying algorithms that are able to learn from experience and data to perform tasks without following explicit instructions. Often, these algorithms are based on a computational model that belongs to differentiable programming techniques, called Neural Networks (NNs). The success of such algorithms resides in their ability to learn to achieve a specific goal [95, 121], i.e., they learn to discover hidden patterns and relations among data to fulfill the task at hand [89, 120]. Mathematically, NNs are made of a sequence of transformations, called layers, composed of affine operators and elementwise nonlinearities. Then, the goal of learning is to modify the transformations' parameters to fulfill a task successfully. Whenever a model accounts for more than a couple of such layers, it is called a Deep Learning (DL) model or a Deep Neural Network (DNN). Thanks to their enormous representation power and the development of new technologies and training algorithms, DL models obtained astonishing results in the last two decades, achieving superhuman performance on certain tasks [177].
A Multi-resolution Approach to Expression Recognition in the Wild
Massoli, Fabio Valerio, Cafarelli, Donato, Amato, Giuseppe, Falchi, Fabrizio
Facial expressions play a fundamental role in human communication. Indeed, they typically reveal the real emotional status of people beyond the spoken language. Moreover, the comprehension of human affect based on visual patterns is a key ingredient for any human-machine interaction system and, for such reasons, the task of Facial Expression Recognition (FER) draws both scientific and industrial interest. In the recent years, Deep Learning techniques reached very high performance on FER by exploiting different architectures and learning paradigms. In such a context, we propose a multi-resolution approach to solve the FER task. We ground our intuition on the observation that often faces images are acquired at different resolutions. Thus, directly considering such property while training a model can help achieve higher performance on recognizing facial expressions. To our aim, we use a ResNet-like architecture, equipped with Squeeze-and-Excitation blocks, trained on the Affect-in-the-Wild 2 dataset. Not being available a test set, we conduct tests and models selection by employing the validation set only on which we achieve more than 90\% accuracy on classifying the seven expressions that the dataset comprises.
MOCCA: Multi-Layer One-Class Classification for Anomaly Detection
Massoli, Fabio Valerio, Falchi, Fabrizio, Kantarci, Alperen, Akti, Şeymanur, Ekenel, Hazim Kemal, Amato, Giuseppe
Anomalies are ubiquitous in all scientific fields and can express an unexpected event due to incomplete knowledge about the data distribution or an unknown process that suddenly comes into play and distorts the observations. Due to such events' rarity, it is common to train deep learning models on "normal", i.e. non-anomalous, datasets only, thus letting the neural network to model the distribution beneath the input data. In this context, we propose our deep learning approach to the anomaly detection problem named Multi-LayerOne-Class Classification (MOCCA). We explicitly leverage the piece-wise nature of deep neural networks by exploiting information extracted at different depths to detect abnormal data instances. We show how combining the representations extracted from multiple layers of a model leads to higher discrimination performance than typical approaches proposed in the literature that are based neural networks' final output only. We propose to train the model by minimizing the $L_2$ distance between the input representation and a reference point, the anomaly-free training data centroid, at each considered layer. We conduct extensive experiments on publicly available datasets for anomaly detection, namely CIFAR10, MVTec AD, and ShanghaiTech, considering both the single-image and video-based scenarios. We show that our method reaches superior performances compared to the state-of-the-art approaches available in the literature. Moreover, we provide a model analysis to give insight on how our approach works.
Combining GANs and AutoEncoders for Efficient Anomaly Detection
Carrara, Fabio, Amato, Giuseppe, Brombin, Luca, Falchi, Fabrizio, Gennaro, Claudio
In this work, we propose CBiGAN -- a novel method for anomaly detection in images, where a consistency constraint is introduced as a regularization term in both the encoder and decoder of a BiGAN. Our model exhibits fairly good modeling power and reconstruction consistency capability. We evaluate the proposed method on MVTec AD -- a real-world benchmark for unsupervised anomaly detection on high-resolution images -- and compare against standard baselines and state-of-the-art approaches. Experiments show that the proposed method improves the performance of BiGAN formulations by a large margin and performs comparably to expensive state-of-the-art iterative methods while reducing the computational cost. We also observe that our model is particularly effective in texture-type anomaly detection, as it sets a new state of the art in this category. Our code is available at https://github.com/fabiocarrara/cbigan-ad/.