It is notable In real world applications, data are often with multiple that strong modal features can lead to a better performance, modalities. Previous works assumed that each nevertheless, are more expensive, therefore a group of serialized modality contains sufficient information for target feature extraction methods were proposed. These methods and can be treated with equal importance. However, extract weak modal features firstly, and then extract more it is often that different modalities are of various strong modal features gradually to improve the performance importance in real tasks, e.g., the facial feature and reduce the overall cost as well. Marcialis et al. proposed is weak modality and the fingerprint feature is a serial fusion technique for multiple biometric modal strong modality in ID recognition. In this paper, we features through extracting gaits information and face information point out that different modalities should be treated step by step; Zhang et al. addressed the serialized with different strategies and propose the Auxiliary multi-modal learning techniques in a semi-supervised information Regularized Machine (ARM), which learning scenario. These methods handle strong and weak works by extracting the most discriminative feature modalities independently while leaving the fact of unsatisfied subspace of weak modality while regularizing the performance on weak modality unexplained.
Combining complementary information from multiple modalities is intuitively appealing for improving the performance of learning-based approaches. However, it is challenging to fully leverage different modalities due to practical challenges such as varying levels of noise and conflicts between modalities. Existing methods do not adopt a joint approach to capturing synergies between the modalities while simultaneously filtering noise and resolving conflicts on a per sample basis. In this work we propose a novel deep neural network based technique that multiplicatively combines information from different source modalities. Thus the model training process automatically focuses on information from more reliable modalities while reducing emphasis on the less reliable modalities. Furthermore, we propose an extension that multiplicatively combines not only the single-source modalities, but a set of mixtured source modalities to better capture cross-modal signal correlations. We demonstrate the effectiveness of our proposed technique by presenting empirical results on three multimodal classification tasks from different domains. The results show consistent accuracy improvements on all three tasks.
One of the challenges in affect recognition is accurate estimation of the emotion intensity level. This research proposes development of an affect intensity estimation model based on a weighted sum of classification confidence levels, displacement of feature points and speed of feature point motion. The parameters of the model were calculated from data captured using multiple modalities such as face, body posture, hand movement and speech. A preliminary study was conducted to compare the accuracy of the model with the annotated intensity levels. An emotion intensity scale ranging from 0 to 1 along the arousal dimension in the emotion space was used. Results indicated speech and hand modality significantly contributed in improving accuracy in emotion intensity estimation using the proposed model.
Mental imagery is the ability to imagine perceptual qualities that are not objectively there, such as sights and sounds, and it engages the same brain regions involved in perceiving modalities. We live in an uncertain and hostile world, so the brain uses all information at its disposal in order to reduce uncertainty and increase the odds of survival. One such possibility comes in the form of cross-modal perception--if information in one modality is more certain than in another, this has been known to induce changes in how we perceive events in the other modality. For instance, a subtle "click" at the right time can dramatically shift our perception of whether two balls bounce away from each other or slide over each other--try this without sound first, to see what I mean.
Humans can learn and reason under substantial uncertainty in a space of infinitely many concepts, including structured relational concepts ("a scene with objects that have the same color") and ad-hoc categories defined through goals ("objects that could fall on one's head"). In contrast, standard classification benchmarks: 1) consider only a fixed set of category labels, 2) do not evaluate compositional concept learning and 3) do not explicitly capture a notion of reasoning under uncertainty. We introduce a new few-shot, meta-learning benchmark, Compositional Reasoning Under Uncertainty (CURI) to bridge this gap. CURI evaluates different aspects of productive and systematic generalization, including abstract understandings of disentangling, productive generalization, learning boolean operations, variable binding, etc. Importantly, it also defines a model-independent "compositionality gap" to evaluate the difficulty of generalizing out-of-distribution along each of these axes. Extensive evaluations across a range of modeling choices spanning different modalities (image, schemas, and sounds), splits, privileged auxiliary concept information, and choices of negatives reveal substantial scope for modeling advances on the proposed task. All code and datasets will be available online.