In this paper we introduce vSTS, a new dataset for measuring textual similarity of sentences using multimodal information. The dataset is comprised by images along with its respectively textual captions. We describe the dataset both quantitatively and qualitatively, and claim that it is a valid gold standard for measuring automatic multimodal textual similarity systems. We also describe the initial experiments combining the multimodal information.
The use of semantics in supervised text classification can improve its effectiveness especially in specific domains. Most state of the art works use concepts as an alternative to words in order to transform the classical bag of words (BOW) into a Bag of concepts (BOC). This transformation is done through conceptualization task. Furthermore, the resulting BOC can be enriched using other related concepts from semantic resources. This enrichment may enhance classification effectiveness as well. This paper focuses on two strategies for semantic enrichment of conceptualized text representation. The first one is based on semantic kernel method while the second one is based on enriching vectors method. These two semantic enrichment strategies are evaluated through experiments using Rocchio as the supervised classification method in the medical domain, using UMLS ontology and Ohsumed corpus.
This paper proposes an effective approach to model the emotional space of words to infer their Sense Sentiment Similarity (SSS). SSS reflects the distance between the words regarding their senses and underlying sentiments. We propose a probabilistic approach that is built on a hidden emotional model in which the basic human emotions are considered as hidden. This leads to predict a vector of emotions for each sense of the words, and then to infer the sense sentiment similarity. The effectiveness of the proposed approach is investigated in two Natural Language Processing tasks: Indirect yes/no Question Answer Pairs Inference and Sentiment Orientation Prediction.
We study and propose in this article several novel solutions to the task of semantic similarity between two short texts. The proposed solutions are based on the probabilistic method of Latent Dirichlet Allocation (LDA) and on the algebraic method of Latent Semantic Analysis (LSA). Both methods, LDA and LSA, are completely automated methods used to discover latent topics or concepts from large collection of documents. We propose a novel word-to-word similarity measure based on LDA as well as several text-to-text similarity measures. We compare these measures with similar, known measures based on LSA. Experiments and results are presented on two data sets: the Microsoft Research Paraphrase corpus and the User Language Paraphrase corpus. We found that the novel word-to-word similarity measure based on LDA is extremely promising.