Goto

Collaborating Authors

 topic vector


RankSum An unsupervised extractive text summarization based on rank fusion

Joshi, A., Fidalgo, E., Alegre, E., Alaiz-Rodriguez, R.

arXiv.org Artificial Intelligence

In this paper, we propose Ranksum, an approach for extractive text summarization of single documents based on the rank fusion of four multi-dimensional sentence features extracted for each sentence: topic information, semantic content, significant keywords, and position. The Ranksum obtains the sentence saliency rankings corresponding to each feature in an unsupervised way followed by the weighted fusion of the four scores to rank the sentences according to their significance. The scores are generated in completely unsupervised way, and a labeled document set is required to learn the fusion weights. Since we found that the fusion weights can generalize to other datasets, we consider the Ranksum as an unsupervised approach. To determine topic rank, we employ probabilistic topic models whereas semantic information is captured using sentence embeddings. To derive rankings using sentence embeddings, we utilize Siamese networks to produce abstractive sentence representation and then we formulate a novel strategy to arrange them in their order of importance. A graph-based strategy is applied to find the significant keywords and related sentence rankings in the document. We also formulate a sentence novelty measure based on bigrams, trigrams, and sentence embeddings to eliminate redundant sentences from the summary. The ranks of all the sentences computed for each feature are finally fused to get the final score for each sentence in the document. We evaluate our approach on publicly available summarization datasets CNN/DailyMail and DUC 2002. Experimental results show that our approach outperforms other existing state-of-the-art summarization methods.


Replicated Softmax: an Undirected Topic Model

Neural Information Processing Systems

We show how to model documents as bags of words using family of two-layer, undirected graphical models. Each member of the family has the same number of binary hidden units but a different number of softmax visible units. All of the softmax units in all of the models in the family share the same weights to the binary hidden units. We describe efficient inference and learning procedures for such a family. Each member of the family models the probability distribution of documents of a specific length as a product of topic-specific distributions rather than as a mixture and this gives much better generalization than Latent Dirichlet Allocation for modeling the log probabilities of held-out documents.


How to perform topic modeling with Top2Vec

#artificialintelligence

Topic modeling is a problem in natural language processing that has many real-world applications. Being able to discover topics within large sections of text helps us understand text data in greater detail. For many years, Latent Dirichlet Allocation (LDA) has been the most commonly used algorithm for topic modeling. The algorithm was first introduced in 2003 and treats topics as probability distributions for the occurrence of different words. If you want to see an example of LDA in action, you should check out my article below where I performed LDA on a fake news classification dataset.


Top2Vec: Distributed Representations of Topics

Angelov, Dimo

arXiv.org Machine Learning

Topic modeling is used for discovering latent semantic structure, usually referred to as topics, in a large collection of documents. The most widely used methods are Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis. Despite their popularity they have several weaknesses. In order to achieve optimal results they often require the number of topics to be known, custom stop-word lists, stemming, and lemmatization. Additionally these methods rely on bag-of-words representation of documents which ignore the ordering and semantics of words. Distributed representations of documents and words have gained popularity due to their ability to capture semantics of words and documents. We present $\texttt{top2vec}$, which leverages joint document and word semantic embedding to find $\textit{topic vectors}$. This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics. The resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity. Our experiments demonstrate that $\texttt{top2vec}$ finds topics which are significantly more informative and representative of the corpus trained on than probabilistic generative models.


Replicated Softmax: an Undirected Topic Model

Hinton, Geoffrey E., Salakhutdinov, Russ R.

Neural Information Processing Systems

We show how to model documents as bags of words using family of two-layer, undirected graphical models. Each member of the family has the same number of binary hidden units but a different number of softmax visible units. All of the softmax units in all of the models in the family share the same weights to the binary hidden units. We describe efficient inference and learning procedures for such a family. Each member of the family models the probability distribution of documents of a specific length as a product of topic-specific distributions rather than as a mixture and this gives much better generalization than Latent Dirichlet Allocation for modeling the log probabilities of held-out documents.


Inductive Document Network Embedding with Topic-Word Attention

Brochier, Robin, Guille, Adrien, Velcin, Julien

arXiv.org Machine Learning

Document network embedding aims at learning representations for a structured text corpus i.e. when documents are linked to each other. Recent algorithms extend network embedding approaches by incorporating the text content associated with the nodes in their formulations. In most cases, it is hard to interpret the learned representations. Moreover, little importance is given to the generalization to new documents that are not observed within the network. In this paper, we propose an interpretable and inductive document network embedding method. We introduce a novel mechanism, the Topic-Word Attention (TW A), that generates document representations based on the interplay between word and topic representations. We train these word and topic vectors through our general model, Inductive Document Network Embedding (IDNE), by leveraging the connections in the document network. Quantitative evaluations show that our approach achieves state-of-the-art performance on various networks and we qualitatively show that our model produces meaningful and interpretable representations of the words, topics and documents.


Topic-aware chatbot using Recurrent Neural Networks and Nonnegative Matrix Factorization

Guo, Yuchen, Hanoian, Nicholas, Lin, Zhexiao, Liskij, Nicholas, Lyu, Hanbaek, Needell, Deanna, Qu, Jiahao, Sojico, Henry, Wang, Yuliang, Xiong, Zhe, Zou, Zhenhong

arXiv.org Machine Learning

After learning topic vectors from an auxiliary text corpus via NMF, the decoder is trained so that it is more likely to sample response words from the most correlated topic vectors. One of the main advantages in our architecture is that the user can easily switch the NMF-learned topic vectors so that the chatbot obtains desired topic-awareness. We demonstrate our model by training on a single conversational data set which is then augmented with topic matrices learned from different auxiliary data sets. We show that our topic-aware chatbot not only outperforms the non-topic counterpart, but also that each topic-aware model qualitatively and contextually gives the most relevant answer depending on the topic of question. Another area where deep learning algorithms have been successfully applied is sequence learning, which aims at understanding the structure of sequential data such as language, musical notes, and videos. One example of an application of deep learning in language modeling is conversational chatbots . A chatbot is a program that conducts a conversation with a user by simulating one side of it. Chatbots receive inputs from a user one message, or question, at a time, and then form a response that is sent back to the user. One of the most widely used machine learning techniques for sequence learning is Recurrent Neural Networks (RNN).


Convergence Rates of Latent Topic Models Under Relaxed Identifiability Conditions

Wang, Yining

arXiv.org Machine Learning

In this paper we study the frequentist convergence rate for the Latent Dirichlet Allocation (Blei et al., 2003) topic models. We show that the maximum likelihood estimator converges to one of the finitely many equivalent parameters in Wasserstein's distance metric at a rate of $n^{-1/4}$ without assuming separability or non-degeneracy of the underlying topics and/or the existence of more than three words per document, thus generalizing the previous works of Anandkumar et al. (2012, 2014) from an information-theoretical perspective. We also show that the $n^{-1/4}$ convergence rate is optimal in the worst case.


Introducing our Hybrid lda2vec Algorithm

#artificialintelligence

Similar to word2vec's skipgram negative-sampling (SGNS) algorithm, we'll start by trying to discriminate pairs of (context j, word i) that appear in the corpus from those randomly sampled from a'negative' pool of words and contexts.


Relating Romanized Comments to News Articles by Inferring Multi-Glyphic Topical Correspondence

Tholpadi, Goutham (Indian Institute of Science, Bangalore) | Das, Mrinal Kanti (Indian Institute of Science, Bangalore) | Bansal, Trapit (Indian Institute of Science, Bangalore) | Bhattacharyya, Chiranjib (Indian Institute of Science, Bangalore)

AAAI Conferences

Commenting is a popular facility provided by news sites. Analyzing such user-generated content has recently attracted research interest. However, in multilingual societies such as India, analyzing such user-generated content is hard due to several reasons: (1) There are more than 20 official languages but linguistic resources are available mainly for Hindi. It is observed that people frequently use romanized text as it is easy and quick using an English keyboard, resulting in multi-glyphic comments, where the texts are in the same language but in different scripts. Such romanized texts are almost unexplored in machine learning so far. (2) In many cases, comments are made on a specific part of the article rather than the topic of the entire article. Off-the-shelf methods such as correspondence LDA are insufficient to model such relationships between articles and comments. In this paper, we extend the notion of correspondence to model multi-lingual, multi-script, and inter-lingual topics in a unified probabilistic model called the Multi-glyphic Correspondence Topic Model (MCTM). Using several metrics, we verify our approach and show that it improves over the state-of-the-art.