DOLDA - a regularized supervised topic model for high-dimensional multi-class regression

Magnusson, Måns, Jonsson, Leif, Villani, Mattias

arXiv.org Machine Learning 

During the last decades more and more textual data have become available, creating a growing need to statistically analyze large amounts of textual data. The hugely popular Latent Dirichlet Allocation (LDA) model introduced by Blei et al. (2003) is a generative probability model where each document is summarized by a set of latent semantic themes, often called topics; formally, a topic is a probability distribution over the vocabulary. An estimated LDA model is therefore a compressed latent representation of the documents. LDA is a mixed membership model where each document is a mixture of topics, where each word (token) in a document belongs to a single topic. The basic LDA model is unsupervised, i.e. the topics are learned solely from the words in the documents without access to document labels. In many situations there are also other information we would like to incorporate in modeling a corpus of documents. A common example is when we have labeled documents, such as ratings of movies together with a movie description, illness category in medical journals or the location of the identified bug together with bug reports. In these situation, one can use a so called supervised topic model to find the semantic structure in the documents that are related to the class of interest. One of the first approaches to supervised topic models was proposed by Mcauliffe and Blei (2008).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found