AITopics

1204.6703

Country:

Asia (1.00)
North America > United States > California (0.28)

Genre: Research Report (0.50)

Industry:

Leisure & Entertainment > Sports (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Law (0.92)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.84)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)

arXiv.org Machine LearningJan-15-2013

A Nested HDP for Hierarchical Topic Models

Paisley, John, Wang, Chong, Blei, David, Jordan, Michael I.

We develop a nested hierarchical Dirichlet process (nHDP) for hierarchical topic modeling. The nHDP is a generalization of the nested Chinese restaurant process (nCRP) that allows each word to follow its own path to a topic node according to a document-specific distribution on a shared tree. This alleviates the rigid, single-path formulation of the nCRP, allowing a document to more easily express thematic borrowings as a random effect. We demonstrate our algorithm on 1.8 million documents from The New York Times.

artificial intelligence, natural language, ncrp, (13 more...)

1301.357

Country:

Europe (1.00)
Asia > Middle East > Jordan (0.17)
Asia > Middle East > Lebanon (0.15)

Genre: Research Report (0.40)

Industry:

Government > Military (0.97)
Banking & Finance (0.96)
Leisure & Entertainment > Sports > Football (0.30)

Technology: Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.41)

Anandkumar, Anima, Foster, Dean P., Hsu, Daniel J., Kakade, Sham M., Liu, Yi-kai

A Spectral Algorithm for Latent Dirichlet Allocation

Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by \emph{multiple} latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (\emph{i.e.}, third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on $k \times k$ matrices, where $k$ is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space.

artificial intelligence, machine learning, natural language, (18 more...)

Country:

North America > United States > Pennsylvania (0.28)
North America > United States > California (0.28)

Industry:

Education (0.48)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.95)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)

Krafft, Peter, Moore, Juston, Desmarais, Bruce, Wallach, Hanna M.

Topic-Partitioned Multinetwork Embeddings

We introduce a new Bayesian admixture model intended for exploratory analysis ofcommunication networks--specifically, the discovery and visualization of topic-specific subnetworks in email data sets. Our model produces principled visualizations ofemail networks, i.e., visualizations that have precise mathematical interpretations in terms of our model and its relationship to the observed data. We validate our modeling assumptions by demonstrating that our model achieves better link prediction performance than three state-of-the-art network models and exhibits topic coherence comparable to that of latent Dirichlet allocation. We showcase our model's ability to discover and visualize topic-specific communication patternsusing a new email data set: the New Hanover County email network. We provide an extensive analysis of these communication patterns, leading us to recommend our model for any exploratory analysis of email networks or other similarly-structured communication data. Finally, we advocate for principled visualization asa primary objective in the development of new network models.

artificial intelligence, email network, natural language, (18 more...)

Country:

North America > United States > Massachusetts (0.28)
North America > United States > Virginia > Hanover County (0.25)
North America > United States > North Carolina > New Hanover County (0.25)

Industry:

Information Technology (0.68)
Government (0.47)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.35)

Monte Carlo Methods for Maximum Margin Supervised Topic Models

Jiang, Qixia, Zhu, Jun, Sun, Maosong, Xing, Eric P.

An effective strategy to exploit the supervising side information for discovering predictive topic representations is to impose discriminative constraints induced by such information on the posterior distributions under a topic model. This strategy has been adopted by a number of supervised topic models, such as MedLDA, which employs max-margin posterior constraints. However, unlike the likelihood-based supervised topic models, of which posterior inference can be carried out using the Bayes' rule, the max-margin posterior constraints have made Monte Carlo methods infeasible or at least not directly applicable, thereby limited the choice of inference algorithms to be based on variational approximation with strict mean field assumptions. In this paper, we develop two efficient Monte Carlo methods under much weaker assumptions for max-margin supervised topic models based on an importance sampler and a collapsed Gibbs sampler, respectively, in a convex dual formulation. We report thorough experimental results that compare our approach favorably against existing alternatives in both accuracy and efficiency.

constraint, machine learning, natural language, (18 more...)

Country:

North America > United States (0.46)
Asia (0.28)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.89)

Zhang, Xianxing, Carin, Lawrence

Joint Modeling of a Matrix with Associated Text via Latent Binary Features

A new methodology is developed for joint analysis of a matrix and accompanying documents, with the documents associated with the matrix rows/columns. The documents are modeled with a focused topic model, inferring interpretable latent binary features for each document. A new matrix decomposition is developed, with latent binary features associated with the rows/columns, and with imposition of a low-rank constraint. The matrix decomposition and topic model are coupled by sharing the latent binary feature vectors associated with each. The model is applied to roll-call data, with the associated documents defined by the legislation. Advantages of the proposed model are demonstrated for prediction of votes on a new piece of legislation, based only on the observed text of legislation. The coupling of the text and legislation is also shown to yield insight into the properties of the matrix decomposition for roll-call data.

artificial intelligence, machine learning, natural language, (14 more...)

Country: North America > United States (0.93)

Genre: Research Report > New Finding (0.46)

Industry:

Law > Statutes (1.00)
Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.46)

Fukumasu, Kosuke, Eguchi, Koji, Xing, Eric P.

Symmetric Correspondence Topic Models for Multilingual Text Analysis

Topic modeling is a widely used approach to analyzing large text collections. A small number of multilingual topic models have recently been explored to discover latent topics among parallel or comparable documents, such as in Wikipedia. Other topic models that were originally proposed for structured data are also applicable to multilingual documents. Correspondence Latent Dirichlet Allocation (CorrLDA) is one such model; however, it requires a pivot language to be specified in advance. We propose a new topic model, Symmetric Correspondence LDA (SymCorrLDA), that incorporates a hidden variable to control a pivot language, in an extension of CorrLDA. We experimented with two multilingual comparable datasets extracted from Wikipedia and demonstrate that SymCorrLDA is more effective than some other existing multilingual topic models.

artificial intelligence, natural language, pivot language, (17 more...)

Country:

Europe (1.00)
North America > United States > Pennsylvania (0.46)
Asia > Japan > Honshū > Kansai (0.14)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)

arXiv.org Artificial IntelligenceDec-28-2012

Discovering Basic Emotion Sets via Semantic Clustering on a Twitter Corpus

Bann, Eugene Yuta

A plethora of words are used to describe the spectrum of human emotions, but how many emotions are there really, and how do they interact? Over the past few decades, several theories of emotion have been proposed, each based around the existence of a set of 'basic emotions', and each supported by an extensive variety of research including studies in facial expression, ethology, neurology and physiology. Here we present research based on a theory that people transmit their understanding of emotions through the language they use surrounding emotion keywords. Using a labelled corpus of over 21,000 tweets, six of the basic emotion sets proposed in existing literature were analysed using Latent Semantic Clustering (LSC), evaluating the distinctiveness of the semantic meaning attached to the emotional label. We hypothesise that the more distinct the language is used to express a certain emotion, then the more distinct the perception (including proprioception) of that emotion is, and thus more 'basic'. This allows us to select the dimensions best representing the entire spectrum of emotion. We find that Ekman's set, arguably the most frequently used for classifying emotions, is in fact the most semantically distinct overall. Next, taking all analysed (that is, previously proposed) emotion terms into account, we determine the optimal semantically irreducible basic emotion set using an iterative LSC algorithm. Our newly-derived set (Accepting, Ashamed, Contempt, Interested, Joyful, Pleased, Sleepy, Stressed) generates a 6.1% increase in distinctiveness over Ekman's set (Angry, Disgusted, Joyful, Sad, Scared). We also demonstrate how using LSC data can help visualise emotions. We introduce the concept of an Emotion Profile and briefly analyse compound emotions both visually and mathematically.

data mining, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

1212.6527

Country:

Europe (0.92)
Asia (0.92)
North America > United States > California (0.45)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Information Technology > Services (0.92)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(5 more...)

arXiv.org Machine LearningDec-19-2012

A Practical Algorithm for Topic Modeling with Provable Guarantees

Arora, Sanjeev, Ge, Rong, Halpern, Yoni, Mimno, David, Moitra, Ankur, Sontag, David, Wu, Yichen, Zhu, Michael

Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model inference have been based on a maximum likelihood objective. Efficient algorithms exist that approximate this objective, but they have no provable guarantees. Recently, algorithms have been introduced that provide provable bounds, but these algorithms are not practical because they are inefficient and not robust to violations of model assumptions. In this paper we present an algorithm for topic model inference that is both provable and practical. The algorithm produces results comparable to the best MCMC implementations while running orders of magnitude faster.

artificial intelligence, machine learning, natural language, (16 more...)

1212.4777

Country:

North America > United States (1.00)
Asia (1.00)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment > Sports (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Government > Regional Government > Asia Government (1.00)
Energy (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.86)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.76)

arXiv.org Machine LearningDec-12-2012

Advances in Boosting (Invited Talk)

Schapire, Robert E.

Boosting is a general method of generating many simple classification rules and combining them into a single, highly accurate rule. In this talk, I will review the AdaBoost boosting algorithm and some of its underlying theory, and then look at how this theory has helped us to face some of the challenges of applying AdaBoost in two domains: In the first of these, we used boosting for predicting and modeling the uncertainty of prices in complicated, interacting auctions. The second application was to the classification of caller utterances in a telephone spoken-dialogue system where we faced two challenges: the need to incorporate prior knowledge to compensate for initially insufficient data; and a later need to filter the large stream of unlabeled examples being collected to select the ones whose labels are likely to be most informative.

adaboost, machine learning, natural language, (17 more...)

1301.0599

Country: Asia (0.14)

Genre: Research Report (0.86)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.70)