AITopics | hlda

Collaborating Authors

hlda

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Scalable Inference for Nested Chinese Restaurant Process Topic Models

Chen, Jianfei, Zhu, Jun, Lu, Jie, Liu, Shixia

arXiv.org Machine LearningFeb-22-2017

Nested Chinese Restaurant Process (nCRP) topic models are powerful nonparametric Bayesian methods to extract a topic hierarchy from a given text corpus, where the hierarchical structure is automatically determined by the data. Hierarchical Latent Dirichlet Allocation (hLDA) is a popular instance of nCRP topic models. However, hLDA has only been evaluated at small scale, because the existing collapsed Gibbs sampling and instantiated weight variational inference algorithms either are not scalable or sacrifice inference quality with mean-field assumptions. Moreover, an efficient distributed implementation of the data structures, such as dynamically growing count matrices and trees, is challenging. In this paper, we propose a novel partially collapsed Gibbs sampling (PCGS) algorithm, which combines the advantages of collapsed and instantiated weight algorithms to achieve good scalability as well as high model quality. An initialization strategy is presented to further improve the model quality. Finally, we propose an efficient distributed implementation of PCGS through vectorization, pre-processing, and a careful design of the concurrent data structures and communication strategy. Empirical studies show that our algorithm is 111 times more efficient than the previous open-source implementation for hLDA, with comparable or even better model quality. Our distributed implementation can extract 1,722 topics from a 131-million-document corpus with 28 billion tokens, which is 4-5 orders of magnitude larger than the previous largest corpus, with 50 machines in 7 hours.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

1702.07083

Country:

Europe (1.00)
Asia > India (0.28)
Asia > Japan (0.28)
(2 more...)

Genre: Research Report (1.00)

Industry: Consumer Products & Services > Restaurants (0.71)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.89)

Add feedback

A Unified Model for Unsupervised Opinion Spamming Detection Incorporating Text Generality

Xu, Yinqing (The Chinese University of Hong Kong) | Shi, Bei (The Chinese University of Hong Kong) | Tian, Wentao (The Chinese University of Hong Kong) | Lam, Wai (The Chinese University of Hong Kong)

AAAI ConferencesJul-15-2015

Unlike other forms of spamming, it is difficult to collect a large amount of gold-standard labels for reviews Many existing methods on review spam detection by means of manual effort. Thus, most of these methods considering text content merely utilize simple text [Mukherjee et al., 2013; Li et al., 2013a; Sun et al., features such as content similarity. We explore a 2013] just rely on the ad-hoc or pseudo fake or non-fake novel idea of exploiting text generality for improving labels for model training, such as the labels annotated by spam detection. Besides, apart from the task the Amazon anonymous online workers [Ott et al., 2011; of review spam detection, although there have also Li et al., 2014]. On the other hand, some unsupervised been some works on identifying the review spammers methods have been proposed to detect the individual review (users) and the manipulated offerings (items), spammer [Mukherjee et al., 2013; Lim et al., 2010; no previous works have attempted to solve these Wang et al., 2011] and review spammer groups [Mukherjee et three tasks in a unified model. We have proposed al., 2012]. In addition, time series pattern [Xie et al., 2012], a unified probabilistic graphical model to detect rating distribution [Feng et al., 2012], reviewer graph [Wang et the suspicious review spams, the review spammers al., 2011], and reviewing burstiness [Fei et al., 2013] have also and the manipulated offerings in an unsupervised been applied to identify the review spams in an unsupervised manner.

abnormal feature, review spam, spamicity, (16 more...)

AAAI Conferences

Twenty-Fourth International Joint Conference on Artificial Intelligence

Country:

Asia > China > Hong Kong (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New York (0.04)
(4 more...)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Security & Privacy > Spam Filtering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Automated Non-Content Word List Generation Using hLDA

Krug, Wayne (Language Computer Corporation) | Tomlinson, Marc T. (Language Computer Corporation)

AAAI ConferencesMay-19-2013

In this paper, we present a language-independent method for the automatic, unsupervised extraction of non-content words from a corpus of documents. This method permits the creation of word lists that may be used in place of traditional function word lists in various natural language processing tasks. As an example we generated lists of words from a corpus of English, Chinese, and Russian posts extracted from Wikipedia articles and Wikipedia Wikitalk discussion pages. We applied these lists to the task of authorship attribution on this corpus to compare the effectiveness of lists of words extracted with this method to expert-created function word lists and frequent word lists (a common alternative to function word lists). hLDA lists perform comparably to frequent word lists. The trials also show that corpus-derived lists tend to perform better than more generic lists, and both sets of generated lists significantly outperformed the expert lists. Additionally, we evaluated the performance of an English expert list on machine translations of our Chinese and Russian documents, showing that our method also outperforms this alternative.

automated non-content word list generation, hlda

AAAI Conferences

The Twenty-Sixth International FLAIRS Conference

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.60)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.53)

Add feedback

Transfer Topic Modeling with Ease and Scalability

Kang, Jeon-Hyung, Ma, Jun, Liu, Yan

arXiv.org Machine LearningJan-26-2013

The increasing volume of short texts generated on social media sites, such as Twitter or Facebook, creates a great demand for effective and efficient topic modeling approaches. While latent Dirichlet allocation (LDA) can be applied, it is not optimal due to its weakness in handling short texts with fast-changing topics and scalability concerns. In this paper, we propose a transfer learning approach that utilizes abundant labeled documents from other domains (such as Yahoo! News or Wikipedia) to improve topic modeling, with better model fitting and result interpretation. Specifically, we develop Transfer Hierarchical LDA (thLDA) model, which incorporates the label information from other domains via informative priors. In addition, we develop a parallel implementation of our model for large-scale applications. We demonstrate the effectiveness of our thLDA model on both a microblogging dataset and standard text collections including AP and RCV1 datasets.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Machine Learning

1301.5686

Country:

Asia (0.68)
North America > United States > California > Los Angeles County > Los Angeles (0.28)

Genre: Research Report (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (0.68)
Government > Military (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.89)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.47)

Add feedback

The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies

Blei, David M., Griffiths, Thomas L., Jordan, Michael I.

arXiv.org Machine LearningAug-27-2009

We present the nested Chinese restaurant process (nCRP), a stochastic process which assigns probability distributions to infinitely-deep, infinitely-branching trees. We show how this stochastic process can be used as a prior distribution in a Bayesian nonparametric model of document collections. Specifically, we present an application to information retrieval in which documents are modeled as paths down a random tree, and the preferential attachment dynamics of the nCRP leads to clustering of documents according to sharing of topics at multiple levels of abstraction. Given a corpus of documents, a posterior inference algorithm finds an approximation to a posterior distribution over trees, topics and allocations of words to levels of the tree. We demonstrate this algorithm on collections of scientific abstracts from several journals. This model exemplifies a recent trend in statistical machine learning--the use of Bayesian nonparametric methods to infer distributions on flexible data structures.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

0710.0845

Country: North America > United States > California (0.46)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Consumer Products & Services > Restaurants (0.86)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback