Goto

Collaborating Authors

 frequent term


Embedded Topic Models Enhanced by Wikification

Shibuya, Takashi, Utsuro, Takehito

arXiv.org Artificial Intelligence

Topic modeling analyzes a collection of documents to learn meaningful patterns of words. However, previous topic models consider only the spelling of words and do not take into consideration the homography of words. In this study, we incorporate the Wikipedia knowledge into a neural topic model to make it aware of named entities. We evaluate our method on two datasets, 1) news articles of \textit{New York Times} and 2) the AIDA-CoNLL dataset. Our experiments show that our method improves the performance of neural topic models in generalizability. Moreover, we analyze frequent terms in each topic and the temporal dependencies between topics to demonstrate that our entity-aware topic models can capture the time-series development of topics well.


An Algorithm for Generating Gap-Fill Multiple Choice Questions of an Expert System

Sirithumgul, Pornpat, Prasertsilp, Pimpaka, Olfman, Lorne

arXiv.org Artificial Intelligence

This research is aimed to propose an artificial intelligence algorithm comprising an ontology-based design, text mining, and natural language processing for automatically generating gap-fill multiple choice questions (MCQs). The simulation of this research demonstrated an application of the algorithm in generating gap-fill MCQs about software testing. The simulation results revealed that by using 103 online documents as inputs, the algorithm could automatically produce more than 16 thousand valid gap-fill MCQs covering a variety of topics in the software testing domain. Finally, in the discussion section of this paper we suggest how the proposed algorithm should be applied to produce gap-fill MCQs being collected in a question pool used by a knowledge expert system.


Predefined Sparseness in Recurrent Sequence Models

Demeester, Thomas, Deleu, Johannes, Godin, Fréderic, Develder, Chris

arXiv.org Artificial Intelligence

Inducing sparseness while training neural networks has been shown to yield models with a lower memory footprint but similar effectiveness to dense models. However, sparseness is typically induced starting from a dense model, and thus this advantage does not hold during training. We propose techniques to enforce sparseness upfront in recurrent sequence models for NLP applications, to also benefit training. First, in language modeling, we show how to increase hidden state sizes in recurrent layers without increasing the number of parameters, leading to more expressive models. Second, for sequence labeling, we show that word embeddings with predefined sparseness lead to similar performance as dense embeddings, at a fraction of the number of trainable parameters.


DreamNLP: Novel NLP System for Clinical Report Metadata Extraction using Count Sketch Data Streaming Algorithm: Preliminary Results

Choi, Sanghyun, Ivkin, Nikita, Braverman, Vladimir, Jacobs, Michael A.

arXiv.org Machine Learning

Extracting information from electronic health records (EHR) is a challenging task since it requires prior knowledge of the reports and some natural language processing algorithm (NLP). With the growing number of EHR implementations, such knowledge is increasingly challenging to obtain in an efficient manner. We address this challenge by proposing a novel methodology to analyze large sets of EHRs using a modified Count Sketch data streaming algorithm termed DreamNLP. By using DreamNLP, we generate a dictionary of frequently occurring terms or heavy hitters in the EHRs using low computational memory compared to conventional counting approach other NLP programs use. We demonstrate the extraction of the most important breast diagnosis features from the EHRs in a set of patients that underwent breast imaging. Based on the analysis, extraction of these terms would be useful for defining important features for downstream tasks such as machine learning for precision medicine.


Clouds, clouds, and more clouds

#artificialintelligence

There are at least eleven kinds of clouds: cirrus, cirrocumulus, cirrostratus, altocumulus, altostratus, cumulonimbus, cumulus, nimbostratus, stratocumulus, small Cu, and stratus. But this article is not about those kinds of clouds. Of course there are other kinds of clouds, like iCloud, Google Cloud, Azure Cloud, Amazon Cloud, and the list goes on. But this article is not about those clouds either. This article is about text analytics.


Training a CNN with the same data but different labels • /r/MachineLearning

#artificialintelligence

I apologize for the ambiguous title, but it was difficult to compile my question into a sentence. I have a large data-set of paintings, and corresponding class labels generated from their medium. I'm not interested in the output class, but rather the 9216 dimension feature vector generated from the Pool5 layer of the network (I'm using AlexNet). Now, when I generate the class labels from the meta-data assosiated with the painting I'm using the least frequent term as it tends to be more telling. As an example, a painting's medium meta-data may be "Oil and Chalk on Paper"; currently, the least frequent term would have been applied as the target label, in this case "Oil".


Training a CNN with the same data but different labels • /r/MachineLearning

@machinelearnbot

I apologize for the ambiguous title, but it was difficult to compile my question into a sentence. I have a large data-set of paintings, and corresponding class labels generated from their medium. I'm not interested in the output class, but rather the 9216 dimension feature vector generated from the Pool5 layer of the network (I'm using AlexNet). Now, when I generate the class labels from the meta-data assosiated with the painting I'm using the least frequent term as it tends to be more telling. As an example, a painting's medium meta-data may be "Oil and Chalk on Paper"; currently, the least frequent term would have been applied as the target label, in this case "Oil".