text processing


The One Minute AI #13 - Bing Spell Check

#artificialintelligence

Welcome to a new series of short articles I am presenting about Artificial Intelligence specifically in the Azure AI stack. The objective is that you will learn about an Azure-based AI service in no more than one minute and thus quickly get familiar with the entire stack over a short period of time. These are going short, easily digestible articles so let's get started! What is Bing Spell Check? Bing Spell Check is a Microsoft's third-generation web-based spell-checker that doesn't rely on dictionaries. Instead, it uses machine learning and statistical machine translation to dynamically train a highly contextual algorithm allowing you to perform a spell check and contextual grammar checks on the text. You can also include capabilities such as slang and informal language recognition, and homophones correction.


Exploratory Data Analysis for Natural Language Processing: A Complete Guide to Python Tools

#artificialintelligence

Exploratory data analysis is one of the most important parts of any machine learning workflow and Natural Language Processing is no different. But which tools you should choose to explore and visualize text data efficiently? In this article, we will discuss and implement nearly all the major techniques that you can use to understand your text data and give you a complete(ish) tour into Python tools that get the job done. In this article, we will use a million news headlines dataset from Kaggle. Now, we can take a look at the data. The dataset contains only two columns, the published date, and the news heading. For simplicity, I will be exploring the first 10000 rows from this dataset.


Python most popular programming language In India

#artificialintelligence

New Delhi: When it comes to programming languages in India, Python is most popular among the students for its role in Artificial Intelligence (AI) applications, data science, Machine Learning (ML) and data analytics, US-based online education company Coursera has said. Python dominated the top 10 list with courses like'Programming for Everybody', 'Python Data Structures', 'Python for Data Science and AI' and more. Python is also easy to get started with, offers a lot of flexibility and is versatile. "Its open source nature makes it easy to learn. A large number libraries for tasks like web development, text processing, calculations add to its appeal," the repor said.


Multi-View Learning of Word Embeddings via CCA

Neural Information Processing Systems

Recently, there has been substantial interest in using large amounts of unlabeled data to learn word representations which can then be used as features in supervised classifiers for NLP tasks. However, most current approaches are slow to train, do not model context of the word, and lack theoretical grounding. In this paper, we present a new learning method, Low Rank Multi-View Learning (LR-MVL) which uses a fast spectral method to estimate low dimensional context-specific word representations from unlabeled data. These representation features can then be used with any supervised learner. LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is theoretically elegant, and achieves state-of-the-art performance on named entity recognition (NER) and chunking problems.


Skip-Thought Vectors

Neural Information Processing Systems

We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. We next introduce a simple vocabulary expansion method to encode words that were not seen as part of training, allowing us to expand our vocabulary to a million words. After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets.


Python most popular programming language in India - OrissaPOST

#artificialintelligence

New Delhi: When it comes to programming languages in India, Python is most popular among the students for its role in Artificial Intelligence (AI) applications, data science, Machine Learning (ML) and data analytics, US-based online education company Coursera has said. Python dominated the top 10 list with courses like'Programming for Everybody', 'Python Data Structures', 'Python for Data Science and AI' and more. Python is also easy to get started with, offers a lot of flexibility and is versatile. "Its open-source nature makes it easy to learn. A large number libraries for tasks like web development, text processing, calculations add to its appeal," the report said.


Spatial Latent Dirichlet Allocation

Neural Information Processing Systems

In recent years, the language model Latent Dirichlet Allocation (LDA), which clusters co-occurring words into topics, has been widely appled in the computer vision field. However, many of these applications have difficulty with modeling the spatial and temporal structure among visual words, since LDA assumes that a document is a bag-of-words''. It is also critical to properly design words'' and "documents" when using a language model to solve vision problems. In this paper, we propose a topic model Spatial Latent Dirichlet Allocation (SLDA), which better encodes spatial structure among visual words that are essential for solving many vision problems. The spatial information is not encoded in the value of visual words but in the design of documents.


Hierarchically Supervised Latent Dirichlet Allocation

Neural Information Processing Systems

We introduce hierarchically supervised latent Dirichlet allocation (HSLDA), a model for hierarchically and multiply labeled bag-of-word data. Examples of such data include web pages and their placement in directories, product descriptions and associated categories from product hierarchies, and free-text clinical records and their assigned diagnosis codes. Out-of-sample label prediction is the primary goal of this work, but improved lower-dimensional representations of the bag-of-word data are also of interest. We demonstrate HSLDA on large-scale data from clinical document labeling and retail product categorization tasks. We show that leveraging the structure from hierarchical labels improves out-of-sample label prediction substantially when compared to models that do not.


Spectral Hashing

Neural Information Processing Systems

Semantic hashing seeks compact binary codes of datapoints so that the Hamming distance between codewords correlates with semantic similarity. Hinton et al. used a clever implementation of autoencoders to find such codes. In this paper, we show that the problem of finding a best code for a given dataset is closely related to the problem of graph partitioning and can be shown to be NP hard. By relaxing the original problem, we obtain a spectral method whose solutions are simply a subset of thresh- olded eigenvectors of the graph Laplacian. By utilizing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami eigen- functions of manifolds, we show how to efficiently calculate the code of a novel datapoint.


Word Features for Latent Dirichlet Allocation

Neural Information Processing Systems

We extend Latent Dirichlet Allocation (LDA) by explicitly allowing for the encoding of side information in the distribution over words. This results in a variety of new capabilities, such as improved estimates for infrequently occurring words, as well as the ability to leverage thesauri and dictionaries in order to boost topic cohesion within and across languages. We present experiments on multi-language topic synchronisation where dictionary information is used to bias corresponding words towards similar topics. Results indicate that our model substantially improves topic cohesion when compared to the standard LDA model. Papers published at the Neural Information Processing Systems Conference.