Inductive Learning
Theory-guided Data Science: A New Paradigm for Scientific Discovery from Data
Karpatne, Anuj, Atluri, Gowtham, Faghmous, James, Steinbach, Michael, Banerjee, Arindam, Ganguly, Auroop, Shekhar, Shashi, Samatova, Nagiza, Kumar, Vipin
Data science models, although successful in a number of commercial domains, have had limited applicability in scientific problems involving complex physical phenomena. Theory-guided data science (TGDS) is an emerging paradigm that aims to leverage the wealth of scientific knowledge for improving the effectiveness of data science models in enabling scientific discovery. The overarching vision of TGDS is to introduce scientific consistency as an essential component for learning generalizable models. Further, by producing scientifically interpretable models, TGDS aims to advance our scientific understanding by discovering novel domain insights. Indeed, the paradigm of TGDS has started to gain prominence in a number of scientific disciplines such as turbulence modeling, material discovery, quantum chemistry, bio-medical science, bio-marker discovery, climate science, and hydrology. In this paper, we formally conceptualize the paradigm of TGDS and present a taxonomy of research themes in TGDS. We describe several approaches for integrating domain knowledge in different research themes using illustrative examples from different disciplines. We also highlight some of the promising avenues of novel research for realizing the full potential of theory-guided data science.
A random matrix analysis and improvement of semi-supervised learning for large dimensional data
This article provides an original understanding of the behavior of a class of graph-oriented semi-supervised learning algorithms in the limit of large and numerous data. It is demonstrated that the intuition at the root of these methods collapses in this limit and that, as a result, most of them become inconsistent. Corrective measures and a new data-driven parametrization scheme are proposed along with a theoretical analysis of the asymptotic performances of the resulting approach. A surprisingly close behavior between theoretical performances on Gaussian mixture models and on real datasets is also illustrated throughout the article, thereby suggesting the importance of the proposed analysis for dealing with practical data. As a result, significant performance gains are observed on practical data classification using the proposed parametrization.
Schema Independent Relational Learning
Picado, Jose, Termehchy, Arash, Fern, Alan, Ataei, Parisa
Learning novel concepts and relations from relational databases is an important problem with many applications in database systems and machine learning. Relational learning algorithms learn the definition of a new relation in terms of existing relations in the database. Nevertheless, the same data set may be represented under different schemas for various reasons, such as efficiency, data quality, and usability. Unfortunately, the output of current relational learning algorithms tends to vary quite substantially over the choice of schema, both in terms of learning accuracy and efficiency. This variation complicates their off-the-shelf application. In this paper, we introduce and formalize the property of schema independence of relational learning algorithms, and study both the theoretical and empirical dependence of existing algorithms on the common class of (de) composition schema transformations. We study both sample-based learning algorithms, which learn from sets of labeled examples, and query-based algorithms, which learn by asking queries to an oracle. We prove that current relational learning algorithms are generally not schema independent. For query-based learning algorithms we show that the (de) composition transformations influence their query complexity. We propose Castor, a sample-based relational learning algorithm that achieves schema independence by leveraging data dependencies. We support the theoretical results with an empirical study that demonstrates the schema dependence/independence of several algorithms on existing benchmark and real-world datasets under (de) compositions.
Parallelized Tensor Train Learning of Polynomial Classifiers
Chen, Zhongming, Batselier, Kim, Suykens, Johan A. K., Wong, Ngai
Pattern classification is the machine learning task of identifying to which category a new observation belongs, on the basis of a training set of observations whose category membership is known. This type of machine learning algorithm that uses a known training dataset to make predictions is called supervised learning, which has been extensively studied and has wide applications in the fields of bioinformatics [1], computer-aided diagnosis (CAD) [2], machine vision [3], speech recognition [4], handwriting recognition [5], spam detection and many others [6], [7], [8]. Usually, different kinds of learning methods use different models to generalize from training examples to novel test examples. As pointed out in [9], [10], one of the important invariants in these applications is the local structure: variables that are spatially or temporally nearby are highly correlated. Local correlations benefit extracting local features because configurations of neighboring variables can be classified into a small number of categories (e.g.
This AI nightmare bot spits out a new horror story every hour
Deep inside an MIT laboratory, an artificially intelligent bot is composing ghastly tales of nightmarish creatures and strange shrieks in the night. MIT researchers named their bot Shelley (after Frankenstein author Mary Shelley). They endowed her with an artificial mind, called a neural network, an advanced form of machine learning in which a computer learns a task by relying on training examples. In Shelley's case, MIT researchers fed her silicon brain 140,000 horror stories published by writers on Reddit's "No Sleep" forum. Endowed with this massive story bank of fright, Shelley is a program that churns out its own unique tales of the undead and soon-to-be dead.
tmulc18/Distributed-TensorFlow-Guide
This guide is a collection of distributed training examples (that can act as boilerplate code) and a tutorial of basic distributed TensorFlow. Many of the examples focus on implementing well-known distributed training schemes, such as those available in Distriubted Keras which were discussed in the author's blog post. Almost all the examples can be run on a single machine with a CPU, and all the examples only use data-parallelism (i.e. The motivation for this guide stems from the current state of distributed deep learning. Deep learning papers typical demonstrate successful new architectures on some benchmark, but rarely show how these models can be trained with 1000x the data which is usually the requirement in industy.
Image Recognition for Fashion with Machine Learning
Can a computer automatically detect pictures of shirts, pants, dresses, and sneakers? It turns out that accurately classifying images of fashion items is surprisingly straight-forward to do, given quality training data to start from. Supervised learning, in particular for classification, is a popular topic amongst artificial intelligence and machine learning enthusiasts. It's common for developers to utilize a well known and easy to process dataset for their first attempts at using supervised learning. The MNIST dataset is an example of such a source, providing thousands of examples of handwritten digits that can be used for supervised learning with your machine learning algorithms. I've previously written about classifying handwritten digits with the MNIST data-set, achieving accuracies of 99% on the training set and 97% on the test set. Data sets such as these are a convenient way to hone your skills and machine learning model development with a tried and trusted data source. It's important to keep in mind that a good data set has several features in common.
Softmax Q-Distribution Estimation for Structured Prediction: A Theoretical Interpretation for RAML
Ma, Xuezhe, Yin, Pengcheng, Liu, Jingzhou, Neubig, Graham, Hovy, Eduard
Reward augmented maximum likelihood (RAML), a simple and effective learning framework to directly optimize towards the reward function in structured prediction tasks, has led to a number of impressive empirical successes. RAML incorporates task-specific reward by performing maximum-likelihood updates on candidate outputs sampled according to an exponentiated payoff distribution, which gives higher probabilities to candidates that are close to the reference output. While RAML is notable for its simplicity, efficiency, and its impressive empirical successes, the theoretical properties of RAML, especially the behavior of the exponentiated payoff distribution, has not been examined thoroughly. In this work, we introduce softmax Q-distribution estimation, a novel theoretical interpretation of RAML, which reveals the relation between RAML and Bayesian decision theory. The softmax Q-distribution can be regarded as a smooth approximation of the Bayes decision boundary, and the Bayes decision rule is achieved by decoding with this Q-distribution. We further show that RAML is equivalent to approximately estimating the softmax Q-distribution, with the temperature $\tau$ controlling approximation error. We perform two experiments, one on synthetic data of multi-class classification and one on real data of image captioning, to demonstrate the relationship between RAML and the proposed softmax Q-distribution estimation method, verifying our theoretical analysis. Additional experiments on three structured prediction tasks with rewards defined on sequential (named entity recognition), tree-based (dependency parsing) and irregular (machine translation) structures show notable improvements over maximum likelihood baselines.
AI Model Fundamentally Cracks CAPTCHAs, Scientists Say
A representation of the letter A, which can be used to crack CAPTCHAs. A representation of the letter A, which can be used to crack CAPTCHAs. Scientists say they have developed a computer model that fundamentally breaks through a key test used to tell a human from a bot. You've probably passed this test hundreds of times. Text-based CAPTCHAs, a rough acronym for Completely Automated Public Turing Test To Tell Computers and Humans Apart, are groups of jumbled characters along with squiggly lines and other background noise.
Top 10 Machine Learning Algorithms for Beginners
The study of ML algorithms has gained immense traction post the Harvard Business Review article terming a'Data Scientist' as the'Sexiest job of the 21st century'. So, for those starting out in the field of ML, we decided to do a reboot of our immensely popular Gold blog The 10 Algorithms Machine Learning Engineers need to know - albeit this post is targetted towards beginners. ML algorithms are those that can learn from data and improve from experience, without human intervention. Learning tasks may include learning the function that maps the input to the output, learning the hidden structure in unlabeled data; or'instance-based learning', where a class label is produced for a new instance by comparing the new instance (row) to instances from the training data, which were stored in memory. 'Instance-based learning' does not create an abstraction from specific instances. Supervised learning can be explained as follows: use labeled training data to learn the mapping function from the input variables (X) to the output variable (Y).