Goto

Collaborating Authors

 Uncertainty


Key Algorithms and Statistical Models for Aspiring Data Scientists

@machinelearnbot

As a data scientist who has been in the profession for several years now, I am often approached for career advice or guidance in course selection related to machine learning by students and career switchers on LinkedIn and Quora. Some questions revolve around educational paths and program selection, but many questions focus on what sort of algorithms or models are common in data science today. With a glut of algorithms from which to choose, it's hard to know where to start. Courses may include algorithms that aren't typically used in industry today, and courses may exclude very useful methods that aren't trending at the moment. Software-based programs may exclude important statistical concepts, and mathematically-based programs may skip over some of the key topics in algorithm design. I've put together a short guide for aspiring data scientists, particularly focused on statistical models and machine learning models (supervised and unsupervised); many of these topics are covered in textbooks, graduate-level statistics courses, data science bootcamps, and other training resources (some of which are included in the reference section of the article).


A Channel-based Exact Inference Algorithm for Bayesian Networks

arXiv.org Artificial Intelligence

URL: tthttp://www.cs.ru.nl/B.Jacobs This paper describes a new algorithm for exact Bayesian inference that is based on a recently proposed compositional semantics of Bayesian networks in terms of channels. The paper concentrates on the ideas behind this algorithm, involving a linearisation ('stretching') of the Bayesian network, followed by a combination of forward state transformation and backward predicate transformation, while evidence is accumulated along the way. The performance of a prototype implementation of the algorithm in Python is briefly compared to a standard implementation (pgmpy): first results show competitive performance.


SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

Journal of Artificial Intelligence Research

The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered "de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to different type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several different domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also significantly contributed to new supervised learning paradigms, including multilabel classification, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of different software packages -- from open source to commercial. In this paper, marking the fifteen year anniversary of SMOTE, we reflect on the SMOTE journey, discuss the current state of affairs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.


The Statistical Model for Ticker, an Adaptive Single-Switch Text-Entry Method for Visually Impaired Users

arXiv.org Artificial Intelligence

Abstract--This paper presents the statistical model for Ticker [1], a novel probabilistic stereophonic single-switch text entry method for visually-impaired users with motor disabilities who rely on single-switch scanning systems to communicate. All terminology and notation are defined in [1]. In Figure 1(a) a typical composite audio sequence that can be presented to the user is shown, where the composite sequence consists of two repetitions of the alphabet. In Ticker, the user selects one letter at a time when listening to such a sequence. In the shown example, the user can click twice per letter. The second repetition occurs in a different order than the first, which allows one to infer the intentional letter selection more accurately. The system does not explicitly make any selection after a click is received; instead the system accumulates evidence. After one or more clicks are received, the system internally updates the posterior word probabilities. It will then proceed to play the composite sequence again for the next letter. When the posterior probability of any word in a predefined dictionary is above a certain threshold, that word is selected.


16 Free Machine Learning Books

#artificialintelligence

The following is a list of free books on Machine Learning. A Brief Introduction To Neural Networks provides a comprehensive overview of the subject of neural networks and is divided into 4 parts โ€“Part I: From Biology to Formalization -- Motivation, Philosophy, History and Realization of Neural Models,Part II: Supervised learning Network Paradigms, Part III: Unsupervised learning Network Paradigms and Part IV: Excursi, Appendices and Registers. A Course In Machine Learning is designed to provide a gentle and pedagogically organized introduction to the field and provide a view of machine learning that focuses on ideas and models, not on math. The audience of this book is anyone who knows differential calculus and discrete math, and can program reasonably well. An undergraduate in their fourth or fifth semester should be fully capable of understanding this material. However, it should also be suitable for first year graduate students, perhaps at a slightly faster pace.


10 machine learning algorithms Every Data Scientist should know in 2018

#artificialintelligence

A data scientist is a person hired to analyze and interpret complicated digital records, together with the utilization statistics of a website; particularly so that it will help an enterprise in its decision-making. An analytical model is a mathematical model that is designed to carry out a particular task or to find out the probability of a selected event i.e. the solution to the equations used to describe modifications in a system can be expressed as a mathematical analytic function. According to Layman, an analytical model is simply a mathematical presentation of an enterprise problem. A simple equation y a bx may be termed as a model with a group of predefined input data and desired output. Scalable and efficient analytical modeling is severely consequential to enable the business to use those techniques to ever-more sizably voluminous data sets for reducing the time taken to carry out these analyses. Accordingly, models are engendered that put into effect key algorithms to determine the solution to our quandary business.


Bayesian Metabolic Flux Analysis reveals intracellular flux couplings

arXiv.org Machine Learning

Markus Heinonen 1, 2, Maria Osmala 1, Henrik Mannerstr om 1, Janne Wallenius 3 Samuel Kaski 1, 2, Juho Rousu 1, 2 and Harri L ahdesm aki 1 1 Department of Computer Science, Aalto University, Espoo, 02150, Finland 2 Helsinki Institute for Information Technology, Finland 3 Institute for Molecular Medicine Finland, Helsinki, Finland Abstract Motivation: Metabolic flux balance analyses are a standard tool in analysing metabolic reaction rates compatible with measurements, steady-state and the metabolic reaction network stoichiometry. Flux analysis methods commonly place unrealistic assumptions on fluxes due to the convenience of formulating the problem as a linear programming model, and most methods ignore the notable uncertainty in flux estimates. Results: We introduce a novel paradigm of Bayesian metabolic flux analysis that models the reactions of the whole genome-scale cellular system in probabilistic terms, and can infer the full flux vector distribution of genome-scale metabolic systems based on exchange and intracellular (e.g. The Bayesian model couples all fluxes jointly together in a simple truncated multivariate posterior distribution, which reveals informative flux couplings. Our model is a plugin replacement to conventional metabolic balance methods, such as flux balance analysis (FBA). Our experiments indicate that we can characterise the genome-scale flux covariances, reveal flux couplings, and determine more intracellular unobserved fluxes in C. acetobutylicum from 13C data than flux variability analysis. Contact: markus.o.heinonen@aalto.fi 1 Introduction Metabolic modelling considers networks of up to thousands of chemical reactions that transform metabolite molecules within cellular organisms (Palsson, 2015). The key problem of metabolism is estimation of the reaction rates, or fluxes, of the system of the highly interdependent intracellular fluxes from measurements of few exchange fluxes that transfer nutrients or products between the external medium and the cell. The dominant approach to flux estimation is the celebrated Flux Balance Analysis (FBA) framework that finds reaction rates that maximise prespecified cellular growth function (Feist and Palsson, 2010), while assuming the cell is in a steady-state, where concentrations of intracellular metabolites do not change (Almaas et al., 2004). The FBA problem can be casted as a convenient and computationally efficient linear programming problem of solving a system of linear steady-state constraints while maximising a linear growth target (Orth et al., 2010), and where flux measurements can be encoded as constraints to the fluxes (Carreira et al., 2014).


Classifying Antimicrobial and Multifunctional Peptides with Bayesian Network Models

arXiv.org Machine Learning

Bayesian network models are finding success in characterizing enzyme-catalyzed reactions, slow conformational changes, predicting enzyme inhibition, and genomics. In this work, we apply them to statistical modeling of peptides by simultaneously identifying amino acid sequence motifs and using a motif-based model to clarify the role motifs may play in antimicrobial activity. We construct models of increasing sophistication, demonstrating how chemical knowledge of a peptide system may be embedded without requiring new derivation of model fitting equations after changing model structure. These models are used to construct classifiers with good performance (94% accuracy, Matthews correlation coefficient of 0.87) at predicting antimicrobial activity in peptides, while at the same time being built of interpretable parameters. We demonstrate use of these models to identify peptides that are potentially both antimicrobial and antifouling, and show that the background distribution of amino acids could play a greater role in activity than sequence motifs do. This provides an advancement in the type of peptide activity modeling that can be done and the ease in which models can be constructed.


Deep Probabilistic Programming Languages: A Qualitative Study

arXiv.org Artificial Intelligence

Deep probabilistic programming languages try to combine the advantages of deep learning with those of probabilistic programming languages. If successful, this would be a big step forward in machine learning and programming languages. Unfortunately, as of now, this new crop of languages is hard to use and understand.


Intrusions in Marked Renewal Processes

arXiv.org Artificial Intelligence

We present a probabilistic model of an intrusion in a marked renewal process. Given a process and a sequence of events, an intrusion is a subsequence of events that is not produced by the process. Applications of the model are, for example, online payment fraud with the fraudster taking over a user's account and performing payments on the user's behalf, or unexpected equipment failures due to unintended use. We adopt Bayesian approach to infer the probability of an intrusion in a sequence of events, a MAP subsequence of events constituting the intrusion, and the marginal probability of each event in a sequence to belong to the intrusion. We evaluate the model for intrusion detection on synthetic data, as well as on anonymized data from an online payment system.