Country
Mislabel Detection of Finnish Publication Ranks
Akusok, Anton, Saarela, Mirka, Kärkkäinen, Tommi, Björk, Kaj-Mikael, Lendasse, Amaury
Finland, in the spirit of Norway and Denmark, introduced ranking system for academic publication channels (referring to scientific journals, conference series, book publishers etc.) called as Jufo (i.e. "Julkaisufoorumi" in Finnish, "Publication Forum" in English) in 2010, together with the renewed university legislation. The ranking of a publication channel, ranging from 0 (non-peer- reviewed) to 3 (most distinguished academic publication forums), is decided by a specially nominated panel of a particular scientific discipline. These panels decide the rankings based on their academic expertise in regular meetings. Because the rankings are directly linked to the allocated funding of the universities, there has been and is a lot of discussion about the fairness and objectivity of the ranks. A versatile analysis of the 2015 Jufo-rankings was done in [10]. There, by using association rule mining, decision trees, and confusion matrices with respect to Norwegian and Danish ranks, it was shown that most of the expert-based rankings could be predicted and explained with machine learning methods. Moreover, it was found out that those publication channels, for which the Finnish expert-based rank is higher than the estimated one, are characterized by higher publication activity or recent upgrade of the rank. Hence, the outcomes of the system, the publication ranks, need to be assessed and evaluated regularly and rigorously. 1
Temporal Normalizing Flows
Analyzing and interpreting time-dependent stochastic data requires accurate and robust density estimation. In this paper we extend the concept of normalizing flows to so-called temporal Normalizing Flows (tNFs) to estimate time dependent distributions, leveraging the full spatio-temporal information present in the dataset. Our approach is unsupervised, does not require an a-priori characteristic scale and can accurately estimate multi-scale distributions of vastly different length scales. We illustrate tNFs on sparse datasets of Brownian and chemotactic walkers, showing that the inclusion of temporal information enhances density estimation. Finally, we speculate how tNFs can be applied to fit and discover the continuous PDE underlying a stochastic process.
Overcoming Long-term Catastrophic Forgetting through Adversarial Neural Pruning and Synaptic Consolidation
Tang, Jian Peng Bo, Jiang, Hao, Li, Zhuo, Lei, Yinjie, Lin, Tao, Li, Haifeng
Enabling a neural network to sequentially learn multiple tasks is of great significance for expanding the applicability of neural networks in realistic human application scenarios. However, as the task sequence increases, the model quickly forgets previously learned skills; we refer to this loss of memory of long sequences as long-term catastrophic forgetting. There are two main reasons for the long-term forgetting: first, as the tasks increase, the intersection of the low-error parameter subspace satisfying these tasks will become smaller and smaller or even non-existent; The second is the cumulative error in the process of protecting the knowledge of previous tasks. This paper, we propose a confrontation mechanism in which neural pruning and synaptic consolidation are used to overcome long-term catastrophic forgetting. This mechanism distills task-related knowledge into a small number of parameters, and retains the old knowledge by consolidating a small number of parameters, while sparing most parameters to learn the follow-up tasks, which not only avoids forgetting but also can learn a large number of tasks. Specifically, the neural pruning iteratively relaxes the parameter conditions of the current task to expand the common parameter subspace of tasks; The modified synaptic consolidation strategy is comprised of two components, a novel network structure information considered measurement is proposed to calculate the parameter importance, and a element-wise parameter updating strategy that is designed to prevent significant parameters being overridden in subsequent learning. We verified the method on image classification, and the results showed that our proposed ANPSC approach outperforms the state-of-the-art methods. The hyperparametric sensitivity test further demonstrates the robustness of our proposed approach.
Per-sample Prediction Intervals for Extreme Learning Machines
Akusok, Anton, Miche, Yoan, Björk, Kaj-Mikael, Lendasse, Amaury
Prediction intervals in supervised Machine Learning bound the region where the true outputs of new samples may fall. They are necessary in the task of separating reliable predictions of a trained model from near random guesses, minimizing the rate of False Positives, and other problem-specific tasks in applied Machine Learning. Many real problems have heteroscedastic stochastic outputs, which explains the need of input-dependent prediction intervals. This paper proposes to estimate the input-dependent prediction intervals by a separate Extreme Learning Machine model, using variance of its predictions as a correction term accounting for the model uncertainty. The variance is estimated from the model's linear output layer with a weighted Jackknife method. The methodology is very fast, robust to heteroscedastic outputs, and handles both extremely large datasets and insufficient amount of training data.
Extreme Learning Tree
Akusok, Anton, Eirola, Emil, Björk, Kaj-Mikael, Lendasse, Amaury
Anton Akusok 1, Emil Eirola 1, Kaj-Mikael Bj ork 2 Amaury Lendasse 3, 4 1 Arcada University of Applied Sciences, Helsinki, Finland 2 Risklab at Arcada UAS, Helsinki, Finland 3 Department of Mechanical and Industrial Engineering, The University of Iowa, Iowa City, USA 4 The Iowa Informatics Initiative, The University of Iowa, Iowa City, USA Abstract The paper proposes a new variant of a decision tree, called an Extreme Learning Tree. It consists of an extremely random tree with nonlinear data transformation, and a linear observer that provides predictions based on the leaf index where the data samples fall. The proposed method outperforms linear models on a benchmark dataset, and may be a building block for a future variant of Random Forest. 1 Introduction Randomized methods are a recent trend in practical machine learning [1]. They enable the high performance of complex nonlinear methods without the high computational cost of their optimization. Current most prominent examples are randomized neural networks, in both feed-forward [2] and recurrent [3] forms. For the latter, the randomized approach provided an efficient training method for the first time, and enabled achieving state-of-the-art performance in multiple areas [4].
A Bayesian Approach to Modelling Longitudinal Data in Electronic Health Records
Bellot, Alexis, van der Schaar, Mihaela
Analyzing electronic health records (EHR) poses significant challenges because often few samples are available describing a patient's health and, when available, their information content is highly diverse. The problem we consider is how to integrate sparsely sampled longitudinal data, missing measurements informative of the underlying health status and fixed demographic information to produce estimated survival distributions updated through a patient's follow up. We propose a nonparametric probabilistic model that generates survival trajectories from an ensemble of Bayesian trees that learns variable interactions over time without specifying beforehand the longitudinal process. We show performance improvements on Primary Biliary Cirrhosis patient data.
Spiking Networks for Improved Cognitive Abilities of Edge Computing Devices
Akusok, Anton, Björk, Kaj-Mikael, Leal, Leonardo Espinosa, Miche, Yoan, Hu, Renjie, Lendasse, Amaury
A sudden realization came to our minds while preparing this white paper - mobile phones are the first type of devices that received dedicated math accelerators at a pervasive scale. Such things never got wide adoption before: Intel 8087 co-processor[11], Intel Xeon Phi[2, 5] or Google TPU (Tensor Processing Unit)[6] stayed niche devices that few people use and even fewer develop for. But since the last two years, major mobile phone companies include dedicated co-processors[4] necessary for computational photography enhancement or facial recognition, that are suitable for general machine learning. Currently the dominant analytical approach stores data and runs computations in the Cloud[12]. However Cloud based methods poorly fit to a range of important practical applications including augmented reality, real-time data analysis, real-time user interaction, or processing sensitive data that incur high risks for a company if leaked, stolen or intercepted in transfer. The price of deployed analytical methods is increased by the need to have a permanently working internet connection for users, and cloud hardware rent for service providers.
A Maximum Entropy approach to Massive Graph Spectra
Granziol, Diego, Ru, Robin, Zohren, Stefan, Dong, Xiaowen, Osborne, Michael, Roberts, Stephen
Machine Learning Research Group and Oxford-Man Institute for Quantitative Finance, Department of Engineering Science, University of Oxford Abstract Graph spectral techniques for measuring graph similarity, or for learning the cluster number, require kernel smoothing. The choice of kernel function and bandwidth are typically chosen in an ad-hoc manner and heavily affect the resulting output. We prove that kernel smoothing biases the moments of the spectral density. We propose an information theoretically optimal approach to learn a smooth graph spectral density, which fully respects the moment information. Our method's computational cost is linear in the number of edges, and hence can be applied to large networks, with millions of nodes. We apply our method to the problems to graph similarity and cluster number learning, where we outperform comparable iterative spectral approaches on synthetic and real graphs. Keywords: Networks, Information Theory, Maximum Entropy, Graph Spectral Theory, Random matrix theory, iterative methods, kernel smoothing 1. Introduction: networks, their graph spectra and importance Many systems of interest can be naturally characterised by complex networks; examples include social networks (Mislove et al., 2007b; Flake et al., 2000; Leskovec et al., 2007), biological networks (Palla et al., 2005) and technological networks.
Reducing Selection Bias in Counterfactual Reasoning for Individual Treatment Effects Estimation
Zhang, Zichen, Lan, Qingfeng, Ding, Lei, Wang, Yue, Hassanpour, Negar, Greiner, Russell
Counterfactual reasoning is an important paradigm applicable in many fields, such as healthcare, economics, and education. In this work, we propose a novel method to address the issue of \textit{selection bias}. We learn two groups of latent random variables, where one group corresponds to variables that only cause selection bias, and the other group is relevant for outcome prediction. They are learned by an auto-encoder where an additional regularized loss based on Pearson Correlation Coefficient (PCC) encourages the de-correlation between the two groups of random variables. This allows for explicitly alleviating selection bias by only keeping the latent variables that are relevant for estimating individual treatment effects. Experimental results on a synthetic toy dataset and a benchmark dataset show that our algorithm is able to achieve state-of-the-art performance and improve the result of its counterpart that does not explicitly model the selection bias.
TransMatch: A Transfer-Learning Scheme for Semi-Supervised Few-Shot Learning
Yu, Zhongjie, Chen, Lin, Cheng, Zhongwei, Luo, Jiebo
The successful application of deep learning to many visual recognition tasks relies heavily on the availability of a large amount of labeled data which is usually expensive to obtain. The few-shot learning problem has attracted increasing attention from researchers for building a robust model upon only a few labeled samples. Most existing works tackle this problem under the meta-learning framework by mimicking the few-shot learning task with an episodic training strategy. In this paper, we propose a new transfer-learning framework for semi-supervised few-shot learning to fully utilize the auxiliary information from labeled base-class data and unlabeled novel-class data. The framework consists of three components: 1) pre-training a feature extractor on base-class data; 2) using the feature extractor to initialize the classifier weights for the novel classes; and 3) further updating the model with a semi-supervised learning method. Under the proposed framework, we develop a novel method for semi-supervised few-shot learning called TransMatch by instantiating the three components with Imprinting and MixMatch. Extensive experiments on two popular benchmark datasets for few-shot learning, CUB-200-2011 and miniImageNet, demonstrate that our proposed method can effectively utilize the auxiliary information from labeled base-class data and unlabeled novel-class data to significantly improve the accuracy of few-shot learning task.