Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks
–Neural Information Processing Systems
The curse of dimensionality is severe when modeling high-dimensional discrete data: the number of possible combinations of the variables explodes exponentially.In this paper we propose a new architecture for modeling high-dimensional data that requires resources (parameters and computations) that grow only at most as the square of the number of variables, usinga multi-layer neural network to represent the joint distribution of the variables as the product of conditional distributions. The neural network can be interpreted as a graphical model without hidden random variables,but in which the conditional distributions are tied through the hidden units. The connectivity of the neural network can be pruned by using dependency tests between the variables. Experiments on modeling the distribution of several discrete data sets show statistically significant improvements over other methods such as naive Bayes and comparable Bayesian networks, and show that significant improvements can be obtained bypruning the network. 1 Introduction The curse of dimensionality hits particularly hard on models of high-dimensional discrete data because there are many more possible combinations of the values of the variables than can possibly be observed in any data set, even the large data sets now common in datamining applications.In this paper we are dealing in particular with multivariate discrete data, where one tries to build a model of the distribution of the data. This can be used for example to detect anomalous cases in data-mining applications, or it can be used to model the class-conditional distribution of some observed variables in order to build a classifier. A simple multinomial maximum likelihood model would give zero probability to all of the combinations not encountered in the training set, i.e., it would most likely give zero probability to most out-of-sample test cases. Smoothing the model by assigning the same nonzero probability for all the unobserved cases would not be satisfactory either because it would not provide much generalization from the training set. This could be obtained by using a multivariate multinomial model whose parameters Bare estimated by the maximum a-posteriori (MAP) principle, i.e., those that have the greatest probability, given the training data D, and using a diffuse prior PCB) (e.g.
Neural Information Processing Systems
Dec-31-2000