Goto

Collaborating Authors

 Ibraheem, Abdulrahman Oladipupo


Regularizing cross entropy loss via minimum entropy and K-L divergence

arXiv.org Artificial Intelligence

I introduce two novel loss functions for classification in deep learning. The two loss functions extend standard cross entropy loss by regularizing it with minimum entropy and Kullback-Leibler (K-L) divergence terms. The first of the two novel loss functions is termed mixed entropy loss (MIX-ENT for short), while the second one is termed minimum entropy regularized cross-entropy loss (MIN-ENT for short). The MIX-ENT function introduces a regularizer that can be shown to be equivalent to the sum of a minimum entropy term and a K-L divergence term. However, it should be noted that the K-L divergence term here is different from that in the standard cross-entropy loss function, in the sense that it swaps the roles of the target probability and the hypothesis probability. The MIN-ENT function simply adds a minimum entropy regularizer to the standard cross entropy loss function. In both MIX-ENT and MIN-ENT, the minimum entropy regularizer minimizes the entropy of the hypothesis probability distribution which is output by the neural network. Experiments on the EMNIST-Letters dataset shows that my implementation of MIX-ENT and MIN-ENT lets the VGG model climb from its previous 3rd position on the paperswithcode leaderboard to reach the 2nd position on the leaderboard, outperforming the Spinal-VGG model in so doing. Specifically, using standard cross-entropy, VGG achieves 95.86% while Spinal-VGG achieves 95.88% classification accuracies, whereas using VGG (without Spinal-VGG) our MIN-ENT achieved 95.933%, while our MIX-ENT achieved 95.927% accuracies. The pre-trained models for both MIX-ENT and MIN-ENT are at https://github.com/rahmanoladi/minimum entropy project.


SENNS: Sparse Extraction Neural NetworkS for Feature Extraction

arXiv.org Machine Learning

By drawing on ideas from optimisation theory, artificial neural networks (ANN), graph embeddings and sparse representations, I develop a novel technique, termed SENNS (Sparse Extraction Neural NetworkS), aimed at addressing the feature extraction problem. The proposed method uses (preferably deep) ANNs for projecting input attribute vectors to an output space wherein pairwise distances are maximized for vectors belonging to different classes, but minimized for those belonging to the same class, while simultaneously enforcing sparsity on the ANN outputs. The vectors that result from the projection can then be used as features in any classifier of choice. Mathematically, I formulate the proposed method as the minimisation of an objective function which can be interpreted, in the ANN output space, as a negative factor of the sum of the squares of the pair-wise distances between output vectors belonging to different classes, added to a positive factor of the sum of squares of the pair-wise distances between output vectors belonging to the same classes, plus sparsity and weight decay terms. To derive an algorithm for minimizing the objective function via gradient descent, I use the multi-variate version of the chain rule to obtain the partial derivatives of the function with respect to ANN weights and biases, and find that each of the required partial derivatives can be expressed as a sum of six terms. As it turns out, four of those six terms can be computed using the standard back propagation algorithm; the fifth can be computed via a slight modification of the standard backpropagation algorithm; while the sixth one can be computed via simple arithmetic. Finally, I propose experiments on the ARABASE Arabic corpora of digits and letters, the CMU PIE database of faces, the MNIST digits database, and other standard machine learning databases.


Correlation of Data Reconstruction Error and Shrinkages in Pair-wise Distances under Principal Component Analysis (PCA)

arXiv.org Machine Learning

In this on-going work, I explore certain theoretical and empirical implications of data transformations under the PCA. In particular, I state and prove three theorems about PCA, which I paraphrase as follows: 1). PCA without discarding eigenvector rows is injective, but looses this injectivity when eigenvector rows are discarded 2). PCA without discarding eigen- vector rows preserves pair-wise distances, but tends to cause pair-wise distances to shrink when eigenvector rows are discarded. 3). For any pair of points, the shrinkage in pair-wise distance is bounded above by an L1 norm reconstruction error associated with the points. Clearly, 3). suggests that there might exist some correlation between shrinkages in pair-wise distances and mean square reconstruction error which is defined as the sum of those eigenvalues associated with the discarded eigenvectors. I therefore decided to perform numerical experiments to obtain the corre- lation between the sum of those eigenvalues and shrinkages in pair-wise distances. In addition, I have also performed some experiments to check respectively the effect of the sum of those eigenvalues and the effect of the shrinkages on classification accuracies under the PCA map. So far, I have obtained the following results on some publicly available data from the UCI Machine Learning Repository: 1). There seems to be a strong cor- relation between the sum of those eigenvalues associated with discarded eigenvectors and shrinkages in pair-wise distances. 2). Neither the sum of those eigenvalues nor pair-wise distances have any strong correlations with classification accuracies. 1