Clustering is one of the most fundamental tasks in data analysis and machine learning. It is central to many data-driven applications that aim to separate the data into groups with similar patterns. Moreover, clustering is a complex procedure that is affected significantly by the choice of the data representation method. Recent research has demonstrated encouraging clustering results by learning effectively these representations. In most of these works a deep auto-encoder is initially pre-trained to minimize a reconstruction loss, and then jointly optimized with clustering centroids in order to improve the clustering objective. Those works focus mainly on the clustering phase of the procedure, while not utilizing the potential benefit out of the initial phase. In this paper we propose to optimize an auto-encoder with respect to a discriminative pairwise loss function during the auto-encoder pre-training phase. We demonstrate the high accuracy obtained by the proposed method as well as its rapid convergence (e.g. reaching above 92% accuracy on MNIST during the pre-training phase, in less than 50 epochs), even with small networks.