Kaggle is a great resource for datasets, as you'll see new ones pop up as competitions begin. Common Crawl is a popular source for text data, and was used to train the higher-dimensional GloVe word vectors. Speaking of which, the GloVe pre-trained word vectors are available to download on that site, which may be useful to you! ImageNet is a common source of images, and the classification task setup by that dataset is used as a benchmark for model quality. Finally, you could always create your own dataset! Really, being a machine learning researcher or practitioner involves a ton more data collection, cleaning, manipulating, and general handling than it may appear at first glance.
There is a growing need for efficient and scalable semantic web queries that handle inference. There is also a growing interest in representing uncertainty in semantic web knowledge bases. In this paper, we present a bit vector schema specifically designed for RDF (Resource Description Framework) datasets. We propose a system for materializing and storing inferred knowledge using this schema. We show experimental results that demonstrate that our solution drastically improves the performance of inference queries. We also propose a solution for materializing uncertain information and probabilities using multiple bit vectors and thresholds.
The subcellular location of a protein can provide valuable information about its function. With the rapid increase of sequenced genomic data, the need for an automated and accurate tool to predict subcellular localization becomes increasingly important. Many efforts have been made to predict protein subcellular localization. This paper aims to merge the artificial neural networks and bioinformatics to predict the location of protein in yeast genome. We introduce a new subcellular prediction method based on a backpropagation neural network. The results show that the prediction within an error limit of 5 to 10 percentage can be achieved with the system.
In real-world recommender systems, some users are easily influenced by new products and whereas others are unwilling to change their minds. So the preference varying speeds for users are different. Based on this observation, we propose a dynamic nonlinear matrix factorization model for collaborative filtering, aimed to improve the rating prediction performance as well as track the preference varying speeds for different users. We assume that user-preference changes smoothly over time, and the preference varying speeds for users are different. These two assumptions are incorporated into the proposed model as prior knowledge on user feature vectors, which can be learned efficiently by MAP estimation. The experimental results show that our method not only achieves state-of-the-art performance in the rating prediction task, but also provides an effective way to track user-preference varying speed.
In conventional prediction tasks, a machine learning algorithm outputs a single best model that globally optimizes its objective function, which typically is accuracy. Therefore, users cannot access the other models explicitly. In contrast to this, multiple model enumeration attracts increasing interests in non-standard machine learning applications where other criteria, e.g., interpretability or fairness, than accuracy are main concern and a user may want to access more than one non-optimal, but suitable models. In this paper, we propose a K-best model enumeration algorithm for Support Vector Machines (SVM) that given a dataset S and an integer K>0, enumerates the K-best models on S with distinct support vectors in the descending order of the objective function values in the dual SVM problem. Based on analysis of the lattice structure of support vectors, our algorithm efficiently finds the next best model with small latency. This is useful in supporting users's interactive examination of their requirements on enumerated models. By experiments on real datasets, we evaluated the efficiency and usefulness of our algorithm.