Support Vector Machines
Determining Song Similarity via Machine Learning Techniques and Tagging Information
Cunha, Renato L. F., Caldeira, Evandro, Fujii, Luciana
The task of determining item similarity is a crucial one in a recommender system. This constitutes the base upon which the recommender system will work to determine which items are more likely to be enjoyed by a user, resulting in more user engagement. In this paper we tackle the problem of determining song similarity based solely on song metadata (such as the performer, and song title) and on tags contributed by users. We evaluate our approach under a series of different machine learning algorithms. We conclude that tf-idf achieves better results than Word2Vec to model the dataset to feature vectors. We also conclude that k-NN models have better performance than SVMs and Linear Regression for this problem.
Field of Groves: An Energy-Efficient Random Forest
Takhirov, Zafar, Wang, Joseph, Louis, Marcia S., Saligrama, Venkatesh, Joshi, Ajay
Machine Learning (ML) algorithms, like Convolutional Neural Networks (CNN), Support Vector Machines (SVM), etc. have become widespread and can achieve high statistical performance. However their accuracy decreases significantly in energy-constrained mobile and embedded systems space, where all computations need to be completed under a tight energy budget. In this work, we present a field of groves (FoG) implementation of random forests (RF) that achieves an accuracy comparable to CNNs and SVMs under tight energy budgets. Evaluation of the FoG shows that at comparable accuracy it consumes ~1.48x, ~24x, ~2.5x, and ~34.7x lower energy per classification compared to conventional RF, SVM_RBF , MLP, and CNN, respectively. FoG is ~6.5x less energy efficient than SVM_LR, but achieves 18% higher accuracy on average across all considered datasets.
A Comparative Study for Predicting Heart Diseases Using Data Mining Classification Methods
Zriqat, Israa Ahmed, Altamimi, Ahmad Mousa, Azzeh, Mohammad
Improving the precision of heart diseases detection has been investigated by many researchers in the literature. Such improvement induced by the overwhelming health care expenditures and erroneous diagnosis. As a result, various methodologies have been proposed to analyze the disease factors aiming to decrease the physicians practice variation and reduce medical costs and errors. In this paper, our main motivation is to develop an effective intelligent medical decision support system based on data mining techniques. In this context, five data mining classifying algorithms, with large datasets, have been utilized to assess and analyze the risk factors statistically related to heart diseases in order to compare the performance of the implemented classifiers (e.g., Na\"ive Bayes, Decision Tree, Discriminant, Random Forest, and Support Vector Machine). To underscore the practical viability of our approach, the selected classifiers have been implemented using MATLAB tool with two datasets. Results of the conducted experiments showed that all classification algorithms are predictive and can give relatively correct answer. However, the decision tree outperforms other classifiers with an accuracy rate of 99.0% followed by Random forest. That is the case because both of them have relatively same mechanism but the Random forest can build ensemble of decision tree. Although ensemble learning has been proved to produce superior results, but in our case the decision tree has outperformed its ensemble version.
L$^3$-SVMs: Landmarks-based Linear Local Support Vectors Machines
Zantedeschi, Valentina, Emonet, Rémi, Sebban, Marc
One of the most famous and commonly used Machine Learning techniques for classification are the Support Vector Machines (SVMs) [7]. This popularity is due to their robustness, simplicity, efficiency (even in non linear scenarios by means of the kernel trick) as well as their theoretical foundations via generalization guarantees. Despite those nice properties, SVMs may face some drawbacks: Kernel SVMs are known to be expensive in terms of time complexity and memory usage when the number of training examples is large, both at training and at testing time. For training, the full Gram matrix needs to be evaluated (i.e., compute and store all pairwise training sample similarities), and then inverted. For testing, the time complexity depends on the number of support vectors which typically grows linearly with the number of training instances [21].
Geometric Insights into Support Vector Machine Behavior using the KKT Conditions
Carmichael, Iain, Marron, J. S.
The Support Vector Machine (SVM) is a powerful and widely used classification algorithm. Its performance is well known to be impacted by a tuning parameter which is frequently selected by cross-validation. This paper uses the Karush-Kuhn-Tucker conditions to provide rigorous mathematical proof for new insights into the behavior of SVM in the large and small tuning parameter regimes. These insights provide perhaps unexpected relationships between SVM and naive Bayes and maximal data piling directions. We explore how characteristics of the training data affect the behavior of SVM in many cases including: balanced vs. unbalanced classes, low vs. high dimension, separable vs. non-separable data. These results present a simple explanation of SVM's behavior as a function of the tuning parameter. We also elaborate on the geometry of complete data piling directions in high dimensional space. The results proved in this paper suggest important implications for tuning SVM with cross-validation.
Book: Neural Networks and Statistical Learning
Providing a broad but in-depth introduction to neural network and machine learning in a statistical framework, this book provides a single, comprehensive resource for study and further research. All the major popular neural network models and statistical learning approaches are covered with examples and exercises in every chapter to develop a practical working understanding of the content. Each of the twenty-five chapters includes state-of-the-art descriptions and important research results on the respective topics. The broad coverage includes the multilayer perceptron, the Hopfield network, associative memory models, clustering models and algorithms, the radial basis function network, recurrent neural networks, principal component analysis, nonnegative matrix factorization, independent component analysis, discriminant analysis, support vector machines, kernel methods, reinforcement learning, probabilistic and Bayesian networks, data fusion and ensemble learning, fuzzy sets and logic, neurofuzzy models, hardware implementations, and some machine learning topics. Applications to biometric/bioinformatics and data mining are also included.
Filtering Tweets for Social Unrest
Mishler, Alan, Wonus, Kevin, Chambers, Wendy, Bloodgood, Michael
There has been substantial interest in building technologies that can use social media postings to help forecast civil unrest [1]-[3]. The Arab Spring of 2011 compellingly illustrates how social media can both reflect and influence political (in)stability [4]. Since social media data is generated on such a large and rapid scale, computational tools are potentially extremely useful in helping to render meaning from that data. While previous work has focused on forecasting specific near-term unrest events [2], in this current paper we are interested in filtering social media content for postings that are relevant to social unrest, with the idea that downstream systems or human experts would use this filtered content for further analysis. In particular, we experiment with filtering tweets written in Arabic for relevance to social unrest.
Deploying nEmesis: Preventing Foodborne Illness by Data Mining Social Media
Sadilek, Adam (University of Rochester) | Kautz, Henry (University of Rochester) | DiPrete, Lauren (Southern Nevada Health District) | Labus, Brian (Southern Nevada Health District, Las Vegas, Nevada) | Portman, Eric (University of Rochester) | Teitel, Jack (University of Rochester) | Silenzio, Vincent (University of Nevada Las Vegas,)
Foodborne illness afflicts 48 million people annually in the U.S. alone. Over 128,000 are hospitalized and 3,000 die from the infection. While preventable with proper food safety practices, the traditional restaurant inspection process has limited impact given the predictability and low frequency of inspections, and the dynamic nature of the kitchen environment. Despite this reality, the inspection process has remained largely unchanged for decades. CDC has even identified food safety as one of seven ”winnable battles”; however, progress to date has been limited. In this work, we demonstrate significant improvements in food safety by marrying AI and the standard inspection process. We apply machine learning to Twitter data, develop a system that automatically detects venues likely to pose a public health hazard, and demonstrate its efficacy in the Las Vegas metropolitan area in a double-blind experiment conducted over three months in collaboration with Nevada’s health department. By contrast, previous research in this domain has been limited to indirect correlative validation using only aggregate statistics. We show that adaptive inspection process is 64 percent more effective at identifying problematic venues than the current state of the art. If fully deployed, our approach could prevent over 9,000 cases of foodborne illness and 557 hospitalizations annually in Las Vegas alone. Additionally, adaptive inspections result in unexpected benefits, including the identification of venues lacking permits, contagious kitchen staff, and fewer customer complaints filed with the Las Vegas health department.
The Importance of Location in Real Estate, Weather, and Machine Learning
Real estate experts like to say that the three most important features of a property are: location, location, location! Likewise, weather events are highly location-dependent. We will see below how a similar perspective is also applicable to machine learning algorithms. In real estate, the buyer is first and foremost concerned about location for at least 3 reasons: (a) the desirability of the surrounding neighborhood; (b) the proximity to schools, businesses, services, etc.; and (c) the value of properties in that area. Similarly, meteorologists tell us that all weather is local.
Fairness Constraints: Mechanisms for Fair Classification
Zafar, Muhammad Bilal, Valera, Isabel, Rodriguez, Manuel Gomez, Gummadi, Krishna P.
Algorithmic decision making systems are ubiquitous across a wide variety of online as well as offline services. These systems rely on complex learning methods and vast amounts of data to optimize the service functionality, satisfaction of the end user and profitability. However, there is a growing concern that these automated decisions can lead, even in the absence of intent, to a lack of fairness, i.e., their outcomes can disproportionately hurt (or, benefit) particular groups of people sharing one or more sensitive attributes (e.g., race, sex). In this paper, we introduce a flexible mechanism to design fair classifiers by leveraging a novel intuitive measure of decision boundary (un)fairness. We instantiate this mechanism with two well-known classifiers, logistic regression and support vector machines, and show on real-world data that our mechanism allows for a fine-grained control on the degree of fairness, often at a small cost in terms of accuracy. A Python implementation of our mechanism is available at fate-computing.mpi-sws.org