Goto

Collaborating Authors

 Support Vector Machines


Email Spam Filtering: An Implementation with Python and Scikit-learn

@machinelearnbot

Text mining (deriving information from text) is a wide field which has gained popularity with the huge text data being generated. Automation of a number of applications like sentiment analysis, document classification, topic classification, text summarization, machine translation, etc has been done using machine learning models. Spam filtering is a beginner's example of document classification task which involves classifying an email as spam or non-spam (a.k.a. Spam box in your Gmail account is the best example of this. So lets get started in building a spam filter on a publicly available mail corpus.


Machine Learning Algorithms: A Concise Technical Overview

#artificialintelligence

Whether you are a newcomer to machine learning, a newbie to specific algorithms or concepts, or a seasoned ML vet looking for a once-over of an algorithm you haven't seen or used in a while, these short and to-the-point tutorials may provide the assistance you are looking for. Each of these posts concisely covers a single, specific machine learning concept. Support Vector Machines remain a popular and time-tested classification algorithm. This post provides a high-level concise technical overview of their functionality. A wide array of clustering techniques are in use today.


Supervised Word Sense Disambiguation for Venetan: A Proof-of-Concept Experiment

AAAI Conferences

Word Sense Disambiguation (WSD) is a classification task that consists of determining which of the senses of an ambiguous word is activated in a specific context. Research in this field has primarily concentrated on investigating English and a few other well-resourced languages. Recently, studies done on a corpus of Old English (Wunderlich 2015) showed that, even with limited resources, it is still possible to approach the problem of WSD. In this paper, a WSD system has been developed for the Low Resource Language (LRL) Venetan, which has recently received some attention from the Natural Language Processing (NLP) community. Our main contributions are twofold: first, we select and annotate a corpus for Venetan, considering two words (one abstract and one concrete term) and using two levels of annotation (fine- and coarse-grained), reporting on annotator agreement. Second, we report results of proof-of-concept experiments of supervised WSD performed with Support Vector Machines on this corpus. To our knowledge, our work is the first time that WSD for a European Dialect like Venetan has been studied.


Score Fusion Based Authorship Attribution of Ancient Arabic Texts

AAAI Conferences

In this paper, we investigate the authorship of several short historical texts that are written by ten ancient Arabic travelers: this Arabic dataset, which was collected by the authors in 2011, and called AAAT (Authorship attribution of Ancient Arabic Texts) corpus, is considered as a reference dataset in Arabic. Several experiments of authorship attribution are conducted by using different features namely: characters, character n-grams, and lexical features such as words, word n-grams, and rare words. On the other hand, different classifiers are employed, such as: statistical distances, Multi Layer Percep-tron (MLP), Support Vector Machines (SVM) and Linear Regression (LR). In this investigation, a new fusion technique is proposed to enhance the overall performances of the classifiers: it is called Score Based Fusion (SBF). Results show good attribution performances with an optimal score between 80% and 90% of good authorship attribution. The proposed fusion technique raised this score to 100% of good authorship attribution. Moreover, this comparative survey has revealed interesting results concerning the Arabic language and more particularly with short texts.


A Text Mining Approach for Anomaly Detection in Application Layer DDoS Attacks

AAAI Conferences

Distributed Denial of Service (DDoS) attacks are a major threat to Internet security, with their use continuing to grow. Attackers are finding more sophisticated methods to attack servers. A lot of defense mechanisms have been proposed for DDoS attacks at IP and TCP layers. Those methods will not work well for application layer DDoS attacks that utilize legitimate application layer requests to overwhelm a webserver. These attacks look legitimate in both packets and protocol characteristics, which makes them harder to detect. In this paper, we propose an anomaly detection method to detect application layer DDoS attacks. We take a text mining approach to extract features which represent a userโ€™s HTTP request sequence using bigrams. We apply the one class Support Vector Machine (SVM) algorithm on the extracted features from normal usersโ€™ HTTP request sequences. The one class SVM labels any newly seen instance that deviates from the normal, trained model as an application layer DDoS instance. We apply our experimental analysis on real web server logs collected from a student resource website. Three different variants of HTTP GET flood attacks are implemented on our server, generated via penetration testing. Our results show that the proposed method is able to detect application layer DDoS attacks with very good performance results.


E-learning courses on Advanced Analytics, Credit Risk Modeling, and Fraud Analytics

@machinelearnbot

The E-learning course starts by refreshing the basic concepts of the analytics process model: data preprocessing, analytics and post processing. We then discuss decision trees and ensemble methods (bagging, boosting, random forests), neural networks, support vector machines (SVMs), Bayesian networks, survival analysis, social networks, monitoring and backtesting analytical models. Throughout the course, we extensively refer to our industry and research experience. The E-learning course consists of more than 20 hours of movies, each 5 minutes on average. Quizzes are included to facilitate the understanding of the material.


Two Class Support Vector Machine

#artificialintelligence

Two-Class Support Vector Machine is used to create a model that is based on the Support Vector Machine Algorithm.The classifier that this module initializes is useful for predicting between two possible outcomes that depend on continuous or categorical predictor variables. This model is a supervised learning method and therefore, requires a dataset which includes a labeled column. You can train the model by providing the model and the tagged dataset as an input to Train Model or Tune Model Hyperparameters. The trained model can then be used to predict values for the new input examples. Support Vector Machines (SVMs) are supervised learning models that analyze data and recognize patterns.


Iteratively-Reweighted Least-Squares Fitting of Support Vector Machines: A Majorization--Minimization Algorithm Approach

arXiv.org Machine Learning

Support vector machines (SVMs) are an important tool in modern data analysis. Traditionally, support vector machines have been fitted via quadratic programming, either using purpose-built or off-the-shelf algorithms. We present an alternative approach to SVM fitting via the majorization--minimization (MM) paradigm. Algorithms that are derived via MM algorithm constructions can be shown to monotonically decrease their objectives at each iteration, as well as be globally convergent to stationary points. We demonstrate the construction of iteratively-reweighted least-squares (IRLS) algorithms, via the MM paradigm, for SVM risk minimization problems involving the hinge, least-square, squared-hinge, and logistic losses, and 1-norm, 2-norm, and elastic net penalizations. Successful implementations of our algorithms are presented via some numerical examples.


How Fast Will You Get a Response? Predicting Interval Time for Reciprocal Link Creation

AAAI Conferences

In the recent years, reciprocal link prediction has received some attention from the data mining and social network analysis researchers, who solved this problem as a binary classification task. However, it is also important to predict the interval time for the creation of reciprocal link. This is a challenging problem for two reasons: First, the lack of effective features, because well-known link prediction features are designed for undirected networks and for the binary classification task, hence they do not work well for the interval time prediction; Second, the presence of censored data instances makes the traditional supervised regression methods unsuitable for solving this problem. In this paper, we propose a solution for the reciprocal link interval time prediction task. We map this problem into survival analysis framework and show through extensive experiments on real-world datasets that, survival analysis methods perform better than traditional regression, neural network based model and support vector regression (SVR).


Mutual Kernel Matrix Completion

arXiv.org Machine Learning

With the huge influx of various data nowadays, extracting knowledge from them has become an interesting but tedious task among data scientists, particularly when the data come in heterogeneous form and have missing information. Many data completion techniques had been introduced, especially in the advent of kernel methods. However, among the many data completion techniques available in the literature, studies about mutually completing several incomplete kernel matrices have not been given much attention yet. In this paper, we present a new method, called Mutual Kernel Matrix Completion (MKMC) algorithm, that tackles this problem of mutually inferring the missing entries of multiple kernel matrices by combining the notions of data fusion and kernel matrix completion, applied on biological data sets to be used for classification task. We first introduced an objective function that will be minimized by exploiting the EM algorithm, which in turn results to an estimate of the missing entries of the kernel matrices involved. The completed kernel matrices are then combined to produce a model matrix that can be used to further improve the obtained estimates. An interesting result of our study is that the E-step and the M-step are given in closed form, which makes our algorithm efficient in terms of time and memory. After completion, the (completed) kernel matrices are then used to train an SVM classifier to test how well the relationships among the entries are preserved. Our empirical results show that the proposed algorithm bested the traditional completion techniques in preserving the relationships among the data points, and in accurately recovering the missing kernel matrix entries. By far, MKMC offers a promising solution to the problem of mutual estimation of a number of relevant incomplete kernel matrices.