Accuracy
Cyberbullying Identification Using Participant-Vocabulary Consistency
With the rise of social media, people can now form relationships and communities easily regardless of location, race, ethnicity, or gender. However, the power of social media simultaneously enables harmful online behavior such as harassment and bullying. Cyberbullying is a serious social problem, making it an important topic in social network analysis. Machine learning methods can potentially help provide better understanding of this phenomenon, but they must address several key challenges: the rapidly changing vocabulary involved in cyber- bullying, the role of social network structure, and the scale of the data. In this study, we propose a model that simultaneously discovers instigators and victims of bullying as well as new bullying vocabulary by starting with a corpus of social interactions and a seed dictionary of bullying indicators. We formulate an objective function based on participant-vocabulary consistency. We evaluate this approach on Twitter and Ask.fm data sets and show that the proposed method can detect new bullying vocabulary as well as victims and bullies.
Discriminating sample groups with multi-way data
Lyu, Tianmeng, Lock, Eric F., Eberly, Lynn E.
High-dimensional linear classifiers, such as the support vector machine (SVM) and distance weighted discrimination (DWD), are commonly used in biomedical research to distinguish groups of subjects based on a large number of features. However, their use is limited to applications where a single vector of features is measured for each subject. In practice data are often multi-way, or measured over multiple dimensions. For example, metabolite abundance may be measured over multiple regions or tissues, or gene expression may be measured over multiple time points, for the same subjects. We propose a framework for linear classification of high-dimensional multi-way data, in which coefficients can be factorized into weights that are specific to each dimension. More generally, the coefficients for each measurement in a multi-way dataset are assumed to have low-rank structure. This framework extends existing classification techniques, and we have implemented multi-way versions of SVM and DWD. We describe informative simulation results, and apply multi-way DWD to data for two very different clinical research studies. The first study uses metabolite magnetic resonance spectroscopy data over multiple brain regions to compare patients with and without spinocerebellar ataxia, the second uses publicly available gene expression time-course data to compare treatment responses for patients with multiple sclerosis. Our method improves performance and simplifies interpretation over naive applications of full rank linear classification to multi-way data. An R package is available at https://github.com/lockEF/MultiwayClassification .
The Rise of Social Bots July 2016 Communications of the ACM
Bots (short for software robots) have been around since the early days of computers. One compelling example of bots is chatbots, algorithms designed to hold a conversation with a human, as envisioned by Alan Turing in the 1950s.33 The dream of designing a computer algorithm that passes the Turing test has driven artificial intelligence research for decades, as witnessed by initiatives like the Loebner Prize, awarding progress in natural language processing.a Many things have changed since the early days of AI, when bots like Joseph Weizenbaum's ELIZA,39 mimicking a Rogerian psychotherapist, were developed as demonstrations or for delight. Today, social media ecosystems populated by hundreds of millions of individuals present real incentives--including economic and political ones--to design algorithms that exhibit human-like behavior. Such ecosystems also raise the bar of the challenge, as they introduce new dimensions to emulate in addition to content, including the social network, temporal activity, diffusion patterns, and sentiment expression. A social bot is a computer algorithm that automatically produces content and interacts with humans on social media, trying to emulate and possibly alter their behavior. Social bots have inhabited social media platforms for the past few years.7,24
A gentle introduction to Naïve Bayes classification using R
Now that we have a model, we can do some predicting. We do this by feeding our test data into our model and comparing the predicted party affiliations with the known ones. The latter is done via the wonderfully named confusion matrix – a table in which true and predicted values for each of the predicted classes are displayed in a matrix format.
Large-Scale Kernel Methods for Independence Testing
Zhang, Qinyi, Filippi, Sarah, Gretton, Arthur, Sejdinovic, Dino
Representations of probability measures in reproducing kernel Hilbert spaces provide a flexible framework for fully nonparametric hypothesis tests of independence, which can capture any type of departure from independence, including nonlinear associations and multivariate interactions. However, these approaches come with an at least quadratic computational cost in the number of observations, which can be prohibitive in many applications. Arguably, it is exactly in such large-scale datasets that capturing any type of dependence is of interest, so striking a favourable tradeoff between computational efficiency and test performance for kernel independence tests would have a direct impact on their applicability in practice. In this contribution, we provide an extensive study of the use of large-scale kernel approximations in the context of independence testing, contrasting block-based, Nystrom and random Fourier feature approaches. Through a variety of synthetic data experiments, it is demonstrated that our novel large scale methods give comparable performance with existing methods whilst using significantly less computation time and memory.
NYC Data Science Academy
They are currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between January 11th to April 1st, 2016. This post is based on their fourth class project - Machine learning(due on the 8th week of the program). The Higgs Boson Challenge, hosted by Kaggle, asked the data scientist community to utilize machine learning to accurately predict if a particle was a Higgs-Boson particle or not; more specifically if a signal detected was either a'tau tau decay of a Higgs boson' or just'background'. The datasets provided were the training and test set with 250,000 and 550,000 observations, respectively. The training set contained all the same features as the test with two additional columns of'Label' and'Weight' that gave the accurate classifiers to help train our models.
An effective approach for classification of advanced malware with high accuracy
Sharma, Ashu, Sahay, Sanjay K.
Combating malware is very important for software/systems security, but to prevent the software/systems from the advanced malware, viz. metamorphic malware is a challenging task, as it changes the structure/code after each infection. Therefore in this paper, we present a novel approach to detect the advanced malware with high accuracy by analyzing the occurrence of opcodes (features) by grouping the executables. These groups are made on the basis of our earlier studies [1] that the difference between the sizes of any two malware generated by popular advanced malware kits viz. PS-MPC, G2 and NGVCK are within 5 KB. On the basis of obtained promising features, we studied the performance of thirteen classifiers using N-fold cross-validation available in machine learning tool WEKA. Among these thirteen classifiers we studied in-depth top five classifiers (Random forest, LMT, NBT, J48 and FT) and obtain more than 96.28% accuracy for the detection of unknown malware, which is better than the maximum detection accuracy (95.9%) reported by Santos et al (2013). In these top five classifiers, our approach obtained a detection accuracy of 97.95% by the Random forest.
Grouping the executables to detect malware with high accuracy
Sahay, Sanjay K., Sharma, Ashu
The metamorphic malware variants with the same malicious behavior (family), can obfuscate themselves to look different from each other. This variation in structure leads to a huge signature database for traditional signature matching techniques to detect them. In order to effective and efficient detection of malware in large amounts of executables, we need to partition these files into groups which can identify their respective families. In addition, the grouping criteria should be chosen such a way that, it can also be applied to unknown files encounter on computers for classification. This paper discusses the study of malware and benign executables in groups to detect unknown malware with high accuracy. We studied sizes of malware generated by three popular second generation malware (metamorphic malware) creator kits viz. G2, PS-MPC and NGVCK, and observed that the size variation in any two generated malware from same kit is not much. Hence, we grouped the executables on the basis of malware sizes by using Optimal k-Means Clustering algorithm and used these obtained groups to select promising features for training (Random forest, J48, LMT, FT and NBT) classifiers to detect variants of malware or unknown malware. We find that detection of malware on the basis of their respected file sizes gives accuracy up to 99.11% from the classifiers.
AI Boosts Cancer Screens to Nearly 100 Percent Accuracy
Diagnosing cancer is about to get more accurate, with the help of artificial intelligence. Pathologists have diagnosed diseases in more or less the same way for the past 100 years, by laboring over a microscope reviewing biopsy samples on little glass slides. Working almost robotically, they sift through millions of normal cells to identify just a few diseased ones. The task is tedious and prone to human error. But now, scientists and engineers have created a technique that uses artificial intelligence (AI) and can differentiate cancer cells from normal cells almost as well as a top-notch pathologist.
Entry Point Data – Using Python's Sci-packages to Prepare Data for Machine Learning Tasks and other
In this short tutorial I want to provide a short overview of some of my favorite Python tools for common procedures as entry points for general pattern classification and machine learning tasks, and various other data analyses. In this section want to recommend a way for installing the required Python-packages packages if you have not done so, yet. Otherwise you can skip this part. Although they can be installed step-by-step "manually", but I highly recommend you to take a look at the Anaconda Python distribution for scientific computing. Anaconda is distributed by Continuum Analytics, but it is completely free and includes more than 195 packages for science and data analysis as of today.