Goto

Collaborating Authors

 Performance Analysis


AppsPred: Predicting Context-Aware Smartphone Apps using Random Forest Learning

arXiv.org Machine Learning

Due to the popularity of context-awareness in the Internet of Things (IoT) and the recent advanced features in the most popular IoT device, i.e., smartphone, modeling and predicting personalized usage behavior based on relevant contexts can be highly useful in assisting them to carry out daily routines and activities. Usage patterns of different categories smartphone apps such as social networking, communication, entertainment, or daily life services related apps usually vary greatly between individuals. People use these apps differently in different contexts, such as temporal context, spatial context, individual mood and preference, work status, Internet connectivity like Wifi? status, or device related status like phone profile, battery level etc. Thus, we consider individuals' apps usage as a multi-class context-aware problem for personalized modeling and prediction. Random Forest learning is one of the most popular machine learning techniques to build a multi-class prediction model. Therefore, in this paper, we present an effective context-aware smartphone apps prediction model, and name it "AppsPred" using random forest machine learning technique that takes into account optimal number of trees based on such multi-dimensional contexts to build the resultant forest. The effectiveness of this model is examined by conducting experiments on smartphone apps usage datasets collected from individual users. The experimental results show that our AppsPred significantly outperforms other popular machine learning classification approaches like ZeroR, Naive Bayes, Decision Tree, Support Vector Machines, Logistic Regression while predicting smartphone apps in various context-aware test cases.


Locally Optimized Random Forests

arXiv.org Machine Learning

Standard supervised learning procedures are validated against a test set that is assumed to have come from the same distribution as the training data. However, in many problems, the test data may have come from a different distribution. We consider the case of having many labeled observations from one distribution, $P_1$, and making predictions at unlabeled points that come from $P_2$. We combine the high predictive accuracy of random forests (Breiman, 2001) with an importance sampling scheme, where the splits and predictions of the base-trees are done in a weighted manner, which we call Locally Optimized Random Forests. These weights correspond to a non-parametric estimate of the likelihood ratio between the training and test distributions. To estimate these ratios with an unlabeled test set, we make the covariate shift assumption, where the differences in distribution are only a function of the training distributions (Shimodaira, 2000.) This methodology is motivated by the problem of forecasting power outages during hurricanes. The extreme nature of the most devastating hurricanes means that typical validation set ups will overly favor less extreme storms. Our method provides a data-driven means of adapting a machine learning method to deal with extreme events.


SynGAN: Towards Generating Synthetic Network Attacks using GANs

arXiv.org Machine Learning

The rapid digital transformation without security considerations has resulted in the rise of global-scale cyberattacks. The first line of defense against these attacks are Network Intrusion Detection Systems (NIDS). Once deployed, however, these systems work as blackboxes with a high rate of false positives with no measurable effectiveness. There is a need to continuously test and improve these systems by emulating real-world network attack mutations. We present SynGAN, a framework that generates adversarial network attacks using the Generative Adver-sial Networks (GAN). SynGAN generates malicious packet flow mutations using real attack traffic, which can improve NIDS attack detection rates. As a first step, we compare two public datasets, NSL-KDD and CI-CIDS2017, for generating synthetic Distributed Denial of Service (DDoS) network attacks. We evaluate the attack quality (real vs. synthetic) using a gradient boosting classifier.


Machine learning algorithms to infer trait matching and predict species interactions in ecological networks

arXiv.org Machine Learning

Ecologists have long suspected that species are more likely to interact if their traits match in a particular way. For example, a pollination interaction may be particularly likely if the proportions of a bee's tongue match flower shape in a beneficial way. Empirical evidence for trait matching, however, varies significantly in strength among different types of ecological networks. Here, we show that ambiguity among empirical trait matching studies may have arisen at least in parts from using overly simple statistical models. Using simulated and real data, we contrast conventional regression models with Machine Learning (ML) models (Random Forest, Boosted Regression Trees, Deep Neural Networks, Convolutional Neural Networks, Support Vector Machines, naive Bayes, and k-Nearest-Neighbor), testing their ability to predict species interactions based on traits, and infer trait combinations causally responsible for species interactions. We find that the best ML models can successfully predict species interactions in plant-pollinator networks (up to 0.93 AUC) and outperform conventional regression models. Our results also demonstrate that ML models can better identify the causally responsible trait matching combinations than GLMs. In two case studies, the best ML models could successfully predict species interactions in a global plant-pollinator database and infer ecologically plausible trait matching rules for a plant-hummingbird network from Costa Rica, without any prior assumptions about the system. We conclude that flexible ML models offer many advantages over traditional regression models for understanding interaction networks. We anticipate that these results extrapolate to other network types, such as trophic or competitive networks. More generally, our results highlight the potential of ML and artificial intelligence for inference beyond standard tasks such as pattern recognition.


How to Predict Hotel Cancellations with Support Vector Machines and ARIMA

#artificialintelligence

Hotel cancellations can cause issues for many businesses in the industry. Not only is there the lost revenue as a result of the customer canceling, but this can also cause difficulty in coordinating bookings and adjusting revenue management practices. Data analytics can help to overcome this issue, in terms of identifying the customers who are most likely to cancel โ€“ allowing a hotel chain to adjust its marketing strategy accordingly. To investigate how machine learning can aid in this task, the ExtraTreesClassifer, logistic regression, and support vector machine models were employed in Python to determine whether cancellations can be accurately predicted with this model. For this example, both hotels are based in Portugal.



Automatic Language Identification in Texts: A Survey

Journal of Artificial Intelligence Research

Language identification ("LI") is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. Today, LI is a key part of many text processing pipelines, as text processing techniques generally assume that the language of the input text is known. Research in this area has recently been especially active. This article provides a brief history of LI research, and an extensive survey of the features and methods used in the LI literature. We describe the features and methods using a unified notation, to make the relationships between methods clearer. We discuss evaluation methods, applications of LI, as well as off-the-shelf LI systems that do not require training by the end user. Finally, we identify open issues, survey the work to date on each issue, and propose future directions for research in LI.


E-MIIM: An Ensemble Learning based Context-Aware Mobile Telephony Model for Intelligent Interruption Management

arXiv.org Machine Learning

Nowadays, mobile telephony interruptions in our daily life activities are common because of the inappropriate ringing notifications of incoming phone calls in different contexts. Such interruptions may impact on the work attention not only for the mobile phone owners but also the surrounding people. Decision tree is the most popular machine learning classification technique that is used in existing context-aware mobile intelligent interruption management (MIIM) model to overcome such issues. However, a single decision tree based context-aware model may cause overfitting problem and thus decrease the prediction accuracy of the inferred model. Therefore, in this paper, we propose an ensemble machine learning based context-aware mobile telephony model for the purpose of intelligent interruption management by taking into account multi-dimensional contexts and name it "E-MIIM". The experimental results on individuals' real life mobile telephony datasets show that our E-MIIM model is more effective and outperforms existing MIIM model for predicting and managing individual's mobile telephony interruptions based on their relevant contextual information.


Predicting the Long-Term Outcomes of Biologics in Psoriasis Patients Using Machine Learning

arXiv.org Machine Learning

Background. Real-world data show that approximately 50% of psoriasis patients treated with a biologic agent will discontinue the drug because of loss of efficacy. History of previous therapy with another biologic, female sex and obesity were identified as predictors of drug discontinuations, but their individual predictive value is low. Objectives. To determine whether machine learning algorithms can produce models that can accurately predict outcomes of biologic therapy in psoriasis on individual patient level. Results. All tested machine learning algorithms could accurately predict the risk of drug discontinuation and its cause (e.g. lack of efficacy vs adverse event). The learned generalized linear model achieved diagnostic accuracy of 82%, requiring under 2 seconds per patient using the psoriasis patients dataset. Input optimization analysis established a profile of a patient who has best chances of long-term treatment success: biologic-naive patient under 49 years, early-onset plaque psoriasis without psoriatic arthritis, weight < 100 kg, and moderate-to-severe psoriasis activity (DLQI $\geq$ 16; PASI $\geq$ 10). Moreover, a different generalized linear model is used to predict the length of treatment for each patient with mean absolute error (MAE) of 4.5 months. However Pearson Correlation Coefficient indicates 0.935 linear dependencies between the actual treatment lengths and predicted ones. Conclusions. Machine learning algorithms predict the risk of drug discontinuation and treatment duration with accuracy exceeding 80%, based on a small set of predictive variables. This approach can be used as a decision-making tool, communicating expected outcomes to the patient, and development of evidence-based guidelines.


Is Data Privacy Real? Don't Bet on It - Knowledge@Wharton

#artificialintelligence

In 2009, Netflix was sued for releasing movie ratings data from half a million subscribers who were identified only by unique ID numbers. The video streaming service divulged this "anonymized" information to the public as part of its Netflix Prize contest, in which participants were asked to use the data to develop a better content recommendation algorithm. But researchers from the University of Texas showed that as few as six movie ratings could be used to identify users. A closet lesbian sued Netflix, saying her anonymity was compromised. The lawsuit was settled in 2010. The Netflix case reveals a problem about which the public is just starting to learn, but that data analysts and computer scientists have known for years. In anonymized datasets where distinguishing characteristics of a person such as name and address have been deleted, even a handful of seemingly innocuous information can lead to identification.