Goto

Collaborating Authors

 Performance Analysis


Optimal Personalized Filtering Against Spear-Phishing Attacks

AAAI Conferences

To penetrate sensitive computer networks, attackers can use spear phishing to sidestep technical security mechanisms by exploiting the privileges of careless users. In order to maximize their success probability, attackers have to target the users that constitute the weakest links of the system. The optimal selection of these target users takes into account both the damage that can be caused by a user and the probability of a malicious e-mail being delivered to and opened by a user. Since attackers select their targets in a strategic way, the optimal mitigation of these attacks requires the defender to also personalize the e-mail filters by taking into account the users' properties. In this paper, we assume that a learned classifier is given and propose strategic per-user filtering thresholds for mitigating spear-phishing attacks. We formulate the problem of filtering targeted and non-targeted malicious e-mails as a Stackelberg security game. We characterize the optimal filtering strategies and show how to compute them in practice. Finally, we evaluate our results using two real-world datasets and demonstrate that the proposed thresholds lead to lower losses than non-strategic thresholds.


Mining User Intents in Twitter: A Semi-Supervised Approach to Inferring Intent Categories for Tweets

AAAI Conferences

In this paper, we propose to study the problem of identifying and classifying tweets into intent categories. For example, a tweet “I wanna buy a new car” indicates the user’s intent for buying a car. Identifying such intent tweets will have great commercial value among others. In particular, it is important that we can distinguish different types of intent tweets. We propose to classify intent tweets into six categories, namely Food & Drink, Travel, Career & Education, Goods & Services, Event and Activities and Trifle. We propose a semisupervised learning approach to categorizing intent tweets into the six categories.We construct a test collection by using a bootstrap method. Our experimental results show that our approach is effective in inferring intent categories for tweets.


Kernel Density Estimation for Text-Based Geolocation

AAAI Conferences

Text-based geolocation classifiers often operate with a grid-based view of the world. Predicting document location of origin based on text content on a geodesic grid is computationally attractive since many standard methods for supervised document classification carry over unchanged to geolocation in the form of predicting a most probable grid cell for a document. However, the grid-based approach suffers from sparse data problems if one wants to improve classification accuracy by moving to smaller cell sizes. In this paper we investigate an enhancement of common methods for determining the geographic point of origin of a text document by kernel density estimation. For geolocation of tweets we obtain a improvements upon non-kernel methods on datasets of U.S. and global Twitter content.


On the Bayes-optimality of F-measure maximizers

arXiv.org Machine Learning

The F-measure, which has originally been introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction. Optimizing this measure is a statistically and computationally challenging problem, since no closed-form solution exists. Adopting a decision-theoretic perspective, this article provides a formal and experimental analysis of different approaches for maximizing the F-measure. We start with a Bayes-risk analysis of related loss functions, such as Hamming loss and subset zero-one loss, showing that optimizing such losses as a surrogate of the F-measure leads to a high worst-case regret. Subsequently, we perform a similar type of analysis for F-measure maximizing algorithms, showing that such algorithms are approximate, while relying on additional assumptions regarding the statistical distribution of the binary response variables. Furthermore, we present a new algorithm which is not only computationally efficient but also Bayes-optimal, regardless of the underlying distribution. To this end, the algorithm requires only a quadratic (with respect to the number of binary responses) number of parameters of the joint distribution. We illustrate the practical performance of all analyzed methods by means of experiments with multi-label classification problems.


The Bayesian Case Model: A Generative Approach for Case-Based Reasoning and Prototype Classification

arXiv.org Machine Learning

We present the Bayesian Case Model (BCM), a general framework for Bayesian case-based reasoning (CBR) and prototype classification and clustering. BCM brings the intuitive power of CBR to a Bayesian generative framework. The BCM learns prototypes, the "quintessential" observations that best represent clusters in a dataset, by performing joint inference on cluster labels, prototypes and important features. Simultaneously, BCM pursues sparsity by learning subspaces, the sets of features that play important roles in the characterization of the prototypes. The prototype and subspace representation provides quantitative benefits in interpretability while preserving classification accuracy. Human subject experiments verify statistically significant improvements to participants' understanding when using explanations produced by BCM, compared to those given by prior art.


A Noise Scaled Semi Parametric Gaussian Process Model for Real Time Water Network Leak Detection in the Presence of Heteroscedasticity

AAAI Conferences

The timely detection of leaks in water distribution systems is critical to the sustainable provision of clean water to consumers. Increasingly, water companies are deploying remote sensors to measure water flow in real-time in order to detect such leaks. However, in practice, for typical District Metering Zones (DMZ), financial constraints limit the number of deployable real time flow sensors/meters to one or two, thus constraining leak detection to be based on the aggregated flow being monitored at these point. Such aggregated flow data typically exhibits input signal dependence whereby both noise and leaks are dependent on the flow being measured. This limited monitoring and input signal dependance make conventional approaches based on simple thresholds unreliable for real time leak detection. To address this, we propose a Gaussian process (GP) model with an additive diagonal noise covariance that is able to handle the input dependant noise observed in this setting. A parameterised mean step change function is used to detect leaks and to estimate their size. Using prior water distribution systems (WDS) knowledge we dynamically bound and discretize the detection parameters of the step change mean function, reducing and pruning the parameter search space considerably. We evaluate the proposed noise scaled GP (NSGP) against both the latest researchwork on GP based fault detection methods and the current state of the art and applied leak detection approaches in water distribution systems. We show that our proposed method outperforms other approaches, on real water network data with synthetically generatedvtime varying leaks, with a detection accuracy of 99%, almost zero false positive detections and the lowest root mean squared error in leak magnitude estimation (0.065 l/s).


Discovering Hotspots and Coldspots of Species Richness in eBird Data

AAAI Conferences

Quantifying biodiversity is an important task related to ecological research. One way to measure biodiversity is through species richness, which measures the number of unique species found in an area. Recently, citizen science biodiversity datasets such as eBird allow the calculation of species richness over an unprecedented spatial and temporal extent. However, several confounding factors associated with the unstructured observation process, such as observer effort, affect the number of species reported by citizen scientists. In this work, we develop an algorithm for discovering hotspots and coldspots of species richness using eBird data while accounting for these confounding factors.


Novel Metaknowledge-based Processing Technique for Multimedia Big Data clustering challenges

arXiv.org Artificial Intelligence

Past research has challenged us with the task of showing relational patterns between text-based data and then clustering for predictive analysis using Golay Code technique. We focus on a novel approach to extract metaknowledge in multimedia datasets. Our collaboration has been an on-going task of studying the relational patterns between datapoints based on metafeatures extracted from metaknowledge in multimedia datasets. Those selected are significant to suit the mining technique we applied, Golay Code algorithm. In this research paper we summarize findings in optimization of metaknowledge representation for 23-bit representation of structured and unstructured multimedia data in order to


What Predicts Media Coverage of Health Science Articles?

AAAI Conferences

An important aspect of health science is communicating research findings to the public. The media is a critical instrument in disseminating research. Yet the process by which a scientific article becomes “newsworthy” is not well understood. In this study, we use large-scale text analysis to characterize the content features of articles that are predictive of newsworthiness. We experiment with two novel corpora: (i) 28,910 articles from a di- verse range of biomedical and health journals, of which 1,343 were covered by the news agency Reuters, and (ii) 10,760 articles from the JAMA journals, of which 846 were given press releases by the journal editors. We show that media coverage can be predicted reasonably well: logistic regression achieves mean AUCs of 0.783 and 0.882 on the Reuters and JAMA datasets, respec- tively. We present and discuss interesting findings con- cerning the most predictive content features.


Privacy-Utility Trade-Off for Time-Series with Application to Smart-Meter Data

AAAI Conferences

We consider the online setting where a user would like to continuously release a time-series of data that is correlated with his private data, to a service provider in the hope of deriving some utility. Due to correlations, the continual observation of the released time-series puts the user at risk of inference of his private data by an adversary. To protect the user from inference attacks on his private data, the time-series is randomized prior to its release according to a probabilistic privacy mapping. The privacy mapping should be designed in a way that balances privacy and utility requirements over time.Our contributions are threefold. First, we formalize the framework for the design of utility-aware privacy mappings for time-series data, under both online and batch models. We provide a sequential scheme that allows to design online privacy mappings at scale, that account for privacy risk from the history of released data and future releases to come. Second, we prove the equivalence of the optimal mappings under the batch and the online models, in the case where the time-series samples are independent across time. We further show that there exists a gap between optimal batch and online privacy mappings when certain conditions are not satisfied.Finally, we evaluate the performance of the framework over synthetic and real-world time-series data. In particular, we show that smart-meter data can be randomized for privacy purposes to prevent disaggregation of per-device energy consumption, while preserving the utility.