Accuracy
On the Bayes-optimality of F-measure maximizers
Waegeman, Willem, Dembczynski, Krzysztof, Jachnik, Arkadiusz, Cheng, Weiwei, Hullermeier, Eyke
The F-measure, which has originally been introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction. Optimizing this measure is a statistically and computationally challenging problem, since no closed-form solution exists. Adopting a decision-theoretic perspective, this article provides a formal and experimental analysis of different approaches for maximizing the F-measure. We start with a Bayes-risk analysis of related loss functions, such as Hamming loss and subset zero-one loss, showing that optimizing such losses as a surrogate of the F-measure leads to a high worst-case regret. Subsequently, we perform a similar type of analysis for F-measure maximizing algorithms, showing that such algorithms are approximate, while relying on additional assumptions regarding the statistical distribution of the binary response variables. Furthermore, we present a new algorithm which is not only computationally efficient but also Bayes-optimal, regardless of the underlying distribution. To this end, the algorithm requires only a quadratic (with respect to the number of binary responses) number of parameters of the joint distribution. We illustrate the practical performance of all analyzed methods by means of experiments with multi-label classification problems.
The Bayesian Case Model: A Generative Approach for Case-Based Reasoning and Prototype Classification
Kim, Been, Rudin, Cynthia, Shah, Julie
We present the Bayesian Case Model (BCM), a general framework for Bayesian case-based reasoning (CBR) and prototype classification and clustering. BCM brings the intuitive power of CBR to a Bayesian generative framework. The BCM learns prototypes, the "quintessential" observations that best represent clusters in a dataset, by performing joint inference on cluster labels, prototypes and important features. Simultaneously, BCM pursues sparsity by learning subspaces, the sets of features that play important roles in the characterization of the prototypes. The prototype and subspace representation provides quantitative benefits in interpretability while preserving classification accuracy. Human subject experiments verify statistically significant improvements to participants' understanding when using explanations produced by BCM, compared to those given by prior art.
A Noise Scaled Semi Parametric Gaussian Process Model for Real Time Water Network Leak Detection in the Presence of Heteroscedasticity
Malik, Obaid (University of Southampton) | Ghosh, Siddhartha (University of Southampton) | Rogers, Alex (University of Southampton)
The timely detection of leaks in water distribution systems is critical to the sustainable provision of clean water to consumers. Increasingly, water companies are deploying remote sensors to measure water flow in real-time in order to detect such leaks. However, in practice, for typical District Metering Zones (DMZ), financial constraints limit the number of deployable real time flow sensors/meters to one or two, thus constraining leak detection to be based on the aggregated flow being monitored at these point. Such aggregated flow data typically exhibits input signal dependence whereby both noise and leaks are dependent on the flow being measured. This limited monitoring and input signal dependance make conventional approaches based on simple thresholds unreliable for real time leak detection. To address this, we propose a Gaussian process (GP) model with an additive diagonal noise covariance that is able to handle the input dependant noise observed in this setting. A parameterised mean step change function is used to detect leaks and to estimate their size. Using prior water distribution systems (WDS) knowledge we dynamically bound and discretize the detection parameters of the step change mean function, reducing and pruning the parameter search space considerably. We evaluate the proposed noise scaled GP (NSGP) against both the latest researchwork on GP based fault detection methods and the current state of the art and applied leak detection approaches in water distribution systems. We show that our proposed method outperforms other approaches, on real water network data with synthetically generatedvtime varying leaks, with a detection accuracy of 99%, almost zero false positive detections and the lowest root mean squared error in leak magnitude estimation (0.065 l/s).
Discovering Hotspots and Coldspots of Species Richness in eBird Data
Moore, Travis (Oregon State University) | Wong, Weng-Keen (Oregon State University)
Quantifying biodiversity is an important task related to ecological research. One way to measure biodiversity is through species richness, which measures the number of unique species found in an area. Recently, citizen science biodiversity datasets such as eBird allow the calculation of species richness over an unprecedented spatial and temporal extent. However, several confounding factors associated with the unstructured observation process, such as observer effort, affect the number of species reported by citizen scientists. In this work, we develop an algorithm for discovering hotspots and coldspots of species richness using eBird data while accounting for these confounding factors.
What Predicts Media Coverage of Health Science Articles?
Wallace, Byron C. (University of Texas at Austin) | Paul, Michael J. (Johns Hopkins University) | Elhadad, Noémie (Columbia University)
An important aspect of health science is communicating research findings to the public. The media is a critical instrument in disseminating research. Yet the process by which a scientific article becomes “newsworthy” is not well understood. In this study, we use large-scale text analysis to characterize the content features of articles that are predictive of newsworthiness. We experiment with two novel corpora: (i) 28,910 articles from a di- verse range of biomedical and health journals, of which 1,343 were covered by the news agency Reuters, and (ii) 10,760 articles from the JAMA journals, of which 846 were given press releases by the journal editors. We show that media coverage can be predicted reasonably well: logistic regression achieves mean AUCs of 0.783 and 0.882 on the Reuters and JAMA datasets, respec- tively. We present and discuss interesting findings con- cerning the most predictive content features.
Privacy-Utility Trade-Off for Time-Series with Application to Smart-Meter Data
Erdogdu, Murat A. (Stanford University) | Fawaz, Nadia (Technicolor) | Montanari, Andrea (Stanford University)
We consider the online setting where a user would like to continuously release a time-series of data that is correlated with his private data, to a service provider in the hope of deriving some utility. Due to correlations, the continual observation of the released time-series puts the user at risk of inference of his private data by an adversary. To protect the user from inference attacks on his private data, the time-series is randomized prior to its release according to a probabilistic privacy mapping. The privacy mapping should be designed in a way that balances privacy and utility requirements over time.Our contributions are threefold. First, we formalize the framework for the design of utility-aware privacy mappings for time-series data, under both online and batch models. We provide a sequential scheme that allows to design online privacy mappings at scale, that account for privacy risk from the history of released data and future releases to come. Second, we prove the equivalence of the optimal mappings under the batch and the online models, in the case where the time-series samples are independent across time. We further show that there exists a gap between optimal batch and online privacy mappings when certain conditions are not satisfied.Finally, we evaluate the performance of the framework over synthetic and real-world time-series data. In particular, we show that smart-meter data can be randomized for privacy purposes to prevent disaggregation of per-device energy consumption, while preserving the utility.
Interactive Multi-Consumer Power Cooperatives with Learning and Axiomatic Cost and Risk Disaggregation
Ehsanfar, Abbas (Stevens Institute of Technology) | Heydari, Babak (Stevens Institute of Technology)
This paper introduces a novel autonomous interactive learning cooperative (ILCP) who receives expected value and variance of load from consumers and participates in the electricity market on their behalf. Using an axiomatic approach, the share of each consumer's payment as well as its weight in calculating the modification of total day-ahead load are formulated. This scheme applies double-seasonal smoothing exponential, a recent load forecasting technique, and a classifier for real-time to day-ahead price direction forecasting (Gaussian Naïve Bayes). In addition to this, the ILCP employs interactive cooperative algorithms for both trading cooperative and consumer side. The ILCP scheme is investigated and its performance is compared to those of non-cooperative real-time pricing (RTP), LCP (non-interactive learning cooperative) and CP (non-interactive non-learning cooperative). The developed system was implemented using PJM(world's largest wholesale electricity market) real-time and day-ahead data for 2013 and half of 2014; real load profiles were selected from a set of 579 residential and commercial consumers, and weather data were applied to forecasting electricity price direction. We demonstrate the advantages of ILCP to lower the average electricity cost and to reduce unit price variations.
Novel Metaknowledge-based Processing Technique for Multimedia Big Data clustering challenges
Bari, Nima, Vichr, Roman, Kowsari, Kamran, Berkovich, Simon Y.
Past research has challenged us with the task of showing relational patterns between text-based data and then clustering for predictive analysis using Golay Code technique. We focus on a novel approach to extract metaknowledge in multimedia datasets. Our collaboration has been an on-going task of studying the relational patterns between datapoints based on metafeatures extracted from metaknowledge in multimedia datasets. Those selected are significant to suit the mining technique we applied, Golay Code algorithm. In this research paper we summarize findings in optimization of metaknowledge representation for 23-bit representation of structured and unstructured multimedia data in order to
Scalable Variational Inference in Log-supermodular Models
Djolonga, Josip, Krause, Andreas
We consider the problem of approximate Bayesian inference in log-supermodular models. These models encompass regular pairwise MRFs with binary variables, but allow to capture high-order interactions, which are intractable for existing approximate inference techniques such as belief propagation, mean field, and variants. We show that a recently proposed variational approach to inference in log-supermodular models -L-FIELD- reduces to the widely-studied minimum norm problem for submodular minimization. This insight allows to leverage powerful existing tools, and hence to solve the variational problem orders of magnitude more efficiently than previously possible. We then provide another natural interpretation of L-FIELD, demonstrating that it exactly minimizes a specific type of R\'enyi divergence measure. This insight sheds light on the nature of the variational approximations produced by L-FIELD. Furthermore, we show how to perform parallel inference as message passing in a suitable factor graph at a linear convergence rate, without having to sum up over all the configurations of the factor. Finally, we apply our approach to a challenging image segmentation task. Our experiments confirm scalability of our approach, high quality of the marginals, and the benefit of incorporating higher-order potentials.
Classification approach based on association rules mining for unbalanced data
Ndour, Cheikh, Diop, Aliou, Dossou-Gbété, Simplice
This paper deals with the binary classification task when the target class has the lower probability of occurrence. In such situation, it is not possible to build a powerful classifier by using standard methods such as logistic regression, classification tree, discriminant analysis, etc. To overcome this short-coming of these methods which yield classifiers with low sensibility, we tackled the classification problem here through an approach based on the association rules learning. This approach has the advantage of allowing the identification of the patterns that are well correlated with the target class. Association rules learning is a well known method in the area of data-mining. It is used when dealing with large database for unsupervised discovery of local patterns that expresses hidden relationships between input variables. In considering association rules from a supervised learning point of view, a relevant set of weak classifiers is obtained from which one derives a classifier that performs well.