Goto

Collaborating Authors

 Accuracy


Multimodal sensor data fusion for in-situ classification of animal behavior using accelerometry and GNSS data

arXiv.org Artificial Intelligence

In this paper, we examine the use of data from multiple sensing modes, i.e., accelerometry and global navigation satellite system (GNSS), for classifying animal behavior. We extract three new features from the GNSS data, namely, distance from water point, median speed, and median estimated horizontal position error. We combine the information available from the accelerometry and GNSS data via two approaches. The first approach is based on concatenating the features extracted from both sensor data and feeding the concatenated feature vector into a multi-layer perceptron (MLP) classifier. The second approach is based on fusing the posterior probabilities predicted by two MLP classifiers. The input to each classifier is the features extracted from the data of one sensing mode. We evaluate the performance of the developed multimodal animal behavior classification algorithms using two real-world datasets collected via smart cattle collar tags and ear tags. The leave-one-animal-out cross-validation results show that both approaches improve the classification performance appreciably compared with using data of only one sensing mode. This is more notable for the infrequent but important behaviors of walking and drinking. The algorithms developed based on both approaches require little computational and memory resources hence are suitable for implementation on embedded systems of our collar tags and ear tags. However, the multimodal animal behavior classification algorithm based on posterior probability fusion is preferable to the one based on feature concatenation as it delivers better classification accuracy, has less computational and memory complexity, is more robust to sensor data failure, and enjoys better modularity.


Rhino: Deep Causal Temporal Relationship Learning With History-dependent Noise

arXiv.org Artificial Intelligence

Discovering causal relationships between different variables from time series data has been a long-standing challenge for many domains such as climate science, finance, and healthcare. Given the complexity of real-world relationships and the nature of observations in discrete time, causal discovery methods need to consider non-linear relations between variables, instantaneous effects and history-dependent noise (the change of noise distribution due to past actions). However, previous works do not offer a solution addressing all these problems together. In this paper, we propose a novel causal relationship learning framework for time-series data, called Rhino, which combines vector auto-regression, deep learning and variational inference to model non-linear relationships with instantaneous effects while allowing the noise distribution to be modulated by historical observations. Theoretically, we prove the structural identifiability of Rhino. Our empirical results from extensive synthetic experiments and two real-world benchmarks demonstrate better discovery performance compared to relevant baselines, with ablation studies revealing its robustness under model misspecification.


Extending F1 metric, probabilistic approach

arXiv.org Artificial Intelligence

This article explores the extension of well-known F1 score used for assessing the performance of binary classifiers. We propose the new metric using probabilistic interpretation of precision, recall, specificity, and negative predictive value. We describe its properties and compare it to common metrics. Then we demonstrate its behavior in edge cases of the confusion matrix. Finally, the properties of the metric are tested on binary classifier trained on the real dataset.


Conformal Off-Policy Prediction in Contextual Bandits

arXiv.org Artificial Intelligence

Most off-policy evaluation methods for contextual bandits have focused on the expected outcome of a policy, which is estimated via methods that at best provide only asymptotic guarantees. However, in many applications, the expectation may not be the best measure of performance as it does not capture the variability of the outcome. In addition, particularly in safety-critical settings, stronger guarantees than asymptotic correctness may be required. To address these limitations, we consider a novel application of conformal prediction to contextual bandits. Given data collected under a behavioral policy, we propose \emph{conformal off-policy prediction} (COPP), which can output reliable predictive intervals for the outcome under a new target policy. We provide theoretical finite-sample guarantees without making any additional assumptions beyond the standard contextual bandit setup, and empirically demonstrate the utility of COPP compared with existing methods on synthetic and real-world data.


ProVe: A Pipeline for Automated Provenance Verification of Knowledge Graphs against Textual Sources

arXiv.org Artificial Intelligence

A Knowledge Graph (KG) is a type of knowledge base that stores information in the form of semantic triples formed by a subject, a predicate, and an object. KGs represent both real and abstract entities internally as labelled and uniquely identifiable entities, such as The Moon or Happiness, and can amass information from a multitude of domains and sources by connecting such entities amongst themselves or to literals through relationships, coded via uniquely identified predicates. KGs serve as sources of both human and machine-readable semantically structured data for various crucial applications in the modern web landscape, such as Wikipedia infoboxes, search engines results, voice-activated assistants, and information gathering projects [30]. Developed and maintained by ontology experts, data curators, and even anonymous volunteers, KGs have massively grown in size and adoption in the last decade, mainly as secondary sources of information. This means not storing new information, but taking it from authoritative and reliable sources which are explicitly referenced. As such, KGs depend on well-documented and verifiable provenance to ensure they are regarded as trustworthy and usable [56]. Processes to assess and assure the quality of information provenance are thus crucial to KGs, especially measuring and maintaining verifiability, i.e. the degree to which consumers of KG triples can attest these are truly supported by their sources [56]. However, such processes are currently performed mostly manually, which does not scale with size. Manually ensuring high verifiability on vital KGs such as Wikidata and DBpedia is prohibitive due to their sheer size.


Similarity between Units of Natural Language: The Transition from Coarse to Fine Estimation

arXiv.org Artificial Intelligence

Capturing the similarities between human language units is crucial for explaining how humans associate different objects, and therefore its computation has received extensive attention, research, and applications. With the ever-increasing amount of information around us, calculating similarity becomes increasingly complex, especially in many cases, such as legal or medical affairs, measuring similarity requires extra care and precision, as small acts within a language unit can have significant real-world effects. My research goal in this thesis is to develop regression models that account for similarities between language units in a more refined way. Computation of similarity has come a long way, but approaches to debugging the measures are often based on continually fitting human judgment values. To this end, my goal is to develop an algorithm that precisely catches loopholes in a similarity calculation. Furthermore, most methods have vague definitions of the similarities they compute and are often difficult to interpret. The proposed framework addresses both shortcomings. It constantly improves the model through catching different loopholes. In addition, every refinement of the model provides a reasonable explanation. The regression model introduced in this thesis is called progressively refined similarity computation, which combines attack testing with adversarial training. The similarity regression model of this thesis achieves state-of-the-art performance in handling edge cases.


On the pragmatism of using binary classifiers over data intensive neural network classifiers for detection of COVID-19 from voice

arXiv.org Artificial Intelligence

In a self-assesment study, COVID patients reported difficulty producing certain voiced sounds and noticed changes in Lately, there has been a global effort by multiple research groups their voice [8]. to detect COVID-19 from voice. Different researchers use different Consequently, a number of research groups around the world kinds of information from the voice signal to achieve this. Various have initiated efforts on attempting to diagnose potential Covid infections types of phonated sounds and the sound of cough and breath have from recordings of vocalizations [9, 5]. While most groups all been used with varying degree of success in automated voice have focused on cough sounds [10, 11, 12] as they are a frequent based COVID-19 detection apps. In this paper, we show that detecting symptom of Covid-19, several groups have also considered other COVID-19 from voice does not require custom made nonstandard vocalizations, such as breathing sounds [10, 13] extended vowels features or complicated neural network classifiers rather it [14, 15, 16], and counts. Yet other teams have analyzed free-form can be successfully done with just standard features and simple binary speech such as those obtainable from YouTube recordings[17].


LaundroGraph: Self-Supervised Graph Representation Learning for Anti-Money Laundering

arXiv.org Artificial Intelligence

Anti-money laundering (AML) regulations mandate financial institutions to deploy AML systems based on a set of rules that, when triggered, form the basis of a suspicious alert to be assessed by human analysts. Reviewing these cases is a cumbersome and complex task that requires analysts to navigate a large network of financial interactions to validate suspicious movements. Furthermore, these systems have very high false positive rates (estimated to be over 95\%). The scarcity of labels hinders the use of alternative systems based on supervised learning, reducing their applicability in real-world applications. In this work we present LaundroGraph, a novel self-supervised graph representation learning approach to encode banking customers and financial transactions into meaningful representations. These representations are used to provide insights to assist the AML reviewing process, such as identifying anomalous movements for a given customer. LaundroGraph represents the underlying network of financial interactions as a customer-transaction bipartite graph and trains a graph neural network on a fully self-supervised link prediction task. We empirically demonstrate that our approach outperforms other strong baselines on self-supervised link prediction using a real-world dataset, improving the best non-graph baseline by $12$ p.p. of AUC. The goal is to increase the efficiency of the reviewing process by supplying these AI-powered insights to the analysts upon review. To the best of our knowledge, this is the first fully self-supervised system within the context of AML detection.


An Intelligent Decision Support Ensemble Voting Model for Coronary Artery Disease Prediction in Smart Healthcare Monitoring Environments

arXiv.org Artificial Intelligence

Coronary artery disease (CAD) is one of the most common cardiac diseases worldwide and causes disability and economic burden. It is the world's leading and most serious cause of mortality, with approximately 80% of deaths reported in low- and middle-income countries. The preferred and most precise diagnostic tool for CAD is angiography, but it is invasive, expensive, and technically demanding. However, the research community is increasingly interested in the computer-aided diagnosis of CAD via the utilization of machine learning (ML) methods. The purpose of this work is to present an e-diagnosis tool based on ML algorithms that can be used in a smart healthcare monitoring system. We applied the most accurate machine learning methods that have shown superior results in the literature to different medical datasets such as RandomForest, XGboost, MLP, J48, AdaBoost, NaiveBayes, LogitBoost, KNN. Every single classifier can be efficient on a different dataset. Thus, an ensemble model using majority voting was designed to take advantage of the well-performed single classifiers, Ensemble learning aims to combine the forecasts of multiple individual classifiers to achieve higher performance than individual classifiers in terms of precision, specificity, sensitivity, and accuracy; furthermore, we have benchmarked our proposed model with the most efficient and well-known ensemble models, such as Bagging, Stacking methods based on the cross-validation technique, The experimental results confirm that the ensemble majority voting approach based on the top 3 classifiers: MultilayerPerceptron, RandomForest, and AdaBoost, achieves the highest accuracy of 88,12% and outperforms all other classifiers. This study demonstrates that the majority voting ensemble approach proposed above is the most accurate machine learning classification approach for the prediction and detection of coronary artery disease.


Exploring the Whole Rashomon Set of Sparse Decision Trees

arXiv.org Artificial Intelligence

In any given machine learning problem, there might be many models that explain the data almost equally well. However, most learning algorithms return only one of these models, leaving practitioners with no practical way to explore alternative models that might have desirable properties beyond what could be expressed by a loss function. The Rashomon set is the set of these all almost-optimal models. Rashomon sets can be large in size and complicated in structure, particularly for highly nonlinear function classes that allow complex interaction terms, such as decision trees. We provide the first technique for completely enumerating the Rashomon set for sparse decision trees; in fact, our work provides the first complete enumeration of any Rashomon set for a non-trivial problem with a highly nonlinear discrete function class. This allows the user an unprecedented level of control over model choice among all models that are approximately equally good. We represent the Rashomon set in a specialized data structure that supports efficient querying and sampling. We show three applications of the Rashomon set: 1) it can be used to study variable importance for the set of almost-optimal trees (as opposed to a single tree), 2) the Rashomon set for accuracy enables enumeration of the Rashomon sets for balanced accuracy and F1-score, and 3) the Rashomon set for a full dataset can be used to produce Rashomon sets constructed with only subsets of the data set. Thus, we are able to examine Rashomon sets across problems with a new lens, enabling users to choose models rather than be at the mercy of an algorithm that produces only a single model.