Goto

Collaborating Authors

 Performance Analysis


Leveraging Multi-level Dependency of Relational Sequences for Social Spammer Detection

arXiv.org Machine Learning

Much recent research has shed light on the development of the relation-dependent but content-independent framework for social spammer detection. This is largely because the relation among users is difficult to be altered when spammers attempt to conceal their malicious intents. Our study investigates the spammer detection problem in the context of multi-relation social networks, and makes an attempt to fully exploit the sequences of heterogeneous relations for enhancing the detection accuracy. Specifically, we present the Multi-level Dependency Model (MDM). The MDM is able to exploit user's long-term dependency hidden in their relational sequences along with short-term dependency. Moreover, MDM fully considers short-term relational sequences from the perspectives of individual-level and union-level, due to the fact that the type of short-term sequences is multi-folds. Experimental results on a real-world multi-relational social network demonstrate the effectiveness of our proposed MDM on multi-relational social spammer detection.


A Vertical Federated Learning Method for Interpretable Scorecard and Its Application in Credit Scoring

arXiv.org Machine Learning

With the success of big data and artificial intelligence in many fields, the applications of big data driven models are expected in financial risk management especially credit scoring and rating. Under the premise of data privacy protection, we propose a projected gradient-based method in the vertical federated learning framework for the traditional scorecard, which is based on logistic regression with bounded constraints, namely FL-LRBC. The latter enables multiple agencies to jointly train an optimized scorecard model in a single training session. It leads to the formation of the model with positive coefficients, while the time-consuming parameter-tuning process can be avoided. Moreover, the performance in terms of both AUC and the Kolmogorov-Smirnov (KS) statistics is significantly improved due to data enrichment using FL-LRBC. At present, FL-LRBC has already been applied to credit business in a China nation-wide financial holdings group.


Fairness Constraints in Semi-supervised Learning

arXiv.org Machine Learning

Fairness in machine learning has received considerable attention. However, most studies on fair learning focus on either supervised learning or unsupervised learning. Very few consider semi-supervised settings. Yet, in reality, most machine learning tasks rely on large datasets that contain both labeled and unlabeled data. One of key issues with fair learning is the balance between fairness and accuracy. Previous studies arguing that increasing the size of the training set can have a better trade-off. We believe that increasing the training set with unlabeled data may achieve the similar result. Hence, we develop a framework for fair semi-supervised learning, which is formulated as an optimization problem. This includes classifier loss to optimize accuracy, label propagation loss to optimize unlabled data prediction, and fairness constraints over labeled and unlabeled data to optimize the fairness level. The framework is conducted in logistic regression and support vector machines under the fairness metrics of disparate impact and disparate mistreatment. We theoretically analyze the source of discrimination in semi-supervised learning via bias, variance and noise decomposition. Extensive experiments show that our method is able to achieve fair semi-supervised learning, and reach a better trade-off between accuracy and fairness than fair supervised learning.


Why Working With AUC is More Powerful Than One Might Think

#artificialintelligence

I am finishing up a work-note that has some really neat implications as to why working with AUC is more powerful than one might think. I think I am far enough along to share the consequences here. This started as some, now reappraised, thoughts on the fallacy of thinking knowing the AUC (area under the curve) means you know the shape of the ROC plot (receiver operating characteristic plot]. I now think for many practical applications the AUC number carries a lot more information about the ROC shape than one might expect. For the experienced practitioner the following ROC shape is very familiar.


That looks interesting! Personalizing Communication and Segmentation with Random Forest Node Embeddings

arXiv.org Artificial Intelligence

Communicating effectively with customers is a challenge for many marketers, but especially in a context that is both pivotal to individual long-term financial well-being and difficult to understand: pensions. Around the world, participants are reluctant to consider their pension in advance, it leads to a lack of preparation of their pension retirement [1], [2]. In order to engage participants to obtain information on their expected pension benefits, personalizing the pension providers' email communication is a first and crucial step. We describe a machine learning approach to model email newsletters to fit participants' interests. The data for the modeling and analysis is collected from newsletters sent by a large Dutch pension provider of the Netherlands and is divided into two parts. The first part comprises 2,228,000 customers whereas the second part comprises the data of a pilot study, which took place in July 2018 with 465,711 participants. In both cases, our algorithm extracts features from continuous and categorical data using random forests, and then calculates node embeddings of the decision boundaries of the random forest. We illustrate the algorithm's effectiveness for the classification task, and how it can be used to perform data mining tasks. In order to confirm that the result is valid for more than one data set, we also illustrate the properties of our algorithm in benchmark data sets concerning churning. In the data sets considered, the proposed modeling demonstrates competitive performance with respect to other state of the art approaches based on random forests, achieving the best Area Under the Curve (AUC) in the pension data set (0.948). For the descriptive part, the algorithm can identify customer segmentations that can be used by marketing departments to better target their communication towards their customers.


Random boosting and random^2 forests -- A random tree depth injection approach

arXiv.org Machine Learning

The induction of additional randomness in parallel and sequential ensemble methods has proven to be worthwhile in many aspects. In this manuscript, we propose and examine a novel random tree depth injection approach suitable for sequential and parallel tree-based approaches including Boosting and Random Forests. The resulting methods are called \emph{Random Boost} and \emph{Random$^2$ Forest}. Both approaches serve as valuable extensions to the existing literature on the gradient boosting framework and random forests. A Monte Carlo simulation, in which tree-shaped data sets with different numbers of final partitions are built, suggests that there are several scenarios where \emph{Random Boost} and \emph{Random$^2$ Forest} can improve the prediction performance of conventional hierarchical boosting and random forest approaches. The new algorithms appear to be especially successful in cases where there are merely a few high-order interactions in the generated data. In addition, our simulations suggest that our random tree depth injection approach can improve computation time by up to 40%, while at the same time the performance losses in terms of prediction accuracy turn out to be minor or even negligible in most cases.


Is there a role for statistics in artificial intelligence?

arXiv.org Artificial Intelligence

The research on and application of artificial intelligence (AI) has triggered a comprehensive scientific, economic, social and political discussion. Here we argue that statistics, as an interdisciplinary scientific field, plays a substantial role both for the theoretical and practical understanding of AI and for its future development. Statistics might even be considered a core element of AI. With its specialist knowledge of data evaluation, starting with the precise formulation of the research question and passing through a study design stage on to analysis and interpretation of the results, statistics is a natural partner for other disciplines in teaching, research and practice. This paper aims at contributing to the current discussion by highlighting the relevance of statistical methodology in the context of AI development. In particular, we discuss contributions of statistics to the field of artificial intelligence concerning methodological development, planning and design of studies, assessment of data quality and data collection, differentiation of causality and associations and assessment of uncertainty in results. Moreover, the paper also deals with the equally necessary and meaningful extension of curricula in schools and universities.


MAGNETO: Fingerprinting USB Flash Drives via Unintentional Magnetic Emissions

arXiv.org Artificial Intelligence

Universal Serial Bus (USB) Flash Drives are nowadays one of the most convenient and diffused means to transfer files, especially when no Internet connection is available. However, USB flash drives are also one of the most common attack vectors used to gain unauthorized access to host devices. For instance, it is possible to replace a USB drive so that when the USB key is connected, it would install passwords stealing tools, root-kit software, and other disrupting malware. In such a way, an attacker can steal sensitive information via the USB-connected devices, as well as inject any kind of malicious software into the host. To thwart the above-cited raising threats, we propose MAGNETO, an efficient, non-interactive, and privacy-preserving framework to verify the authenticity of a USB flash drive, rooted in the analysis of its unintentional magnetic emissions. We show that the magnetic emissions radiated during boot operations on a specific host are unique for each device, and sufficient to uniquely fingerprint both the brand and the model of the USB flash drive, or the specific USB device, depending on the used equipment. Our investigation on 59 different USB flash drives---belonging to 17 brands, including the top brands purchased on Amazon in mid-2019---, reveals a minimum classification accuracy of 98.2% in the identification of both brand and model, accompanied by a negligible time and computational overhead. MAGNETO can also identify the specific USB Flash drive, with a minimum classification accuracy of 91.2%. Overall, MAGNETO proves that unintentional magnetic emissions can be considered as a viable and reliable means to fingerprint read-only USB flash drives. Finally, future research directions in this domain are also discussed.


Explainable Artificial Intelligence for Process Mining: A General Overview and Application of a Novel Local Explanation Approach for Predictive Process Monitoring

arXiv.org Artificial Intelligence

The contemporary process-aware information systems possess the capabilities to record the activities generated during the process execution. To leverage these process specific fine-granular data, process mining has recently emerged as a promising research discipline. As an important branch of process mining, predictive business process management, pursues the objective to generate forward-looking, predictive insights to shape business processes. In this study, we propose a conceptual framework sought to establish and promote understanding of decision-making environment, underlying business processes and nature of the user characteristics for developing explainable business process prediction solutions. Consequently, with regard to the theoretical and practical implications of the framework, this study proposes a novel local post-hoc explanation approach for a deep learning classifier that is expected to facilitate the domain experts in justifying the model decisions. In contrary to alternative popular perturbation-based local explanation approaches, this study defines the local regions from the validation dataset by using the intermediate latent space representations learned by the deep neural networks. To validate the applicability of the proposed explanation method, the real-life process log data delivered by the Volvo IT Belgium's incident management system are used. The adopted deep learning classifier achieves a good performance with the Area Under the ROC Curve of 0.94. The generated local explanations are also visualized and presented with relevant evaluation measures that are expected to increase the users' trust in the black-box-model.


Machine Learning Against Cancer: Accurate Diagnosis of Cancer by Machine Learning Classification of the Whole Genome Sequencing Data

arXiv.org Machine Learning

Machine learning can precisely identify different cancer tumors at any stage by classifying cancerous and healthy samples based on their genomic profile. We have developed novel methods of MLAC (Machine Learning Against Cancer) achieving perfect results with perfect precision, sensitivity, and specificity. We have used the whole genome sequencing data acquired by next-generation RNA sequencing techniques in The Cancer Genome Atlas and Genotype-Tissue Expression projects for cancerous and healthy tissues respectively. Moreover, we have shown that unsupervised machine learning clustering has great potential to be used for cancer diagnosis. Indeed, a creative way to work with data and general algorithms has resulted in perfect classification i.e. all precision, sensitivity, and specificity are equal to 1 for most of the different tumor types even with a modest amount of data, and the same method works well on a series of cancers and results in great clustering of cancerous and healthy samples too. Our system can be used in practice because once the classifier is trained, it can be used to classify any new sample of new potential patients. One advantage of our work is that the aforementioned perfect precision and recall are obtained on samples of all stages including very early stages of cancer; therefore, it is a promising tool for diagnosis of cancers in early stages. Another advantage of our novel model is that it works with normalized values of RNA sequencing data, hence people's private sensitive medical data will remain hidden, protected, and safe. This type of analysis will be widespread and economical in the future and people can even learn to receive their RNA sequencing data and do their own preliminary cancer studies themselves which have the potential to help the healthcare systems. It is a great step forward toward good health that is the main base of sustainable societies.