Diagnosis
SketchBoost: Fast Gradient Boosted Decision Tree for Multioutput Problems
Iosipoi, Leonid, Vakhrushev, Anton
Gradient Boosted Decision Tree (GBDT) is a widely-used machine learning algorithm that has been shown to achieve state-of-the-art results on many standard data science problems. We are interested in its application to multioutput problems when the output is highly multidimensional. Although there are highly effective GBDT implementations, their scalability to such problems is still unsatisfactory. In this paper, we propose novel methods aiming to accelerate the training process of GBDT in the multioutput scenario. The idea behind these methods lies in the approximate computation of a scoring function used to find the best split of decision trees. These methods are implemented in SketchBoost, which itself is integrated into our easily customizable Python-based GPU implementation of GBDT called Py-Boost. Our numerical study demonstrates that SketchBoost speeds up the training process of GBDT by up to over 40 times while achieving comparable or even better performance.
Machine Learning Methods for Anomaly Detection in Nuclear Power Plant Power Transformers
Katser, Iurii, Raspopov, Dmitriy, Kozitsin, Vyacheslav, Mezhov, Maxim
Power transformers are an important component of a nuclear power plant (NPP). Currently, the NPP operates a lot of power transformers with extended service life, which exceeds the designated 25 years. Due to the extension of the service life, the task of monitoring the technical condition of power transformers becomes urgent. An important method for monitoring power transformers is Chromatographic Analysis of Dissolved Gas. It is based on the principle of controlling the concentration of gases dissolved in transformer oil. The appearance of almost any type of defect in equipment is accompanied by the formation of gases that dissolve in oil, and specific types of defects generate their gases in different quantities. At present, at NPPs, the monitoring systems for transformer equipment use predefined control limits for the concentration of dissolved gases in the oil. This study describes the stages of developing an algorithm to detect defects and faults in transformers automatically using machine learning and data analysis methods. Among machine learning models, we trained Logistic Regression, Decision Trees, Random Forest, Gradient Boosting, Neural Networks. The best of them were then combined into an ensemble (StackingClassifier) showing F1-score of 0.974 on a test sample. To develop mathematical models, we used data on the state of transformers, containing time series with values of gas concentrations (H2, CO, C2H4, C2H2). The datasets were labeled and contained four operating modes: normal mode, partial discharge, low energy discharge, low-temperature overheating.
Decision Trees (the upside-down trees)
Taking SpongeBob SquarePants' mood as an example, based on historical data, there are two factors that affect whether he is happy or upset. So, if the whole mood table of Mr. SquarePants is visualized into a scatter plot, it will look as the following: Decision trees are used in classification problems to find a line(s) that separates the data as perfectly as possible. The separation process is done by measuring the homogeneity or similarity of the data. Attempting to separate data, a vertical line is drawn to separate happy SpongeBob from upset SpongeBob, taking into consideration one feature only -which is the number of jellyfish that were hunted. A vertical line at the number 10 on the x-axis can work as a separator, so if the number of jellyfish hunted is less than 10 SpongeBob is upset otherwise, he is happy.
Mind Your Bias: A Critical Review of Bias Detection Methods for Contextual Language Models
The awareness and mitigation of biases are of fundamental importance for the fair and transparent use of contextual language models, yet they crucially depend on the accurate detection of biases as a precursor. Consequently, numerous bias detection methods have been proposed, which vary in their approach, the considered type of bias, and the data used for evaluation. However, while most detection methods are derived from the word embedding association test for static word embeddings, the reported results are heterogeneous, inconsistent, and ultimately inconclusive. To address this issue, we conduct a rigorous analysis and comparison of bias detection methods for contextual language models. Our results show that minor design and implementation decisions (or errors) have a substantial and often significant impact on the derived bias scores. Overall, we find the state of the field to be both worse than previously acknowledged due to systematic and propagated errors in implementations, yet better than anticipated since divergent results in the literature homogenize after accounting for implementation errors. Based on our findings, we conclude with a discussion of paths towards more robust and consistent bias detection methods.
VicOne launches Secured RDS Service on MIH Open EV Platform
The rapid development of the global electric vehicle market has ushered in a new era of Software Defined Vehicle (SDV). The accelerated innovative development has also made electric vehicles the new target for hackers, which accentuate the importance of the cyber security in the electric vehicle industry. VicOne, an automotive cybersecurity solution provider, established by cloud security leader Trend Micro, unveiled the industry's first-of-kind in-vehicle software safety remote diagnosis service on the Open EV Platform at MIH Demo Day, an annual technical event organized by MIH Consortium on the 8th November, 2023. VicOne Secured RDS (Remote Diagnostic Service) is the first to extend the service of vehicle abnormality detection to software diagnosis. Detecting abnormalities and real-time warnings remotely can help car manufacturers and system suppliers to establish a stronger cyber security defense and provide comprehensive network security protection to electric vehicles.
Framework Construction of an Adversarial Federated Transfer Learning Classifier
Yi, Hang, Bie, Tongxuan, Yan, Tongjiang
As the Internet grows in popularity, more and more classification jobs, such as IoT, finance industry and healthcare field, rely on mobile edge computing to advance machine learning. In the medical industry, however, good diagnostic accuracy necessitates the combination of large amounts of labeled data to train the model, which is difficult and expensive to collect and risks jeopardizing patients' privacy. In this paper, we offer a novel medical diagnostic framework that employs a federated learning platform to ensure patient data privacy by transferring classification algorithms acquired in a labeled domain to a domain with sparse or missing labeled data. Rather than using a generative adversarial network, our framework uses a discriminative model to build multiple classification loss functions with the goal of improving diagnostic accuracy. It also avoids the difficulty of collecting large amounts of labeled data or the high cost of generating large amount of sample data. Experiments on real-world image datasets demonstrates that the suggested adversarial federated transfer learning method is promising for real-world medical diagnosis applications that use image classification.
A Random Forest and Current Fault Texture Feature-Based Method for Current Sensor Fault Diagnosis in Three-Phase PWM VSR
Kou, Lei, Gong, Xiao-dong, Zheng, Yi, Ni, Xiu-hui, Li, Yang, Yuan, Quan-de, Dong, Ya-nan
Three-phase PWM voltage-source rectifier (VSR) systems have been widely used in various energy conversion systems, where current sensors are the key component for state monitoring and system control. The current sensor faults may bring hidden danger or damage to the whole system; therefore, this paper proposed a random forest (RF) and current fault texture feature-based method for current sensor fault diagnosis in three-phase PWM VSR systems. First, the three-phase alternating currents (ACs) of the three-phase PWM VSR are collected to extract the current fault texture features, and no additional hardware sensors are needed to avoid causing additional unstable factors. Then, the current fault texture features are adopted to train the random forest current sensor fault detection and diagnosis (CSFDD) classifier, which is a data-driven CSFDD classifier. Finally, the effectiveness of the proposed method is verified by simulation experiments. The result shows that the current sensor faults can be detected and located successfully and that it can effectively provide fault locations for maintenance personnel to keep the stable operation of the whole system.
Towards federated multivariate statistical process control (FedMSPC)
Duy, Du Nguyen, Gabauer, David, Nikzad-Langerodi, Ramin
The ongoing transition from a linear (produce-use-dispose) to a circular economy poses significant challenges to current state-of-the-art information and communication technologies. In particular, the derivation of integrated, high-level views on material, process, and product streams from (real-time) data produced along value chains is challenging for several reasons. Most importantly, sufficiently rich data is often available yet not shared across company borders because of privacy concerns which make it impossible to build integrated process models that capture the interrelations between input materials, process parameters, and key performance indicators along value chains. In the current contribution, we propose a privacy-preserving, federated multivariate statistical process control (FedMSPC) framework based on Federated Principal Component Analysis (PCA) and Secure Multiparty Computation to foster the incentive for closer collaboration of stakeholders along value chains. We tested our approach on two industrial benchmark data sets - SECOM and ST-AWFD. Our empirical results demonstrate the superior fault detection capability of the proposed approach compared to standard, single-party (multiway) PCA. Furthermore, we showcase the possibility of our framework to provide privacy-preserving fault diagnosis to each data holder in the value chain to underpin the benefits of secure data sharing and federated process modeling.
Data-driven design of fault diagnosis for three-phase PWM rectifier using random forests technique with transient synthetic features
Kou, Lei, Liu, Chuang, Cai, Guo-wei, Zhou, Jia-ning, Yuan, Quan-de
A three-phase pulse-width modulation (PWM) rectifier can usually maintain operation when open-circuit faults occur in insulated-gate bipolar transistors (IGBTs), which will lead the system to be unstable and unsafe. Aiming at this problem, based on random forests with transient synthetic features, a data-driven online fault diagnosis method is proposed to locate the open-circuit faults of IGBTs timely and effectively in this study. Firstly, by analysing the open-circuit fault features of IGBTs in the three-phase PWM rectifier, it is found that the occurrence of the fault features is related to the fault location and time, and the fault features do not always appear immediately with the occurrence of the fault. Secondly, different data-driven fault diagnosis methods are compared and evaluated, the performance of random forests algorithm is better than that of support vector machine or artificial neural networks. Meanwhile, the accuracy of fault diagnosis classifier trained by transient synthetic features is higher than that trained by original features. Also, the random forests fault diagnosis classifier trained by multiplicative features is the best with fault diagnosis accuracy can reach 98.32%. Finally, the online fault diagnosis experiments are carried out and the results demonstrate the effectiveness of the proposed method, which can accurately locate the open-circuit faults in IGBTs while ensuring system safety.
Optimization of Oblivious Decision Tree Ensembles Evaluation for CPU
Mironov, Alexey, Khuziev, Ilnur
CatBoost is a popular machine learning library. CatBoost models are based on oblivious decision trees, making training and evaluation rapid. CatBoost has many applications, and some require low latency and high throughput evaluation. This paper investigates the possibilities for improving CatBoost's performance in single-core CPU computations. We explore the new features provided by the AVX instruction sets to optimize evaluation. We increase performance by 20-40% using AVX2 instructions without quality impact. We also introduce a new trade-off between speed and quality. Using float16 for leaf values and AVX-512 instructions, we achieve 50-70% speed-up.