Performance Analysis
Comparing Shape-Constrained Regression Algorithms for Data Validation
Bachinger, Florian, Kronberger, Gabriel
Industrial and scientific applications handle large volumes of data that render manual validation by humans infeasible. Therefore, we require automated data validation approaches that are able to consider the prior knowledge of domain experts to produce dependable, trustworthy assessments of data quality. Prior knowledge is often available as rules that describe interactions of inputs with regard to the target e.g. the target must be monotonically decreasing and convex over increasing input values. Domain experts are able to validate multiple such interactions at a glance. However, existing rule-based data validation approaches are unable to consider these constraints. In this work, we compare different shape-constrained regression algorithms for the purpose of data validation based on their classification accuracy and runtime performance.
BuFF: Burst Feature Finder for Light-Constrained 3D Reconstruction
Ravendran, Ahalya, Bryson, Mitch, Dansereau, Donald G.
Robots operating at night using conventional vision cameras face significant challenges in reconstruction due to noise-limited images. Previous work has demonstrated that burst-imaging techniques can be used to partially overcome this issue. In this paper, we develop a novel feature detector that operates directly on image bursts that enhances vision-based reconstruction under extremely low-light conditions. Our approach finds keypoints with well-defined scale and apparent motion within each burst by jointly searching in a multi-scale and multi-motion space. Because we describe these features at a stage where the images have higher signal-to-noise ratio, the detected features are more accurate than the state-of-the-art on conventional noisy images and burst-merged images and exhibit high precision, recall, and matching performance. We show improved feature performance and camera pose estimates and demonstrate improved structure-from-motion performance using our feature detector in challenging light-constrained scenes. Our feature finder provides a significant step towards robots operating in low-light scenarios and applications including night-time operations.
S-Rocket: Selective Random Convolution Kernels for Time Series Classification
Salehinejad, Hojjat, Wang, Yang, Yu, Yuanhao, Jin, Tang, Valaee, Shahrokh
Random convolution kernel transform (Rocket) is a fast, efficient, and novel approach for time series feature extraction using a large number of independent randomly initialized 1-D convolution kernels of different configurations. The output of the convolution operation on each time series is represented by a partial positive value (PPV). A concatenation of PPVs from all kernels is the input feature vector to a Ridge regression classifier. Unlike typical deep learning models, the kernels are not trained and there is no weighted/trainable connection between kernels or concatenated features and the classifier. Since these kernels are generated randomly, a portion of these kernels may not positively contribute in performance of the model. Hence, selection of the most important kernels and pruning the redundant and less important ones is necessary to reduce computational complexity and accelerate inference of Rocket for applications on the edge devices. Selection of these kernels is a combinatorial optimization problem. In this paper, we propose a scheme for selecting these kernels while maintaining the classification performance. First, the original model is pre-trained at full capacity. Then, a population of binary candidate state vectors is initialized where each element of a vector represents the active/inactive status of a kernel. A population-based optimization algorithm evolves the population in order to find a best state vector which minimizes the number of active kernels while maximizing the accuracy of the classifier. This activation function is a linear combination of the total number of active kernels and the classification accuracy of the pre-trained classifier with the active kernels. Finally, the selected kernels in the best state vector are utilized to train the Ridge regression classifier with the selected kernels.
Closing the Gender Wage Gap: Adversarial Fairness in Job Recommendation
Rus, Clara, Luppes, Jeffrey, Oosterhuis, Harrie, Schoenmacker, Gido H.
The goal of this work is to help mitigate the already existing gender wage gap by supplying unbiased job recommendations based on resumes from job seekers. We employ a generative adversarial network to remove gender bias from word2vec representations of 12M job vacancy texts and 900k resumes. Our results show that representations created from recruitment texts contain algorithmic bias and that this bias results in real-world consequences for recommendation systems. Without controlling for bias, women are recommended jobs with significantly lower salary in our data. With adversarially fair representations, this wage gap disappears, meaning that our debiased job recommendations reduce wage discrimination. We conclude that adversarial debiasing of word representations can increase real-world fairness of systems and thus may be part of the solution for creating fairness-aware recommendation systems.
A Tent L\'evy Flying Sparrow Search Algorithm for Feature Selection: A COVID-19 Case Study
Yang, Qinwen, Gao, Yuelin, Song, Yanjie
The "Curse of Dimensionality" induced by the rapid development of information science, might have a negative impact when dealing with big datasets. In this paper, we propose a variant of the sparrow search algorithm (SSA), called Tent L\'evy flying sparrow search algorithm (TFSSA), and use it to select the best subset of features in the packing pattern for classification purposes. SSA is a recently proposed algorithm that has not been systematically applied to feature selection problems. After verification by the CEC2020 benchmark function, TFSSA is used to select the best feature combination to maximize classification accuracy and minimize the number of selected features. The proposed TFSSA is compared with nine algorithms in the literature. Nine evaluation metrics are used to properly evaluate and compare the performance of these algorithms on twenty-one datasets from the UCI repository. Furthermore, the approach is applied to the coronavirus disease (COVID-19) dataset, yielding the best average classification accuracy and the average number of feature selections, respectively, of 93.47% and 2.1. Experimental results confirm the advantages of the proposed algorithm in improving classification accuracy and reducing the number of selected features compared to other wrapper-based algorithms.
Explainable Misinformation Detection Across Multiple Social Media Platforms
Joshi, Gargi, Srivastava, Ananya, Yagnik, Bhargav, Hasan, Mohammed, Saiyed, Zainuddin, Gabralla, Lubna A, Abraham, Ajith, Walambe, Rahee, Kotecha, Ketan
In this work, the integration of two machine learning approaches, namely domain adaptation and explainable AI, is proposed to address these two issues of generalized detection and explainability. Firstly the Domain Adversarial Neural Network (DANN) develops a generalized misinformation detector across multiple social media platforms DANN is employed to generate the classification results for test domains with relevant but unseen data. The DANN-based model, a traditional black-box model, cannot justify its outcome, i.e., the labels for the target domain. Hence a Local Interpretable Model-Agnostic Explanations (LIME) explainable AI model is applied to explain the outcome of the DANN mode. To demonstrate these two approaches and their integration for effective explainable generalized detection, COVID-19 misinformation is considered a case study. We experimented with two datasets, namely CoAID and MiSoVac, and compared results with and without DANN implementation. DANN significantly improves the accuracy measure F1 classification score and increases the accuracy and AUC performance. The results obtained show that the proposed framework performs well in the case of domain shift and can learn domain-invariant features while explaining the target labels with LIME implementation enabling trustworthy information processing and extraction to combat misinformation effectively.
Cross Project Software Vulnerability Detection via Domain Adaptation and Max-Margin Principle
Nguyen, Van, Le, Trung, Tantithamthavorn, Chakkrit, Grundy, John, Nguyen, Hung, Phung, Dinh
Software vulnerabilities (SVs) have become a common, serious and crucial concern due to the ubiquity of computer software. Many machine learning-based approaches have been proposed to solve the software vulnerability detection (SVD) problem. However, there are still two open and significant issues for SVD in terms of i) learning automatic representations to improve the predictive performance of SVD, and ii) tackling the scarcity of labeled vulnerabilities datasets that conventionally need laborious labeling effort by experts. In this paper, we propose a novel end-to-end approach to tackle these two crucial issues. We first exploit the automatic representation learning with deep domain adaptation for software vulnerability detection. We then propose a novel cross-domain kernel classifier leveraging the max-margin principle to significantly improve the transfer learning process of software vulnerabilities from labeled projects into unlabeled ones. The experimental results on real-world software datasets show the superiority of our proposed method over state-of-the-art baselines. In short, our method obtains a higher performance on F1-measure, the most important measure in SVD, from 1.83% to 6.25% compared to the second highest method in the used datasets. Our released source code samples are publicly available at https://github.com/vannguyennd/dam2p
Overview of the SV-Ident 2022 Shared Task on Survey Variable Identification in Social Science Publications
Tsereteli, Tornike, Kartal, Yavuz Selim, Ponzetto, Simone Paolo, Zielinski, Andrea, Eckert, Kai, Mayr, Philipp
In this paper, we provide an overview of the SV-Ident shared task as part of the 3rd Workshop on Scholarly Document Processing (SDP) at COLING 2022. In the shared task, participants were provided with a sentence and a vocabulary of variables, and asked to identify which variables, if any, are mentioned in individual sentences from scholarly documents in full text. Two teams made a total of 9 submissions to the shared task leaderboard. While none of the teams improve on the baseline systems, we still draw insights from their submissions. Furthermore, we provide a detailed evaluation. Data and baselines for our shared task are freely available at https://github.com/vadis-project/sv-ident
Analyzing Machine Learning Models for Credit Scoring with Explainable AI and Optimizing Investment Decisions
This paper examines two different yet related questions related to explainable AI (XAI) practices. Machine learning (ML) is increasingly important in financial services, such as pre-approval, credit underwriting, investments, and various front-end and back-end activities. Machine Learning can automatically detect non-linearities and interactions in training data, facilitating faster and more accurate credit decisions. However, machine learning models are opaque and hard to explain, which are critical elements needed for establishing a reliable technology. The study compares various machine learning models, including single classifiers (logistic regression, decision trees, LDA, QDA), heterogeneous ensembles (AdaBoost, Random Forest), and sequential neural networks. The results indicate that ensemble classifiers and neural networks outperform. In addition, two advanced post-hoc model agnostic explainability techniques - LIME and SHAP are utilized to assess ML-based credit scoring models using the open-access datasets offered by US-based P2P Lending Platform, Lending Club. For this study, we are also using machine learning algorithms to develop new investment models and explore portfolio strategies that can maximize profitability while minimizing risk.
Machine Learning Class Numbers of Real Quadratic Fields
Amir, Malik, He, Yang-Hui, Lee, Kyu-Hwan, Oliver, Thomas, Sultanow, Eldar
We implement and interpret various supervised learning experiments involving real quadratic fields with class numbers 1, 2 and 3. We quantify the relative difficulties in separating class numbers of matching/different parity from a data-scientific perspective, apply the methodology of feature analysis and principal component analysis, and use symbolic classification to develop machine-learned formulas for class numbers 1, 2 and 3 that apply to our dataset.