Accuracy
Real-Time Sensor Anomaly Detection and Recovery in Connected Automated Vehicle Sensors
Wang, Yiyang, Masoud, Neda, Khojandi, Anahita
In this paper we propose a novel observer-based method to improve the safety and security of connected and automated vehicle (CAV) transportation. The proposed method combines model-based signal filtering and anomaly detection methods. Specifically, we use adaptive extended Kalman filter (AEKF) to smooth sensor readings of a CAV based on a nonlinear car-following model. Using the car-following model the subject vehicle (i.e., the following vehicle) utilizes the leading vehicle's information to detect sensor anomalies by employing previously-trained One Class Support Vector Machine (OCSVM) models. This approach allows the AEKF to estimate the state of a vehicle not only based on the vehicle's location and speed, but also by taking into account the state of the surrounding traffic. A communication time delay factor is considered in the car-following model to make it more suitable for real-world applications. Our experiments show that compared with the AEKF with a traditional $\chi^2$-detector, our proposed method achieves a better anomaly detection performance. We also demonstrate that a larger time delay factor has a negative impact on the overall detection performance.
Online Debiasing for Adaptively Collected High-dimensional Data
Deshpande, Yash, Javanmard, Adel, Mehrabi, Mohammad
Adaptive collection of data is increasingly commonplace in many applications. From the point of view of statistical inference however, adaptive collection induces memory and correlation in the samples, and poses significant challenge. We consider the high-dimensional linear regression, where the samples are collected adaptively and the sample size $n$ can be smaller than $p$, the number of covariates. In this setting, there are two distinct sources of bias: the first due to regularization imposed for estimation, e.g. using the LASSO, and the second due to adaptivity in collecting the samples. We propose \emph{`online debiasing'}, a general procedure for estimators such as the LASSO, which addresses both sources of bias. In two concrete contexts $(i)$ batched data collection and $(ii)$ high-dimensional time series analysis, we demonstrate that online debiasing optimally debiases the LASSO estimate when the underlying parameter $\theta_0$ has sparsity of order $o(\sqrt{n}/\log p)$. In this regime, the debiased estimator can be used to compute $p$-values and confidence intervals of optimal size.
Auditing and Achieving Intersectional Fairness in Classification Problems
Morina, Giulio, Oliinyk, Viktoriia, Waton, Julian, Marusic, Ines, Georgatzis, Konstantinos
Machine learning algorithms are extensively used to make increasingly more consequential decisions, so that achieving optimal predictive performance can no longer be the only focus. This paper explores intersectional fairness, that is fairness when intersections of multiple sensitive attributes -- such as race, age, nationality, etc. -- are considered. Previous research has mainly been focusing on fairness with respect to a single sensitive attribute, with intersectional fairness being comparatively less studied despite its critical importance for modern machine learning applications. We introduce intersectional fairness metrics by extending prior work, and provide different methodologies to audit discrimination in a given dataset or model outputs. Secondly, we develop novel post-processing techniques to mitigate any detected bias in a classification model. Our proposed methodology does not rely on any assumptions regarding the underlying model and aims at guaranteeing fairness while preserving good predictive performance. Finally, we give guidance on a practical implementation, showing how the proposed methods perform on a real-world dataset.
Smaller Is Better: Lightweight Face Detection For Smartphones
Although mobile devices were not designed to run compute-heavy AI models, in recent years AI-powered features like face detection, eye tracking, and voice recognition have all been added to smartphones. Much of the compute for such services is done on the cloud, but ideally these applications would be light enough to run directly on devices without an Internet connection. In this spirit of "smaller is better," Shanghai-based developer "Linzai" (GitHub user name @Linzaer) recently shared a new lightweight model that enables real-time face detection for smartphones. The project has garnered a whopping 3.3k Stars and over 600 forks on GitHub. Facial recognition technology is widely applied in security monitoring, surveillance, human-computer interaction, entertainment, etc. Detecting human faces in digital images is the first step in facial recognition, and an ideal face detection model can be evaluated by how quickly and accurately it performs.
Detecting random filenames using (un)supervised machine learning
Combining both n-grams and random forest models to detect malicious activity. An essential part of Managed Detection and Response at Fox-IT is the Security Operations Center. This is our frontline for detecting and analyzing possible threats. Our Security Operations Center brings together the best in human and machine analysis and we continually strive to improve both. For instance, we develop machine learning techniques for detecting malicious content such as DGA domains or unusual SMB traffic.
Regularized Adversarial Sampling and Deep Time-aware Attention for Click-Through Rate Prediction
Wang, Yikai, Zhang, Liang, Dai, Quanyu, Sun, Fuchun, Zhang, Bo, He, Yang, Yan, Weipeng, Bao, Yongjun
Improving the performance of click-through rate (CTR) prediction remains one of the core tasks in online advertising systems. With the rise of deep learning, CTR prediction models with deep networks remarkably enhance model capacities. In deep CTR models, exploiting users' historical data is essential for learning users' behaviors and interests. As existing CTR prediction works neglect the importance of the temporal signals when embed users' historical clicking records, we propose a time-aware attention model which explicitly uses absolute temporal signals for expressing the users' periodic behaviors and relative temporal signals for expressing the temporal relation between items. Besides, we propose a regularized adversarial sampling strategy for negative sampling which eases the classification imbalance of CTR data and can make use of the strong guidance provided by the observed negative CTR samples. The adversarial sampling strategy significantly improves the training efficiency, and can be co-trained with the time-aware attention model seamlessly. Experiments are conducted on real-world CTR datasets from both in-station and out-station advertising places.
A Study of Data Pre-processing Techniques for Imbalanced Biomedical Data Classification
Liu, Shigang, Zhang, Jun, Xiang, Yang, Zhou, Wanlei, Xiang, Dongxi
Biomedical data are widely accepted in developing prediction models for identifying a specific tumor, drug discovery and classification of human cancers. However, previous studies usually focused on different classifiers, and overlook the class imbalance problem in real-world biomedical datasets. There are a lack of studies on evaluation of data pre-processing techniques, such as resampling and feature selection, on imbalanced biomedical data learning. The relationship between data pre-processing techniques and the data distributions has never been analysed in previous studies. This article mainly focuses on reviewing and evaluating some popular and recently developed resampling and feature selection methods for class imbalance learning. We analyse the effectiveness of each technique from data distribution perspective. Extensive experiments have been done based on five classifiers, four performance measures, eight learning techniques across twenty real-world datasets. Experimental results show that: (1) resampling and feature selection techniques exhibit better performance using support vector machine (SVM) classifier. However, resampling and Feature Selection techniques perform poorly when using C4.5 decision tree and Linear discriminant analysis classifiers; (2) for datasets with different distributions, techniques such as Random undersampling and Feature Selection perform better than other data pre-processing methods with T Location-Scale distribution when using SVM and KNN (K-nearest neighbours) classifiers. Random oversampling outperforms other methods on Negative Binomial distribution using Random Forest classifier with lower level of imbalance ratio; (3) Feature Selection outperforms other data pre-processing methods in most cases, thus, Feature Selection with SVM classifier is the best choice for imbalanced biomedical data learning.
Fair Predictors under Distribution Shift
Singh, Harvineet, Singh, Rina, Mhasawade, Vishwali, Chunara, Rumi
Recent work on fair machine learning adds to a growing set of algorithmic safeguards required for deployment in high societal impact areas. A fundamental concern with model deployment is to guarantee stable performance under changes in data distribution. Extensive work in domain adaptation addresses this concern, albeit with the notion of stability limited to that of predictive performance. We provide conditions under which a stable model both in terms of prediction and fairness performance can be trained. Building on the problem setup of causal domain adaptation, we select a subset of features for training predictors with fairness constraints such that risk with respect to an unseen target data distribution is minimized. Advantages of the approach are demonstrated on synthetic datasets and on the task of diagnosing acute kidney injury in a real-world dataset under an instance of measurement policy shift and selection bias.
Improving Cross-Lingual Transfer Learning by Filtering Training Data : Alexa Blogs
This type of cross-lingual transfer learning can make it easier to bootstrap a model in a language for which training data is scarce, by taking advantage of more abundant data in a source language. But sometimes the data in the source language is so abundant that using all of it to train a transfer model would be impractically time consuming. Moreover, linguistic differences between source and target languages mean that pruning the training data in the source language, so that its statistical patterns better match those of the target language, can actually improve the performance of the transferred model. In a paper we're presenting at this year's Conference on Empirical Methods in Natural Language Processing, we describe experiments with a new data selection technique that let us halve the amount of training data required in the source language, while actually improving a transfer model's performance in a target language. For evaluation purposes, we used two techniques to cut the source-language data set in half: one was our data selection technique, and the other was random sampling.