Performance Analysis
Audits as Evidence: Experiments, Ensembles, and Enforcement
Kline, Patrick, Walters, Christopher
We develop tools for utilizing correspondence experiments to detect illegal discrimination by individual employers. Employers violate US employment law if their propensity to contact applicants depends on protected characteristics such as race or sex. We establish identification of higher moments of the causal effects of protected characteristics on callback rates as a function of the number of fictitious applications sent to each job ad. These moments are used to bound the fraction of jobs that illegally discriminate. Applying our results to three experimental datasets, we find evidence of significant employer heterogeneity in discriminatory behavior, with the standard deviation of gaps in job-specific callback probabilities across protected groups averaging roughly twice the mean gap. In a recent experiment manipulating racially distinctive names, we estimate that at least 85% of jobs that contact both of two white applications and neither of two black applications are engaged in illegal discrimination. To assess the tradeoff between type I and II errors presented by these patterns, we consider the performance of a series of decision rules for investigating suspicious callback behavior under a simple two-type model that rationalizes the experimental data. Though, in our preferred specification, only 17% of employers are estimated to discriminate on the basis of race, we find that an experiment sending 10 applications to each job would enable accurate detection of 7-10% of discriminators while falsely accusing fewer than 0.2% of non-discriminators. A minimax decision rule acknowledging partial identification of the joint distribution of callback rates yields higher error rates but more investigations than our baseline two-type model. Our results suggest illegal labor market discrimination can be reliably monitored with relatively small modifications to existing audit designs.
Data augmentation with Symbolic-to-Real Image Translation GANs for Traffic Sign Recognition
Soufi, Nour, Valdenegro-Toro, Matias
Traffic sign recognition is an important component of many advanced driving assistance systems, and it is required for full autonomous driving. Computational performance is usually the bottleneck in using large scale neural networks for this purpose. SqueezeNet is a good candidate for efficient image classification of traffic signs, but in our experiments it does not reach high accuracy, and we believe this is due to lack of data, requiring data augmentation. Generative adversarial networks can learn the high dimensional distribution of empirical data, allowing the generation of new data points. In this paper we apply pix2pix GANs architecture to generate new traffic sign images and evaluate the use of these images in data augmentation. We were motivated to use pix2pix to translate symbolic sign images to real ones due to the mode collapse in Conditional GANs. Through our experiments we found that data augmentation using GAN can increase classification accuracy for circular traffic signs from 92.1% to 94.0%, and for triangular traffic signs from 93.8% to 95.3%, producing an overall improvement of 2%. However some traditional augmentation techniques can outperform GAN data augmentation, for example contrast variation in circular traffic signs (95.5%) and displacement on triangular traffic signs (96.7 %). Our negative results shows that while GANs can be naively used for data augmentation, they are not always the best choice, depending on the problem and variability in the data.
An Adaptive Approach for Anomaly Detector Selection and Fine-Tuning in Time Series
Ye, Hui, Ma, Xiaopeng, Pan, Qingfeng, Fang, Huaqiang, Xiang, Hang, Shao, Tongzhen
The anomaly detection of time series is a hotspot of time series data mining. The own characteristics of different anomaly detectors determine the abnormal data that they are good at. There is no detector can be optimizing in all types of anomalies. Moreover, it still has difficulties in industrial production due to problems such as a single detector can't be optimized at different time windows of the same time series. This paper proposes an adaptive model based on time series characteristics and selecting appropriate detector and run-time parameters for anomaly detection, which is called ATSDLN(Adaptive Time Series Detector Learning Network). We take the time series as the input of the model, and learn the time series representation through FCN. In order to realize the adaptive selection of detectors and run-time parameters according to the input time series, the outputs of FCN are the inputs of two sub-networks: the detector selection network and the run-time parameters selection network. In addition, the way that the variable layer width design of the parameter selection sub-network and the introduction of transfer learning make the model be with more expandability. Through experiments, it is found that ATSDLN can select appropriate anomaly detector and run-time parameters, and have strong expandability, which can quickly transfer. We investigate the performance of ATSDLN in public data sets, our methods outperform other methods in most cases with higher effect and better adaptation. We also show experimental results on public data sets to demonstrate how model structure and transfer learning affect the effectiveness.
Network Based Pricing for 3D Printing Services in Two-Sided Manufacturing-as-a-Service Marketplace
This paper presents approaches to determine a network based pricing for 3D printing services in the context of a two-sided manufacturing-as-a-service marketplace. The intent is to provide cost analytics to enable service bureaus to better compete in the market by moving away from setting ad-hoc and subjective prices. A data mining approach with machine learning methods is used to estimate a price range based on the profile characteristics of 3D printing service suppliers. The model considers factors such as supplier experience, supplier capabilities, customer reviews and ratings from past orders, and scale of operations among others to estimate a price range for suppliers' services. Data was gathered from existing marketplace websites, which was then used to train and test the model. The model demonstrates an accuracy of 65% for US based suppliers and 59% for Europe based suppliers to classify a supplier's 3D Printer listing in one of the seven price categories. The improvement over baseline accuracy of 25% demonstrates that machine learning based methods are promising for network based pricing in manufacturing marketplaces. Conventional methodologies for pricing services through activity based costing are inefficient in strategically pricing 3D printing service offering in a connected marketplace. As opposed to arbitrarily determining prices, this work proposes an approach to determine prices through data mining methods to estimate competitive prices. Such tools can be built into online marketplaces to help independent service bureaus to determine service price rates.
Improving Outbreak Detection with Stacking of Statistical Surveillance Methods
Kulessa, Moritz, Mencรญa, Eneldo Loza, Fรผrnkranz, Johannes
Epidemiologists use a variety of statistical algorithms for the early detection of outbreaks. The practical usefulness of such methods highly depends on the trade-off between the detection rate of outbreaks and the chances of raising a false alarm. Recent research has shown that the use of machine learning for the fusion of multiple statistical algorithms improves outbreak detection. Instead of relying only on the binary output (alarm or no alarm) of the statistical algorithms, we propose to make use of their p-values for training a fusion classifier. In addition, we also show that adding additional features and adapting the labeling of an epidemic period may further improve performance. For comparison and evaluation, a new measure is introduced which captures the performance of an outbreak detection method with respect to a low rate of false alarms more precisely than previous works. Our results on synthetic data show that it is challenging to improve the performance with a trainable fusion method based on machine learning. In particular, the use of a fusion classifier that is only based on binary outputs of the statistical surveillance methods can make the overall performance worse than directly using the underlying algorithms. However, the use of p-values and additional information for the learning is promising, enabling to identify more valuable patterns to detect outbreaks.
A hybrid neural network model based on improved PSO and SA for bankruptcy prediction
Azayite, Fatima Zahra, Achchab, Said
Predicting firm's failure is one of the most interesting subjects for investors and decision makers. In this paper, a bankruptcy prediction model is proposed based on Artificial Neural networks (ANN). Taking into consideration that the choice of v ariables to discriminate between bankrupt and non - bankrupt firms influences significantly the model's accuracy and considering the problem of local minima, we propose a hybrid ANN based on variables selection techniques. Moreover, we evolve the convergence of Particle Swarm Optimization (PSO) by proposing a training algorithm based on an improved PSO and Simulated Annealing. A comparative performance study is reported, and the proposed hybrid model shows a high performance and convergence in the context of missing data.
Bootstrapping Ternary Relation Extractors
Binary relation extraction methods have been widely studied in recent years. However, few methods have been developed for higher n-ary relation extraction. One limiting factor is the effort required to generate training data. For binary relations, one only has to provide a few dozen pairs of entities per relation, as training data. For ternary relations (n=3), each training instance is a triplet of entities, placing a greater cognitive load on people. For example, many people know that Google acquired Youtube but not the dollar amount or the date of the acquisition and many people know that Hillary Clinton is married to Bill Clinton by not the location or date of their wedding. This makes higher n-nary training data generation a time consuming exercise in searching the Web. We present a resource for training ternary relation extractors. This was generated using a minimally supervised yet effective approach. We present statistics on the size and the quality of the dataset.
Electroencephalography based Classification of Long-term Stress using Psychological Labeling
Saeed, Sanay Muhammad Umar, Anwar, Syed Muhammad, Khalid, Humaira, Majid, Muhammad, Bagci, Ulas
Stress research is a rapidly emerging area in thefield of electroencephalography (EEG) based signal processing.The use of EEG as an objective measure for cost effective andpersonalized stress management becomes important in particularsituations such as the non-availability of mental health facilities.In this study, long-term stress is classified using baseline EEGsignal recordings. The labelling for the stress and control groupsis performed using two methods (i) the perceived stress scalescore and (ii) expert evaluation. The frequency domain featuresare extracted from five-channel EEG recordings in addition tothe frontal and temporal alpha and beta asymmetries. The alphaasymmetry is computed from four channels and used as a feature.Feature selection is also performed using a t-test to identifystatistically significant features for both stress and control groups.We found that support vector machine is best suited to classifylong-term human stress when used with alpha asymmetry asa feature. It is observed that expert evaluation based labellingmethod has improved the classification accuracy up to 85.20%.Based on these results, it is concluded that alpha asymmetry maybe used as a potential bio-marker for stress classification, when labels are assigned using expert evaluation.
FAHT: An Adaptive Fairness-aware Decision Tree Classifier
Zhang, Wenbin, Ntoutsi, Eirini
Automated data-driven decision-making systems are ubiquitous across a wide spread of online as well as offline services. These systems, depend on sophisticated learning algorithms and available data, to optimize the service function for decision support assistance. However, there is a growing concern about the accountability and fairness of the employed models by the fact that often the available historic data is intrinsically discriminatory, i.e., the proportion of members sharing one or more sensitive attributes is higher than the proportion in the population as a whole when receiving positive classification, which leads to a lack of fairness in decision support system. A number of fairness-aware learning methods have been proposed to handle this concern. However, these methods tackle fairness as a static problem and do not take the evolution of the underlying stream population into consideration. In this paper, we introduce a learning mechanism to design a fair classifier for online stream based decision-making. Our learning model, FAHT (Fairness-Aware Hoeffding Tree), is an extension of the well-known Hoeffding Tree algorithm for decision tree induction over streams, that also accounts for fairness. Our experiments show that our algorithm is able to deal with discrimination in streaming environments, while maintaining a moderate predictive performance over the stream.