Accuracy
Linearized Optimal Transport for Collider Events
Cai, Tianji, Cheng, Junyi, Craig, Katy, Craig, Nathaniel
We introduce an efficient framework for computing the distance between collider events using the tools of Linearized Optimal Transport (LOT). This preserves many of the advantages of the recently-introduced Energy Mover's Distance, which quantifies the "work" required to rearrange one event into another, while significantly reducing the computational cost. It also furnishes a Euclidean embedding amenable to simple machine learning algorithms and visualization techniques, which we demonstrate in a variety of jet tagging examples. The LOT approximation lowers the threshold for diverse applications of the theory of optimal transport to collider physics.
Rethinking Default Values: a Low Cost and Efficient Strategy to Define Hyperparameters
Mantovani, Rafael Gomes, Rossi, Andrรฉ Luis Debiaso, Alcobaรงa, Edesio, Gertrudes, Jadson Castro, Junior, Sylvio Barbon, de Carvalho, Andrรฉ Carlos Ponce de Leon Ferreira
Machine Learning (ML) algorithms have been successfully employed by a vast range of practitioners with different backgrounds. One of the reasons for ML popularity is the capability to consistently delivers accurate results, which can be further boosted by adjusting hyperparameters (HP). However, part of practitioners has limited knowledge about the algorithms and does not take advantage of suitable HP settings. In general, HP values are defined by trial and error, tuning, or by using default values. Trial and error is very subjective, time costly and dependent on the user experience. Tuning techniques search for HP values able to maximize the predictive performance of induced models for a given dataset, but with the drawback of a high computational cost and target specificity. To avoid tuning costs, practitioners use default values suggested by the algorithm developer or by tools implementing the algorithm. Although default values usually result in models with acceptable predictive performance, different implementations of the same algorithm can suggest distinct default values. To maintain a balance between tuning and using default values, we propose a strategy to generate new optimized default values. Our approach is grounded on a small set of optimized values able to obtain predictive performance values better than default settings provided by popular tools. The HP candidates are estimated through a pool of promising values tuned from a small and informative set of datasets. After performing a large experiment and a careful analysis of the results, we concluded that our approach delivers better default values. Besides, it leads to competitive solutions when compared with the use of tuned values, being easier to use and having a lower cost.Based on our results, we also extracted simple rules to guide practitioners in deciding whether using our new methodology or a tuning approach.
A Formally Robust Time Series Distance Metric
Toller, Maximilian, Geiger, Bernhard C., Kern, Roman
Distance-based classification is among the most competitive classification methods for time series data. The most critical component of distance-based classification is the selected distance function. Past research has proposed various different distance metrics or measures dedicated to particular aspects of real-world time series data, yet there is an important aspect that has not been considered so far: Robustness against arbitrary data contamination. In this work, we propose a novel distance metric that is robust against arbitrarily "bad" contamination and has a worst-case computational complexity of $\mathcal{O}(n\log n)$. We formally argue why our proposed metric is robust, and demonstrate in an empirical evaluation that the metric yields competitive classification accuracy when applied in k-Nearest Neighbor time series classification.
Transferring Complementary Operating Conditions for Anomaly Detection
In complex industrial systems, the number of possible fault types is uncountable, making it impossible to train supervised models covering them all. Instead, anomaly detectors are trained on healthy operating condition data and raise an alarm when the data deviate from the healthy conditions, indicating the possible occurrence of faults. Data-driven anomaly detection performance relies on a representative collection of samples of the normal (healthy) class distribution. This means that the samples used to train the model should be sufficient in number and distributed so as to empirically determine the full healthy distribution. But for industrial systems in gradually varying environments or subject to changing usage, acquiring such a comprehensive set of samples would require a long collection period and delay the point at which the anomaly detector could be trained and operational. In this paper, we propose a framework for the transfer of complementary operating conditions between different units, to train more robust anomaly detectors. The domain shift due to different units' specificities needs to be accounted for. This problem is an extension of Unsupervised Domain Adaptation to the one-class classification task. We solve the problem with adversarial deep learning and replace the traditional classification loss, unavailable in one-class problems, with a new loss inspired by a dimensionality reduction tool. This loss enforces the conservation of the inherent variability of each dataset while the adversarial architecture ensures the alignment of the distributions, hence correcting the domain shift. We demonstrate the benefit of this approach using three open source datasets.
Addestramento con Dataset Sbilanciati
The following document pursues the objective of comparing some useful methods to balance a dataset and obtain a trained model. The dataset used for training is made up of short and medium length sentences, such as simple phrases or extracts from conversations that took place on web channels. The training of the models will take place with the help of the structures made available by the Apache Spark framework, the models may subsequently be useful for a possible implementation of a solution capable of classifying sentences using the distributed environment, as described in "New frontier of textual classification: Big data and distributed calculation" by Massimiliano Morrelli et al.
Understanding Brain Dynamics for Color Perception using Wearable EEG headband
Chaudhary, Mahima, Mukhopadhyay, Sumona, Litoiu, Marin, Sergio, Lauren E, Adams, Meaghan S
The perception of color is an important cognitive feature of the human brain. The variety of colors that impinge upon the human eye can trigger changes in brain activity which can be captured using electroencephalography (EEG). In this work, we have designed a multiclass classification model to detect the primary colors from the features of raw EEG signals. In contrast to previous research, our method employs spectral power features, statistical features as well as correlation features from the signal band power obtained from continuous Morlet wavelet transform instead of raw EEG, for the classification task. We have applied dimensionality reduction techniques such as Forward Feature Selection and Stacked Autoencoders to reduce the dimension of data eventually increasing the model's efficiency. Our proposed methodology using Forward Selection and Random Forest Classifier gave the best overall accuracy of 80.6\% for intra-subject classification. Our approach shows promise in developing techniques for cognitive tasks using color cues such as controlling Internet of Thing (IoT) devices by looking at primary colors for individuals with restricted motor abilities.
Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study
Bahri, Dara, Tay, Yi, Zheng, Che, Metzler, Donald, Brunk, Cliff, Tomkins, Andrew
Large generative language models such as GPT-2 are well-known for their ability to generate text as well as their utility in supervised downstream tasks via fine-tuning. Our work is twofold: firstly we demonstrate via human evaluation that classifiers trained to discriminate between human and machine-generated text emerge as unsupervised predictors of "page quality", able to detect low quality content without any training. This enables fast bootstrapping of quality indicators in a low-resource setting. Secondly, curious to understand the prevalence and nature of low quality pages in the wild, we conduct extensive qualitative and quantitative analysis over 500 million web articles, making this the largest-scale study ever conducted on the topic.
Automated Detection of Cortical Lesions in Multiple Sclerosis Patients with 7T MRI
La Rosa, Francesco, Beck, Erin S, Abdulkadir, Ahmed, Thiran, Jean-Philippe, Reich, Daniel S, Sati, Pascal, Cuadra, Meritxell Bach
The automated detection of cortical lesions (CLs) in patients with multiple sclerosis (MS) is a challenging task that, despite its clinical relevance, has received very little attention. Accurate detection of the small and scarce lesions requires specialized sequences and high or ultra-high field MRI. For supervised training based on multimodal structural MRI at 7T, two experts generated ground truth segmentation masks of 60 patients with 2014 CLs. We implemented a simplified 3D U-Net with three resolution levels (3D U-Net-). By increasing the complexity of the task (adding brain tissue segmentation), while randomly dropping input channels during training, we improved the performance compared to the baseline. Considering a minimum lesion size of 0.75 {\mu}L, we achieved a lesion-wise cortical lesion detection rate of 67% and a false positive rate of 42%. However, 393 (24%) of the lesions reported as false positives were post-hoc confirmed as potential or definite lesions by an expert. This indicates the potential of the proposed method to support experts in the tedious process of CL manual segmentation.
Binarised Regression with Instance-Varying Costs: Evaluation using Impact Curves
Many evaluation methods exist, each for a particular prediction task, and there are a number of prediction tasks commonly performed including classification and regression. In binarised regression, binary decisions are generated from a learned regression model (or real-valued dependent variable), which is useful when the division between instances that should be predicted positive or negative depends on the utility. For example, in mining, the boundary between a valuable rock and a waste rock depends on the market price of various metals, which varies with time. This paper proposes impact curves to evaluate binarised regression with instance-varying costs, where some instances are much worse to be classified as positive (or negative) than other instances; e.g., it is much worse to throw away a high-grade gold rock than a medium-grade copper-ore rock, even if the mine wishes to keep both because both are profitable. We show how to construct an impact curve for a variety of domains, including examples from healthcare, mining, and entertainment. Impact curves optimize binary decisions across all utilities of the chosen utility function, identify the conditions where one model may be favoured over another, and quantitatively assess improvement between competing models.