Accuracy
Ivy: Instrumental Variable Synthesis for Causal Inference
Kuang, Zhaobin, Sala, Frederic, Sohoni, Nimit, Wu, Sen, Córdova-Palomera, Aldo, Dunnmon, Jared, Priest, James, Ré, Christopher
A popular way to estimate the causal effect of a variable x on y from observational data is to use an instrumental variable (IV): a third variable z that affects y only through x. The more strongly z is associated with x, the more reliable the estimate is, but such strong IVs are difficult to find. Instead, practitioners combine more commonly available IV candidates---which are not necessarily strong, or even valid, IVs---into a single "summary" that is plugged into causal effect estimators in place of an IV. In genetic epidemiology, such approaches are known as allele scores. Allele scores require strong assumptions---independence and validity of all IV candidates---for the resulting estimate to be reliable. To relax these assumptions, we propose Ivy, a new method to combine IV candidates that can handle correlated and invalid IV candidates in a robust manner. Theoretically, we characterize this robustness, its limits, and its impact on the resulting causal estimates. Empirically, Ivy can correctly identify the directionality of known relationships and is robust against false discovery (median effect size <= 0.025) on three real-world datasets with no causal effects, while allele scores return more biased estimates (median effect size >= 0.118).
Training Data Set Assessment for Decision-Making in a Multiagent Landmine Detection Platform
Florez-Lozano, Johana, Caraffini, Fabio, Parra, Carlos, Gongora, Mario
Real-world problems such as landmine detection require multiple sources of information to reduce the uncertainty of decision-making. A novel approach to solve these problems includes distributed systems, as presented in this work based on hardware and software multi-agent systems. To achieve a high rate of landmine detection, we evaluate the performance of a trained system over the distribution of samples between training and validation sets. Additionally, a general explanation of the data set is provided, presenting the samples gathered by a cooperative multi-agent system developed for detecting improvised explosive devices. The results show that input samples affect the performance of the output decisions, and a decision-making system can be less sensitive to sensor noise with intelligent systems obtained from a diverse and suitably organised training set.
A Modified Bayesian Optimization based Hyper-Parameter Tuning Approach for Extreme Gradient Boosting
It is already reported in the literature that the performance of a machine learning algorithm is greatly impacted by performing proper Hyper-Parameter optimization. One of the ways to perform Hyper-Parameter optimization is by manual search but that is time consuming. Some of the common approaches for performing Hyper-Parameter optimization are Grid search Random search and Bayesian optimization using Hyperopt. In this paper, we propose a brand new approach for hyperparameter improvement i.e. Randomized-Hyperopt and then tune the hyperparameters of the XGBoost i.e. the Extreme Gradient Boosting algorithm on ten datasets by applying Random search, Randomized-Hyperopt, Hyperopt and Grid Search. The performances of each of these four techniques were compared by taking both the prediction accuracy and the execution time into consideration. We find that the Randomized-Hyperopt performs better than the other three conventional methods for hyper-paramter optimization of XGBoost.
Visual Spoofing in content based spam detection
Sokolov, Mark, Olufowobi, Kehinde, Herndon, Nic
"Subject: Please send money Body: I am so distraught. I thought i could reach out to you to help me out. I came down to United Kingdom for a short vacation unfortunately i was mugged at the park of the hotel i stayed, all cash, credit card and cell phone was stolen from me but luckily for me i still have my passport with me. I've been to the embassy and to the police here but they're not helping issues at all and, my flight leaves in few hours time from now but. I am having problems settling the hotel bills and the hotel manager won't let me leave until i settle my hotel bills. I'm freaked out at the moment." As expected, this email, which definitely seems to be spam, ends up in the junk email folder. However, in this paper we show that visual spoofing achieved by substituting some confusables (characters that look similar) into the above email text will enable the same email to bypass the spam filter. We also propose ways to address this loophole.
False negative coronavirus tests could be due to how healthcare workers are collecting samples
The US has tested more than 1.2 million Americans for coronavirus, but some have received negative results despite being infected. The coronavirus is a disease that forms in the lungs, but it sometimes sits in a cavity between the nose and throat where a swab is unable to reach. Although the RT-polymerase chain reaction (rRT-PCR) detection is the'gold standard' for testing, it can produce a false negative if the sample is not taken properly. Experts also believe that because hospitals and drive-thru testing sites are being flooded by people, healthcare workers are also rushing to tend to as many individuals as possible and are not grabbing the samples properly. The coronavirus is a disease that forms in the lungs, but it sometimes sits in a cavity between the nose and throat where a swab is unable to reach.
Latent regularization for feature selection using kernel methods in tumor classification
Palazzo, Martin, Yankilevich, Patricio, Beauseroy, Pierre
The transcriptomics of cancer tumors are characterized with tens of thousands of gene expression features. Patient prognosis or tumor stage can be assessed by machine learning techniques like supervised classification tasks given a gene expression profile. Feature selection is a useful approach to select the key genes which helps to classify tumors. In this work we propose a feature selection method based on Multiple Kernel Learning that results in a reduced subset of genes and a custom kernel that improves the classification performance when used in support vector classification. During the feature selection process this method performs a novel latent regularisation by relaxing the supervised target problem by introducing unsupervised structure obtained from the latent space learned by a non linear dimensionality reduction model. An improvement of the generalization capacity is obtained and assessed by the tumor classification performance on new unseen test samples when the classifier is trained with the features selected by the proposed method in comparison with other supervised feature selection approaches.
Multiclass Classification via Class-Weighted Nearest Neighbors
Khim, Justin, Xu, Ziyu, Singh, Shashank
Classification is a fundamental problem in statistics and machine learning that arises in many scientific and engineering problems. Scientific applications include identifying plant and animal species from body measurements, determining cancer types based on gene expression, and satellite image processing (Fisher, 1936, 1938; Khan et al., 2001; Lee et al., 2004); in modern engineering contexts, credit card fraud detection, handwritten digit recognition, word sense disambiguation, and object detection in images are all examples of classification tasks. These applications have brought two new challenges: multiclass classification with a potentially large number of classes and imbalanced data. For example, in online retailing, websites have hundreds of thousands or millions of products, and they may like to categorize these products within a preexisting taxonomy based on product descriptions (Lin et al., 2018). While the number of classes alone makes the problem difficult, an added difficulty with text data is that it is usually highly imbalanced, meaning that a few classes may constitute a large fraction of the data while many classes have only a few examples. In fact, Feldman (2019) notes that if the data follows the classical Zipf distribution for text data (Zipf, 1936), i.e., the class probabilities satisfy a power-law distribution, then up to 35% of seen examples may appear only once in the training data. Additionally, natural image data also seems to have the problems of many classes and imbalanced data (Salakhutdinov et al., 2011; Zhu et al., 2014). Focusing on the problem of imbalanced data, researchers have found that a few heuristics help "do better," and the most principled and studied of these is weighting. There are a number of forms of weighting; we consider the most basic in which we incur a loss of weight for misclassifying an example of class and refer to this method as class-weighting.
Diagnosing COVID-19 from X-Ray and Images using Deep Learning Algorithms Learn Neural Networks
Throughout history, epidemics and chronic diseases have claimed the lives of many people and caused major crises that have taken a long time to overcome. The 2019 novel coronavirus (COVID-19) pandemic appeared in Wuhan, China in December 2019 and has become a serious public health problem worldwide. It is an acute resolved disease, but it can also be deadly, with a 2% case fatality rate. The early and automatic diagnosis of Covid-19 may be beneficial for timely referral of the patient to quarantine, and monitoring of the spread of the disease. Some tests requiring significant time to produce results (days), and a projected up to 30% false positive rate, other timely approaches to diagnosis are worthy of investigation.
Model-Agnostic Characterization of Fairness Trade-offs
Kim, Joon Sik, Chen, Jiahao, Talwalkar, Ameet
There exist several inherent trade-offs in designing a fair model, such as those between the model's predictive performance and fairness, or even among different notions of fairness. In practice, exploring these trade-offs requires significant human and computational resources. We propose a diagnostic that enables practitioners to explore these trade-offs without training a single model. Our work hinges on the observation that many widely-used fairness definitions can be expressed via the fairness-confusion tensor, an object obtained by splitting the traditional confusion matrix according to protected data attributes. Optimizing accuracy and fairness objectives directly over the elements in this tensor yields a data-dependent yet model-agnostic way of understanding several types of trade-offs. We further leverage this tensor-based perspective to generalize existing theoretical impossibility results to a wider range of fairness definitions. Finally, we demonstrate the usefulness of the proposed diagnostic on synthetic and real datasets.