Performance Analysis
A Modified Bayesian Optimization based Hyper-Parameter Tuning Approach for Extreme Gradient Boosting
It is already reported in the literature that the performance of a machine learning algorithm is greatly impacted by performing proper Hyper-Parameter optimization. One of the ways to perform Hyper-Parameter optimization is by manual search but that is time consuming. Some of the common approaches for performing Hyper-Parameter optimization are Grid search Random search and Bayesian optimization using Hyperopt. In this paper, we propose a brand new approach for hyperparameter improvement i.e. Randomized-Hyperopt and then tune the hyperparameters of the XGBoost i.e. the Extreme Gradient Boosting algorithm on ten datasets by applying Random search, Randomized-Hyperopt, Hyperopt and Grid Search. The performances of each of these four techniques were compared by taking both the prediction accuracy and the execution time into consideration. We find that the Randomized-Hyperopt performs better than the other three conventional methods for hyper-paramter optimization of XGBoost.
Visual Spoofing in content based spam detection
Sokolov, Mark, Olufowobi, Kehinde, Herndon, Nic
"Subject: Please send money Body: I am so distraught. I thought i could reach out to you to help me out. I came down to United Kingdom for a short vacation unfortunately i was mugged at the park of the hotel i stayed, all cash, credit card and cell phone was stolen from me but luckily for me i still have my passport with me. I've been to the embassy and to the police here but they're not helping issues at all and, my flight leaves in few hours time from now but. I am having problems settling the hotel bills and the hotel manager won't let me leave until i settle my hotel bills. I'm freaked out at the moment." As expected, this email, which definitely seems to be spam, ends up in the junk email folder. However, in this paper we show that visual spoofing achieved by substituting some confusables (characters that look similar) into the above email text will enable the same email to bypass the spam filter. We also propose ways to address this loophole.
Multimodal Categorization of Crisis Events in Social Media
Abavisani, Mahdi, Wu, Liwei, Hu, Shengli, Tetreault, Joel, Jaimes, Alejandro
Recent developments in image classification and natural language processing, coupled with the rapid growth in social media usage, have enabled fundamental advances in detecting breaking events around the world in real-time. Emergency response is one such area that stands to gain from these advances. By processing billions of texts and images a minute, events can be automatically detected to enable emergency response workers to better assess rapidly evolving situations and deploy resources accordingly. To date, most event detection techniques in this area have focused on image-only or text-only approaches, limiting detection performance and impacting the quality of information delivered to crisis response teams. In this paper, we present a new multimodal fusion method that leverages both images and texts as input. In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities on a sample by sample basis. In addition, we employ a multimodal graph-based approach to stochastically transition between embeddings of different multimodal pairs during training to better regularize the learning process as well as dealing with limited training data by constructing new matched pairs from different samples. We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
False negative coronavirus tests could be due to how healthcare workers are collecting samples
The US has tested more than 1.2 million Americans for coronavirus, but some have received negative results despite being infected. The coronavirus is a disease that forms in the lungs, but it sometimes sits in a cavity between the nose and throat where a swab is unable to reach. Although the RT-polymerase chain reaction (rRT-PCR) detection is the'gold standard' for testing, it can produce a false negative if the sample is not taken properly. Experts also believe that because hospitals and drive-thru testing sites are being flooded by people, healthcare workers are also rushing to tend to as many individuals as possible and are not grabbing the samples properly. The coronavirus is a disease that forms in the lungs, but it sometimes sits in a cavity between the nose and throat where a swab is unable to reach.
Latent regularization for feature selection using kernel methods in tumor classification
Palazzo, Martin, Yankilevich, Patricio, Beauseroy, Pierre
The transcriptomics of cancer tumors are characterized with tens of thousands of gene expression features. Patient prognosis or tumor stage can be assessed by machine learning techniques like supervised classification tasks given a gene expression profile. Feature selection is a useful approach to select the key genes which helps to classify tumors. In this work we propose a feature selection method based on Multiple Kernel Learning that results in a reduced subset of genes and a custom kernel that improves the classification performance when used in support vector classification. During the feature selection process this method performs a novel latent regularisation by relaxing the supervised target problem by introducing unsupervised structure obtained from the latent space learned by a non linear dimensionality reduction model. An improvement of the generalization capacity is obtained and assessed by the tumor classification performance on new unseen test samples when the classifier is trained with the features selected by the proposed method in comparison with other supervised feature selection approaches.
Multiclass Classification via Class-Weighted Nearest Neighbors
Khim, Justin, Xu, Ziyu, Singh, Shashank
Classification is a fundamental problem in statistics and machine learning that arises in many scientific and engineering problems. Scientific applications include identifying plant and animal species from body measurements, determining cancer types based on gene expression, and satellite image processing (Fisher, 1936, 1938; Khan et al., 2001; Lee et al., 2004); in modern engineering contexts, credit card fraud detection, handwritten digit recognition, word sense disambiguation, and object detection in images are all examples of classification tasks. These applications have brought two new challenges: multiclass classification with a potentially large number of classes and imbalanced data. For example, in online retailing, websites have hundreds of thousands or millions of products, and they may like to categorize these products within a preexisting taxonomy based on product descriptions (Lin et al., 2018). While the number of classes alone makes the problem difficult, an added difficulty with text data is that it is usually highly imbalanced, meaning that a few classes may constitute a large fraction of the data while many classes have only a few examples. In fact, Feldman (2019) notes that if the data follows the classical Zipf distribution for text data (Zipf, 1936), i.e., the class probabilities satisfy a power-law distribution, then up to 35% of seen examples may appear only once in the training data. Additionally, natural image data also seems to have the problems of many classes and imbalanced data (Salakhutdinov et al., 2011; Zhu et al., 2014). Focusing on the problem of imbalanced data, researchers have found that a few heuristics help "do better," and the most principled and studied of these is weighting. There are a number of forms of weighting; we consider the most basic in which we incur a loss of weight for misclassifying an example of class and refer to this method as class-weighting.
Diagnosing COVID-19 from X-Ray and Images using Deep Learning Algorithms Learn Neural Networks
Throughout history, epidemics and chronic diseases have claimed the lives of many people and caused major crises that have taken a long time to overcome. The 2019 novel coronavirus (COVID-19) pandemic appeared in Wuhan, China in December 2019 and has become a serious public health problem worldwide. It is an acute resolved disease, but it can also be deadly, with a 2% case fatality rate. The early and automatic diagnosis of Covid-19 may be beneficial for timely referral of the patient to quarantine, and monitoring of the spread of the disease. Some tests requiring significant time to produce results (days), and a projected up to 30% false positive rate, other timely approaches to diagnosis are worthy of investigation.
Model-Agnostic Characterization of Fairness Trade-offs
Kim, Joon Sik, Chen, Jiahao, Talwalkar, Ameet
There exist several inherent trade-offs in designing a fair model, such as those between the model's predictive performance and fairness, or even among different notions of fairness. In practice, exploring these trade-offs requires significant human and computational resources. We propose a diagnostic that enables practitioners to explore these trade-offs without training a single model. Our work hinges on the observation that many widely-used fairness definitions can be expressed via the fairness-confusion tensor, an object obtained by splitting the traditional confusion matrix according to protected data attributes. Optimizing accuracy and fairness objectives directly over the elements in this tensor yields a data-dependent yet model-agnostic way of understanding several types of trade-offs. We further leverage this tensor-based perspective to generalize existing theoretical impossibility results to a wider range of fairness definitions. Finally, we demonstrate the usefulness of the proposed diagnostic on synthetic and real datasets.
Imbalanced Data Learning by Minority Class Augmentation using Capsule Adversarial Networks
Shamsolmoali, Pourya, Zareapoor, Masoumeh, Shen, Linlin, Sadka, Abdul Hamid, Yang, Jie
The fact that image datasets are often imbalanced poses an intense challenge for deep learning techniques. In this paper, we propose a method to restore the balance in imbalanced images, by coalescing two concurrent methods, generative adversarial networks (GANs) and capsule network. In our model, generative and discriminative networks play a novel competitive game, in which the generator generates samples towards specific classes from multivariate probabilities distribution. The discriminator of our model is designed in a way that while recognizing the real and fake samples, it is also requires to assign classes to the inputs. Since GAN approaches require fully observed data during training, when the training samples are imbalanced, the approaches might generate similar samples which leading to data overfitting. This problem is addressed by providing all the available information from both the class components jointly in the adversarial training. It improves learning from imbalanced data by incorporating the majority distribution structure in the generation of new minority samples. Furthermore, the generator is trained with feature matching loss function to improve the training convergence. In addition, prevents generation of outliers and does not affect majority class space. The evaluations show the effectiveness of our proposed methodology; in particular, the coalescing of capsule-GAN is effective at recognizing highly overlapping classes with much fewer parameters compared with the convolutional-GAN.