Goto

Collaborating Authors

 Accuracy


Phishing URL Detection: A Network-based Approach Robust to Evasion

arXiv.org Artificial Intelligence

Many cyberattacks start with disseminating phishing URLs. When clicking these phishing URLs, the victim's private information is leaked to the attacker. There have been proposed several machine learning methods to detect phishing URLs. However, it still remains under-explored to detect phishing URLs with evasion, i.e., phishing URLs that pretend to be benign by manipulating patterns. In many cases, the attacker i) reuses prepared phishing web pages because making a completely brand-new set costs non-trivial expenses, ii) prefers hosting companies that do not require private information and are cheaper than others, iii) prefers shared hosting for cost efficiency, and iv) sometimes uses benign domains, IP addresses, and URL string patterns to evade existing detection methods. Inspired by those behavioral characteristics, we present a network-based inference method to accurately detect phishing URLs camouflaged with legitimate patterns, i.e., robust to evasion. In the network approach, a phishing URL will be still identified as phishy even after evasion unless a majority of its neighbors in the network are evaded at the same time. Our method consistently shows better detection performance throughout various experimental tests than state-of-the-art methods, e.g., F-1 of 0.89 for our method vs. 0.84 for the best feature-based method.


Real or Not? Disaster Tweets classification with RoBERTa

#artificialintelligence

This article was published as a part of the Data Science Blogathon. Today we live in a world of active social networking where every kind of information is shared among users worldwide. This is greatly facilitated by the ubiquitousness of smartphones and other handheld communication devices. Some popular sites are Facebook, Whatsapp, LinkedIn, etc.; however, Twitter is a viral microblogging site used worldwide for open information exchange. On Twitter, various types of information are exchanged in the form of short messages that include information regarding any mishaps or accidents happening worldwide.


ML interview preparation-- popular topics

#artificialintelligence

Hello, happy to see you here. Today let's dive into some popular topics which are often discussed in machine learning interviews. Without further ado, presenting to you the next cheat sheet. But what exactly do we do when we already have data? We will stop on each task separately in detail in next articles.


Exploiting Fairness to Enhance Sensitive Attributes Reconstruction

arXiv.org Artificial Intelligence

In recent years, a growing body of work has emerged on how to learn machine learning models under fairness constraints, often expressed with respect to some sensitive attributes. In this work, we consider the setting in which an adversary has black-box access to a target model and show that information about this model's fairness can be exploited by the adversary to enhance his reconstruction of the sensitive attributes of the training data. More precisely, we propose a generic reconstruction correction method, which takes as input an initial guess made by the adversary and corrects it to comply with some user-defined constraints (such as the fairness information) while minimizing the changes in the adversary's guess. The proposed method is agnostic to the type of target model, the fairness-aware learning method as well as the auxiliary knowledge of the adversary. To assess the applicability of our approach, we have conducted a thorough experimental evaluation on two state-of-the-art fair learning methods, using four different fairness metrics with a wide range of tolerances and with three datasets of diverse sizes and sensitive attributes. The experimental results demonstrate the effectiveness of the proposed approach to improve the reconstruction of the sensitive attributes of the training set.


Semi-WTC: A Practical Semi-supervised Framework for Attack Categorization through Weight-Task Consistency

arXiv.org Artificial Intelligence

Supervised learning has been widely used for attack categorization, requiring high-quality data and labels. However, the data is often imbalanced and it is difficult to obtain sufficient annotations. Moreover, supervised models are subject to real-world deployment issues, such as defending against unseen artificial attacks. To tackle the challenges, we propose a semi-supervised fine-grained attack categorization framework consisting of an encoder and a two-branch structure and this framework can be generalized to different supervised models. The multilayer perceptron with residual connection is used as the encoder to extract features and reduce the complexity. The Recurrent Prototype Module (RPM) is proposed to train the encoder effectively in a semi-supervised manner. To alleviate the data imbalance problem, we introduce the Weight-Task Consistency (WTC) into the iterative process of RPM by assigning larger weights to classes with fewer samples in the loss function. In addition, to cope with new attacks in real-world deployment, we propose an Active Adaption Resampling (AAR) method, which can better discover the distribution of unseen sample data and adapt the parameters of encoder. Experimental results show that our model outperforms the state-of-the-art semi-supervised attack detection methods with a 3% improvement in classification accuracy and a 90% reduction in training time.


Label Structure Preserving Contrastive Embedding for Multi-Label Learning with Missing Labels

arXiv.org Artificial Intelligence

Contrastive learning (CL) has shown impressive advances in image representation learning in whichever supervised multi-class classification or unsupervised learning. However, these CL methods fail to be directly adapted to multi-label image classification due to the difficulty in defining the positive and negative instances to contrast a given anchor image in multi-label scenario, let the label missing one alone, implying that borrowing a commonly-used way from contrastive multi-class learning to define them will incur a lot of false negative instances unfavorable for learning. In this paper, with the introduction of a label correction mechanism to identify missing labels, we first elegantly generate positives and negatives for individual semantic labels of an anchor image, then define a unique contrastive loss for multi-label image classification with missing labels (CLML), the loss is able to accurately bring images close to their true positive images and false negative images, far away from their true negative images. Different from existing multi-label CL losses, CLML also preserves low-rank global and local label dependencies in the latent representation space where such dependencies have been shown to be helpful in dealing with missing labels. To the best of our knowledge, this is the first general multi-label CL loss in the missing-label scenario and thus can seamlessly be paired with those losses of any existing multi-label learning methods just via a single hyperparameter. The proposed strategy has been shown to improve the classification performance of the Resnet101 model by margins of 1.2%, 1.6%, and 1.3% respectively on three standard datasets, MSCOCO, VOC, and NUS-WIDE. Code is available at https://github.com/chuangua/ContrastiveLossMLML.


Are Attribute Inference Attacks Just Imputation?

arXiv.org Artificial Intelligence

Models can expose sensitive information about their training data. In an attribute inference attack, an adversary has partial knowledge of some training records and access to a model trained on those records, and infers the unknown values of a sensitive feature of those records. We study a fine-grained variant of attribute inference we call \emph{sensitive value inference}, where the adversary's goal is to identify with high confidence some records from a candidate set where the unknown attribute has a particular sensitive value. We explicitly compare attribute inference with data imputation that captures the training distribution statistics, under various assumptions about the training data available to the adversary. Our main conclusions are: (1) previous attribute inference methods do not reveal more about the training data from the model than can be inferred by an adversary without access to the trained model, but with the same knowledge of the underlying distribution as needed to train the attribute inference attack; (2) black-box attribute inference attacks rarely learn anything that cannot be learned without the model; but (3) white-box attacks, which we introduce and evaluate in the paper, can reliably identify some records with the sensitive value attribute that would not be predicted without having access to the model. Furthermore, we show that proposed defenses such as differentially private training and removing vulnerable records from training do not mitigate this privacy risk. The code for our experiments is available at \url{https://github.com/bargavj/EvaluatingDPML}.


Improving debris flow evacuation alerts in Taiwan using machine learning

arXiv.org Artificial Intelligence

Taiwan has the highest susceptibility to and fatalities from debris flows worldwide. The existing debris flow warning system in Taiwan, which uses a time-weighted measure of rainfall, leads to alerts when the measure exceeds a predefined threshold. However, this system generates many false alarms and misses a substantial fraction of the actual debris flows. Towards improving this system, we implemented five machine learning models that input historical rainfall data and predict whether a debris flow will occur within a selected time. We found that a random forest model performed the best among the five models and outperformed the existing system in Taiwan. Furthermore, we identified the rainfall trajectories strongly related to debris flow occurrences and explored trade-offs between the risks of missing debris flows versus frequent false alerts. These results suggest the potential for machine learning models trained on hourly rainfall data alone to save lives while reducing false alerts.


Identifying Transients in the Dark Energy Survey using Convolutional Neural Networks

arXiv.org Machine Learning

The ability to discover new transient candidates via image differencing without direct human intervention is an important task in observational astronomy. For these kind of image classification problems, machine Learning techniques such as Convolutional Neural Networks (CNNs) have shown remarkable success. In this work, we present the results of an automated transient candidate identification on images with CNNs for an extant dataset from the Dark Energy Survey Supernova program (DES-SN), whose main focus was on using Type Ia supernovae for cosmology. By performing an architecture search of CNNs, we identify networks that efficiently select non-artifacts (e.g. The CNNs also help us identify a subset of mislabeled images. Performing a relabeling of the images in this subset, the resulting classification with CNNs is significantly better than previous results, lowering the false positive rate by 27% at a fixed missed detection rate of 0.05. INTRODUCTION A major aspect of observational astronomy is the "survey" which involves the wholesale mapping of various regions of the sky to create catalogs which are subsequently mined for scientifically important astronomical objects. We refer to a transient candidate as the detection on a single image of a new or varying source with respect to a previously taken reference image, regardless of its astrophysical nature since at this stage its classification is unknown and will remain so until further data is taken (spectroscopy and/or additional photometry). Some examples of such transient candidates are solar system objects, supernovae, active galactic nuclei, variable stars, and neutron star mergers, etc. Since some of these events are quite rare and will fade rapidly, it is often important to trigger follow-up observations immediately to glean their underlying nature and discover new physics. Hence, identifying transient candidates in images quickly and efficiently is very important so as not to waste precious, and expensive, follow-up resources. For many years this process was conducted by manual inspection of images by humans.


Overview of Machine Learning

#artificialintelligence

In layman's terms, machine learning is to allow computers to learn automatically from data to obtain certain knowledge. As a discipline, machine learning usually refers to a type of problem and the method to solve this type of problem, that is, how to find the law from the observation data, and use the learned law to predict the unknown or unobservable data. In the early engineering field, machine learning is often called pattern recognition, but pattern recognition is more biased towards specific application tasks, such as optical character recognition, speech recognition, and face recognition. The characteristic of these tasks is that for us humans, these tasks are easy to complete, but we do not know how we do it, so it is difficult to manually design a computer program to complete these tasks. A feasible method is to design an algorithm that allows the computer to learn the rules from the labeled samples and use it to complete various recognition tasks. With the increasing application of machine learning technology, the concept of machine learning is now gradually replacing pattern recognition, becoming the general term for this type of problem and its solutions. Taking handwritten digit recognition as an example, we need to allow the computer to automatically recognize handwritten digits. Handwritten digit recognition is a classic machine learning task, which is simple for humans, but very difficult for computers. It is difficult for us to summarize the handwriting characteristics of each digit, or the rules for distinguishing different digits, so designing a set of recognition algorithms is an almost impossible task. In real life, many problems are similar to those of handwritten number recognition, such as object recognition and speech recognition. For this kind of problem, we don't know how to design a computer program to solve it. Even if it can be realized by some heuristic rules, the process is extremely complicated. Therefore, people began to try another way of thinking, that is, let the computer see a large number of samples, and learn some experience from them, and then use these experiences to identify new samples. To recognize handwritten digits, first manually annotate a large number of handwritten digital images (that is, each image is manually marked with what number it is), these images are used as training data, and then a set of models are automatically generated through the learning algorithm, and rely on it. This method of learning through data is called the method of machine learning. First, we use a life example to introduce some basic concepts in machine learning: samples, features, labels, models, learning algorithms, etc. Suppose we want to buy mangoes in the market, but we have no previous experience in selecting mangoes, how can we obtain this knowledge through learning? First, we randomly select some mangoes from the market and list the characteristics of each mango.