Inductive Learning
Beyond the Hype: A Real-World Evaluation of the Impact and Cost of Machine Learning-Based Malware Detection
Bridges, Robert A., Oesch, Sean, Verma, Miki E., Iannacone, Michael D., Huffer, Kelly M. T., Jewell, Brian, Nichols, Jeff A., Weber, Brian, Beaver, Justin M., Smith, Jared M., Scofield, Daniel, Miles, Craig, Plummer, Thomas, Daniell, Mark, Tall, Anne M.
Attackers use malicious software, known as malware, to steal sensitive data, damage network infrastructure, and hold information for ransom. One of the top priorities for computer security tools is to detect malware and prevent or minimize its impact on both corporate and personal networks. Traditionally, signature-based methods have been used to detect files previously identified as malicious with near perfect precision, but potentially miss newer malware samples. With the advent of self-modifying malware and the rapid increase in novel threats, signature-based methods are insufficient on their own. By generalizing patterns of known benign/malicious training examples, machine learning (ML) exhibits the capability to quickly and accurately classify novel file samples in many research studies [19]. Moreover, ML-based malware research has made the transition from the subject of myriad research efforts to a current mainstay of commercial-off-the-shelf (COTS) malware detectors. Yet, few practical evaluations of COTS ML-based technologies have been conducted. Turning from the academic literature to market reports from commercial companies can provide (for a fee) useful information, specifically, end-user feedback, itemization of all technologies in the antivirus/endpoint detection and response marketplace [17], and even statistics showing the efficacy of the detectors on malware tests [4, 40].
DeeperDive: The Unreasonable Effectiveness of Weak Supervision in Document Understanding A Case Study in Collaboration with UiPath Inc
Elwany, Emad, Hegel, Allison, Shah, Marina, Roof, Brendan, Peaslee, Genevieve, Rivet, Quentin
Weak supervision has been applied to various Natural Language Understanding tasks in recent years. Due to technical challenges with scaling weak supervision to work on long-form documents, spanning up to hundreds of pages, applications in the document understanding space have been limited. At Lexion, we built a weak supervision-based system tailored for long-form (10-200 pages long) PDF documents. We use this platform for building dozens of language understanding models and have applied it successfully to various domains, from commercial agreements to corporate formation documents. In this paper, we demonstrate the effectiveness of supervised learning with weak supervision in a situation with limited time, workforce, and training data. We built 8 high quality machine learning models in the span of one week, with the help of a small team of just 3 annotators working with a dataset of under 300 documents. We share some details about our overall architecture, how we utilize weak supervision, and what results we are able to achieve. We also include the dataset for researchers who would like to experiment with alternate approaches or refine ours. Furthermore, we shed some light on the additional complexities that arise when working with poorly scanned long-form documents in PDF format, and some of the techniques that help us achieve state-of-the-art performance on such data.
Learning New Skills after Deployment: Improving open-domain internet-driven dialogue with human feedback
Xu, Jing, Ung, Megan, Komeili, Mojtaba, Arora, Kushal, Boureau, Y-Lan, Weston, Jason
Frozen models trained to mimic static datasets can never improve their performance. Models that can employ internet-retrieval for up-to-date information and obtain feedback from humans during deployment provide the promise of both adapting to new information, and improving their performance. In this work we study how to improve internet-driven conversational skills in such a learning framework. We collect deployment data, which we make publicly available, of human interactions, and collect various types of human feedback -- including binary quality measurements, free-form text feedback, and fine-grained reasons for failure. We then study various algorithms for improving from such feedback, including standard supervised learning, rejection sampling, model-guiding and reward-based learning, in order to make recommendations on which type of feedback and algorithms work best. We find the recently introduced Director model (Arora et al., '22) shows significant improvements over other existing approaches.
Learning Representations with Contrastive Self-Supervised Learning for Histopathology Applications
Stacke, Karin, Unger, Jonas, Lundström, Claes, Eilertsen, Gabriel
Unsupervised learning has made substantial progress over the last few years, especially by means of contrastive self-supervised learning. The dominating dataset for benchmarking self-supervised learning has been ImageNet, for which recent methods are approaching the performance achieved by fully supervised training. The ImageNet dataset is however largely object-centric, and it is not clear yet what potential those methods have on widely different datasets and tasks that are not object-centric, such as in digital pathology. While self-supervised learning has started to be explored within this area with encouraging results, there is reason to look closer at how this setting differs from natural images and ImageNet. In this paper we make an in-depth analysis of contrastive learning for histopathology, pin-pointing how the contrastive objective will behave differently due to the characteristics of histopathology data. We bring forward a number of considerations, such as view generation for the contrastive objective and hyper-parameter tuning. In a large battery of experiments, we analyze how the downstream performance in tissue classification will be affected by these considerations. The results point to how contrastive learning can reduce the annotation effort within digital pathology, but that the specific dataset characteristics need to be considered. To take full advantage of the contrastive learning objective, different calibrations of view generation and hyper-parameters are required. Our results pave the way for realizing the full potential of self-supervised learning for histopathology applications.
Self-paced learning to improve text row detection in historical documents with missing labels
Gaman, Mihaela, Ghadamiyan, Lida, Ionescu, Radu Tudor, Popescu, Marius
An important preliminary step of optical character recognition systems is the detection of text rows. To address this task in the context of historical data with missing labels, we propose a self-paced learning algorithm capable of improving the row detection performance. We conjecture that pages with more ground-truth bounding boxes are less likely to have missing annotations. Based on this hypothesis, we sort the training examples in descending order with respect to the number of ground-truth bounding boxes, and organize them into k batches. Using our self-paced learning method, we train a row detector over k iterations, progressively adding batches with less ground-truth annotations. At each iteration, we combine the ground-truth bounding boxes with pseudo-bounding boxes (bounding boxes predicted by the model itself) using non-maximum suppression, and we include the resulting annotations at the next training iteration. We demonstrate that our self-paced learning strategy brings significant performance gains on two data sets of historical documents, improving the average precision of YOLOv4 with more than 12% on one data set and 39% on the other.
RDA: Reciprocal Distribution Alignment for Robust Semi-supervised Learning
Duan, Yue, Qi, Lei, Wang, Lei, Zhou, Luping, Shi, Yinghuan
In this work, we propose Reciprocal Distribution Alignment (RDA) to address semi-supervised learning (SSL), which is a hyperparameter-free framework that is independent of confidence threshold and works with both the matched (conventionally) and the mismatched class distributions. Distribution mismatch is an often overlooked but more general SSL scenario where the labeled and the unlabeled data do not fall into the identical class distribution. This may lead to the model not exploiting the labeled data reliably and drastically degrade the performance of SSL methods, which could not be rescued by the traditional distribution alignment. In RDA, we enforce a reciprocal alignment on the distributions of the predictions from two classifiers predicting pseudo-labels and complementary labels on the unlabeled data. These two distributions, carrying complementary information, could be utilized to regularize each other without any prior of class distribution. Moreover, we theoretically show that RDA maximizes the input-output mutual information. Our approach achieves promising performance in SSL under a variety of scenarios of mismatched distributions, as well as the conventional matched SSL setting.
The Supervised Learning Workshop: A New, Interactive Approach to Understanding Supervised Learning Algorithms, 2nd Edition: Bateman, Blaine, Jha, Ashish Ranjan, Johnston, Benjamin, Mathur, Ishita: 9781800209046: Amazon.com: Books
He graduated w/Special Honors in ChE & later Cert. in Quality Mgmt. Syndicated research (silicon photonics); writes for trade press and web communities. Served Fortune 1000 and FTSE 250 companies in a variety of projects, including global market/product strategy and most recently deep analytics and forecasting. Following ten years in government research and management (Deputy Director, National Measurement Laboratory (US DoC NIST) and Chief, Chemical Engineering Division of NIST), Mr. Bateman worked at several start-ups in electronics and antennas, resulting in 100s of products and several patents. Mr. Bateman led efforts to bring design and manufacturing of telematics and in-building antennas to China and Malaysia, and was key in creating an Automotive Connectivity Unit in Laird, and led technical diligence for multiple acquisitions and creation of an Infrastructure Antenna Unit.
Why Accuracy Is Not A Good Metric For Imbalanced Data
Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. Classification, In Machine Learning, is a supervised learning concept where data points are classified into different classes.
Non-Contrastive Self-supervised Learning for Utterance-Level Information Extraction from Speech
Cho, Jaejin, Villalba, Jes'us, Moro-Velazquez, Laureano, Dehak, Najim
In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-supervised speech representation, e.g., wav2vec, can be used as utterance-level representation with pooling, but the models are usually large. There are also SSL techniques to learn utterance-level representation. One of the most successful is a contrastive method, which requires negative sampling: selecting alternative samples to contrast with the current sample (anchor). However, this does not ensure that all the negative samples belong to classes different from the anchor class without labels. This paper applies a non-contrastive self-supervised method to learn utterance-level embeddings. We adapted DIstillation with NO labels (DINO) from computer vision to speech. Unlike contrastive methods, DINO does not require negative sampling. We compared DINO to x-vector trained in a supervised manner. When transferred to down-stream tasks (speaker verification, speech emotion recognition (SER), and Alzheimer's disease detection), DINO outperformed x-vector. We studied the influence of several aspects during transfer learning such as dividing the fine-tuning process into steps, chunk lengths, or augmentation. During fine-tuning, tuning the last affine layers first and then the whole network surpassed fine-tuning all at once. Using shorter chunk lengths, although they generate more diverse inputs, did not necessarily improve performance, implying speech segments at least with a specific length are required for better performance per application. Augmentation was helpful in SER.
Comparison and Analysis of New Curriculum Criteria for End-to-End ASR
Karakasidis, Georgios, Grósz, Tamás, Kurimo, Mikko
It is common knowledge that the quantity and quality of the training data play a significant role in the creation of a good machine learning model. In this paper, we take it one step further and demonstrate that the way the training examples are arranged is also of crucial importance. Curriculum Learning is built on the observation that organized and structured assimilation of knowledge has the ability to enable faster training and better comprehension. When humans learn to speak, they first try to utter basic phones and then gradually move towards more complex structures such as words and sentences. This methodology is known as Curriculum Learning, and we employ it in the context of Automatic Speech Recognition. We hypothesize that end-to-end models can achieve better performance when provided with an organized training set consisting of examples that exhibit an increasing level of difficulty (i.e. a curriculum). To impose structure on the training set and to define the notion of an easy example, we explored multiple scoring functions that either use feedback from an external neural network or incorporate feedback from the model itself. Empirical results show that with different curriculums we can balance the training times and the network's performance.