Goto

Collaborating Authors

 Yegneswaran, Vinod


Augmenting Rule-based DNS Censorship Detection at Scale with Machine Learning

arXiv.org Artificial Intelligence

The proliferation of global censorship has led to the development of a plethora of measurement platforms to monitor and expose it. Censorship of the domain name system (DNS) is a key mechanism used across different countries. It is currently detected by applying heuristics to samples of DNS queries and responses (probes) for specific destinations. These heuristics, however, are both platform-specific and have been found to be brittle when censors change their blocking behavior, necessitating a more reliable automated process for detecting censorship. In this paper, we explore how machine learning (ML) models can (1) help streamline the detection process, (2) improve the potential of using large-scale datasets for censorship detection, and (3) discover new censorship instances and blocking signatures missed by existing heuristic methods. Our study shows that supervised models, trained using expert-derived labels on instances of known anomalies and possible censorship, can learn the detection heuristics employed by different measurement platforms. More crucially, we find that unsupervised models, trained solely on uncensored instances, can identify new instances and variations of censorship missed by existing heuristics. Moreover, both methods demonstrate the capability to uncover a substantial number of new DNS blocking signatures, i.e., injected fake IP addresses overlooked by existing heuristics. These results are underpinned by an important methodological finding: comparing the outputs of models trained using the same probes but with labels arising from independent processes allows us to more reliably detect cases of censorship in the absence of ground-truth labels of censorship.


Data Masking with Privacy Guarantees

arXiv.org Machine Learning

We study the problem of data release with privacy, where data is made available with privacy guarantees while keeping the usability of the data as high as possible --- this is important in health-care and other domains with sensitive data. In particular, we propose a method of masking the private data with privacy guarantee while ensuring that a classifier trained on the masked data is similar to the classifier trained on the original data, to maintain usability. We analyze the theoretical risks of the proposed method and the traditional input perturbation method. Results show that the proposed method achieves lower risk compared to the input perturbation, especially when the number of training samples gets large. We illustrate the effectiveness of the proposed method of data masking for privacy-sensitive learning on $12$ benchmark datasets.


Trusted Neural Networks for Safety-Constrained Autonomous Control

arXiv.org Artificial Intelligence

We propose Trusted Neural Network (TNN) models, which are deep neural network models that satisfy safety constraints critical to the application domain. We investigate different mechanisms for incorporating rule-based knowledge in the form of first-order logic constraints into a TNN model, where rules that encode safety are accompanied by weights indicating their relative importance. This framework allows the TNN model to learn from knowledge available in form of data as well as logical rules. We propose multiple approaches for solving this problem: (a) a multi-headed model structure that allows trade-off between satisfying logical constraints and fitting training data in a unified training framework, and (b) creating a constrained optimization problem and solving it in dual formulation by posing a new constrained loss function and using a proximal gradient descent algorithm. We demonstrate the efficacy of our TNN framework through experiments using the open-source TORCS~\cite{BernhardCAA15} 3D simulator for self-driving cars. Experiments using our first approach of a multi-headed TNN model, on a dataset generated by a customized version of TORCS, show that (1) adding safety constraints to a neural network model results in increased performance and safety, and (2) the improvement increases with increasing importance of the safety constraints. Experiments were also performed using the second approach of proximal algorithm for constrained optimization --- they demonstrate how the proposed method ensures that (1) the overall TNN model satisfies the constraints even when the training data violates some of the constraints, and (2) the proximal gradient descent algorithm on the constrained objective converges faster than the unconstrained version.


ATOL: A Framework for Automated Analysis and Categorization of the Darkweb Ecosystem

AAAI Conferences

We present a framework for automated analysis and categorization of .onion websites in the darkweb to facilitate analyst situational awareness of new content that emerges from this dynamic landscape. Over the last two years, our team has developed a large-scale darkweb crawling infrastructure called OnionCrawler that acquires new onion domains on a daily basis, and crawls and indexes millions of pages from these new and previously known .onion sites. It stores this data into a research repository designed to help better understand Tor’s hidden service ecosystem. The analysis component of our framework is called Automated Tool for Onion Labeling (ATOL), which introduces a two-stage thematic labeling strategy: (1) it learns descriptive and discriminative keywords for different categories, and (2) uses these terms to map onion site content to a set of thematic labels. We also present empirical results of ATOL and our ongoing experimentation with it, as we have gained experience applying it to the entirety of our darkweb repository, now over 70 million indexed pages. We find that ATOL can perform site-level thematic label assignment more accurately than keywordbased schemes developed by domain experts — we expand the analyst-provided keywords using an automatic keyword discovery algorithm, and get 12% gain in accuracy by using a machine learning classification model. We also show how ATOL can discover categories on previously unlabeled onions and discuss applications of ATOL in supporting various analyses and investigations of the darkweb.