AITopics

2003.06428

Country:

North America > United States > Oregon > Washington County > Hillsboro (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > District of Columbia > Washington (0.04)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningMar-13-2020

Automating Botnet Detection with Graph Neural Networks

Zhou, Jiawei, Xu, Zhiying, Rush, Alexander M., Yu, Minlan

Botnets are now a major source for many network attacks, such as DDoS attacks and spam. However, most traditional detection methods heavily rely on heuristically designed multi-stage detection criteria. In this paper, we consider the neural network design challenges of using modern deep learning techniques to learn policies for botnet detection automatically. To generate training data, we synthesize botnet connections with different underlying communication patterns overlaid on large-scale real networks as datasets. To capture the important hierarchical structure of centralized botnets and the fast-mixing structure for decentralized botnets, we tailor graph neural networks (GNN) to detect the properties of these structures. Experimental results show that GNNs are better able to capture botnet structure than previous non-learning methods when trained with appropriate data, and that deeper GNNs are crucial for learning difficult botnet topologies. We believe our data and studies can be useful for both the network security and graph learning communities.

botnet, graph, topology, (12 more...)

2003.06344

Country:

North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.49)

Carvalho, Flavio, Guedes, Gustavo Paiva

TF-IDFC-RF: A Novel Supervised Term Weighting Scheme

arXiv.org Machine LearningMar-12-2020

Sentiment Analysis is a branch of Affective Computing usually considered a binary classification task. In this line of reasoning, Sentiment Analysis can be applied in several contexts to classify the attitude expressed in text samples, for example, movie reviews, sarcasm, among others. A common approach to represent text samples is the use of the Vector Space Model to compute numerical feature vectors consisting of the weight of terms. The most popular term weighting scheme is TF-IDF (Term Frequency - Inverse Document Frequency). It is an Unsupervised Weighting Scheme (UWS) since it does not consider the class information in the weighting of terms. Apart from that, there are Supervised Weighting Schemes (SWS), which consider the class information on term weighting calculation. Several SWS have been recently proposed, demonstrating better results than TF-IDF. In this scenario, this work presents a comparative study on different term weighting schemes and proposes a novel supervised term weighting scheme, named as TF-IDFC-RF (Term Frequency - Inverse Document Frequency in Class - Relevance Frequency). The effectiveness of TF-IDFC-RF is validated with SVM (Support Vector Machine) and NB (Naive Bayes) classifiers on four commonly used Sentiment Analysis datasets. TF-IDFC-RF outperforms all other weighting schemes and achieves F1 results of more than 99.9% on all datasets with SVM classifier.

dataset, term weighting scheme, weighting scheme, (12 more...)

2003.07193

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
South America > Argentina > Patagonia > Río Negro Province > Viedma (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry:

Media > Film (0.35)
Leisure & Entertainment (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.86)

Sivaguru, Raaghavi, Peck, Jonathan, Olumofin, Femi, Nascimento, Anderson, De Cock, Martine

Inline Detection of DGA Domains Using Side Information

arXiv.org Machine LearningMar-12-2020

Malware applications typically use a command and control (C&C) server to manage bots to perform malicious activities. Domain Generation Algorithms (DGAs) are popular methods for generating pseudo-random domain names that can be used to establish a communication between an infected bot and the C&C server. In recent years, machine learning based systems have been widely used to detect DGAs. There are several well known state-of-the-art classifiers in the literature that can detect DGA domain names in real-time applications with high predictive performance. However, these DGA classifiers are highly vulnerable to adversarial attacks in which adversaries purposely craft domain names to evade DGA detection classifiers. In our work, we focus on hardening DGA classifiers against adversarial attacks. To this end, we train and evaluate state-of-the-art deep learning and random forest (RF) classifiers for DGA detection using side information that is harder for adversaries to manipulate than the domain name itself. Additionally, the side information features are selected such that they are easily obtainable in practice to perform inline DGA detection. The performance and robustness of these models is assessed by exposing them to one day of real-traffic data as well as domains generated by adversarial attack algorithms. We found that the DGA classifiers that rely on both the domain name and side information have high performance and are more robust against adversaries.

classifier, detection, domain name, (13 more...)

2003.05703

Country:

North America > United States > Washington > Pierce County > Tacoma (0.04)
Europe > Belgium > Flanders (0.04)
Asia > Middle East > Iran > East Azerbaijan Province > Tabriz (0.04)

Genre:

Research Report (0.50)
Overview (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Zhang, Haoran, Lu, Amy X., Abdalla, Mohamed, McDermott, Matthew, Ghassemi, Marzyeh

Hurtful Words: Quantifying Biases in Clinical Contextual Word Embeddings

arXiv.org Machine LearningMar-11-2020

In this work, we examine the extent to which embeddings may encode marginalized populations differently, and how this may lead to a perpetuation of biases and worsened performance on clinical tasks. We pretrain deep embedding models (BERT) on medical notes from the MIMIC-III hospital dataset, and quantify potential disparities using two approaches. First, we identify dangerous latent relationships that are captured by the contextual word embeddings using a fill-in-the-blank method with text from real clinical notes and a log probability bias score quantification. Second, we evaluate performance gaps across different definitions of fairness on over 50 downstream clinical prediction tasks that include detection of acute and chronic conditions. We find that classifiers trained from BERT representations exhibit statistically significant differences in performance, often favoring the majority group with regards to gender, language, ethnicity, and insurance status. Finally, we explore shortcomings of using adversarial debiasing to obfuscate subgroup information in contextual word embeddings, and recommend best practices for such deep embedding models in clinical settings.

arxiv, bert model, significant difference, (16 more...)

2003.11515

Country:

North America > Canada > Ontario > Toronto (0.30)
North America > United States > North Carolina (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(3 more...)

Genre:

Research Report > New Finding (0.46)
Research Report > Experimental Study (0.46)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Health Care Providers & Services (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Data Science > Data Mining (0.67)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.67)

#artificialintelligenceMar-10-2020, 04:29:03 GMT

Classical Statistics and Statistical Learning in Imaging Neuroscience

Single subject prediction of brain disorders in neuroimaging: promises and pitfalls.

algorithm, hypothesis, inference, (16 more...)

#artificialintelligence

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
(12 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Instructional Material (1.00)
Overview (0.67)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(4 more...)

Multivariate Functional Regression via Nested Reduced-Rank Regularization

Liu, Xiaokang, Ma, Shujie, Chen, Kun

We propose a nested reduced-rank regression (NRRR) approach in fitting regression model with multivariate functional responses and predictors, to achieve tailored dimension reduction and facilitate interpretation/visualization of the resulting functional model. Our approach is based on a two-level low-rank structure imposed on the functional regression surfaces. A global low-rank structure identifies a small set of latent principal functional responses and predictors that drives the underlying regression association. A local low-rank structure then controls the complexity and smoothness of the association between the principal functional responses and predictors. Through a basis expansion approach, the functional problem boils down to an interesting integrated matrix approximation task, where the blocks or submatrices of an integrated low-rank matrix share some common row space and/or column space. An iterative algorithm with convergence guarantee is developed. We establish the consistency of NRRR and also show through non-asymptotic analysis that it can achieve at least a comparable error rate to that of the reduced-rank regression. Simulation studies demonstrate the effectiveness of NRRR. We apply NRRR in an electricity demand problem, to relate the trajectories of the daily electricity consumption with those of the daily temperatures.

estimation, matrix, predictor, (14 more...)

2003.04786

Country:

Oceania > Australia > South Australia (0.04)
North America > United States > New York (0.04)
North America > United States > Connecticut (0.04)
North America > United States > California > Riverside County > Riverside (0.04)

Genre: Research Report (0.82)

Industry: Energy > Power Industry (0.69)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.66)

Shalit, Nadav, Fire, Michael, Elia, Eran Ben

Imputing Missing Boarding Stations With Machine Learning Methods

With the increase in population densities and environmental awareness, public transport has become an important aspect of urban life. Consequently, large quantities of transportation data are generated, and mining data from smart card use has become a standardized method to understand the travel habits of passengers. Public transport datasets, however, often may lack data integrity; boarding stop information may be missing due to either imperfect acquirement processes or inadequate reporting. As a result, large quantities of observations and even complete sections of cities might be absent from the smart card database. We have developed a machine (supervised) learning method to impute missing boarding stops based on ordinal classification. In addition, we present a new metric, Pareto Accuracy, to evaluate algorithms where classes have an ordinal nature. Results are based on a case study in the Israeli city of Beer Sheva for one month of data. We show that our proposed method significantly notably outperforms current imputation methods and can improve the accuracy and usefulness of large-scale transportation data.

card data, dataset, smart card data, (15 more...)

2003.05285

Country:

Asia > Middle East > Israel > Southern District > Beer-Sheva (0.25)
North America > United States (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Asia > China (0.04)

Genre:

Overview (0.93)
Research Report > New Finding (0.46)

Industry:

Transportation > Infrastructure & Services (1.00)
Information Technology > Security & Privacy (0.95)
Transportation > Ground > Road (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Xue, Songkai, Yurochkin, Mikhail, Sun, Yuekai

Auditing ML Models for Individual Bias and Unfairness

We consider the task of auditing ML models for individual bias/unfairness. We formalize the task in an optimization problem and develop a suite of inferential tools for the optimal value. Our tools permit us to obtain asymptotic confidence intervals and hypothesis tests that cover the target/control the Type I error rate exactly. To demonstrate the utility of our tools, we use them to reveal the gender and racial biases in Northpointe's COMPAS recidivism prediction instrument.

auditor, fairness, ml model, (12 more...)

2003.05048

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Michigan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report > New Finding (0.68)

Industry: Law (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Miron, Marius, Tolan, Songül, Gómez, Emilia, Castillo, Carlos

Addressing multiple metrics of group fairness in data-driven decision making

The Fairness, Accountability, and Transparency in Machine Learning (FAT-ML) literature proposes a varied set of group fairness metrics to measure discrimination against socio-demographic groups that are characterized by a protected feature, such as gender or race. Such a system can be deemed as either fair or unfair depending on the choice of the metric. Several metrics have been proposed, some of them incompatible with each other. We present here a framework to navigate the tensions between various group-wise metrics and to study fairness in data-driven decision making without the constraint of choosing a single metric. We do so empirically, by observing that several of these metrics cluster together in two or three main clusters for the same groups and machine learning methods. In addition, we propose a robust way to visualize multidimensional fairness in two dimensions through a Principal Component Analysis (PCA) of the group fairness metrics. Experimental results on multiple datasets show that the PCA decomposition explains the variance between the metrics with one to three components.

fairness, group fairness, metric, (15 more...)

2003.04794

Country:

Europe > Spain > Catalonia (0.04)
North America > United States > Florida > Broward County (0.04)
Asia > Taiwan (0.04)
North America > United States > California (0.04)

Genre: Research Report > Experimental Study (0.46)

Industry:

Law (1.00)
Banking & Finance > Credit (0.93)
Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)