Accuracy
Machine Learning Approach on Multiclass Classification of Internet Firewall Log Files
Rahman, Md Habibur, Islam, Taminul, Rana, Md Masum, Tasnim, Rehnuma, Mona, Tanzina Rahman, Sakib, Md. Mamun
Firewalls are critical components in securing communication networks by screening all incoming (and occasionally exiting) data packets. Filtering is carried out by comparing incoming data packets to a set of rules designed to prevent malicious code from entering the network. To regulate the flow of data packets entering and leaving a network, an Internet firewall keeps a track of all activity. While the primary function of log files is to aid in troubleshooting and diagnostics, the information they contain is also very relevant to system audits and forensics. Firewalls primary function is to prevent malicious data packets from being sent. In order to better defend against cyberattacks and understand when and how malicious actions are influencing the internet, it is necessary to examine log files. As a result, the firewall decides whether to 'allow,' 'deny,' 'drop,' or 'reset-both' the incoming and outgoing packets. In this research, we apply various categorization algorithms to make sense of data logged by a firewall device. Harmonic mean F1 score, recall, and sensitivity measurement data with a 99% accuracy score in the random forest technique are used to compare the classifier's performance. To be sure, the proposed characteristics did significantly contribute to enhancing the firewall classification rate, as seen by the high accuracy rates generated by the other methods.
Noisy Positive-Unlabeled Learning with Self-Training for Speculative Knowledge Graph Reasoning
Wang, Ruijie, Li, Baoyu, Lu, Yichen, Sun, Dachun, Li, Jinning, Yan, Yuchen, Liu, Shengzhong, Tong, Hanghang, Abdelzaher, Tarek F.
This paper studies speculative reasoning task on real-world knowledge graphs (KG) that contain both \textit{false negative issue} (i.e., potential true facts being excluded) and \textit{false positive issue} (i.e., unreliable or outdated facts being included). State-of-the-art methods fall short in the speculative reasoning ability, as they assume the correctness of a fact is solely determined by its presence in KG, making them vulnerable to false negative/positive issues. The new reasoning task is formulated as a noisy Positive-Unlabeled learning problem. We propose a variational framework, namely nPUGraph, that jointly estimates the correctness of both collected and uncollected facts (which we call \textit{label posterior}) and updates model parameters during training. The label posterior estimation facilitates speculative reasoning from two perspectives. First, it improves the robustness of a label posterior-aware graph encoder against false positive links. Second, it identifies missing facts to provide high-quality grounds of reasoning. They are unified in a simple yet effective self-training procedure. Empirically, extensive experiments on three benchmark KG and one Twitter dataset with various degrees of false negative/positive cases demonstrate the effectiveness of nPUGraph.
Learning Unnormalized Statistical Models via Compositional Optimization
Jiang, Wei, Qin, Jiayu, Wu, Lingyu, Chen, Changyou, Yang, Tianbao, Zhang, Lijun
Learning unnormalized statistical models (e.g., energy-based models) is computationally challenging due to the complexity of handling the partition function. To eschew this complexity, noise-contrastive estimation~(NCE) has been proposed by formulating the objective as the logistic loss of the real data and the artificial noise. However, as found in previous works, NCE may perform poorly in many tasks due to its flat loss landscape and slow convergence. In this paper, we study it a direct approach for optimizing the negative log-likelihood of unnormalized models from the perspective of compositional optimization. To tackle the partition function, a noise distribution is introduced such that the log partition function can be written as a compositional function whose inner function can be estimated with stochastic samples. Hence, the objective can be optimized by stochastic compositional optimization algorithms. Despite being a simple method, we demonstrate that it is more favorable than NCE by (1) establishing a fast convergence rate and quantifying its dependence on the noise distribution through the variance of stochastic estimators; (2) developing better results for one-dimensional Gaussian mean estimation by showing our objective has a much favorable loss landscape and hence our method enjoys faster convergence; (3) demonstrating better performance on multiple applications, including density estimation, out-of-distribution detection, and real image generation.
Accurate Measures of Vaccination and Concerns of Vaccine Holdouts from Web Search Logs
Chang, Serina, Fourney, Adam, Horvitz, Eric
To design effective vaccine policies, policymakers need detailed data about who has been vaccinated, who is holding out, and why. However, existing data in the US are insufficient: reported vaccination rates are often delayed or missing, and surveys of vaccine hesitancy are limited by high-level questions and self-report biases. Here, we show how large-scale search engine logs and machine learning can be leveraged to fill these gaps and provide novel insights about vaccine intentions and behaviors. First, we develop a vaccine intent classifier that can accurately detect when a user is seeking the COVID-19 vaccine on search. Our classifier demonstrates strong agreement with CDC vaccination rates, with correlations above 0.86, and estimates vaccine intent rates to the level of ZIP codes in real time, allowing us to pinpoint more granular trends in vaccine seeking across regions, demographics, and time. To investigate vaccine hesitancy, we use our classifier to identify two groups, vaccine early adopters and vaccine holdouts. We find that holdouts, compared to early adopters matched on covariates, are 69% more likely to click on untrusted news sites. Furthermore, we organize 25,000 vaccine-related URLs into a hierarchical ontology of vaccine concerns, and we find that holdouts are far more concerned about vaccine requirements, vaccine development and approval, and vaccine myths, and even within holdouts, concerns vary significantly across demographic groups. Finally, we explore the temporal dynamics of vaccine concerns and vaccine seeking, and find that key indicators emerge when individuals convert from holding out to preparing to accept the vaccine.
Towards Fair and Explainable AI using a Human-Centered AI Approach
The rise of machine learning (ML) is accompanied by several high-profile cases that have stressed the need for fairness, accountability, explainability and trust in ML systems. The existing literature has largely focused on fully automated ML approaches that try to optimize for some performance metric. However, human-centric measures like fairness, trust, explainability, etc. are subjective in nature, context-dependent, and might not correlate with conventional performance metrics. To deal with these challenges, we explore a human-centered AI approach that empowers people by providing more transparency and human control. In this dissertation, we present 5 research projects that aim to enhance explainability and fairness in classification systems and word embeddings. The first project explores the utility/downsides of introducing local model explanations as interfaces for machine teachers (crowd workers). Our study found that adding explanations supports trust calibration for the resulting ML model and enables rich forms of teaching feedback. The second project presents D-BIAS, a causality-based human-in-the-loop visual tool for identifying and mitigating social biases in tabular datasets. Apart from fairness, we found that our tool also enhances trust and accountability. The third project presents WordBias, a visual interactive tool that helps audit pre-trained static word embeddings for biases against groups, such as females, or subgroups, such as Black Muslim females. The fourth project presents DramatVis Personae, a visual analytics tool that helps identify social biases in creative writing. Finally, the last project presents an empirical study aimed at understanding the cumulative impact of multiple fairness-enhancing interventions at different stages of the ML pipeline on fairness, utility and different population groups. We conclude by discussing some of the future directions.
Izindaba-Tindzaba: Machine learning news categorisation for Long and Short Text for isiZulu and Siswati
Madodonga, Andani, Marivate, Vukosi, Adendorff, Matthew
Local/Native South African languages are classified as low-resource languages. As such, it is essential to build the resources for these languages so that they can benefit from advances in the field of natural language processing. In this work, the focus was to create annotated news datasets for the isiZulu and Siswati native languages based on news topic classification tasks and present the findings from these baseline classification models. Due to the shortage of data for these native South African languages, the datasets that were created were augmented and oversampled to increase data size and overcome class classification imbalance. In total, four different classification models were used namely Logistic regression, Naive bayes, XGBoost and LSTM. These models were trained on three different word embeddings namely Bag-Of-Words, TFIDF and Word2vec. The results of this study showed that XGBoost, Logistic Regression and LSTM, trained from Word2vec performed better than the other combinations.
No Free Lunch: The Hazards of Over-Expressive Representations in Anomaly Detection
Reiss, Tal, Cohen, Niv, Hoshen, Yedid
Anomaly detection methods, powered by deep learning, have recently been making significant progress, mostly due to improved representations. It is tempting to hypothesize that anomaly detection can improve indefinitely by increasing the scale of our networks, making their representations more expressive. In this paper, we provide theoretical and empirical evidence to the contrary. In fact, we empirically show cases where very expressive representations fail to detect even simple anomalies when evaluated beyond the well-studied object-centric datasets. To investigate this phenomenon, we begin by introducing a novel theoretical toy model for anomaly detection performance. The model uncovers a fundamental trade-off between representation sufficiency and over-expressivity. It provides evidence for a no-free-lunch theorem in anomaly detection stating that increasing representation expressivity will eventually result in performance degradation. Instead, guidance must be provided to focus the representation on the attributes relevant to the anomalies of interest. We conduct an extensive empirical investigation demonstrating that state-of-the-art representations often suffer from over-expressivity, failing to detect many types of anomalies. Our investigation demonstrates how this over-expressivity impairs image anomaly detection in practical settings. We conclude with future directions for mitigating this issue.
Explainable AI and Machine Learning Towards Human Gait Deterioration Analysis
Gait analysis, an expanding research area, employs non invasive sensors and machine learning techniques for a range of applicatio ns. In this study, we concentrate on gait analysis for detecting cognitive decline in Parkinson's disease (PD) and under dual task conditions. Using convolutional neural networks (CNNs) and explainable machine learning, we objectively analyze gait data and associate findings with clinically relevant biomarkers. This is accomplished by connecting machine learning outputs to decisions based on human visual observations or derived quantitative gait parameters, which are tested and routinely implemented in curr ent healthcare practice. Our analysis of gait deterioration due to cognitive decline in PD enables robust results using the proposed methods for assessing PD severity from ground reaction force (GRF) data. We achieved classification accuracies of 98% F1 sc ores for each PhysioNet.org dataset and 95.5% F1 scores for the combined PhysioNet dataset. By linking clinically observable features to the model outputs, we demonstrate the impact of PD severity on gait. Furthermore, we explore the significance of cognit ive load in healthy gait analysis, resulting in robust classification accuracies of 100% F1 scores for subject identity verification. We also identify weaker features crucial for model predictions using Layer Wise Relevance Propagation. A notable finding o f this study reveals that cognitive deterioration's effect on gait influences body balance and foot landing/lifting dynamics in both classification cases: cognitive load in healthy gait and cognitive decline in PD gait.
On building machine learning pipelines for Android malware detection: a procedural survey of practices, challenges and opportunities
Koushki, Masoud Mehrabi, AbuAlhaol, Ibrahim, Raju, Anandharaju Durai, Zhou, Yang, Giagone, Ronnie Salvador, Shengqiang, Huang
As the smartphone market leader, Android has been a prominent target for malware attacks. The number of malicious applications (apps) identified for it has increased continually over the past decade, creating an immense challenge for all parties involved. For market holders and researchers, in particular, the large number of samples has made manual malware detection unfeasible, leading to an influx of research that investigate Machine Learning (ML) approaches to automate this process. However, while some of the proposed approaches achieve high performance, rapidly evolving Android malware has made them unable to maintain their accuracy over time. This has created a need in the community to conduct further research, and build more flexible ML pipelines. Doing so, however, is currently hindered by a lack of systematic overview of the existing literature, to learn from and improve upon the existing solutions. Existing survey papers often focus only on parts of the ML process (e.g., data collection or model deployment), while omitting other important stages, such as model evaluation and explanation. In this paper, we address this problem with a review of 42 highly-cited papers, spanning a decade of research (from 2011 to 2021). We introduce a novel procedural taxonomy of the published literature, covering how they have used ML algorithms, what features they have engineered, which dimensionality reduction techniques they have employed, what datasets they have employed for training, and what their evaluation and explanation strategies are. Drawing from this taxonomy, we also identify gaps in knowledge and provide ideas for improvement and future work.
Imbalanced Multi-label Classification for Business-related Text with Moderately Large Label Spaces
Arslan, Muhammad, Cruz, Christophe
In this study, we compared the performance of four different methods for multi-label text classification using a specific imbalanced business dataset. The four methods we evaluated were fine-tuned BERT, Binary Relevance, Classifier Chains, and Label Powerset. The results show that fine-tuned BERT outperforms the other three methods by a significant margin, achieving high values of accuracy, F1-Score, Precision, and Recall. Binary Relevance also performs well on this dataset, while Classifier Chains and Label Powerset demonstrate relatively poor performance. These findings highlight the effectiveness of fine-tuned BERT for multi-label text classification tasks, and suggest that it may be a useful tool for businesses seeking to analyze complex and multifaceted texts.