Goto

Collaborating Authors

 Accuracy


Shell Language Processing: Machine Learning for Security Intrusion Detection with Linux auditd

#artificialintelligence

Applicability of Machine Learning (ML) algorithms -- in the industry, tutorials, and courses -- is heavily biased towards building the ML models themselves. From our point of view, however, the data preprocessing step (i.e., transforming textual system logging to numerical arrays that capture valuable insights from the data) possesses the highest psychological and epistemic gap for security engineers and analysts trying to weaponize their data. There are enormous log collection hubs that lack qualitative analytics to infer necessary visibility out of acquired data. TeraBytes of logs are often collected to perform only basic analytics (e.g., basic signature-based rules) and are considered to be used in an ad hoc, reactive manner -- if an investigation is needed. Many valuable inferences can be acquired by defining manual heuristics on top of this data.


Multi-level Adversarial Spatio-temporal Learning for Footstep Pressure based FoG Detection

arXiv.org Artificial Intelligence

Freezing of gait (FoG) is one of the most common symptoms of Parkinson's disease, which is a neurodegenerative disorder of the central nervous system impacting millions of people around the world. To address the pressing need to improve the quality of treatment for FoG, devising a computer-aided detection and quantification tool for FoG has been increasingly important. As a non-invasive technique for collecting motion patterns, the footstep pressure sequences obtained from pressure sensitive gait mats provide a great opportunity for evaluating FoG in the clinic and potentially in the home environment. In this study, FoG detection is formulated as a sequential modelling task and a novel deep learning architecture, namely Adversarial Spatio-temporal Network (ASTN), is proposed to learn FoG patterns across multiple levels. A novel adversarial training scheme is introduced with a multi-level subject discriminator to obtain subject-independent FoG representations, which helps to reduce the over-fitting risk due to the high inter-subject variance. As a result, robust FoG detection can be achieved for unseen subjects. The proposed scheme also sheds light on improving subject-level clinical studies from other scenarios as it can be integrated with many existing deep architectures. To the best of our knowledge, this is one of the first studies of footstep pressure-based FoG detection and the approach of utilizing ASTN is the first deep neural network architecture in pursuit of subject-independent representations. Experimental results on 393 trials collected from 21 subjects demonstrate encouraging performance of the proposed ASTN for FoG detection with an AUC 0.85.


Human Treelike Tubular Structure Segmentation: A Comprehensive Review and Future Perspectives

arXiv.org Artificial Intelligence

Various structures in human physiology follow a treelike morphology, which often expresses complexity at very fine scales. Examples of such structures are intrathoracic airways, retinal blood vessels, and hepatic blood vessels. Large collections of 2D and 3D images have been made available by medical imaging modalities such as magnetic resonance imaging (MRI), computed tomography (CT), Optical coherence tomography (OCT) and ultrasound in which the spatial arrangement can be observed. Segmentation of these structures in medical imaging is of great importance since the analysis of the structure provides insights into disease diagnosis, treatment planning, and prognosis. Manually labelling extensive data by radiologists is often time-consuming and error-prone. As a result, automated or semi-automated computational models have become a popular research field of medical imaging in the past two decades, and many have been developed to date. In this survey, we aim to provide a comprehensive review of currently publicly available datasets, segmentation algorithms, and evaluation metrics. In addition, current challenges and future research directions are discussed.


AIR-JPMC@SMM4H'22: Classifying Self-Reported Intimate Partner Violence in Tweets with Multiple BERT-based Models

arXiv.org Artificial Intelligence

This paper presents our submission for the SMM4H 2022-Shared Task on the classification of self-reported intimate partner violence on Twitter (in English). The goal of this task was to accurately determine if the contents of a given tweet demonstrated someone reporting their own experience with intimate partner violence. The submitted system is an ensemble of five RoBERTa models each weighted by their respective F1-scores on the validation data-set. This system performed 13% better than the baseline and was the best performing system overall for this shared task.


Review of Time Series Forecasting Methods and Their Applications to Particle Accelerators

arXiv.org Artificial Intelligence

Particle accelerators are complex facilities that produce large amounts of structured data and have clear optimization goals as well as precisely defined control requirements. As such they are naturally amenable to data-driven research methodologies. The data from sensors and monitors inside the accelerator form multivariate time series. With fast pre-emptive approaches being highly preferred in accelerator control and diagnostics, the application of data-driven time series forecasting methods is particularly promising. This review formulates the time series forecasting problem and summarizes existing models with applications in various scientific areas. Several current and future attempts in the field of particle accelerators are introduced. The application of time series forecasting to particle accelerators has shown encouraging results and the promise for broader use, and existing problems such as data consistency and compatibility have started to be addressed.


Improving the Performance of Robust Control through Event-Triggered Learning

arXiv.org Artificial Intelligence

Robust controllers ensure stability in feedback loops designed under uncertainty but at the cost of performance. Model uncertainty in time-invariant systems can be reduced by recently proposed learning-based methods, which improve the performance of robust controllers using data. However, in practice, many systems also exhibit uncertainty in the form of changes over time, e.g., due to weight shifts or wear and tear, leading to decreased performance or instability of the learning-based controller. We propose an event-triggered learning algorithm that decides when to learn in the face of uncertainty in the LQR problem with rare or slow changes. Our key idea is to switch between robust and learned controllers. For learning, we first approximate the optimal length of the learning phase via Monte-Carlo estimations using a probabilistic model. We then design a statistical test for uncertain systems based on the moment-generating function of the LQR cost. The test detects changes in the system under control and triggers re-learning when control performance deteriorates due to system changes. We demonstrate improved performance over a robust controller baseline in a numerical example.


Protein language models trained on multiple sequence alignments learn phylogenetic relationships

arXiv.org Artificial Intelligence

The explosion of available biological sequence data has led to multiple computational approaches aiming to infer three-dimensional structure, biological function, fitness, and evolutionary history of proteins from sequence data [1, 2]. Recently, self-supervised deep learning models based on natural language processing methods, especially attention [3] and transformers [4], have been trained on large ensembles of protein sequences by means of the masked language modeling objective of filling in masked amino acids in a sequence, given the surrounding ones [5-10]. These models, which capture longrange dependencies, learn rich representations of protein sequences, and can be employed for multiple tasks. In particular, they can predict structural contacts from single sequences in an unsupervised way [7], presumably by transferring knowledge from their large training set [11]. Neural network architectures based on attention are also employed in the Evoformer blocks in AlphaFold [12], as well as in RoseTTAFold [13] and RGN2 [14], and they contributed to the recent breakthrough in the supervised prediction of protein structure. Protein sequences can be classified in families of homologous proteins, that descend from an ancestral protein and share a similar structure and function. Analyzing multiple sequence alignments (MSAs) of homologous proteins thus provides substantial information about functional and structural constraints [1]. The statistics of MSA columns, representing amino-acid sites, allow to identify functional residues that are conserved during evolution, and correlations of amino-acid usage between columns contain key information about functional sectors and structural contacts [15-18]. Indeed, through the course of evolution, contacting amino acids need to maintain their physico-chemical complementarity, which leads to correlated amino-acid usages at these sites: this is known as coevolution.


Common human diseases prediction using machine learning based on survey data

arXiv.org Artificial Intelligence

In this era, the moment has arrived to move away from disease as the primary emphasis of medical treatment. Although impressive, the multiple techniques that have been developed to detect the diseases. In this time, there are some types of diseases COVID-19, normal flue, migraine, lung disease, heart disease, kidney disease, diabetics, stomach disease, gastric, bone disease, autism are the very common diseases. In this analysis, we analyze disease symptoms and have done disease predictions based on their symptoms. We studied a range of symptoms and took a survey from people in order to complete the task. Several classification algorithms have been employed to train the model. Furthermore, performance evaluation matrices are used to measure the model's performance. Finally, we discovered that the part classifier surpasses the others.


Towards Trustworthy AI-Empowered Real-Time Bidding for Online Advertisement Auctioning

arXiv.org Artificial Intelligence

Artificial intelligence-empowred Real-Time Bidding (AIRTB) is regarded as one of the most enabling technologies for online advertising. It has attracted significant research attention from diverse fields such as pattern recognition, game theory and mechanism design. Despite of its remarkable development and deployment, the AIRTB system can sometimes harm the interest of its participants (e.g., depleting the advertisers' budget with various kinds of fraud). As such, building trustworthy AIRTB auctioning systems has emerged as an important direction of research in this field in recent years. Due to the highly interdisciplinary nature of this field and a lack of a comprehensive survey, it is a challenge for researchers to enter this field and contribute towards building trustworthy AIRTB technologies. This paper bridges this important gap in trustworthy AIRTB literature. We start by analysing the key concerns of various AIRTB stakeholders and identify three main dimensions of trust building in AIRTB, namely security, robustness and fairness. For each of these dimensions, we propose a unique taxonomy of the state of the art, trace the root causes of possible breakdown of trust, and discuss the necessity of the given dimension. This is followed by a comprehensive review of existing strategies for fulfilling the requirements of each trust dimension. In addition, we discuss the promising future directions of research essential towards building trustworthy AIRTB systems to benefit the field of online advertising.


Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

arXiv.org Artificial Intelligence

Most of the popular Big Data analytics tools evolved to adapt their working environment to extract valuable information from a vast amount of unstructured data. The ability of data mining techniques to filter this helpful information from Big Data led to the term Big Data Mining. Shifting the scope of data from small-size, structured, and stable data to huge volume, unstructured, and quickly changing data brings many data management challenges. Different tools cope with these challenges in their own way due to their architectural limitations. There are numerous parameters to take into consideration when choosing the right data management framework based on the task at hand. In this paper, we present a comprehensive benchmark for two widely used Big Data analytics tools, namely Apache Spark and Hadoop MapReduce, on a common data mining task, i.e., classification. We employ several evaluation metrics to compare the performance of the benchmarked frameworks, such as execution time, accuracy, and scalability. These metrics are specialized to measure the performance for classification task. To the best of our knowledge, there is no previous study in the literature that employs all these metrics while taking into consideration task-specific concerns. We show that Spark is 5 times faster than MapReduce on training the model. Nevertheless, the performance of Spark degrades when the input workload gets larger. Scaling the environment by additional clusters significantly improves the performance of Spark. However, similar enhancement is not observed in Hadoop. Machine learning utility of MapReduce tend to have better accuracy scores than that of Spark, like around 3%, even in small size data sets.