The ubiquitous availability of computing devices and the widespread use of the internet have generated a large amount of data continuously. Therefore, the amount of available information on any given topic is far beyond humans' processing capacity to properly process, causing what is known as information overload. To efficiently cope with large amounts of information and generate content with significant value to users, we require identifying, merging and summarising information. Data summaries can help gather related information and collect it into a shorter format that enables answering complicated questions, gaining new insight and discovering conceptual boundaries. This thesis focuses on three main challenges to alleviate information overload using novel summarisation techniques. It further intends to facilitate the analysis of documents to support personalised information extraction. This thesis separates the research issues into four areas, covering (i) feature engineering in document summarisation, (ii) traditional static and inflexible summaries, (iii) traditional generic summarisation approaches, and (iv) the need for reference summaries. We propose novel approaches to tackle these challenges, by: i)enabling automatic intelligent feature engineering, ii) enabling flexible and interactive summarisation, iii) utilising intelligent and personalised summarisation approaches. The experimental results prove the efficiency of the proposed approaches compared to other state-of-the-art models. We further propose solutions to the information overload problem in different domains through summarisation, covering network traffic data, health data and business process data.
Static analysis remains one of the most popular approaches for detecting and correcting poor or vulnerable program code. It involves the examination of code listings, test results, or other documentation to identify errors, violations of development standards, or other problems, with the ultimate goal of fixing these errors so that systems and software are as secure as possible. There exists a plethora of static analysis tools, which makes it challenging for businesses and programmers to select a tool to analyze their program code. It is imperative to find ways to improve code analysis so that it can be employed by cyber defenders to mitigate security risks. In this research, we propose a method that employs the use of virtual assistants to work with programmers to ensure that software are as safe as possible in order to protect safety-critical systems from data breaches and other attacks. The pro- posed method employs a recommender system that uses various metrics to help programmers select the most appropriate code analysis tool for their project and guides them through the analysis process. The system further tracks the user's behavior regarding the adoption of the recommended practices.
Connected vehicles (CVs), because of the external connectivity with other CVs and connected infrastructure, are vulnerable to cyberattacks that can instantly compromise the safety of the vehicle itself and other connected vehicles and roadway infrastructure. One such cyberattack is the false information attack, where an external attacker injects inaccurate information into the connected vehicles and eventually can cause catastrophic consequences by compromising safety-critical applications like the forward collision warning. The occurrence and target of such attack events can be very dynamic, making real-time and near-real-time detection challenging. Change point models, can be used for real-time anomaly detection caused by the false information attack. In this paper, we have evaluated three change point-based statistical models; Expectation Maximization, Cumulative Summation, and Bayesian Online Change Point Algorithms for cyberattack detection in the CV data. Also, data-driven artificial intelligence (AI) models, which can be used to detect known and unknown underlying patterns in the dataset, have the potential of detecting a real-time anomaly in the CV data. We have used six AI models to detect false information attacks and compared the performance for detecting the attacks with our developed change point models. Our study shows that change points models performed better in real-time false information attack detection compared to the performance of the AI models. Change point models having the advantage of no training requirements can be a feasible and computationally efficient alternative to AI models for false information attack detection in connected vehicles.
Software Vulnerabilities (SVs) are increasing in complexity and scale, posing great security risks to many software systems. Given the limited resources in practice, SV assessment and prioritization help practitioners devise optimal SV mitigation plans based on various SV characteristics. The surge in SV data sources and data-driven techniques such as Machine Learning and Deep Learning have taken SV assessment and prioritization to the next level. Our survey provides a taxonomy of the past research efforts and highlights the best practices for data-driven SV assessment and prioritization. We also discuss the current limitations and propose potential solutions to address such issues.
Graph neural networks (GNNs) have attracted increasing interests. With broad deployments of GNNs in real-world applications, there is an urgent need for understanding the robustness of GNNs under adversarial attacks, especially in realistic setups. In this work, we study the problem of attacking GNNs in a restricted and realistic setup, by perturbing the features of a small set of nodes, with no access to model parameters and model predictions. Our formal analysis draws a connection between this type of attacks and an influence maximization problem on the graph. This connection not only enhances our understanding on the problem of adversarial attack on GNNs, but also allows us to propose a group of effective and practical attack strategies. Our experiments verify that the proposed attack strategies significantly degrade the performance of three popular GNN models and outperform baseline adversarial attack strategies.
Following procedural texts written in natural languages is challenging. We must read the whole text to identify the relevant information or identify the instruction flows to complete a task, which is prone to failures. If such texts are structured, we can readily visualize instruction-flows, reason or infer a particular step, or even build automated systems to help novice agents achieve a goal. However, this structure recovery task is a challenge because of such texts' diverse nature. This paper proposes to identify relevant information from such texts and generate information flows between sentences. We built a large annotated procedural text dataset (CTFW) in the cybersecurity domain (3154 documents). This dataset contains valuable instructions regarding software vulnerability analysis experiences. We performed extensive experiments on CTFW with our LM-GNN model variants in multiple settings. To show the generalizability of both this task and our method, we also experimented with procedural texts from two other domains (Maintenance Manual and Cooking), which are substantially different from cybersecurity. Our experiments show that Graph Convolution Network with BERT sentence embeddings outperforms BERT in all three domains
Privacy protection has recently attracted the attention of both academics and industries. Society protects individual data privacy through complex legal frameworks. This has become a topic of interest with the increasing applications of data science and artificial intelligence that have created a higher demand to the ubiquitous application of the data. The privacy protection of the broad Data-InformationKnowledge-Wisdom (DIKW) landscape, the next generation of information organization, has not been in the limelight. Next, we will explore DIKW architecture through the applications of popular swarm intelligence and differential privacy. As differential privacy proved to be an effective data privacy approach, we will look at it from a DIKW domain perspective. Swarm Intelligence could effectively optimize and reduce the number of items in DIKW used in differential privacy, this way accelerating both the effectiveness and the efficiency of differential privacy for crossing multiple modals of conceptual DIKW. The proposed approach is proved through the application of personalized data that is based on the open-sourse IRIS dataset. This experiment demonstrates the efficiency of Swarm Intelligence in reducing computing complexity.
Buluc, Aydin, Kolda, Tamara G., Wild, Stefan M., Anitescu, Mihai, DeGennaro, Anthony, Jakeman, John, Kamath, Chandrika, Ramakrishnan, null, Kannan, null, Lopes, Miles E., Martinsson, Per-Gunnar, Myers, Kary, Nelson, Jelani, Restrepo, Juan M., Seshadhri, C., Vrabie, Draguna, Wohlberg, Brendt, Wright, Stephen J., Yang, Chao, Zwart, Peter
Randomized algorithms have propelled advances in artificial intelligence and represent a foundational research area in advancing AI for Science. Future advancements in DOE Office of Science priority areas such as climate science, astrophysics, fusion, advanced materials, combustion, and quantum computing all require randomized algorithms for surmounting challenges of complexity, robustness, and scalability. This report summarizes the outcomes of that workshop, "Randomized Algorithms for Scientific Computing (RASC)," held virtually across four days in December 2020 and January 2021.
Real-time traffic prediction models play a pivotal role in smart mobility systems and have been widely used in route guidance, emerging mobility services, and advanced traffic management systems. With the availability of massive traffic data, neural network-based deep learning methods, especially the graph convolutional networks (GCN) have demonstrated outstanding performance in mining spatio-temporal information and achieving high prediction accuracy. Recent studies reveal the vulnerability of GCN under adversarial attacks, while there is a lack of studies to understand the vulnerability issues of the GCN-based traffic prediction models. Given this, this paper proposes a new task -- diffusion attack, to study the robustness of GCN-based traffic prediction models. The diffusion attack aims to select and attack a small set of nodes to degrade the performance of the entire prediction model. To conduct the diffusion attack, we propose a novel attack algorithm, which consists of two major components: 1) approximating the gradient of the black-box prediction model with Simultaneous Perturbation Stochastic Approximation (SPSA); 2) adapting the knapsack greedy algorithm to select the attack nodes. The proposed algorithm is examined with three GCN-based traffic prediction models: St-Gcn, T-Gcn, and A3t-Gcn on two cities. The proposed algorithm demonstrates high efficiency in the adversarial attack tasks under various scenarios, and it can still generate adversarial samples under the drop regularization such as DropOut, DropNode, and DropEdge. The research outcomes could help to improve the robustness of the GCN-based traffic prediction models and better protect the smart mobility systems. Our code is available at https://github.com/LYZ98/Adversarial-Diffusion-Attacks-on-Graph-based-Traffic-Prediction-Models
Cyber Physical Systems (CPS) are characterized by their ability to integrate the physical and information or cyber worlds. Their deployment in critical infrastructure have demonstrated a potential to transform the world. However, harnessing this potential is limited by their critical nature and the far reaching effects of cyber attacks on human, infrastructure and the environment. An attraction for cyber concerns in CPS rises from the process of sending information from sensors to actuators over the wireless communication medium, thereby widening the attack surface. Traditionally, CPS security has been investigated from the perspective of preventing intruders from gaining access to the system using cryptography and other access control techniques. Most research work have therefore focused on the detection of attacks in CPS. However, in a world of increasing adversaries, it is becoming more difficult to totally prevent CPS from adversarial attacks, hence the need to focus on making CPS resilient. Resilient CPS are designed to withstand disruptions and remain functional despite the operation of adversaries. One of the dominant methodologies explored for building resilient CPS is dependent on machine learning (ML) algorithms. However, rising from recent research in adversarial ML, we posit that ML algorithms for securing CPS must themselves be resilient. This paper is therefore aimed at comprehensively surveying the interactions between resilient CPS using ML and resilient ML when applied in CPS. The paper concludes with a number of research trends and promising future research directions. Furthermore, with this paper, readers can have a thorough understanding of recent advances on ML-based security and securing ML for CPS and countermeasures, as well as research trends in this active research area.