Plotting

Supplementary material for CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence

Neural Information Processing Systems

This task assesses the LLMs' ability to evaluate the severity of This task tests the LLMs' capability to The dataset consists of 5 TSV files, each corresponding to a different task. "Prompt" column used to pose questions to the LLM. Most files also include a "GT" column that The dataset includes URLs indicating the sources from which the data was collected. A permanent DOI identifier is associated with the dataset: DOI: AI4Sec (2024).


CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence Dipkamal Bhusal Rochester Institute of Technology Rochester Institute of Technology Rochester, NY, USA

Neural Information Processing Systems

Cyber threat intelligence (CTI) is crucial in today's cybersecurity landscape, providing essential insights to understand and mitigate the ever-evolving cyber threats. The recent rise of Large Language Models (LLMs) have shown potential in this domain, but concerns about their reliability, accuracy, and hallucinations persist. While existing benchmarks provide general evaluations of LLMs, there are no benchmarks that address the practical and applied aspects of CTI-specific tasks. To bridge this gap, we introduce CTIBench, a benchmark designed to assess LLMs' performance in CTI applications. CTIBench includes multiple datasets focused on evaluating knowledge acquired by LLMs in the cyber-threat landscape. Our evaluation of several state-of-the-art models on these tasks provides insights into their strengths and weaknesses in CTI contexts, contributing to a better understanding of LLM capabilities in CTI.


Appendices

Neural Information Processing Systems

The supplementary material is organized as follows. We first discuss additional related work and provide experiment details in Section 2 and Appendix B respectively. In Appendix C, we provide additional experiments to further validate the extreme nature of Simplicity Bias (SB). Then, in Appendix D, we provide additional information about the experiment setup used to to show that extreme SB can hurt generalization. We evaluate the extent to which ensemble methods and adversarial training mitigate Simplicity Bias (SB) in Appendix E. Finally, we provide the proof of Theorem 1 in Appendix F. In this section, we provide a more thorough discussion of relevant work related to margin-based generalization bounds, adversarial attacks and robustness, and out-of-distribution (OOD) examples. Margin-based generalization bounds: Building up on the classical work of [3], recent works try to obtain tighter generalization bounds for neural networks in terms of normalized margin [4, 50, 18, 22]. Here, margin is defined as the difference in the probability of the true label and the largest probability of the incorrect labels.


The Pitfalls of Simplicity Bias in Neural Networks

Neural Information Processing Systems

Several works have proposed Simplicity Bias (SB)--the tendency of standard training procedures such as Stochastic Gradient Descent (SGD) to find simple models--to justify why neural networks generalize well [1, 49, 74]. However, the precise notion of simplicity remains vague. Furthermore, previous settings [67, 24] that use SB to justify why neural networks generalize well do not simultaneously capture the non-robustness of neural networks--a widely observed phenomenon in practice [71, 36]. We attempt to reconcile SB and the superior standard generalization of neural networks with the non-robustness observed in practice by introducing piecewise-linear and image-based datasets, which (a) incorporate a precise notion of simplicity, (b) comprise multiple predictive features with varying levels of simplicity, and (c) capture the non-robustness of neural networks trained on real data. Through theoretical analysis and targeted experiments on these datasets, we make four observations: (i) SB of SGD and variants can be extreme: neural networks can exclusively rely on the simplest feature and remain invariant to all predictive complex features.


Documentation for The Noisy Ostracods Dataset, He Wang

Neural Information Processing Systems

The Noisy Ostracods dataset is a real-world taxonomy dataset characterized by various types of noise. It was created out of the need for a clean taxonomy dataset and the challenges we encountered during the cleaning process in our real use case. Our goal was to provide a benchmark for evaluating the performance of robust machine learning methods and label correction algorithms from a practical perspective. The imbalanced and fine-grained nature of the dataset introduces additional challenges to these methods. The document is made by adapting the most relevant questions from datasheets for datasets[1] according to the property of our datasets.


Noisy Ostracods: A Fine-Grained, Imbalanced Real-World Dataset for Benchmarking Robust Machine Learning and Label Correction Methods

Neural Information Processing Systems

We present the Noisy Ostracods, a noisy dataset for genus and species classification of crustacean ostracods with specialists' annotations. Over the 71466 specimens collected, 5.58% of them are estimated to be noisy (possibly problematic) at genus level. The dataset is created to addressing a real-world challenge: creating a clean fine-grained taxonomy dataset. The Noisy Ostracods dataset has diverse noises from multiple sources. Firstly, the noise is open-set, including new classes discovered during curation that were not part of the original annotation.


Exploiting the Surrogate Gap in Online Multiclass Classification

Neural Information Processing Systems

We present a new proof technique that exploits the gap between the zero-one loss and surrogate losses rather than exploiting properties such as exp-concavity or mixability, which are traditionally used to prove logarithmic or constant regret bounds.



PolarMix: A General Data Augmentation Technique for LiDAR Point Clouds, Dayan Guan

Neural Information Processing Systems

LiDAR point clouds, which are usually scanned by rotating LiDAR sensors continuously, capture precise geometry of the surrounding environment and are crucial to many autonomous detection and navigation tasks. Though many 3D deep architectures have been developed, efficient collection and annotation of large amounts of point clouds remain one major challenge in the analytics and understanding of point cloud data. This paper presents PolarMix, a point cloud augmentation technique that is simple and generic but can mitigate the data constraint effectively across different perception tasks and scenarios. PolarMix enriches point cloud distributions and preserves point cloud fidelity via two cross-scan augmentation strategies that cut, edit, and mix point clouds along the scanning direction. The first is scene-level swapping which exchanges point cloud sectors of two LiDAR scans that are cut along the azimuth axis. The second is instance-level rotation and paste which crops point instances from one LiDAR scan, rotates them by multiple angles (to create multiple copies), and paste the rotated point instances into other scans. Extensive experiments show that PolarMix achieves superior performance consistently across different perception tasks and scenarios. In addition, it can work as a plug-and-play for various 3D deep architectures and also performs well for unsupervised domain adaptation.


Towards the Dynamics of a DNN Learning Symbolic Interactions

Neural Information Processing Systems

This study proves the two-phase dynamics of a deep neural network (DNN) learning interactions. Despite the long disappointing view of the faithfulness of post-hoc explanation of a DNN, a series of theorems have been proven [27] in recent years to show that for a given input sample, a small set of interactions between input variables can be considered as primitive inference patterns that faithfully represent a DNN's detailed inference logic on that sample. Particularly, Zhang et al. [41] have observed that various DNNs all learn interactions of different complexities in two distinct phases, and this two-phase dynamics well explains how a DNN changes from under-fitting to over-fitting. Therefore, in this study, we mathematically prove the two-phase dynamics of interactions, providing a theoretical mechanism for how the generalization power of a DNN changes during the training process. Experiments show that our theory well predicts the real dynamics of interactions on different DNNs trained for various tasks.