Goto

Collaborating Authors

 Overview


DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

arXiv.org Artificial Intelligence

Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of provided datasets? To evaluate this question, we present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. The benchmark is designed to systematically assess current model capabilities in discovery tasks and provide a useful resource for improving them. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering, by manually deriving discovery workflows from published papers to approximate the real-world challenges faced by researchers, where each task is defined by a dataset, its metadata, and a discovery goal in natural language. We additionally provide 903 synthetic tasks to conduct controlled evaluations across task complexity. Furthermore, our structured formalism of data-driven discovery enables a facet-based evaluation that provides useful insights into different failure modes. We evaluate several popular LLM-based reasoning frameworks using both open and closed LLMs as baselines on DiscoveryBench and find that even the best system scores only 25%. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.


Core Knowledge Learning Framework for Graph Adaptation and Scalability Learning

arXiv.org Artificial Intelligence

Graph classification is a pivotal challenge in machine learning, especially within the realm of graph-based data, given its importance in numerous real-world applications such as social network analysis, recommendation systems, and bioinformatics. Despite its significance, graph classification faces several hurdles, including adapting to diverse prediction tasks, training across multiple target domains, and handling small-sample prediction scenarios. Current methods often tackle these challenges individually, leading to fragmented solutions that lack a holistic approach to the overarching problem. In this paper, we propose an algorithm aimed at addressing the aforementioned challenges. By incorporating insights from various types of tasks, our method aims to enhance adaptability, scalability, and generalizability in graph classification. Motivated by the recognition that the underlying subgraph plays a crucial role in GNN prediction, while the remainder is task-irrelevant, we introduce the Core Knowledge Learning (\method{}) framework for graph adaptation and scalability learning. \method{} comprises several key modules, including the core subgraph knowledge submodule, graph domain adaptation module, and few-shot learning module for downstream tasks. Each module is tailored to tackle specific challenges in graph classification, such as domain shift, label inconsistencies, and data scarcity. By learning the core subgraph of the entire graph, we focus on the most pertinent features for task relevance. Consequently, our method offers benefits such as improved model performance, increased domain adaptability, and enhanced robustness to domain variations. Experimental results demonstrate significant performance enhancements achieved by our method compared to state-of-the-art approaches.


NLPGuard: A Framework for Mitigating the Use of Protected Attributes by NLP Classifiers

arXiv.org Artificial Intelligence

AI regulations are expected to prohibit machine learning models from using sensitive attributes during training. However, the latest Natural Language Processing (NLP) classifiers, which rely on deep learning, operate as black-box systems, complicating the detection and remediation of such misuse. Traditional bias mitigation methods in NLP aim for comparable performance across different groups based on attributes like gender or race but fail to address the underlying issue of reliance on protected attributes. To partly fix that, we introduce NLPGuard, a framework for mitigating the reliance on protected attributes in NLP classifiers. NLPGuard takes an unlabeled dataset, an existing NLP classifier, and its training data as input, producing a modified training dataset that significantly reduces dependence on protected attributes without compromising accuracy. NLPGuard is applied to three classification tasks: identifying toxic language, sentiment analysis, and occupation classification. Our evaluation shows that current NLP classifiers heavily depend on protected attributes, with up to $23\%$ of the most predictive words associated with these attributes. However, NLPGuard effectively reduces this reliance by up to $79\%$, while slightly improving accuracy.


POLygraph: Polish Fake News Dataset

arXiv.org Artificial Intelligence

This paper presents the POLygraph dataset, a unique resource for fake news detection in Polish. The dataset, created by an interdisciplinary team, is composed of two parts: the "fake-or-not" dataset with 11,360 pairs of news articles (identified by their URLs) and corresponding labels, and the "fake-they-say" dataset with 5,082 news articles (identified by their URLs) and tweets commenting on them. Unlike existing datasets, POLygraph encompasses a variety of approaches from source literature, providing a comprehensive resource for fake news detection. The data was collected through manual annotation by expert and non-expert annotators. The project also developed a software tool that uses advanced machine learning techniques to analyze the data and determine content authenticity. The tool and dataset are expected to benefit various entities, from public sector institutions to publishers and fact-checking organizations. Further dataset exploration will foster fake news detection and potentially stimulate the implementation of similar models in other languages. The paper focuses on the creation and composition of the dataset, so it does not include a detailed evaluation of the software tool for content authenticity analysis, which is planned at a later stage of the project.


On the Abuse and Detection of Polyglot Files

arXiv.org Artificial Intelligence

A polyglot is a file that is valid in two or more formats. Polyglot files pose a problem for malware detection systems that route files to format-specific detectors/signatures, as well as file upload and sanitization tools. In this work we found that existing file-format and embedded-file detection tools, even those developed specifically for polyglot files, fail to reliably detect polyglot files used in the wild, leaving organizations vulnerable to attack. To address this issue, we studied the use of polyglot files by malicious actors in the wild, finding $30$ polyglot samples and $15$ attack chains that leveraged polyglot files. In this report, we highlight two well-known APTs whose cyber attack chains relied on polyglot files to bypass detection mechanisms. Using knowledge from our survey of polyglot usage in the wild -- the first of its kind -- we created a novel data set based on adversary techniques. We then trained a machine learning detection solution, PolyConv, using this data set. PolyConv achieves a precision-recall area-under-curve score of $0.999$ with an F1 score of $99.20$% for polyglot detection and $99.47$% for file-format identification, significantly outperforming all other tools tested. We developed a content disarmament and reconstruction tool, ImSan, that successfully sanitized $100$% of the tested image-based polyglots, which were the most common type found via the survey. Our work provides concrete tools and suggestions to enable defenders to better defend themselves against polyglot files, as well as directions for future work to create more robust file specifications and methods of disarmament.


Deep Dive into MRI: Exploring Deep Learning Applications in 0.55T and 7T MRI

arXiv.org Artificial Intelligence

The development of magnetic resonance imaging (MRI) for medical imaging has provided a leap forward in diagnosis, providing a safe, non-invasive alternative to techniques involving ionising radiation exposure for diagnostic purposes. It was described by Block and Purcel in 1946, and it was not until 1980 that the first clinical application of MRI became available. Since that time the MRI has gone through many advances and has altered the way diagnosing procedures are performed. Due to its ability to improve constantly, MRI has become a commonly used practice among several specialisations in medicine. Particularly starting 0.55T and 7T MRI technologies have pointed out enhanced preservation of image detail and advanced tissue characterisation. This review examines the integration of deep learning (DL) techniques into these MRI modalities, disseminating and exploring the study applications. It highlights how DL contributes to 0.55T and 7T MRI data, showcasing the potential of DL in improving and refining these technologies. The review ends with a brief overview of how MRI technology will evolve in the coming years.


Towards Unsupervised Question Answering System with Multi-level Summarization for Legal Text

arXiv.org Artificial Intelligence

This paper summarizes Team SCaLAR's work on SemEval-2024 Task 5: Legal Argument Reasoning in Civil Procedure. To address this Binary Classification task, which was daunting due to the complexity of the Legal Texts involved, we propose a simple yet novel similarity and distance-based unsupervised approach to generate labels. Further, we explore the Multi-level fusion of Legal-Bert embeddings using ensemble features, including CNN, GRU, and LSTM. To address the lengthy nature of Legal explanation in the dataset, we introduce T5-based segment-wise summarization, which successfully retained crucial information, enhancing the model's performance. Our unsupervised system witnessed a 20-point increase in macro F1-score on the development set and a 10-point increase on the test set, which is promising given its uncomplicated architecture.


Individual brain parcellation: Review of methods, validations and applications

arXiv.org Artificial Intelligence

Individual brains vary greatly in morphology, connectivity and organization. The applicability of group-level parcellations is limited by the rapid development of precision medicine today because they do not take into account the variation of parcels at the individual level. Accurate mapping of brain functional regions at the individual level is pivotal for a comprehensive understanding of the variations in brain function and behaviors, early and precise identification of brain abnormalities, as well as personalized treatments for neuropsychiatric disorders. With the development of neuroimaging and machine learning techniques, studies on individual brain parcellation are booming. In this paper, we offer an overview of recent advances in the methodologies of individual brain parcellation, including optimization- and learning-based methods. Comprehensive evaluation metrics to validate individual brain mapping have been introduced. We also review the studies of how individual brain mapping promotes neuroscience research and clinical medicine. Finally, we summarize the major challenges and important future directions of individualized brain parcellation. Collectively, we intend to offer a thorough overview of individual brain parcellation methods, validations, and applications, along with highlighting the current challenges that call for an urgent demand for integrated platforms that integrate datasets, methods, and validations.


Parallel Computing Architectures for Robotic Applications: A Comprehensive Review

arXiv.org Artificial Intelligence

With the growing complexity and capability of contemporary robotic systems, the necessity of sophisticated computing solutions to efficiently handle tasks such as real-time processing, sensor integration, decision-making, and control algorithms is also increasing. Conventional serial computing frequently fails to meet these requirements, underscoring the necessity for high-performance computing alternatives. Parallel computing, the utilization of several processing elements simultaneously to solve computational problems, offers a possible answer. Various parallel computing designs, such as multi-core CPUs, GPUs, FPGAs, and distributed systems, provide substantial enhancements in processing capacity and efficiency. By utilizing these architectures, robotic systems can attain improved performance in functionalities such as real-time image processing, sensor fusion, and path planning. The transformative potential of parallel computing architectures in advancing robotic technology has been underscored, real-life case studies of these architectures in the robotics field have been discussed, and comparisons are presented. Challenges pertaining to these architectures have been explored, and possible solutions have been mentioned for further research and enhancement of the robotic applications.


Engineering Conversational Search Systems: A Review of Applications, Architectures, and Functional Components

arXiv.org Artificial Intelligence

Conversational search systems enable information retrieval via natural language interactions, with the goal of maximizing users' information gain over multiple dialogue turns. The increasing prevalence of conversational interfaces adopting this search paradigm challenges traditional information retrieval approaches, stressing the importance of better understanding the engineering process of developing these systems. We undertook a systematic literature review to investigate the links between theoretical studies and technical implementations of conversational search systems. Our review identifies real-world application scenarios, system architectures, and functional components. We consolidate our results by presenting a layered architecture framework and explaining the core functions of conversational search systems. Furthermore, we reflect on our findings in light of the rapid progress in large language models, discussing their capabilities, limitations, and directions for future research.