Overview
Data-centric Artificial Intelligence: A Survey
Zha, Daochen, Bhat, Zaid Pervaiz, Lai, Kwei-Herng, Yang, Fan, Jiang, Zhimeng, Zhong, Shaochen, Hu, Xia
Artificial Intelligence (AI) is making a profound impact in almost every domain. A vital enabler of its great success is the availability of abundant and high-quality data for building machine learning models. Recently, the role of data in AI has been significantly magnified, giving rise to the emerging concept of data-centric AI. The attention of researchers and practitioners has gradually shifted from advancing model design to enhancing the quality and quantity of the data. In this survey, we discuss the necessity of data-centric AI, followed by a holistic view of three general data-centric goals (training data development, inference data development, and data maintenance) and the representative methods. We also organize the existing literature from automation and collaboration perspectives, discuss the challenges, and tabulate the benchmarks for various tasks. We believe this is the first comprehensive survey that provides a global view of a spectrum of tasks across various stages of the data lifecycle. We hope it can help the readers efficiently grasp a broad picture of this field, and equip them with the techniques and further research ideas to systematically engineer data for building AI systems. A companion list of data-centric AI resources will be regularly updated on https://github.com/daochenzha/data-centric-AI
A Comprehensive Survey of Continual Learning: Theory, Method and Application
Wang, Liyuan, Zhang, Xingxing, Su, Hang, Zhu, Jun
To cope with real-world dynamics, an intelligent agent needs to incrementally acquire, update, accumulate, and exploit knowledge throughout its lifetime. This ability, known as continual learning, provides a foundation for AI systems to develop themselves adaptively. In a general sense, continual learning is explicitly limited by catastrophic forgetting, where learning a new task usually results in a dramatic performance degradation of the old tasks. Beyond this, increasingly numerous advances have emerged in recent years that largely extend the understanding and application of continual learning. The growing and widespread interest in this direction demonstrates its realistic significance as well as complexity. In this work, we present a comprehensive survey of continual learning, seeking to bridge the basic settings, theoretical foundations, representative methods, and practical applications. Based on existing theoretical and empirical results, we summarize the general objectives of continual learning as ensuring a proper stability-plasticity trade-off and an adequate intra/inter-task generalizability in the context of resource efficiency. Then we provide a state-of-the-art and elaborated taxonomy, extensively analyzing how representative strategies address continual learning, and how they are adapted to particular challenges in various applications. Through an in-depth discussion of promising directions, we believe that such a holistic perspective can greatly facilitate subsequent exploration in this field and beyond.
Learnersourcing in the Age of AI: Student, Educator and Machine Partnerships for Content Creation
Khosravi, Hassan, Denny, Paul, Moore, Steven, Stamper, John
Our increasingly connected world is empowering learners and enabling exciting new pedagogies. In particular, educational tools that facilitate collaboration between students can help to foster a wide range of social and domainspecific skills (Jeong, Hmelo-Silver and Jo, 2019). The literature on computer supported collaborative learning documents a diverse range of pedagogies that have been applied for decades in many subject domains and educational levels (Lehtinen, Hakkarainen, Lipponen, Rahikainen and Muukkonen, 1999; Roberts, 2005; Kaliisa, Rienties, Mรธrch and Kluge, 2022). One recent approach, derived from foundational work on contributing student pedagogies (Collis and Moonen, 2002; Hamer, Sheard, Purchase and Luxton-Reilly, 2012), involves students creating and sharing learning resources with one another. Such activities have gained popularity in recent years and are associated with two broad types of benefits. Firstly, creating learning content is a cognitively demanding task that requires students to engage deeply with course concepts and exhibit behaviours at the highest level of Bloom's taxonomy of educational objectives (Hilton, Goldwater, Hancock, Clemson, Huang and Denyer, 2022). Secondly, leveraging the creative power of many students can result in the rapid and cost-effective creation of large repositories of learning resources that can, in turn, be used for practice and to support personalized learning experiences (Singh, Brooks, Lin and Li, 2021). Learnersourcing is a commonly used term to describe the practice of having students work collaboratively to generate shared learning resources (Kim, 2015). It is related to the more general task of crowdsourcing, in which tasks are outsourced to a pool of participants, often drawn from large and undefined populations, each of whom makes a small contribution to some product.
A Comprehensive Review of State-of-The-Art Methods for Java Code Generation from Natural Language Text
Espejel, Jessica Lรณpez, Alassan, Mahaman Sanoussi Yahaya, Chouham, El Mehdi, Dahhane, Walid, Ettifouri, El Hassane
Java Code Generation consists in generating automatically Java code from a Natural Language Text. This NLP task helps in increasing programmers' productivity by providing them with immediate solutions to the simplest and most repetitive tasks. Code generation is a challenging task because of the hard syntactic rules and the necessity of a deep understanding of the semantic aspect of the programming language. Many works tried to tackle this task using either RNN-based, or Transformer-based models. The latter achieved remarkable advancement in the domain and they can be divided into three groups: (1) encoder-only models, (2) decoder-only models, and (3) encoder-decoder models. In this paper, we provide a comprehensive review of the evolution and progress of deep learning models in Java code generation task. We focus on the most important methods and present their merits and limitations, as well as the objective functions used by the community. In addition, we provide a detailed description of datasets and evaluation metrics used in the literature. Finally, we discuss results of different models on CONCODE dataset, then propose some future directions.
Near-optimal Conservative Exploration in Reinforcement Learning under Episode-wise Constraints
Li, Donghao, Huang, Ruiquan, Shen, Cong, Yang, Jing
This paper investigates conservative exploration in reinforcement learning where the performance of the learning agent is guaranteed to be above a certain threshold throughout the learning process. It focuses on the tabular episodic Markov Decision Process (MDP) setting that has finite states and actions. With the knowledge of an existing safe baseline policy, an algorithm termed as StepMix is proposed to balance the exploitation and exploration while ensuring that the conservative constraint is never violated in each episode with high probability. StepMix features a unique design of a mixture policy that adaptively and smoothly interpolates between the baseline policy and the optimistic policy. Theoretical analysis shows that StepMix achieves near-optimal regret order as in the constraint-free setting, indicating that obeying the stringent episode-wise conservative constraint does not compromise the learning performance. Besides, a randomization-based EpsMix algorithm is also proposed and shown to achieve the same performance as StepMix. The algorithm design and theoretical analysis are further extended to the setting where the baseline policy is not given a priori but must be learned from an offline dataset, and it is proved that similar conservative guarantee and regret can be achieved if the offline dataset is sufficiently large. Experiment results corroborate the theoretical analysis and demonstrate the effectiveness of the proposed conservative exploration strategies.
Robust Twin Parametric Margin Support Vector Machine for Multiclass Classification
De Leone, Renato, Maggioni, Francesca, Spinelli, Andrea
In this paper we present a Twin Parametric-Margin Support Vector Machine (TPMSVM) model to tackle the problem of multiclass classification. In the spirit of one-versus-all paradigm, for each class we construct a classifier by solving a TPMSVM-type model. Once all classifiers have been determined, they are combined into an aggregate decision function. We consider the cases of both linear and nonlinear kernel-induced classifiers. In addition, we robustify the proposed approach through robust optimization techniques. Indeed, in real-world applications observations are subject to measurement errors and noise, affecting the quality of the solutions. Consequently, data uncertainties need to be included within the model in order to prevent low accuracies in the classification process. Preliminary computational experiments on real-world datasets show the good performance of the proposed approach.
The Age of Synthetic Realities: Challenges and Opportunities
Cardenuto, Joรฃo Phillipe, Yang, Jing, Padilha, Rafael, Wan, Renjie, Moreira, Daniel, Li, Haoliang, Wang, Shiqi, Andalรณ, Fernanda, Marcel, Sรฉbastien, Rocha, Anderson
Synthetic realities are digital creations or augmentations that are contextually generated through the use of Artificial Intelligence (AI) methods, leveraging extensive amounts of data to construct new narratives or realities, regardless of the intent to deceive. In this paper, we delve into the concept of synthetic realities and their implications for Digital Forensics and society at large within the rapidly advancing field of AI. We highlight the crucial need for the development of forensic techniques capable of identifying harmful synthetic creations and distinguishing them from reality. This is especially important in scenarios involving the creation and dissemination of fake news, disinformation, and misinformation. Our focus extends to various forms of media, such as images, videos, audio, and text, as we examine how synthetic realities are crafted and explore approaches to detecting these malicious creations. Additionally, we shed light on the key research challenges that lie ahead in this area. This study is of paramount importance due to the rapid progress of AI generative techniques and their impact on the fundamental principles of Forensic Science.
A Survey on Cross-Architectural IoT Malware Threat Hunting
Raju, Anandharaju Durai, Abualhaol, Ibrahim, Giagone, Ronnie Salvador, Zhou, Yang, Huang, Shengqiang
In recent years, the increase in non-Windows malware threats had turned the focus of the cybersecurity community. Research works on hunting Windows PE-based malwares are maturing, whereas the developments on Linux malware threat hunting are relatively scarce. With the advent of the Internet of Things (IoT) era, smart devices that are getting integrated into human life have become a hackers highway for their malicious activities. The IoT devices employ various Unix-based architectures that follow ELF (Executable and Linkable Format) as their standard binary file specification. This study aims at providing a comprehensive survey on the latest developments in cross-architectural IoT malware detection and classification approaches. Aided by a modern taxonomy, we discuss the feature representations, feature extraction techniques, and machine learning models employed in the surveyed works. We further provide more insights on the practical challenges involved in cross-architectural IoT malware threat hunting and discuss various avenues to instill potential future research.
PLPCA: Persistent Laplacian Enhanced-PCA for Microarray Data Analysis
Cottrell, Sean, Wang, Rui, Wei, Guowei
Over the years, Principal Component Analysis (PCA) has served as the baseline approach for dimensionality reduction in gene expression data analysis. It primary objective is to identify a subset of disease-causing genes from a vast pool of thousands of genes. However, PCA possesses inherent limitations that hinder its interpretability, introduce classification ambiguity, and fail to capture complex geometric structures in the data. Although these limitations have been partially addressed in the literature by incorporating various regularizers such as graph Laplacian regularization, existing improved PCA methods still face challenges related to multiscale analysis and capturing higher-order interactions in the data. To address these challenges, we propose a novel approach called Persistent Laplacian-enhanced Principal Component Analysis (PLPCA). PLPCA amalgamates the advantages of earlier regularized PCA methods with persistent spectral graph theory, specifically persistent Laplacians derived from algebraic topology. In contrast to graph Laplacians, persistent Laplacians enable multiscale analysis through filtration and incorporate higher-order simplicial complexes to capture higher-order interactions in the data. We evaluate and validate the performance of PLPCA using benchmark microarray datasets that involve normal tissue samples and four different cancer tissues. Our extensive studies demonstrate that PLPCA outperforms all other state-of-the-art models for classification tasks after dimensionality reduction.
FinGPT: Open-Source Financial Large Language Models
Yang, Hongyang, Liu, Xiao-Yang, Wang, Christina Dan
Large language models (LLMs) have shown the potential of revolutionizing natural language processing tasks in diverse domains, sparking great interest in finance. Accessing high-quality financial data is the first challenge for financial LLMs (FinLLMs). While proprietary models like BloombergGPT have taken advantage of their unique data accumulation, such privileged access calls for an open-source alternative to democratize Internet-scale financial data. In this paper, we present an open-source large language model, FinGPT, for the finance sector. Unlike proprietary models, FinGPT takes a data-centric approach, providing researchers and practitioners with accessible and transparent resources to develop their FinLLMs. We highlight the importance of an automatic data curation pipeline and the lightweight low-rank adaptation technique in building FinGPT. Furthermore, we showcase several potential applications as stepping stones for users, such as robo-advising, algorithmic trading, and low-code development. Through collaborative efforts within the open-source AI4Finance community, FinGPT aims to stimulate innovation, democratize FinLLMs, and unlock new opportunities in open finance. Two associated code repos are \url{https://github.com/AI4Finance-Foundation/FinGPT} and \url{https://github.com/AI4Finance-Foundation/FinNLP}