Goto

Collaborating Authors

 Overview


Linguistic Interpretability of Transformer-based Language Models: a systematic review

arXiv.org Artificial Intelligence

Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or sentiment analysis. However, despite the architecture of these models being well-defined, little is known about how their internal computations help them achieve their results. This renders these models, as of today, a type of 'black box' systems. There is, however, a line of research -- 'interpretability' -- aiming to learn how information is encoded inside these models. More specifically, there is work dedicated to studying whether Transformer-based models possess knowledge of linguistic phenomena similar to human speakers -- an area we call 'linguistic interpretability' of these models. In this survey we present a comprehensive analysis of 160 research works, spread across multiple languages and models -- including multilingual ones -- that attempt to discover linguistic information from the perspective of several traditional Linguistics disciplines: Syntax, Morphology, Lexico-Semantics and Discourse. Our survey fills a gap in the existing interpretability literature, which either not focus on linguistic knowledge in these models or present some limitations -- e.g. only studying English-based models. Our survey also focuses on Pre-trained Language Models not further specialized for a downstream task, with an emphasis on works that use interpretability techniques that explore models' internal representations.


Diffusion Models for Robotic Manipulation: A Survey

arXiv.org Machine Learning

Diffusion generative models have demonstrated remarkable success in visual domains such as image and video generation. They have also recently emerged as a promising approach in robotics, especially in robot manipulations. Diffusion models leverage a probabilistic framework, and they stand out with their ability to model multi-modal distributions and their robustness to high-dimensional input and output spaces. This survey provides a comprehensive review of state-of-the-art diffusion models in robotic manipulation, including grasp learning, trajectory planning, and data augmentation. Diffusion models for scene and image augmentation lie at the intersection of robotics and computer vision for vision-based tasks to enhance generalizability and data scarcity. This paper also presents the two main frameworks of diffusion models and their integration with imitation learning and reinforcement learning. In addition, it discusses the common architectures and benchmarks and points out the challenges and advantages of current state-of-the-art diffusion-based methods.


The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models

arXiv.org Artificial Intelligence

Practically all large language models have been pre-trained on data that is subject to global uncertainty related to copyright infringement and breach of contract. This creates potential risk for users and developers due to this uncertain legal status. The KL3M Data Project directly confronts this critical issue by introducing the largest comprehensive training data pipeline that minimizes risks related to copyright or breach of contract. The foundation of this project is a corpus of over 132 million documents and trillions of tokens spanning 16 different sources that have been verified to meet the strict copyright and licensing protocol detailed herein. We are releasing the entire pipeline, including 1) the source code to acquire and process these documents, 2) the original document formats with associated provenance and metadata, 3) extracted content in a standardized format, 4) pre-tokenized representations of the documents, and 5) various mid- and post-train resources such as question-answer, summarization, conversion, drafting, classification, prediction, and conversational data. All of these resources are freely available to the public on S3, Hugging Face, and GitHub under CC-BY terms. We are committed to continuing this project in furtherance of a more ethical, legal, and sustainable approach to the development and use of AI models.


Data over dialogue: Why artificial intelligence is unlikely to humanise medicine

arXiv.org Artificial Intelligence

Recently, a growing number of experts in artificial intelligence (AI) and medicine have be-gun to suggest that the use of AI systems, particularly machine learning (ML) systems, is likely to humanise the practice of medicine by substantially improving the quality of clinician-patient relationships. In this thesis, however, I argue that medical ML systems are more likely to negatively impact these relationships than to improve them. In particular, I argue that the use of medical ML systems is likely to comprise the quality of trust, care, empathy, understanding, and communication between clinicians and patients.


DeepGreen: Effective LLM-Driven Green-washing Monitoring System Designed for Empirical Testing -- Evidence from China

arXiv.org Artificial Intelligence

D EEPG REEN: E FFECTIVE LLM-D RIVEN G REEN-WASHING M ONITORING S YSTEM D ESIGNED FOR E MPIRICAL T ESTING --E VIDENCE FROM C HINA Congluo Xu Business School Sichuan University Chengdu, 610065 Y u Miao School of Economics Sichuan University Chengdu, 610065 Yiling Xiao Business School Sichuan University Chengdu, 610065 Chengmengjia Lin Business School Sichuan University Chengdu, 610065 April 11, 2025 A BSTRACT This paper proposes DeepGreen, an Large Language Model Driven (LLM-Driven) system for detecting corporate green-washing behaviour. Utilizing dual-layer LLM analysis, DeepGreen preliminar-ily identifies potential green keywords in financial statements and then assesses their implementation degree via iterative semantic analysis of LLM. A core variable GreenImplement is derived from the ratio from the two layers' output. We extract 204 financial statements of 68 companies from A-share market over three years, comprising 89,893 words, and analyse them through DeepGreen. Our analysis, supported by violin plots and K-means clustering, reveals insights and validates the variable against the Huazheng ESG rating. It offers a novel perspective for regulatory agencies and investors, serving as a proactive monitoring tool that complements traditional methods.Empirical tests show that green implementation can significantly boost the asset return rate of companies, but there is heterogeneity in scale. Small and medium-sized companies have limited contribution to asset return via green implementation, so there is a stronger motivation for green-washing. K eywords Green-washing Monitoring Large Language Models Financial Statement Analysis Unstructured Data Analysis 1 Introduction Amid intensifying global focus on sustainable development and environmental protection, the phenomenon of corporate "green-washing" has emerged as a contentious issue. "Green-washing" typically refers to those companies exaggerating or misrepresenting their environmental protection efforts in promotional materials, while their actual practices fail to meet sustainable development standards [1]. However, a more elusive challenge lies in "general green-washing", which involves subtler tactics that distort perceptions by repeatedly invoking terms such as "carbon peak" or "green development" without substantive evidence [2]. The elusiveness of general green-washing stems from its exploitation of human psychology and information processing mechanisms.


Quantum Machine Learning: Unveiling Trends, Impacts through Bibliometric Analysis

arXiv.org Artificial Intelligence

Quantum Machine Learning (QML) is the intersection of two revolutionary fields: quantum computing and machine learning. It promises to unlock unparalleled capabilities in data analysis, model building, and problem-solving by harnessing the unique properties of quantum mechanics. This research endeavors to conduct a comprehensive bibliometric analysis of scientific information pertaining to QML covering the period from 2000 to 2023. An extensive dataset comprising 9493 scholarly works is meticulously examined to unveil notable trends, impact factors, and funding patterns within the domain. Additionally, the study employs bibliometric mapping techniques to visually illustrate the network relationships among key countries, institutions, authors, patent citations and significant keywords in QML research. The analysis reveals a consistent growth in publications over the examined period. The findings highlight the United States and China as prominent contributors, exhibiting substantial publication and citation metrics. Notably, the study concludes that QML, as a research subject, is currently in a formative stage, characterized by robust scholarly activity and ongoing development.


ms-Mamba: Multi-scale Mamba for Time-Series Forecasting

arXiv.org Artificial Intelligence

The problem of Time-series Forecasting is generally addressed by recurrent, Transformer-based and the recently proposed Mamba-based architectures. However, existing architectures generally process their input at a single temporal scale, which may be sub-optimal for many tasks where information changes over multiple time scales. In this paper, we introduce a novel architecture called Multi-scale Mamba (ms-Mamba) to address this gap. ms-Mamba incorporates multiple temporal scales by using multiple Mamba blocks with different sampling rates ($Δ$s). Our experiments on many benchmarks demonstrate that ms-Mamba outperforms state-of-the-art approaches, including the recently proposed Transformer-based and Mamba-based models.


Generative Artificial Intelligence for Internet of Things Computing: A Systematic Survey

arXiv.org Artificial Intelligence

The integration of Generative Artificial Intelligence (GenAI) within the Internet of Things (IoT) is garnering considerable interest. This growing attention stems from the continuous evolution and widespread adoption they are both having individually, enough to spontaneously reshape numerous sectors, including Healthcare, Manufacturing, and Smart Cities. Hence, their increasing popularity has catalyzed further extensive research for understanding the potential of the duo GenAI-IoT, how they interplay, and to which extent their synergy can innovate the state-of-the-art in their individual scenarios. However, despite the increasing prominence of GenAI for IoT Computing, much of the existing research remains focused on specific, narrowly scoped applications. This fragmented approach highlights the need for a more comprehensive analysis of the potential, challenges, and implications of GenAI integration within the broader IoT ecosystem. This survey exactly aims to address this gap by providing a holistic overview of the opportunities, issues, and considerations arising from the convergence of these mainstream paradigms. Our contribution is realized through a systematic literature review following the PRISMA methodology. A comparison framework is presented, and well-defined research questions are outlined to comprehensively explore the past, present, and future directions of GenAI integration with IoT Computing, offering valuable insights for both experts and newcomers.


ConceptFormer: Towards Efficient Use of Knowledge-Graph Embeddings in Large Language Models

arXiv.org Artificial Intelligence

Retrieval Augmented Generation (RAG) has enjoyed increased attention in the recent past and recent advancements in Large Language Models (LLMs) have highlighted the importance of integrating world knowledge into these systems. Current RAG methodologies often modify the internal architecture of pre-trained language models (PLMs) or rely on textifying knowledge graphs (KGs), which is inefficient in terms of token usage. This paper introduces ConceptFormer, a new approach to augment LLMs with structured knowledge from KGs, such as Wikidata, without altering their internal structure or relying on textual input of KGs. ConceptFormer operates in the LLM embedding vector space, creating and injecting \emph{concept vectors} that encapsulate the information of the KG nodes directly. Trained in conjunction with a frozen LLM, ConceptFormer generates a comprehensive lookup table that maps KG nodes to their respective concept vectors. The approach aims to enhance the factual recall capabilities of LLMs by enabling them to process these concept vectors natively, thus enriching them with structured world knowledge in an efficient and scalable manner. Our experiments demonstrate that the addition of concept vectors to GPT-2 0.1B substantially increases its factual recall ability (Hit@10) by up to 272\% when tested on sentences from Wikipedia and up to 348\% on synthetically generated sentences. Even injecting only a single concept vector into the prompt increases factual recall ability (Hit@10) by up to 213\% on Wikipedia sentences, significantly outperforming RAG with graph textification while consuming 130x fewer input tokens.


Transformer-Based Temporal Information Extraction and Application: A Review

arXiv.org Artificial Intelligence

Temporal information extraction (IE) aims to extract structured temporal information from unstructured text, thereby uncovering the implicit timelines within. This technique is applied across domains such as healthcare, newswire, and intelligence analysis, aiding models in these areas to perform temporal reasoning and enabling human users to grasp the temporal structure of text. Transformer-based pre-trained language models have produced revolutionary advancements in natural language processing, demonstrating exceptional performance across a multitude of tasks. Despite the achievements garnered by Transformer-based approaches in temporal IE, there is a lack of comprehensive reviews on these endeavors. In this paper, we aim to bridge this gap by systematically summarizing and analyzing the body of work on temporal IE using Transformers while highlighting potential future research directions.