Goto

Collaborating Authors

 Large Language Model


Language Versatilists vs. Specialists: An Empirical Revisiting on Multilingual Transfer Ability

arXiv.org Artificial Intelligence

Multilingual transfer ability, which reflects how well the models fine-tuned on one source language can be applied to other languages, has been well studied in multilingual pre-trained models (e.g., BLOOM). However, such ability has not been investigated for English-centric models (e.g., LLaMA). To fill this gap, we study the following research questions. First, does multilingual transfer ability exist in English-centric models and how does it compare with multilingual pretrained models? Second, does it only appears when English is the source language for the English-centric model? Third, how does it vary in different tasks? We take multilingual reasoning ability as our focus and conduct extensive experiments across four types of reasoning tasks. We find that the multilingual pretrained model does not always outperform an English-centric model. Furthermore, English appears to be a less suitable source language, and the choice of source language becomes less important when the English-centric model scales up. In addition, different types of tasks exhibit different multilingual transfer abilities. These findings demonstrate that English-centric models not only possess multilingual transfer ability but may even surpass the transferability of multilingual pretrained models if well-trained. By showing the strength and weaknesses, the experiments also provide valuable insights into enhancing multilingual reasoning abilities for the English-centric models.


Applying Standards to Advance Upstream & Downstream Ethics in Large Language Models

arXiv.org Artificial Intelligence

This paper explores how AI-owners can develop safeguards for AI-generated content by drawing from established codes of conduct and ethical standards in other content-creation industries. It delves into the current state of ethical awareness on Large Language Models (LLMs). By dissecting the mechanism of content generation by LLMs, four key areas (upstream/downstream and at user prompt/answer), where safeguards could be effectively applied, are identified. A comparative analysis of these four areas follows and includes an evaluation of the existing ethical safeguards in terms of cost, effectiveness, and alignment with established industry practices. The paper's key argument is that existing IT-related ethical codes, while adequate for traditional IT engineering, are inadequate for the challenges posed by LLM-based content generation. Drawing from established practices within journalism, we propose potential standards for businesses involved in distributing and selling LLM-generated content. Finally, potential conflicts of interest between dataset curation at upstream and ethical benchmarking downstream are highlighted to underscore the need for a broader evaluation beyond mere output. This study prompts a nuanced conversation around ethical implications in this rapidly evolving field of content generation.


Understanding Programs by Exploiting (Fuzzing) Test Cases

arXiv.org Artificial Intelligence

Semantic understanding of programs has attracted great attention in the community. Inspired by recent successes of large language models (LLMs) in natural language understanding, tremendous progress has been made by treating programming language as another sort of natural language and training LLMs on corpora of program code. However, programs are essentially different from texts after all, in a sense that they are normally heavily structured and syntax-strict. In particular, programs and their basic units (i.e., functions and subroutines) are designed to demonstrate a variety of behaviors and/or provide possible outputs, given different inputs. The relationship between inputs and possible outputs/behaviors represents the functions/subroutines and profiles the program as a whole. Therefore, we propose to incorporate such a relationship into learning, for achieving a deeper semantic understanding of programs. To obtain inputs that are representative enough to trigger the execution of most part of the code, we resort to fuzz testing and propose fuzz tuning to boost the performance of program understanding and code representation learning, given a pre-trained LLM. The effectiveness of the proposed method is verified on two program understanding tasks including code clone detection and code classification, and it outperforms current state-of-the-arts by large margins. Code is available at https://github.com/rabbitjy/FuzzTuning.


Data-centric Artificial Intelligence: A Survey

arXiv.org Artificial Intelligence

Artificial Intelligence (AI) is making a profound impact in almost every domain. A vital enabler of its great success is the availability of abundant and high-quality data for building machine learning models. Recently, the role of data in AI has been significantly magnified, giving rise to the emerging concept of data-centric AI. The attention of researchers and practitioners has gradually shifted from advancing model design to enhancing the quality and quantity of the data. In this survey, we discuss the necessity of data-centric AI, followed by a holistic view of three general data-centric goals (training data development, inference data development, and data maintenance) and the representative methods. We also organize the existing literature from automation and collaboration perspectives, discuss the challenges, and tabulate the benchmarks for various tasks. We believe this is the first comprehensive survey that provides a global view of a spectrum of tasks across various stages of the data lifecycle. We hope it can help the readers efficiently grasp a broad picture of this field, and equip them with the techniques and further research ideas to systematically engineer data for building AI systems. A companion list of data-centric AI resources will be regularly updated on https://github.com/daochenzha/data-centric-AI


OpenAI CEO Calls for Collaboration With China to Counter AI Risks

WSJ.com: WSJD - Technology

This copy is for your personal, non-commercial use only. For non-personal use or to order multiple copies, please contact Dow Jones Reprints at 1-800-843-0008 or visit www.djreprints.com.


OpenAI's CEO calls on China to help shape AI safety guidelines

The Japan Times

China should play a key role in shaping the artificial intelligence guardrails needed to ensure the safety of transformative new systems, OpenAI Inc.'s Chief Executive Officer Sam Altman said. "With the emergence of the increasingly powerful AI systems, the stakes for global cooperation have never been higher," Altman, whose company kick-started an AI frenzy in China with last year's launch of ChatGPT, told a Beijing conference via video link on Saturday. In both China and Silicon Valley, talent and investments are flowing into AI, a strategic area that will help define the deepening tech rivalry between the world's two largest economies. Advances in the emerging technology have also highlighted tensions in how governments are seeking to regulate the sector, one that Chinese leader Xi Jinping has said requires greater state oversight to mitigate national security risks. This could be due to a conflict with your ad-blocking or security software.


A Comprehensive Survey of Continual Learning: Theory, Method and Application

arXiv.org Artificial Intelligence

To cope with real-world dynamics, an intelligent agent needs to incrementally acquire, update, accumulate, and exploit knowledge throughout its lifetime. This ability, known as continual learning, provides a foundation for AI systems to develop themselves adaptively. In a general sense, continual learning is explicitly limited by catastrophic forgetting, where learning a new task usually results in a dramatic performance degradation of the old tasks. Beyond this, increasingly numerous advances have emerged in recent years that largely extend the understanding and application of continual learning. The growing and widespread interest in this direction demonstrates its realistic significance as well as complexity. In this work, we present a comprehensive survey of continual learning, seeking to bridge the basic settings, theoretical foundations, representative methods, and practical applications. Based on existing theoretical and empirical results, we summarize the general objectives of continual learning as ensuring a proper stability-plasticity trade-off and an adequate intra/inter-task generalizability in the context of resource efficiency. Then we provide a state-of-the-art and elaborated taxonomy, extensively analyzing how representative strategies address continual learning, and how they are adapted to particular challenges in various applications. Through an in-depth discussion of promising directions, we believe that such a holistic perspective can greatly facilitate subsequent exploration in this field and beyond.


Learnersourcing in the Age of AI: Student, Educator and Machine Partnerships for Content Creation

arXiv.org Artificial Intelligence

Our increasingly connected world is empowering learners and enabling exciting new pedagogies. In particular, educational tools that facilitate collaboration between students can help to foster a wide range of social and domainspecific skills (Jeong, Hmelo-Silver and Jo, 2019). The literature on computer supported collaborative learning documents a diverse range of pedagogies that have been applied for decades in many subject domains and educational levels (Lehtinen, Hakkarainen, Lipponen, Rahikainen and Muukkonen, 1999; Roberts, 2005; Kaliisa, Rienties, Mรธrch and Kluge, 2022). One recent approach, derived from foundational work on contributing student pedagogies (Collis and Moonen, 2002; Hamer, Sheard, Purchase and Luxton-Reilly, 2012), involves students creating and sharing learning resources with one another. Such activities have gained popularity in recent years and are associated with two broad types of benefits. Firstly, creating learning content is a cognitively demanding task that requires students to engage deeply with course concepts and exhibit behaviours at the highest level of Bloom's taxonomy of educational objectives (Hilton, Goldwater, Hancock, Clemson, Huang and Denyer, 2022). Secondly, leveraging the creative power of many students can result in the rapid and cost-effective creation of large repositories of learning resources that can, in turn, be used for practice and to support personalized learning experiences (Singh, Brooks, Lin and Li, 2021). Learnersourcing is a commonly used term to describe the practice of having students work collaboratively to generate shared learning resources (Kim, 2015). It is related to the more general task of crowdsourcing, in which tasks are outsourced to a pool of participants, often drawn from large and undefined populations, each of whom makes a small contribution to some product.


Medical Data Augmentation via ChatGPT: A Case Study on Medication Identification and Medication Event Classification

arXiv.org Artificial Intelligence

To encourage advancements in data analytics on EHRs, The identification of key factors such as medications, diseases, and the N2C2 2022 competitions have invited teams to participate in relationships within electronic health records and clinical notes has various tasks aimed at identifying key factors such as medications, a wide range of applications in the clinical field. In the N2C2 2022 diseases, and relationships within the Contextualized Medication competitions, various tasks were presented to promote the identification Event Dataset (CMED) [10]. of key factors in electronic health records (EHRs) using Over the past few years, there has been a significant breakthrough the Contextualized Medication Event Dataset (CMED). Pretrained in natural language processing (NLP) tasks with the introduction large language models (LLMs) demonstrated exceptional performance of pretrained large language models (LLMs) such as BERT. in these tasks. This study aims to explore the utilization of These LLMs are transformer-based architectures that undergo unsupervised LLMs, specifically ChatGPT, for data augmentation to overcome training on extensive text data to comprehend the intricate the limited availability of annotated data for identifying the key features and patterns of human language.


Universal Language Modelling agent

arXiv.org Artificial Intelligence

Large Language Models are designed to understand complex Human Language. Yet, Understanding of animal language has long intrigued researchers striving to bridge the communication gap between humans and other species. This research paper introduces a novel approach that draws inspiration from the linguistic concepts found in the Quran, a revealed Holy Arabic scripture dating back 1400 years. By exploring the linguistic structure of the Quran, specifically the components of ism, fil, and harf, we aim to unlock the underlying intentions and meanings embedded within animal conversations using audio data. To unravel the intricate complexities of animal language, we employ word embedding techniques to analyze each distinct frequency component. This methodology enables the identification of potential correlations and the extraction of meaningful insights from the data. Furthermore, we leverage a bioacoustics model to generate audio, which serves as a valuable resource for training natural language processing (NLP) techniques. This Paper aims to find the intention* behind animal language rather than having each word translation.