Generative AI
AI Content Self-Detection for Transformer-based Large Language Models
Caiado, Antônio Junior Alves, Hahsler, Michael
$ $The usage of generative artificial intelligence (AI) tools based on large language models, including ChatGPT, Bard, and Claude, for text generation has many exciting applications with the potential for phenomenal productivity gains. One issue is authorship attribution when using AI tools. This is especially important in an academic setting where the inappropriate use of generative AI tools may hinder student learning or stifle research by creating a large amount of automatically generated derivative work. Existing plagiarism detection systems can trace the source of submitted text but are not yet equipped with methods to accurately detect AI-generated text. This paper introduces the idea of direct origin detection and evaluates whether generative AI systems can recognize their output and distinguish it from human-written texts. We argue why current transformer-based models may be able to self-detect their own generated text and perform a small empirical study using zero-shot learning to investigate if that is the case. Results reveal varying capabilities of AI systems to identify their generated text. Google's Bard model exhibits the largest capability of self-detection with an accuracy of 94\%, followed by OpenAI's ChatGPT with 83\%. On the other hand, Anthropic's Claude model seems to be not able to self-detect.
DrugAssist: A Large Language Model for Molecule Optimization
Ye, Geyan, Cai, Xibao, Lai, Houtim, Wang, Xing, Huang, Junhong, Wang, Longyue, Liu, Wei, Zeng, Xiangxiang
Recently, the impressive performance of large language models (LLMs) on a wide range of tasks has attracted an increasing number of attempts to apply LLMs in drug discovery. However, molecule optimization, a critical task in the drug discovery pipeline, is currently an area that has seen little involvement from LLMs. Most of existing approaches focus solely on capturing the underlying patterns in chemical structures provided by the data, without taking advantage of expert feedback. These non-interactive approaches overlook the fact that the drug discovery process is actually one that requires the integration of expert experience and iterative refinement. To address this gap, we propose DrugAssist, an interactive molecule optimization model which performs optimization through humanmachine dialogue by leveraging LLM's strong interactivity and generalizability. DrugAssist has achieved leading results in both single and multiple property optimization, simultaneously showcasing immense potential in transferability and iterative optimization. In addition, we publicly release a large instructionbased dataset called "MolOpt-Instructions" for fine-tuning language models on molecule optimization tasks. Figure 1: The illustration of our proposed DrugAssist model framework, which focus on optimizing molecules through human-machine dialogue. Recently, generative artificial intelligence has made remarkable strides in the field of natural language processing (NLP), particularly with the advent of Large Language Models (LLMs) such as GPT (Generative Pre-trained Transformer) (Radford et al., 2019). These models have demonstrated impressive capabilities in a wide range of tasks, extending far beyond everyday communication and question-answering scenarios.
Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math
Wang, Zengzhi, Xia, Rui, Liu, Pengfei
High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce \textsc{MathPile}, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of ``\emph{less is more}'', firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates. We hope our \textsc{MathPile} can help to enhance the mathematical reasoning abilities of language models. We plan to open-source different versions of \mathpile with the scripts used for processing, to facilitate future developments in this field.
Identifying and Mitigating the Security Risks of Generative AI
Barrett, Clark, Boyd, Brad, Burzstein, Elie, Carlini, Nicholas, Chen, Brad, Choi, Jihye, Chowdhury, Amrita Roy, Christodorescu, Mihai, Datta, Anupam, Feizi, Soheil, Fisher, Kathleen, Hashimoto, Tatsunori, Hendrycks, Dan, Jha, Somesh, Kang, Daniel, Kerschbaum, Florian, Mitchell, Eric, Mitchell, John, Ramzan, Zulfikar, Shams, Khawaja, Song, Dawn, Taly, Ankur, Yang, Diyi
Every major technical invention resurfaces the dual-use dilemma -- the new technology has the potential to be used for good as well as for harm. Generative AI (GenAI) techniques, such as large language models (LLMs) and diffusion models, have shown remarkable capabilities (e.g., in-context learning, code-completion, and text-to-image generation and editing). However, GenAI can be used just as well by attackers to generate new attacks and increase the velocity and efficacy of existing attacks. This paper reports the findings of a workshop held at Google (co-organized by Stanford University and the University of Wisconsin-Madison) on the dual-use dilemma posed by GenAI. This paper is not meant to be comprehensive, but is rather an attempt to synthesize some of the interesting findings from the workshop. We discuss short-term and long-term goals for the community on this topic. We hope this paper provides both a launching point for a discussion on this important topic as well as interesting problems that the research community can work to address.
The New York Times is Suing OpenAI and Microsoft for Copyright Infringement
While previous lawsuits claiming intellectual property violations by AI companies have come from artists and writers, the Times is the first American news organization to sue the companies, alleging that OpenAI and Microsoft used millions of their articles to train digital chatbots that now compete with the publication. While the case does not specify the revenue the Times has lost to new robot rivals, the suit argues that the tech companies' unauthorized use of the newspaper's images and written work deprives it of income from "subscriptions, licensing, advertising, and affiliates." The complaint asks that the AI companies be held accountable for "billions of dollars in statutory and actual" damages, citing several examples where the program lifted excerpts from the paper's stories verbatim. "Defendants have refused to recognize this protection." You can read the full legal complaint here.
The New York Times is suing OpenAI and Microsoft for copyright infringement
The New York Times is suing OpenAI and Microsoft for using published news articles to train its artificial intelligence chatbots without an agreement that compensates it for its intellectual property. The NYT did not specify how much it seeks in payout from the companies but that "this action seeks to hold them responsible for the billions of dollars in statutory and actual damages." The NYT claims that OpenAI and Microsoft, the makers of Chat GPT and Copilot, "seek to free-ride on The Times's massive investment in its journalism" without having any licensing agreements. In one part of the complaint, the NYT highlights that its domain (www.nytimes.com) It alleges more than 66 million records, ranging from breaking news articles to op-eds, published across the NYT websites and other affiliated brands were used to train the AI models.
New York Times sues OpenAI, Microsoft for infringing copyrighted works
The Times said OpenAI and Microsoft are advancing their technology through the "unlawful use of The Times's work to create artificial intelligence products that compete with it" and "threatens The Times's ability to provide that service". Through their AI chatbots, the companies "seek to free-ride on The Times's massive investment in its journalism by using it to build substitutive products without permission or payment", the lawsuit said. The Times, one of the most respected news organisations in the United States, is seeking damages as well as an order that the companies stop using its content – and destroy data already harvested. While no sum is specifically requested, the Times alleged that the infringement could have cost "billions of dollars in statutory and actual damages". With the suit, The New York Times chose a more confrontational approach to the sudden rise of AI chatbots, in contrast to other media groups, such as Germany's Axel Springer or The Associated Press, which have struck content deals with OpenAI. Microsoft, the world's second biggest company by market capitalisation, is a major investor in OpenAI and swiftly implemented the powers of AI in its own products after the release of ChatGPT last year.
New York Times sues OpenAI, Microsoft for using articles to train AI
The "large language models" (LLMs) behind AI tools such as ChatGPT work by ingesting huge amounts of text scraped from the internet, learning the connections between words and concepts, and then developing the ability to predict what word to say next in a sentence, allowing them to mimic human speech and writing. OpenAI, Microsoft and Google have refused to reveal what goes into their newest models, but previous LLMs have been shown to include large amounts of content from news organizations and catalogues of books.