Goto

Collaborating Authors

 Generative AI


Canada's privacy watchdog opens investigation into OpenAI, ChatGPT over complaint

FOX News

A New York Rabbi recently went viral for delivering a sermon written by ChatGPT to his congregation, causing many to question the humanity in such an act. Canada's privacy watchdog has opened an investigation into OpenAI, the California-based company behind the explosive artificial intelligence chatbot, ChatGPT. Privacy Commissioner Philippe Dufresne said Tuesday his office was investigating OpenAI after receiving complaints alleging "the collection, use and disclosure of personal information without consent." FILE: The logo of the chatbot ChatGPT from the company OpenAI can be seen on a smartphone on April 3, 2023, in Berlin, Germany. "A.I. technology and its effects on privacy is a priority for my Office," Dufresne said in a statement.


Automated Reading Passage Generation with OpenAI's Large Language Model

arXiv.org Artificial Intelligence

The widespread usage of computer-based assessments and individualized learning platforms has resulted in an increased demand for the rapid production of high-quality items. Automated item generation (AIG), the process of using item models to generate new items with the help of computer technology, was proposed to reduce reliance on human subject experts at each step of the process. AIG has been used in test development for some time. Still, the use of machine learning algorithms has introduced the potential to improve the efficiency and effectiveness of the process greatly. The approach presented in this paper utilizes OpenAI's latest transformer-based language model, GPT-3, to generate reading passages. Existing reading passages were used in carefully engineered prompts to ensure the AI-generated text has similar content and structure to a fourth-grade reading passage. For each prompt, we generated multiple passages, the final passage was selected according to the Lexile score agreement with the original passage. In the final round, the selected passage went through a simple revision by a human editor to ensure the text was free of any grammatical and factual errors. All AI-generated passages, along with original passages were evaluated by human judges according to their coherence, appropriateness to fourth graders, and readability.


Coincidental Generation

arXiv.org Artificial Intelligence

Generative A.I. models have emerged as versatile tools across diverse industries, with applications in privacy-preserving data sharing, computational art, personalization of products and services, and immersive entertainment. Here, we introduce a new privacy concern in the adoption and use of generative A.I. models: that of coincidental generation, where a generative model's output is similar enough to an existing entity, beyond those represented in the dataset used to train the model, to be mistaken for it. Consider, for example, synthetic portrait generators, which are today deployed in commercial applications such as virtual modeling agencies and synthetic stock photography. Due to the low intrinsic dimensionality of human face perception, every synthetically generated face will coincidentally resemble an actual person. Such examples of coincidental generation all but guarantee the misappropriation of likeness and expose organizations that use generative A.I. to legal and regulatory risk.


Deep Generative Modeling on Limited Data with Regularization by Nontransferable Pre-trained Models

arXiv.org Artificial Intelligence

Deep generative models (DGMs) are data-eager because learning a complex model on limited data suffers from a large variance and easily overfits. Inspired by the classical perspective of the bias-variance tradeoff, we propose regularized deep generative model (Reg-DGM), which leverages a nontransferable pre-trained model to reduce the variance of generative modeling with limited data. Formally, Reg-DGM optimizes a weighted sum of a certain divergence and the expectation of an energy function, where the divergence is between the data and the model distributions, and the energy function is defined by the pre-trained model w.r.t. the model distribution. We analyze a simple yet representative Gaussian-fitting case to demonstrate how the weighting hyperparameter trades off the bias and the variance. Theoretically, we characterize the existence and the uniqueness of the global minimum of Reg-DGM in a non-parametric setting and prove its convergence with neural networks trained by gradient-based methods. Empirically, with various pretrained feature extractors and a data-dependent energy function, Reg-DGM consistently improves the generation performance of strong DGMs with limited data and achieves competitive results to the state-of-the-art methods. Such models are often data-eager (Li et al., 2021; Wang et al., 2018) due to the presence of complex function classes. Recent work (Karras et al., 2020a) found that the classical variants of generative adversarial networks (GANs) (Goodfellow et al., 2014; Karras et al., 2020b) produce poor samples with limited data, which is shared by other DGMs in principle. Thus, improving the sample efficiency is a common challenge for DGMs. The root cause of the problem is that learning a model in a complex class on limited data suffers from a large variance and easily overfits the training data (Mohri et al., 2018). Although not pointed out in the literature to our knowledge, prior work can be understood as reducing the variance of the estimate implicitly (Mohri et al., 2018). In Sec. 2, we formulate the objective function of Reg-DGM as the sum of a certain divergence and a regularization term weighted by a hyperparameter.


Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

arXiv.org Artificial Intelligence

Recent advancements in large language models (LLMs) have led to the development of highly potent models like OpenAI's ChatGPT. These models have exhibited exceptional performance in a variety of tasks, such as question answering, essay composition, and code generation. However, their effectiveness in the healthcare sector remains uncertain. In this study, we seek to investigate the potential of ChatGPT to aid in clinical text mining by examining its ability to extract structured information from unstructured healthcare texts, with a focus on biological named entity recognition and relation extraction. However, our preliminary results indicate that employing ChatGPT directly for these tasks resulted in poor performance and raised privacy concerns associated with uploading patients' information to the ChatGPT API. To overcome these limitations, we propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data with labels utilizing ChatGPT and fine-tuning a local model for the downstream task. Our method has resulted in significant improvements in the performance of downstream tasks, improving the F1-score from 23.37% to 63.99% for the named entity recognition task and from 75.86% to 83.59% for the relation extraction task. Furthermore, generating data using ChatGPT can significantly reduce the time and effort required for data collection and labeling, as well as mitigate data privacy concerns. In summary, the proposed framework presents a promising solution to enhance the applicability of LLM models to clinical text mining.


EKILA: Synthetic Media Provenance and Attribution for Generative Art

arXiv.org Artificial Intelligence

We present EKILA; a decentralized framework that enables creatives to receive recognition and reward for their contributions to generative AI (GenAI). EKILA proposes a robust visual attribution technique and combines this with an emerging content provenance standard (C2PA) to address the problem of synthetic image provenance -- determining the generative model and training data responsible for an AI-generated image. Furthermore, EKILA extends the non-fungible token (NFT) ecosystem to introduce a tokenized representation for rights, enabling a triangular relationship between the asset's Ownership, Rights, and Attribution (ORA). Leveraging the ORA relationship enables creators to express agency over training consent and, through our attribution model, to receive apportioned credit, including royalty payments for the use of their assets in GenAI.


How ChatGPT is changing the job hiring process, from the HR department to coders

#artificialintelligence

The recent launch of Google's Bard brought another tech giant into the generative artificial intelligence space, alongside Microsoft's Bing chat and OpenAI's ChatGPT. But how many business leaders are currently using AI tech in day-to-day operations or plan to? Based on new research, a lot. Half of the companies ResumeBuilder surveyed in February said they are using ChatGPT; 30% said they plan to do so. The data included 1,000 responses from the ResumeBuilder's network of business leaders.


Generative AI Tools Use Custom Data to Power More Business Functions - WSJ

#artificialintelligence

Business software makers in financial management, design and other areas are rolling out generative artificial intelligence tools that pack troves of industry-specific data into customized applications, aiming for an edge in an already crowded market. By leveraging data gathered from specific business functions--in some cases stockpiled from decades of commercial use--software firms can offer AI tools fine-tuned for distinct applications, industry analysts said. They can also keep underlying algorithms free of extraneous data scraped online from unknown sources, which can produce unreliable results, they said.


It doesn't take much to make machine-learning algorithms go awry

#artificialintelligence

The algorithms that underlie modern artificial-intelligence (AI) systems need lots of data on which to train. Much of that data comes from the open web which, unfortunately, makes the AIs susceptible to a type of cyber-attack known as "data poisoning". This means modifying or adding extraneous information to a training data set so that an algorithm learns harmful or undesirable behaviours. Like a real poison, poisoned data could go unnoticed until after the damage has been done. Your browser does not support the audio element.


Image deduplication using OpenAI's CLIP and Community Detection

#artificialintelligence

A short guide on how to use image embeddings from OpenAI's CLIP and clustering techniques in order to group near-duplicate images together. CLIP is trained by trying to align image text embedding pairs, or "learning visual representations from natural language supervision". You can use it's text or image embeddings to accomplish a lot of different tasks, such as zero-shot image classification! It's embeddings are pretty powerful. For this task, we're going to use the AirBnB Duplicate Image Dataset, available on Kaggle.