Generative AI
Introducing v0.5 of the AI Safety Benchmark from MLCommons
Vidgen, Bertie, Agrawal, Adarsh, Ahmed, Ahmed M., Akinwande, Victor, Al-Nuaimi, Namir, Alfaraj, Najla, Alhajjar, Elie, Aroyo, Lora, Bavalatti, Trupti, Bartolo, Max, Blili-Hamelin, Borhane, Bollacker, Kurt, Bomassani, Rishi, Boston, Marisa Ferrara, Campos, Siméon, Chakra, Kal, Chen, Canyu, Coleman, Cody, Coudert, Zacharie Delpierre, Derczynski, Leon, Dutta, Debojyoti, Eisenberg, Ian, Ezick, James, Frase, Heather, Fuller, Brian, Gandikota, Ram, Gangavarapu, Agasthya, Gangavarapu, Ananya, Gealy, James, Ghosh, Rajat, Goel, James, Gohar, Usman, Goswami, Sujata, Hale, Scott A., Hutiri, Wiebke, Imperial, Joseph Marvin, Jandial, Surgan, Judd, Nick, Juefei-Xu, Felix, Khomh, Foutse, Kailkhura, Bhavya, Kirk, Hannah Rose, Klyman, Kevin, Knotz, Chris, Kuchnik, Michael, Kumar, Shachi H., Kumar, Srijan, Lengerich, Chris, Li, Bo, Liao, Zeyi, Long, Eileen Peters, Lu, Victor, Luger, Sarah, Mai, Yifan, Mammen, Priyanka Mary, Manyeki, Kelvin, McGregor, Sean, Mehta, Virendra, Mohammed, Shafee, Moss, Emanuel, Nachman, Lama, Naganna, Dinesh Jinenhally, Nikanjam, Amin, Nushi, Besmira, Oala, Luis, Orr, Iftach, Parrish, Alicia, Patlak, Cigdem, Pietri, William, Poursabzi-Sangdeh, Forough, Presani, Eleonora, Puletti, Fabrizio, Röttger, Paul, Sahay, Saurav, Santos, Tim, Scherrer, Nino, Sebag, Alice Schoenauer, Schramowski, Patrick, Shahbazi, Abolfazl, Sharma, Vin, Shen, Xudong, Sistla, Vamsi, Tang, Leonard, Testuggine, Davide, Thangarasa, Vithursan, Watkins, Elizabeth Anne, Weiss, Rebecca, Welty, Chris, Wilbers, Tyler, Williams, Adina, Wu, Carole-Jean, Yadav, Poonam, Yang, Xianjun, Zeng, Yi, Zhang, Wenhui, Zhdanov, Fedor, Zhu, Jiacheng, Liang, Percy, Mattson, Peter, Vanschoren, Joaquin
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.
Compositional Text-to-Image Generation with Dense Blob Representations
Nie, Weili, Liu, Sifei, Mardani, Morteza, Liu, Chao, Eckart, Benjamin, Vahdat, Arash
Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks.
Realizing Visual Question Answering for Education: GPT-4V as a Multimodal AI
Lee, Gyeong-Geon, Zhai, Xiaoming
Educational scholars have analyzed various image data acquired from teaching and learning situations, such as photos that shows classroom dynamics, students' drawings with regard to the learning content, textbook illustrations, etc. Unquestioningly, most qualitative analysis of and explanation on image data have been conducted by human researchers, without machine-based automation. It was partially because most image processing artificial intelligence models were not accessible to general educational scholars or explainable due to their complex deep neural network architecture. However, the recent development of Visual Question Answering (VQA) techniques is accomplishing usable visual language models, which receive from the user a question about the given image and returns an answer, both in natural language. Particularly, GPT-4V released by OpenAI, has wide opened the state-of-the-art visual langauge model service so that VQA could be used for a variety of purposes. However, VQA and GPT-4V have not yet been applied to educational studies much. In this position paper, we suggest that GPT-4V contributes to realizing VQA for education. By 'realizing' VQA, we denote two meanings: (1) GPT-4V realizes the utilization of VQA techniques by any educational scholars without technical/accessibility barrier, and (2) GPT-4V makes educational scholars realize the usefulness of VQA to educational research. Given these, this paper aims to introduce VQA for educational studies so that it provides a milestone for educational research methodology. In this paper, chapter II reviews the development of VQA techniques, which primes with the release of GPT-4V. Chapter III reviews the use of image analysis in educational studies. Chapter IV demonstrates how GPT-4V can be used for each research usage reviewed in Chapter III, with operating prompts provided. Finally, chapter V discusses the future implications.
Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis
Petrov, Nikolay B, Serapio-García, Gregory, Rentfrow, Jason
The humanlike responses of large language models (LLMs) have prompted social scientists to investigate whether LLMs can be used to simulate human participants in experiments, opinion polls and surveys. Of central interest in this line of research has been mapping out the psychological profiles of LLMs by prompting them to respond to standardized questionnaires. The conflicting findings of this research are unsurprising given that mapping out underlying, or latent, traits from LLMs' text responses to questionnaires is no easy task. To address this, we use psychometrics, the science of psychological measurement. In this study, we prompt OpenAI's flagship models, GPT-3.5 and GPT-4, to assume different personas and respond to a range of standardized measures of personality constructs. We used two kinds of persona descriptions: either generic (four or five random person descriptions) or specific (mostly demographics of actual humans from a large-scale human dataset). We found that the responses from GPT-4, but not GPT-3.5, using generic persona descriptions show promising, albeit not perfect, psychometric properties, similar to human norms, but the data from both LLMs when using specific demographic profiles, show poor psychometrics properties. We conclude that, currently, when LLMs are asked to simulate silicon personas, their responses are poor signals of potentially underlying latent traits. Thus, our work casts doubt on LLMs' ability to simulate individual-level human behaviour across multiple-choice question answering tasks.
Enhancing Decision-Making in Optimization through LLM-Assisted Inference: A Neural Networks Perspective
Singh, Gaurav, Bali, Kavitesh Kumar
This paper explores the seamless integration of Generative AI (GenAI) and Evolutionary Algorithms (EAs) within the domain of large-scale multi-objective optimization. Focusing on the transformative role of Large Language Models (LLMs), our study investigates the potential of LLM-Assisted Inference to automate and enhance decision-making processes. Specifically, we highlight its effectiveness in illuminating key decision variables in evolutionarily optimized solutions while articulating contextual trade-offs. Tailored to address the challenges inherent in inferring complex multi-objective optimization solutions at scale, our approach emphasizes the adaptive nature of LLMs, allowing them to provide nuanced explanations and align their language with diverse stakeholder expertise levels and domain preferences. Empirical studies underscore the practical applicability and impact of LLM-Assisted Inference in real-world decision-making scenarios.
Artificial intelligence not always helpful for reducing doctor burnout, studies suggest
FOX News' Eben Brown reports on AI going mainstream in healthcare, which doctors say has the potential to create stronger relationships with patients. The use of generative AI may not be helpful in reducing burnout in health care, new research suggests. Previous research indicated that increased time spent using electronic health record (EHR) systems and handling administrative responsibilities has been a burden on doctors. So some people had heralded artificial intelligence as a potential solution -- yet recent investigations by U.S. health systems found that large language models (LLMs) did not simplify clinicians' day-to-day responsibilities. WHAT IS ARTIFICIAL INTELLIGENCE (AI)?
Making deepfake images is increasingly easy – controlling their use is proving all but impossible
"Very creepy," was April's first thought when she saw her face on a generative AI website. April is one half of the Maddison twins. She and her sister Amelia make content for OnlyFans, Instagram and other platforms, but they also existed as a custom generative AI model – made without their consent. "It was really weird to see our faces, but not really our faces," she says. Deepfakes – the creation of realistic but false imagery, video and audio using artificial intelligence – is on the political agenda after the federal government announced last week it would introduce legislation to ban the creation and sharing of deepfake pornography as part of measures to combat violence against women.
Stack Overflow Users Are Revolting Against an OpenAI Deal
On Monday, Stack Overflow and OpenAI announced a new API partnership that will integrate Stack Overflow's technical content with OpenAI's ChatGPT AI assistant. The deal has sparked controversy among Stack Overflow's user community, with many expressing anger and protest over the use of their contributed content to support and train AI models. I'm just going to delete/deface my answers one by one," wrote one user on sister site Stack Exchange. "I don't care if this is against your silly policies, because as this announcement shows, your policies can change at a whim without prior consultation of your stakeholders. Stack Overflow is a popular question-and-answer site for software developers that allows users to ask and answer technical questions related to coding.
Microsoft Deploys Generative AI for US Spies
Law enforcement in the United States, United Kingdom, and Australia this week named a Russian national as the person behind LockBitSupp, the pseudonym of the leader of the LockBit ransomware gang that the US says is responsible for extracting 500 million from its victims. Dmitry Yuryevich Khoroshev has been sanctioned and charged with 26 criminal counts in the US, which combined could result in a prison sentence of 185 years. That is, if he's ever arrested and successfully prosecuted--an extremely rare event for suspects who live in Russia. Elsewhere in the world of cybercrime, WIRED's Andy Greenberg interviewed a representative of Cyber Army of Russia, a group of hackers who have targeted water utilities in the US and Europe and are said to have ties to the notorious Russian military hacking unit known as Sandworm. The responses from Cyber Army of Russia were littered with pro-Kremlin talking points--and some curious admissions.
Japan team uses Fugaku supercomputer to develop language model for AI
A team of researchers from the Tokyo Institute of Technology, Fujitsu and others have announced the development of a large language model that can serve as a foundation for generative artificial intelligence, using the Japanese supercomputer Fugaku. Trained extensively on data in Japanese, which account for 60% of the total training data, the Fugaku-LLM model revealed Friday is expected to lead to research on generative AI tailored to domestic needs. In May 2023, the researchers -- also including those from Tohoku University, Nagoya University, the government-backed research institute Riken, CyberAgent and Kotoba Technologies -- launched the project employing the supercomputer jointly developed by Fujitsu and Riken.