Generative AI
Will AI inspire a new M&M? How artificial intelligence is reshaping Mars
Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Could AI text-to-image generators like DALL-E inspire new designs for iconic candies like M&Ms or Skittles? As a candy-packed Halloween approaches, it seemed like an obvious question to ask the head of AI and machine learning at Mars Inc. -- who over the past century has overseen a slew of popular confectionery brands from M&Ms to Milky Way and Snickers; grown into a CPG behemoth that includes brands such as Dove, Pedigree and Whiskas; and now claims to care for half the world's pets through nutrition, health and services businesses including Banfield Pet Hospitals and Anicura. While Shubham Mehrish, global vice president of digital strategy at Mars Inc., wouldn't say whether an AI-designed M&M was on the horizon, he did sound bullish on DALL-E and other AI art tools for idea generation at Mars. "The DALL-E team has been stingy in giving access, but we have a few of our AI scientists already playing with it," he said.
Introducing Whisper
We've trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy on English speech recognition. Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. We are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing. The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer.
A Case Report On The "A.I. Locked-In Problem": social concerns with modern NLP
Modern NLP models are becoming better conversational agents than their predecessors. Recurrent Neural Networks (RNNs) and especially Long-Short Term Memory (LSTM) features allow the agent to better store and use information about semantic content, a trend that has become even more pronounced with the Transformer Models. Large Language Models (LLMs) such as GPT-3 by OpenAI have become known to be able to construct and follow a narrative, which enables the system to adopt personas on the go, adapt them and play along in conversational stories. However, practical experimentation with GPT-3 shows that there is a recurring problem with these modern NLP systems, namely that they can "get stuck" in the narrative so that further conversations, prompt executions or commands become futile. This is here referred to as the "Locked-In Problem" and is exemplified with an experimental case report, followed by practical and social concerns that are accompanied with this problem.
Implementing and Experimenting with Diffusion Models for Text-to-Image Generation
Taking advantage of the many recent advances in deep learning, text-to-image generative models currently have the merit of attracting the general public attention. Two of these models, DALL-E 2 and Imagen, have demonstrated that highly photorealistic images could be generated from a simple textual description of an image. Based on a novel approach for image generation called diffusion models, text-to-image models enable the production of many different types of high resolution images, where human imagination is the only limit. However, these models require exceptionally large amounts of computational resources to train, as well as handling huge datasets collected from the internet. In addition, neither the codebase nor the models have been released. It consequently prevents the AI community from experimenting with these cutting-edge models, making the reproduction of their results complicated, if not impossible. In this thesis, we aim to contribute by firstly reviewing the different approaches and techniques used by these models, and then by proposing our own implementation of a text-to-image model. Highly based on DALL-E 2, we introduce several slight modifications to tackle the high computational cost induced. We thus have the opportunity to experiment in order to understand what these models are capable of, especially in a low resource regime. In particular, we provide additional and analyses deeper than the ones performed by the authors of DALL-E 2, including ablation studies. Besides, diffusion models use so-called guidance methods to help the generating process. We introduce a new guidance method which can be used in conjunction with other guidance methods to improve image quality. Finally, the images generated by our model are of reasonably good quality, without having to sustain the significant training costs of state-of-the-art text-to-image models.
A Solution to DALL·E 2's AI Bias Problem
Take a look at voice applications. When applying a mindful AI approach, and leveraging the power of a global talent pool, developers can account for linguistic elements such as different dialects and accents in the data sets. Many of the people we rely on to crowdsource at Pactera EDGE are not full-time employees, but they develop expertise working regularly in our projects. We use modules and tests to identify and reward those with the strongest capabilities at translation and those that produce the best outcomes for our clients. Establishing a human-centered design framework from the beginning is critical.
OpenAI open-sources Whisper, a multilingual speech recognition system
Speech recognition remains a challenging problem in AI and machine learning. In a step toward solving it, OpenAI today open-sourced Whisper, an automatic speech recognition system that the company claims enables "robust" transcription in multiple languages as well as translation from those languages into English. Countless organizations have developed highly capable speech recognition systems, which sit at the core of software and services from tech giants like Google, Amazon and Meta. But what makes Whisper different, according to OpenAI, is that it was trained on 680,000 hours of multilingual and "multitask" data collected from the web, which lead to improved recognition of unique accents, background noise and technical jargon. "The primary intended users of [the Whisper] models are AI researchers studying robustness, generalization, capabilities, biases and constraints of the current model. However, Whisper is also potentially quite useful as an automatic speech recognition solution for developers, especially for English speech recognition," OpenAI wrote in the GitHub repo for Whisper, from where several versions of the system can be downloaded.
Dall-E 2 users to be allowed to upload faces for first time
Users of the image generating artificial intelligence Dall-E 2 will be allowed to upload faces to the system for the first time, creators OpenAI have said, as competition in the sector heats up. The feature marks the latest relaxation of the company's rules around how its tool, which can generate high-quality images from a text prompt, can be used. When it first launched in a public beta, OpenAI banned users from generating any images with a realistic face. Later, those rules were relaxed to allow the generation of realistic faces, but not those of specific individuals. Now, users will be able to upload photos that depict real people – with consent – and use OpenAI's tools to generate new variations on the pictures.
ACT-1: How Adept Is Building the Future of AI with Action Transformers
One of AI's most ambitious goals is to build systems that can do everything a human can. GPT-3 can write and Stable Diffusion can paint, but neither can interact with the world directly. AI companies have been trying to create intelligent agents this way for 10 years. This seems to be changing now. One of my latest articles covers Google's PaLM-SayCan (PSC), a robot powered by PaLM, the best large language model to date. PSC's language module can interpret human requests expressed in natural language and transform them into high-level tasks that can be further broken down into elemental actions.
Generative AI: A Creative New World
Humans are good at analyzing things. Machines can analyze a set of data and find patterns in it for a multitude of use cases, whether it's fraud or spam detection, forecasting the ETA of your delivery or predicting which TikTok video to show you next. They are getting smarter at these tasks. This is called "Analytical AI," or traditional AI. But humans are not only good at analyzing things--we are also good at creating.
Holz, founder of AI art service Midjourney, on future images
Interview In 2008, David Holz co-founded a hardware peripheral firm called Leap Motion. He ran it until last year when he left to create Midjourey. Midjourney in its present form is a social network for creating AI-generated art from a text prompt – type a word or phrase at the input prompt and you'll receive an interesting or perhaps wonderful image on screen after about a minute of computation. It's similar in some respects to OpenAI's DALL-E 2. Midjourney image of the sky and clouds, using the text prompt "All this useless beauty." Both are the result of large AI models trained on vast numbers of images. But Midjourney has its own distinctive style, as can be seen from this Twitter thread.