An open-source training framework to advance multimodal AI

AIHub 

Trying to model the physical reality by assembling various modalities: the image shows a couple of oranges seen through the lens of multiple modalities, with each slice showing a different way one might perceive and understand this scene. The modalities from left to right represent surface normals (the color represents surface orientation), depth (distance to the camera, red near, blue far), RGB (the original image), segmentation (distinct objects and image regions), and edges (object or texture boundaries). Large Language Models such as OpenAI's ChatGPT have already transformed the way many of us go about some of our daily tasks. These generative artificial intelligence chatbots are trained with language -- hundreds of terabytes of text'scraped' from across the Internet and with billions of parameters. Looking ahead, many believe the'engines' that drive generative artificial intelligence will be multimodal models that are not just trained on text but also can process various other modalities of information, including images, video, sound, and modalities from other domains such as biological or atmospheric data. Yet, until recently, training a single model to handle a wide range of modalities – inputs – and tasks – outputs – faced significant challenges.