Goto

Collaborating Authors

 sima


Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

Wang, Xiyao, Chen, Jiuhai, Wang, Zhaoyang, Zhou, Yuhang, Zhou, Yiyang, Yao, Huaxiu, Zhou, Tianyi, Goldstein, Tom, Bhatia, Parminder, Huang, Furong, Xiao, Cao

arXiv.org Artificial Intelligence

Large vision-language models (LVLMs) have achieved impressive results in various visual question-answering and reasoning tasks through vision instruction tuning on specific datasets. However, there is still significant room for improvement in the alignment between visual and language modalities. Previous methods to enhance this alignment typically require external models or data, heavily depending on their capabilities and quality, which inevitably sets an upper bound on performance. In this paper, we propose SIMA, a framework that enhances visual and language modality alignment through self-improvement, eliminating the needs for external models or data. SIMA leverages prompts from existing vision instruction tuning datasets to self-generate responses and employs an in-context self-critic mechanism to select response pairs for preference tuning. The key innovation is the introduction of three vision metrics during the in-context self-critic process, which can guide the LVLM in selecting responses that enhance image comprehension. Through experiments across 14 hallucination and comprehensive benchmarks, we demonstrate that SIMA not only improves model performance across all benchmarks but also achieves superior modality alignment, outperforming previous approaches.


Goats, Google and games: The future impact of a tech giant's push to train AI to play video games

FOX News

Google has developed an artificial intelligence system that can play video games like a human and take orders from players and could eventually even have real-world implications down the line. "This work isn't about achieving high game scores," the SIMA research team wrote in a Google DeepMind post earlier this month. "Learning to play even one video game is a technical feat for an AI system, but learning to follow instructions in a variety of game settings could unlock more helpful AI agents for any environment." SIMA, which stands for Scalable Instructable Multiworld Agent, isn't like a typical computer player that's built into a specific game. Rather, the AI agent plays alongside and learns like a human -- through image recognition and from native language commands -- and plays with keyboard and mouse outputs.


Forget Chatbots. AI Agents Are the Future

WIRED

This week a startup called Cognition AI caused a bit of a stir by releasing a demo showing an artificial intelligence program called Devin performing work usually done by well-paid software engineers. Chatbots like ChatGPT and Gemini can generate code, but Devin went further, planning how to solve a problem, writing the code, and then testing and implementing it. When asked to test how Meta's open source language model Llama 2 performed when accessed via different companies hosting it, Devin generated a step-by-step plan for the project, generated code needed to access the APIs and run benchmarking tests, and created a website summarizing the results. It's always hard to judge staged demos, but Cognition has shown Devin handling a wide range of impressive tasks. It wowed investors and engineers on X, receiving plenty of endorsements, and even inspired a few memes--including some predicting Devin will soon be responsible for a wave of tech industry layoffs.


Google AI learns to play open-world video games by watching them

New Scientist

A Google DeepMind artificial intelligence model can play different open-world video games including No Man's Sky like a human, by watching video from a screen, which could be a step towards generally intelligent AIs that operate in the corporeal world. Playing video games has long been a way to test the progress of AI systems, such as Google DeepMind's AI mastery of chess or Go, but these games have obvious ways to win or lose, making it relatively straightforward to train an AI to succeed at them. Open-world games with extraneous information that can be ignored and more abstract objectives, such as Minecraft, are harder for AI systems to crack. Because the array of choices available in the games makes them a little more like normal life, they are thought to be an important stepping stone towards training AI agents that could do jobs in the real world, such as control robots, and artificial general intelligence. Now, researchers at Google DeepMind have developed an AI they call a Scalable Instructable Multiworld Agent, or SIMA, which can play nine different video games and virtual environments it hasn't seen before using just the video feed from the game.


Google DeepMind's new AI can follow commands inside 3D games it hasn't seen before

Engadget

Google DeepMind has unveiled new research highlighting an AI agent that's able to carry out a swath of tasks in 3D games it hasn't seen before. The team has long been experimenting with AI models that can win in the likes of Go and chess, and even learn games without being told their rules. Now, for the first time, according to DeepMind, an AI agent has shown it's able to understand a wide range of gaming worlds and carry out tasks within them based on natural-language instructions. The researchers teamed up with studios and publishers such as Hello Games ( No Man's Sky), Tuxedo Labs ( Teardown) and Coffee Stain ( Valheim and Goat Simulator 3) to train the Scalable Instructable Multiworld Agent (SIMA) on nine games. The team also used four research environments, including one built in Unity in which agents are instructed to form sculptures using building blocks.


An AI that can play Goat Simulator is a step toward more useful machines

MIT Technology Review

In training AI systems, games are a good proxy for real-world tasks. "A general game-playing agent could, in principle, learn a lot more about how to navigate our world than anything in a single environment ever could," says Michael Bernstein, an associate professor of computer science at Stanford University, who was not part of the research. "One could imagine one day rather than having superhuman agents which you play against, we could have agents like SIMA playing alongside you in games with you and with your friends," says Tim Harley, a research engineer at Google DeepMind who was part of the team that developed the agent. The team trained SIMA on lots of examples of humans playing video games, both individually and collaboratively, alongside keyboard and mouse input and annotations of what the players did in the game, says Frederic Besse, a research engineer at Google DeepMind. Then they used an AI technique called imitation learning to teach the agent to play games as humans would.


Google DeepMind's Latest AI Agent Learned to Play 'Goat Simulator 3'

WIRED

Goat Simulator 3 is a surreal video game in which players take domesticated ungulates on a series of implausible adventures, sometimes involving jetpacks. That might seem an unlikely venue for the next big leap in artificial intelligence, but Google DeepMind today revealed an AI program capable of learning how to complete tasks in a number of games, including Goat Simulator 3. Most impressively, when the program encounters a game for the first time, it can reliably perform tasks by adapting what it learned from playing other games. The program is called SIMA, for Scalable Instructable Multiworld Agent, and it builds upon recent AI advances that have seen large language models produce remarkably capable chabots like ChatGPT. "SIMA is greater than the sum of its parts," says Frederic Besse, a research engineer at Google DeepMind who was involved with the project. "It is able to take advantage of the shared concepts in the game, to learn better skills and to learn to be better at carrying out instructions."


Classification and Generation of real-world data with an Associative Memory Model

Simas, Rodrigo, Sa-Couto, Luis, Wichert, Andreas

arXiv.org Artificial Intelligence

Drawing from memory the face of a friend you have not seen in years is a difficult task. However, if you happen to cross paths, you would easily recognize each other. The biological memory is equipped with an impressive compression algorithm that can store the essential, and then infer the details to match perception. The Willshaw Memory is a simple abstract model for cortical computations which implements mechanisms of biological memories. Using our recently proposed sparse coding prescription for visual patterns, this model can store and retrieve an impressive amount of real-world data in a fault-tolerant manner. In this paper, we extend the capabilities of the basic Associative Memory Model by using a Multiple-Modality framework. In this setting, the memory stores several modalities (e.g., visual, or textual) of each pattern simultaneously. After training, the memory can be used to infer missing modalities when just a subset is perceived. Using a simple encoder-memory-decoder architecture, and a newly proposed iterative retrieval algorithm for the Willshaw Model, we perform experiments on the MNIST dataset. By storing both the images and labels as modalities, a single Memory can be used not only to retrieve and complete patterns but also to classify and generate new ones. We further discuss how this model could be used for other learning tasks, thus serving as a biologically-inspired framework for learning.


Enabling Edge Machine Learning Applications with SiMA.ai

#artificialintelligence

Industrial IoT systems with the intelligence to sort goods on the production line based on their size and quality. Autonomous vehicles that passengers can summon for rides. Drones that survey crops to optimize water consumption and yield. Machine learning (ML) at the embedded edge is blossoming and new applications are certain to emerge as the underlying ML technologies become easier to implement. SiMa.ai is one of the companies at the forefront of ushering in an age of effortless ML for the embedded edge.


Purpose-built ML SoC for edge processing Smart2.0

#artificialintelligence

Designed to enable quick and effortless ML experiences for the embedded edge, the software-centric MLSoC Platform addresses any computer vision application and is offered as delivering a 10x better performance/watt solution – operating at the most efficient frames per second/watt. The platform's push-button software experience, says the company, allows users to effortlessly scale machine learning in minutes for robotics, smart vision, government, autonomous vehicles, drones, and healthcare applications. "When we started SiMa.ai 3.5 years ago," says Krishna Rangasayee, CEO and Founder, SiMa.ai, "we set out to deliver a disruptive 10x performance improvement over alternatives and provide a scalable industry-leading ML experience solving computer vision applications. Today we are delighting customers by delivering on that promise and exceeding their expectations. We are excited to take our very first purpose-built software-centric MLSoC to volume production."