Goto

Collaborating Authors

Artificial intelligence system learns concepts shared across video, audio, and text

#artificialintelligence

Humans observe the world through a combination of different modalities, like vision, hearing, and our understanding of language. Machines, on the other hand, interpret the world through data that algorithms can process. So, when a machine "sees" a photo, it must encode that photo into data it can use to perform a task like image classification. This process becomes more complicated when inputs come in multiple formats, like videos, audio clips, and images. "The main challenge here is, how can a machine align those different modalities? As humans, this is easy for us. We see a car and then hear the sound of a car driving by, and we know these are the same thing. But for machine learning, it is not that straightforward," says Alexander Liu, a graduate student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and first author of a paper tackling this problem.


AI May Be Catching up With Human Reasoning

#artificialintelligence

A new technique that measures the reasoning power of artificial intelligence (AI) shows that machines are catching up to humans in their abilities to think, experts say. Researchers at MIT and IBM Research have created a method that enables a user to rank the results of a machine-learning model's behavior. Their technique, called Shared Interest, incorporates metrics that compare how well a model's thinking matches people's. "Today, AI is capable of reaching (and, in some cases, exceeding) human performance in specific tasks, including image recognition and language understanding," Pieter Buteneers, director of engineering in machine learning and AI at the communications company Sinch, told Lifewire in an email interview. "With natural language processing (NLP), AI systems can interpret, write and speak languages as well as humans, and the AI can even adjust its dialect and tone to align with its human peers."


Learning from videos to understand the world

#artificialintelligence

Today, we're announcing a project called Learning from Videos, designed to automatically learn audio, textual, and visual representations from the data in publicly available videos uploaded to Facebook. By learning from videos spanning nearly every country and hundreds of languages, this project will not just help us continuously improve our core AI systems for applications like content recommendation and policy enforcement -- it will enable entirely new experiences. This is also part of our broader efforts toward building machines that learn like humans do -- from any example, not just ones where experts have labeled. The first application is now live in Instagram Reels' recommendation system. Continuously learning from the world around us is one of the hallmarks of human intelligence.


Toward a machine learning model that can reason about everyday actions

#artificialintelligence

The ability to reason abstractly about events as they unfold is a defining feature of human intelligence. We know instinctively that crying and writing are means of communicating, and that a panda falling from a tree and a plane landing are variations on descending. Organizing the world into abstract categories does not come easily to computers, but in recent years researchers have inched closer by training machine learning models on words and images infused with structural information about the world, and how objects, animals, and actions relate. In a new study at the European Conference on Computer Vision this month, researchers unveiled a hybrid language-vision model that can compare and contrast a set of dynamic events captured on video to tease out the high-level concepts connecting them. Their model did as well as or better than humans at two types of visual reasoning tasks--picking the video that conceptually best completes the set, and picking the video that doesn't fit.


Toward a machine learning model that can reason about everyday actions

#artificialintelligence

The ability to reason abstractly about events as they unfold is a defining feature of human intelligence. We know instinctively that crying and writing are means of communicating, and that a panda falling from a tree and a plane landing are variations on descending. Organizing the world into abstract categories does not come easily to computers, but in recent years researchers have inched closer by training machine learning models on words and images infused with structural information about the world, and how objects, animals, and actions relate. In a new study at the European Conference on Computer Vision this month, researchers unveiled a hybrid language-vision model that can compare and contrast a set of dynamic events captured on video to tease out the high-level concepts connecting them. Their model did as well as or better than humans at two types of visual reasoning tasks -- picking the video that conceptually best completes the set, and picking the video that doesn't fit.