The group affect or emotion in an image of people can be inferred by extracting features about both the people in the picture and the overall makeup of the scene. The state-of-the-art on this problem investigates a combination of facial features, scene extraction and even audio tonality. This paper combines three additional modalities, namely, human pose, text-based tagging and CNN extracted features / predictions. To the best of our knowledge, this is the first time all of the modalities were extracted using deep neural networks. We evaluate the performance of our approach against baselines and identify insights throughout this paper.
As machines have become more intelligent, there has been a renewed interest in methods for measuring their intelligence. A common approach is to propose tasks for which a human excels, but one that machines find difficult. However, an ideal task should also be easy to evaluate and not be easily gameable. We begin with a case study exploring the recently popular task of image captioning and its limitations as a task for measuring machine intelligence. An alternative and more promising task is visual question answering, which tests a machine's ability to reason about language and vision.
Zitnick, C. Lawrence (Facebook AI Research) | Agrawal, Aishwarya (Virginia Institute of Technology) | Antol, Stanislaw (Virginia Institute of Technology) | Mitchell, Margaret (Microsoft Research) | Batra, Dhruv (Virginia Institute of Technology) | Parikh, Devi (Virginia Institute of Technology)
As machines have become more intelligent, there has been a renewed interest in methods for measuring their intelligence. A common approach is to propose tasks for which a human excels, but one which machines find difficult. However, an ideal task should also be easy to evaluate and not be easily gameable. We begin with a case study exploring the recently popular task of image captioning and its limitations as a task for measuring machine intelligence. An alternative and more promising task is Visual Question Answering that tests a machine’s ability to reason about language and vision. We describe a dataset unprecedented in size created for the task that contains over 760,000 human generated questions about images. Using around 10 million human generated answers, machines may be easily evaluated.
DeepMind's artificial intelligence programme AlphaZero is now showing signs of human-like intuition and creativity, in what developers have hailed as'turning point' in history. The computer system amazed the world last year when it mastered the game of chess from scratch within just four hours, despite not being programmed how to win. But now, after a year of testing and analysis by chess grandmasters, the machine has developed a new style of play unlike anything ever seen before, suggesting the programme is now improvising like a human. Unlike the world's best chess machine - Stockfish - which calculates millions of possible outcomes as it plays, AlphaZero learns from its past successes and failures, making its moves based on, a'nebulous sense that it is all going to work out in the long run,' according to experts at DeepMind. When AlphaZero was pitted against Stockfish in 1,000 games, it lost just six, winning convincingly 155 times, and drawing the remaining bouts.
Many think we'll see human-level artificial intelligence in the next 10 years. Industry continues to boast smarter tech like personalized assistants or self-driving cars. And in computer science, new and powerful tools embolden researchers to assert that we are nearing the goal in the quest for human-level artificial intelligence. Despite the hype, despite progress, we are far from machines that think like you and me. Last year Google unveiled Duplex -- a Pixel smartphone assistant which can call and make reservations for you.