AI researchers are interested in building intelligent machines that can interact with them as they interact with each other. Science fiction writers have given us these goals in the form of HAL in 2001: A Space Odyssey and Commander Data in Star Trek: The Next Generation. However, at present, our computers are deaf, dumb, and blind, almost unaware of the environment they are in and of the user who interacts with them. In this article, I present the current state of the art in machines that can see people, recognize them, determine their gaze, understand their facial expressions and hand gestures, and interpret their activities. I believe that by building machines with such abilities for perceiving, people will take us one step closer to building HAL and Commander Data.
Using software to parse the world's visual content is as big of a revolution in computing as mobile was 10 years ago, and will provide a major edge for developers and businesses to build amazing products. While these types of algorithms have been around in various forms since the 1960's, recent advances in Machine Learning, as well as leaps forward in data storage, computing capabilities, and cheap high-quality input devices, have driven major improvements in how well our software can explore this kind of content. Computer Vision is the broad parent name for any computations involving visual content – that means images, videos, icons, and anything else with pixels involved. A classical application of computer vision is handwriting recognition for digitizing handwritten content (we'll explore more use cases below). Any other application that involves understanding pixels through software can safely be labeled as computer vision.
Before a classification algorithm can do its magic, we need to train it by showing thousands of cat and non-cat images. The general principle in machine learning algorithms is to treat feature vectors as points in higher dimensional space. Then it tries to find planes or surfaces (contours) that separate higher dimensional space in a way that all examples from a particular class are on one side of the plane or surface. To build a predictive model we need neural networks. The neural network is a system of hardware and software similar to our brain to estimate functions that depend on the huge amount of unknown inputs.
An introduction to the field of computer vision and image recognition, and how Deep Learning is fueling the fire of this hot topic. Computer Vision is an interdisciplinary field that focuses on how machines or computers can emulate the way in which humans' brains and eyes work together to visually process the world around them. Research on Computer Vision can be traced back to beginning in the 1960s. The 1970's saw the foundations of computer vision algorithms used today being made; like the shift from basic digital image processing to focusing on the understanding of the 3D structure of scenes, edge extraction and line-labelling. Over the years, computer vision has developed many applications; 3D imaging, facial recognition, autonomous driving, drone technology and medical diagnostics to name a few.
Machine learning and computer vision methods have recently received a lot of attention, in particular when it comes to data analytics. The success of deep neural networks that can help cars drive autonomously and make smartphones recognize speech and translate text attests to the value of using machine learning methods to tackle complex real-world problems. A further prominent example is the success of Google's AlphaGo AI that defeated the world champion Lee Sedol in playing Go. This is remarkable in particular since Go has previously been considered as one of the most complex games due to the larger number of game states. As the amount of data collected by cameras and scientific instruments continues to rise, automated analysis methods will become ever more important in the future.