In recent years, deep learning has completely revolutionized the fields of computer vision, speech recognition and natural language processing. Despite breakthroughs in all three fields, one common barrier for training neural networks to solve real-world problems remains the amount of labeled training data that is required to train a model. In some domains, like video understanding, gathering real world data can be prohibitively expensive and time consuming in the absence of innovative solutions.
Hou, Jingyi (Beijing Institute of Technology) | Wu, Xinxiao (Beijing Institute of Technology) | Chen, Jin (Beijing Institute of Technology ) | Luo, Jiebo (University of Rochester) | Jia, Yunde (Beijing Institute of Technology)
Current deep learning methods for action recognition rely heavily on large scale labeled video datasets. Manually annotating video datasets is laborious and may introduce unexpected bias to train complex deep models for learning video representation. In this paper, we propose an unsupervised deep learning method which employs unlabeled local spatial-temporal volumes extracted from action videos to learn midlevel video representation for action recognition. Specifically, our method simultaneously discovers mid-level semantic concepts by discriminative clustering and optimizes local spatial-temporal features by two relatively small and simple deep neural networks. The clustering generates semantic visual concepts that guide the training of the deep networks, and the networks in turn guarantee the robustness of the semantic concepts. Experiments on the HMDB51 and the UCF101 datasets demonstrate the superiority of the proposed method, even over several supervised learning methods.
Learning-based Adaptive Bit Rate~(ABR) method, aiming to learn outstanding strategies without any presumptions, has become one of the research hotspots for adaptive streaming. However, it is still suffering from several issues, i.e., low sample efficiency and lack of awareness of the video quality information. In this paper, we propose Comyco, a video quality-aware ABR approach that enormously improves the learning-based methods by tackling the above issues. Comyco trains the policy via imitating expert trajectories given by the instant solver, which can not only avoid redundant exploration but also make better use of the collected samples. Meanwhile, Comyco attempts to pick the chunk with higher perceptual video qualities rather than video bitrates. To achieve this, we construct Comyco's neural network architecture, video datasets and QoE metrics with video quality features. Using trace-driven and real-world experiments, we demonstrate significant improvements of Comyco's sample efficiency in comparison to prior work, with 1700x improvements in terms of the number of samples required and 16x improvements on training time required. Moreover, results illustrate that Comyco outperforms previously proposed methods, with the improvements on average QoE of 7.5% - 16.79%. Especially, Comyco also surpasses state-of-the-art approach Pensieve by 7.37% on average video quality under the same rebuffering time.
Video viewability is a top priority for video publishers who are under pressure to verify that their audience is actually watching advertisers' content. In a previous post How Deep Learning Video Sequence Drives Profits, we demonstrated why image sequences draw consumer attention. Advanced technologies such as Deep Learning are increasing video Viewability through identifying and learning which images make people stick to content. This content intelligence is the foundation for advancing video machine learning and improving overall video performance. In this post, we will explore some challenges in viewability and how deep learning is boosting video watch rates.
A new artificial intelligence system teaches itself to recognize a range of visual and audio concepts by watching short video clips. Researchers at Google's DeepMind unit have developed an artificial intelligence (AI) system that teaches itself to recognize a range of visual and audio concepts by watching short video clips. He notes the DeepMind project brings the field one step closer to the goal of creating AI that can teach itself by watching and listening to the world around it. Instead of relying on human-labeled datasets, the new algorithm learns to recognize images and sounds by matching what it sees with what it hears.