Media
SoundNet: Learning Sound Representations from Unlabeled Video
Aytar, Yusuf, Vondrick, Carl, Torralba, Antonio
We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild. We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos. Unlabeled video has the advantage that it can be economically acquired at massive scales, yet contains useful signals about natural sound. We propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge. Our sound representation yields significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification. Visualizations suggest some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.
Optimal spectral transportation with application to music transcription
Flamary, Rรฉmi, Fรฉvotte, Cรฉdric, Courty, Nicolas, Emiya, Valentin
Many spectral unmixing methods rely on the non-negative decomposition of spectral data onto a dictionary of spectral templates. In particular, state-of-the-art music transcription systems decompose the spectrogram of the input signal onto a dictionary of representative note spectra. The typical measures of fit used to quantify the adequacy of the decomposition compare the data and template entries frequency-wise. As such, small displacements of energy from a frequency bin to another as well as variations of timber can disproportionally harm the fit. We address these issues by means of optimal transportation and propose a new measure of fit that treats the frequency distributions of energy holistically as opposed to frequency-wise. Building on the harmonic nature of sound, the new measure is invariant to shifts of energy to harmonically-related frequencies, as well as to small and local displacements of energy. Equipped with this new measure of fit, the dictionary of note templates can be considerably simplified to a set of Dirac vectors located at the target fundamental frequencies (musical pitch values). This in turns gives ground to a very fast and simple decomposition algorithm that achieves state-of-the-art performance on real musical data.
Exponential Family Embeddings
Rudolph, Maja, Ruiz, Francisco, Mandt, Stephan, Blei, David
Word embeddings are a powerful approach to capturing semantic similarity among terms in a vocabulary. In this paper, we develop exponential family embeddings, which extends the idea of word embeddings to other types of high-dimensional data. As examples, we studied several types of data: neural data with real-valued observations, count data from a market basket analysis, and ratings data from a movie recommendation system. The main idea is that each observation is modeled conditioned on a set of latent embeddings and other observations, called the context, where the way the context is defined depends on the problem. In language the context is the surrounding words; in neuroscience the context is close-by neurons; in market basket data the context is other items in the shopping cart. Each instance of an embedding defines the context, the exponential family of conditional distributions, and how the embedding vectors are shared across data. We infer the embeddings with stochastic gradient descent, with an algorithm that connects closely to generalized linear models. On all three of our applicationsโneural activity of zebrafish, usersโ shopping behavior, and movie ratingsโwe found that exponential family embedding models are more effective than other dimension reduction methods. They better reconstruct held-out data and find interesting qualitative structure.
Collaborative Recurrent Autoencoder: Recommend while Learning to Fill in the Blanks
Wang, Hao, SHI, Xingjian, Yeung, Dit-Yan
Hybrid methods that utilize both content and rating information are commonly used in many recommender systems. However, most of them use either handcrafted features or the bag-of-words representation as a surrogate for the content information but they are neither effective nor natural enough. To address this problem, we develop a collaborative recurrent autoencoder (CRAE) which is a denoising recurrent autoencoder (DRAE) that models the generation of content sequences in the collaborative filtering (CF) setting. The model generalizes recent advances in recurrent deep learning from i.i.d. input to non-i.i.d. (CF-based) input and provides a new denoising scheme along with a novel learnable pooling scheme for the recurrent autoencoder. To do this, we first develop a hierarchical Bayesian model for the DRAE and then generalize it to the CF setting. The synergy between denoising and CF enables CRAE to make accurate recommendations while learning to fill in the blanks in sequences. Experiments on real-world datasets from different domains (CiteULike and Netflix) show that, by jointly modeling the order-aware generation of sequences for the content information and performing CF for the ratings, CRAE is able to significantly outperform the state of the art on both the recommendation task based on ratings and the sequence generation task based on content information.
Could a machine predict box office flops?
Scriptbook is an algorithm that claims it can predict whether a movie is going to be a hit or a miss at the box office. Inspired by the dismal performance of "Gigli" starring Ben Affleck and Jennifer Lopez at the turn of the century, Scriptbook's CEO Nadira Azermai and her team spent over a year creating the machine-learning platform that claims to be able to predict the relative success of a film using its script. With the data from 4,000 scripts and 10,000 movies, ScriptBook uses 220 parameters to automatically read and score scripts, and then produce an estimated financial performance calculation. If ScriptBook is proved to work -- even partially -- the impact on the movie industry could be substantial. According to Forbes, losses from flops in 2016 come out at around $100.9 million.
Artificial intelligence takes on machine reading, Christmas carols and eye disease โ Weekend Reading: Dec. 30 edition - The Official Microsoft Blog
Artificial intelligence (AI) made incredible strides in 2016, and the growth appears set to accelerate as we enter the New Year. A team of Microsoft researchers has released a dataset of 100,000 questions and answers that other AI researchers can use โ for free โ in their quest to create systems that can read and answer questions as well as a human. The MS MARCO dataset is based on anonymized real-world data from Bing and Cortana queries and is part of an attempt to spur the breakthroughs in machine reading that are already happening in image and speech recognition. The move is also aimed at facilitating advances toward "artificial general intelligence," or machines that can think like humans โ and can read and understand a document as well as a person. Meanwhile, AI helped a musician in Norway sing a new tune for the holidays this year: a Christmas carol that was created by Microsoft's AI technology.
How to produce sounds in Python, R, Java, C, Perl, Javascript or even Linux?
I want to create music generated by mathematical algorithms, or even turning big data files into sound files, just like NASA turned electromagnetic signals from space into music. Producing artificially generated music is a popular subject, see for instance Composing Music With Recurrent Neural Networks, or Using Machine Learning to Generate Music. My question is how to access my laptop's speaker from a script written in Python or Perl. I used to do it long ago in C language, using the command sound available in the Borland package. Today I tried various system calls from within Perl, or directly from the command line, to non avail.
Google Pixel Tips: Review How To Make The Most Out Of Your New Pixel XL Phone
Google's new Pixel and Pixel XL phones come with some pretty awesome features. Google's new smart Assistant, unlimited cloud storage for photos and videos and it has one of the highest rated smartphone cameras. Here's a guide on how to make the most out of your new Google Pixel phone. Google Assistant is one of the best features of the new Pixel phones. You can use it on Google's new messaging app Allo and on the Google Home speaker.
These Were The Best Machine Learning Breakthroughs Of 2016
What were the main advances in machine learning/artificial intelligence in 2016? Everyone now seems to be doing machine learning, and if they are not, they are thinking of buying a startup to claim they do. Now, to be fair, there are reasons for much of that "hype". Can you believe that it has been only a year since Google announced they were open sourcing Tensor Flow? TF is already a very active project that is being used for anything ranging from drug discovery to generating music.
Star Wars Technology: How Artificial Limbs Work [Infographic]
In the Star Wars movies Doctors and medical droids are able to replace limbs with artificial prosthetics that are as good or even better than the original limb. Which is a good thing because in a place that is packed with lightsabers it's not surprising that a lot of people lose their limbs. To me it seems like picking up a saber the wrong way can result in a loss of hand or leg. The most notable lost limbs belong to the Skywalkers. Luke's hand was cut off by Vader at the end of the Empire Strikes Back and is probably the most famous.