Do you hear the birds chirping outside your window? There are more than 10,000 bird species in the world, and they can be found in nearly every environment, from untouched rainforests to suburbs and cities. Birds play an essential role in nature. They are high up in the food chain and integrate changes occurring at low levels. As such, birds are excellent indicators of deteriorating habitat quality and environmental pollution.
The fictional character of Dr. Dolittle has captured the imagination of millions of children with his ability to talk to animals – and now the idea of using technology to listen to and better understand animals is capturing the imagination of AI experts around the world. For example, AI language-analysis technology is being used to decode the sounds of bottlenose dolphins and compile a dictionary of dolphin language. This work is taking place on a global scale, across a vast variety of species. Researchers are using technology to gather data that could address some of the biggest environmental challenges of our time, including some with grants from Microsoft's AI for Earth program helping achieve their goals. Here is a snapshot of some of the projects underway – and what they hope to achieve.
One of most immediately striking features about Bernie Krause is his glasses. They're big--not soda-bottle thick, but unusually large, and draw attention to his eyes. Which is ironic, as Krause's life has been devoted to what he hears, but also appropriate, since it's the weakness of his eyes that compelled Krause to engage with sound: first with music, and later the music of nature. Nearsighted and astigmatic, Krause has spent most of the last half-century recording biological symphonies to which most of us are deaf. Even more than Krause sees, he listens.
Bioacoustic sensors, sometimes known as autonomous recording units (ARUs), can record sounds of wildlife over long periods of time in scalable and minimally invasive ways. Deriving per-species abundance estimates from these sensors requires detection, classification, and quantification of animal vocalizations as individual acoustic events. Yet, variability in ambient noise, both over time and across sensors, hinders the reliability of current automated systems for sound event detection (SED), such as convolutional neural networks (CNN) in the time-frequency domain. In this article, we develop, benchmark, and combine several machine listening techniques to improve the generalizability of SED models across heterogeneous acoustic environments. As a case study, we consider the problem of detecting avian flight calls from a ten-hour recording of nocturnal bird migration, recorded by a network of six ARUs in the presence of heterogeneous background noise. Starting from a CNN yielding state-of-the-art accuracy on this task, we introduce two noise adaptation techniques, respectively integrating short-term (60-millisecond) and long-term (30-minute) context. First, we apply per-channel energy normalization (PCEN) in the time-frequency domain, which applies short-term automatic gain control to every subband in the mel-frequency spectrogram. Secondly, we replace the last dense layer in the network by a context-adaptive neural network (CA-NN) layer, i.e. an affine layer whose weights are dynamically adapted at prediction time by an auxiliary network taking long-term summary statistics of spectrotemporal features as input. We show that both techniques are helpful and complementary. [...] We release a pre-trained version of our best performing system under the name of BirdVoxDetect, a ready-to-use detector of avian flight calls in field recordings.
Bird sounds possess distinctive spectral structure which may exhibit small shifts in spectrum depending on the bird species and environmental conditions. In this paper, we propose using convolutional recurrent neural networks on the task of automated bird audio detection in real-life environments. In the proposed method, convolutional layers extract high dimensional, local frequency shift invariant features, while recurrent layers capture longer term dependencies between the features extracted from short time frames. This method achieves 88.5% Area Under ROC Curve (AUC) score on the unseen evaluation data and obtains the second place in the Bird Audio Detection challenge.