Traditional techniques for monitoring wildlife populations are temporally and spatially limited. Alternatively, in order to quickly and accurately extract information about the current state of the environment, tools for processing and recognition of acoustic signals can be used. In the past, a number of research studies on automatic classification of species through their vocalizations have been undertaken. In many of them, however, the segmentation applied in the preprocessing stage either implies human effort or is insufficiently described to be reproduced. Therefore, it might be unfeasible in real conditions. Particularly, this paper is focused on the extraction of local information as units --called instances-- from audio recordings. The methodology for instance extraction consists in the segmentation carried out using image processing techniques on spectrograms and the estimation of a needed threshold by the Otsu's method. The multiple instance classification (MIC) approach is used for the recognition of the sound units. A public data set was used for the experiments. The proposed unsupervised segmentation method has a practical advantage over the compared supervised method, which requires the training from manually segmented spectrograms. Results show that there is no significant difference between the proposed method and its baseline. Therefore, it is shown that the proposed approach is feasible to design an automatic recognition system of recordings which only requires, as training information, labeled examples of audio recordings.