Learning Deep Learning and Transformers: A Roadmap for Bioacoustics and Beyond

Introduction

Bioacoustics, the scientific exploration of animal sounds, provides an invaluable lens into understanding animal behavior and serves as a crucial tool for biodiversity monitoring (Marler & Slabbekoorn, 2004; Laiolo, 2010; Marques et al., 2012; Brown & Riede, 2017). For years, computational methods, including signal processing, data mining, and machine learning, have been instrumental in advancing bioacoustics (Towsey et al., 2012; Ganchev, 2017). Among these, deep learning (DL) has recently emerged as a transformative force across numerous computational disciplines. Originally spurred by artificial intelligence (AI) aspirations and refined for image and text processing, DL innovations have permeated diverse fields (LeCun, Bengio & Hinton, 2015; Goodfellow, Bengio & Courville, 2016). This revolution extends to audio domains, significantly impacting automatic speech recognition and music informatics (Abeßer, 2020; Manilow, Seetharman & Salamon, 2020).

Now, computational bioacoustics is also harnessing the power of DL to tackle and automate complex challenges that were once deemed insurmountable. This advancement is both driven and necessitated by the exponential growth of data in the 21st century. The widespread availability and affordability of digital recording devices, data storage, and sharing have made large-scale bioacoustic monitoring, including continuous audio capture, a reality (Ranft, 2004; Roe et al., 2021; Webster & Budney, 2017; Roch et al., 2017). The resulting data deluge underscores a critical bottleneck: the scarcity of trained human analysts. This highlights the growing importance of automated methods, particularly machine learning, to streamline and accelerate the bioacoustic workflow.

While the deep learning revolution in bioacoustics is undeniable, it is still in its early stages. Reviews and textbooks as recent as 2017 gave limited attention to DL as a central tool, even when focusing on machine learning for bioacoustics (Ganchev, 2017; Stowell, 2018). Mercado & Sturdy (2017) reviewed the applications of artificial neural networks (NNs) in bioacoustics research. However, this review predates the deep learning era of neural networks, which, while sharing some fundamental aspects, exhibits significant conceptual and practical differences.

Many bioacousticians are now actively engaging with deep learning, resulting in a surge of innovative research adapting DL to the specific needs of bioacoustic analysis. However, the field is still developing, and comprehensive reference materials are scarce. This article serves as a guide to learning deep learning and transformers, offering a roadmap through the emerging landscape of deep learning for computational bioacoustics. We aim to provide an overview of the current state-of-the-art, clarify core concepts, and identify crucial knowledge gaps and underexplored areas that warrant future research. Following a description of our survey methodology, we will summarize the current state of the field, outlining best practices and common tasks. Subsequently, we will present a learning deep learning and transformers roadmap for computational bioacoustics, informed by our survey and thematic analysis, drawing from broader advancements in deep learning and key bioacoustic challenges.

Survey Methodology

Deep learning is a dynamic and rapidly evolving field. Although deep learning techniques have been applied to audio tasks like speech and music for over a decade, their application in wildlife bioacoustics is a more recent and less mature development. Significant advancements in acoustic deep learning include Hershey et al. (2017), which demonstrated the maturity of audio recognition using convolutional neural networks (CNNs) by introducing the widely-used AudioSet dataset and VGGish NN architecture. Convolutional-recurrent neural network (CRNN) methods also emerged as powerful tools (Çakır et al., 2017). The organizers of the BirdCLEF data challenge declared “the arrival of deep learning” in 2016 (Goëau et al., 2016). Therefore, we focused our keyword-based literature searches on papers published from 2016 onwards, utilizing both Google Scholar and Web of Science. The search query employed was:

(bioacoust* OR ecoacoust* OR vocali* OR “animal calls” OR
“passive acoustic monitoring” OR “soundscape”) AND (“deep learning” OR “convolutional neural network” OR “recurrent neural network”) AND (animal OR bird* OR cetacean* OR insect* OR mammal*)

This search in Google Scholar yielded 989 entries. Many were excluded for reasons such as being off-topic, duplicates, reviews, abstracts only, non-English, or inaccessible. Preprints from arXiv, biorXiv, and other sources were encountered; while not entirely excluded, preference was given to peer-reviewed publications. The same query in Web of Science resulted in 56 entries. After merging and removing duplicates, we obtained a collection of 162 relevant articles. This subfield is experiencing rapid growth; the number of selected articles increased from 5 in 2016 to 64 in 2021. The bibliography accompanying this article lists all these selected publications and additional articles included for context during the review process.

State of the Art and Recent Developments

We begin by presenting a standard DL recipe derived from the literature, followed by an overview of the taxonomic scope of research. We then examine key themes in bioacoustic DL that have gained significant traction and are approaching maturity. To avoid redundancy, some less established or unresolved topics, even if mentioned in existing literature, will be discussed in the subsequent ‘roadmap’ section.

The Standard Recipe for Bioacoustic Deep Learning

Deep learning’s versatility allows it to be applied to diverse tasks, ranging from classification and regression to signal enhancement and even data synthesis. However, classification, the assignment of one or more ‘labels’ from a predefined list to data items (e.g., species, individuals, or call types), remains the primary DL ‘workhorse’. This task underpins many DL breakthroughs, and its power is leveraged in addressing various other tasks, even image generation (Goodfellow et al., 2014). Classification is indeed the most prevalent application of DL in computational bioacoustics.

A typical ‘recipe’ for bioacoustic classification using deep learning, widely reflected in recent literature, is outlined below. Some terminology may be unfamiliar but will be elaborated upon in later sections:

Employ a well-established CNN architecture (ResNet, VGGish, Inception, MobileNet), potentially pre-trained on AudioSet. (These are readily available within popular DL Python frameworks like PyTorch, Keras, and TensorFlow).
Input spectrogram data, typically segmented into fixed-size audio clips (e.g., 1 s or 10 s) to accommodate ‘batch’ processing within GPU memory. Spectrograms can be standard (linear-frequency), mel spectrograms, or log-frequency spectrograms. Spectrogram “pixels” represent magnitudes, often log-transformed or normalized using per-channel energy normalization (PCEN) before use. There is no definitive consensus on the ‘best’ spectrogram format; the optimal choice is often empirically determined based on the frequency bands and dynamic ranges relevant to the specific task.
Define a list of labels for prediction, which might include species, individuals, call types, or other relevant categories. This could be a binary (yes/no) classification for presence/absence detection. For species identification, DL can handle hundreds of categories. More complex outputs, such as transcriptions of multiple sound events, are also possible and will be discussed later.
Implement data augmentation to enhance the diversity of small bioacoustic training datasets (e.g., noise mixing, time shifting, mixup).
While standard CNNs are common, CRNNs are also popular, incorporating a recurrent layer (LSTM or GRU) after convolutional layers. This can be achieved by building a network from scratch or adding a recurrent layer to a pre-existing architecture.
Train the network using standard deep learning best practices (e.g., Adam optimizer, dropout, early stopping, hyperparameter tuning) (Goodfellow, Bengio & Courville, 2016).
Adhere to good practice by using separate datasets for training, validation (for monitoring training progress and hyperparameter selection), and final testing/evaluation. Testing on data representing novel conditions is especially valuable for assessing true system generalizability (Stowell et al., 2019b). However, it remains common practice to sample training, validation, and testing data from the same source pool.
Evaluate performance using standard metrics such as accuracy, precision, recall, F-score, and area under the curve (AUC or AUROC). Given that bioacoustic datasets are often “unbalanced,” with uneven class representation, metrics like macro-averaging are commonly used to account for this, calculating performance per class and averaging to give equal weight to each (Mesaros, Heittola & Virtanen, 2016).

This standard recipe provides a robust starting point for many bioacoustic classification tasks, including those involving noisy outdoor soundscapes. (However, heavy rain and wind remain challenging across all analysis methods, including DL). It can be implemented using readily available Python libraries: PyTorch/TensorFlow/Keras, librosa (or similar) for audio processing, and data augmentation tools like SpecAugment, audiomentations, or kapre. Spectrogram creation and augmentation are audio-specific, but CNN architectures and training are standard across DL for images, audio, video, etc., allowing bioacoustics to benefit from broader advancements in the field.

Data augmentation is particularly beneficial for small and unbalanced datasets, common in bioacoustics. Common augmentation methods (time-shifting, sound mixing, noise mixing) are generally safe, as they are unlikely to alter the semantic content of the audio. However, other modifications, such as time warping and frequency shifting, could affect subtle cues crucial for distinguishing individuals or call types. Therefore, the choice of augmentation methods should be carefully considered and tailored to the specific audio and animal sounds being analyzed.

The standard recipe does have limitations. The use of mel frequency scale, AudioSet pretraining, and magnitude-based spectrograms may prioritize aspects of audio readily perceptible to humans, potentially overlooking subtle details important for high-resolution discriminations or animal perception (Morfi, Lachlan & Stowell, 2021a). The small convolutional filters in common CNN architectures may also be less suited for broad-band sound events.

While some research focuses on optimizing spectrogram generation parameters or exploring alternative representations like wavelets, these task-specific improvements rarely generalize broadly enough to overturn the widespread use of spectrograms. Networks using raw waveforms as input hold promise for overcoming these limitations but require larger training datasets. Pre-trained raw-waveform networks are a promising area for future development.

Taxonomic Coverage

Deep learning has been applied to analyze the vocalizations of a wide range of species and taxa, including:

Birds: Extensively studied, benefiting from large datasets and challenges like BirdCLEF.
Cetaceans: Significant research due to the need for marine mammal monitoring and conservation.
Insects: Growing interest, particularly in ecoacoustics and biodiversity assessments.
Bats: Research focused on echolocation calls and species identification.
Primates: Studies on vocal communication and behavior.
Rodents: Analysis of ultrasonic vocalizations, particularly in laboratory settings.
Amphibians: Species identification and population monitoring through call analysis.
Fish: Emerging field, utilizing underwater acoustics for monitoring and classification.
Livestock: Applications in animal welfare and behavior monitoring through vocalization analysis.
Ecosystems/Soundscapes: Ecoacoustic approaches analyzing entire soundscapes for biodiversity assessment.

Many studies analyze multiple taxa, as DL facilitates multi-species recognition and benefits from large and diverse datasets. Some research sidesteps specific taxa by focusing on ecosystem or soundscape level analysis (“ecoacoustic” approaches) (Sethi et al., 2020; Heath et al., 2021; Fairbrass et al., 2019).

The emphasis on certain taxa is influenced by factors such as biodiversity and conservation priorities (birds, bats, insects), comparative linguistics and behavior (songbirds, cetaceans, primates, rodents), and the complexity and analyzability of their vocalizations (Marler & Slabbekoorn, 2004). Practical considerations, like the ease of recording terrestrial and diurnal species, also play a role. Progress in bird sound classification has been accelerated by standardized datasets and challenges like BirdCLEF (Goëau et al., 2014; Joly et al., 2019). This dataset- and challenge-driven approach mirrors progress patterns in many machine learning applications. However, research effort allocation does not always align with the diversity or importance of taxa, a point we will revisit.

Having outlined a standard recipe and the taxonomic scope, we now review key themes that have received detailed attention in the literature on DL for computational bioacoustics.

Neural Network Architectures

The “architecture” of a neural network (NN) defines the arrangement of nodes and their connections, often organized in sequential processing layers (Goodfellow, Bengio & Courville, 2016). Early applications of NNs to animal sound utilized basic “multi-layer perceptron” (MLP) architectures (Koops, Van Balen & Wiering, 2015; Houégnigan et al., 2017; Hassan, Ramli & Jaafar, 2017; Mercado & Sturdy, 2017), using manually-designed summary features (e.g., syllable duration, peak frequency) as input. However, CNN and recurrent neural network (RNN) architectures have surpassed and significantly outperformed MLPs. CNNs and RNNs can leverage the sequential/grid structure in raw or lightly-preprocessed data, enabling time series or time-frequency spectrogram data as input (Goodfellow, Bengio & Courville, 2016). This shift, eliminating the manual feature extraction step and maintaining a higher dimensional input, allows for richer information representation. Neural networks are highly non-linear and can utilize subtle variations in this “raw” data. CNNs and RNNs incorporate assumptions about data structure, leading to efficient training. For example, CNN classifiers are inherently invariant to time-shifts in input data, a reasonable assumption for sound stimuli, reducing the number of trainable parameters compared to MLPs and simplifying training.

One of the earliest CNN applications in bioacoustics classified 10 anuran species (Colonna et al., 2016). In the same year, CNNs taking spectrograms as input were used by 3 of 6 teams in the 2016 BirdCLEF challenge, including the top-performing team (Goëau et al., 2016). One system even reused AlexNet, a 2012 CNN designed for images. Shortly after, Salamon et al. (2017a) and Knight et al. (2017) confirmed CNNs’ superiority over previous “shallow” machine learning paradigms in bioacoustics.

CNNs are now dominant; at least 83 surveyed articles utilized them, sometimes in combination with other modules. Many studies empirically compare NN architectures and configurations, such as the number of CNN layers (Wang et al., 2021; Li et al., 2021; Zualkernan et al., 2020). Oikarinen et al. (2019) explored a dual task of call type and caller ID inference from marmoset monkey pairs, evaluating output layer types for this scenario.

While many articles used custom CNN architectures, there’s a strong trend towards using or evaluating off-the-shelf CNN architectures (Lasseck, 2018; Zhong et al., 2020a; Guyot et al., 2021; Dias, Ponti & Minghim, 2021; Li et al., 2021; Kiskin et al., 2021; Bravo Sanchez et al., 2021; Gupta et al., 2021). These influential CNNs, widely used in DL, are readily available in DL frameworks (Table 1) and can be downloaded pre-trained on standard datasets. The choice of CNN architecture is rarely based on first principles, aside from general advice that network complexity should scale with task complexity (Kaplan et al., 2020). Recent architectures like ResNet and DenseNet incorporate modifications for training very deep networks, while others (MobileNet, EfficientNet, Xception) prioritize efficiency (Canziani, Paszke & Culurciello, 2016).

Table 1: Off-the-shelf CNN architectures usage in bioacoustics deep learning research papers. This table shows the frequency of different pre-built CNN architectures used in the surveyed literature for bioacoustics tasks, highlighting popular choices like ResNet and VGG.

Convolutional layers in CNNs typically use non-linear filters with small “receptive fields,” enabling them to utilize local dependencies within spectrogram data. However, sound scenes and vocalizations often exhibit dependencies across both short and long timescales. This temporal aspect inspired recurrent neural networks (RNNs), with LSTM and GRU being popular implementations (Hochreiter & Schmidhuber, 1997). RNNs can propagate information forward and/or backward in time during inference. Consequently, RNNs have been explored for processing sound, including animal sounds (Xian et al., 2016; Wang et al., 2021; Madhusudhana et al., 2021; Islam & Valles, 2020; Garcia et al., 2020; Ibrahim et al., 2018). RNNs alone may not always achieve strong performance. However, combining CNNs with RNN layers (CRNNs) by adding RNN layers after CNN layers, has shown strong performance in audio tasks, with the RNN layers performing temporal integration of information preprocessed by earlier layers (Cakir et al., 2017). CRNNs have been applied in bioacoustics with positive results (Himawan et al., 2018; Morfi & Stowell, 2018; Gupta et al., 2021; Xie et al., 2020; Tzirakis et al., 2020; Li et al., 2019). However, CRNNs can be computationally intensive, and their added benefit is not always guaranteed.

In 2016, WaveNet, an influential audio synthesis method, demonstrated that long temporal sequences could be modeled using CNN layers with a ‘dilated’ structure, enabling context from hundreds of timesteps (van den Oord et al., 2016). This inspired replacing recurrent layers with 1-D temporal convolutions, known as temporal CNNs (TCNs or TCNNs) (Bai, Kolter & Koltun, 2018). Whether applied to spectrograms or waveforms, these are 1-D (time only) convolutions, distinct from the 2-D (time-frequency) convolutions more commonly used. TCNs can be faster to train than RNNs with comparable or superior results. TCNs have been increasingly used in bioacoustics since 2021 (Steinfath et al., 2021; Fujimori et al., 2021; Roch et al., 2021; Xie et al., 2021b; Gupta et al., 2021; Gillings & Scott, 2021; Bhatia, 2021). Gupta et al. (2021) compared CRNNs against CNN+TCN and standard CNN architectures (ResNet, VGG), finding CRNN to be the strongest in their evaluation.

Innovations in NN architectures continue to be explored. Vesperini et al. (2018) applied capsule networks for bird detection. Gupta et al. (2021) used Legendre memory units in birdsong species classification. “Attention” mechanisms, popular in wider DL, particularly text processing (Chorowski et al., 2015), allow networks to dynamically weight and combine inputs, moving beyond the fixed context assumptions of CNNs and RNNs. Ren et al. (2018) applied attention to spectrograms, and Morfi, Lachlan & Stowell (2021a) used it for bird vocalizations. “Transformer” layers (Vaswani et al., 2017), utilizing attention as the core building block, are gaining prominence in DL. While not yet widely explored in bioacoustics, their success in other domains suggests increasing future use, with recent studies showing promising results (Elliott et al., 2021; Wolters et al., 2021).

Many studies empirically compare NN architectures from a chosen set, often evaluating hyperparameters. Exhaustive searching is infeasible, and a priori network selection is challenging. Brown, Montgomery & Garg (2021) propose automating workflow construction, including NN architecture selection, to address this problem.

Acoustic Features: Spectrograms, Waveforms, and More

Magnitude spectrograms are the dominant input data representation in surveyed studies. Spectrograms transform raw audio into 2D grids representing energy distribution across time and frequency. Pre-DL, spectrograms were sources for feature extraction (peak frequencies, event durations, etc.). Using spectrograms directly allows DL systems to exploit diverse information and leverage advancements in image DL due to format similarity to images.

Spectrogram creation involves choices like window length (time-frequency resolution trade-off) and window function shape (Jones & Baraniuk, 1995). Careful parameter selection can offer minor benefits, as argued for in DL (Heuer et al., 2019; Knight et al., 2020). A more debated choice is between linear frequency axis spectrograms and (pseudo-)logarithmically-spaced axes like mel spectrograms (Xie et al., 2019; Zualkernan et al., 2020) or constant-Q transform (CQT) (Himawan et al., 2018). Mel spectrograms, based on the mel scale approximating human auditory selectivity, might seem unusual for non-human data. Their use is likely due to convenience and the fact that pitch shifts of harmonic signals correspond to linear shifts on a logarithmic scale, aligning well with CNNs designed for linearly-shifted feature detection. Zualkernan et al. (2020) found mel spectrograms useful even for bat signals, with frequency range adjustments. The literature lacks consensus, with studies favoring mel (Xie et al., 2019; Zualkernan et al., 2020), logarithmic (Himawan et al., 2018; Smith & Kristensen, 2017), or linear scales (Bergler et al., 2019b). No single representation consistently outperforms others across all tasks and taxa. Some studies utilize multiple spectrogram representations, “stacking” them as multi-channel input (like RGB image channels) (Thomas et al., 2019; Xie et al., 2021c). This redundancy allows NNs to flexibly aggregate information and gain minor advantages.

ML practitioners must consider data normalization and preprocessing. Standard practice involves transforming input data to zero mean and unit variance, and applying noise reduction (e.g., median filtering) to spectrograms. Spectral magnitudes can have highly variable dynamic ranges, noise levels, and event densities. Lostanlen et al. (2019a, 2019b) advocate for per-channel energy normalization (PCEN), a simple adaptive normalization algorithm, supported by theoretical and empirical evidence. PCEN has been adopted in recent works, improving deep bioacoustic event detector performance (Allen et al., 2021; Morfi et al., 2021b).

Mel-frequency cepstral coefficients (MFCCs), used extensively in previous acoustic analysis eras to compress spectral information, have been occasionally used in bioacoustic DL (Colonna et al., 2016; Kojima et al., 2018; Jung et al., 2021). However, their shift-invariance along the MFCC coefficient axis makes them less suitable for CNNs. DL evaluations generally show MFCCs underperforming compared to less-preprocessed representations like mel spectrograms (Zualkernan et al., 2020; Elliott et al., 2021).

Other time-frequency representations explored as DL inputs include wavelets (Smith & Kristensen, 2017; Kiskin et al., 2020) or sinusoidal pitch tracking algorithm traces (Jancovic & Köküer, 2019). These are often motivated by signal characteristics, like chirplets matching whale sounds (Glotin, Ricard & Balestriero, 2017).

However, raw waveforms are the main alternative to spectrograms, facilitated by WaveNet and TCN architectures. Raw waveform-based DL often requires larger training datasets but removes manual preprocessing (spectrogram transformation), allowing DL systems to extract information optimally. Recent studies use TCN architectures (1-dimensional CNNs) on raw waveform input (Ibrahim et al., 2018; Li et al., 2019; Fujimori et al., 2021; Roch et al., 2021; Xie et al., 2021b). Ibrahim et al. (2018) compare RNNs and TCNs on waveforms for fish classification. Li et al. (2019) use TCNs with a final recurrent layer on bird sound waveforms. Steinfath et al. (2021) offer spectrogram or waveform input for their CNN segmentation method. Bhatia (2021) explores bird sound synthesis using WaveNet and other methods. Transformers can also be applied directly to waveform data (Elliott et al., 2021).

Recent work proposes trainable representations between raw waveforms and spectrograms (Balestriero et al., 2018; Ravanelli & Bengio, 2018; Zeghidour et al., 2021). These trainable filterbanks optimize filter parameters alongside other NN layers. Balestriero et al. (2018) introduced a trainable filterbank with promising bird audio detection results. Bravo Sanchez et al. (2021) used SincNet, achieving competitive birdsong classification results with fast training. Zeghidour et al. (2021) applied SincNet and introduced LEAF, a learnable audio front-end with trainable PCEN and filterbank layers, showing strong bird audio detection performance.

In summary, spectrograms are often suitable for bioacoustic DL, often with (pseudo-)logarithmic frequency axes like mel or CQT spectrograms. PCEN preprocessing is frequently beneficial. Raw waveform and adaptive front-end methods are likely to become more prominent, especially if integrated into standard NN architectures effective across bioacoustic tasks.

Classification, Detection, Clustering

Classification and detection are by far the most common tasks in the literature, serving as fundamental building blocks in many workflows and comprehensively addressed by current DL state-of-the-art.

“Classification” in this review, aligns with ML usage, referring to the prediction of categorical labels like species or call type. It is widely investigated in bioacoustic DL, primarily for species classification, often within a taxon family (e.g., BirdCLEF challenge) (Joly et al., 2021). Other classification tasks include individual animal identification (Oikarinen et al., 2019; Ntalampiras & Potamitis, 2021), call type classification (Bergler et al., 2019a; Waddell, Rasmussen & Širović, 2021), sex and strain classification (Ivanenko et al., 2020), and behavioral state classification (Wang et al., 2021; Jung et al., 2021). Soundscape classification includes biophony, geophony, and anthropophony categories (Fairbrass et al., 2019; Mishachandar & Vairamuthu, 2021).

“Detection” tasks are defined in three common ways in surveyed literature (Fig. 1):

Figure 1: Common sound detection implementation approaches in bioacoustics. This diagram illustrates three distinct methods for sound detection: (A) clip-level classification, (B) sound event detection (SED) with time boundaries, and (C) object detection with time-frequency bounding boxes, showcasing different levels of detail in detection outputs.

For all three detection task settings, CNNs demonstrate strong performance, outperforming other ML techniques (Marchal, Fabianek & Aubry, 2021; Knight et al., 2017; Prince et al., 2019). The output layers and loss functions vary slightly based on the data format in each task setting (Mesaros et al., 2019). (Other settings include pixel-wise segmentation of spectral shapes (Narasimhan, Fern & Raich, 2017)).

Detection is crucial in marine contexts with large-scale surveys and sparse sound events (Frazao, Padovese & Kirsebom, 2020). Numerous studies apply DL to cetacean sound detection underwater (Jiang et al., 2019; Bergler et al., 2019b; Best et al., 2020; Shiu et al., 2020; Zhong et al., 2020a; Ibrahim et al., 2021; Vickers et al., 2021b; Zhong et al., 2021; Roch et al., 2021; Vickers et al., 2021a; Allen et al., 2021; Madhusudhana et al., 2021).

A common bioacoustic workflow is “detect then classify” (Waddell, Rasmussen & Širović, 2021; LeBien et al., 2020; Schröter et al., 2019; Jiang et al., 2019; Koumura & Okanoya, 2016; Zhong et al., 2021; Padovese et al., 2021; Frazao, Padovese & Kirsebom, 2020; Garcia et al., 2020; Marchal, Fabianek & Aubry, 2021; Coffey, Marx & Neumaier, 2019). For sparse sounds, detection can filter out ‘negative’ clips, reducing data load and potentially simplifying classifier training. Combined detection and classification is also feasible, with SED and image object detection methods integrating both tasks into one NN architecture (Kong, Xu & Plumbley, 2017; Narasimhan, Fern & Raich, 2017; Shrestha et al., 2021; Venkatesh, Moffat & Miranda, 2021).

Unsupervised learning methods, like clustering, are used when labels are unavailable. DL can drive clustering indirectly, often using autoencoders (algorithms trained to compress and decode data) and then applying standard clustering algorithms to the autoencoder-transformed representation (Coffey, Marx & Neumaier, 2019; Ozanich et al., 2021).

Signal Processing Using Deep Learning

DL applications in computational bioacoustics extend beyond classification, detection, and clustering, encompassing signal processing, manipulation, and generation tasks. These less-studied tasks relate to signal processing and modification.

Denoising and source separation are preprocessing steps to enhance sound quality before analysis, especially in low signal-to-noise ratio (SNR) conditions (Xie, Colonna & Zhang, 2021a). However, preprocessing isn’t always necessary or desirable as it can remove information, and DL recognition can often perform well despite noise. While lightweight signal processing algorithms are typical preprocessing steps, CNN-based DL is increasingly used for signal enhancement and source separation in audio fields (Manilow, Seetharman & Salamon, 2020). This often operates on spectrograms, mapping input spectrograms to enhanced output spectrograms. DL methods for this include denoising autoencoders and u-nets, specialized CNN architectures for domain-to-domain mapping (Jansson et al., 2017). DL denoising has shown good performance as a preprocessing step for recognition in underwater (Vickers et al., 2021b; Yang et al., 2021) and bird sound (Sinha & Rajan, 2018).

Privacy in bioacoustic analysis is not a primary concern, but regulations like GDPR are raising awareness as acoustic monitoring becomes more widespread (Le Cornu, Mitchell & Cooper, 2021). Detecting speech in recordings to delete clips is one strategy (Janetzky et al., 2021). Another approach uses denoising or source separation to remove speech as “noise”. Cohen-Hadria et al. (2019) used this for urban sound monitoring, blurring speech and mixing it back for anonymization, potentially useful for human-animal interaction studies.

Data compression is relevant for deployed monitoring projects. Heath et al. (2021) found that audio compression codecs like MP3 have minimal impact on DL analysis, consistent with pre-DL findings. They also found CNN AudioSet embeddings effective as compressed “fingerprints.” Bjorck et al. (2019) used DL to optimize codecs, creating compressed representations of elephant sounds decodable back to audio, unlike fingerprints.

Synthesis of animal sounds has occasional attention and could be useful for playback stimuli. Bhatia (2021) studied birdsong synthesis using DL methods, including WaveNet and generative adversarial network (GAN) approaches.

Small Data: Data Augmentation, Pre-training, Embeddings

DL’s success is partly due to large labeled datasets. However, bioacoustics often faces a lack of large labeled datasets due to species/call rarity or expert annotation requirements. This is true for fine categorical distinctions and large-scale monitoring. Strategies to address this include data mining and ecoacoustic methods; here we focus on DL techniques for small data scenarios.

Data augmentation artificially increases dataset size by applying minor, irrelevant modifications to data samples, primarily for training sets. For audio, this includes time shifting, low-amplitude noise addition, audio mixing (‘mixup’), and spectrogram warping (Lasseck, 2018). Modifications should preserve data label meaning. Frequency shifts may be unsuitable for some animal vocalizations. Data augmentation was used early in DL’s bioacoustic application (Goëau et al., 2016) and is now widespread. Various studies examine augmentation combinations for terrestrial and underwater sound (Lasseck, 2018; Li et al., 2021; Padovese et al., 2021). Standard data augmentation is essential for most bioacoustic DL training. Software packages like SpecAugment, kapre, and audiomentations for Python facilitate audio data augmentation. Beyond standard practice, augmentation can estimate confounding factor impacts in datasets (Stowell et al., 2019b).

Pretraining is another common technique: initializing NN training with weights from a network previously trained on a related task, rather than random initialization. “Transfer learning” leverages shared aspects between tasks, like spectrogram time-frequency correlations. Pretraining is valuable when large annotated datasets are available for the pretraining task. Early work used image dataset pretraining (e.g., ImageNet), improving performance despite differences between images and spectrograms (Lasseck, 2018). ImageNet pretraining is still sometimes used (Disabato et al., 2021; Fonseca et al., 2021), but many now use Google’s AudioSet (Hershey et al., 2017; Çoban et al., 2020; Kahl et al., 2021) or VGG-Sound (Chen et al., 2020a) (Bain et al., 2021). Pretrained networks are readily available in toolkits. Bioacoustics-specific datasets (e.g., BirdCLEF) are rarely used for pretraining, possibly due to lower diversity compared to AudioSet/VGG-Sound or convenience. Ntalampiras (2018) explored music genre dataset transfer learning. Morgan & Braasch (2021) reported no pretraining benefit, possibly due to a large dataset (150 h annotated). Pretraining from simulated sound data is another alternative (Glotin, Ricard & Balestriero, 2017; Yang et al., 2021; Li et al., 2020).

Embeddings and metric learning are related to pretraining. Instead of direct label prediction, NNs are trained to convert acoustic features into vector coordinates (“embeddings”) useful for classification. Embeddings are created by removing the final classification layers (“head”) from a pretrained network. The “body” output is an intermediate representation, often a useful high-dimensional feature representation. AudioSet embeddings have been explored in bioacoustics and ecoacoustics for diverse tasks (Sethi et al., 2021; Sethi et al., 2020; Çoban et al., 2020; Heath et al., 2021).

Autoencoders can also create embeddings by learning to encode and decode data, using the encoder’s representation as the embedding (Ozanich et al., 2021; Rowe et al., 2021). This can be unsupervised, followed by clustering (Ozanich et al., 2021).

Siamese and triplet networks are another embedding strategy. These use standard CNNs but different loss functions, training based on the distances between vector coordinates of pairs/triplets of items. Siamese networks train pairwise, aiming for close embeddings for similar items and distant embeddings for dissimilar items. Triplet networks use an “anchor,” a positive instance to be close, and a negative instance to be distant. Siamese/triplet networks can train effectively with small or unbalanced datasets, reported in terrestrial and underwater projects (Thakur et al., 2019; Nanni et al., 2020; Clementino & Colonna, 2020; Acconcjaioco & Ntalampiras, 2021; Zhong et al., 2021).

Other data scarcity strategies in bioacoustics include:

Meta-learning: Learning to learn from limited data.
Synthetic data generation: Creating artificial training data.
Semi-supervised learning: Utilizing unlabeled data to improve learning.
Weakly supervised learning: Training with noisy or incomplete labels.

These methods are less studied; many bioacoustics researchers use off-the-shelf pretrained embeddings. However, these techniques are useful for training with limited datasets and could be used in creating high-quality embeddings in future work.

Generalization and Domain Shift

Generalization to new data is a widespread concern, especially with small datasets. “Domain shift,” performance degradation due to changes in input data attributes (background soundscape, species sub-population, event frequency, microphone type), is a specific concern (Morgan & Braasch, 2021).

Evaluating DL systems on test sets differing from training data (location, SNR, season) is increasingly common (Shiu et al., 2020; Vickers et al., 2021b; Çoban et al., 2020; Allen et al., 2021; Khalighifar et al., 2021). This helps avoid overestimating real-world generalizability.

Domain adaptation methods can automatically adjust NN parameters to account for domain shift (Adavanne et al., 2017; Best et al., 2020). Including contextual correlates as NN input is another adaptation strategy (Lostanlen et al., 2019b, Roch et al., 2021). Fine-tuning (limited retraining) or active learning (interactive feedback) can be used with limited human input about the new domain (Çoban et al., 2020; Allen et al., 2021; Ryazanov et al., 2021). The Bird Audio Detection challenge (Stowell et al., 2019b) aimed to stimulate cross-condition generalizable methods, but leading submissions relied on transfer learning and data augmentation rather than explicit domain adaptation.

Open-Set and Novelty

Standard DL recognition is limited to a fixed set of labels. In wildlife recordings, encountering un-trained species or individuals is common and should be identified. “Open set” recognition, detecting new sound types beyond known classes, relates to novelty detection, detecting any novel data occurrence. Cramer et al. (2020) suggest hierarchical classification; a sound can be classified to a higher-level taxon even if the lower-level class is novel. Ntalampiras & Potamitis (2021) apply novelty detection using CNN autoencoders, assuming novel sounds will have high reconstruction error.

Embeddings offer a route to open-set classification. Good embeddings semantically represent new data, allowing novel classes to cluster well. Thakur et al. (2019) and Acconcjaioco & Ntalampiras (2021) advocate for this using triplet and Siamese learning, respectively. Novelty and open-set issues are ongoing concerns, but general-purpose embeddings offer a partial solution.

Context and Auxiliary Information

DL implementations typically operate on short audio/spectrogram segments (e.g., 1-10 s), even RNNs. However, animal vocalizations and recognition can depend on context beyond short windows, such as prior soundscape activity or date/time, location, weather.

Lostanlen et al. (2019b) added a “context-adaptive neural network” layer, dynamically adapting weights based on long-term spectrotemporal statistics. Roch et al. (2021) input acoustic context via local SNR estimates. Madhusudhana et al. (2021) used a CNN (DenseNet) followed by an RNN for postprocessing to incorporate longer-term temporal context. This CNN-RNN is not a CRNN but separate stages, allowing the RNN to operate on longer timescales.

Animal taxonomy is another contextual information form. Hierarchical classification is used in bioacoustics; Cramer et al. (2020) encoded taxonomic relationships in CNN training. Nolasco & Stowell (2022) proposed a different method, evaluating across taxa and individual identity hierarchies.

Perception

Most DL bioacoustics work uses DL as a practical tool. However, some research uses DL to model animal acoustic perception. DL can model non-linear phenomena, potentially replicating natural hearing subtleties. Such models could be studied or used as animal judgment proxies. Morfi, Lachlan & Stowell (2021a) used triplet loss to train a CNN to mimic bird decisions in forced-choice experiments. Simon et al. (2021) trained a CNN on bat echolocation reflections to classify bat-pollinated flowers. Francl & McDermott (2020) found that a DL trained to localize sounds in reverberant environments exhibited human-like acoustic perception phenomena.

On-device Deep Learning

Several studies focus on running bioacoustic DL on small hardware for affordable field monitoring. While many projects process data later, on-device DL allows live readouts, rapid responses (Mac Aodha et al., 2018), and potential savings in power and data transmission. Filtering uninformative audio on-device can extend deployments and reduce bandwidth.

Raspberry Pi is a popular small Linux device used for acoustic monitoring (Jolles, 2021). NVIDIA Jetson Nano and Google Coral are similar devices. Zualkernan et al. (2021) evaluated these for on-device bat detection.

More constrained devices like AudioMoth (Hill et al., 2018) offer lower power consumption. Prince et al. (2019) implemented a depthwise-separable CNN on AudioMoth, outperforming a hidden Markov model detector but not in real-time. Frameworks like ARM CMSIS and TensorFlow Lite facilitate low-level implementations (Disabato et al., 2021; Zualkernan et al. (2021)). Off-the-shelf architectures like MobileNet and SqueezeNet are designed for efficiency (Vidaña-Vila et al., 2020). However, bioacoustic studies often customize CNN designs further to reduce footprint.

Small-footprint devices offer DL with reduced power, bandwidth, and storage needs. Lostanlen et al. (2021b) argue energy efficiency is insufficient and propose batteryless acoustic sensing due to resource constraints. Integrating batteryless sensing with DL remains a future challenge.

Workflows and Other Practicalities

As DL use increases, the focus shifts to workflow integration and practicalities. Authors offer advice for ecologists using DL (Knight et al., 2017; Rumelt, Basto & Roncal, 2021; Maegawa et al., 2021). Others investigate CNN integration into workflows including data acquisition, selection, and labeling (LeBien et al., 2020; Morgan & Braasch, 2021; Ruff et al., 2021). Brown, Montgomery & Garg (2021) automate workflow design, finding that searched workflows outperform literature-based workflows.

User interfaces (UIs) are crucial for algorithm accessibility. While DL researchers often provide Python scripts, GUIs are needed for broader user adoption. Various authors provide GUIs and study efficient interaction (Jiang et al., 2019; Coffey, Marx & Neumaier, 2019; Steinfath et al., 2021; Ruff et al., 2021).

A Roadmap for Bioacoustic Deep Learning

We now outline a learning deep learning and transformers roadmap, focusing on unresolved topics and future research directions in deep learning for computational bioacoustics, identified through our literature survey.

Some key principles guide this roadmap. Firstly, AI augments, not replaces, expertise. DL agents, while sophisticated, are imperfect and possess different types of knowledge. A bird classifier based on AudioSet embeddings and a raw waveform system trained from scratch have distinct expertises. DL systems become expert peers, consulted and debated with. Active learning, deserving more attention, reinforces this role by allowing DL agents to learn from feedback. Future work will integrate expert knowledge, crowdsourcing, and DL (Kitzes & Schricker, 2019). Secondly, open science is vital. Open data, NN architectures, pretrained weights, and source code sharing are crucial for bioacoustic DL progress. While data sharing is increasing, it remains incomplete in bioacoustics (Baker & Vincent, 2019). Open data and standardized metadata are essential to move beyond single-dataset limitations.

Maturing Topics: Architectures and Features

Core bioacoustic DL topics that are frequently discussed but maturing, and thus of lower urgency, include:

Spectrograms/mel spectrograms are widely used inputs. While species-customized spectrograms may be considered, they are unlikely to yield significant improvements over mel spectrograms for most tasks. Preprocessing like noise reduction and PCEN remains useful. Raw waveform and adaptive front-end methods (SincNet, LEAF) are of interest for tasks requiring fine-grained distinctions.

Off-the-shelf deep embeddings are likely to become even more prevalent features. AudioSet and VGG-Sound are common pretraining datasets, but bioacoustics-specific embeddings may be useful in niches like ultrasound.

CNNs are dominant, with TCNs gaining traction for simplicity and efficiency. However, “attention”-based NN architectures (transformers/perceivers) are challenging CNN dominance in NLP and other domains. CNNs are well-suited for waveform and spectrogram data and are likely to remain part of sound NNs, potentially combined with transformer layers (Baevski et al., 2020). Perceivers are promising for processing variable-length spectrograms into per-event representations (Wolters et al., 2021).

RNNs, while embodying sequential data modeling, are a special case of computation with memory. Transformers and attention NNs offer flexible approaches to referencing past timesteps. The future of DL-with-memory is likely to evolve, with computational bioacoustics integrating short- and long-term memory with contextual data.

Learning Without Large Datasets

Open data availability will benefit bioacoustics but not eliminate small data challenges in project-specific recognition tasks, high-resolution discriminations, and tasks unsuitable for transfer learning (Morfi, Lachlan & Stowell, 2021a). Pre-training, embeddings, multi-task learning, and data augmentation are effective for improved generalization with limited data.

Few-shot learning is a relevant area, reflecting common bioacoustic needs. Active learning (AL) is of high importance, moving beyond fixed training sets to iterative human-machine interaction. AL efficiently utilizes human labeling effort (Qian et al., 2017) and is effective for large datasets and domain shift. While used in bioacoustic DL (Steinfath et al., 2021; Allen et al., 2021), AL is underexplored due to evaluation complexity. Future work may combine few-shot learning with AL processes.

Simulated datasets offer another approach to reduce data dependence (“sim2real”). Simulations can generate diverse training data, controlling biases, but realism can be a limitation. Simulated datasets have been used to train marine sound detectors, as these signals can be modeled using chirp/impulse/sinusoidal synthesis (Glotin, Ricard & Balestriero, 2017; Yang et al., 2021; Li et al., 2020). Simulation is also relevant for spatial sound scenes (Gao et al., 2020; Simon et al., 2021). Soundscape simulation has been used for urban and domestic sound analysis (Salamon et al., 2017b; Turpault et al., 2021), suggesting wider bioacoustic DL application potential.

Equal Representation

DL systems can be “black boxes” and prone to biases, replicating biases in training data (Koenecke et al., 2020). Reliable bioacoustic DL requires equal representation in sensitivity and error rates (Hardt et al., 2016).

Baker & Vincent (2019) highlight taxonomic bias in bioacoustics research, unrepresentative of audible animal diversity, biomass, or conservation importance. This bias extends to DL in bioacoustics. Baker advocates for more insect sound research, and insects are broadly understudied (Montgomery et al., 2020); Linke et al. (2018) for freshwater species.

Equal taxonomic and geographic representation should be assessed in datasets. Joint efforts are needed to create diverse open datasets covering various locations and regions. Addressing the lack of data deposition in bioacoustics research (Baker & Vincent, 2019) is crucial for improved data coverage.

Equal representation should also be considered in embeddings. Biases in datasets, NN architectures, and training can affect embedding representational capacities.

Targeted methods for rare species (Znidersic et al., 2020; Wood et al., 2021) remain important. Rare species data scarcity necessitates further study. Synthetic examples (Beery et al., 2020) and evaluation frameworks for rare sound event detection (Baumann et al., 2020) are relevant strategies.

Interfaces and Visualization

Bridging the gap between DL algorithms and zoologists/conservationists requires user interfaces (UIs). While Python scripts are good for reproducibility, broader user access requires more diverse UI options, including R, Python, desktop apps, smartphone apps, and websites. No consensus exists on optimal UI types, but integration with audio editing/annotation tools is desirable. Future algorithms will likely be available as installable packages or web APIs accessed through various interfaces. More UI work is needed, including efficient human-computer interaction research and visualization tools leveraging large-scale DL processing (Kholghi et al., 2018; Znidersic et al., 2020; Phillips, Towsey & Roe, 2018).

Active learning (AL) particularly benefits from UI development due to its iterative human-computer interaction. Bioacoustic AL UI designs should enhance sound data interaction (temporal regions, spectrograms, overlapping sounds).

Animal-computer interaction, like robotic animal agents (Simon et al., 2019; Slonina et al., 2021), offers new ethological insights and may use DL for sophisticated vocal interaction.

DL task formulation needs to adapt for interactive situations like AL, moving beyond fixed datasets. Data-driven challenge formats and DL techniques like reinforcement learning (Teşileanu, Ölveczky & Balasubramanian, 2017) warrant further consideration.

Under-Explored Machine Learning Tasks

Under-explored but important ML tasks include:

Individual ID

Automatic individual animal recognition is valuable for behavior studies and monitoring (Ptacek et al., 2016; Vignal, Mathevon & Mottin, 2008; Searby, Jouventin & Aubin, 2004; Linhart et al., 2019; Adi, Johnson & Osiejuk, 2010; Fox, 2008; Beecher, 1989). DL application to acoustic surveying for individual ID is rare but will likely increase (Ntalampiras & Potamitis, 2021; Nolasco & Stowell, 2022). DL’s capacity for nonlinear pattern discrimination and generalization is valuable for fine-scale inter-individual distinctions.

General-purpose individual animal discrimination is a useful DL development focus, requiring systems capable of fine acoustic distinctions. Cross-species and multi-task learning can bridge bioacoustic considerations across taxa (Nolasco & Stowell, 2022). Open-set handling is crucial for individual recognition in the wild. Increased open data sharing with individual ID labels is needed.

Sound Event Detection and Object Detection

Detailed sound event “transcripts” are valuable for ethological and biodiversity analyses, requiring higher-resolution bioacoustic DL.

SED approaches include music transcription/speaker diarisation-like methods (Mesaros et al., 2021; Morfi & Stowell, 2018; Morfi et al., 2021b) and object detection methods using time-frequency bounding boxes (Venkatesh, Moffat & Miranda, 2021; Shrestha et al., 2021; Zsebök et al., 2019; Coffey, Marx & Neumaier, 2019). Both approaches have advantages. Future work should unify these, applicable whether or not frequency bounds are in sound event data.

Spatial Acoustics

Spatial sound source arrangement is informative for individual attribution and counting (Jain & Balakrishnan, 2011). Spatial location is analyzed using multi-microphone arrays (stereo, ambisonic) in terms of direction-of-arrival (DoA) and range.

Standard spatial analysis uses signal processing algorithms, even with ML classification (Kojima et al., 2018). However, DL is increasingly used for spatial tasks (Houégnigan et al., 2017; Van Komen et al., 2020; Yip et al., 2019; Hammer et al., 2021; Adavanne et al., 2018; Shimada et al., 2021). Current DL spatial work is mostly indoor sound. Outdoor sound propagation differences may pose challenges (Traer & McDermott, 2016).

Spatial inference can be combined with SED (SELD). DL tasks like distance estimation and SELD (parallel to SED) could broadly benefit bioacoustics, with local spatial information used more widely in analysis.

Broader geographic-scale population distribution estimation using DL in statistical ecology is outside this review’s scope.

Useful Integration of Outputs

DL workflow integration is crucial. Calibration of automatic inference outputs is important. Kitzes & Schricker (2019) highlight the need for calibrated probabilities rather than binary decisions for reliable abundance estimation. DL outputs are not always well-calibrated probabilities, exhibiting under- or over-confidence (Niculescu-Mizil & Caruana, 2005). Measuring and postprocessing miscalibration is necessary. Evaluating systematic biases, such as higher sensitivity to well-represented sounds (Lostanlen et al., 2018), and reduced performance in dense soundscapes (Joly et al., 2019) is vital.

Spatial detection reliability is one facet of calibration. Distance-detection probability relationships, varying by species and habitat (Johnston et al., 2014), need to be evaluated for automatic detectors. Reproducibility allows consistent distance and calibration curves for DL algorithms and devices. Evaluating DL performance degradation over distance is being explored (Maegawa et al., 2021; Lostanlen et al., 2021a).

Binary “occupancy” observations from detection are less informative than abundance estimates. DL can bridge this gap by counting calls/song bouts, though these don’t directly reflect animal numbers without calling-rate information (Stevenson et al., 2015). DL “language models” of vocal sequences and spatial information can help. Direct inference of animal abundance, skipping call detection, is another route. Counting and density estimation using DL have been explored for images (Arteta, Lempitsky & Zisserman, 2016) and is being explored for audio (Dias, Ponti & Minghim, 2021; Sethi et al., 2021).

Standardized data exchange is essential for DL tool integration. Biodiversity Information Standards (TDWG) provide data exchange format guidance (https://www.tdwg.org/). Standards need development to represent probabilistic outputs and algorithm attribution. W3C Provenance Ontology (https://www.w3.org/TR/prov-o/) can represent provenance, but its use is not widespread. Machine-readable provenance representation would simplify data merging and algorithmic/training-data effect analysis.

These integration topics are crucial for linking bioacoustic monitoring, data, policy, and interventions, enabling bioacoustic DL to address biodiversity crises. IPBES calls for enhanced environmental monitoring (IPBES, 2019). Integration work is vital for computational bioacoustics to fully address global challenges.

Behaviour and Multi-Agent Interactions

Ethology benefits from automatic vocalization detection for intra- and inter-species interactions. SED/SELD/object-detection will increasingly transcribe sound scenes. Previous ethology works used correlation, Markov models, and network analysis, but general-purpose data-driven vocal sequencing models are challenging (Kershenbaum et al., 2014; Stowell, Gill & Clayton, 2016a). DL can model multi-agent sound scenes, with neural point process models offering new tools (Xiao et al., 2019; Chen, Amos & Nickel, 2020b).

Behavior modeling is analogous to “language models” in automatic speech recognition (ASR) (O’Shaughnessy, 2003). A grand challenge in bioacoustic DL is constructing DL “language models” for flexible, open-set, agent-based vocal sequences, integrated with SED/SELD/object-detection. SED/SELD/object-detection paradigms also need improvement, such as transcribing overlapping sound events within categories (Stowell & Clayton, 2015). Analogies with natural sound scene parsing (Chait, 2020) may guide useful approaches.

Low Impact

Large-scale computational work necessitates considering wider impacts: carbon footprint and resource usage (Lostanlen et al., 2021b). Training DL and practical application have power impacts (Henderson et al., 2020). Pretrained networks, embeddings, and efficient NN architectures (ResNet, MobileNet, EfficientNet) can reduce power consumption.

On-device DL offers resource tradeoffs. Small devices can be efficient and use small-footprint NNs. On-device DL can reduce storage/communication overheads. Batch data analysis may be more efficient than on-device processing in some cases (Dekkers et al., 2022). Networking options also affect resource efficiency.

Developing low-impact bioacoustic DL paradigms is important for rapid-response ecosystem monitoring and nature-based climate/biodiversity solutions.

Conclusions

DL has revolutionized automatic systems in computational bioacoustics. Bioacoustics will continue to benefit from wider DL advancements, including methods from image recognition, speech, and audio. Data availability, hardware, processing power, and biodiversity accounting demands will drive further progress. However, simply adopting techniques from other fields is insufficient. This roadmap identifies bioacoustics-specific research topics arising from unique data characteristics and challenges.

Supplemental Information

Bibtex database of literature used.

DOI: 10.7717/peerj.13152/supp-1

Download