Introduction to bioacoustics

Package overview

The bioacoustics package contains all the necessary R functions to read audio recordings of various formats (e.g., WAV, WAC, MP3, Zero-Crossing), filter noisy files, display audio signals, detect and extract automatically acoustic features for further analysis such as species identification based on classification of animal vocalisations. This package does not provide R functions to perform classification tasks. Other packages such as randomForest, extraTrees, mclust, or keras can be used in addition with the bioacoustics package to perform these tasks. Complementary functions for audio processing in R are also available in the tuneR, monitorR, warbleR, and seewave packages.

This package was originally built for bat bioacoustics, and thus the default arguments for all the available functions are set for bats. Users will have to set the arguments according to the group of animal vocalisations they which to study1. The bioacoustics package can be used to display and extract acoustic features from recordings of birds, dolphins, frogs, insects, and other terrestrial or marine animals as long as their vocalisations can be recorded.

This package is structured around three main axes components containing R functions accessible to users and internal processes that will also be defined below for the sake of clarity and ease of comprehension. The three main components with the current functions are as follow:

The functions of this package are designed to address three main needs:

  • The need to read, write and manipulate acoustic recordings:
    • mp3_to_wav
    • read_audio (read_mp3, read_wac, read_wav)
    • read_zc
    • resample
    • write_zc
  • The need to display what’s inside acoustic recordings, whether to plot or just extract metadata
    • fspec
    • metadata
    • plot_zc
    • spectro
  • The need to analyse audio recordings in batch in search of specific vocalisations and extract acoustic features
    • blob_detection
    • threshold_detection

Audio recordings

Read

The read_audio function loads into R mono and stereo audio recordings with MP3, WAV, or WAC (proprietary format from Wildlife Acoustics) formats. read_audio is essentially an helper calling specific functions (read_mp3, read_wac, read_wav), to decode the various format of audio recordings.

Metadata

The metadata function automatically extracts metadata (if present) embedded in audio recordings that were previously loaded into R with the read_audio function. It can also extract metadata from objects generated with the threshold_detection or blob_detection functions. The metadata function also extract GUANO metadata. GUANO stands for the “Grand Unified Acoustic Notation Ontology” and means to be a universal, extensible, open metadata format for bat (and non-bat) acoustic recordings. The R code necessary to extract GUANO metadata was originally provided by David Riggs under the MIT licence and is available here.

Display

The fspec function returns a matrix describing the power spectrum by frequency of a time wave using the Fast Fourrier Transform (FFT). Values are expressed in decibels (dB).

The spectro function generates a frequency spectrum representation of a time wave using the Fast Fourrier Transform (FFT).

Manipulate

The mp3_to_wav function converts MP3 to WAV files.

The resample function up- or down-sample the sample rate of a given audio recording.

Analyse and extract acoustic features

Threshold detection

The threshold_detection function is a modified version of the Bat Bioacoustics software developed by Christopher Scott (2012). It combines several algorithms for detecting, filtering and segmenting audio events, and extracting audio features.

Audio event detection

The proposed audio event detection function is a modified spectral sum function for peak-picking, with background noise reduction and echo suppression (Scott, 2012).

A recording typically contains broadband continuous background noise, and discrete pulses of acoustic energy expressed in dB. Any discrete portion of a recorded signal above the level of background noise in a recording is defined here as an ‘audio event’. Under this definition, an audio event may include animal vocalisations, call echoes (e.g., for bats), and other noise sources either biotic (e.g., stridulating insects) or abiotic (flowing water, rain, wind, and wind-induced vegetation noise).

Audio events can be detected in a recording by its energy content, and defining a threshold rule for selecting discrete portions of the recording containing energy (dB) above a fixed threshold. The spectral energy content of a recording is typicaly revealed by applying a Short-Time Fourier Transform (STFT), a technique that slides an analysis window of fixed size through the recording. The output from the windowed FFT analysis is a discrete set of values from which the locations of audio events can be identified through simple energy thresholding and peak-picking.

The two conventional methods of bioacoustic signal detection are the spectral peak and the spectral sum functions. For a signal x at time n, X[n] is defined as its STFT, where |Xk[n]| is the spectral magnitude of the Kth FFT bin at n. The spectral sum function calculates the sum of the STFT magnitudes (the total energy over the entire spectrum) at each consecutive window through the recording to create the following detection function:

Dsum[n] = sum|Xk[n]|

This conventional signal detection function depends on both levels of gain and background noise in the recording. This complicates the setting of a consistent detection threshold value, as recordings may be at different levels. Thus, requiring different threshold levels to detect the exact same audio events.

To resolve this problem, the detection function should be normalised for each recording by substracting the median value over all analysis windows from each detection function data point (Skowronski & Fenton, 2009). The median value is an estimate of the noise floor of the recording, and this process of median offsetting allows the use of a fixed threshold parameter, that is then independent of the recording level. However, the process of normalisation requires that the entire recording must be acquired prior to processing, ruling out real-time analysis.

This is why the threshold_detection function estimates the noise floor using only past values of the recording (i.e., fixed windows of previous analysis frames). Prior knowledge is therefore not required for the normalisation of the detection function, and real-time analysis remains a possibility. In addition, by estimating the noise floor locally, the function can dynamically react to changes in the magnitude of background noise winthin the recording. The size of the noise estimation window can be set using the NWS argument. We recommend a time window of 100 ms for short recordings (typically < 1 min length) containing bat echolocation calls. However, we have found that a time window of 5000 ms is ideal in long recordings (typically > 60 min length) containing bird vocalisations.

The noise floor is estimated and subtracted independently for each spectral bin of the FFT spectrum. Environmental noise and microphone self noise is rarely white in nature (i.e., equal power at all frequencies), and is typically weighted more heavily at the low end of the frequency spectrum. Frequency-specific noise substraction can attenuate the noisier low-frequency regions of the spectrum more heavily than the higher frequency and lower noise parts. This process increases the sensitivity of the audio event detection at higher frequency regions of the spectrum, as signals in those regions consequently have a higher Signal to Noise Ratio (SNR).

A threshold function then selects candidate audio event locations from the detection function normalised outputs. This threshold function works by first marking the location at which the detection function crosses the trigger threshold level. The value (in dB above SNR) of this trigger threshold can be set using the threshold argument available in the threshold_detection function.

If the detection function does not fall below the threshold level (in dB above SNR) after a certain audio event duration, the noise estimation is resumed from this location. This duration threshold can be set using the duration_thr argument. The duration_thr length is usually set at the maximum audio event duration of the targeted group of species. For bats in eastern Canada this parameter is set by default at 80 ms. For calls of Bicknell’s Thrush (Catharus bicknelii), it is recommended to set it at 440 ms (see Table 1).

Candidate audio events are subsequently filtered using the following rule: if the duration of the detected audio event is less than x milliseconds (ms) it is removed. This minimum duration filter is set with the min_dur argument available in the threshold_detection function. This duration threshold is set to help remove spurious detections caused by transient noise. For bats in eastern Canada this parameter is set by default at 1.5 ms.

A temporal masking is also employed to reduce the influence of echoes on the detection function: an exponential decay curve is applied to the output of the detection function, which acts as an adaptive threshold. Echoes falling below the threshold do not contribute to the detection function as they are masked by the louder preceding audio event. The exponential decay curve is defined as:

F[n] = max(D[n], a * F[n − 1] + (1 − a) * D[n])

where F[n] is the threshold function, D[n] is the detection function and a is the exponential decay gain. This echo suppression function aims to reduce the false alarms caused by echoes exceeding the energy threshold that triggers the detection. Exponential Decay Gain (EDG) values > 0.8 worked well in practice and are set by default at 0.996. The exponential decay gain can be set using the EDG argument in the threshold_detection function.

The detection function D[n] is generated by first summing all spectral magnitudes in frequency bands that are greater than both their local median values, and the temporal masking threshold. This value is considered to be the audio event content as defined above. A noise estimate is then taken as the sum of all local median values. Finally, the detection function is expressed as SNR in dB as:

$$SNR = \frac{signal}{noise}$$

The function uses by default a FFT window size of 256 points (which can be set using the FFT_size argument), with a Blackman-Harris 7-term window to reduce spectral leakage (Harris, 1978). Larger FFT windows produce finer frequency resolution (i.e., an increased number of FFT bins, each with a narrower frequency span), but increase computation time and reduce temporal resolution due to Gabor’s uncertainty principle (Gabor, 1946). The overlap between two consecutives FFT windows is set by default at 87.5 % (which can be set using the FFT_overlap argument). The bioacoustics package rely on the FFTW library for efficient Fourier transforms.

The above mentioned functions has been tested and their performances evaluated on bat echolocation calls in Scott (2012). Tests on a real-world dataset of field recordings confirmed that the modified spectral sum function with background noise reduction and echo suppression outperformed the two conventional approaches (i.e., spectral sum and spectral peak functions) in terms of accuracy. The noise substraction function performed well with all signal types across a broad range of thresholds, making it simpler to apply in practice where the recording content is not known a priori. As normalisation is applied in real time through local background noise substraction, the length of the recording to be analysed has no effect on function efficiency (Scott, 2012).

Extraction with filtering and smoothing

High Pass (HPF) and Low Pass filters (LPF) can be employed to reduce the amount of unwanted noise in the recording or to track particuliar audio events within a narrower frequency bandwith than the recording sampling rate. Frequencies below the HPF and above the LPF cutoff are greatly attenuated. These frequency filters can be set using the HPF and LPF arguments in the threshold_detection function. Note that these filters are described in this section, but are used by the threshold_detection function just after the conversion of the recording in the time / frequency domain using the Fast Fourier transform (FFT).

After the recording has been filtered and that the candidate audio events have been located, a second series of functions are employed to filter, extract and smooth the audio event in the time / frequency domain. The audio event extraction function starts a search for the next FFT windows from right-to-left (start) and from left-to-right (end) from each audio event centroid (i.e., location of the audio event at the peak of its maximum energy content), in search for the start and the end of the audio event, respectivelly.

The audio event extraction function relies on three different thresholds to either pursue or stop its search for the next FFT window: the energy content (expressed in dB), the SNR (dB), and the angle of the next FFT window. These thresholds can be set using the start_thr, end_thr, SNR_thr and angle_thr arguments respectively. The extraction stops as soon as the next FFT falls under the minimum threshold value of any of these thresholds. Note that the start and the end thresholds can be set independently, because audio events may have a louder part before (start) or after (end) their peak of maximum energy (dB). A function is then verifying on the temporal X-axis of the recording, if an audio event has been extracted twice, and if so, it keeps only the longest audio event.

The extracted audio event sequence is smoothed with a Kalman filtering function, whose parameters can also be set using the KPE (Kalman Process Error) and KME (Kalman Measurement Error) arguments. Another series of filtering thresholds are applied to the Kalman filtered outputs such as minimum duration (min_dur), maximum duration (max_dur), minimum time between two audio events (minTBE), maximum time between two audio events (maxTBE). Acoustic features are then extracted from the filtered frequency (Hz) and energy (dB) outputs.

Acoustic features and metadata

Feature Unit Description
starting_time sec Location of the audio event in the recording
duration ms Duration of the audio event
freq_max_amp Hz Frequency of the maximum energy of the audio event
freq_max Hz Highest frequency of the audio event
freq_min Hz Lowest frequency of the audio event
bandwidth Hz Difference between the highest (freq_max) and lowest (freq_min) frequencies
freq_start Hz Frequency at the start of the audio event
freq_center Hz Frequency at the half of the audio event
freq_end Hz Frequency at the end of the audio event
freq_knee Hz Frequency at which the slope is the steepest (knee)
freq_c Hz Frequency at which the slope is the flatest (caracteristic frequency)
freq_bw_knee_fc Hz Frequency bandwith between the knee and caracteristic frequency
bin_max_energy Hz Frequency at the maximum of energy where the slope is the flatest
pc_freq_max_amp Hz Location of the frequency with the maximum of energy
pc_freq_min % Location of the minimum frequency
pc_fmax % Location of the maximum frequency
pc_knee % Location of the frequency at which the slope is the steepest
temp_bw_knee_fc % Temporal bandwith between the knee and caracteristic frequency
slope ms Raw slope estimate (frequency bandwith against duration)
kalman_slope Hz / ms Smoothed slope estimate after Kalman filtering
curve_pos_start Hz / ms Slope estimate at the begining of the audio event
curve_pos_end Hz / ms Slope estimate at the end of the audio event
curve_neg Hz / ms Slope negative antropy
mid_offset dB Mid-offset
snr dB Signal to noise ratio
harmonic_distortion dB Level of harmonic distortion
smoothness Time / frequency regularity

The metadata embedded in the WAV file header can be extracted by setting the metadata argument to TRUE. Metadata can then be extracted from the object produced by threshold_detection using the metadata function.

The detection and extraction settings used with the threshold_detection and blob_detection functions can be saved as metadata by setting the settings argument to TRUE.

Blob detection

The blob_detection function is a modified version of the Bat Classify software developed by Christopher Scott (2014). It combines several image processing, filtering and image feature extraction.

Audio event detection and extraction

A recording typically contains broadband continuous background noise, and discrete pulses of acoustic energy expressed in dB. Any discrete portion of a recorded signal above the level of background noise in a recording is defined here as an ‘audio event’. Under this definition, an audio event may include animal vocalisations, call echoes (e.g., for bats), and other noise sources, either biotic (e.g., stridulating insects) or abiotic (flowing water, rain, wind, and wind-induced vegetation noise).

The spectral energy content of a recording is typicaly revealed by applying a Short-Time Fourier Transform (STFT). This technique slides an analysis window of fixed size through the recording. The output from the windowed FFT analysis is a discrete set of values from which the locations of the audio events can be identified through simple thresholding after background noise substraction (Scott, 2012).

The blob_detection function uses by default a FFT window size of 256 points (can be changed using the FFT_size argument), with a Blackman Harris 4-term window to reduce spectral leakage (Harris, 1978). Larger FFT windows produce finer frequency resolution (an increased number of FFT bins, each with a narrower frequency span), but increase computation time and reduce temporal resolution due to Gabor’s uncertainty principle (Gabor, 1946). The overlap between two consecutives FFT windows is set by default at 87.5 % (can be changed using the FFT_overlap argument). The bioacoustics package rely on the FFTW library for efficient Fourier transforms.

FFT values are used to display the spectrogram representation of the audio event for image processing. A blur is applied to smooth the spectrogram with a Gaussian function and the level of blur can be changed using the blur argument. This function reduces the level of Gaussian noise in the spectrogram. The background noise is then substracted. The background substraction uses a local average of the energy spectrum intensity to estimate the amount of noise to substract from each spectrogram. This estimate is then multiplied by the value of the bg_substract argument. By estimating the noise floor locally, the function can dynamically react to local variations in the noise floor intensity. A contrast boost function is finally applied on the normalised spectrogram values to increase the definition of the audio event contour against the background noise. The level of contrast boost can be set with the contrast_boost argument.

A ‘blob detection’ algorithm is applied on the processed spectrogram to detect a Region Of Interest (ROI). This function relies on the linear-time connected component labelling algorithm described in Chang et al. (2004), and adapted from Andrew Brampton (2011). The resulting function simultaneously labels the connected FFT values (or ‘blob’) and their contours in the spectrogram. Both external, and possibly internal contours of each blob are detected and labeled. Labeling is done in a single pass over the spectrogram, while contour points are revisited more than once and up to four times (Chang et al., 2004). Moreover, the detection function extracts a blob in a sequential order of contour points, i.e. a ‘segment’, which is useful in the case of animal vocalisations.

The blob_detection function then discards any extracted segment which area is < n pixels. This can be set using the min_area argument, and is set by default at 40 pixels to best extract bat echolocation calls. The values of these filtering parameters may be set differently to extract vocalisations from other group of animals. The segment is finally log-compressed, to convert magnitude into dB, before feature extraction.

A series of filtering functions are also applied to the segments such as minimum duration (min_dur), maximum duration (max_dur), minimum time between two segments (min_TBE), maximum time between two segments (max_TBE), to reduce the amount of unwanted noise or to track specific vocalisations within a narrower temporal window. Like temporal filters, High Pass (HPF) and Low Pass frequency filters (LPF) can also be set using the HPF and LPF arguments in the blob_detection function. Frequencies below the HPF and above the LPF cutoff are greatly attenuated. Acoustic features are then extracted from these filtered segments.

Acoustic features and metadata

The spectral and temporal moments are extracted from the audio event contour points (called ‘segment’), and a gradient histogram is built from the sequence of frequencies (Hz) and energy (expressed in dB) stored in each pixel values (for each extracted audio event). These sequences are available to users in addition to the other acoustic features from the function’s output.

Feature Unit Description
starting_time sec Location of the audio event in the recording
duration ms Duration of the audio event
area pixels Estimated area of the audio event (in pixels)
freq_centroid Hz Frequency at the centroid of the extracted audio event
freq_bandwith Hz Difference between the highest (freq_max) and lowest (freq_min) frequencies
freq_skew Hz Skewness of the frequency distribution of the audio event
freq_kurtosis Hz Kurtosis of the frequency distribution of the audio event
q Hz Centroid frequency divided by the frequency bandwith of the audio event
freq_gini_impurity Hz Degree of smoothness of the frequency distribution of the audio event
quant_2.5 Hz 2.5 percentile of the frequency distribution of the audio event
quant_25 Hz 25 percentile of the frequency distribution of the audio event
quant_50 Hz 50 percentile of the frequency distribution of the audio event
quant_75 Hz 75 percentile of the frequency distribution of the audio event
quant_97.5 Hz 97.5 percentile of the frequency distribution of the audio event
freq_bw_95_ci Hz Frequency bandwith between the 97.5 and the 2.5 percentiles
freq_bw_75_ci Hz Frequency bandwith between the 75 and the 25 percentiles
temp_centroid ms Time at the centroid of the extracted audio event
temp_bandwith ms Time difference between the begining and the end of the audio event
temp_skew ms Skewness of the time distribution of the audio event
temp_kurtosis ms Kurtosis of the time distribution of the audio event
temp_gini_impurity ms Degree of smoothness of the temporal distribution of the audio event
grad_centroid Gradient at the centroid of the extracted audio event
grad_bandwith Gradient difference between the begining and the end of the audio event
grad_skew Skewness of the gradient distribution of the audio event
grad_kurtosis Kurtosis of the gradient distribution of the audio event
grad_gini_impurity Degree of smoothness of the gradient distribution of the audio event

Zero-Crossing

Read / Write

The read_zc function can read into R a Zero-Crossing file generated from various bat recorders and by the Kaleidoscope software (Wildlife Acoustics, Inc). This function is a modified version of Peter Wilson’s Anabat Tools (2013) and the C source code provided by Chris Corben.

The write_zc function writes or re-writes a Zero-Crossing file loaded with the write_zc function.

Metadata

The metadata function automatically extracts file header from a Zero-Crossing file previously loaded with the read_zc function.

Plot

The plot_zc function can be used to plot the content of a Zero-Crossing file previously loaded with the read_zc function. This function is a modified version of Peter Wilson’s Anabat Tools (2013).

Datasets

Myotis

The myotis dataset is a WAV file of 10 seconds, 16 bits, mono, with an original sampling rate at 500 kHz (time expanded by 10). It contains 11 echolocation calls of bats from the Myotis genus. The recording was made in United-Kingdom with a D500X bat detector from Pettersson Elektronik AB.

zc

The zc dataset is a Zero-Crossing file of 16384 dots containing a sequence of 24 echolocation calls of a hoary bat (Lasiurus cinereus). This ZC recording was made in the Gatineau Park, Quebec, eastern Canada, during the summer 2017 with a Walkabout bat detector from Titley Scientific.

References

Chang, F., Chen, C.-J., & Lu, C.J. (2004). A linear-time component-labeling algorithm using contour tracing technique. computer vision and image understanding, 93(2), 206-220. Link

Gabor, D. (1946). Theory of communication. Part 1: The analysis of information. Engineers-Part III: Radio and Communication.

Harris, F. J. (1978). On then use of windows with the discrete for harmonic analysis Fourier transform. Proceedings of the IEEE, 66(1), 51-83. Link

Scott, C. D. (2012). Automated techniques for bat echolocation call analysis. PhD thesis. The University of Leeds Institute of Integrative and Comparative Biology, University of Leeds, Leeds, UK. 166 pages.

Skowronski, M. D., & Fenton, M. B (2009). Detecting bat calls: An analysis of automated methods. Acta Chiropterologica, 11(1), 191-203. Link

See also

# Package tutorial
vignette("tutorial", package = "bioacoustics")

Table 1. Example of detection, filtering and extraction settings with the threshold_detection function for eastern canadian bats and calls of Bicknell’s Thrush (Catharus bicknelii).

Parameters Eastern canadian bats Bicknell’s Thrush calls
threshold 14 12
time_exp 1 1
min_dur 1.5 140
max_dur 80 440
min_TBE 30 300
max_TBE 1000 5000
EDG 0.996 0.996
LPF 250000 8000
HPF 16000 2000
FFT_size 256 256
FFT_overlap 0.875 0.875
start_thr 40 25
end_thr 20 30
SNR_thr 10 10
angle_thr 40 45
duration_thr 80 440
NWS 100 1000
KPE 1e-05 1e-05
KME 1e-05 1e-04

  1. See Table 1 for examples of extraction arguments for different species groups↩︎