The bioacoustics package contains all the necessary R functions to read audio recordings of various formats (e.g., WAV, WAC, MP3, Zero-Crossing), filter noisy files, display audio signals, detect and extract automatically acoustic features for further analysis such as species identification based on classification of animal vocalisations. This package does not provide R functions to perform classification tasks. Other packages such as randomForest, extraTrees, mclust, or keras can be used in addition with the bioacoustics package to perform these tasks. Complementary functions for audio processing in R are also available in the tuneR, monitorR, warbleR, and seewave packages.
This package was originally built for bat bioacoustics, and thus the default arguments for all the available functions are set for bats. Users will have to set the arguments according to the group of animal vocalisations they which to study1. The bioacoustics package can be used to display and extract acoustic features from recordings of birds, dolphins, frogs, insects, and other terrestrial or marine animals as long as their vocalisations can be recorded.
This package is structured around three main axes components containing R functions accessible to users and internal processes that will also be defined below for the sake of clarity and ease of comprehension. The three main components with the current functions are as follow:
The functions of this package are designed to address three main needs:
mp3_to_wav
read_audio
(read_mp3
,
read_wac
, read_wav
)read_zc
resample
write_zc
fspec
metadata
plot_zc
spectro
blob_detection
threshold_detection
The read_audio
function loads into R mono and stereo
audio recordings with MP3, WAV, or WAC (proprietary format from Wildlife
Acoustics) formats. read_audio
is essentially an helper
calling specific functions (read_mp3
,
read_wac
, read_wav
), to decode the various
format of audio recordings.
The metadata
function automatically extracts metadata
(if present) embedded in audio recordings that were previously loaded
into R with the read_audio
function. It can also extract
metadata from objects generated with the
threshold_detection
or blob_detection
functions. The metadata
function also extract GUANO metadata. GUANO stands for the “Grand Unified
Acoustic Notation Ontology” and means to be a universal, extensible,
open metadata format for bat (and non-bat) acoustic recordings. The R
code necessary to extract GUANO metadata was originally provided by
David Riggs under the MIT licence and is available here.
The fspec
function returns a matrix describing the power
spectrum by frequency of a time wave using the Fast Fourrier Transform
(FFT). Values are expressed in decibels (dB).
The spectro
function generates a frequency spectrum
representation of a time wave using the Fast Fourrier Transform
(FFT).
The mp3_to_wav
function converts MP3 to WAV files.
The resample
function up- or down-sample the sample rate
of a given audio recording.
The threshold_detection
function is a modified version
of the Bat Bioacoustics software developed by Christopher Scott (2012).
It combines several algorithms for detecting, filtering and segmenting
audio events, and extracting audio features.
The proposed audio event detection function is a modified spectral sum function for peak-picking, with background noise reduction and echo suppression (Scott, 2012).
A recording typically contains broadband continuous background noise, and discrete pulses of acoustic energy expressed in dB. Any discrete portion of a recorded signal above the level of background noise in a recording is defined here as an ‘audio event’. Under this definition, an audio event may include animal vocalisations, call echoes (e.g., for bats), and other noise sources either biotic (e.g., stridulating insects) or abiotic (flowing water, rain, wind, and wind-induced vegetation noise).
Audio events can be detected in a recording by its energy content, and defining a threshold rule for selecting discrete portions of the recording containing energy (dB) above a fixed threshold. The spectral energy content of a recording is typicaly revealed by applying a Short-Time Fourier Transform (STFT), a technique that slides an analysis window of fixed size through the recording. The output from the windowed FFT analysis is a discrete set of values from which the locations of audio events can be identified through simple energy thresholding and peak-picking.
The two conventional methods of bioacoustic signal detection are the spectral peak and the spectral sum functions. For a signal x at time n, X[n] is defined as its STFT, where |Xk[n]| is the spectral magnitude of the Kth FFT bin at n. The spectral sum function calculates the sum of the STFT magnitudes (the total energy over the entire spectrum) at each consecutive window through the recording to create the following detection function:
Dsum[n] = sum|Xk[n]|
This conventional signal detection function depends on both levels of gain and background noise in the recording. This complicates the setting of a consistent detection threshold value, as recordings may be at different levels. Thus, requiring different threshold levels to detect the exact same audio events.
To resolve this problem, the detection function should be normalised for each recording by substracting the median value over all analysis windows from each detection function data point (Skowronski & Fenton, 2009). The median value is an estimate of the noise floor of the recording, and this process of median offsetting allows the use of a fixed threshold parameter, that is then independent of the recording level. However, the process of normalisation requires that the entire recording must be acquired prior to processing, ruling out real-time analysis.
This is why the threshold_detection
function estimates
the noise floor using only past values of the recording (i.e.,
fixed windows of previous analysis frames). Prior knowledge is therefore
not required for the normalisation of the detection function, and
real-time analysis remains a possibility. In addition, by estimating the
noise floor locally, the function can dynamically react to changes in
the magnitude of background noise winthin the recording. The size of the
noise estimation window can be set using the NWS
argument.
We recommend a time window of 100 ms for short recordings (typically
< 1 min length) containing bat echolocation calls. However, we have
found that a time window of 5000 ms is ideal in long recordings
(typically > 60 min length) containing bird vocalisations.
The noise floor is estimated and subtracted independently for each spectral bin of the FFT spectrum. Environmental noise and microphone self noise is rarely white in nature (i.e., equal power at all frequencies), and is typically weighted more heavily at the low end of the frequency spectrum. Frequency-specific noise substraction can attenuate the noisier low-frequency regions of the spectrum more heavily than the higher frequency and lower noise parts. This process increases the sensitivity of the audio event detection at higher frequency regions of the spectrum, as signals in those regions consequently have a higher Signal to Noise Ratio (SNR).
A threshold function then selects candidate audio event locations
from the detection function normalised outputs. This threshold function
works by first marking the location at which the detection function
crosses the trigger threshold level. The value (in dB above SNR) of this
trigger threshold can be set using the threshold
argument
available in the threshold_detection
function.
If the detection function does not fall below the threshold level (in
dB above SNR) after a certain audio event duration, the noise estimation
is resumed from this location. This duration threshold can be set using
the duration_thr
argument. The duration_thr
length is usually set at the maximum audio event duration of the
targeted group of species. For bats in eastern Canada this parameter is
set by default at 80 ms. For calls of Bicknell’s Thrush (Catharus
bicknelii), it is recommended to set it at 440 ms (see Table
1).
Candidate audio events are subsequently filtered using the following
rule: if the duration of the detected audio event is less than x milliseconds (ms) it is removed.
This minimum duration filter is set with the min_dur
argument available in the threshold_detection
function.
This duration threshold is set to help remove spurious detections caused
by transient noise. For bats in eastern Canada this parameter is set by
default at 1.5 ms.
A temporal masking is also employed to reduce the influence of echoes on the detection function: an exponential decay curve is applied to the output of the detection function, which acts as an adaptive threshold. Echoes falling below the threshold do not contribute to the detection function as they are masked by the louder preceding audio event. The exponential decay curve is defined as:
F[n] = max(D[n], a * F[n − 1] + (1 − a) * D[n])
where F[n] is the
threshold function, D[n] is the detection
function and a is the
exponential decay gain. This echo suppression function aims to reduce
the false alarms caused by echoes exceeding the energy threshold that
triggers the detection. Exponential Decay Gain (EDG) values > 0.8
worked well in practice and are set by default at 0.996. The exponential
decay gain can be set using the EDG
argument in the
threshold_detection
function.
The detection function D[n] is generated by first summing all spectral magnitudes in frequency bands that are greater than both their local median values, and the temporal masking threshold. This value is considered to be the audio event content as defined above. A noise estimate is then taken as the sum of all local median values. Finally, the detection function is expressed as SNR in dB as:
$$SNR = \frac{signal}{noise}$$
The function uses by default a FFT window size of 256 points (which
can be set using the FFT_size
argument), with a
Blackman-Harris 7-term window to reduce spectral leakage (Harris, 1978).
Larger FFT windows produce finer frequency resolution (i.e., an
increased number of FFT bins, each with a narrower frequency span), but
increase computation time and reduce temporal resolution due to Gabor’s
uncertainty principle (Gabor, 1946). The overlap between two
consecutives FFT windows is set by default at 87.5 % (which can be set
using the FFT_overlap
argument). The
bioacoustics package rely on the FFTW library for efficient Fourier
transforms.
The above mentioned functions has been tested and their performances evaluated on bat echolocation calls in Scott (2012). Tests on a real-world dataset of field recordings confirmed that the modified spectral sum function with background noise reduction and echo suppression outperformed the two conventional approaches (i.e., spectral sum and spectral peak functions) in terms of accuracy. The noise substraction function performed well with all signal types across a broad range of thresholds, making it simpler to apply in practice where the recording content is not known a priori. As normalisation is applied in real time through local background noise substraction, the length of the recording to be analysed has no effect on function efficiency (Scott, 2012).
High Pass (HPF) and Low Pass filters (LPF) can be employed to reduce
the amount of unwanted noise in the recording or to track particuliar
audio events within a narrower frequency bandwith than the recording
sampling rate. Frequencies below the HPF and above the LPF cutoff are
greatly attenuated. These frequency filters can be set using the
HPF
and LPF
arguments in the
threshold_detection
function. Note that these filters are
described in this section, but are used by the
threshold_detection
function just after the conversion of
the recording in the time / frequency domain using the Fast Fourier
transform (FFT).
After the recording has been filtered and that the candidate audio events have been located, a second series of functions are employed to filter, extract and smooth the audio event in the time / frequency domain. The audio event extraction function starts a search for the next FFT windows from right-to-left (start) and from left-to-right (end) from each audio event centroid (i.e., location of the audio event at the peak of its maximum energy content), in search for the start and the end of the audio event, respectivelly.
The audio event extraction function relies on three different
thresholds to either pursue or stop its search for the next FFT window:
the energy content (expressed in dB), the SNR (dB), and the angle of the
next FFT window. These thresholds can be set using the
start_thr
, end_thr
, SNR_thr
and
angle_thr
arguments respectively. The extraction stops as
soon as the next FFT falls under the minimum threshold value of any of
these thresholds. Note that the start and the end thresholds can be set
independently, because audio events may have a louder part before
(start) or after (end) their peak of maximum energy (dB). A function is
then verifying on the temporal X-axis of the recording, if an audio
event has been extracted twice, and if so, it keeps only the longest
audio event.
The extracted audio event sequence is smoothed with a Kalman
filtering function, whose parameters can also be set using the
KPE
(Kalman Process Error) and KME
(Kalman
Measurement Error) arguments. Another series of filtering thresholds are
applied to the Kalman filtered outputs such as minimum duration
(min_dur
), maximum duration (max_dur
), minimum
time between two audio events (minTBE
), maximum time
between two audio events (maxTBE
). Acoustic features are
then extracted from the filtered frequency (Hz) and energy (dB)
outputs.
Feature | Unit | Description |
---|---|---|
starting_time | sec | Location of the audio event in the recording |
duration | ms | Duration of the audio event |
freq_max_amp | Hz | Frequency of the maximum energy of the audio event |
freq_max | Hz | Highest frequency of the audio event |
freq_min | Hz | Lowest frequency of the audio event |
bandwidth | Hz | Difference between the highest (freq_max) and lowest (freq_min) frequencies |
freq_start | Hz | Frequency at the start of the audio event |
freq_center | Hz | Frequency at the half of the audio event |
freq_end | Hz | Frequency at the end of the audio event |
freq_knee | Hz | Frequency at which the slope is the steepest (knee) |
freq_c | Hz | Frequency at which the slope is the flatest (caracteristic frequency) |
freq_bw_knee_fc | Hz | Frequency bandwith between the knee and caracteristic frequency |
bin_max_energy | Hz | Frequency at the maximum of energy where the slope is the flatest |
pc_freq_max_amp | Hz | Location of the frequency with the maximum of energy |
pc_freq_min | % | Location of the minimum frequency |
pc_fmax | % | Location of the maximum frequency |
pc_knee | % | Location of the frequency at which the slope is the steepest |
temp_bw_knee_fc | % | Temporal bandwith between the knee and caracteristic frequency |
slope | ms | Raw slope estimate (frequency bandwith against duration) |
kalman_slope | Hz / ms | Smoothed slope estimate after Kalman filtering |
curve_pos_start | Hz / ms | Slope estimate at the begining of the audio event |
curve_pos_end | Hz / ms | Slope estimate at the end of the audio event |
curve_neg | Hz / ms | Slope negative antropy |
mid_offset | dB | Mid-offset |
snr | dB | Signal to noise ratio |
harmonic_distortion | dB | Level of harmonic distortion |
smoothness | Time / frequency regularity |
The metadata embedded in the WAV file header can be extracted by
setting the metadata
argument to TRUE
.
Metadata can then be extracted from the object produced by
threshold_detection
using the metadata
function.
The detection and extraction settings used with the
threshold_detection
and blob_detection
functions can be saved as metadata by setting the settings
argument to TRUE
.
The blob_detection
function is a modified version of the
Bat Classify software developed by Christopher Scott (2014). It combines
several image processing, filtering and image feature extraction.
A recording typically contains broadband continuous background noise, and discrete pulses of acoustic energy expressed in dB. Any discrete portion of a recorded signal above the level of background noise in a recording is defined here as an ‘audio event’. Under this definition, an audio event may include animal vocalisations, call echoes (e.g., for bats), and other noise sources, either biotic (e.g., stridulating insects) or abiotic (flowing water, rain, wind, and wind-induced vegetation noise).
The spectral energy content of a recording is typicaly revealed by applying a Short-Time Fourier Transform (STFT). This technique slides an analysis window of fixed size through the recording. The output from the windowed FFT analysis is a discrete set of values from which the locations of the audio events can be identified through simple thresholding after background noise substraction (Scott, 2012).
The blob_detection
function uses by default a FFT window
size of 256 points (can be changed using the FFT_size
argument), with a Blackman Harris 4-term window to reduce spectral
leakage (Harris, 1978). Larger FFT windows produce finer frequency
resolution (an increased number of FFT bins, each with a narrower
frequency span), but increase computation time and reduce temporal
resolution due to Gabor’s uncertainty principle (Gabor, 1946). The
overlap between two consecutives FFT windows is set by default at 87.5 %
(can be changed using the FFT_overlap
argument). The
bioacoustics package rely on the FFTW library for efficient Fourier
transforms.
FFT values are used to display the spectrogram representation of the
audio event for image processing. A blur is applied to smooth the
spectrogram with a Gaussian function and the level of blur can be
changed using the blur
argument. This function reduces the
level of Gaussian noise in the spectrogram. The background noise is then
substracted. The background substraction uses a local average of the
energy spectrum intensity to estimate the amount of noise to substract
from each spectrogram. This estimate is then multiplied by the value of
the bg_substract
argument. By estimating the noise floor
locally, the function can dynamically react to local variations in the
noise floor intensity. A contrast boost function is finally applied on
the normalised spectrogram values to increase the definition of the
audio event contour against the background noise. The level of contrast
boost can be set with the contrast_boost
argument.
A ‘blob detection’ algorithm is applied on the processed spectrogram to detect a Region Of Interest (ROI). This function relies on the linear-time connected component labelling algorithm described in Chang et al. (2004), and adapted from Andrew Brampton (2011). The resulting function simultaneously labels the connected FFT values (or ‘blob’) and their contours in the spectrogram. Both external, and possibly internal contours of each blob are detected and labeled. Labeling is done in a single pass over the spectrogram, while contour points are revisited more than once and up to four times (Chang et al., 2004). Moreover, the detection function extracts a blob in a sequential order of contour points, i.e. a ‘segment’, which is useful in the case of animal vocalisations.
The blob_detection
function then discards any extracted
segment which area is < n pixels. This can be set using the
min_area
argument, and is set by default at 40 pixels to
best extract bat echolocation calls. The values of these filtering
parameters may be set differently to extract vocalisations from other
group of animals. The segment is finally log-compressed, to convert
magnitude into dB, before feature extraction.
A series of filtering functions are also applied to the segments such
as minimum duration (min_dur
), maximum duration
(max_dur
), minimum time between two segments
(min_TBE
), maximum time between two segments
(max_TBE
), to reduce the amount of unwanted noise or to
track specific vocalisations within a narrower temporal window. Like
temporal filters, High Pass (HPF) and Low Pass frequency filters (LPF)
can also be set using the HPF
and LPF
arguments in the blob_detection
function. Frequencies below
the HPF and above the LPF cutoff are greatly attenuated. Acoustic
features are then extracted from these filtered segments.
The spectral and temporal moments are extracted from the audio event contour points (called ‘segment’), and a gradient histogram is built from the sequence of frequencies (Hz) and energy (expressed in dB) stored in each pixel values (for each extracted audio event). These sequences are available to users in addition to the other acoustic features from the function’s output.
Feature | Unit | Description |
---|---|---|
starting_time | sec | Location of the audio event in the recording |
duration | ms | Duration of the audio event |
area | pixels | Estimated area of the audio event (in pixels) |
freq_centroid | Hz | Frequency at the centroid of the extracted audio event |
freq_bandwith | Hz | Difference between the highest (freq_max) and lowest (freq_min) frequencies |
freq_skew | Hz | Skewness of the frequency distribution of the audio event |
freq_kurtosis | Hz | Kurtosis of the frequency distribution of the audio event |
q | Hz | Centroid frequency divided by the frequency bandwith of the audio event |
freq_gini_impurity | Hz | Degree of smoothness of the frequency distribution of the audio event |
quant_2.5 | Hz | 2.5 percentile of the frequency distribution of the audio event |
quant_25 | Hz | 25 percentile of the frequency distribution of the audio event |
quant_50 | Hz | 50 percentile of the frequency distribution of the audio event |
quant_75 | Hz | 75 percentile of the frequency distribution of the audio event |
quant_97.5 | Hz | 97.5 percentile of the frequency distribution of the audio event |
freq_bw_95_ci | Hz | Frequency bandwith between the 97.5 and the 2.5 percentiles |
freq_bw_75_ci | Hz | Frequency bandwith between the 75 and the 25 percentiles |
temp_centroid | ms | Time at the centroid of the extracted audio event |
temp_bandwith | ms | Time difference between the begining and the end of the audio event |
temp_skew | ms | Skewness of the time distribution of the audio event |
temp_kurtosis | ms | Kurtosis of the time distribution of the audio event |
temp_gini_impurity | ms | Degree of smoothness of the temporal distribution of the audio event |
grad_centroid | Gradient at the centroid of the extracted audio event | |
grad_bandwith | Gradient difference between the begining and the end of the audio event | |
grad_skew | Skewness of the gradient distribution of the audio event | |
grad_kurtosis | Kurtosis of the gradient distribution of the audio event | |
grad_gini_impurity | Degree of smoothness of the gradient distribution of the audio event |
The read_zc
function can read into R a Zero-Crossing
file generated from various bat recorders and by the Kaleidoscope
software (Wildlife Acoustics, Inc). This function is a modified version
of Peter Wilson’s Anabat Tools
(2013) and the C source
code provided by Chris Corben.
The write_zc
function writes or re-writes a
Zero-Crossing file loaded with the write_zc
function.
The metadata
function automatically extracts file header
from a Zero-Crossing file previously loaded with the
read_zc
function.
The plot_zc
function can be used to plot the content of
a Zero-Crossing file previously loaded with the read_zc
function. This function is a modified version of Peter Wilson’s Anabat Tools
(2013).
The myotis dataset is a WAV file of 10 seconds, 16 bits, mono, with an original sampling rate at 500 kHz (time expanded by 10). It contains 11 echolocation calls of bats from the Myotis genus. The recording was made in United-Kingdom with a D500X bat detector from Pettersson Elektronik AB.
The zc dataset is a Zero-Crossing file of 16384 dots containing a sequence of 24 echolocation calls of a hoary bat (Lasiurus cinereus). This ZC recording was made in the Gatineau Park, Quebec, eastern Canada, during the summer 2017 with a Walkabout bat detector from Titley Scientific.
Chang, F., Chen, C.-J., & Lu, C.J. (2004). A linear-time component-labeling algorithm using contour tracing technique. computer vision and image understanding, 93(2), 206-220. Link
Gabor, D. (1946). Theory of communication. Part 1: The analysis of information. Engineers-Part III: Radio and Communication.
Harris, F. J. (1978). On then use of windows with the discrete for harmonic analysis Fourier transform. Proceedings of the IEEE, 66(1), 51-83. Link
Scott, C. D. (2012). Automated techniques for bat echolocation call analysis. PhD thesis. The University of Leeds Institute of Integrative and Comparative Biology, University of Leeds, Leeds, UK. 166 pages.
Skowronski, M. D., & Fenton, M. B (2009). Detecting bat calls: An analysis of automated methods. Acta Chiropterologica, 11(1), 191-203. Link
Table 1. Example of detection, filtering and extraction settings with
the threshold_detection
function for eastern canadian bats
and calls of Bicknell’s Thrush (Catharus bicknelii).
Parameters | Eastern canadian bats | Bicknell’s Thrush calls |
---|---|---|
threshold | 14 | 12 |
time_exp | 1 | 1 |
min_dur | 1.5 | 140 |
max_dur | 80 | 440 |
min_TBE | 30 | 300 |
max_TBE | 1000 | 5000 |
EDG | 0.996 | 0.996 |
LPF | 250000 | 8000 |
HPF | 16000 | 2000 |
FFT_size | 256 | 256 |
FFT_overlap | 0.875 | 0.875 |
start_thr | 40 | 25 |
end_thr | 20 | 30 |
SNR_thr | 10 | 10 |
angle_thr | 40 | 45 |
duration_thr | 80 | 440 |
NWS | 100 | 1000 |
KPE | 1e-05 | 1e-05 |
KME | 1e-05 | 1e-04 |
See Table 1 for examples of extraction arguments for different species groups↩︎