Make sure you have the latest version of the packages:
install.packages("bioacoustics")
# The bioacoustics package may also be installed from GitHub using devtools as follows:
install.packages("devtools")
devtools::install_github("wavx/bioacoustics") # For the latest, unstable version
install.packages("warbleR")
install.packages("randomForest")
We can use the quer_xc()
function from
warbleR to download bird vocalizations from Xeno-Canto:
https://www.xeno-canto.org/
We are going to choose calls from Catharus bicknelli, songs from Passerella iliaca and Setophaga magnolia recorded in the United States and Canada.
We will filter only “A” quality recordings, then, pick up only the
first nine, and merge all the metadata into a single “df” data frame.
This data frame will be used to download MP3 files in your working
directory directly from Xeno-Canto with the quer_xc()
function:
df1 = query_xc(qword ='Catharus bicknelli type:call cnt:"United States"', download = FALSE)
df1 = df1[df1$Vocalization_type=="call",]
df1 = df1[df1$Quality=="A",]
df1 = df1[1:9,]
df2 = query_xc(qword ='Setophaga magnolia type:song cnt:"Canada"', download = FALSE)
df2 = df2[df2$Quality=="A",]
df2 = df2[1:9,]
df3 = query_xc(qword ='Passerella iliaca type:song cnt:"Canada"', download = FALSE)
df3 = df3[df3$Vocalization_type=="song",]
df3 = df3[df3$Quality %in% c("A", "B"),]
df3 = df3[1:9,]
df = rbind(df1,df2,df3)
rm(df1,df2,df3)
# Visualize your data frame
View(df)
# We will work in the R temp directory
wd <- tempdir()
# Create a data directory if it does not exist
data_dir <- file.path(wd, "data")
if(!dir.exists(data_dir))
dir.create(data_dir)
# Download the MP3 files into your data directory
quer_xc(X = df, download = TRUE, path = data_dir)
Now that we have recordings stored in the data directory, we can read
one of them to look at its structure and content. Let’s use the
read_audio()
function in bioacoustics to
manually read a recording of Catharus bicknelli:
We can see that the MP3 file has been converted into a Wave object with 7551360 samples, a duration of 157.32 seconds, a sampling rate of 48000 Hz, a bit depth of 16 bits, and that it contains one channel (mono, stereo being two channels).
Remember that you just have to divide the number of samples by the sampling rate to retrieve the duration (s) of an audio file.
GUANO stands for the “Grand
Unified Acoustic Notation Ontology” and means to be a universal,
extensible, open metadata format for bat (ultrasonic) and non-bat
(audible, infrasonic) acoustic recordings more here. GUANO format is
now embeded directly in the WAV files generated from most Pettersson,
Wildlife Acoustics, Titley Scientific acoustic recorders. It is possible
to extract the metadata and GUANO embedded in the WAV file by using the
metadata()
function.
Now that we have explored a WAV file, we will use the Fast Fourrier Transform (FFT) to compute a frequency-time representation of the recording, called a spectrogram. A spectrogram being the representation of the spectrum of frequencies in a recording as they vary through time. This representation, although not optimal, is still commonly used to detect animal vocalizations and extract acoustic features useful for classification with the purpose of animal identification. ___________________
There are several options to display animal vocalizations in audio
files with R. You can use both spectro()
or
fspec()
functions to generate spectrograms with
bioacoustics. fspec()
generates only a
matrix of the spectrogram, and thus has to be used with the
image()
function to display the spectrogram. It is also
possible to use the spectro()
function in
Seewave.
Next, we will search manually and display an audio event (here, a bird vocalization) from a recording of Catharus bicknelli. To display a Region Of Interest (ROI) of the recording we will use temporal and frequency filters. Let’s start with a temporal slice from 1 to 10 secs and a FFT size of 512 samples.
# Set plot margins to 0
par(mar = c(0, 0, 0, 0), oma = c(0, 0, 0, 0))
# Display with spectro()
ticks <- c(from = 1000, to = 20000, by = 1000) # frequency tick marks from 1 to
# 20 kHz, and steps at each 1 kHz
temp_slice <- c(from = 1, to = 10) # in seconds
spectro(CATBIC, tlim = temp_slice, FFT_size = 512, ticks_y = ticks)
Let’s display spectrograms with various time / frequency limits (with
tlim=
and flim=
arguments). You can also play
with other arguments in spectro()
and fspec()
functions such as the percent of overlap between two FFTs (with
FFT_overlap=
) and various FFT resolutions (with
FFT_size=
). Note that the arguments are briefly explained
in the documentation of each function:
First, let’s shorten the temporal axis from 2 to 3.5 secs to work on
a shorter time window and compare the outputs from
spectro()
and fspec()
functions. Note that
spectrograms can also be generated automatically while using the
detection functions from bioacoustics. We will explore
that in details in section 4.1.
# Set plot margins to 0
par(mar = c(0, 0, 0, 0), oma = c(0, 0, 0, 0))
# Display the spectrogram with spectro()
ticks <- c(from = 1000, to = 20000, by = 1000) # frequency tick marks from 1 to
# 20 kHz, 1 kHz steps
temp_slice <- c(from = 2, to = 3.5) # in seconds
spectro(CATBIC, tlim = temp_slice, FFT_size = 512, ticks_y = ticks)
# fspec() gives you the spectrogram matrix with energy values (dB)
spec_mx <- fspec(CATBIC, tlim = temp_slice, FFT_size = 512, rotate = TRUE)
# You can display the spectrogram with image()
image(spec_mx, xaxt = "n", yaxt = "n")
The tick marks on the (frequency) y-axis were defined in the
spectro()
function starting from 1 to 20 kHz with an
interval at each 1 kHz. The FFT size was 512 samples with an overlap
between two FFT windows set by default at 0.875. Now try these settings:
FTT size = 256, 1024 and 2048; FFT overlap = 0.3, 0.6, 0.9…
Another interesting thing to perform with the fspec outputs, is to implement your own set of filters. Let’s try to reduce the background noise from a spectrogram with a narrower time and frequency bandwidth:
temp_slice <- c(from = 2.5, to = 3.5)
freq_slice <- c(from = 1500, to = 20000)
spec_o <- fspec(CATBIC, tlim = temp_slice, flim = freq_slice, FFT_size = 512, rotate = TRUE)
## min and max (range) dB intensity
range(spec_o) # -120 (min) to 0 dB (max)
# Note that the tolerance of your recorders depends on the number of bits.
# 16-bit recorders offer only around -96 dB tolerance and sound pressure above
# this level is clipped to 0 dB.
## Let's try a filter by mean + sd intensity
spec_f <- fspec(CATBIC, tlim = temp_slice, flim = freq_slice, FFT_size = 512, rotate = TRUE)
spec_f[spec_f < mean(spec_f) + sd(spec_f)] <- -120
# Works well with high intensity audio events, but leads to
# false negatives (missed events) otherwise.
par(mar = c(0, 0, 0, 0), oma = c(0, 0, 0, 0))
image(spec_o, xaxt="n", yaxt="n")
image(spec_f, xaxt="n", yaxt="n")
The functions used to detect and extract audio events in a recording
also rely on “generic” filters based on frequency and duration, along
with other “specific” filters. Let’s take a quick look at the generic
filters available in the threshold_detection()
and
blob_detection()
functions:
HPF=
and LPF=
arguments in the
threshold_detection()
and blob_detection()
functions.min_dur=
,
max_dur=
, and TBE=
arguments in both
threshold_detection()
and blob_detection()
functions.Let’s start with the threshold_detection()
function on a
recording containing calls from Catharus bicknelli. This
function is an amplitude threshold detector that picks up audio events
above the Signal to Noise Ratio (SNR). It combines several algorithms
for detection, filtering and audio feature extraction. We will play with
the arguments of this function to understand their implication in the
detection and extraction of audio events (here, calls of Catharus
bicknelli).
# Set each argument according to the targeted audio events
TD <- threshold_detection(
CATBIC, # Either a path to an audio file (see ?read_audio), or a Wave object
threshold = 12, # 12 dB SNR sensitivity for the detection algorithm
time_exp = 1, # Time expansion factor of 1. Only needed for bat recordings.
min_dur = 140, # Minimum duration threshold of 140 milliseconds (ms)
max_dur = 440, # Maximum duration threshold of 440 ms
min_TBE = 10, # Minimum time window between two audio events of 10 milliseconds
max_TBE = 5000, # Maximum time window between two audio events, here 5 seconds
EDG = 0.996, # Temporal masking with Exponential Decay Gain from 0 to 1
LPF = 10000, # Low-Pass Filter of 10 kHz
HPF = 1000, # High-Pass Filter of 1 kHz
FFT_size = 256, # Size of the Fast Fourrier Transform (FFT) window
FFT_overlap = 0.875, # Percentage of overlap between two FFT windows
start_thr = 25, # 25 dB threshold at the start of the audio event
end_thr = 30, # 30 dB threshold at the end of the audio event
SNR_thr = 10, # 10 dB SNR threshold at which the extraction of the audio event stops
angle_thr = 45, # 45° of angle at which the extraction of the audio event stops
duration_thr = 440, # Noise estimation is resumed after 440 ms
NWS = 1000, # Time window length of 1 s used for background noise estimation
KPE = 1e-05, # Process Error parameter of the Kalman filter (for smoothing)
KME = 1e-04, # Measurement Error parameter of the Kalman filter (for smoothing)
settings = FALSE, # Save on a list the above parameters set with this function
acoustic_feat = TRUE, # Extracts the acoustic and signal quality parameters
metadata = FALSE, # Extracts on a list the metadata embedded with the Wave file
spectro_dir = file.path(tempdir(), "Spectros"), # Directory where to save the spectrograms
time_scale = 1, # Time resolution of 2 ms for spectrogram display
ticks = TRUE # Tick marks and their intervals are drawn on the y-axis (frequencies)
)
# Get the number of extracted audio events
nrow(TD$data$event_data)
Let the HTML page open with the 57 spectrograms (each representing an
extracted audio event). These settings will be our benchmark for the
number of audio events that can be extracted with the
threshold_detection()
function. In the following exercise,
you will try to reach or beat this number by exploring different
combinations of parameters for each argument of the function.
# Let's try various settings, starting with 1024 FFT size instead of 256.
TD <- threshold_detection(
CATBIC, threshold = 12, time_exp = 1, min_dur = 140, max_dur = 440,
min_TBE = 10, max_TBE = 5000, EDG = 0.996, LPF = 10000, HPF = 1000,
FFT_size = 1024, FFT_overlap = 0.875, start_thr = 25, end_thr = 30,
SNR_thr = 10, angle_thr = 45, duration_thr = 440, NWS = 1000,
KPE = 1e-05, KME = 1e-04, settings = FALSE, acoustic_feat = TRUE,
metadata = FALSE, spectro_dir = file.path(tempdir(), "Spectros"), time_scale = 1,
ticks = c(1000, 10000, 1000) # Tick marks from 1 to 10 kHz with 1 kHz interval
)
# Take a look at the spectrograms and compare them with the previous extraction.
nrow(TD$data$event_data) # Only three audio events!
We will play with various detection thresholds: end_thr, SNR_thr, angle_thr, KPE and KME parameters. Try to reach 66 spectrograms extracted with a contour (i.e., Kalman curve) that best matches the audio event (answer below). The FFT size will be set at 256 samples.
CATBIC <- read_audio(file.path(data_dir, "Catharus-bicknelli-54864.mp3"))
TD <- threshold_detection(
CATBIC, threshold = 12, time_exp = 1, min_dur = 140, max_dur = 440, min_TBE = 10,
max_TBE = Inf, EDG = 0.996, LPF = 10000, HPF = 1000, FFT_size = 256, FFT_overlap = 0.875,
start_thr = 22, end_thr = 30, SNR_thr = 10, angle_thr = 125, duration_thr = 440, NWS = 1000,
KPE = 1e-05, KME = 1e-05, settings = FALSE, acoustic_feat = TRUE, metadata = FALSE
)
Let’s take a look at the extracted audio features. Note that all the
features are described and explained in the package vignette
(vignette("bioacoustics")
).
# Acoustic features are stored in a data frame called event_data,
# stored by order of detection.
View(TD$data$event_data) # Contains the filename and the time of detection in the
# recording, and 26 extracted features.
The location (in number of samples) of the audio event in the recording is saved in a list.
# Start and end of the 5th extracted audio event (in samples)
c(TD$data$event_start[[5]], TD$data$event_end[[5]])
# Remember you just have to divide by the sample rate to retrieve the time (s)
c(TD$data$event_start[[5]], TD$data$event_end[[5]]) / slot(CATBIC, "samp.rate")
The amplitude (dB) and frequency (Hz) tracks (or bins) are also saved in a list. These can be used to build your own acoustic features.
par(mar = c(1,1, 1, 1), oma = c(1, 1, 1, 1))
# Amplitude track of the 5th audio event
plot(TD$data$amp_track[[5]], type = "l")
# Frequency track of the 5th audio event
plot(TD$data$freq_track[[5]], type = "l")
The whole energy and frequency content can also be used to classify audio events instead of using acoustic features that may result in a loss of information. We will get there soon, but first, let’s discover another detection function, here applied on echolocation calls of bats.
The blob_detection()
function will be used on a
recording containing 10 bat echolocation calls from the Myotis
genus. This function combines several image processing, filtering and
image feature extraction. A blur and contrast boost is applied after
mean background subtraction to increase the SNR of the audio event. The
blob detection algorithm is applied on the processed spectrogram to
detect the ROI (i.e., each preprocessed audio event). The blob detector
simultaneously labels the connected FFT values and their contours in the
spectrogram. Labelling is done in a single pass over the spectrogram,
while contour points are revisited more than once and up to four times
(see Chang et al.,
2004. We will play with the arguments of this function to extract
bat echolocation calls.
# Use the bat recording stored in the package
data(myotis)
# Set each argument according to the targeted audio events
BD <- blob_detection(
myotis, # Either a path to an audio file (see ?read_audio), or a Wave object
time_exp = 10, # Time expansion factor of 10 for time expanded recordings.
min_dur = 1.5, # Minimum duration threshold of 1.5 milliseconds (ms)
max_dur = 80, # Maximum duration threshold of 80 ms
min_area = 40, # minimum number of 40 pixels in the blob
min_TBE = 20, # Minimum time window between two audio events of 20 milliseconds
EDG = 0.996, # Temporal masking with Exponential Decay Gain from 0 to 1
LPF = slot(myotis, "samp.rate") * 10 / 2, # Low-Pass Filter at the Nyquist frequency
HPF = 16000, # High-Pass Filter of 16 kHz
FFT_size = 256, # Size of the Fast Fourrier Transform (FFT) window
FFT_overlap = 0.875, # Percentage of overlap between two FFT windows
blur = 2, # Gaussian smoothing function with a factor of 2 for blurring the spectrogram
bg_substract = 20, # Foreground extraction with a mean filter applied on the spectrogram
contrast_boost = 20, # Edge contrast enhancement filter of the spectrogram contour
settings = FALSE, # Save on a list the above parameters set with this function
acoustic_feat = TRUE, # Extracts the acoustic and signal quality parameters
metadata = FALSE, # Extracts on a list the metadata embedded with the Wave file
spectro_dir = file.path(tempdir(), "Spectros"), # HTML page with spectrograms by order of detection
time_scale = 0.1, # Time resolution of 2 ms for spectrogram display
ticks = TRUE # Tick marks and their intervals are drawn on y-axis (frequencies)
)
# Get the number of extracted audio events
nrow(BD$data$event_data)
Do not close the HTML page and tune the FFT size at 512. Let’s play with the blur, contrast boost and background subtraction parameters to retrieve a number of 10 extracted echolocation calls.
# Let's try various settings, starting with 512 FFT size instead of 256.
BD <- blob_detection(
myotis, time_exp = 10, FFT_size = 512, settings = FALSE, acoustic_feat = TRUE,
metadata = FALSE, spectro_dir = file.path(tempdir(), "Spectros"), time_scale = 0.1, ticks = TRUE
)
# Take a look at the spectrograms and compare them with the previous extraction.
nrow(BD$data$event_data) # Only 6 audio events!
Let’s take a look at the extracted audio features. All the features are described and explained in the package vignette.
This data frame is, for now, the only available set of acoustic
features with the blob_detection()
function. However, it
combines well with the fspec()
to make image analysis.
Now that we have played with both detection functions with bird and bat vocalizations, let’s go back to birds to explore batch analysis (i.e., with several recordings) and audio event classification.
In this section, we will learn how to analyze several recordings at the same time and train a simple classifier (with training set) that will be used to classify new data (i.e., the test set).
We will work with 27 recordings of Catharus-bicknelli (n = 9), Passerella iliaca (n = 9), and Setophaga-magnolia (n = 9). We will split the extracted audio events in a 70 % training set (called “Train”) and 30 % test set (called “Test”).
Our target audio events are calls of Catharus-bicknelli. We will use the threshold detector previously configured for this species (see section 4.1.1).
# Get the filepath for each MP3 file
files <- dir(data_dir, recursive = TRUE, full.names = TRUE, pattern = "[.]mp3$")
# Detect and extract audio events
TDs <- setNames(
lapply(
files,
threshold_detection,
threshold = 12, min_dur = 140, max_dur = 440, min_TBE = 50, max_TBE = Inf,
LPF = 8000, HPF = 1500, FFT_size = 256, start_thr = 30, end_thr = 20,
SNR_thr = 10, angle_thr = 125, duration_thr = 400, spectro_dir = NULL,
NWS = 2000, KPE = 0.00001, time_scale = 2, EDG = 0.996
),
basename(file_path_sans_ext(files))
)
# Keep only files with data in it
TDs <- TDs[lapply(TDs, function(x) length(x$data)) > 0]
# Keep the extracted feature and merge in a single data frame for further analysis
Event_data <- do.call("rbind", c(lapply(TDs, function(x) x$data$event_data), list(stringsAsFactors = FALSE)))
nrow(Event_data) # 355 audio events extracted
# Compute the number of extracted CATBIC calls
sum(startsWith(Event_data$filename, "Cat"))
# Add a "Class" column: "CATBIC" vs. other species of birds "OTHERS"
classes <- as.factor(ifelse(startsWith(Event_data$filename, "Cat"), "CATBIC", "OTHERS"))
Event_data <- cbind(data.frame(Class = classes), Event_data)
# Get rid of the filename and time in the recording
Event_data$filename <- Event_data$starting_time <- NULL
We now have the necessary dataset to train a classifier: we will train a Random Forest on the training set and validate the results on the test set.
# Split the data in 60% Training / 40% Test sets
train <- sample(1:nrow(Event_data), round(nrow(Event_data) * .6))
Train <- Event_data[train,]
test <- setdiff(1:nrow(Event_data), train)
Test <- Event_data[test,]
# Train a random forest classifier
set.seed(666)
rf <- randomForest(Class ~ duration + freq_max_amp + freq_max + freq_min +
bandwidth + freq_start + freq_center + freq_end +
freq_knee + fc + freq_bw_knee_fc + bin_max_amp +
pc_freq_max_amp + pc_freq_max + pc_freq_min +
pc_knee + temp_bw_knee_fc + slope + kalman_slope +
curve_neg + curve_pos_start + curve_pos_end +
mid_offset + smoothness + snr + hd + smoothness,
data = Train, importance = FALSE, proximity = FALSE,
replace = TRUE, ntree = 4000, mtry = 4)
# Look at the confusion matrix of the training set
rf$confusion # looks good, but...
# Let's make predictions with our classifier on a test set
table(Test[,1], predict(rf, Test[,-1], type = "response")) # not bad!
# To look at the predictions
head(predict(rf, Test[,-1], type = "prob"))
We are now able to use this simple, but proven robust, classifier to detect new calls of your target species.
We will use Keras in R which requires to install several packages in Python Guidelines to install Keras properly in R are available here
Let’s now explore a ConvNet approach available on Keras. We will follow the approach of Hatami et al. (2017) to analyze time series as images with 2D ConvNets. The difference is that we will only perform max pooling at the last layer before activation and add batch normalization with dropouts at each layer.
# Run if keras is installed on your machine
library(keras)
# Build the training set
Y_train <- to_categorical(as.integer(Train[,1]) - 1) # One hot encoding
# X as matrix
X_train <- as.matrix(Train[,-1])
# Build the test set
Y_test <- to_categorical(as.integer(Test[,1]) - 1)
Y_test <- Y_test[,-1]
X_test <- as.matrix(Test[,-1])
# Build the sequential model
mod0 <- keras_model_sequential()
mod0 %>%
# Input shape layer = c(samples, rows, cols, channels)
layer_reshape(input_shape=ncol(X_train),target_shape=c(1,1,ncol(X_train))) %>%
# First conv 2d layer with 128 neurons, kernel size of 8 x 8 and stride of 1 x 1
layer_conv_2d(128, c(8,8), c(1,1), padding='same') %>%
layer_batch_normalization() %>%
layer_activation("relu") %>%
layer_dropout(0.2) %>%
# Second conv 2d layer with 256 neurons, kernel size of 5 x 5 and stride of 1 x 1
layer_conv_2d(256, c(5,5), c(1,1), padding='same') %>%
layer_batch_normalization() %>%
layer_activation("relu") %>%
layer_dropout(0.2) %>%
# Third conv 2d layer with 128 neurons, kernel size of 3 x 3 and stride of 1 x 1
layer_conv_2d(128, c(3,3), c(1,1), padding='same') %>%
layer_batch_normalization() %>%
layer_activation("relu") %>%
layer_dropout(0.2) %>%
# Average pooling layer
layer_global_average_pooling_2d() %>%
# Activation output layer with 2 classes
layer_dense(units = ncol(Y_train), activation='softmax')
# Model compile
mod0 %>% compile(loss = 'categorical_crossentropy',
optimizer = "adam",
metrics = "categorical_accuracy")
# Add a callback to reduce the learning rate when reaching the plateau
reduce_lr <- callback_reduce_lr_on_plateau(monitor = 'loss', factor = 0.5,
patience = 50, min_lr = 0.0001)
# Start learning
mod0 %>% fit(X_train, Y_train, batch_size = 32, epochs = 50,
validation_data = list(X_test, Y_test),
verbose = 1, callbacks = reduce_lr)
# Score on the test set
score <- mod0 %>% evaluate(X_test, Y_test, batch_size = 32)
score
Let’s work a bit with the output to build a confusion matrix and use the predict function on the test set.
# Look at predictions and build a confusion matrix
Pred <- as.factor(predict_classes(mod0, X_test, batch_size = 32, verbose = 1))
table(Y_test[,2], Pred)
# To look at the prediction values
Prob <- round(predict_proba(mod0, X_test, batch_size = 32, verbose = 1), 2)
We obtained a val_loss < 0.2 and val_categorical_accuracy > 0.94 which is acceptable, but not better than the simplest RF approach we used in section 3.2. Using only 26 acoustic features as model inputs instead of the whole spectrogram content (energy and frequency distribution, and harmonics) probably reduced the performances of the CNN model.
This tutorial is now complete. Comments and feedback are welcome:
Francois: [email protected]
Jean: [email protected]
www.wavx.ca