Feature Selection for Audio Classification
Feature selection reduces the dimensionality of data by selecting a subset of measured features to create a model. Performing feature selection enables you to train smaller models quickly without sacrificing accuracy. For some tasks, properly selected features used with simple thresholding can provide adequate results, especially in situations where model size and complexity must be minimized.
In this example, you walk through a standard machine learning pipeline to develop an audio classification system. The pipeline has been abstracted so that you can apply the same steps to either speaker recognition or word recognition tasks.
Dataset Management and Labeling
Download the Free Spoken Digit Dataset (FSDD) [1]. FSDD consists of short audio files with spoken digits (0-9). The data is sampled at 8 kHz.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","FSDD.zip"); dataFolder = tempdir; unzip(downloadFolder,dataFolder) dataset = fullfile(dataFolder,"FSDD");
Create an audioDatastore
to manage the audio dataset.
ads = audioDatastore(dataset,IncludeSubfolders=true);
Choose a task and set the audioDatastore
labels accordingly.
task = "word recognition"; [~,filenames] = fileparts(ads.Files); switch task case "speaker recognition" ads.Labels = extractBetween(filenames,"_","_"); case "word recognition" ads.Labels = extractBefore(filenames,"_"); end
Split data into train and test sets. Use 80% for training and 20% for testing.
[adsTrain,adsTest] = splitEachLabel(ads,0.8);
Listen to a sample from the training set. Plot the waveform and display the associated label.
[x,xinfo] = read(adsTrain); sound(x,xinfo.SampleRate) t = (0:numel(x)-1)/xinfo.SampleRate; figure plot(t,x) title("Label: " + xinfo.Label) grid on axis tight ylabel("Amplitude") xlabel("Time (s)")
Feature Extraction Pipeline
Audio signals can broadly be categorized as stationary or non-stationary. Stationary signals have spectrums that do not change over time, like pure tones. Non-stationary signals have spectrums that change over time, like speech signals. To make machine learning-based tasks tractable, non-stationary signals can be approximated as stationary when analyzed at appropriately small time scales. Generally, speech signals are considered stationary when viewed at time scales around 30 ms. Therefore, speech can be characterized by extracting features from 30 ms analysis windows over time.
Use the helper function, helperVisualizeBuffer
, to visualize the analysis windows of an audio file. Specify a 30 ms analysis window with 20 ms overlap between adjacent windows. The overlap duration must be less than the window duration. The Analysis Windows of Signal plot shows the individual analysis windows from which features are extracted.
windowDuration =0.03; overlapDuration = 0.02; helperVisualizeBuffer(x,xinfo.SampleRate,WindowDuration=windowDuration,OverlapDuration=overlapDuration);
Create an audioFeatureExtractor
to extract features from 30 ms windows with 20 ms overlap between windows.
afe = audioFeatureExtractor(SampleRate=xinfo.SampleRate, ... Window=hann(round(windowDuration*xinfo.SampleRate),"periodic"), ... OverlapLength=round(overlapDuration*xinfo.SampleRate))
afe = audioFeatureExtractor with properties: Properties Window: [240×1 double] OverlapLength: 160 SampleRate: 8000 FFTLength: [] SpectralDescriptorInput: 'linearSpectrum' FeatureVectorLength: 0 Enabled Features none Disabled Features linearSpectrum, melSpectrum, barkSpectrum, erbSpectrum, mfcc, mfccDelta mfccDeltaDelta, gtcc, gtccDelta, gtccDeltaDelta, spectralCentroid, spectralCrest spectralDecrease, spectralEntropy, spectralFlatness, spectralFlux, spectralKurtosis, spectralRolloffPoint spectralSkewness, spectralSlope, spectralSpread, pitch, harmonicRatio, zerocrossrate shortTimeEnergy To extract a feature, set the corresponding property to true. For example, obj.mfcc = true, adds mfcc to the list of enabled features.
Configure the audioFeatureExtractor
to extract all features.
in = info(afe,"all");
featureSwitches = fields(in);
cellfun(@(x)afe.set(x,true),featureSwitches)
afe
afe = audioFeatureExtractor with properties: Properties Window: [240×1 double] OverlapLength: 160 SampleRate: 8000 FFTLength: [] SpectralDescriptorInput: 'linearSpectrum' FeatureVectorLength: 306 Enabled Features linearSpectrum, melSpectrum, barkSpectrum, erbSpectrum, mfcc, mfccDelta mfccDeltaDelta, gtcc, gtccDelta, gtccDeltaDelta, spectralCentroid, spectralCrest spectralDecrease, spectralEntropy, spectralFlatness, spectralFlux, spectralKurtosis, spectralRolloffPoint spectralSkewness, spectralSlope, spectralSpread, pitch, harmonicRatio, zerocrossrate shortTimeEnergy Disabled Features none To extract a feature, set the corresponding property to true. For example, obj.mfcc = true, adds mfcc to the list of enabled features.
You can use the extract
object function of audioFeatureExtractor
to extract all the enabled features from an audio signal. The features are concatenated into a matrix with analysis windows along the rows and features along the columns.
featureMatrix = extract(afe,x); [numWindows,numFeatures] = size(featureMatrix)
numWindows = 62
numFeatures = 306
You can use info
to get a mapping between the columns of the output matrix and the feature names. The term "features" is overloaded in the literature. features can refer to the feature group, such as "linearSpectrum", "mfcc", or "spectralCentroid", or the individual feature elements, such as the first element of the linear spectrum or the third element of the MFCC. The output map returned by info
is a struct where each field corresponds to a feature group and the values correspond to which columns in the feature matrix the feature groups occupy.
outputMap = info(afe)
outputMap = struct with fields:
linearSpectrum: [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 … ]
melSpectrum: [122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153]
barkSpectrum: [154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185]
erbSpectrum: [186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213]
mfcc: [214 215 216 217 218 219 220 221 222 223 224 225 226]
mfccDelta: [227 228 229 230 231 232 233 234 235 236 237 238 239]
mfccDeltaDelta: [240 241 242 243 244 245 246 247 248 249 250 251 252]
gtcc: [253 254 255 256 257 258 259 260 261 262 263 264 265]
gtccDelta: [266 267 268 269 270 271 272 273 274 275 276 277 278]
gtccDeltaDelta: [279 280 281 282 283 284 285 286 287 288 289 290 291]
spectralCentroid: 292
spectralCrest: 293
spectralDecrease: 294
spectralEntropy: 295
spectralFlatness: 296
spectralFlux: 297
spectralKurtosis: 298
spectralRolloffPoint: 299
spectralSkewness: 300
spectralSlope: 301
spectralSpread: 302
pitch: 303
harmonicRatio: 304
zerocrossrate: 305
shortTimeEnergy: 306
This figure is intended to help you interpret the feature matrix returned from extract
.
Use extract
to extract features from all files in the audio datastore. If you have Parallel Computing Toolbox™, spread the computation across multiple workers.
The output is a (Number of files)-by-1 cell array. Each element of the cell array is a (Number of hops)-by-(Number of features) matrix, where the number of hops depends on the length of the audio file.
features = extract(afe,adsTrain,UseParallel=canUseParallelPool);
Feature/Label Correspondence
Once you have extracted features from approximately stationary windows in time, the next question is whether to feed the window-level features to your machine learning model or to combine the features into file-level representations. The choice of window-level or file-level features depends on your application and requirements. For file-level features, you will generally create summary statistics of the window-level features to collapse the time dimension. Common summary statistics include the mean and standard deviation. This example uses window-level features.
To train a machine learning model on window-level features, replicate the file-level labels so that they are in one-to-one correspondence with the features.
N = cellfun(@(x)size(x,1),features); T = repelem(adsTrain.Labels,N);
Concatenate the features into a single matrix for consumption by machine-learning tools.
X = cat(1,features{:});
Feature Selection
Statistics and Machine Learning Toolbox™ provides several tools to aid in feature selection. The best feature selector will depend on your intended model. Use fscmrmr
(Statistics and Machine Learning Toolbox) to rank features for classification using the minimum-redundancy/maximum-relevance (MRMR) algorithm. The MRMR is a sequential algorithm that finds an optimal set of features that is mutually and maximally dissimilar and can represent the response variable effectively.
rng("default") % for reproducibility [featureSelectionIdx,featureSelectionScores] = fscmrmr(X,T);
The fscmrmr
function considers each column of the input feature matrix as a unique feature. Plot the scores of each scalar in the feature matrix returned by audioFeatureExtractor
.
figure bar(featureSelectionScores) ylabel("Feature Score") xlabel("Feature Matrix Column")
The audioFeatureExtractor
extracts feature groups with varying numbers of elements. For example, the default number of elements of the MFCC feature group is 13, while the spectral centroid feature always consists of 1 element. The output map returned by calling info
on audioFeatureExtractor
is a struct with fields equal to the feature group and values equal to the columns that feature group occupies in the matrix output by extract
. Use the output map and the supporting function uniqueFeatureName
to create a unique name for each scalar feature, then plot the scores of the top 25 performing features.
featurenames = uniqueFeatureName(outputMap); featurenamesSorted = featurenames(featureSelectionIdx); figure bar(reordercats(categorical(featurenames),featurenamesSorted),featureSelectionScores) xlim([featurenamesSorted(1),featurenamesSorted(25)])
Depending on your application, you can approximate grouped feature selection by averaging the scores of feature groups. Using grouped features (for example, all MFCC) may help you deploy more efficient feature extraction. In this example, you use the top-performing feature scalars, regardless of which feature group they belong to.
Select some top scoring features. The number you select will depend on the model you are training and the final constraints of your application.
numFeatures = 30;
selectedFeatureIndex = featureSelectionIdx(1:numFeatures);
Train Model
To train a KNN model using your selected features, use fitcknn
(Statistics and Machine Learning Toolbox). If you are unsure of which machine learning model you want to use, try fitcauto
(Statistics and Machine Learning Toolbox) to automatically select a classification model with optimized parameters, or try the Classification Learner (Statistics and Machine Learning Toolbox).
Mdl = fitcknn(X(:,selectedFeatureIndex),T,Standardize=true);
Evaluate Model
Spot-check the model's performance.
Read a sample from the test set. Listen to the sample and then plot its waveform and display the ground-truth label.
[x,xInfo] = read(adsTest); sound(x,xInfo.SampleRate) t = (0:numel(x)-1)/xInfo.SampleRate; figure plot(t,x) title("Label: " + xInfo.Label) grid on axis tight ylabel("Amplitude") xlabel("Time (s)")
Extract features from analysis windows.
yPerWindow = extract(afe,x);
Predict the correct label per window.
t = predict(Mdl,yPerWindow(:,selectedFeatureIndex)); trueLabel = categorical(xInfo.Label)
trueLabel = categorical
0
predictionsPerWindow = categorical(t')
predictionsPerWindow = 1×39 categorical
0 0 3 3 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 0 0 0 0 0 0
Create a file-level prediction by taking the mode of window-level predictions.
prediction = mode(predictionsPerWindow)
prediction = categorical
0
Analyze the whole-word performance over the entire test set.
Tfile = categorical(adsTest.Labels); featuresTest = extract(afe,adsTest,UseParallel=canUseParallelPool); Y = cellfun(@(x)mode(categorical(predict(Mdl,x(:,selectedFeatureIndex)))),featuresTest,UniformOutput=false); Y = cat(1,Y{:}); figure confusionchart(Tfile,Y,Title="Accuracy = " + 100*mean(Tfile==Y) + " (%)")
You can apply a similar pattern as above to also select an optimal window, window length, window overlap, DFT length, and input to spectral descriptors.
Supporting Functions
function c = uniqueFeatureName(afeInfo) %UNIQUEFEATURENAME Create unique feature names %c = uniqueFeatureName(featureInfo) creates a unique set of feature names %for each element of each feature described in the afeInfo struct. The %afeInfo struct is returned by the info object function of %audioFeatureExtractor. a = repelem(fields(afeInfo),structfun(@numel,afeInfo)); b = matlab.lang.makeUniqueStrings(a); d = find(endsWith(b,"_1")); c = strrep(b,"_",""); c(d-1) = strcat(c(d-1),"0"); end
References
[1] Jakobovski. “Jakobovski/Free-Spoken-Digit-Dataset.” GitHub, May 30, 2019. https://github.com/Jakobovski/free-spoken-digit-dataset.