Main Content

speech2text

Transcribe speech signal to text

Since R2022b

    Description

    transcript = speech2text(audioIn,fs) transcribes speech in the input audio signal to text using a pretrained wav2vec 2.0 model.

    Using wav2vec 2.0 requires Deep Learning Toolbox™ and installing the pretrained model.

    example

    transcript = speech2text(audioIn,fs,Name=Value) specifies options using one or more name-value arguments. For example, speech2text(x,fs,Language="es") transcribes a signal containing Spanish-language speech.

    example

    [transcript,rawOutput] = speech2text(___) also returns the unprocessed server output from the third-party speech service.

    Examples

    collapse all

    Type speechClient("wav2vec2.0") into the command line. If the required model files are not installed, then the function throws an error and provides a link to download them. Click the link, and unzip the file to a location on the MATLAB path.

    Alternatively, execute the following commands to download and unzip the wav2vec model files to your temporary directory.

    downloadFolder = fullfile(tempdir,"wav2vecDownload");
    loc = websave(downloadFolder,"https://ssd.bat365/supportfiles/audio/asr-wav2vec2-librispeech.zip");
    modelsLocation = tempdir;
    unzip(loc,modelsLocation)
    addpath(fullfile(modelsLocation,"asr-wav2vec2-librispeech"))

    Check that the installation is successful by typing speechClient("wav2vec2.0") into the command line. If the files are installed, then the function returns a Wav2VecSpeechClient object.

    speechClient("wav2vec2.0")
    ans = 
      Wav2VecSpeechClient with properties:
    
        Segmentation: 'word'
          TimeStamps: 0
            Language: 'english'
    
    

    Read in an audio file containing speech and listen to it.

    [y,fs] = audioread("speech_dft.wav");
    sound(y,fs)

    Use speech2text to transcribe the audio signal using the wav2vec 2.0 pretrained network. This requires installing the pretrained network. If the network is not installed, the function provides a link with instructions to download and install the pretrained model.

    transcript = speech2text(y,fs)
    transcript = 
    "the discreet forier transform of a real valued signal is conjugate symmetric"
    

    Create a speechClient object that uses the Emformer pretrained model.

    emformerSpeechClient = speechClient("emformer");

    Create a dsp.AudioFileReader object to read in an audio file. In a streaming loop, read in frames of the audio file and transcribe the speech using speech2text with the Emformer speechClient. The Emformer speechClient object maintains an internal state to perform the streaming speech-to-text transcription.

    afr = dsp.AudioFileReader("Counting-16-44p1-mono-15secs.wav");
    txtTotal = "";
    while ~isDone(afr)
        x = afr();
        txt = speech2text(x,afr.SampleRate,Client=emformerSpeechClient);
        txtTotal = txtTotal + txt;
    end
    
    txtTotal
    txtTotal = 
    "one two three four five six seven eight nine"
    

    Read in an audio file containing speech in the Spanish language and listen to it.

    [x,fs] = audioread("spanish.wav");
    sound(x,fs)

    Use speech2text with Language set to "spanish" to transcribe the speech.

    transcript = speech2text(x,fs,Language="spanish")
    transcript = 
    "la inductancia mutua de los circuitos depende exclusivamente de la geometría de los mismos."
    

    Create a speechClient object that uses a Whisper pretrained model. Set Task to "translate" to translate other languages into English when performing speech-to-text with this object.

    whisperSpeechClient = speechClient("whisper",Task="translate");

    Read in a speech signal containing Polish and listen to it.

    [x,fs] = audioread("polish.wav");
    sound(x,fs)

    Call speech2text on the signal with Client set to the Whisper client object to simultaneously translate and transcribe the speech.

    translatedTranscript = speech2text(x,fs,Client=whisperSpeechClient)
    translatedTranscript = 
    "Good day, I am Polish."
    

    Input Arguments

    collapse all

    Audio input signal, specified as a column vector (single channel).

    Data Types: single | double

    Sample rate in Hz, specified as a positive scalar.

    Data Types: single | double

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: speech2text(x,fs,Language="es")

    Client object, specified as an object returned by speechClient. The object is an interface to a pretrained model or to a third-party speech service. By default, speech2text uses a wav2vec 2.0 client object.

    Using speech2text with wav2vec 2.0 requires Deep Learning Toolbox and installing the pretrained wav2vec 2.0 model. If the model is not installed, calling speechClient with "wav2vec2.0" provides a link to download and install the model.

    Using the Emformer or Whisper models requires Deep Learning Toolbox and Audio Toolbox™ Interface for SpeechBrain and Torchaudio Libraries. If this support package is not installed, calling speechClient with "emformer" provides a link to the Add-On Explorer, where you can download and install the support package.

    To use any of the third-party speech services, you must download the extended Audio Toolbox functionality from File Exchange. The File Exchange submission includes a tutorial to get started with the third-party services.

    Example: speechClient("wav2vec2.0")

    Language spoken in the input signal specified as "english", "spanish", "italian", "french", or "german". You can also specify the ISO language codes ("en", "es", "it", "fr", "de").

    This argument applies only when using the default Client. If you specify Client, set the Language property on the client object.

    Data Types: char | string

    Output Arguments

    collapse all

    Speech transcript of the input audio signal, returned as a table with a column containing the transcript and another column containing the associated confidence metrics. If the Segmentation property of Client is "none", speech2text returns the transcript as a string.

    The returned table can have additional columns depending on the speechClient properties and server options.

    Data Types: table | string

    Unprocessed server output, returned as a matlab.net.http.ResponseMessage object containing the HTTP response from the third-party speech service. If the third-party speech service is Amazon®, speech2text returns the server output as a structure.

    This output argument does not apply if Client interfaces with a pretrained model.

    References

    [1] Baevski, Alexei, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” 2020. https://doi.org/10.48550/ARXIV.2006.11477.

    Extended Capabilities

    Version History

    Introduced in R2022b