Converting Speech to Text with Project Oxford

Speech recognition is a problem on which computer scientists have been working for years. Project Oxford applies the science of Machine Learning to this problem in order to recognize words spoken and determine their probable meaning based on context.

Project Oxford exposes a REST web service so that you can add speech recognition to your application.

Before you can use the Speech API, you must register at Project Oxford. and retrieve the Speech API key

Figure 1: Speech API Key

The easiest way to use this API in a .NET application is to use the SpeechRecognition library. A NuGet package makes it easy to add this library to your application. In Visual Studio 2015, create a new WPF application (File | New | Project | Windows | WPF Application). Then, right-click the project in the Solution Explorer and select Manage NuGet Packages. Search for and add the "Microsoft.ProjectOxford.SpeechRecognition" package. Select the "x64" or "x86" version that corresponds with your version of Windows.

Figure 2: NuGet dialog

Now, you can start using the library to call the Speech API.

Add the following using statement to the top of a class file:

using Microsoft.ProjectOxford.SpeechRecognition;

Within the class, declare a private instance of the MicrophoneRecognitionClient class

MicrophoneRecognitionClient _microphoneRecognitionClient;

To begin listening to speech, instantiate the MicrophoneRecognitionClient object by using the SpeechRecognitionServiceFactory.CreateMicrophoneClient method and pass and pass in the Speech Recognition Mode, the language to listen for, and your Speech Subscription Key.

The Speech Recognition Mode is an enum that can be either ShortPhrase or LongDictation. These are optimized for shorter or longer voice messages, respectively. Below is an example of this creating a new MicrophoneRecognitionClient instance:

var speechRecognitionMode = SpeechRecognitionMode.ShortPhrase;

string language = "en-us";

string subscriptionKey = ConfigurationManager.AppSettings["SpeechKey"].ToString();

_microphoneRecognitionClient

        = SpeechRecognitionServiceFactory.CreateMicrophoneClient

                        speechRecognitionMode,

                        language,

                        subscriptionKey

);

Now that you have a MicrophoneRecognitionClient object, wire up the OnPartialResponseReceived and the OnResponseReceived events to listen for speech and call the API to turn that speech into text.

_microphoneRecognitionClient.OnPartialResponseReceived += OnPartialResponseReceivedHandler;

_microphoneRecognitionClient.OnResponseReceived += OnMicShortPhraseResponseReceivedHandler;

The MicrophoneRecognitionClient object calls the web service frequently - often after every word - to interpret what words has heard so far. When it makes this call, its OnPartialResponseReceived event fires.

The signature of OnPartialResponseReceivedHandler is:

void OnPartialResponseReceivedHandler(object sender, PartialSpeechResponseEventArgs e)

and you can retrieve Oxford's text interpretation of the spoken words from e.PartialResult. Oxford may revise its interpretation of words spoken at the beginning of a sentence when it receives more of the sentence to provide some context.

After a significant pause, the MicrophoneRecognitionClient object will decide that the user has finished speaking. At this point, it fires the OnResponseReceived event, giving you a chance to clean up. The EndMicAndRecognition method of the MicrophoneRecognitionClient stops listening and severs the connection to the web service.

Here is some code that may be appropriate in the OnResponseReceived event handler:

_microphoneRecognitionClient.EndMicAndRecognition();

_microphoneRecognitionClient.Dispose();

_microphoneRecognitionClient = null;

I have created a sample WPF app with a single window containing the following XAML:

<StackPanel Name="MainStackPanel" Orientation="Vertical" VerticalAlignment="Top">

    <Button Name="RecordButton" Width="250" Height="100"

            FontSize="32" VerticalAlignment="Top"

            Click="RecordButton_Click">

        Start!

    Button>

    <TextBox Name="OutputTextbox" VerticalAlignment="Top" Width="600"

        TextWrapping="Wrap" FontSize="18">TextBox>

StackPanel>

The code-behind for this window is listed below. It includes some visual cues that the app is listening and displays the latest text returned from the Speech API.

using System;

using System.Configuration;

using System.Threading;

using System.Windows;

using System.Windows.Media;

using Microsoft.ProjectOxford.SpeechRecognition;

namespace SpeechToTextDemo

///

    /// Interaction logic for MainWindow.xaml

///

    public partial class MainWindow : Window

        AutoResetEvent _FinalResponseEvent;

        MicrophoneRecognitionClient _microphoneRecognitionClient;

        public MainWindow()

            InitializeComponent();

            RecordButton.Content = "Start\nRecording";

            _FinalResponseEvent = new AutoResetEvent(false);

            OutputTextbox.Background = Brushes.White;

            OutputTextbox.Foreground = Brushes.Black;

        private void RecordButton_Click(object sender, RoutedEventArgs e)

            RecordButton.Content = "Listening...";

            RecordButton.IsEnabled = false;

            OutputTextbox.Background = Brushes.Green;

            OutputTextbox.Foreground = Brushes.White;

            ConvertTextToSpeech();

///

        /// Start listening.

///

        private void ConvertTextToSpeech()

            var speechRecognitionMode = SpeechRecognitionMode.ShortPhrase;

            string language = "en-us";

            string subscriptionKey = ConfigurationManager.AppSettings["SpeechKey"].ToString();

            _microphoneRecognitionClient

                    = SpeechRecognitionServiceFactory.CreateMicrophoneClient

                                    speechRecognitionMode,

                                    language,

                                    subscriptionKey

);

            _microphoneRecognitionClient.OnPartialResponseReceived += OnPartialResponseReceivedHandler;

            _microphoneRecognitionClient.OnResponseReceived += OnMicShortPhraseResponseReceivedHandler;

            _microphoneRecognitionClient.StartMicAndRecognition();

        void OnPartialResponseReceivedHandler(object sender, PartialSpeechResponseEventArgs e)

            string result = e.PartialResult;

            Dispatcher.Invoke(() =>

                OutputTextbox.Text = (e.PartialResult);

                OutputTextbox.Text += ("\n");

});

///

        /// Speaker has finished speaking. Sever connection to server, stop listening, and clean up

///

///

///

        void OnMicShortPhraseResponseReceivedHandler(object sender, SpeechResponseEventArgs e)

            Dispatcher.Invoke((Action)(() =>

                _FinalResponseEvent.Set();

                _microphoneRecognitionClient.EndMicAndRecognition();

                _microphoneRecognitionClient.Dispose();

                _microphoneRecognitionClient = null;

                RecordButton.Content = "Start\nRecording";

                RecordButton.IsEnabled = true;

                OutputTextbox.Background = Brushes.White;

                OutputTextbox.Foreground = Brushes.Black;

            }));

You can download this project from my GitHub repository.

In this article, you learned how to use the Project Oxford Speech Recognition .NET library to take advantage of the Oxford Speech API and add text-to-speech capabilities to your application.