Book Image

Learning Microsoft Cognitive Services - Second Edition

By : Leif Larsen
Book Image

Learning Microsoft Cognitive Services - Second Edition

By: Leif Larsen

Overview of this book

Microsoft has revamped its Project Oxford to launch the all new Cognitive Services platform-a set of 30 APIs to add speech, vision, language, and knowledge capabilities to apps. This book will introduce you to 24 of the APIs released as part of Cognitive Services platform and show you how to leverage their capabilities. More importantly, you'll see how the power of these APIs can be combined to build real-world apps that have cognitive capabilities. The book is split into three sections: computer vision, speech recognition and language processing, and knowledge and search. You will be taken through the vision APIs at first as this is very visual, and not too complex. The next part revolves around speech and language, which are somewhat connected. The last part is about adding real-world intelligence to apps by connecting them to Knowledge and Search APIs. By the end of this book, you will be in a position to understand what Microsoft Cognitive Service can offer and how to use the different APIs.
Table of Contents (19 chapters)
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Getting feedback on detected faces


Now that we have seen what else Microsoft Cognitive Services can offer, we are going to add an API to our face detection application. Through this part, we will add the Bing Speech API to make the application say the number of faces out loud.

This feature of the API is not provided in the NuGet package, and as such we are going to use the REST API.

To reach our end goal, we are going to add two new classes, TextToSpeak and Authentication. The first class will be in charge of generating correct headers and making the calls to our service endpoint. The latter class will be in charge of generating an authentication token. This will be tied together in our ViewModel, where we will make the application speak back to us.

We need to get our hands on an API key first. Head over to the Microsoft Azure Portal. Create a new service for Bing Speech.

To be able to call the Bing Speech API, we need to have an authorization token. Go back to Visual Studio and create a new file called Authentication.cs. Place this in the Model folder.

We need to add two new references to the project. Find System.Runtime.Serialization and System.Web packages in the Assembly tab in the Add References window and add them.

In our Authentication class, define four private variables and one public property as follows:

    private string _requestDetails; 
    private string _token; 
    private Timer _tokenRenewer; 
 
    private const int TokenRefreshInterval = 9; 
 
    public string Token { get { return _token; } } 

The constructor should accept one string parameter, clientSecret. The clientSecret parameter is the API key you signed up for.

In the constructor, assign the _clientSecret variable as follows:

    _clientSecret = clientSecret; 

Create a new function called Initialize as follows:

    public async Task Initialize()
    {
        _token = GetToken(); 
 
        _tokenRenewer = new Timer(new TimerCallback(OnTokenExpiredCallback), this, 
        TimeSpan.FromMinutes(TokenRefreshInterval), 
        TimeSpan.FromMilliseconds(-1));
    } 

We then fetch the access token, in a method we will create shortly.

Finally, we create our timer class, which will call the callback function in 9 minutes. The callback function will need to fetch the access token again and assign it to the _token variable. It also needs to assure that we run the timer again in 9 minutes.

Next we need to create the GetToken method. This method should return a Task<string>object, and it should be declared as private and marked as async.

In the method, we start by creating an HttpClient object, pointing to an endpoint that will generate our token. We specify the root endpoint and add the token issue path as follows:

    using(var client = new HttpClient())
    {
        client.DefaultRequestHeaders.Add ("Opc-Apim-Subscription-Key", _clientSecret);
        UriBuilder uriBuilder = new UriBuilder (https://api.cognitive.microsoft.com/sts/v1.0”);
        uriBuilder.Path = “/issueToken”;

We then go on to make a POST call to generate a token as follows:

var result = await client.PostAsync(uriBuilder.Uri.AbsoluteUri, null);

When the request has been sent, we expect there to be a response. We want to read this response and return the response string:

return await result.Content.ReadAsStringAsync();

Add a new file called TextToSpeak.cs if you have not already done so. Put this file in the Model folder.

Beneath the newly created class (but inside the namespace), we want to add two event argument classes. These will be used to handle audio events, which we will see later.

The AudioEventArgs class simply takes a generic stream, and you can imagine it being used to send the audio stream to our application:

    public class AudioEventArgs : EventArgs 
    { 
        public AudioEventArgs(Stream eventData) 
        { 
            EventData = eventData; 
        } 
 
        public StreamEventData { get; private set; }  
    } 

The next class allows us to send an event with a specific error message:

    public class AudioErrorEventArgs : EventArgs 
    { 
        public AudioErrorEventArgs(string message) 
        { 
            ErrorMessage = message; 
        } 
 
        public string ErrorMessage { get; private set; } 
    } 

We move on to start on the TextToSpeak class, where we start off by declaring some events and class members as follows:

    public class TextToSpeak 
    { 
        public event EventHandler<AudioEventArgs>OnAudioAvailable; 
        public event EventHandler<AudioErrorEventArgs>OnError; 
 
        private string _gender; 
        private string _voiceName; 
        private string _outputFormat; 
        private string _authorizationToken; 
        private AccessTokenInfo _token;  
 
        private List<KeyValuePair<string, string>> _headers = new  List<KeyValuePair<string, string>>(); 

The first two lines in the class are events using the event argument classes we created earlier. These events will be triggered if a call to the API finishes, and returns some audio, or if anything fails. The next few lines are string variables, which we will use as input parameters. We have one line to contain our access token information. The last line creates a new list, which we will use to hold our request headers.

We add two constant strings to our class as follows:

private const string RequestUri =  "https://speech.platform.bing.com/synthesize"; 

private const string SsmlTemplate = 
    "<speak version='1.0'xml:lang='en-US'>
        <voice xml:lang='en-US'xml:gender='{0}' 
        name='{1}'>{2}
        </voice>
    </speak>";

The first string contains the request URI. That is the REST API endpoint we need to call to execute our request. Next we have a string defining our Speech Synthesis Markup Language (SSML) template. This is where we will specify what the speech service should say, and a bit on how it should say it.

Next we create our constructor as follows:

        public TextToSpeak() 
        { 
            _gender = "Female"; 
            _outputFormat = "riff-16khz-16bit-mono-pcm"; 
            _voiceName = "Microsoft Server Speech Text to Speech Voice (en-US, ZiraRUS)"; 
        } 

Here we are just initializing some of our variables declared earlier. As you may see, we are defining the voice to be female and we define it to use a specific voice. In terms of gender, naturally it can be either female or male. In terms of voice name, it can be one of a long list of options. We will look more into the details of that list when we go through this API in a later chapter.

The last line specifies the output format of the audio. This will define the format and codec in use by the resulting audio stream. Again, this can be a number of varieties, which we will look into in a later chapter.

Following the constructor, there are three public methods we will create. These will generate an authentication token, generate some HTTP headers, and finally execute our call to the API. Before we create these, you should add two helper methods to be able to raise our events. Call them the RaiseOnAudioAvailable and RaiseOnError methods. They should accept AudioEventArgs and AudioErrorEventArgs as parameters.

Next, add a new method called the GenerateHeaders method as follows:

        public void GenerateHeaders() 
        { 
            _headers.Add(new KeyValuePair<string, string>("Content-Type", "application/ssml+xml")); 
            _headers.Add(new KeyValuePair<string, string>("X-Microsoft-OutputFormat", _outputFormat)); 
            _headers.Add(new KeyValuePair<string, string>("Authorization", _authorizationToken)); 
            _headers.Add(new KeyValuePair<string, string>("X-Search-AppId", Guid.NewGuid().ToString("N"))); 
            _headers.Add(new KeyValuePair<string, string>("X-Search-ClientID", Guid.NewGuid().ToString("N"))); 
            _headers.Add(new KeyValuePair<string, string>("User-Agent", "Chapter1")); 
        } 

Here we add the HTTP headers to our previously created list. These headers are required for the service to respond, and if any is missing, it will yield an HTTP/400 response. What we are using as headers is something we will cover in more detail later. For now, just make sure they are present.

Following this, we want to add a new method called GenerateAuthenticationToken as follows:

        public bool GenerateAuthenticationToken(string clientSecret) 
        { 
            Authentication auth = new Authentication(clientSecret); 

This method accepts one string parameter, the client secret (your API key). First we create a new object of the Authentication class, which we looked at earlier, as follows:

        try 
        { 
            _token = auth.Token; 
 
            if (_token != null) 
            { 
                _authorizationToken = $"Bearer {_token}"; 
 
                return true; 
            } 
            else 
            { 
                RaiseOnError(new AudioErrorEventArgs("Failed to generate authentication token.")); 
                return false; 
            } 
        } 

We use the authentication object to retrieve an access token. This token is used in our authorization token string, which, as we saw earlier, is being passed on in our headers. If the application for some reason fails to generate the access token, we trigger an error event.

Finish this method by adding the associated catch clause. If any exceptions occur, we want to raise a new error event.

The last method that we need to create in this class is going to be called the SpeakAsync method. This method will be actually performing the request to the Speech API:

        public Task SpeakAsync(string textToSpeak, CancellationTokencancellationToken) 
        { 
            varcookieContainer = new CookieContainer(); 
            var handler = new HttpClientHandler() { 
                CookieContainer = cookieContainer 
            }; 
            var client = new HttpClient(handler);  

The method takes two parameters. One is the string, that will be the text we want to get spoken. The next is a cancellation token. This can be used to propagate that the given operation should be cancelled.

When entering the method, we create three objects which we will use to execute the request. These are classes from the .NET library, and we will not be going through them in any more detail.

We generated some headers earlier and we need to add these to our HTTP client. We do this by adding the headers in the preceding foreach loop, basically looping through the entire list:

            foreach(var header in _headers) 
            { 
                client.DefaultRequestHeaders.TryAddWithoutValidation (header.Key, header.Value); 
            } 

Next we create an HTTP Request Message, specifying that we will send data through the POST method and specifying the request URI. We also specify the content using the SSML template we created earlier and adding the correct parameters (gender, voice name, and the text we want to be spoken):

            var request = new HttpRequestMessage(HttpMethod.Post, RequestUri) 
            { 
                Content = new StringContent(string.Format(SsmlTemplate, _gender, _voiceName, textToSpeak)) 
            }; 

We use the HTTP client to send the HTTP request asynchronously as follows:

            var httpTask = client.SendAsync(request, HttpCompletionOption.ResponseHeadersRead, cancellationToken); 

The following code is a continuation of the asynchronous send call we made previously. This will run asynchronously as well and check the status of the response. If the response is successful, it will read the response message as a stream and trigger the audio event. If everything succeeded, then that stream should contain our text in spoken words:

    var saveTask = httpTask.ContinueWith(async (responseMessage, token) => 
    { 
        try 
        { 
            if (responseMessage.IsCompleted && 
                responseMessage.Result != null &&   
                responseMessage.Result.IsSuccessStatusCode) { 
                var httpStream = await responseMessage. Result.Content.ReadAsStreamAsync().ConfigureAwait(false); 
                RaiseOnAudioAvailable(new AudioEventArgs (httpStream)); 
            } else { 
                RaiseOnError(new AudioErrorEventArgs($"Service returned {responseMessage.Result.StatusCode}")); 
            } 
        } 
        catch(Exception e) 
        { 
            RaiseOnError(new AudioErrorEventArgs (e.GetBaseException().Message)); 
        } 
    } 

If the response indicates anything other than success, we will raise the error event.

We also want to add a catch clause as well as a finally clause to this. Raise an error if an exception is caught, and dispose of all objects used in the finally clause.

The final code we need is to specify that the continuation task is attached to the parent task. Also, we need to add the cancellation token to this task as well. Go on to add the following code to finish off the method:

    }, TaskContinuationOptions.AttachedToParent, cancellationToken); 
    return saveTask; 
}  

With that in place, we are now able to utilize this class in our application, and we are going to do that now. Open the MainViewModel.cs file and declare a new class variable as follows:

        private TextToSpeak _textToSpeak; 

Add the following code in the constructor to initialize the newly added object. We also need to call a function to generate the authentication token as follows:

            _textToSpeak = new TextToSpeak(); 
            _textToSpeak.OnAudioAvailable +=  _textToSpeak_OnAudioAvailable; 
            _textToSpeak.OnError += _textToSpeak_OnError; 
 
            GenerateToken();

After we have created the object, we hook up the two events to event handlers. Then we generate an authentication token by creating a function GenerateToken with the following content:

public async void GenerateToken()
{
    if (await _textToSpeak.GenerateAuthenticationToken("BING_SPEECH_API_KEY_HERE"))
        _textToSpeak.GenerateHeaders(); 
}

Then we generate an authentication token, specifying the API key for the Bing Speech API. If that call succeeds, we generate the HTTP headers required.

We need to add the event handlers, so create the method called _textToSpeak_OnError first as follows:

            private void _textToSpeak_OnError(object sender, AudioErrorEventArgs e) 
            { 
                StatusText = $"Status: Audio service failed -  {e.ErrorMessage}"; 
            } 

It should be rather simple, just to output the error message to the user, in the status text field.

Next, we need to create a _textToSpeak_OnAudioAvailable method as follows:

        private void _textToSpeak_OnAudioAvailable(object sender, AudioEventArgs e) 
        { 
            SoundPlayer player = new SoundPlayer(e.EventData); 
            player.Play(); 
            e.EventData.Dispose(); 
        } 

Here we utilize the SoundPlayer class from the .NET framework. This allows us to add the stream data directly and simply play the message.

The last part we need for everything to work is to make the call to the SpeakAsync method. We can make that by adding the following at the end of our DetectFace method:

    await _textToSpeak.SpeakAsync(textToSpeak, CancellationToken.None); 

With that in place, you should now be able to compile and run the application. By loading a photo and clicking on Detect face, you should be able to get the number of faces spoken back to you. Just remember to have audio on!