From its earliest days, Microsoft Cognitive Services has had the ability to convert pictures of text into text - process known as Optical Character Recognition. I wrote about using this service here and here.

Recently, Microsoft released a new service to perform OCR. Unlike the previous service, which only requires a single web service call, this service requires two calls: one to pass an image and start the text recognition process; and other to ask the status of that text recognition process and return the transcribed text.

To get started, you will need to create a Computer Vision key, as described here.

Creating this service gives you a URI endpoint to call as a web service, and an API key, which must be passed in the header of web service calls.

Recognize Text

The first call is to the Recognize Text API. To call this API, send an HTTP POST to the following URL:

https://lllll.api.cognitive.microsoft.com/vision/v2.0/recognizeText?mode=mmmmm

where:

lllll is the location selected when you created the Computer Vision Cognitive Service in Azure; and

mmmmm is "Printed" if the image contains printed text, as from a computer or typewriter; or "Handwritten" if the image contains a picture of handwritten text.

The header of an HTTP request can include name-value pairs. In this request, include the following name-value pairs:

Name Value
Ocp-Apim-Subscription-Key The Computer Vision API key (from the Cognitive Service created above)
Content-Type "application/json", if you plan to pass a URL pointing to an image on the public web;
"application/octet-stream", if you are passing the actual image in the request body.
Details about the request body are described below.

You must pass the image or the URL of the image in the request body. What you pass must be consistent with the "Content-Type" value passed in the header.

If you set the Content-Type header value to "application/json", pass the following JSON in the request body:

{"url":"http://xxxx.com/xxx.xxx"}  

where http://xxxx.com/xxx.xxx is the URL of the image you want to analyze. This image must be accessible to Cognitive Service (e.g., it cannot be behind a firewall or password-protected).

If you set the Content-Type header value to "application/octet-stream", pass the binary image in the request body.

You will receive an HTTP response to your POST. If you receive a response code of "202" ("Accepted"), this is an indication that the POST was successful, and the service is analyzing the image. An "Accepted" response will include the "Operation-Location in its header. The value of this header will contain a URL that you can use to query if the service has finished analyzing the image. The URL will look like the following:

https://lllll.api.cognitiveservices.microsoft.com/vision/v2.0/textOperations/gggggggg-gggg-gggg-gggg-gggggggggggg

where

lllll is the location selected when you created the Computer Vision Cognitive Service in Azure; and

gggggggg-gggg-gggg-gggg-gggggggggggg is a GUID that uniquely identifies the analysis job.

Get Recognize Text Operation Result

After you call the Recognize Text service, you can call the Get Recognize Text Operation Result service to determine if the OCR operation is complete.

To call this service, send an HTTP GET request to the "Operation-Location" URL returned in the request above.

In the header, send the following name-value pair:

Name Value
Ocp-Apim-Subscription-Key The Computer Vision API key (from the Cognitive Service created above)

This is the same value as in the previous request.

An HTTP GET request has no body, so there is nothing to send there.

If the request is successful, you will receive an HTTP "200" ("OK") response code. A successful response does not mean that the image has been analyzed. To know if it has been analyzed, you will need to look at the JSON object returned in the body of the response.

At the root of this JSON object is a property named "status". If the value of this property is "Succeeded", this indicates that the analysis is complete, and the text of the image will also be included in the same JSON object.

Other possible statuses are "NotStarted", "Running" and "Failed".

A successful status will include the recognized text in the JSON document.

At the root of the JSON (the same level as "status") is an object named "recognitionResult". This object contains a child object named "lines".

The "lines" object contains an array of anonymous objects, each of which contains a "boundingBox" object, a "text" object, and a "words" object. Each object in this array represents a line of text.

The "boundingBox" object contains an array of exactly 8 integers, representing the x,y coordinates of the corners an invisible rectangle around the line.

The "text" object contains a string with the full text of the line.

The "words" object contains an array of anonymous objects, each of which contains a "boundingBox" object and a "text" object. Each object in this array represents a single word in this line.

The "boundingBox" object contains an array of exactly 8 integers, representing the x,y coordinates of the corners an invisible rectangle around the word.

The "text" object contains a string with the word.

Below is a sample of a partial result:

{ 
  "status": "Succeeded", 
  "recognitionResult": { 
    "lines": [ 
      { 
        "boundingBox": [ 
          202, 
          618, 
          2047, 
          643, 
          2046, 
          840, 
          200, 
          813 
        ], 
        "text": "The walrus and the carpenter", 
         "words": [ 
          { 
            "boundingBox": [ 
               204, 
              627, 
              481, 
              628, 
              481, 
              830, 
              204, 
               829 
            ], 
            "text": "The" 
           }, 
          { 
            "boundingBox": [ 
              519, 
              628, 
              1057, 
              630, 
               1057, 
              832, 
              518, 
               830 
            ], 
           "text": "walrus" 
          }, 
          ...etc... 
  

In this article, I showed details of the Recognize Text API. In a future article, I will show how to call this service from code within your application.