AI Awesomeness: 2020 Update! Microsoft Cognitive Services Speaker Recognition API

A while back I showed how to use the Microsoft Speaker Recognition APIs in the simplest way I could think of; using a web page to record audio and call the various APIs to set up and test Speaker Verification and Speaker Identification.

Honestly, the hardest part of this by far was getting the audio recorded in the correct format for the APIs to accept! I hacked the wonderful recorderjs from Matt Diamond to set the correct bitrate, channels, etc, and eventually got there through trial and error (and squinting at the minified source code of the Microsoft demo page)!

In the run up to //Build this year, there have been a lot of changes in the Microsoft AI space.

One of these changes managed to break my existing Speaker Recognition API applications (it’s still in Preview, so don’t be surprised!) by moving Speaker Recognition under the Speech Service project, slightly changing the APIs and their endpoints, and adding new functionality (exciting!)

In this article I’ll show the same web page implementation, but use the updated 2020 Speaker Recognition APIs, and introduce the new Verification API that doesn’t rely on a predefined list of passphrases.

All of the code is over on Github with the demo page at https://rposbo.github.io/speaker-recognition-api/

Prerequisites

You need to set up Cognitive Services over in your Azure account, and retrieve the key and endpoint to use; my endpoint, for example, is: https://rposbo-demo.cognitiveservices.azure.com/.

The steps to get these values are in the previous articles.

Throughout this article the code samples refer to variables called key and baseApi, which map to the values key and endpoint, as mentioned above.

Got all that? Great, let’s go!

Speaker Verification

The verification API is intended to be used as a type of login or authentication; just by talking a key phrase to your Cognitive Services enhanced app you can, with a degree of confidence, verify the user is who they (literally) say they are.

Each user is required to “enrol” by repeating a known phrase three times, clearly enough for the API to have a high confidence for that voiceprint – that is, the data representation of an audio recording.

To set up Verification we need to call two or three endpoints;

  1. Create Profile (create a placeholder with an ID)
  2. Enrol Profile (associate some audio with that ID)
  3. Verify Profile (compare some audio to profile and see if it matches)

The big change here is new support for verification phrases that are not in the list of previously allowed verification phrases; if you continue to use the predefined list then you’re using text-dependent configuration. If you use your own audio the you’re using text-independent configuration. This is a very cool change that merges identification and verification capabilities.

Text Dependent Verification

I’ll start by showing the updated version of the text-dependent verification, which uses a list of predefined passphrases; if you say something that’s not on this list then it won’t work.

For text-dependent verification, the speaker’s voice is enrolled by saying a passphrase from a set of predefined phrases. Voice features are extracted from the audio recording to form a unique voice signature, while the chosen passphrase is also recognised. Together, the voice signature and the passphrase are used to verify the speaker. docs

Verification Phrases

A reminder of how to get the allowed phrases:

const  phrasesEndpoint = 
	`${baseApi}/speaker/verification/v2.0/text-dependent/phrases/en-US`;

// Get the supported verification phrases
var  request = new  XMLHttpRequest();
request.open("GET", phrasesEndpoint, true);
request.setRequestHeader('Ocp-Apim-Subscription-Key', key);
request.onload = function(){ console.log(request.responseText); };
request.send();

If you execute that then you’ll find an array of phrases you can use to configure verification:

[
	{"passPhrase":"i am going to make him an offer he cannot refuse"},
	{"passPhrase":"houston we have had a problem"},
	{"passPhrase":"my voice is my passport verify me"},
	{"passPhrase":"apple juice tastes funny after toothpaste"},
	{"passPhrase":"you can get in without your password"},
	{"passPhrase":"you can activate security system now"},
	{"passPhrase":"my voice is stronger than passwords"},
	{"passPhrase":"my password is not your business"},
	{"passPhrase":"my name is unknown to you"},
	{"passPhrase":"be yourself everyone else is already taken"}
]

This is your reminder to go and watch the fantastic 90s hacker movie Sneakers to see where one of these came from!

Profile Creation

Now let’s create a placeholder profile:

const  createVerificationProfileEndpoint = 
	`${baseApi}/speaker/verification/v2.0/text-dependent/profiles`;

var  request = new  XMLHttpRequest();
request.open("POST", createVerificationProfileEndpoint, true);
request.setRequestHeader('Content-Type','application/json');
request.setRequestHeader('Ocp-Apim-Subscription-Key', key);
request.onload = function () {
	console.log(request.responseText);
	var json = JSON.parse(request.responseText);
	
	// previously json.verificationProfileId
	var profileId = json.profileId;
};
request.send(JSON.stringify({'locale' :'en-us'}));

All this does is set up a blank profile that we will use to associate the processed audio with; the response looks like this:

{
  "remainingEnrollmentsCount": 3,
  "locale": "en-us",
  "createdDateTime": "2020-06-14T13:20:47.069Z",
  "enrollmentStatus": "Enrolling",
  "modelVersion": null,
  "profileId": "7f3fe300-ef62-43f7-8d1b-918bcb6a9c8b",
  "lastUpdatedDateTime": null,
  "enrollmentsCount": 0,
  "enrollmentsLength": 0,
  "enrollmentSpeechLength": 0
}

Notice remainingEnrollmentsCount is 3 and all the other enrollment counts are zero – next we need to take that profileId and associate some audio with it until remainingEnrollmentsCount is zero.

Profile Enrolment

Assuming we have the profile ID in a variable called profileId, then the following code will submit some audio that you’ve already recorded and saved in the blob variable to the enrolment endpoint:

Not sure how to record audio in the right format? Check out the customised version of recorderjs over in the demo repo

const  enrollVerificationProfileEndpoint = 
	`${baseApi}/speaker/verification/v2.0/text-dependent/profiles/${profileId}/enrollments`;

var  request = new  XMLHttpRequest();
request.open("POST", enrollVerificationProfileEndpoint, true);
request.setRequestHeader('Ocp-Apim-Subscription-Key', key);
request.onload = function () {
	console.log('enrolling');
	var json = JSON.parse(request.responseText);
	console.log(json);

	// need 3 successful enrolled chunks of audio per profile id; store
	if (json.remainingEnrollmentsCount == 0) {
		console.log("Verification should be enabled!")
	}
};

// "blob" is the audio encoded in the necessary format
request.send(blob);

The initial response will hopefully be something like:

{
  "remainingEnrollmentsCount": 2,
  "passPhrase": "my voice is my passport verify me",
  "profileId": "7f3fe300-ef62-43f7-8d1b-918bcb6a9c8b",
  "enrollmentStatus": "Enrolling",
  "enrollmentsCount": 1,
  "enrollmentsLength": 3.76,
  "enrollmentsSpeechLength": 3.6,
  "audioLength": 3.76,
  "audioSpeechLength": 3.6
}

Notice there is one enrollmentsCount and two remainingEnrollmentsCount; that means it succeeded once, and we still need to do the same this twice more; this means the same voice, with the same passphrase.

There are various reasons this can fail; the reason for failure is usually in the responseText – for example, if you say something that’s not in the list of passphrases:

{
  "error": {
    "code": "InvalidRequest",
    "message": "Invalid passphrase."
  }
}

or perhaps if it can’t isolate the voice within the audio:

{
  "error": {
    "code": "InvalidRequest",
    "message": "Audio is too noisy."
  }
}

Once you’ve successfully executed the enrolment (yes, I use one “l” in “enrolment”, Microsoft uses two – I’m English!) then you’ll get a response like this:

{
  "remainingEnrollmentsCount": 0,
  "passPhrase": "my voice is my passport verify me",
  "profileId": "7f3fe300-ef62-43f7-8d1b-918bcb6a9c8b",
  "enrollmentStatus": "Enrolled",
  "enrollmentsCount": 3,
  "enrollmentsLength": 11.62,
  "enrollmentsSpeechLength": 10.23,
  "audioLength": 3.93,
  "audioSpeechLength": 3.31
}

You have now associated a voice print with profile – yay! Lastly, let’s try and verify you.

Profile Verification

The process here is to submit audio to the API with the profile ID that you’re attempting to verify:

const  verifyProfileEndpoint = 
	`${baseApi}/speaker/verification/v2.0/text-dependent/profiles/${profileId}/verify`;
	
var  request = new  XMLHttpRequest();
request.open("POST", verifyProfileEndpoint, true);
request.setRequestHeader('Content-Type','application/json');
request.setRequestHeader('Ocp-Apim-Subscription-Key', key);

request.onload = function () {
	console.log('verifying profile');
	// Was it a match?
	console.log(JSON.parse(request.responseText));
};
request.send(blob);

The response will looks something like this for a success:

{
  "recognitionResult": "Accept",
  "score": 0.6933718323707581
}  

And like this for a rejection (wrong person or wrong passphrase):

{
  "recognitionResult": "Reject",
  "score": 0
}

Differences

The changes between this version of the API and the previous one are mainly cosmetic; changing verificationProfileId to profileId in the json response, for example. The biggest change is how you no longer need to poll a different endpoint for the processing status – it’s all done quickly, or the request just takes a few seconds before you get a response. Take this into account when building your app to not kill a connection after a second or two!

Text Independent Verification

Now let’s investigate the new verification API type – text-independent:

Text-independent verification has no restrictions on what the speaker says during enrollment or in the audio sample to be verified, as it only extracts voice features to score similarity. Text-independent verification means speakers can speak in everyday language in the enrollment and verification phrases. docs

So it basically works the same as the previous speaker verification, but can use any audio for the speaker – no restriction of a list of predefined passphrases. I don’t know if this affects the accuracy of the match, or if the speech speaker recognition capabilities have just significantly improved over the past couple of years.

Either way, to set up text-independent Verification we need to call the same type of APIs but on different endpoints;

  1. Create Profile (create a placeholder with an ID)
  2. Enrol Profile (associate some audio with that ID)
  3. Verify Profile (compare some audio to profile and see if it matches)

Profile Creation

This is the same as before, but changing the URL path to contain text-indepenedent instead of text-dependent:

// notice the subtley different URL path
const createTextIndependentVerificationProfileEndpoint = 
`${baseApi}/speaker/verification/v2.0/text-independent/profiles`;

// everything else is the same:
var request = new XMLHttpRequest();
request.open("POST", createTextIndependentVerificationProfileEndpoint, true);
request.setRequestHeader('Content-Type','application/json');
request.setRequestHeader('Ocp-Apim-Subscription-Key', key);
request.onload = function () {
	console.log(request.responseText);
	var json = JSON.parse(request.responseText);
	
	var profileId = json.profileId;
};

request.send(JSON.stringify({'locale' :'en-us'}));

The response from this will look like:

{
  "remainingEnrollmentsSpeechLength": 20,
  "locale": "en-us",
  "createdDateTime": "2020-06-14T17:25:09.015Z",
  "enrollmentStatus": "Enrolling",
  "modelVersion": null,
  "profileId": "db2aa5fd-9e55-446e-a1d3-76d8110db58b",
  "lastUpdatedDateTime": null,
  "enrollmentsCount": 0,
  "enrollmentsLength": 0,
  "enrollmentSpeechLength": 0
}

Since we’re not going to repeat the same phrase multiple times, the response has remainingEnrollmentsSpeechLength instead of remainingEnrollmentsCount.

Profile Enrolment

Again, this is the same sort of API call except for the path difference. You can set the ignoreMinLength param to true in order to allow for much shorter audio samples; of course this will impact accuracy. If you leave it as the default (false) then you’ll need to keep submitting more audio to the same endpoint until remainingEnrollmentsSpeechLength is 0.

const enrolTextIndependentVerificationProfileEndpoint = 
	`${baseApi}/speaker/verification/v2.0/text-independent/profiles/${profileId}/enrollments?ignoreMinLength=false`;

var request = new XMLHttpRequest();
request.open("POST", enrolTextIndependentVerificationProfileEndpoint, true);
request.setRequestHeader('Ocp-Apim-Subscription-Key', key);
request.onload = function () {
	console.log('enrolling');
	console.log(request.responseText);

	var json = JSON.parse(request.responseText);

	// if remainingEnrollmentsSpeechLength  > 0 then you need to 
	// execute this again for the same profile Id until
	// remainingEnrollmentsSpeechLength is 0
	if (json.remainingEnrollmentsSpeechLength == 0) 
	{
		console.log("Verification should be enabled!")
	}
};
 
request.send(blob);

The response will look something like this:

{
  "remainingEnrollmentsSpeechLength": 15.05,
  "profileId": "db2aa5fd-9e55-446e-a1d3-76d8110db58b",
  "enrollmentStatus": "Enrolling",
  "enrollmentsCount": 1,
  "enrollmentsLength": 5.64,
  "enrollmentsSpeechLength": 4.95,
  "audioLength": 5.64,
  "audioSpeechLength": 4.95
}

As mentioned above, if the audio sample you submitted was less than 20 seconds (ignoring pauses or gaps) then you’ll need to keep submitting more audio until remainingEnrollmentsSpeechLength is 0.

Profile Verification

Once the profile has fully enrolled, you can try out the verification; this time you don’t need to repeat the same thing you said previously; just say anything and submit the sample against the profile ID you want to verify.


const verifyTextIndependentProfileEndpoint = 
	`${baseApi}/speaker/verification/v2.0/text-independent/profiles/${profileId}/verify`;

var request = new XMLHttpRequest();
request.open("POST", verifyTextIndependentProfileEndpoint, true);

request.setRequestHeader('Content-Type','application/json');
request.setRequestHeader('Ocp-Apim-Subscription-Key', key);
 
request.onload = function () {
	console.log('verifying profile');

	// Was it a match?
	console.log(JSON.parse(request.responseText));
};
 
request.send(blob);

Your response should look something like this (the score will be lower if you decided to use ignoreMinLength when enrolling the profile’s audio):

{
  "recognitionResult": "Accept",
  "score": 0.8042700886726379
}

Speaker Identification

This functionality allows you to register several audio profiles for different people. Once done, you can submit an audio sample which will be matched to the profiles, allowing you to identify the speaker in the sample. Very cool stuff. The functionality hasn’t really changed from the previous version, except that the endpoint you poll for status is not returned as a location header from the enroll call anymore; it’s just a clearly defined endpoint per profile.

The steps are similar to before, but the response structure is slightly different:

  1. Create Profile (create a placeholder with an ID)
  2. Enrol Profile (associate some audio with that ID)
  3. Poll for Enrolment status (we’re sending longer, unstructured, speech so it can take time to process)
  4. Identify Profile (compare some audio to a list of profile IDs and see if it matches any of them)

Identification Profile Creation

As with Verification, this initially creates a placeholder profile against which we need to associate audio samples:

const createIdentificationProfileEndpoint = 
	`${baseApi}/speaker/identification/v2.0/text-independent/profiles`;

var request = new XMLHttpRequest();
request.open("POST", createIdentificationProfileEndpoint, true);
request.setRequestHeader('Content-Type','application/json');
request.setRequestHeader('Ocp-Apim-Subscription-Key', key);

request.onload = function () {
	console.log('creating profile');

	var json = JSON.parse(request.responseText);
	console.log(json);

	var profileId = json.profileId;
};

request.send(JSON.stringify({ 'locale' :'en-us'}));

This should create that placeholder for you and give you a response like:

{
  "remainingEnrollmentsSpeechLength": 20,
  "locale": "en-us",
  "createdDateTime": "2020-06-14T20:17:58.456Z",
  "enrollmentStatus": "Enrolling",
  "modelVersion": null,
  "profileId": "19dd988e-b1c1-44d3-b9ff-13d5c1a67799",
  "lastUpdatedDateTime": null,
  "enrollmentsCount": 0,
  "enrollmentsLength": 0,
  "enrollmentSpeechLength": 0
}

Similar to the text-independent Verification method, we have to fill up the remainingEnrollmentsSpeechLength with audio. To do that we need to enroll.

Identification Profile Enrolment

We need to submit the audio along with the profileId that it belongs to until remainingEnrollmentsSpeechLength is 0 (or we just pass ignoreMinLength: and let the accuracy be lower):

const enrollIdentificationProfileEndpoint =
	`${baseApi}/speaker/identification/v2.0/text-independent/profiles/${profileId}/enrollments`;

var request = new XMLHttpRequest();
request.open("POST", enrollIdentificationProfileEndpoint, true);
request.setRequestHeader('Ocp-Apim-Subscription-Key', key);
request.onload = function () {
	console.log('enrolling');	
	var json = JSON.parse(request.responseText);
	
	if (request.status==200 || request.status==201) {
		console.log(json);
		// now we need to poll for enrollment status
	} else {
		console.log(`Failed to submit for enrollment: got a ${request.status} response code.`);
		console.log(`${json.error.code}: ${json.error.message}`);
	}
};  
request.send(blob);

The response will look a bit like this, assuming you’ve submitted enough audio (or set ignoreMinLength to true):

{
  "remainingEnrollmentsSpeechLength": 0,
  "profileId": "19dd988e-b1c1-44d3-b9ff-13d5c1a67799",
  "enrollmentStatus": "Enrolled",
  "enrollmentsCount": 1,
  "enrollmentsLength": 14.76,
  "enrollmentsSpeechLength": 13.45,
  "audioLength": 14.76,
  "audioSpeechLength": 13.45
}

If remainingEnrollmentsSpeechLength then just submit more audio until it’s 0. If the enrollmentStatus is Enrolled, then processing is already complete; if it’s Training then you need to check back in a second or so.

Here’s how to do that:

Poll for Identification Profile Enrolment Status

Previously the enroll step returned a location header for you to check for the enrolment status; now you can just hit the “Get Profile” endpoint for that profileId :

const enrollIdentificationProfileStatusEndpoint =
	`${baseApi}/speaker/identification/v2.0/text-independent/profiles/${profileId}`;

var enrolledInterval;

// hit the endpoint every second
enrolledInterval = setInterval(function()
{
	var request = new XMLHttpRequest();
	request.open("GET", enrollIdentificationProfileStatusEndpoint, true);
	request.setRequestHeader('Ocp-Apim-Subscription-Key', key);
	request.onload = function()
	{
		console.log('getting status');
		var json = JSON.parse(request.responseText);
		console.log(json);

		if (json.enrollmentStatus == 'Enrolled')
		{
			// Woohoo! The audio was enrolled successfully! 
			// stop polling
			clearInterval(enrolledInterval);
			console.log('enrollment complete!');
		}
		else 
		{
			// keep polling
			console.log('Not done yet..');
		}
	};

	request.send();
}, 1000);

Once you get an "enrollmentStatus": "Enrolled" response, then you’re good to go.

Identify Speaker Profile

Add a few different profiles and hold on to the returned profileIds. We will now submit an audio sample along with a list of profileIDs to compare it against, and will hopefully receive the correctly matching profileId for that audio – fingers crossed!

var Ids = [...]; // array of the profile ids
const  identifyProfileEndpoint = 
	(Ids) => 
	`${baseApi}/speaker/identification/v2.0/text-independent/profiles/identifySingleSpeaker?profileIds=${Ids}&ignoreMinLength=true`;

var request = new XMLHttpRequest();
// conver convert array to a comma delimited list for the API endpoint
request.open("POST", identifyProfileEndpoint(Ids.join()), true);
request.setRequestHeader('Ocp-Apim-Subscription-Key', key);
request.onload = function () {
	console.log('identifying profile');
	var json = JSON.parse(request.responseText);

	// hopefully the match is in this response
	console.log(json);
};

request.send(blob);

Once you’ve submitted the audio sample, you should get a quick response like this:

{
  "identifiedProfile": {
    "profileId": "19dd988e-b1c1-44d3-b9ff-13d5c1a67799",
    "score": 0.82385486
  },
  "profilesRanking": [
    {
      "profileId": "19dd988e-b1c1-44d3-b9ff-13d5c1a67799",
      "score": 0.82385486
    }
    ... // any other profiles you added
  ]
}

There are the usual possible errors for the audio: too short, too quiet, not the right format etc.

Summary

You’ve now got working examples for the June 2020 version of the Microsoft Cognitive Services Speaker Recognition API, including the changes to the previous version and the new text-independent Verification API.

Remember to head over to the documentation for more up to date info, and the working example to play with this; you just need your Cognitive Services key and endpoint from your Azure portal.

Let me know how you get on!

References

https://docs.microsoft.com/en-gb/azure/cognitive-services/speech-service/speaker-recognition-overview

https://docs.microsoft.com/en-us/rest/api/speakerrecognition/

https://rposbo.github.io/speaker-recognition-api

https://github.com/rposbo/speaker-recognition-api

Leave a Reply

Your email address will not be published. Required fields are marked *