IVR ASR Performance Benchmark

Measuring The New Generation of Digital Contact Center Platforms

What We Did

Bespoken and DefinedCrowd ran this benchmark to measure the Automatic Speech Recognition Performance of three new players in the Digital Contact Center space:

Amazon Connect

Google Dialogflow

Twilio Voice

We used datasets in US English and Spanish to see how well they performed against challenging, real-world data. Learn more about how we ran the tests here.

What It Means

The results speak for themselves - these are tough scenarios, and they demonstrate that Speech Recognition, despite what you may read, is far from a solved problem. It's critical to thorughly evaluate vendors before making a decision - DefinedCrowd and Bespoken can help.
For example, here is an audio sample where someone is speaking the background:

Here is a case where the user has a poor quality connection:

Someone might say these tests are almost unfair. But what is unfair is not taking into account that your users are not calling from an ivory tower. They are calling from their cars, from their kitches and family rooms, on their speakers, with their kids around. And they still need help and service.

If there is just ONE thing you are going to take from this report, it is that speech recognition is not a STATIC, ONE-SIZE-FITS-ALL space. Though the state of the art of Speech Recognition has advanced rapidly, it still takes careful testing, training and tuning to deliver maximum performance.

Word Error Rate By Platform (English)

We see here significant variation between platforms. And the Word Error Rates are high - these are tough, real-world scenarios, the type customers are in everyday: pauses and hesitations, background noise, spotty connections, poor enunciation, etc.

Word Error Rate By Platform (Spanish)

Interestingly, on Spanish we see the performance of the vendors is very different - it's not exactly flipped, but the second-place and third-place vendors have swapped positions. This is not surprising, as there are significant variations within platforms in their support for specific languages and locales. The conclusion? Just because a platform does well with one language and locale on CANNOT assume it will excel at others. Testing, with real data, is the key to find the right solution.

And overall, it is worth noting the performance in Spanish is generally better. That is because the audio in this dataset is generally "cleaner" - better connections with less background noise.

Word Error Rate By Domain

Word Error Rate By Domain (English)

Word Error Rate By Domain (Spanish)

Similar to the significant variations we see by language, we see the same based on the industry and domain we are working within. This holds true across both English and Spanish.

ASR systems will bias towards the domains they know best and have been trained on most intensively. So if your provider has not been trained and tuned for your specific domain, your performance will be negatively impacted, and vice-versa.

For example, in the banking domain, the word "checking" is used routinely. In hospitality, users will commonly say "check-in". Just on the pure sound of the words, they are easily confused - and that is where the bias of the specific platform will creep in. These issues are eminently solvable, but they do require attention and care.

Word Error Rate By Background Noise (English)

As mentioned at the outset - these are tough, real-world scenarios. Background noise really challenges speech recognition systems - and we can see reflected here how dramatic the impact is on it.

For example, click here to listen to this clip where the caller appears to have the TV on in the background. It's difficult, but also commonplace.

Word Error Rate By Ethnicity (English)

Also unsurprising - race and ethnicity have a big impact. Between white and non-white speakers, we see non-white speakers being understood significantly worse, almost dramatically so.

A couple additional notes to keep in mind with this chart:
  • The data presented here is for only for the English dataset, as we do not have Ethnicity classifications for the Spanish-language dataset.
  • We roll up between white and non-White, as well as across vendors, because our goal is not to fuel incendiary headlines.
Instead, we want to point out that these systems are imperfect. That they perform better for particular audiences is CRITICAL to bear in mind when working with them.

Word Error Rate By Age Group (English)

We see minimal impact by age in the English dataset - this is an area that represents significant progress for ASR/NLP platforms. This is good news and shows how these systems have evolved over time to better serve real-world scenarios.

Word Error Rate By Age Group (Spanish)

But even though age is not a factor for the English dataset, it is for our Spanish dataset. For older users, the platforms do not perform as well. Depending on the demographics of your users, this could be a very significant finding.

Word Error Rate By Gender

Word Error Rate By Gender (English)

Word Error Rate By Gender (Spanish)

The impact by gender is minimal across both languages - this is great to see! Historically, this has been a challenge for ASR/IVR systems - and at least for these platforms and datasets, it appears to have been overcome.

What Is Next?

Follow along as we expand on this benchmark to cover:

  • Tuning and training the ASR behavior using Bespoken's software - we typically see huge improvements using simple techniques
  • Expanding the domains and languages that we include in our testing
  • Testing additional platforms such as IBM Watson, Genesys, Azure Communication Services and More

We welcome your input to - if you have ideas on what we should include in the benchmark, or how we can improve it - drop us a line.

Who Are You?

Testing, Training, Tuning and Monitoring

For Conversational AI - Voice and Chat

We work with companies like Mercedes-Benz, Roku, Spotify and others to ensure their users have delightful voice experiences.

We test, train and monitor their systems to rapidly improve performance. See our recent case study for more details on the profound impact our tools and team can have.

Then contact us for a free baseline assessment of YOUR AI-based system and a roadmap to improve it.