NLP Benchmark

What We Did

We ran this benchmark using the following devices: Amazon Echo Show 5, Apple iPad Mini, Google Nest Home Hub.

We leveraged the Dev Dataset from ComQA - a set of complex questions derived from the crowd-sourced WikiAnswers website.

Tests were run using our Bespoken Test Robots - if you can talk to it, we can test it. Read more about our test protocol here, see the detailed results by question here.

What It Means

All the assistants performed well considering the complexity of the benchmark. These are difficult questions that are meant to challenge the limits of the assistants, so their success rate is remarkable. Google's ability to consistently tackle complex queries is particularly impressive.

At the same time, we can see there are areas where further progress is needed and still significant work to be done. We look forward to seeing how these platforms learn and improve over time.

Success By Platform

The number of questions answered correctly by each platform.

Success By Question Complexity

The percentage of simple versus complex questions answered correctly by each platform.
Complex questions involve comparisons, composition, and/or temporal reasoning.

Success By Question Type

	Amazon Alexa	Apple Siri	Google Assistant

The percentage of questions answered correctly for each type of question.
We classified questions as zero or more of the categories listed below. Click on any of the question types for more in-depth information.

Success By Question Topics

The percentage of questions answered correctly for each topic.
Click here for a complete breakdown across all topics.

What Is Next?

We plan to produce these benchmarks on a routine basis, both refreshing these results as well as doing additional studies. Here is what we currently have planned:

ASR Benchmark - including Google Speech-To-Text, Twilio AutoPilot, Houndify, and others
Personal Assistant Benchmark - questions related to managing calendar appointments, email, etc.
IVR Benchmark - performance of various IVR platforms

Have your own ideas for a benchmark? Ways that we can improve this? Drop us a line.

Who Are You?

Thanks for asking! We are Bespoken, the world leader in automated testing, training and monitoring for voice. We work with companies like Mercedes-Benz, Roku, Spotify and others to ensure their users have delightful voice experiences.

Want to assess the performance of your own voice assistant for Alexa, Google, IVR?

Moreover, our tools provide specific guidance on how to improve it. Interested? Contact us - contact@bespoken.io