LLM Benchmark

Hey ChatGPT, are you the smartest?
ChatGPT Outperforms Alexa And Google Assistant Generally,
All Virtual Agents Struggle With False Information

We ran this benchmark using the following services: Amazon Alexa, OpenAI ChatGPT, and Google Assistant. We leveraged the ComQA dataset - a set of complex questions derived from the crowd-sourced WikiAnswers website.

Highlights of our report:

ChatGPT outperformed outperformed Google and Alexa on answering general knowledge question.
All the agents struggled with declining to answer questions when there is no answer (i.e., nonsensical and/or trick questions)
This is just the start - we plan to expand these benchmarks for other platforms - email us (contact@bespoken.io) with ideas or for customized testing on your own in-house bot/LLM

Read on below for the details!

Success By Platform

The number of questions answered correctly by each platform.

We ran this test initially in 2020 with Alexa, Google and Siri. Click here to view those results.

This time around, all the AIs performed even better, especially considering the complexity of the benchmark. These are difficult questions that are meant to challenge the limits of the assistants, so their success rate is remarkable.

And of course, ChatGPT, which has made huge splash with the public, also had a big impact in these results. It outperformed both Google and Alexa in terms of overall knowledge.

Success By Question Type

	Amazon Alexa	Google Assistant	OpenAI ChatGPT

The percentage of questions answered correctly for each type of question. Click on any of the question types for more in-depth information.

We would highlight the fairly lackluster performance of all the agents on "No Answer" questions - that is: when the questions were not answerable, does it correctly decline to answer.

An example of such a question is "What day did the first man land on Mars?" - these are essentially "trick questions", ones that anecdotally have been shown to cause problems for these assistants.

Interally, we have dubbed this the Humility Index - how effective are these engines at knowing when they do NOT know. As you can see, this is a struggle for the current generation of AIs. What's more, this is a critical consideration for any business thinking about adopting this technology - how to measure it AND how to mitigate it.

Success By Question Complexity

The percentage of simple versus complex questions answered correctly by each platform.
Complex questions involve comparisons, composition, and/or temporal reasoning.

Success By Question Topics

The percentage of questions answered correctly for each topic.

How Can This Help Me?

We know that many of our customers are out there right now formulating their strategies for Large Language Models and figuring out how they can best take advantage of this promising new technoloy.

At Bespoken, we have believed since our inception that testing is critical to working with AI.

And large-language models have only made this more true - as they make development easier, they push more of the burden to QA:

The LLM Feedback Loop

Bespoken tests models from LLM providers and vendors for accuracy, functionality, and monitors them continuously for safety. This virtuous, constant cycle is essential for AI.

This process started during initial development of the model - we help to ensure it is performing properly with the vendor
During user acceptance testing, separate datasets are again used to asses the model's performance
Once live, our monitoring tools keep an eye on it on an ongoing basis to ensure it continues to work optimally

Who Are You?

Thanks for asking! We are Bespoken, the world leader in automated testing, training and monitoring for conversational AI. We work with companies like Disney, BNP Paribas, Ford Automotive and others to ensure their users have delightful voice experiences.

Want to assess the performance of your own bot for IVR, chat or virtual agent? Ensure that it is operating safely? Just fill out the fields below. We will get in touch shortly:

Learn More - talk to the experts at Bespoken
See how Accuracy Testing can make your LLM safer and smarter

First Name

Last Name

Company