This is the second part of the trilogy focusing on performance evaluation infrastructure for conversational AI agents.

Quick Recap

In case you have not read the first part of this trilogy, we focused on some high level questions of what and why about metrics needed to ensure end user performance for an AI agent.

Here is a summary:

In this article, we will focus on how to obtain these metrics in a manner that is easy to use, accurate and actionable.

The Oracle

We hope there was a “oracle” which could provide highly accurate end user performance metrics. Guess what? There is! We as humans are extremely good at doing this. However this comes as a cost of efficiency. The amount of data points are less but highly confident. We will discuss some ways to mitigate this gap.

There should be infrastructure in place to analyze these data points and referred to as the “ground truth”.

The Next Best Thing

While ground truth provides highly confident set of metrics, it cannot replicate all the conditions that might be occur for a real user in production. Note that this might have lower confidence but is needed to uncover unanticipated issues.

Hence, there is a need to invest in ensuring we have logging signals in place for production metrics. There are few things to be cautious of for production logging.

Privacy and security implications must be kept in mind

This is pretty much self-explanatory but all production logs should be aggregated and anonymized to protect against any privacy and security issues.

Log only what is needed

On the second point, it is important to note that success metrics need to be decided beforehand and then the corresponding logs need to be identified from that. It is easy to just add a lot of logging and then determine the metrics we can derive. This actually ends up polluting the metrics making it less objective and only ends up measuring the metrics we can based on available logs.

The other drawback for logging a lot is that it can end up affecting user behavior, causing performance degradation.

Keep Logging simple

It is important to note that while metrics should drive what logs are needed, logs should not try to derive metrics. In other words, logging should just capture what event happened and when. The derivation should happen on the metric processing side and not during the user interaction. This helps to keep logging and metric derivation orthogonal - single responsibility :)

Keeping logging simple also helps with reducing unnecessary complexity in production code which can result in unexpected errors and lead to performance degradation.

Putting everything together

The two types of metric collection processes we discussed here are complementary to each other.

As a result, both of these should be used together to create a highly reliable and fast conversational agent. For any new changes (user facing or not), both the metric systems should be used to verify if there is any degradation in end user performance.

The metric collection processes should aim to improve each other:

High Level Takeaways

What’s Next?

As our conversation AI agent scales across different devices - phones, tablets, cars, earphones - it is becomes vital that the performance evaluation infrastructure can also scale seamlessly. We discuss more details on scaling performance metrics infrastructure in the final part of the trilogy - stay tuned!