Observability is about the ability to troubleshoot unknown issues that might happen in your application. If you are not familiar with it, I recommend you to watch How to Build Observable Distributed Systems and The Present and Future of Serverless Observability from QCon 2018.

In this article, I’m going to explain you how some of the most prominent Serverless observability tools¹ have performed against my test scenarios, meanwhile I complement it by providing an overview of each tool’s pros and cons. I have tested those tools against my Node.js based Serverless app. It’s deployed on AWS and uses Proxy Integration. You can find the code in my Github.

P.S: Since I’ve published this post, some vendors have improved their negative points / test results. I’m planning to write a revised blog post, test their claims and reflect their changes. Until then, please read their comments at the end of this article, to get notified about their pro-claimed improvements.

Table of Contents

Test Scenarios

I have tested all those tools against three scenarios by performing load testing with Gatling:

In scenarios 1 and 2, user receives 502 Bad Gateway and the vague response “internal server error”. That’s why a proper observability tool is needed to troubleshoot these cases, especially in case of big distributed application.

If a tool has passed a test, means it was able to detect the problem, and to show it to the admin. This can enable him/her for faster troubleshooting. If tool has failed to meet the before-mentioned criteria, result is marked as failed. However, some tools have partly passed tests, in this case I’ve explained their behaviour.

AWS X-RAY

Results

Pic1: X-Ray doesn’t clarify what is the error type. Also UI is contradictory, by hovering on the clock icon, it shows “no faults or error”.

Pros

Cons

Pic2: Buggy traces are shown with a 200 response, which is confusing.

Dashbird

Results

Pic3: Dashbird shows a buggy trace as successful, and user should investigate all traces, dig them thoroughly to find if there is an exception.

Pros

Pic4: Example log of a buggy trace, Dashbird shows logs from the whole function execution time.

Cons

Acknowledgment:

Thanks to Taavi Rehemägi, co-founder of Dashbird, for extending my trial period and enabling me to investigate their SaaS.

Thundra

Results

Pros

I talked with its product manager about support for Node.js app. Apparently, Thundra is focused on Java applications, and its Node.js related features are far behind. I haven’t had time to investigate its Java features, but if someone is using Java based Serverless app, I recommend him/her to take a look at Thundra.

Cons

IOPipe

Results

Pros

Cons

Workaround

To detect errors, you can act proactively and use monitoring, instead of or in collaboration with observability tools. You are advised to use CloudWatch. Lambda has built in agent to send logs to CloudWatch, and using it doesn’t add extra latency to your functions, unless you publish Custom Metrics.

To achieve the goal in an optimised way, you can create Metric Filter. For example, for error scenario 1, your Metric Filter can have a Filter Pattern such as “Task timed out after”. Then, Metric filter searches in your log events and whenever it finds a match, it increments value of corresponding CloudWatch metric. Subsequently, you can set an CloudWatch alarm for that metric and publish it e.g. via SNS. Also, to get full potentiality of CloudWatch, you can use Structured Logging by using JSON.

Conclusion

I haven’t had time to investigate all serverless observability tools, however based on my my investigation on the most prominent ones, all of tools are immature, or incomplete at some level, and need to improve. There is no single solution that you can use to thoroughly observe your distributed app perfectly.

Surprisingly some issues, like error 2, hasn’t been addressed by any of the solutions (not even in CloudWatch logs). But that error can happen due to negligence, as I encountered it when my friend was wondering why end user just gets error and asked me to debug his app. Everything looked ok, no error or exception, but end user was getting error. After debugging it for around an hour, the only thing came to my mind was the outputting format. And I was right. He forgot to JSON.strinigfy() the body property of his function’s output, and AWS Proxy Integration was failing silently. It was a simple application, but imagine if this would happen in a big and complex distributed app? How are you supposed to find it out?

To achieve observability, you need to use different solutions in tandem and also take help of deep monitoring, by Structured Logging. Pierre Vincent has addressed this during his QCon presentation How to Build Observable Distributed Systems.

My 3 test scenarios are just examples. Do you think what are other issues & errors that should be in priority to be observed? Do you know about any other tool that excel the above mentioned tools? What’s your opinion about current status of serverless observability IN PRACTICE?

Footnotes

  1. This is just my opinion.