Last week I saw this post which reminded me a lot of our experience with App Engine a few years ago. I shared that in a comment which got some attention so I think it’s worthwhile to describe what we went thru and how we got out (almost) of the sand trap that is Google App Engine.

But first some background, our startup uses a very complex set of servers hosted almost everywhere thru the years… From Azure, to AWS, Linode, Digital Ocean and many others… The reason for this complexity relates to our core product which requires domain specific servers.

To simplify our initial launch and to scale properly we chose App Engine. As we’re Java guys this made a lot of sense. The main goal for picking PaaS over IaaS is simplicity. We saw it as a shortcut so we can focus on our mobile platform and not on managing servers.

The Honeymoon

For the first couple of years things worked fine, we had some issues to be sure e.g. a bug in the Eclipse plugin (we used Eclipse for this because we started way back and that was the recommended approach) caused a bad deployment (picked the IDE JDK and didn’t specify class version). Our servers were down for hours and the logs were completely cryptic.

We still liked app engine after that and we even did a talk at JavaOne where we discussed how helpful it has been for us to scale our business rapidly.

To prevent issues like the downtime we had from recurring we decided to pay for gold support (extra 400USD per month) so such issues will not recur. A google rep wrote to me trying to arrange a call but nothing really happened due to scheduling conflicts and I never physically talked to a google rep in relation to the gold support. I did meet with a couple of Google reps at their local offices before upgrading to gold and had some discussions about app engine with them but mostly abstract line of business talks.

The “Disaster”

On March 2015 our monthly “data read ops” suddenly jumped from $70 spend to 4 digits. Being a busy startup we didn’t notice it until the bill arrived and we were already on the April bill!

The thing is we didn’t change anything, or at least didn’t notice any change as we were pretty busy during that time. To this day I have no idea what went wrong but I’m getting ahead of myself…

The “App Engine Datastore Read Ops” billing line item is pretty opaque, it effectively means we read from the datastore too often. Google recommends using memcache for frequent data access which we did but “somewhere” memcache didn’t work. The problem is that this is a needle in a haystack!

If we’d have logged every datastore access we would have had huge unreadable logs to go thru. This isn’t something one can debug in app engine. On a regular database we would have been able to place triggers on tables to at least see which table was causing the problem, but here we didn’t have that level of reporting. Google had some form of monitors you can install but they provided reports that didn’t help at all.

Naturally we called on Gold support, to whom we even sent our full project source code. They concluded (after reviewing their logs) that the problem was on our side…

I don’t dispute that. What I do dispute is charging for something you can’t possibly control or monitor!

This was a disaster since we were getting a bill so large it nearly wiped our revenue. Being bootstrapped and in the early stages of monetization this had the potential of sending us into bankruptcy. Google suggested placing charge limits on the account which effectively means bringing the site down, that’s an insane suggestion for a service whose whole purpose of existing is “scale”.

The Solution

The problem is that Google only updates billing once a day (or 12 hours I don’t recall exactly since I’m recounting details from 2015), so there was literally no way to know if a fix we made works until we get billing data the next day.

We just cached every possible thing, removed everything that wasn’t essential over and over. Since there was no way to debug this it included a lot of guesswork and finger crossing hoping we aren’t making things worse.

The billing eventually went down but to this day we have no way of knowing which one of our changes fixed it.

Why is this Google Fault?

Clearly something in our code broke, right?

I will take some blame, mostly for picking app engine but not for this issue.

The reason companies & individuals pick app engine is to reach a “google scale” business relatively easily. That’s the reason we picked them, we wanted to avoid a lot of the complexities that come with deploying & managing individual servers.

The main fault on Google’s side is in opaque billing. When I get a phone bill I get itemized details on why the charges were made. That’s important so if my toddler took the phone and started dialing random numbers it’s my fault and the phone company will point me at the problematic numbers.

Google did no such thing. They list an opaque item within the line items and no information about the actual source or a way to debug this (at least not back in 2015). Gold support didn’t help so even if there was a way to do it the fact that their paid support tier didn’t help is a crucial fact.

Migration Away — Alternatives are WAY Better

We still have pieces of code in app engine as migrating the database away is really hard and this isn’t our chief priority. However, having said that migrating away from app engine gave us HUGE unforeseen advantages:

As I wrote the original comment someone asked about AppScale. We looked into them in 2015 and had some issues which I don’t recall exactly. I’m sure those were resolved by now but back then we couldn’t get it to work for us.

Lessons Learned

One of my early jobs was building flight simulators and we worked with a lot of military pilots who instilled in me a “debrief” ritual. When you do something you need to honestly and methodologically examine your failures and see how you can avoid them in the future:

Our annual spend on infrastructure has kept relatively steady despite business growth. In fact in some aspects we spend less on servers than we did when app engine was running correctly and scalability wasn’t impacted.

Hacker Noon is how hackers start their afternoons. We’re a part of the @AMIfamily. We are now accepting submissions and happy to discuss advertising & sponsorship opportunities.

To learn more, read our about page, like/message us on Facebook, or simply, tweet/DM @HackerNoon.

If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!