sia.hackernoon.com

Last week I saw this post which reminded me a lot of our experience with App Engine a few years ago. I shared that in a comment which got some attention so I think it’s worthwhile to describe what we went thru and how we got out (almost) of the sand trap that is Google App Engine.

But first some background, our startup uses a very complex set of servers hosted almost everywhere thru the years… From Azure, to AWS, Linode, Digital Ocean and many others… The reason for this complexity relates to our core product which requires domain specific servers.

To simplify our initial launch and to scale properly we chose App Engine. As we’re Java guys this made a lot of sense. The main goal for picking PaaS over IaaS is simplicity. We saw it as a shortcut so we can focus on our mobile platform and not on managing servers.

The Honeymoon

For the first couple of years things worked fine, we had some issues to be sure e.g. a bug in the Eclipse plugin (we used Eclipse for this because we started way back and that was the recommended approach) caused a bad deployment (picked the IDE JDK and didn’t specify class version). Our servers were down for hours and the logs were completely cryptic.

We still liked app engine after that and we even did a talk at JavaOne where we discussed how helpful it has been for us to scale our business rapidly.

To prevent issues like the downtime we had from recurring we decided to pay for gold support (extra 400USD per month) so such issues will not recur. A google rep wrote to me trying to arrange a call but nothing really happened due to scheduling conflicts and I never physically talked to a google rep in relation to the gold support. I did meet with a couple of Google reps at their local offices before upgrading to gold and had some discussions about app engine with them but mostly abstract line of business talks.

The “Disaster”

On March 2015 our monthly “data read ops” suddenly jumped from $70 spend to 4 digits. Being a busy startup we didn’t notice it until the bill arrived and we were already on the April bill!

The thing is we didn’t change anything, or at least didn’t notice any change as we were pretty busy during that time. To this day I have no idea what went wrong but I’m getting ahead of myself…

The “App Engine Datastore Read Ops” billing line item is pretty opaque, it effectively means we read from the datastore too often. Google recommends using memcache for frequent data access which we did but “somewhere” memcache didn’t work. The problem is that this is a needle in a haystack!

If we’d have logged every datastore access we would have had huge unreadable logs to go thru. This isn’t something one can debug in app engine. On a regular database we would have been able to place triggers on tables to at least see which table was causing the problem, but here we didn’t have that level of reporting. Google had some form of monitors you can install but they provided reports that didn’t help at all.

Naturally we called on Gold support, to whom we even sent our full project source code. They concluded (after reviewing their logs) that the problem was on our side…

I don’t dispute that. What I do dispute is charging for something you can’t possibly control or monitor!

This was a disaster since we were getting a bill so large it nearly wiped our revenue. Being bootstrapped and in the early stages of monetization this had the potential of sending us into bankruptcy. Google suggested placing charge limits on the account which effectively means bringing the site down, that’s an insane suggestion for a service whose whole purpose of existing is “scale”.

The Solution

The problem is that Google only updates billing once a day (or 12 hours I don’t recall exactly since I’m recounting details from 2015), so there was literally no way to know if a fix we made works until we get billing data the next day.

We just cached every possible thing, removed everything that wasn’t essential over and over. Since there was no way to debug this it included a lot of guesswork and finger crossing hoping we aren’t making things worse.

The billing eventually went down but to this day we have no way of knowing which one of our changes fixed it.

Why is this Google Fault?

Clearly something in our code broke, right?

I will take some blame, mostly for picking app engine but not for this issue.

The reason companies & individuals pick app engine is to reach a “google scale” business relatively easily. That’s the reason we picked them, we wanted to avoid a lot of the complexities that come with deploying & managing individual servers.

The main fault on Google’s side is in opaque billing. When I get a phone bill I get itemized details on why the charges were made. That’s important so if my toddler took the phone and started dialing random numbers it’s my fault and the phone company will point me at the problematic numbers.

Google did no such thing. They list an opaque item within the line items and no information about the actual source or a way to debug this (at least not back in 2015). Gold support didn’t help so even if there was a way to do it the fact that their paid support tier didn’t help is a crucial fact.

Migration Away — Alternatives are WAY Better

We still have pieces of code in app engine as migrating the database away is really hard and this isn’t our chief priority. However, having said that migrating away from app engine gave us HUGE unforeseen advantages:

Simplicity — rewriting code was slightly tedious but turned out to be WAY simpler. Getting out of app engine restrictions really simplified a lot of the code. Recently we started working with Spring Boot which made the code even simpler
Price — I had my doubts when Linode introduced a 5USD server but we’ve run a Spring Boot app with MySQL on such a server and so far it worked beautifully. Most of our servers have higher specs but the ability to use physical separate servers is huge
scalability— this surprised me but today I think our new architecture is more scalable than app engine. To simplify the migration we split up the various pieces to separate servers and took a microservice approach for the new services. This means that a single service failure doesn’t bring down the whole thing. Better still, a performance issue just slows things down a bit instead of breaking the bank…We could also leverage CDN’s like cloudflare to provide a level of availability & performance that is pretty great. This is obviously very hard to measure but we didn’t have server related downtime due to functions moved away from app engine. The only thing that did fail once is some of our code that used S3 which went down during the recent Amazon S3 outage
Easier reporting/debugging — I can use standard reporting tools on regular databases. I can remotely connect to the database and debug on production data to see issues users are facing. Both of these are huge.
China — App engine has serious issues there

As I wrote the original comment someone asked about AppScale. We looked into them in 2015 and had some issues which I don’t recall exactly. I’m sure those were resolved by now but back then we couldn’t get it to work for us.

Lessons Learned

One of my early jobs was building flight simulators and we worked with a lot of military pilots who instilled in me a “debrief” ritual. When you do something you need to honestly and methodologically examine your failures and see how you can avoid them in the future:

Billing —With new services we now try to avoid flexible billing, we still use AWS‘s S3 but ideally we want to remove that. I’d rather pay more with a fixed price although reviewing some of the alternatives S3 seems expensive for our use cases
Avoid PaaS — Yes that might be throwing away the baby with the bathwater but it’s really hard to get PaaS right. We had an issue with the Parse shutdown too so if the PaaS is exposed to end users (requires software update on their end) this is a problem.
Microservices — Our original implementation was monolithic. It helped us launch quickly so I don’t regret that and don’t think it was a mistake but moving forward we simplified and distributed.
Startups & smaller companies are better than big companies — I’m arguably biased here but having worked with a lot of providers lets discuss some of the other experiences we had. Azure: had multiple crashes and over-billing (although nothing as serious as above just unclear pricing policy) with no real support. AWS: One of the most confusing and obtuse billing (reserved instances are both expensive and difficult). On the other hand Digital Ocean was cheaper than Google & AWS. Had good service and when we chose to migrate to Linode (who were even cheaper) they were very helpful and even refunded the remaining credit!Linode also refunded instantly a billing mistake (on our side) and has been very helpful. The reason for this is pretty obvious: a person working for Google/Amazon is a cubical dweller in a big faceless organization. We are one in the crowd to him. For a startup we are “the business” you matter and the service matches.I’m not concerned about startups disappearing, the only cases where products disappeared for us was Parse and a few features in App Engine (blob support) that got deprecated.

Our annual spend on infrastructure has kept relatively steady despite business growth. In fact in some aspects we spend less on servers than we did when app engine was running correctly and scalability wasn’t impacted.

Hacker Noon is how hackers start their afternoons. We’re a part of the @AMIfamily. We are now accepting submissions and happy to discuss advertising & sponsorship opportunities.

To learn more, read our about page, like/message us on Facebook, or simply, tweet/DM @HackerNoon.