TL;DR: At one of the orgs we were consulting with, we found that 70% their GET APIs were repeated calls, but could not be cached due to some amount of changes in the underlying data. Finding a way to cache these APIs allowed us to bring our database down from m5.4x-large to m5.2x-large and saved us ~30% of our monthly cloud spend on AWS database & egress costs.
Most application APIs are considered non-cacheable either because:
- the underlying data changes frequently
- they use GraphQL making it difficult to cache
- they’re REST APIs that fetch using POST because they have their search parameters within the body
We were in a similar situation at our client's org, a B2B application. We used Postgres on RDS, with many tables that had to be joined to produce an API response. This became both slow and expensive as our customer base & data stored increased.
Looking for low hanging fruits, we stumbled on an interesting stat: ~70% of our APIs were repeated calls. So these could be cached!
Not so easy. With our application having a high number writes (20 updates per second), we either had to use a very low TTL making the API cache effectively useless, or risk showing stale data to our users.
Here’s the risk-gain ratio on Cloudflare & Cloudfront, with a TTL of 1s (minimum possible):
At 20 updates per second, we risked showing 19% stale data to our users. This was a deal breaker.
Time to ditch API caching, clearly a no go….. unless we’re missing something?
On digging deeper into our data, we found that majority of our writes were updating only a few columns in our database. Which meant, at least theoretically, only these values could be re-computed and plugged into our cache, with the rest of the API’s response being identical.
Messy POC v0.1
This wasn’t possible on Cloudflare / Cloudfront or Fastly. So we decided to do build a POC of a mutable cache that received a database update stream to update the cache in-place.
The results were a little better - not good enough yet, but promising since it had much lesser stale data.
What was most interesting though was that because our mutable cache layer was directly connected to our database updates, there was now a path towards having a higher TTL for our cache without risking stale data being shown to our customers.
Messy POC v0.2
After multiple iterations and finding a better name (SuperAPI), we now have zero stale data, a much better cache HIT ratio and very low latency API (~23ms on average).
As for our client's org, after moving to a lower database model & reducing the number of our compute instances, they’re happily chugging along with 30% lesser costs on AWS.
If you’re interested in how SuperAPI works in more detail: How does SuperAPI work?