Degraded performance of GraphQL on Umbraco Heartcore

Blank

Umbraco Status page

Welcome to the Umbraco status page. On this page, you can see the current operational status as well as plans for scheduled maintenance and automatic upgrades for all our cloud offerings: Umbraco Cloud and Heartcore.

Subscribe to updates above to get the latest status send straight to your inbox.

If you’re experience issues with your cloud project which does not seem to relate to the current operational status, please go to Our Umbraco and search for the issue or reach out to the Umbraco Support in the portal chat.

Incident Report for Umbraco Cloud

Postmortem

In December 2021 and the beginning of January 2022, we experienced a number of issues related to the GraphQL API in Umbraco Heartcore. This post-mortem covers a summary of the different issues and what we have done to fix them.

The Heartcore GraphQL issues occurred on the following dates as communicate on https://status.umbraco.io/:

December 2, 2021
December 9, 2021
December 14. 2021
December 23, 2021
January 4, 2021

‌

For these issues, we have identified three main issue types:

Issues related to cache updates
Outage due to connection exhaustion
Performance issues

‌

Below, you’ll find a description of each type including how we solved them. Finally, we’ll invite you for a webinar “Heartcore Status and Future” to tell you more about our plans for Umbraco Heartcore in order to make the product more stable and beneficial.

Issues related to cache updates

What happened:

When performing queries against the Heartcore GraphQL API the content is cached upon the first retrieval. So the first request will go back to the platform and perform the query in our data store, and as the result is returned to the caller it's also cached at the edge (Cloudflare).

Next time the same query is run, the result will be returned from the cache.

When new content is published, media saved, or metadata (doc types, data types) updated, the cache will be invalidated. All of this centers around an ETag which is used as a cache key in combination with Url, Project Alias, query, etc.

We initially stored this ETag in the database which meant that all requests had to go back to the datastore in order to determine if the cache should be invalid. This was less than ideal for a number of reasons, so it was changed to be stored in-memory in the GraphQL Server.

‌

What we did:
As the datastore configuration was changed, the approach of storing the ETag in-memory proved insufficient, so we decided to move it out of the server and closer to the edge cache itself.

Over the past month, we went through 3 iterations of updating our cache refresh strategy to be more efficient and reliable than previously. At the same time, we reduced the overhead of checking the validity of cached queries.

At the time of writing this, we are also exploring the time to refresh the cache as an area we want to improve in order to make it even faster to refresh the edge cache.

Outage due to connection exhaustion

What happened:
The Content Delivery Platform is one of the central parts of Umbraco Heartcore and is where the Content Delivery, Preview, and GraphQL APIs live. At the heart of this is a central datastore, which has a draft and published content as well as GraphQL schema.

The life-cycle of the datastore involves ingestion of updated content/media/metadata and serving queries for the various APIs. As with most datastores there is a limit to the number of connections that can happen at once.

Above, we mentioned how each request would go back to the Content Delivery Platform to validate the cache key and whether the cached result was still valid. This caused a large number of requests to hit the database, often at the same time as updates were inserted, which resulted in connection exhaustion.

What we did:

Initially, we solved the connection exhaustion issue by scaling the datastore to be able to handle more connections. This alone was not the only solution to the problem, as was also apparent as the load to the GraphQL API has increased as more customers were onboarded.

Reworking our caching strategy was another initiative which was related to improving the platform and being able to handle more load.
Additionally, we have done more to better control the ingestion of data into the datastore and connection pooling around the reads performed against the datastore.

The combination of scaling, better handling of connections, and surrounding initiatives have made the platform as a whole more stable, reliable, and performant.

Performance issues

What happened:

In addition to the above-mentioned issues, we had some performance issues with the GraphQL API, which centers around two areas:

Performance of the datastore
Ingestion of updated data - when this is slow, its easily perceived as the GraphQL being slow to update, because published changes are not immediately available

What we did:

Performance of the datastore was largely related to the initial cache strategy, as described in the previous sections. Removing a large number of lookups had a positive impact on work the datastore could then work on - mainly processing queries.

Additionally, we separated the reads and the writes so they are not performed against the same instance. This had a significant impact on performance and a positive impact on having more connections available.

With the ingestion of data, we initially did some locking in the datastore as metadata was updated in part to ensure the “correctness” of the ETag in the database. So as part of reworking our caching strategy, we moved the ETag out of the datastore and at the same time removed a lot of locks, which were no longer necessary. We revisited parts of the schema along with revisiting locking, which meant that updates can be ingested faster.

We continue to look for optimizations, which enable faster inserts and queries in the datastore. Performance is front and center when it comes to our Content Delivery Platform, so it will be an ongoing focus to make tweaks and improve performance.

Join the Heartcore Status and Future webinar

In relation to the above, we’d like to invite you to a dedicated Heartcore webinar where CTO Filip Bech-Larsen, CEO Kim Sneum Madsen, and Heartcore team lead Morten Christensen will talk about the issue mentioned above and focus on what the future holds for Heartcore in order to make the product even more stable and attractive.

The webinar is free and will take place on Feb 7, 2022 at 3PM UTC. Sign up now: https://umbraco.zoom.us/webinar/register/WN_GI_fCIzqSkOhW1Zl3paz-A

Posted Jan 19, 2022 - 13:43 CET

Resolved

This incident has been resolved.

Posted Jan 04, 2022 - 20:24 CET

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 04, 2022 - 18:47 CET

Identified

The issue has been identified and a fix is being implemented.

Posted Jan 04, 2022 - 18:12 CET

This incident affected: Umbraco Heartcore (Umbraco Heartcore - API, Umbraco Heartcore Services).