Scaling Email Infrastructure for Medium Digest

How we increased our email capacity by 3.8x

Penny Shen

Published in

Medium Engineering

10 min readSep 29, 2020

Introduction

The Medium digest is a daily email containing a personalized list of stories that’s scheduled to arrive between 8 a.m. and 9 a.m. user’s local time. Timely delivery is important to us because we want our readers to start their days with the digest. Kind of like your morning newspaper!

We send millions of digests every day, and digest generation creates one of the heaviest loads on our resources. Making sure they continue to arrive on time as our reader base grows has been challenging. If there’s a bottleneck somewhere, digests are normally impacted first.

Our email infrastructure, built on top of Amazon’s Simple Queuing Service (SQS), did a commendable job handling the growing volume, but we finally hit its limit toward the end of 2019.

In the following months, we worked on a series of improvements to our email infrastructure that enabled us to send 282% more digests. This post describes our journey from discovering the limit, iterating on the improvements, to finally bringing life back into our email infrastructure.

📭 Digest generation overview

Before we dive into the issues we encountered, let’s take a look at the lifecycle of a digest first.

In order for digests to arrive at 8 a.m. user’s local time on time, we trigger digest generation for users in a specific time zone six hours before the scheduled send time. For example, we start generating digests for Pacific time at 2 a.m. Pacific time.

We check which time zone to generate digests for every 10 minutes. If you’re wondering why we don’t check every hour, which is presumably when we go from one time zone to another, there are two reasons. First, not all time zones are one hour apart! Second and more importantly, we randomly divide users in each time zone into six 10-minute buckets. This way, each bucket contains fewer users, and we lose fewer digests if the generation for that bucket fails to start for whatever reason.

With that said, let’s look at the different events involved in generating and sending a digest:

Shard events: Every 10 minutes, a Jenkins job triggers the upstream shard event, which is then fanned out into 256 downstream shard events. Each downstream shard event iterates through a shard of the user email table for the current time zone bucket and emits a digest generation event for each user in the shard.
Generation events: Generation events do the heavy lifting of looking up the information needed for a digest.
Send events: Once an email is generated, it’s passed to Sendgrid, the email delivery service we use, via a send event. Sendgrid takes care of actually delivering the email to the reader’s inbox.

Diagram: Jenkins job → Shard Events → Generation Events → Send Events → Sendgrid API

All digest events were processed by a single queue so that digest events won’t block other events from being processed.

🛑 Digests reached processing limit

Around September to October 2019, we saw that the digest delivery time was getting delayed, and it was getting significantly worse every day.

We dug into it and found that the queue responsible for handling digest events was approaching its processing limit.

Normally, we’d have some downtime throughout the day, during time zones with fewer Medium users, when the queue is processing at below the maximum rate of 500 events per second. As digest send volume grows, the downtime was starting to disappear:

Chart showing message processing rate converging to always processing at 500 events per second

In other words, the number of events the queue has to process over one day is approaching the maximum number of events the queue can process at its maximum processing rate.

If we go beyond the limit (which we did by the end of October 2019), the queue will remain backed-up forever because we add more events than we can process every day.

The most obvious solution was to bump up the processing rate, but due to digest’s scale, that would require us to significantly increase resources for underlying services, which is costly. We wanted to see if we could optimize our email infrastructure before shelling out the big bucks.

🛠️ Separating send events onto a separate queue

As described earlier, all digest events were processed by the same queue. The problem with this approach is that when the queue gets backed up, a generated digest can’t get passed onto Sendgrid right away because the send event gets added all the way to the back of the queue. (Although SQS doesn’t guarantee FIFO order, events still get processed roughly in the order they are put in.)

The first thing we did was putting send events onto their own queue. Since send events are fast compared to most other email events, this new queue is unlikely to get backed up, and a generated digest can be handed to Sendgrid without further delay.

With this change, we got a decent amount of breathing room, as around 20% of the events on the original queue were send events.

We were off the hook for now, but this was a Band-Aid rather than a solution, as we were still close to our limit. In fact, with an influx of users in the first half of 2020, we reached the limit again by June 2020.

️️🛠 Reduce the number of events processed

An important best practice for email deliverability is to “sunset” inactive recipients, which means to only send emails to users who have been active with your products in the last few months. Following this rule, we only send digests to users who have either opened an email or visited Medium in the last 45 days.

Previously, we handled evaluating whether a user has been sunsetted in the generation event itself. This means we were processing one generation event for every user who has ever signed up for a Medium account.

However, most of the generation events we process don’t actually translate to send events. In fact, we only send emails to 1/4 of all users we try generating for.

We check whether we should send each digest in the very beginning of the generation event, so the generation events that don’t become send events (presumably because those users have been sunsetted) are effectively no-op.

Even though these no-op events are relatively fast, they can still cause the queue to get backed up because we rate-limit each queue by number of events per second. For example, even if 100% of events are no-op and we finish processing the whole batch in the first 100ms, it’ll still wait the full one second before picking up the next batch of events.

Seeing this, the first thing we decided to do was to process events for active users only. We added a lastActiveAt field to our user email table and updated it every time a user becomes active. Then, when we’re querying the table to decide who to generate digest for, we only query for those who were active in the last 45 days according to the lastActiveAt field.

This doesn’t sound too hard! We put in the necessary changes behind a flag and started ramping up percent of events processed using this new strategy.

⚠️ Increased errors from bursty traffic

As we ramped to 50%, we started seeing a significant increase of errors from Rex, the recommendations service we rely on for personalizing stories in the digest.

Upon closer inspection, we realized that the request pattern from our offline-processing service to Rex had become more bursty with short peaks every 10 minutes, which causes Rex to thrash as it continually scales up and back down within each interval.

This is because all the no-op events for sunsetted users effectively served as a buffer and smoothed out the request pattern. Without these no-op events, every generation event resulted in an actual request to Rex, and Rex was doing poorly under this new pattern.

🛠 Spread generation events across each 10-min interval

We were able to spread out the generation events so that the request pattern to Rex is more even, reducing thrashing. The difference for traffic pattern to Rex is shown below:

Traffic pattern before and after spreading out generation events; you can see the spikes in the “before” chart on the left, especially during low-traffic times.

We accomplished this with two changes…

Randomized delays

SQS allows each event to be delayed up to 15 minutes. In order to evenly spread out generation events, we went with a naive approach of:

Function ShardEventHandler:
  For each user in shard:
    emit generation event with randomized delay within the
    next ten minutes

Process shard events promptly by putting them on a separate queue

We deployed the randomized delays, but the request pattern was still uneven. This was because this approach assumed we were processing each batch of shard events every 10 minutes, when in fact they often get processed in bunches when the queue is backed up:

Diagram for event placements on queue between the ideal scenario versus when the queue is backed up.

To resolve this, we separated the shard events onto a dedicated queue so they can be processed promptly.

The difference in how new events are added onto the queue is shown below:

Number of visible queued up events, before and after: On the left (before), you can see events are added in big chunks, whereas on the right (after), the events are added steadily over time.

🛠 Improve Rex scaling

Even though the traffic was now smoother, we were still seeing some Rex errors during times when we went from processing almost no generation events to processing at maximum rate (since we have consecutive time zones where one has very few users and a neighboring one has a lot of users).

This was because, without the no-op events as buffer, we were still sending Rex a higher number of requests per second. In fact, we were effectively tripling the number of requests to Rex, and Rex simply wasn’t able to scale up fast enough.

To alleviate this, we lowered the max processing rate from 500 to 250, since we no longer have to process at 500 now that we have fewer events to process. In addition, we gave Rex more resources and tuned the configuration so it auto-scales better with the new request pattern.

🎉 Email infrastructure is healthy again

With all these changes, we were finally able to fully ramp up generating digests for active users only. We ended up reducing the number of generation events by 70%.

Even with this change, we noticed around 22.5% of all generation events are still no-op (that is, they don’t become send events). This is likely because email settings were turned off for those users, or the accounts may have been deleted or suspended.

At the end, comparing the number of digests we were sending when we hit the limit to the projected maximum number of digests we can send under the improved infrastructure (assuming 22.5% of all generation events continue to be no-op), we can now process around 282% more events than before. We don’t expect to hit the limit again for at least another few years.

🧐 Takeaways

Separation of concerns with queues

When working with queues, it’s generally a good idea to separate different types of events onto different queues so they can be managed separately. On the other hand, it’s also good to avoid overengineering and only split events out onto separate queues when a concrete need arises.

Predictable and even traffic patterns

Making traffic patterns as even as possible makes autoscaling underlying services easier. We also found that more predictability (in our case, by moving shard events onto a separate queue) with when new events come in help us make better decisions. It’s hard to try to adjust the traffic pattern when you don’t even know when the new events will come in.

Scaling services can be tricky

With microservices, you can’t assume services you rely on are ready when you are. Making sure they are configured properly to handle your load can be challenging.

The difficulty is compounded if your team doesn’t own those services. It can be hard to diagnose and fix issues in services you don’t own, as they can look wildly different from the services you’re familiar with.

Our team doesn’t own Rex, and our attempts at trying to figure out how to scale Rex ourselves did not end up well. We simply weren’t familiar with it enough.

In the end, we decided to ask the Recommendations team to review our plan and communicated with them closely throughout our ramp. With enough context, they were able to help us figure out how to scale Rex.

We learned that as much as we didn’t want to bother other people, working directly with the teams that own those services is the best way to make sure you don’t break them.

🌅 Future Considerations

Optimize for the remaining 22.5%

As noted above, around 22.5% of the generation events are still effectively no-ops. We can further reduce the number of events we need to process by recording email settings and user status (suspension, deletion) in the User Email table.

Increasing processing rate

If we do reach the limit again, increasing processing rate back up to 500 (or even higher) is still an option. We will need to make sure Rex and its underlying services can handle the higher processing rate.

Improve digest to Rex connection

Part of the difficulty for bumping up processing rate is that even with spreading out events within each 10-minute chunk, digest to Rex traffic can still go from processing very little to processing a lot as we move across time zones, causing Rex to have to scale up significantly in a short amount of time.

We can make the digest to Rex connection more stable either by increasing processing rate slowly (for example, instead of going from 10 to 500 in one minute, go from 10 → 20 → 40 → 80 → … → 500 over a few minutes) or allowing digest requests to be retried safely.