How to delete old data from DynamoDB without spending thousands

Published in

Medium Engineering

6 min readJul 24, 2024

Like in many companies, you probably have some databases lying around that are starting to cost a lot in terms of storage. You are paying more and more every week; the more you wait, the more money will be spent for nothing. It’s time for a cleanup.

The basic approach is to write a script that will scan your DB and delete the items that aren’t useful anymore (e.g., really old data).

Problem: that can be extremely expensive with Dynamo DB, so expensive that it will take years to pay off…

Solution: One way to tackle this is to migrate all the data you need to a new table. From there, you can delete your old data altogether (which is free). This way, you pay only to migrate the data you need, which is usually a small percentage of your total table.

In this post I’ll do a small overview of how we did that at Medium and the tools we used to estimate the costs of the different scenarios.

DynamoDB is AWS’s managed noSQL database

First things first, clean up the incoming data

What happened to your data exactly? Why did this table grow out of proportion? Storage costs of “real” data such as user / user interactions or user / content interactions are usually reasonable.

In our case, we’ve been storing things that weren’t meant to be stored forever:

feature logs: that’s data on the user and post state at the moment we recommend it to a user — Although this data is useful to retrain our algorithm, we don’t need to store this forever.
impressions, i.e. posts we presented in a feed to a user — This is useful to make sure we don’t show a user the same recommendations over and over again, but we don’t need to store this for longer than a few months.

One of the mistakes we made with presentation data (meant to expire at some point) is that we stored this alongside persistent data in a UserPostRelations table. It’s not easy to maintain items that aren’t meant to have the same lifecycle in the same table. We should have stored this into two separate tables.

emails sent — We store email contents we send to users (eg: the Medium Daily Digest). We use those to make sure we don’t send the same stories several times to the same user, but again, we don’t need to store that forever.

The first thing to do is to make sure that your new incoming items will be removed after some time. Luckily, that’s really easy to setup with DynamoDB: All you need to do is to configure the field DynamoDB should be using as the expiration date, and then you can simply set that attribute with the proper expiration date when you insert or update items. DynamoDB automatically deletes expired items and doesn’t charge you for those deletions.

Now that the incoming data is handled, we can look into ways to clean up the existing items in the table.

Get rid of your bad items: Cost estimations

We have three possible paths forward here:

do nothing and let the table grow indefinitely (along with your costs!)
delete the “bad” items
migrate the “good” items to a new table and then delete the old table

To understand these options more fully, let’s do some cost estimations using one of our Dynamo tables at Medium as an example. This table is pretty massive with 105 billion items and an average item size of 100 bytes. Since we’re concerned about short-term spend as well as long term, let’s estimate total expenses after 6 months and after 3 year.

For this table, 93% of the items are not useful anymore and can be deleted. We were able to estimate that because this table is exported to Snowflake for analytics purposes.

Scenario 1: Do Nothing:

In this scenario, your storage costs should remain more or less the same over time. Thanks to the automatic item expiration described above, new items are now “clean” and will expire after some time, but old items without an expiration date will stay forever.

To view the storage costs, we use the AWS cost explorer tool. You can get a view by month and group costs by “Usage Type.” We added tags from the AWS console to our Dynamo table; this way we can filter by Dynamo table in the Cost Explorer.

As you can see, we’re currently spending $7.2k every month in storage costs (”TimeStorage-ByteHrs”, which is regular storage, as well as “Point In Time Recovery” costs). In this scenario, after 6 months, we will have spent $43k. After 3 years, the amount spent jumps to $259k.

Scenario 2: Delete “bad” items:

This is the brute force method. You need to scan your database and delete each “bad” item, but be aware: Dynamo will charge you for each read and write in this case. To avoid scanning the whole database, you can either figure out the right global secondary index to set up on your table or you can use your analytics data (At Medium, most of our Dynamo tables are exported to Snowflake).

To estimate the costs, you can use AWS’ cost estimator: https://calculator.aws/#/addService

For this table, I found that it would cost around $2k to scan the table and $19k to delete the bad items. That’s taking into account all the best optimisations: The tables should be provisioned beforehand and the traffic should be stable throughout the migration.

In this scenario, we pay $21k up front — but then, since we deleted 93% of our table, we’ll be paying much less in storage costs after the cleanup. If we consider that the good and bad items are the same size then our new storage costs are now 7% of what we had previously. Accounting for everything, we would have spent $23k after 6 months. After 3 years, we will have spent $39k.

Scenario 3: Migrate only the “good” items:

This option is the most complex but is also the cheapest. In our use-case, we still have some incoming data and we can’t allow our services to be down at any time. This means we need to do a “hot” migration.

We do this in 6 steps:

create a new table (let’s call them Table and TableV2)
change the logic so that all insertions and updates now update both Table and TableV2. We’re now in a double-write situation.
backfill all the good items from Table to TableV2. For that you’ll need to perform a scan on Table and an insertion in TableV2 for each good item
you can then switch all read traffic to TableV2
stop writing updates to Table
delete Table

That option is really cheap, in our case we’d pay $2k for scanning Table and then $1.4k to insert the “good” items into TableV2.

After that, we continue to pay storage costs, but only for the good items (just like in Scenario 2).

Conclusion:

For this particular table, the “delete bad items” option is viable. It’s a bit more costly in terms of hosting costs but requires less engineering time and is less risky. Again that really depends on your use-case. We’ve had use-cases where tables were used across all our codebase, with high traffic and any downtime would cause our product to be severely downgraded. This makes migrations much harder and riskier. In other use-cases the table is used only by an isolated admin tool and accidental downtime is more acceptable. In that example, migrating to a new table would be much easier.

We’ve had other examples where the “delete bad items” option was not viable, and we would’ve only seen a return on investment after two years. So the only viable option here was to perform a difficult migration to a new table.

You should try to reproduce those estimations with your own use case before you start any cleanup work. With the migrations we did, the cost estimations we made beforehand were exact and the savings perfectly matched what we had predicted.

Medium Engineering

How to delete old data from DynamoDB without spending thousands

First things first, clean up the incoming data

Get rid of your bad items: Cost estimations

Scenario 1: Do Nothing:

Scenario 2: Delete “bad” items:

Scenario 3: Migrate only the “good” items:

Conclusion:

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Medium Engineering

Written by Raphael Montaud

Responses (4)