Incident postmortem 2021-08-03

2021-08-03 19:13

Incident summary

On 2021-08-03 2:35pm PST, one of our engineers made a bad database migration that caused discrepancies between the indexed data and the source of truth in the database. It impacted 39 users, and we backfilled the data and resolved the issue at 4:46pm PST.

Impact

Those impacted 39 users may lose data added between 2021-08-03 2:35pm PST and 4:46pm PST. We backfilled the data but cannot guarantee 100% recovery.

Root cause

The root cause is our new database migration to reorganize the file structure and prepare for the dropbox integration. Unfortunately, we underestimated the number of users visiting this service during the deployment.

Lessons learned

Next time in similar situations, we will

  1. Be more cautious about the database migration. Be aware that there are data insertions during the migration.
  2. Set the site to maintenance mode when we need to stop all the traffic and racing conditions.