An old postmortem

Few years ago, I was the hands-on CTO of a pre-seed startup company.

I had set up a CI/CD pipeline including Terraform, to update our AWS infrastructure, building and publish frontend and backend artifacts in 17 minutes.

We had two environments, dev for the development, to have a test reference, prod for (the first few) users and demos.

At ~1pm, I have locally started to rename/move resources, including the AWS Cognito pool, which is a solution for user management.

At ~2.30pm, one of the frontend developer pushed a "quick commit", to update the color of two buttons, directly to the master branch.

At 2.44pm, terraform apply is run, wiping out the pool, Terraform dropped the renamed pool I had locally and tried to create a new one as it compared its state, the shared one I had started to modified, and the original definition.

At 2.47pm, I had an unusual phone call from my COO:

  • Hello, I cannot use dev anymore, is it normal?
  • Let me check, nope, it seems we have an issue, due to the CI/CD which broke the infrastructure
  • Can you restore it?
  • Sure, I still to end my task, it's structural, I should be done by 5pm
  • No way! We have a demo with a potential investor at 3.30pm
  • You are supposed to use prod for that
  • We are using dev for months, and it was okay until today
  • We do not guarantee stability on dev, it's a sandbox for the development team, I can upgrade prod in 20 minutes, it is only few days behind, lacking list of bug fixes and features
  • It not acceptable, we do not want to publish features to new users
  • Okay, I'll handle it, but it'll be longer
  • Hurry up! I bring everyone in the call

He brought everyone in the call: "Gautier is working on the infrastructure broken by the CI/CD, it should work for the investor demo in 30 minutes".

It's 2.55pm, we are 6 people in the call, every other ones being quiet.

I had to stage my changes, we were using git, checkout master, use terraform apply to get a similar infrastructure state, I also had to drop manually few remaining resources from the previous CI/CD pipeline which failed.

Once done, I had to get some hard-coded values from terraform apply to be put in the frontend/backend to deal with the AWS Cognito pool.

I have created a commit and pushed it to the repository at 3.08pm.

At this point, I broke the silence: "I have made and push the fix, it should be ready in 20 minutes".

Then, on of my coworker started to discuss:

  • You know you can run it on your powerful computer, it'll be faster
  • It's not that simple, I have locally do each steps
  • I do not see why, there are not too many
  • I have to isolate my environment, fetch all the keys, copy and paste all steps, there is a greater probability to fail
  • I would do it

And then, I have waited, waited, and waited, until I have realized that, one of the fronted developer, took the GitHub Action runner, pushing three jobs between 2.58pm and 3.06pm.

I had to cancel them, my CI/CD pipeline started at 3.11pm and completed at 3.28pm.

Then, I have stated that everything should be ready, my COO created a new account, set everything up for his demo and jump into the call at 3.35pm without a goodbye.

Let's do a quick recap:

  • Good
    • The recovery time is decent regarding the root cause
    • Everything ran smoothly, no additional problems
    • The CI/CD pipeline was robust
  • Bad
    • Lack of leadership from myself limiting the dev environment access to the development team
    • No communication from me during the problem-solving (it could have avoided my coworker to push commits, further delaying the recovery, though he could guess he should have avoided performing operations involving CI/CD or infrastructure)
    • No way to speed up the CI/CD pipeline during emergencies
    • Calling everyone while only 1-2 people are actually working

I have requested to do a full company postmortem, but it was declined by the CEO and the COO.

I have only run it in the development team.

We got the following root causes and mitigations:

  • dev environment usage by people outside the development team
    • Using only prod environment => rejected by the CEO & COO
    • Adding a staging environment => rejected by the CEO & COO
  • Concurrent usage of Terraform
    • Adding an approval step in the CI/CD => rejected by me, code reviews are already used for approvals, and I did not find a way to limit it to "regular" changes
    • Using Terraform lock => it was already setup, I forgot to lock when I had started to work
  • Not protected Terraform resources definition
    • Enable resource drop protection => done
  • Hard-coded identifier
    • Injecting them in the settings/environment variable => done
  • Lack of communication
    • Explain plan and action => never done, it is a waste of time when both the goal and action to the people performing actions
    • Having someone from the development handling communication
  • Push direct to master
    • We already had policies against it, but it was a "special" CEO request
  • No shared knowledge
    • Onboard people on CI/CD => never done, not enough time

Any mitigation could have been enough to prevent the outage.

Additionally, the CEO and COO went with requests:

  • Do not break the dev environment => dismissed, we cannot offer this guarantee
  • Ask before releasing on dev => dismissed, it would have slowed us down
  • Plan releases => dismissed, not only SDLC in not predictable, but they have unplanned demo, either they do not notify us

Hopefully, it has not happened twice, the mitigations were enough.