An old postmortem

Few years ago, I was the hands-on CTO of a pre-seed startup company.

I had set up a CI/CD pipeline including Terraform, to update our AWS infrastructure, building and publish frontend and backend artifacts in 17 minutes.

We had two environments, dev for the development, to have a test reference, prod for (the first few) users and demos.

At ~1pm, I have locally started to rename/move resources, including the AWS Cognito pool, which is a solution for user management.

At ~2.30pm, one of the frontend developer pushed a "quick commit", to update the color of two buttons, directly to the master branch.

At 2.44pm, terraform apply is run, wiping out the pool, Terraform dropped the renamed pool I had locally and tried to create a new one as it compared its state, the shared one I had started to modified, and the original definition.

At 2.47pm, I had an unusual phone call from my COO:

Hello, I cannot use dev anymore, is it normal?
Let me check, nope, it seems we have an issue, due to the CI/CD which broke the infrastructure
Can you restore it?
Sure, I still to end my task, it's structural, I should be done by 5pm
No way! We have a demo with a potential investor at 3.30pm
You are supposed to use prod for that
We are using dev for months, and it was okay until today
We do not guarantee stability on dev, it's a sandbox for the development team, I can upgrade prod in 20 minutes, it is only few days behind, lacking list of bug fixes and features
It not acceptable, we do not want to publish features to new users
Okay, I'll handle it, but it'll be longer
Hurry up! I bring everyone in the call

He brought everyone in the call: "Gautier is working on the infrastructure broken by the CI/CD, it should work for the investor demo in 30 minutes".

It's 2.55pm, we are 6 people in the call, every other ones being quiet.

I had to stage my changes, we were using git, checkout master, use terraform apply to get a similar infrastructure state, I also had to drop manually few remaining resources from the previous CI/CD pipeline which failed.

Once done, I had to get some hard-coded values from terraform apply to be put in the frontend/backend to deal with the AWS Cognito pool.

I have created a commit and pushed it to the repository at 3.08pm.

At this point, I broke the silence: "I have made and push the fix, it should be ready in 20 minutes".

Then, on of my coworker started to discuss:

You know you can run it on your powerful computer, it'll be faster
It's not that simple, I have locally do each steps
I do not see why, there are not too many
I have to isolate my environment, fetch all the keys, copy and paste all steps, there is a greater probability to fail
I would do it

And then, I have waited, waited, and waited, until I have realized that, one of the fronted developer, took the GitHub Action runner, pushing three jobs between 2.58pm and 3.06pm.

I had to cancel them, my CI/CD pipeline started at 3.11pm and completed at 3.28pm.

Then, I have stated that everything should be ready, my COO created a new account, set everything up for his demo and jump into the call at 3.35pm without a goodbye.

Let's do a quick recap:

Good
- The recovery time is decent regarding the root cause
- Everything ran smoothly, no additional problems
- The CI/CD pipeline was robust
Bad
- Lack of leadership from myself limiting the dev environment access to the development team
- No communication from me during the problem-solving (it could have avoided my coworker to push commits, further delaying the recovery, though he could guess he should have avoided performing operations involving CI/CD or infrastructure)
- No way to speed up the CI/CD pipeline during emergencies
- Calling everyone while only 1-2 people are actually working

I have requested to do a full company postmortem, but it was declined by the CEO and the COO.

I have only run it in the development team.

We got the following root causes and mitigations:

dev environment usage by people outside the development team
- Using only prod environment => rejected by the CEO & COO
- Adding a staging environment => rejected by the CEO & COO
Concurrent usage of Terraform
- Adding an approval step in the CI/CD => rejected by me, code reviews are already used for approvals, and I did not find a way to limit it to "regular" changes
- Using Terraform lock => it was already setup, I forgot to lock when I had started to work
Not protected Terraform resources definition
- Enable resource drop protection => done
Hard-coded identifier
- Injecting them in the settings/environment variable => done
Lack of communication
- Explain plan and action => never done, it is a waste of time when both the goal and action to the people performing actions
- Having someone from the development handling communication
Push direct to master
- We already had policies against it, but it was a "special" CEO request
No shared knowledge
- Onboard people on CI/CD => never done, not enough time

Any mitigation could have been enough to prevent the outage.

Additionally, the CEO and COO went with requests:

Do not break the dev environment => dismissed, we cannot offer this guarantee
Ask before releasing on dev => dismissed, it would have slowed us down
Plan releases => dismissed, not only SDLC in not predictable, but they have unplanned demo, either they do not notify us

Hopefully, it has not happened twice, the mitigations were enough.