An old postmortem
Few years ago, I was the hands-on CTO of a pre-seed startup company.
I had set up a CI/CD pipeline including Terraform, to update our AWS infrastructure, building and publish frontend and backend artifacts in 17 minutes.
We had two environments, dev for the development, to have a test reference, prod for (the first few) users and demos.
At ~1pm, I have locally started to rename/move resources, including the AWS Cognito pool, which is a solution for user management.
At ~2.30pm, one of the frontend developer pushed a "quick commit", to update the color of two buttons, directly to the master branch.
At 2.44pm, terraform apply
is run, wiping out the pool, Terraform dropped the renamed pool I had locally and tried to create a new one as it compared its state, the shared one I had started to modified, and the original definition.
At 2.47pm, I had an unusual phone call from my COO:
- Hello, I cannot use dev anymore, is it normal?
- Let me check, nope, it seems we have an issue, due to the CI/CD which broke the infrastructure
- Can you restore it?
- Sure, I still to end my task, it's structural, I should be done by 5pm
- No way! We have a demo with a potential investor at 3.30pm
- You are supposed to use prod for that
- We are using dev for months, and it was okay until today
- We do not guarantee stability on dev, it's a sandbox for the development team, I can upgrade prod in 20 minutes, it is only few days behind, lacking list of bug fixes and features
- It not acceptable, we do not want to publish features to new users
- Okay, I'll handle it, but it'll be longer
- Hurry up! I bring everyone in the call
He brought everyone in the call: "Gautier is working on the infrastructure broken by the CI/CD, it should work for the investor demo in 30 minutes".
It's 2.55pm, we are 6 people in the call, every other ones being quiet.
I had to stage my changes, we were using git, checkout master, use terraform apply
to get a similar infrastructure state, I also had to drop manually few remaining resources from the previous CI/CD pipeline which failed.
Once done, I had to get some hard-coded values from terraform apply
to be put in the frontend/backend to deal with the AWS Cognito pool.
I have created a commit and pushed it to the repository at 3.08pm.
At this point, I broke the silence: "I have made and push the fix, it should be ready in 20 minutes".
Then, on of my coworker started to discuss:
- You know you can run it on your powerful computer, it'll be faster
- It's not that simple, I have locally do each steps
- I do not see why, there are not too many
- I have to isolate my environment, fetch all the keys, copy and paste all steps, there is a greater probability to fail
- I would do it
And then, I have waited, waited, and waited, until I have realized that, one of the fronted developer, took the GitHub Action runner, pushing three jobs between 2.58pm and 3.06pm.
I had to cancel them, my CI/CD pipeline started at 3.11pm and completed at 3.28pm.
Then, I have stated that everything should be ready, my COO created a new account, set everything up for his demo and jump into the call at 3.35pm without a goodbye.
Let's do a quick recap:
- Good
- The recovery time is decent regarding the root cause
- Everything ran smoothly, no additional problems
- The CI/CD pipeline was robust
- Bad
- Lack of leadership from myself limiting the dev environment access to the development team
- No communication from me during the problem-solving (it could have avoided my coworker to push commits, further delaying the recovery, though he could guess he should have avoided performing operations involving CI/CD or infrastructure)
- No way to speed up the CI/CD pipeline during emergencies
- Calling everyone while only 1-2 people are actually working
I have requested to do a full company postmortem, but it was declined by the CEO and the COO.
I have only run it in the development team.
We got the following root causes and mitigations:
- dev environment usage by people outside the development team
- Using only prod environment => rejected by the CEO & COO
- Adding a staging environment => rejected by the CEO & COO
- Concurrent usage of Terraform
- Adding an approval step in the CI/CD => rejected by me, code reviews are already used for approvals, and I did not find a way to limit it to "regular" changes
- Using Terraform lock => it was already setup, I forgot to lock when I had started to work
- Not protected Terraform resources definition
- Enable resource drop protection => done
- Hard-coded identifier
- Injecting them in the settings/environment variable => done
- Lack of communication
- Explain plan and action => never done, it is a waste of time when both the goal and action to the people performing actions
- Having someone from the development handling communication
- Push direct to master
- We already had policies against it, but it was a "special" CEO request
- No shared knowledge
- Onboard people on CI/CD => never done, not enough time
Any mitigation could have been enough to prevent the outage.
Additionally, the CEO and COO went with requests:
- Do not break the dev environment => dismissed, we cannot offer this guarantee
- Ask before releasing on dev => dismissed, it would have slowed us down
- Plan releases => dismissed, not only SDLC in not predictable, but they have unplanned demo, either they do not notify us
Hopefully, it has not happened twice, the mitigations were enough.