Architecture Heuristic YNIA: Queues

Gautier DI FOLCO October 18, 2023 [Software engineering] #architecture #design #heuristics

There is a famous acronym I use over and over again when it comes to design decision: YAGNI.

It's a constant reminder to keep things as simple as possible, however we tend to forget that we should over simplify.

That's one of the reason I tend to introduce early on components which seems to be more complex than they need to be.

I call it YNIA (You'll Need It Anyway).

Today: queues.

They are many cases when you add accidental complexity by technically coupling things which are not coupled in terms of business:

fetching distant (on unmanaged services) some information
performing long-running operations
acting on distant services
running not business-related logic inside or around business logic

There are two basic ways to deal with it: synchronously or through some kind of local-asynchronous code (e.g. callbacks, threads).

These are not solutions:

no traceability
not fault-tolerant
not restart tolerant
hard to debug
hard to evenly scale
no observability

To give you an idea, last time I have encountered such implementation, we were regularly loosing jobs each time our pods (on Kubernetes) were restarted (on failure or on update). Which was hard to debug as we had to correlate failure of not tracked side-processes with random restarts.

Adding a queue gives a way to temporally decouple these computations, which allow to incrementally add:

a way to monitor each job (current status, history)
add restart/fault-tolerance (add error detection and mitigation, such as retry)
handle back-pressure
add observability (throughput, latency, traceability with correlation identifiers)