Part 2: How randomness works in software projects
So, hello again. This is the second part to the theory of constraints stories I have to tell. This time I am going to talk about randomness, and how it affects the best laid plans. Why software projects are routinely late, why engineers get burned out, and why certain project management practices fail every other organisation despite everyone’s best efforts.
Again, a trigger warning. Considering how ubiquitous SCRUM/Kanban processes are in software engineering, there is a very good chance that I’m going to say things that might make you feel uncomfortable about whatever it is you’re preaching at the moment in your current organisation. Just do a breathing exercise and remember that the fact that it feels weird in your tummy is a sign that you still care. And that is a good thing. You don’t have to agree with me, listening is good enough.
And here is the link to the first article if you haven’t read it yet:
Part 1: Software company in accordance with the theory of constraints
Welcome to randomness
Randomness is a really weird thing. Our monkey brains are seriously ill equipped to deal with the challenges it throws at us. I remember when I was a uni student, I had this amazing mathematics professor who taught probabilities and statistics. And I remember a room full of very cocky theoretical physics students, who, despite of all their swollen brains couldn’t give good answers to any seemingly simple questions he asked.
It was a bewildering, and also humbling experience. And there is a very important thing I learned that year. It was the realisation that everything we know, even the so called exact sciences, are result of statistical analysis. Physics largely, really just systems thinking applied to the very messy reality of the outside world.
The other weird thing that I’ve learned by now is that software engineering is not that much different. Despite of us dealing with boolean logic all day long, software development is also mostly driven by randomness, and you are posed with the same options: either give up, and let it roll on fire, or try to understand the system behind the seemingly random events. Try to organise randomness into rules and principles, and then apply that to project management.
Unsurprisingly, we are not alone in this struggle for better work predictability. Essentially, every industry that has a concept of a “project” obeys the same underlying rules. And so, people have done a lot of thinking on the subject. The theory of constraints is one of the more fundamental attempts to understand how work flows through projects and how complexity and unpredictability affects our plans.
So lets see what we can learn from those muggles.
What probability looks like
I guess everyone knows what a normal distribution looks like, right? Everyone saw the bell curve of probabilities in the past:
The important thing to understand about probability distributions is that the entire area of the bell curve shape, represents the 100% probability of an event. The less randomness the more narrow the bell curve will be. And the other way around, the more variability and deviation there is to an event, the more slope the curve will have.
When it comes to software development, it is useful to think about this in terms of complexity. The more complexity there is in a task, the more chances that something will go wrong, and hence bigger the deviations from an original plan, and wider the graph will look like. And the opposite is true as well. The simpler the task is, the narrower the range of probabilities of completion will be.
In an essence this is where the logarithmic points estimations system in SCRUM is coming from. The more complex a task is, the more effort it requires, the higher the probability of it going sideways is. And so the thinking goes, that by giving the randomness plenty of padding, you can curb it. Doesn’t really give you any better ideas about when something will be done, but your burn down charts will look amazing when you send them upstairs.
Probabilities in time estimations
When it comes to time estimations, we tend to just roll by inertia and think that those obey the straight up normal distributions as well. And I can’t blame anyone for that. It is a beautiful idea that comes quite naturally. Unfortunately, reality works in a slightly different way though.
What makes time estimations special, is that there is no negative or even zero time estimations. Time always goes forward, and, as the result, time estimations represent a group of randomness called asymmetrical distributions.
The problem here is that average and probable don’t represent the same thing anymore. A whole lot of probability is shifted towards the right end of the graph.
What this means is that any task you do as a part of a software project, has more probability to be completed late, than to be completed sooner. Moreover, because the slope of the curve essentially represents the complexity of a task at hand, the probability distribution on more complex tasks — which is the bulk of the actual work in software engineering — will have a very shallow peak and a really long tail. Which means that you’ll have fairly equal probabilities of a task to be completed in a drastically wide range of timelines.
Let the idea sit with you for a second there. Yes, you heard this right. Yes, the work is going to be done eventually if you try really hard. But, during planning, you can have fairly equal probabilities of the task to be done in 3, 4 or even 6 days for example. For example, you could end up with say 80%, 83% and 90% probabilities respectively, which in the larger scheme of things are roughly the same thing.
You know from your own experience that it is true. The math tells you it is true. An yet, when an engineer says that they are unsure when something will be done, you feel uncomfortable with the ambiguity an demand a sufficiently probable number. One number.
The truth is, that for complex tasks, any close enough number will suffice. It will be just as wrong as any other. The most likely scenario is that an engineer will either overestimate and look inefficient or underestimate and look like they don’t know what they are doing. They can only hit the real number by sheer luck, because the range of outcomes is just too wide.
Randomness in coupled systems
There is another aspect that has a significant effect on project timelines. It is how randomness works in coupled systems. To demonstrate it in action lets have a bit of a thought experiment.
Lets say you have a person on a team who gets 3 to 5 tasks done per day; yes, with normal distribution that is. They take available work and they process it; left to right. If you have an abundance of work, and you see this person working 10 days in a row, it would not be completely unfair to estimate that they will average out at 4 tasks per day. So far so good?
Now lets pretend that we have 5 such people in a chain handing work over to each other left to right. And we run this system for 10 days. Now, because each person handles, on average, 4 tasks per day the natural expectation is that the throughput of this system will be, in the long run, 4 tasks per day as well.
The problem is that in coupled systems each person is not independent anymore. The performance of each person depends on the performance of the previous one, and that one depends on performance of the one before them, and so on. Lets say the first person have processed only 3 tasks. The second person can be really pumped up to do 4 or 5 tasks that day, but because only 3 of them are ready to go, they can only finish 3 that day. I hope this makes sense.
The nature of coupled systems is that their performance is lower than average performance of its components. Each negatively impaired step becomes the base line for the next step, and then the next one, and the next. As the result, work moves through the system in sorts of lumps, or waves if you want, never achieving the perfect flow required for the averages to work.
As you can guess, software projects are a type of coupled systems. There are always steps that need to be completed in a certain order. And there are usually limited resources, so that people cannot do everything at once, even if work is available, they have to move from one task to another. Which means that if you simply tally up estimations of a project, like most contemporary project managements processes do, you’re most likely going to end up being upset about your work.
The summary
If there is anything I want you to take out of this article it is those two things:
- Statistically speaking, work always takes longer than estimated, the more complex the more probable.
- Software projects are coupled systems, and work always takes longer than the sum of task estimations.
The reason why I did not lump those two into a single “everything is always late” conclusion is that there is a system underneath the randomness, there are moving parts underneath the surface. We need to make friends with those.
Look at it this way. If you ignore the system underneath and the only thing you see is that work always takes longer than expected, you corner yourself with basically two options:
- You can pad the crap out of your estimations, and then congratulate yourself in quarterly meetings for timely deliveries. But, that makes for a very inefficient process which very few companies can afford those days.
- You can try to hold people accountable for their honest estimations, and then watch them burning out. Which is pretty much guaranteed, because math is not on their side there.
If you take a few steps back from this picture, you can clearly see why SCRUM is so popular in corporate/government environments, and why it reliably fails in startups. If all you care about is better estimations you’re setting yourself into a situation where you can win only occasionally. You have significantly more chances to be wrong than to be right. Someone is guaranteed to walk out of this situation frustrated, either engineers, or project manager, or the CEO.
To be clear, this is not an argument against estimations. Estimations are necessary. What I’m trying to impress on you is that the ubiquitous practice in this industry of taking estimations at face value is actually the reason why projects are always late. We need to step back from this obsession with better estimations and try to understand how the system actually works and, in particular, how work flows through an organisation.
What is the better way?
Well, that is the question, right? And that will be the topic of the next and the last article in this mini series. I’ll tell you the story how we went from projects that are routinely 200–300% over, to projects that are 100% on time; well actually 2/3 of our projects in the last months were shipped ahead of time. Same company, same technology, same people. All because we had stopped obsessing about task estimations and started using something the theory of constraints calls buffer management.
The real jedi mind trick is to stop trying to figure out how much building something will cost, and start thinking about what actually makes sense to build given the resources you have. Estimations are an illusion that takes you away from your actual goals, because what’s built in the end is usually not what you have estimated. No plan survives the action.
PS:
Do you want me to come over and talk about this at your organisation, meetup, boutique gathering? Let me know, I love giving talks and share knowledge! You know where to find me.