The Design Mindset (2)
In a comment from last week’s post on the design mindset, which focuses on asking what through observation, Alan asked why I don’t focus on business drivers, or intent, first. This is a great question. Let me give you three answers before we actually move on to asking why?
Why can yuor barin raed tihs? Because your mind has a natural ability to recognize patterns and “unscramble” them. In reality, what you’re doing is seeing something that looks similar to what you’ve seen before, inferring that’s what is meant now, and putting the two together in a way you can understand. It’s pattern recognition at it’s finest—you’re already a master at this, even if you think you’re not. This is an important skill for assessing the world and reacting in (near) real time; if we didn’t have this skill, we wouldn’t be able to tolerate the information inflow we actually receive on a daily basis.
The danger is, of course, that you’re going to see a pattern you think you recognize and skip to the next thing to look at without realizing that you’ve mismatched the pattern. These pattern mismatches can be dangerous in the real world—like the time I bumped against an engine part that was so hot it felt cool, leaving me with a permanent scar on my leg. So the point of “observe first” is to deal with reality as it is on the ground, rather than seeing the pattern, inferring the intent, and moving on to the “next thing.”
Once you’ve observed, it’s time to try and understand why. When you’re asking why, you don’t ever want to stop with the obvious answer. Instead, you want to be like the pesky eight year old who’s discovered that “why” is the ultimate question to drive your parents nuts.
“Why is this aggregation configured here?”
“Because we needed to break up the failure domain.”
“Why did you need to break up the failure domain just here?”
“Because we thought it was too big.”
“Why did you think the failure domain was too big?”
“Because we had a convergence problem once around that area.”
“Why did you care about the speed at which the network converges?”
“Because we have this application, you see… And if you don’t stop asking why, I’m going to slap you silly!”
Why is a multilevel question; ultimately you want to get back to the actual business driver for any particular item of configuration. In the end, if you can’t connect a configuration to a business driver (and don’t settle for, “it’s a best common practice,” by the way), then you need to set that bit of configuration or reality aside in a special pool to be considered later. Using this process, you’re likely to find a lot of stuff that might not need to be there. By making the connections, you might be able to find another way to look at the problem that will help you radically simplify the design.
What’s often hiding behind the why that can’t be connected to a specific business driver is either “because we could” or “because we know that technology.” The time that I worked through converting a network from OSPF to IS-IS because several folks on the networking staff were studying for the CCIE comes to mind…
The complexity model can, as always, help guide your why questions—specifically focusing on optimization, as this is where you’re most likely going to match network design with actual business requirements. Within the complexity model, of course, you’re going the be trading off optimization against state and surface, so the process is going to look something like this most of the time:
Business drivers often lead to primarily optimization requirements, to which the designer can respond either by increasing the amount or speed of state in the network, or by adding overlays and other systems, which in turn increases the surfaces in the network. At some point, someone cries “uncle!,” and says, “it’s time to reduce complexity here, because this network is eating our OPEX!” This is where really understanding why starts to prove useful, because it allows you to start seeing where optimization can be realistically traded off against simplicity by rethinking the relationship between optimization, state, and surface.
We’ll consider this more deeply when we get to the decision phase in the next post.
Very eye-opening point on the recognition of patterns and how this can influence our next steps in a very negative way (does make me wonder in how far this is “brainwashing through vendor methods with pre-defined designs and best-practices”…).
Love the why’s (I’m a big fan of the 5 whys method) and how this in the end should map back to business drivers (and not to what I call IT-for-IT (“because we can” or because “these nerd-knobs are there”)).
One thing I’m sort of still pondering (which I guess will be in the decision phase) is what happens if we can map the why back to a business driver but due to technology changes we can now “solve” the business issue another way, how do we then break through the “we have always done it this way” or “to do this will cost to much time/effort for minimal benefits”?
To close of I’d like to thank you for sharing your insights and knowledge on these matters and helping me (and others) learn from your experiences.
This is a two edged sword, honestly. Of course, you’ve just given me a really good idea for another blog post in this neighborhood. 🙂 The problem is that you’re trying to trade off incurring technical debt by “doing what we’ve always done,” versus taking the time to do it now the right way, so you can continue to grow in the future. The problem is you can’t measure the ossification or brittleness of a network in any meaningful way, so you end up thinking things are just fine until it breaks. And then it breaks for real, and you’re working nights and weekends to get the business back on it’s feet.
Two useful constructions here might be the mean time to repair and the mean time between mistakes. For MTTR, find down time to break the network in an intentional way, and see how long it takes for someone else to find it and repair it — make your network a CCIE lab, in other words. If the time is longer than the downtime, you have a complexity problem that needs to be solved. If you can quantify that downtime due to MTTR in business costs — $x/minute — then you have something solid to argue from in from a simplification perspective.
For MTBF — this is harder to quantify, but it’s the same sort of problem. Configuring 1k OSPF routers with the same configuration is pretty easy. Maintaining it, however, is a problem.
Feel free to ship me an email if you want to discuss further…