This post is a random thought post, not quite technical but in my opinion very important. The idea formed after some subjects and discussions at last week’s NL VMUG. This blog post’s main goal is to create a discussion, so why don’t you post a comment with your opinion … Here it goes…
Murphy, hardware failures and engineers tripping over cables in the data center, us tech gals and guys all know and probably experienced them. Disaster happens everyday. But what about a state of the art application that ticks all the boxes for functional and technical requirements, but users are not able to use it, because of their lack of knowledge in this field, or because they are clueless why the business has created this thingy (why and how this application or data is supposed to help the information flow of business processes)? Failure is a constant and needs to be handled accordantly, and from all angles.
Techies are used to look at the environment from the bottom up. We design complete infrastructures with failure in our minds and have the technology and knowledge to perfectly execute disaster avoidance or disaster recovery (forget the theoretical RTO/RPO of 0’s here). We can do this at a lower cost (CAPEX) than ever before, and there are more benefits (OPEX and minimized downtime for business processes) than before. But subsequently, we should ask ourselves this: What about failing applications or data which is generated but not reaching the required business processes (the people that are operating or using these processes)?
Designs need to tackle this problem, using design based on the complete business view and connecting strategy, technical possibility and users!
And how will we do this then?
Well, first of all, the business needs to have full knowledge of their required processes and information flows, that support or process in and out data for these services supporting the business strategy. Very important. And to be honest, only a few companies have figured out this part. Most experience difficulties. And they give up. Commitment from the business and people in the business is of utmost importance. Be a strategic partner (to the management). Start with asking why certain choices are made and explain the why a little more often than just the how, what and when!
Describe why and how information and data is collected organized and distributed (in a fail safe and secure method) and what information systems are used. Describe the applications (and their ROI, services, processes and busses), how the information is presented and flows back in the business (via the people or automated systems). How does your solution let the business grow and flourish? Keep clear of too much technical detail – present your story in a way the manager understands the added value, and knows which team members (future users) to delegate to project meetings.
Next up IT, or ICT here in the Netherlands, Information and Communication Technology. I really like the Communication part for this post, businesses must do that a little more often. Start looking at the business from different points of view, and make sure you understand the functional parts and what is required to operate. To prevent people working on their own without a common goal or reason, internal communication is essential. Know the in and outs, describe why and how the desired result is achieved. Connect the different business layers. For this a great part of business IT departments needs to refocus it’s 1984 vision to the now and future. IT is not about infrastructure alone, it is a working part within the business, a facilitator, a placeholder (for lack of other words in my current vocabulary). IT needs to be about aligning the business services with applications and data, the tools and services that support and provides the business. That is why IT is there in the first place, not the business that is (connected or not) there for IT. IT’s business. Start listening, start writing first on a global level (what does the business mean by working from everywhere everyplace), then map possibilities to logical components (think from the information, why is it there, where does it come from and where does it go, and think for the apps, the users) and then when you have defined the logical components, you can add the physical components (insert the providers, vendors, hardware building blocks).
Sounds familiar? There are frameworks out there to use. Use your Google-Fu: Enterprise Architecture. Is this for enterprise size organizations only? No, any size company must know the why and why and why. And do something about it. And a simplified version will work for SMB size companies. Below is an example of a simplified model and what layers of attention this architectural framework brings to your organization.
And…in addition to this, start using the following as a basis to include in your designs:
The best way to avoid failure is to fail constantly
Not my own, but from Netflix. This cannot be closer than the truth. No test or disaster recovery plan testing in iterations of half year or year. Do it constantly and see if your environment and business is up to the task to not influence any applications that will go down. Sure, there will be influences that for example the services running at 100% warp speed, but your users still able to do things with the services is better than nothing at all. And knowing that your service operates with a failure is the important part here. Now you can do something about not reaching the full speed, for example scale out to allow a service failure but not at a degraded service speed. Or know which of your services can actually go down without influencing business services for a certain time-frame. This is valuable feedback that will need to go back to the business. Is going down sufficient for the business, or should we try and handle this part so it does not go down at all. Just don’t use it at the infrastructure level only, include the data, application and information layers as well.
Big words here: trust and commitment. Trust the environment in place and test if it succeeds to provide the services needed even when hell freezes over (or when some other unexpected thing should happen). Trust that your environment can handle failure. Trust that the people can do something with or about the failures.
Commitment of the organization not to abandon when reaching a brick wall over and over, but to keep going until you are all satisfied. And trust that your people can fail also. Let them be familiar with the procedures and let a broader range of people handle the procedure (not just the current users names mapped to the processes, but within defined and mapped roles to services, multiple people can operate and analyze the information). Just like technical testing, your people are not operating 24x7x365, they like to go on leave and sometimes they tend to get ill.
Back to Netflix. For their failure generating Netflix uses Chaos Monkey. With that name an other Monkey comes to mind, Monkey Lives: http://www.folklore.org/StoryView.py?project=Macintosh&story=Monkey_Lives.txt. Not sure where the idea came from, but such a service and name cannot be a coincidence only (if you believe coincidence exists in the first place). But that is not what this paragraph is about.
The Chaos Monkey’s job is to automatically and randomly kill instances and services within the Netflix Infrastructure architecture. When working with Chaos Monkey you will quickly learn that everything happens for a reason. And you will have to do something about it. Pretty Awesome. And the engineers even shared Chaos Monkey on Github: https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey. It must not stop at the battle plan of randomly killing services; fill up the environment with random events where services will get into some not okay state (unlike a dead service) and see how the environment reacts to this.