In the first part of writing about vCenter Operations Manager, or vCOPS for short, I handled the architecture and installation of vCOPS. You can read that part at https://www.pascalswereld.nl/post/84944390488/installing-vcenter-operations-manager.
As I also visit customers that have vCOPS in place, but don’t know how to use it. I want to follow up with an article on how to interpret the presented metrics, how are they calculated and when to act. Future posts must go about tuning and customization, but first is first.
What do we see when opening the standard interface of vCOPS?
We see a World overview. World is the vOPS process with all vCenters, Datacenter, cluster, host and other sibling objects. Per object type you narrow down the view from that objects tree. Health, Risk and Efficiency badges can be green on the world, but down in the cluster specific stress or efficiency levels can bring them down for that zoomed object. There will be metrics involved specific for the object. Always check down in the tree to see what specifically is happening there.
Like said in the previous part, for Health and Efficiency high is good and low is bad. For Risk this is the other way around, low is good high is bad. This opening screen isn’t that bad (if we forget all the alerts and reclaimable waste for now).
When I go down an object layer to one of the vCenters, I get this:
My health went down a few points, nothing serious. Risk went up a few points. Efficiency went from green to a yellow state. There is a reclaimable waste here, and density is also not ideal (out of view so you will have to trust me) on memory. Something to investigate further.
And when we go to a cluster in that datacenter, we get this view.
Risk is up to 96, there is definitely something to do here in this cluster!
We didn’t see this right away in the world dashboard, but we could have come here by investigating and looking at the environment, operations, planning, alert etc. from the world view. Don’t go and trust all your initial information as these are calculated at the whole environment and you will miss important information from the siblings. There is always something lurking in the environment.
Where does vCOPS gets it’s metrics?
What VCOPS actually does is take thousands of metrics from vCenter Server and categorize them up into 3 actionable higher level badges for Health, Risk and Efficiency. These are critical pieces of info that would help any admin without having to go through all those vCenter metrics.
All of the metrics that are collected from the vCenter servers database, are moved into the vCOPS embedded database running inside the vCOPS Analytics VM. Because vCOPS itself pulls data from the vCOPS database, it starts providing useful performance and capacity information the first day it is installed because it can use the historical data available in the vCenter database (can as this depends on your specific settings). No need to wait for important business cycles, again depending on the fact your vCenter is already configured to take account of these specific workloads.
Certain metrics are identified more important than other metrics. These more important metrics can indicate that there are severe problems in the virtual infrastructure. Those special groups of metrics are KPIs, or key performance indicators.
The way badges, alerts, forecast and all are made up, is controlled via the policies. Standard the default policy is applied when no custom policy is applied. For tuning the environment custom policies are required.
You will also customize the groups of metrics. The created groups of metrics might, for example, track the average free disk space for all MSSQL server data disks in your organizations infrastructure.
When and how to act?
vCOPS is big and will tell a lot of information about your environment, take the time to get used to the way vCOPS presents its information. Try to get the why out of the presented data. vCOPS is complex, and stays complex until you understand the variety of badges and how/why they are presented in the different layes and object type (these have their own metrics). It becomes even more complex when you want to customize. Make sure you take the time to recognize important badges and use them to their (and yours) advantage.
See any stress or oversized and go full me(n)tal jacket and to try to solve all those events by changing the VM’s and blindly solving issues? Well your application or server administrators won’t be pleased, perhaps some customers as well as services might go down (without application redundant roles/services), and when they do their half years cycle resources needed weren’t calculated yet as those statistics are not saved, woops. Removing resources, and solving issues as well, needs some planning. Hot add is okay for the most of the last releases of OS, but hot removing still is far from support. This takes down time. But it also takes planning. Right size the VM’s. Just removing all resources to the minimum amount registered in monitoring is probably not the way, you will need to test/monitor the workload for capacity planning. Maybe the workload needs some more in a few cycles. Talk to the owners of the workload, they know what is needed and when (or they should at least). Think, communicate and plan your actions.
Customize or extend with other monitoring? Sure that is possible and recommended as well. But take it a step at a time, first know what is going around before creating a bigger monster that burns down your brain with an informational overload.