In this blog series about vRealize Log Insight (hereafter abbreviated as LI), I will try to convince you why LI is an excellent tool for troubleshooting vSphere (and other) environments and why you should all start using it instead of sticking to the old, clunky methods of troubleshooting and reading logs.
This first part is about proactively looking at the number of events per source to identify and pinpoint errors and problems.
The first step is really simple: Log into LI, go to the Dashboards screen, select the VMware – vSphere content pack, the General – Overview screen, set a fairly long time span (like 24 hours) and look at the top right chart named All vSphere events by hostname. The clicks are all marked in red in the screenshot below:
The number of events per hostname (which corresponds to the total number of log lines) should be distributed somewhat according to the number of VMs per host, since each VM will inevitably generate some general “background noise” during day-to-day operations.
In the screenshot above, the first five hosts have significantly more VMs than the rest. There might also be a difference between ESXi 5.5 and 6.0, since the latter doesn’t have the ‘verbose’ log level by default for both host and vpxa.
Sometimes you get a much more clear result, such as in the screenshot below:
Here we can very clearly see that the leftmost host has some kind of serious issue, or at least something going on worth looking into.
Going back to the first screenshot, lets try to figure out why the first host has more events than the rest. We do this by simply clicking on the blue bar of the host in question. This will take us to the Interactive Analysis screen, filtered by the hostname in question plus the other filters in place for the particular graph that we just clicked in.
The first thing we will change in this view is to set it to display the Time series instead of just a total number of events. This is done by clicking on grouped by hostname and selecting Time series and then clicking the blue Apply button.
This will give us a nice graph which shows us that there is no particular time in the last 24 hours where there has been more or less events. This is important to know during our troubleshooting.
Next, we’ll try to figure out if there is any particular event that stands out and makes the number of total events go up. Click the Event Types tab in the middle of the screen:
This tab groups similar events together and shows us the total number of events in the left column. We can see that the top event type has occurred 148 thousand times in the selected time span (still 24 hours). The others have occurred around 90 thousand times etc.
We can also note that all of these are verbose, which means they are the types of events that usually aren’t crucial for regular troubleshooting, but will be important to have if we send support bundles to VMware Global Support Services (GSS). You might be tempted to lower the log levels to info, but that is not recommended, since that will increase the risk of not getting future problems diagnosed properly. Also see Steve Flanders blog post on this matter.
Let’s filter out the verbose logs by adding it to the current filter:
Now we can see the top four event types, and we can decide if we want to drill down further into any of them. This will be shown in the next blog post of this series. Stay tuned!