Occasionally, Maximo
became unavailable for a short period of 5-10 minutes. Alarms were raised, IT
help desk was called, and the issue got escalated to the Maximo specialist
(you). You logged into the server, checked the log file, and found a Java
Out-of-Memory (OOM) issue. Not a big deal, the server usually restarted itself
and became available soon after that. You reported back to the business and
closed the issue. Does that scenario sound familiar to you?
If such an issue has
only occurred to your system once, it was probably treated as a simple problem.
But since you had to search for a solution on the web and ended up here,
reading this article, probably it has occurred more than once, the business
requires it to be treated as a critical incident. As the Maximo specialist,
you’ll need to dig deeper to report the root cause of the issue and provide a
fix to prevent it from occurring again. Analyzing low level Java issue is not
an easy task, and this post describes my process to deal with this issue.
By default, when an OMM
issue occurred, Websphere will produce a bunch of dump files in the $WAS_HOME/profiles/<ProfileName>/ folder, these files can
include:
- Javacore.{timestamp}.txt: contains high-level details of the JVM when it crashed which should be the first place to look in a general JVM crash scenario. However, in the case if I know it is an OOM issue, I generally ignore this file.
- Heapdump.{timestamp}.phd: this is the dump from the JVM’s heap memory. For an OOM issue, this contains the key data we can analyse to get some further details.
- Core.{timestamp}.dmp: These are native memory dump. I get these as I work with Maximo running on Windows most of the time. Different OSes (e.g. Linux) might produce a different file. I often ignore and delete this file from the server as soon as I find there’s no need for it. However, in certain scenarios, we can get some information out of it to help our analysis as demonstrated in one scenario later in this article.
Generally, for an OOM issue, if it
is a one-off instance, we’ll want to identify (if possible) what consumed all
JVM memory. And, if it is a recurrence issue, it’s likely a memory leak in
which we’ll need to identify the leak suspects. To analyse the heap dump (PHD
file), there are many Heap Analyzer tools available, I use Heap Analyzer
provided with the IBM Support Assistant Workbench
To read Windows dump files (DMP file), I use WinBdg tool
that comes with Windows 10. Below are some examples of crashes I had to
troubleshoot earlier, hopefully they give you some generally ideas on how to
deal with such problem:
Case 1: A core dump occurred to the Integration JVM
of an otherwise stable system. The issue was escalated to me from 2nd
level support. Using Heap Analyzer, I could see Maximo was trying to load 1.6 Gb
of data into memory which equals to 68% of the allocated heap size for this JVM.
There was also a java.lang.StackOverflowError object consumed 20% of the heap
space. This obviously looked weird, but I couldn’t figure out what was the
problem. So, I reported this back to the support engineer, together with some
information I could find from SystemOut.log that, immediately before the crash
occurred, the system status looked good (memory consumption was low), and there
was some high level of activities by a specific user. The support engineer
picked up the phone to talk with the guy, and found the issue was due to him
trying to load some bad data via MXLoader. The solution includes some further
training on data loading to this user, and some tightening of Maximo integration/performance
settings
Figure 1: Analyzer shows a huge 1.6Gb of SqlServer TDSPacket object |
Case 2: Several core dumps occurred within a short
period. The client was not aware of the unavailability as the system is load
balanced. Nevertheless, alarms sent to support team and was treated as a
critical incident. When heap dump was opened by Heap Analyzer, it showed a
single string buffer and char[] object consumed 40% of the JVM’s heap space.
Figure 2: Analyzer shows a char[] object consumed 1.6Gb of heap space |
In this instance, since it is a single string object, I attempted
to open the core dump file using WinBdg and view the content of this
string using the “du” command on the memory address of the char[] object (Figure
3). From the value shown, it looks like a ton of error messages related to
DbConnectionWatchDog was added to this string buffer. It was me who, a few days
before, switched on the DbConnWatchDog on this system to troubleshoot some
DB-related issues. In this case, the Maximo’s out-of-the-box DbConnWatchDog is
faulty by itself and caused the problem. So, I had to switch it off.
Figure 3: WinBdg shows content of a memory address |
Case 3: A system consistently thrown OOM errors and
core dumped on the two UI JVMs on every 2-3 weeks. Heap Analyzer almost always
showed a leak suspect which has some links to a WebClientSessions object. The
log file also showed an oddly high number of WebClientSessions created vs the
number of logged in users. We know that with this client, there are a group of
users that always open multiple browser tabs to use many Maximo screens at the
same time. But it should not create such a disproportionately high number of
WebClientSessions. Anyhow, we don’t know what caused it.
|
During the whole time troubleshooting the issue, we
maintained a channel with IBM support team to seek additional help on the
issue. With their suggestions, various log settings were switched on to
troubleshoot the issue. The UI logging confirmed that WebClientSessions always
get created when a user logged in, but never get disposed. In other words, the
total WebClientSessions count always increased, and after a period of use, it
would consume all JVM heap space and caused the OOM crash. Some frantic, random
search led me to an article by Chon Neth, author of the famous MaximoTimes blog, mentioning a
memory-to-memory replication setting in Websphere could cause a similar
behaviour. I quickly checked and confirmed this setting was enabled in this
system. Memory-to-Memory replication is a High Availability setting available
in Websphere, but it is not supported by Maximo. So, we turned this setting
off, and the problem disappeared.
|
Very helpful article which is linked to critical incidents always.
ReplyDeleteVery informative blog
ReplyDelete