When writing a piece of software, we are in total control of the quality of the product. With integration, many elements are not under our control. Network and firewall are usually managed by IT. With external systems, we usually don’t know how they work, or many times, not given access. Yet, any changes to these elements can cause our interfaces to fail.
For synchronous interfaces, the
user would receive instant feedback after each action is taken (e.g. Maximo
- GIS integration), thus, we don’t usually need to setup alarms. For
asynchronous interfaces, which usually run in the background, and don’t give instant
feedback, when failure occurs, it usually goes unnoticed. In many cases, we
only find out about failures after it has caused some major damage.
A good interface must provide adequate mechanism to handle failures, and in the case of async integration, proper alarms and reports should be setup so that failures are captured and handled proactively by IT and application administrators.
On the one hand, it is bad to have no monitoring. On the other hand, it is even worse to have too many alarms to a point that people completely ignore everything including the critical issues. This is usually seen in larger organisations. Many of readers of this blog probably won’t be surprised when they open the Message Reprocessing app of the Maximo system they manage and find thousands of unprocessed errors in there. It’s likely that those issues have been accumulated and not dealt with for years.
It is hard to create a perfect
design from day one and build an interface that works smoothly after the first
release. There are many different kinds of problem an external system can throw at us and it is
not easy to envision all possible failure modes. As such, we should expect and plan for an intensive
monitoring and stabilizing period of a few days to one or two weeks after the first release.
As a rule of thumb, an interface
should always be monitored and raise alarms when a failure occurs. It
should also provide a mechanism to resubmit/reprocess a failed message. More
importantly, there shouldn’t be more than a few alarms raised per day on
average from each interface, no matter how critical and high volume the
integration is. More than that, it will become too noisy and people will start
ignoring those alarms. If an interface raises more than a few alarms a day, there
must be some recurring patterns and each of them must be treated as a systemic
issue. The software should be rebuilt or updated to handle these recurring
issues.
It is easier said than done, and
every interface is a continuous learning and improvement process to me. Below
are some examples of the interfaces I built or dealt with recently. I hope you
find it entertaining to read.
Case #1: Integration of Toll
Point equipment and maintenance to Maximo
An infrastructure construction company
built and is now operating a freeway in Sydney. Maximo is used to manage
maintenance activities, mainly on civil infrastructure. Toll point equipment
and traffic monitoring system was provided by an external provider (Kapsch). Device
status and maintenance work from this system are exported daily as CSV files
and sent to Maximo via SFTP. On Maximo
side, the CSV files are imported using a few automation scripts triggered by a
cron task.
The main goal of the interface is
to maintain a consolidated database of all assets and maintenance activities in
Maximo. It is a non-critical integration because even if it stopped working for
a day or two, it won’t cause a business disruption. However, occasionally,
Kapsch would stop exporting CSV files for various reasons. The problem will
only be found out after a while, like when end-of-month report is produced or
when someone tries to lookup for the status of certain work order which was not
created via the interface. Since we don’t have any access or visibility to the
traffic monitoring system managed by Kaspch, we’ll need to build the monitoring
and alarms in Maximo.
The difficulty is, when the
interface on Kapsch’s side fails, it doesn’t send Maximo anything, there would
be no import, and thus no errors or faults seen by Maximo to raise any alarm.
The solution we came up with is having a custom logging table that we write
each import as an entry with some basic statistics including import start time,
end time, total records processed and the number of records that failed. The
statistics are displayed on Start Center.
For alarm, since this integration
is non-critical, an escalation is set to monitor whether there has been no new
import within the last 24 hours, Maximo will send out an email to me and the people
involved. There are actually a few different interfaces in this integration,
such as for device list and preventive maintenance work coming from TrafficCom,
or corrective work on faults coming from JIRA. Thus, sometimes, when a system
stopped running for various planned or unplanned reasons, I would receive
multiple emails for a couple of days in a row, which is too much. So, I tweaked
it even further by sending only one
email on the first day if one or more interfaces stopped working, and another
email reminding me a week later if the issue has not been rectified. After the
initial fine-tuning period, support team on Kapsch and Maximo side is added to
the recipient list, and after almost two years now, the integration has been
running satisfactorily. In other words, there has been a few times files are
not received on Maximo side and the support people involved were always
informed and able to take corrective action in a timely manner before the end-users
can notice.
Case #2: Integration of CRM
and Maximo
A water utility in Queensland
uses Maximo for managing infrastructure asset, tracking, and dispatching work
to field crews. When a customer calls up requesting a new connection or
reporting a problem, the details are entered to a CRM system by the company’s call
centre. The request will then be sent to Maximo as a new SR, then turned into
work orders. When the work order is scheduled and a crew has been dispatched,
these status updates are sent back to CRM. At any time, if the customer calls
up to check on the status of the request, call centre should be able to provide
an answer by looking up the details of the ticket in CRM only. Certain types of
problem have high priority such as major leak or water quality issues. Some
issues have SLA with response time in minutes. As such, this integration is
highly critical.
WebMethods is used as a middleware
to handle this integration, and as part of the steps sending new SR from CRM to
Maximo, service address will also need to be cross-checked with ArcGIS for
verification and standardization. As you can see, there are multiple points of
failure with this integration.
This integration was built several years ago and there has been some level of alarms setup in CRM on a few points where there is high risk of failure such as when a Service Order is created but not picked up by WebMehods or picked-up but not sent to Maximo. Despite this, the interface would have some issues every few weeks, and thus, needed to be rebuilt. In addition to existing alarms coming from CRM, several new alarm points were added in Maximo and Webmethods:
- When WM couldn’t talk with CRM to retrieve new Service Order
- When WM couldn’t send status update back to CRM
- When WM couldn’t talk to Maximo
- When Maximo couldn’t publish messages to WM
These apply to individual
messages coming in and out of Maximo and CRM and any failure would result in an
email sent to the developer and the support team.
On the first few days after this
new interface was released to Production, the team received a few hundred
alarms each day. My capacity to troubleshoot was about a dozen of those alarms
a day. Thus, instead of trying to solve them. We tried to identify all
recurring patterns of issues and address them by modifying the interface
design, business process, or fixing bad data. A great dealt of time was also
spent on trying to improve the alarms, such as for each type of issue, detailed
error messages, or in many cases, the content of the XML message itself is
attached to the email alarm. A new “fix patch” was released to Production about
two weeks after the first release, and after that, the integration only
produced a few alarms per month. In most cases, the support person can
immediately tell what is the cause of the problem by just looking at the email
before even log in to the client’s environment. After almost a year now, all of
the possible failure points that we envision, no matter how low of a chance it can
occur, has failed and raised alarms at least once, and the support team has
always been on top of it. I’m glad that we had put in all of those monitoring
in the first place. And as a result, I haven’t heard of any issues that not
been fixed before the end-users become aware of it.
Case #3: Interface with
medium criticality/frequency
Of the two examples above, one is
low frequency/low criticality; the other is high frequency and highly critical.
Most interfaces are somewhere in the middle. Those interfaces that are highly
critical but don’t run frequently or don’t need short response time can also be
put into this category. In such cases, we might not need to send individual
alarms in real-time. Although I like to think of myself as pretty good at troubleshooting
issues, I don’t think I can handle more than a few issues per day. As such, my
rule of thumb is, if I receive more than a few alarms per day, it is too much. As
developers, if we don’t think we can handle more than a few alarms a day, I
think we shouldn’t to that to support team (giving them alarms all day long).
For the utility company mentioned above, when WebMethods was first deployed,
the WM developer has configured a bi-daily report that lists all failed
transactions occurred in the last 12 hours. Thus, for most interfaces, we don’t
need to setup any specific alarms. If there were a few failures, they will show
up in the report and will be looked at by technical support at noon or at the
end of the day. This appears to work really well, even for some of the very
critical interface such as bank transfer order or invoice payment.
Case #4: Recurring failure
resulting in too much alarm
For the integration mentioned in
#1 and #2, the key to get them work satisfactorily is to spend some time after
the first release to monitor the interfaces and fine-tune both the interface
itself and the alarms. It is important to have alarms raised when failure
occurs, but it is also important to ensure there aren’t too many alarms raised.
Not only that people will ignore it if they receive too much alarm, it also makes it hard to tell the critical
issues apart from other noisy less important ones. From my experience, dealing
with those noisy alarms are usually quite easy. Most of the time, the alarms
come from a few recurring failures that are ignored. When people first look at
it, they can easily be overwhelmed by the high number of issues they see
initially, and thus, feel reluctant from dealing with it. In many cases, I
simply deal with each alarm/failure one by one, and carefully document the
error message or symptom, and the solution for each problem on an Excel
spreadsheet. Usually, after a few I’ve gone through a few issues, they would
all comeback to some recurring patterns that can easily be dealt with. Below is
an example:
A water utility uses an external
asset register system, and the asset data is synchronized to Maximo in a near
real-time frequency. The interface produces almost 1GB of SystemOut.log file
each day causing the logging system to become useless. I looked each error and
document them one by one. After about two hours, it was clear that 80% of the
errors come from missing locations which were not synchronized. When the
integration tries to create a new asset under these locations, it will write a
bunch errors to SystemOut.log. I did a quick scan and wrote down all of the
missing locations and quickly add them to Maximo using MXLoader. After that the
amount of error was greatly reduced. By doing occasion check on the log files
on the following few days, I was able to list all of the missing 30+ locations
and able to remove all of the related errors. The remaining errors found in the
log files were easily handled separately, some were quite critical that only be
made aware to the business after that.
No comments:
Post a Comment