Network Services Manager Develops Superhero-like Abilities
Mark Newman boasts greater power and speed and can quickly and accurately identify issues others can’t
A few months ago, Mark Newman, Network Services Manager with Open Solutions Canada noticed some strange things happening in the office. While he continued to oversee the handling of more than 75 million transactions per year via various real-time EFT channels on a 24x7 basis, there was a remarkable difference in how this occurred. He was no longer buried under lots of data or having to pull others off projects to help provide him detailed information. He could now see things others couldn’t and managed to fix problems before anyone was even aware there was a dilemma. It was if he had super speed and enhanced senses and could do the work of many all by himself.
Some colleagues wondered if he’d been subject to the same fate as a young Peter Parker, who was bitten by a radioactive spider during a science demonstration, thereby gaining super abilities and becoming best known as Spiderman.
Coincidentally, a few weeks earlier he installed the beta version of INETCO Insight™ 4.5.
So, is it a spider bite or INETCO Insight? For the inside story, read the following Q&A with Mark Newman.
Historically, how did you pull together all your transaction information in a single place?
The short answer is that we didn’t. We couldn’t. Each interface is unique, and logs data in its own files and format, on its own server. Several interfaces might share one gateway and server, thus providing common time-stamps, but between gateways and servers time-stamps could drift making correlation of events more complex.
If for some reason a compiled event log was required, that would be a manual process undertaken by a Programmer.
What were the main challenges in this approach?
This solution was slow, complex, and time-consuming. Due to the complexity of the logs, a Programmer is required to perform the investigation. Their initial step would be to determine the scope of the outage:
- what service or services were impacted
- what client or clients were impacted
- when did the outage start
- when did the outage end
- are transactions now being successfully processed
From there, they could investigate and determine the root-cause and long-term resolution.
Unfortunately, with our old tool suite we couldn’t even provide the Programmers this basic information with any accuracy; we’d have to consult our EFT partners and have them advise us as to the actual time frames and status. The Programmer would then know where to look in the logs to determine what happened.
A single EFT interface outage would take a Programmer several hours to investigate; if the outage involved multiple EFT interfaces, the timeframe would increase accordingly. If the outage was the result of an on-going issue, the service could remain down for the duration of the investigation. We don’t have a group of staff members standing-by to perform investigations – these are people who have to be taken off other work to perform the analysis.
This solution worked in an after-the-fact manner, but was slow and cumbersome.
What are you now able to do?
We now monitor all of our EFT channels with one tool, regardless of banking system, database, or EFT service. We have one common look-and-feel for all services, significantly reducing the learning curve and increasing the information available to front-line staff. All log files are saved in one location, with a common timestamp.
By grouping the information by both EFT service and also by Institution, we get an almost instantaneous determination as to the scope of an outage. Is it:
- one EFT service for one Institution
- all EFT services for one Institution
- one EFT service for all Institutions, or
- all EFT services for all Institutions
This immediately provides a starting point to determine root cause and restore service.
We can also baseline our performance using a common tool for all services. How quickly are the servers responding to EFT requests during the day, during the overnight batch processing cycles, during month-end, during the Christmas busy season?
We’ve started to model our usage patterns, and from there can determine the impact of a scheduled outage. When is the quietest time of the week? How many transactions do we process over the lunch period vs. before or after lunch? Following an outage, how many Store-and-Forward records can the system process per minute?
We’ve also started to monitor our traffic patterns to look for anomalies. Are we seeing SAF files at any time other than following an outage? Why? Are we dropping “keep-alive” messages? Why?
We’re also able to quickly generate reports and real-time alerts on security events that have occurred. If we get a request to monitor all Host Response Codes of xyz we can create an Alert condition and have it operational in less than 5 minutes, with e-mail being sent to whomever when the event occurs, or the Event being sent to a Syslog server for further manipulation.
So, tell us, have you ever been bitten by a radioactive spider?
No; a mosquito or two but never a spider!
Download this article as a PDF.
Well, then, it must be because of the enhancements in INETCO Insight 4.5. How have the improvements impacted you?
The biggest change for me in Insight 4.5 is to the Alerting module. With the addition of the “Entities” feature, tied into the existing Threshold feature, this has become an amazingly powerful tool. I’ve created 56 Alerts, but by using the Entities feature I can monitor more than 200 unique conditions.
For example, one rule can be configured that:
- sets an Alert when the host response time for more than 4 consecutive Point of Sale transactions for credit union “X” exceeds 1 second
- and it clears the Alert when that condition is no longer true
A second rule that I’ve introduced:
- sets an Alert when the host returns any message other than “Approved” for more than 75% of internet banking transaction processed over a 5-minute period for credit union “X”
- this same rule also e-mails the Operations staff when this condition occurs
- this same rule also clears the Alert when the condition is no longer true
- this same rule also e-mails the Operations staff when this condition is cleared
A last example:
- sets an Alert when there have been no messages received from one of our ATM switch vendors for credit union “X” for more than 30 minutes
- Again, this same rule also clears the Alert when messages are being received
- This rule also is active only between 08:00 in the morning and 08:00 in the evening for the timezone in which the credit union is located
Other items that I’m alerting on include:
- Slow host response time by EFT service and Institution as % of all transactions
- Slow host response time by EFT service and Institution as consecutive txn’s
- Failed transactions by EFT service and Institution as % of all transactions
- Failed transactions by EFT service and Institution as consecutive txn’s
- Lack of transaction by EFT service and Institution during the day
- Lack of transaction by EFT service and Institution during the night (different thresholds as we expect fewer transactions)
- When we’re not receiving Keep-Alive messages
- When we’re receiving SAF files
INETCO has also introduced an “Alert Status” dashboard to help keep track of Alerts that have occurred. This provides a colour-keyed one-stop view of all Alerts, whether or not the Alert condition has been triggered, and if so, when it was last triggered.
From the “Alerts Status” screen you can flip back to the scrolling Alerts log, and you can now drill down and see the underlying events that triggered the Alert.
A second area of enhancement that I’ve come to appreciate more during my Beta testing is the History area. This now displays a full week’s worth of history, making it much more useful in terms of looking back to see what happened over the past several days. It also displays the transactions that occurred during that time so if you see an anomaly you can drill down from that screen to see what transactions occurred.
A third enhancement - now you can make changes without stopping and starting the Processor – just save and apply. Really handy if you keep changing your mind as to how you want your Alerts to work!
What's the business value of Insight to Open Solutions Canada?
In my department we’ve had an on-going project to reduce and consolidate our tools suite as much as practical. Our philosophy has been that, first and foremost, our tools have to provide the department with value; after that they need to be reliable, easy to access, straight-forward to use, and where possible, allow some degree of integration into other tools that are already in use. If these requirements are met, we can train staff more quickly, provide more powerful tools, and reduce the impact of reduced staffing.
Insight has proven to be of value to us. Even yesterday as I was preparing for this, Insight alarmed over a hung application and provided an immediate view into the scope and impact of the event, and provided an immediate view into its recovery.
The product has been reliable. Even going through these beta releases with Inetco, the tool has been stable and reporting correctly; Alerts have been sent when expected, and cleared when expected. Being a passive out-of-band tool has allowed us to work on it during prime time with no fear of impacting our production environment, and has reduced the demand for exhaustive QA and SysOp involvement.
The tool has been easy to access and straight-forward to use. Being a browser-based solution, all of our workstations already have the software needed to access Insight (subject to appropriate Firewall and user ID credentials). We can also access the tool remotely across our private network when providing support after-hours.
Integration into our other tools is somewhat of a work in progress. We’ve already fully integrated it into our Syslog server, but that’s easy to do. We’re working to integrate it with WhatsUp Gold, but in the meantime the “Alerts Status” page meets many of our needs. And Insight is pretty user-friendly in its native state…
With this release of Insight we have the information, power and granularity to become proactive in terms of understanding our environment far better than we did in the past, of meeting our client’s needs faster than we did before, of identifying issues more quickly and accurately than previously, and doing it with the same or fewer staff members.