Telstra explains why Triple Zero transfers failed – Telco/ISP


Telstra has partly attributed its March 1 Triple Zero outage to software that unexpectedly failed while medical support devices were logging into its network.



During the 90-minute outage, calls had to be manually transferred to emergency services, with 148 transfers failing and one Victorian man dying of a cardiac arrest.

In a post-incident report into the outage, CEO Vicki Brady also revealed why Telstra’s backup processes failed during the outage: Telstra had stored the wrong alternative number for eight emergency services.

Brady explained that the alternative numbers are stored “in a secondary database” and used for manual call transfers.

The incorrect numbers “prevented our team from making the manual transfer of the call to the respective emergency services operator.”

The technical trigger for the outage was a combination of an unexpected database outage that triggered an existing but previously unknown software fault.

This emerged at 3.30am, when there was “a high volume of registration requests” from medical alert devices.

This traffic alone wasn’t enough to cause a problem, Brady explained, but it coincided with “other system activity that resulted in connections to the database reaching the maximum limit.”

When that happened, it “triggered an existing but previously undetected software fault” that prevented the calling line identification (CLI) system from recovering.

Finally, there was a communication breakdown when Telstra decided to contact emergency services via email for the failed call transfers.

The Telstra team was given an updated email address for Triple Zero Victoria, entered it incorrectly into its system, and it took 13 minutes for this to be corrected.

Brady is apologetic about the contact number and email errors: “Ensuring we have the right contact numbers for emergency services operators is basic and something we should have gotten right,” she said.

“Relying on email as a fallback in this situation is far from ideal, and introduced a delay that is entirely unacceptable. The team introduced it as a last resort when our manual transfer backup failed.”

The wash up

Since its prior update, Brady said, Telstra has identified and reproduced the issue that caused CLI to fail; it is now testing a fix for the software fault.

It expects this to be deployed by April.

It’s also worked with organisations managing medical alert devices, so that registration is only sent when the device needs to make an emergency call.

The carrier has also reviewed both its “end to end approach for Triple Zero”, as well as its “monitoring and alarming” for the service, “to ensure we can identify and respond to any issues as quickly as possible.”



Source link