Telstra’s Exchange blog is always a busy place, with plenty of interesting articles and viewpoints, but of late, there have been the inevitable apologies and explanations following the now infamous Telstra mobile network outages.
Speaking at the CommsDay Summit in Sydney, McKenzie reprinted her entire speech to the event at the blog, which is definitely worth reading for those in the telecommunications game.
While the speech is definitely worth reading in full, some of the most important info is as follows:
“At Telstra, we take our responsibility to connect people very seriously. Our mobile network supports more than 16.9 million customers and carries around 70 million voice calls every single day – more than all other Australian mobile networks combined. We know that our customers rely on us and that’s why we were so disappointed by what has happened.
“Many of you are our Telstra Wholesale customers, and we are committed to keeping you informed about what happened, why it happened, and what we’re doing to fix it. As Chief Operations Officer, it’s my personal commitment to ensure that is done to your satisfaction.
“Our initial review has confirmed the incidents were not related, although two of the disruptions were due to delays in processing the registration of mobiles devices.
“Telstra’s network is highly duplicated and designed for the reliability you have come to expect. Whilst at no time did it suffer a system-wide failure, each of these events did impact varying numbers of our customers and we are working to ensure this does not happen again. It is important to note that our network is now stable and operating as it should.
“On the morning of the 9th technical staff were investigating a fault with one of the signalling nodes used to manage 3G and 4G wireless data sessions and voice calls in the mobile network. With evidence of increasing degradation on the health of the node and potential service risk, the issue was escalated and a decision was taken to isolate the node from the network – a standard operational procedure in such an event.
“That happened at around 12:30pm, but unfortunately due to processes not being followed properly the subsequent node restart initiated incorrectly. This meant that around 15% of all mobile devices connected through this node needed to re-register when establishing a new voice call or data session.
“The mass re-registration of these mobile devices then overloaded the other mobile signalling nodes, impacting approximately 15% of our customers directly and some more at times during the event where they were unable to establish new voice calls or data sessions.
“As soon as we identified what had occurred we worked to address the fault and take action to bring customers back online as quickly possible. In doing so we prioritised voice services over data services. Most impacted devices were able to establish new data sessions by 1pm with some residual customers still impacted. The network was stabilised and all services restored at around 2:30pm.”
“I’ll now turn to the service interruption that occurred on 17 March.
“Just before 6pm some customers nationally were sporadically unable to make 2G, 3G and 4G voice calls or establish a mobile data session.
“Calls between Telstra mobiles were failing intermittently with voice call volumes dropping by approximately 50%. Calls to fixed line telephony services were not impacted if the mobile was connected to the network. SMS services were largely unaffected, although some delivery delays occurred. Data services were affected where customers may have been unable to establish a data session, or existing data sessions may have been disconnected.
“Service restoration commenced from 7pm through limiting the volume of 4G signalling for devices reconnecting with the network, and configuration changes were made in the mobile network to speed up recovery. These changes reinstated network stability.
“Although the user experience was similar, the issue is different to the disruption that occurred in early February.
“The problem was caused when a significant number of customers – initially international roaming customers, and then domestic customers as well – were unexpectedly disconnected from the network. When they all attempted to reconnect at the same time, which happens automatically, we saw a period of overload in the database used to register devices.
“The ability of mobile networks to deal with these mass re-registration events is not unique. Our industry experts have already told us that this is a global challenge faced by many in the industry.
“Finally, in relation to the service interruption of 22 March.
“Some Telstra mobile, IP Telephony (TIPT) and NBN voice customers may have been unable to make or receive calls intermittently between 11:30am and 12:50pm, primarily in Victoria and Tasmania. This incident effected around 3% of our customers and services were restored by around 5.30pm.
“What are we doing to stop this happening again?"
Part of McKenzie's speech continues below, please read on!
“We are committed to getting to the bottom of these incidents and are taking all necessary steps to minimise the risk of it happening again. Our customers expect nothing less.
“We are well into a thorough review of the network. I am leading this review and it involves our own specialist teams as well as external experts from around the world. We have already progressed short- and medium-term actions to improve resilience and robustness in the mobile network.
“Changes have been implemented to increase the capacity and path diversity of critical signalling channels and a temporary layer of traffic management protection has been added to minimise the impact of events like the ones we saw on 9 February and 17 March. Within a few days we expect to augment capacity in a key platform (Home Location Register – Front End aka HLR-FE) that manages our customers’ subscription data.
“In conjunction with our global partners Ericsson, Cisco and Juniper we have assembled a team of internal and external engineering experts to do an end-to-end review of our network. While this work is underway, Telstra Operations has a heightened awareness plan including Executive-level review of any changes planned for the mobile and core IP networks.
“Our network is resilient and we are determined to get the best advice from around the world to help ensure that it stays that way. Our focus is on ensuring our network is the best available and rebuilding our customers’ trust by meeting their expectations every day.”