Optus customers woke up this morning to find they were unable to get their social media fix, and they weren’t happy. Around 4am AEDT, customers started to report an inability to access both mobile and home internet services.
Optus advised it was investigating the issue, with reports emerging around midday of some services coming back online.
Around 12.30pm, Optus chief executive Kelly Bayer Rosmarin told radio 2GB the path to restoration had been found, nearly nine hours after the blackout began.
The outage, one of the largest in Australia’s history, sent alarm bells ringing across the country. With a number of smaller mobile network providers reselling the Optus network, including Aussie Broadband, Amaysim, CatchConnect, Coles Mobile, Dodo, Moose Mobile and more, the impact was felt far and wide.
As the morning progressed, the impact grew. Health and emergency services were unable to communicate, trains in Melbourne were brought to a halt and small businesses across the nation were unable to use Optus EFTPOS.
Fortunately, Optus users could still use roaming to call 000 if they were within the coverage of other telecommunication service providers.
Read more:
In a crisis, Optus appears to be ignoring Communications 101
What is a ‘deep network’ problem?
Earlier today Minister for Communications Michelle Rowland described the incident as a “deep network” problem.
Telecommunications networks include three components: the core, transit and access networks. You can think of the core network as the systems that allow customers’ devices to connect to and access phone and internet services.
The transit network connects the core to the access networks using optical fibre cables. The access networks include the local infrastructure found in suburbs – including the mobile phone towers.
Core network outages can occur when equipment or cables fail, when there is a software fault, or when a cyberattack occurs.
The most common reason for a software fault is when a patch or update is applied and it has an unintended outcome, such as causing one or more of the core network systems to fail.
What could have caused this?
Although Optus hasn’t give any indications as to the exact cause of the outage, Bayer Rosmarin said it was unlikely a cyberattack was the cause:
There is no indication that it is anything to do with spyware at this stage.
At the same time, experts have noted mobile cell towers are working, and there seems to be no damage to the underlying fibre optic network. This means we can probably rule out an issue in the transit or access networks.
The scale and speed with which the impact hit (and the somewhat specific timing) indicates the culprit was likely a problem in the core network.
It’s very possible a software or system update was responsible. Such updates or changes often happen out of business hours to have minimal impact. They typically involve a short period of downtime – a “scheduled outage” – which goes unnoticed by customers.
It could be, as some reports have speculated, the Optus outage was an unplanned consequence of a planned system change, such as a planned update or outage. When these processes go wrong, they can go spectacularly wrong!
As for how such a fault may happen, it is likely due to human error (especially since 4am is a time you might expect engineers to be carrying out patch work). However, it could also be a result of other factors, such as a hardware fault that then causes a software failure.
Read more:
What caused the unprecedented Facebook outage? The few clues point to a problem from within
Another possibility is a fault in an accounting or user management system, such as no longer being able to attribute costs or verify users’ identities properly. Issues in back-end billing and management systems can generate a cascade of failures throughout the rest of a network. In such cases, a simple bug in the system can impact everyone connected to the network.
How will this be fixed?
Optus engineers will be actively investigating the cause of the outage. You might be imagining someone scurrying around with wires in their hands trying to find the one that isn’t plugged in – but in reality this will be a lengthy process that involves examining various systems and software configurations to find the culprit.
For Optus, the hard work will continue after the fix is in place to ensure it doesn’t happen again. And perhaps an even more difficult challenge will be convincing the public this was an isolated incident – one that has once again highlighted how vulnerable our massively connected systems are to (even single) points of failure.
Speaking on 3AW Afternoons, Bayer Rosmarin said:
We are looking at what we can do to say thank you to our customers for their patience.
Optus is likely to pay compensation to customers. For residential customers this may be in the form of a reduced bill.
For business customers, the compensation would be linked with their service-level agreements. In other words, the specific penalties for Optus will be based on individual agreements it has made with various parties using or sharing its services.
Beyond this, it’s highly likely today’s events have dealt a massive blow to Optus’s reputation – especially when considered alongside last year’s Optus data breach.
Read more:
Optus says it needed to keep identity data for six years. But did it really?