The major incident caused by the failure of the UK’s National Air Traffic Services (Nats) in August 2023 may be a very rare occurrence, but a final report into the system failure has recommended 34 changes.
The report, prepared for the UK Civil Aviation Authority (CAA) by the Independent Review Panel, looked at what could be done better to limit the effects of the failure that occurred because an incorrectly formatted flight plan was submitted to the system.
In the event of a failure of a primary system, the backup system is designed to seamlessly take over processing. The authors of the Nats major incident investigation final report noted that in this instance, the primary system had not failed, but had acted as programmed. It placed itself into maintenance mode to make sure irreconcilable – and therefore, potentially unsafe – information was not sent to an air traffic controller.
However, the backup system applied the same logic to the flight plan with the same result. It subsequently raised its own critical exception, writing a log file into the system log, and placed itself into maintenance mode.
The failure of Nats occurred because both the primary and backup Flight Plan Reception Suite Automated (FPRSA-R) subsystems were in maintenance mode to protect the safety of the air traffic control operations. This meant flight plans could no longer be automatically processed, and manual intervention was now required.
The report recommended that Nats should review the current command structure, its supporting technology and processes. This should analyse whether the current model is likely to lead to the best outcomes in the majority of incidents, or whether it can be optimised further with the addition of alternative options.
The report’s authors recommended that this review should include, as a minimum, options for alternative models and examples of other effective command structures, including the use of a single incident manager model. They also noted that such options should include guidance about when the use of each option is most appropriate, and suggested a review of training requirements to maximise operational oversight capabilities during incidents, and system and process requirements to support selected structures, including decision-making, escalation and creation of a common operating picture.
When Nats went offline, a subset of unprocessed data remained in the system but was outside the established pause queue. This required further escalation to identify the root cause of the issue.
The report recommended that air traffic control documentation should be reviewed to ensure that the system complexity and behaviour can be better understood by engineers and users who are not dedicated to the system. There should also be a high-level joint Technical Services and Operations review of key critical systems. The report recommended that this review should confirm that the operational documentation for each system reviewed has sufficient description and clarity to allow the system to be operated safely and resiliently in unexpected circumstances.
While escalation procedures were followed, the authors of the report pointed out that earlier contact with the supplier would most likely have expedited the resolution of the event.
They recommended that Nats should update the escalation process to provide guidance on the time or other key criteria that should trigger when, and under what circumstances, supplier support is requested. “Nats should create a single controlled document detailing the supplier contracts and associated contacts, who provide 24-hour support,” the report stated. “These details should be accessible by anyone in Nats likely to be required to support an incident response. As a minimum, these should include Levels 1 through 3 of engineering support.”
Among the minor recommendations is that given the complexity of the system architecture, which is regularly changed and upgraded, it is impossible to maintain up-to-date overall system mapping of Nats. The report’s authors recommended conducting an assessment of the feasibility of using new technology, or a model-based engineering process, to rapidly produce the required system schematic information to the teams during the early stages of an incident.
They also said that the technical services director should review the current operational documentation in support of implementing new technology, or a model-based engineering process that supports rapid mapping. “This must ensure that there is sufficient and accurate detail for the various levels of engineering support to see the high-level, key interfacing systems and methods by which they connect,” they wrote.
The key aim of this review should be to assist in the identification of problems that might be upstream or downstream of the specific system where a fault first occurs.