Did ‘Terraform Destroy’ Cause the UniSuper Google Cloud Outage?


In early May, the internet was rocked by news of Google supposedly deleting a pension fund account worth $125 billion. Users of the Australia-based UniSuper pension fund’s systems suddenly had issues accessing their accounts for around a week. More than 600,000 pension fund members were affected.

Expectedly, many assumed it was a cyber attack. Several high-profile breaches such as the Maersk ransomware incident have involved major data losses that resulted in operational disruptions. It eventually became clear, though, that the problem was an undiscovered bug that could easily be exploited by threat actors. It is a vulnerability Google was unaware of and did not expect to be possible. 

Google fixed the problem around mid-May and posted an explanation about what happened. However, there is an interesting take on the incident that merits some scrutiny: the possible role of IaC management tool Terraform. A New Zealand-based senior software developer shared interesting theories based on his experiences with Google Cloud’s professional services team, pointing to the possible unintended effects of Terraform commands. 

Google’s Explanation

In a blog post on May 25, Google detailed how the incident actually happened. The company clarified that the incident only affected one customer in a single cloud region, referring to UniSuper. Specifically, the problem was limited to only one of the customer’s multiple Google Cloud VMware Engine (GCVE) private clouds. The event, Google said, did not impact other Google Cloud services, customer accounts, projects, and data backups.

After Google’s internal investigation, the cloud service provider concluded that the incident happened because of misconfiguration. The company traced this error to an initial deployment of a Google Cloud VMware Engine (GCVE) private cloud by a customer who used an internal tool. There was an issue in the parameter configuration, which resulted in the unintended and undesired consequence of capping the customer’s GCVE private cloud to a fixed term.

Google maintains that their operators, the people responsible for managing and deploying Google Cloud services, acted in line with the company’s internal control protocols. The UniSuper incident was the first problem of its kind they encountered, suggesting that they did not expect that an input parameter left blank could result in the deletion of a private cloud. 

Google explained that the blank parameter prompted the system to assign a then-unknown default term. The investigation revealed that this term is for one year, which means that the GCVE private cloud was unwittingly set to terminate after a year. There were no notifications sent to the customer because the deletion was not brought about by a customer request. It was triggered as a consequence of a parameter left blank by Google operators.  

The blog post by Google implicitly cleared UniSuper of any fault, saying that it was a Google Cloud issue through and through. A joint statement was released by UniSuper and Google, characterizing the incident as an isolated “one-of-a-kind occurrence” that was not supposed to have taken place. 

Was ‘Terraform Destroy’ Truly the Culprit?

As researchers pointed out, it seems that the internal tool used by Google’s operators is Terraform. Commonly used for infrastructure-as-code (IaC) management, Terraform supports a command called ‘destroy,’ which is crucial for infrastructure management. DevOps managers can use Terraform destroy on a specific resource or multiple resources at once. 

Using this command requires caution, as it can result in the irreversible removal of an infrastructure component. An accidental execution of the command over unintended resources can easily lead to an outage.

As mentioned in Google’s blog post, the unintended deletion happened because of a blank parameter inadvertently introduced. In this sense, the deletion was akin to the detonation of a long-running time bomb set a year prior (the one-year system-assigned expiration of the private cloud). With these details from Google, it seems highly unlikely that the high-profile mishap was caused by an imprudent use of the Terraform destroy command after all.

If a destroy command had been involved, the situation would have warranted a very different type of explanation. Instead of the fault entirely falling on Google’s operators, the problem would have originated from UniSuper’s own cloud provisioning managers. In this scenario, UniSuper would have applied a Terraform configuration file containing an instruction to remove a private cloud via the destroy command, with Google operators immediately approving it. 

Cybersecurity Concerns

Despite the indications that it likely wasn’t careless use of the destroy command that caused the UniSuper outage, it is still worth discussing how important it is to be mindful of Terraform destroy. Threat actors can take advantage of it as they exploit bugs to delete resources and disrupt operations. 

There are three possible scenarios where the destroy command can be indirectly triggered, and all of them involve bugs. 

In the first scenario, failure to address bugs or issues in Terraform configuration files can wreak havoc during the Plan and Apply phases. These configuration file bugs may cause the unintended marking of resources for deletion. For example, poorly thought-out conditional statements or corrupted configuration files may inappropriately target certain resources for removal.

In the second scenario, organizations may be using external tools that interact with Terraform. These can include cloud provider APIs and provisioning scripts, which may have bugs that prompt them to inadvertently delete resources when they should not. There are cases where Terraform may call for these scripts, usually during configuration changes. If these are applied, the undesirable outcomes can be serious.

Lastly, if organizations use third-party providers to interact with cloud services and platforms, there is always the possibility that these tools can be misused to bring about misinterpretations during the apply phase and even during planning.

To prevent bugs and other cyber issues from using the destroy command to delete resources, it is important to regularly test configurations before applying them. IaC code reviews should also become a routine activity. Moreover, it is important to ensure the quality of the external controls being used and to always be updated with the latest bug fixes and security patches. Finally, the principle of least privilege should be enforced and regular data backups should always be readily available to expedite restoration efforts.

In Summary

To recap, the Terraform destroy command ultimately didn’t cause the UniSuper Google Cloud outage. The incident happened because of a blank parameter that was left unnoticed and unaddressed. Google’s team did not anticipate that the tool they were using would autonomously assign values that could lead them to trouble one year later.

There are still so many things to discover, learn, and understand about modern IT technologies, particularly when it comes to cloud configurations and management. For security teams, collaborating with DevOps armed with a thorough understanding of Terraform commands, is important for maximizing workflow efficiency, uptime and security.

 

Ad



Source link