Insufficient disk space on database servers was the cause of Toyota car production grinding to a halt for 36 hours at 14 plants in Japan late in August.
But just how does such a situation arise? What’s pretty likely here is a failure in database capacity planning and to provide for sufficient storage.
That’s an activity that might often go under the radar of the C-suite, but if things go wrong, the result will be keenly felt. Any estimates of what lost production at 14 Toyota plants for a day and half is worth must be pretty hefty.
So, what happened? Toyota put out a press release that stated a cyber attack was not to blame. Instead, it said: “The system malfunction was caused by the unavailability of some multiple servers that process parts orders,” which is a critical part of the lean/just-in-time production system for which Toyota is famous. “[R]egular maintenance work was performed on August 27, the day before the malfunction occurred. During the maintenance procedure, data that had accumulated in the database was deleted and organised, and an error occurred due to insufficient disk space, causing the system to stop.”
The press release went on to say: “Since these servers were running on the same system, a similar failure occurred in the backup function.”
Key here is what looks like a failure in database capacity planning. That would be the likely issue if data in a database was “deleted and organised” (presumably meaning “reorganised”) because deletion of data does not result in a smaller database file and may even see it grow.
In addition, the basic risk mitigation measure of separating production and backup storage also seems to have been overlooked, but we’ll leave that aside for the time being.
Database sizing – prediction of database storage needs – and capacity planning, which is the practice of ongoing provision of the required storage capacity, are core skills for database administrators.
Capacity planning is more complex than the layperson might imagine because the size of database files does not just grow and shrink according to need without further impacts.
To start with, a (relational) database is not just a database. “A database” does consist of a database file in which allocated rows and columns and multiple and connected tables are defined. But, there are also temporary files, such as SQL Server’s TempDB files. There are also indexes to allow for rapid access to particular frequently used rows, log files that record all database activity, and backups created by the database and associated applications.
Also, different suppliers’ database systems have different implementations of these basic elements and can occupy different volumes of storage capacity for the same nominal database size.
A 20,000 node (host, server, etc.), 20GB database file size in SQL Server, for example, requires 46GB when log files, TempDB files and backups are taken into account. The equivalent database file size in Oracle and PostgreSQL would take up 90GB and 26GB respectively.
The issue with deletes in a database – which Toyota identified as a cause in its problems – is that a database delete does not reduce file size. If a row is removed, its allocated space is not removed. It’s just marked as unused. Database size is set when it is originally configured and periodically reset through its lifetime, but the key thing is that space is allocated and it stays that way, primarily to make sure storage needs are not suddenly overwhelmed by huge bouts of shrinkage and growth. It’s pretty much an in-stone commandment of the DBA world that databases never shrink.
Log and temp files
Then there are the log and temp files, which can suddenly balloon in volume with a spike in transactions, such as during the deletes and reorganisation Toyota alluded to and during maintenance in general. We have no way of knowing if that’s what actually happened to Toyota, but it’s a theoretical possibility.
Further to this are the potential for replication and snapshots to add to storage capacity requirements. Meanwhile, increased disk utilisation can result in longer access times and RAID configurations can add to storage volume growth.
A further implication of Toyota’s (limited) explanation points to some other possible basic failings. These include that temp files may not have been on separate drives, which is a recommended best practice, that backups were on the same drives as production data (not recommended), and even that production system database storage was on direct-attached server storage. We don’t know if the latter was the case – it may have been on perfectly adequate server or hyper-converged infrastructure, in theory – but high-performance shared SAN storage is very well-suited to critical database deployments.
Overall, we’ve had to speculate a fair bit here. Toyota’s communication was quite candid, but there’s lots we don’t know. The bottom line is that the company failed somehow to plan properly for growth in database size for some critical systems, and that’s a lesson lots of organisations can take on board.