Recently we noticed a sudden increase in the completion time for two SQL related backups on two different servers. Other SQL servers were not experiencing the same issue. It caught our eye the affected systems were used for the same application.
- The first one is used for acceptation and test and contains approximately 450GB of data. Initially the backupjob completed after 4 hours, now it needed 10 hours to process and it still failed on 8 databases.
- The second one is used for production and contains approximately 1500GB of data. Initially the backupjob needed 15 hours, now it needed 27 hours and it still throws 11 failed database backups.
- Both SQL instances were running on dedicated Microsoft Windows clusters(non-shared);
- There is no sudden increase of data capacity on the machines;
- Cluster has not been failed over;
- No configuration changes have been performed to the SQL server, nor to the front-end application (indexation, clean-up tasks, batches, etc);
- No patches were installed on Windows, SQL and CommVault;
- Everything is still on 1GBit network-wise;
- We are backing up to a shared Virtual Tape Library (VTL).
Currently we are running on CommVault Simpana 9 with Service Pack 14 installed.
Our hunch: it needs to be related to a rescheduling of an application-wise task putting extra load on the server. However we are unable to prove it, further investigation is required.
Both servers failed with the “error code: [30:325] – Description: Error encountered during Backup. Error “.
When investigating the local event log of the Windows system, we notice a bunch of errors stating: “BackupVirtualDeviceFile::RequestDurableMedia: Flush failure on backup device ‘dd9aec90-2c37-4156-abf4-50e95753124a_3′. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.)”
Eventually we changed the value VDI value on the SQL Server Instance within the CommVault console. By default this is configured on 300 seconds (5 minutes). We altered it to 7200 seconds (2 hours), the number of retries remained default.
After altering the VDI timeout value, the backup runs again as it should, completing within the approved time range and with no failed databases.
Hope this helps!