Backup of a 48TB SQL Server Database Causes Log File Truncation Failure

Time: Column:Databases views:222

In this article, we explore an issue encountered in a SQL Server 2019 environment, where a 48TB database backup process led to a failure in truncating the log file. The inability to truncate the log caused the file to reach its size limit, preventing subsequent transactions from being written, which rendered the database unusable.

SQL Server Version: SQL Server 2019

Background

In a database environment where a single database is 48TB in size and uses the Simple recovery model, the log file size is limited to 600GB. During a full backup, which takes more than 12 hours, the log file could not be truncated and reached its maximum size. This resulted in the failure of subsequent transactions, causing the entire database to become unavailable.


Problem Symptoms

All Virtual Log Files (VLFs) within the LDF log file were in an active state, preventing normal log truncation. As the log file reached the 600GB limit, subsequent transaction writes failed, halting database operations.

Backup of a 48TB SQL Server Database Causes Log File Truncation Failure


Troubleshooting Approach

The general troubleshooting steps include:

  1. Large transactions preventing truncation: Uncommitted large transactions could block the truncation of the log file.

  2. Special environments: Scenarios such as Replication, Mirroring, Availability Groups (AG), Change Data Capture (CDC), or standby environment anomalies could prevent log truncation on the primary database.

  3. Transaction log backup delay: In a Full recovery model, not backing up the transaction log promptly could cause log file growth.

  4. Database recovery time settings: Modifying the database recovery time settings could delay checkpoints, prolonging log file truncation.


Troubleshooting Steps

We followed these steps to identify the issue:

  1. Check the recovery model: The database was in the Simple recovery model, ruling out transaction log backup issues (point 3).

  2. Examine the environment: The database was in standalone mode, excluding scenarios like Replication, Mirroring, or AG (point 2).

  3. Inspect long-running transactions: Using DBCC OPENTRAN, no long-running transactions were found, eliminating the large transaction issue (point 1).

  4. Verify recovery settings: Both TARGET_RECOVERY_TIME and recovery interval were set to default values, ruling out recovery time issues (point 4).

  5. Check for blocking: No blocking issues were found.

  6. Inspect SQL Agent jobs: No jobs other than the full backup were running.

  7. Review write operations: Discussions with developers revealed that this database handles daily batch data writes, and the log write size exceeded 400GB.


Root Cause Analysis

To identify the root cause, we need to understand how a full database backup works. A full database backup consists of two phases:

  1. First phase: At the start of the backup, the current LSN (Log Sequence Number) is recorded, and a snapshot-style backup is created.

  2. Second phase: After the snapshot, the latest LSN is recorded, and the transaction logs between the two LSNs are written into the backup.

Since the database was large, the backup process exceeded 12 hours, remaining in the first phase and unable to transition to the second phase. During this period, the log file couldn't be truncated, and the transactions within the 12-hour window exceeded 600GB, causing the LDF log file to reach its limit.

Backup of a 48TB SQL Server Database Causes Log File Truncation Failure


Optimization Recommendations

The final solution was to enable SQL Server 2019’s Accelerated Database Recovery (ADR) feature, which significantly reduced log file growth during the full backup process, resolving the issue.

Steps to Enable ADR

Ensure that you are using SQL Server 2019 or SQL Server 2022, as ADR was introduced starting from SQL Server 2019.

-- Enable ADR for the database
ALTER DATABASE [YourDatabaseName]
SET ACCELERATED_DATABASE_RECOVERY = ON;
GO

-- Check ADR status
SELECT name, is_accelerated_database_recovery_on
FROM sys.databases
WHERE name = 'YourDatabaseName';

-- Disable ADR
ALTER DATABASE [YourDatabaseName]
SET ACCELERATED_DATABASE_RECOVERY = OFF;
GO

Here are several optimization suggestions:

  1. Multi-file backup: Instead of a single file backup, use multiple backup files to improve backup efficiency and reduce backup time. (Previously, we discussed the split backup feature of SQL Server.)

  2. Upgrade SQL Server version: Upgrade to SQL Server 2019 or SQL Server 2022 and enable ADR (Accelerated Database Recovery) to ensure timely log truncation through SLOG.

  3. Business partitioning: Although less feasible, splitting business operations could reduce the transaction volume in a single database.


Conclusion

In SQL Server 2019, a 48TB database encountered issues with log file truncation due to a prolonged backup process, which caused the log file to reach its limit and prevent transaction writes. This article outlines multiple solutions, including split backups and enabling the ADR feature, with the final solution being the successful resolution of the issue by enabling ADR. It is crucial to always use the latest database versions, such as SQL Server 2019 or SQL Server 2022, to leverage the newest features and avoid many potential issues.