Introduction
SQL Server Always On Availability Groups are designed to provide a high-availability and disaster recovery solution that keeps your databases running smoothly even in the face of hardware failures and data corruption. One of the key features supporting this capability is Automatic Page Repair. This feature works behind the scenes to automatically fix corrupted pages in a database, thereby minimizing downtime and preserving data integrity. In this post, we will explore the mechanisms, benefits, and limitations of Automatic Page Repair, highlighting its role in enhancing SQL Server reliability.
How Automatic Page Repair Works
Automatic Page Repair is an automated process that SQL Server uses to resolve page corruption issues without requiring manual intervention from a database administrator. By automating the repair process, SQL Server ensures that databases can continue operating with minimal disruption. Let’s look at how this process unfolds.
Detection of Corruption
The first step in Automatic Page Repair is the detection of a corrupted page. When SQL Server encounters a page that cannot be read correctly due to corruption, it takes several actions:
- Log the Corruption:
SQL Server records the corruption incident in thesuspect_pages
table within theMSDB
system database. This log entry captures critical details such as the database ID, file ID, page ID, and error ID, which help track the nature and scope of the corruption. - Request Repair:
Once the corruption is logged, SQL Server sends a request to all secondary replicas in the availability group to provide a copy of the corrupted page. This request is a broadcast message seeking assistance from other replicas that may have an uncorrupted version of the page. - Handle Responses:
The replicas respond to the request. SQL Server retrieves the page from the first replica that responds, ensuring that the page ID and the current log sequence number (LSN) are specified. This step is critical for ensuring that the correct version of the page is retrieved. - Mark as Restore Pending:
During the repair process, the corrupted page is marked as restore pending. Any attempts to access this page result in error 829, which indicates that the page is temporarily unavailable while repairs are underway.
Page Repair Process
The repair process involves collaboration between the primary and secondary replicas. Here’s how it works:
- Log Processing by Secondary Replica:
The secondary replica processes the transaction log up to the specified LSN to ensure it has the most recent version of the page. - Access and Send Page:
The secondary replica accesses the required page and sends it to the primary replica. If the page cannot be accessed, an error is returned, causing the repair process to fail. This ensures that only valid and accessible pages are used for repair. - Update Repair Status:
If the repair is successful, SQL Server updates the page’s status in thesuspect_pages
table to “restored” (event_type 5). The server then resolves any deferred transactions that were pending due to the corruption.
Monitoring and Validation
To monitor the Automatic Page Repair process, database administrators can use the Dynamic Management View (DMV) sys.dm_hadr_auto_page_repair
. This DMV provides vital information that helps track the progress and outcome of the repair process:
- Page_id: The ID of the restored page, indicating which page was affected.
- Error_type: The type of error that was repaired, such as logical consistency errors or checksum failures.
- Page_status: The current status of the page in the repair process, ranging from queued to successfully repaired or failed.
Error Types Handled
Automatic Page Repair is capable of addressing several types of errors, which include:
- Error 823: Occurs when there is a cyclic redundancy check (CRC) failure, indicating a hardware error.
- Error 824: Involves logical consistency errors, such as bad page checksum or torn page detection, which are more likely to occur due to software issues.
- Error 829: Raised when a page is marked as restore pending, signaling that repair efforts are in progress.
Limitations
While Automatic Page Repair is a powerful feature, it is not without limitations:
- Certain critical pages cannot be recovered or restored through this feature. These include:
- A file header page
- A database boot page
- Global Allocation Map (GAM) pages
- Shared Allocation Map (SGAM) pages
- Page Free Space (PFS) pages
- The feature is only effective when the Availability Group is in a synchronous state, meaning that the primary and secondary replicas must be fully synchronized for repairs to be successful.
Benefits
Automatic Page Repair provides several benefits that enhance the reliability of SQL Server databases:
- Data Integrity: By automatically repairing corrupted pages, this feature helps maintain the integrity of the database, ensuring that data remains consistent and reliable.
- High Availability: Automatic repairs reduce the need for manual intervention and minimize downtime caused by page corruption, thereby enhancing the overall availability of the database.
Process of Automatic Page Repair in Secondary Replicas
In some cases, page corruption can occur in a secondary replica. Here’s how the process works:
- Enter Suspended State: The secondary replica’s SQL Server Always On Availability Group enters a suspended state to prevent further corruption or inconsistencies.
- Record Corruption: The corrupt page information is recorded in the
suspect_pages
table of the secondary replica, allowing administrators to track the issue. - Request Page Copy: A request is sent to the primary replica for a copy of the page, leveraging the primary’s role as the source of truth.
- Primary Access: If the primary replica can access the page, it sends the page to the secondary replica for repair.
- Repair and Synchronize: The secondary replica repairs the page, and the Availability Group returns to a synchronized state, resuming normal operations.
Conclusion
Automatic Page Repair is an invaluable feature for maintaining database integrity and availability in SQL Server Always On Availability Groups. By understanding how it works, its limitations, and benefits, database administrators can leverage this feature to ensure their systems remain robust and resilient against data corruption. This feature not only enhances SQL Server reliability but also reduces the administrative burden of managing database integrity and availability.