Did you know Always On Availability Groups Have Flow Control? What does this mean?
Availability groups are designed with flow control gates on the primary replica to avoid excessive resource consumption, such as network and memory resources, on all availability replicas. These flow control gates do not affect the synchronization health state of the availability replicas, but they can affect the overall performance of your availability databases, including RPO.
After the logs have been captured on the primary replica, they are subject to two levels of flow controls. Once the message threshold of either gate is reached, log messages are no longer sent to a specific replica or for a specific database. Messages can be sent once acknowledgment messages are received for the sent messages to bring the number of sent messages below the threshold.
In addition to the flow control gates, there is another factor that can prevent the log messages from being sent. The synchronization of replicas ensures that the messages are sent and applied in the order of the log sequence numbers (LSN). Before a log message is sent, its LSN is checked against the lowest acknowledged LSN number to make sure its less than one of thresholds (depending on the message type). If the gap between the two LSN numbers is larger than the threshold, the messages are not sent. Once the gap is below the threshold again, the sending of messages is resumed.
Purpose of Flow Control:
- Avoiding Excessive Resource Usage: Flow Control helps prevent excessive resource consumption on both the primary and secondary replicas. These resources include network bandwidth and memory.
- Throttling Message Transmission: When an AG is in “Flow Control” mode, the transmission of log block messages from the primary to the secondary is paused until the flow control mode is reset.
Flow Control operates at two levels:
- Availability Group Replica/Transport: The threshold for flow control at this level is 8192 messages.
- Availability Group Replica Database: Each database within the AG has its own flow control gate, allowing a maximum of 112*16 = 1792 messages per database. This limit contributes to the overall 8192 total limit at the transport or replica level.
How Flow Control Works:
- Message Sequence Numbers: Every message sent over the wire has a sequence number, which increases monotonically. It also includes an acknowledgment number, representing the sequence number of the last processed message at the other end of the connection.
- Primary Replica Perspective:
- When sending a message, the primary checks the unacknowledged messages (the delta between the sequence number of the sent message and the last acknowledged message).
- If this delta exceeds a predefined threshold, the replica or database enters flow control mode, halting further messages until the mode is reset.
- Secondary Replica Perspective:
- When log caches are flushed or memory pressure is detected, the secondary sends a SECONDARY_FLOW_CONTROL message to the primary.
- The primary sets a bit for the affected database, indicating flow control mode.
- During flow control, the primary skips sending messages to that database.
- The primary periodically checks for changes in flow control state (every 1000ms).
How to Monitor
You can monitor for flow control by using Performance Monitor, capture on the primary replica. To monitor database flow control, add counters SQLServer:Database Replica and choose Database Flow Control Delay and Database Flow Controls/sec counters. In the Instance box select the availability group database you wish to check for database flow control. To detect and monitor availability replica flow control, add counters SQLServer:Availability Replica and choose Flow Control Time (ms/sec) and Flow Control/sec counters.
There are 2 Extended Events which will give us the relevant information when we are in Flow control mode:
The action is basically a “set=0” or “cleared=1” bit.
Consider activating Trace Flag 12310 to raise the flow control limits on the number of messages each Availability Group gate can handle. increased limit is also available to the following versions of SQL Server, starting with: SQL Server 2019 CU9, SQL Server 2017 CU18, and SQL Server 2016 SP1 CU16.These enhanced limits are introduced as the default settings in SQL Server 2022 (16.x), rendering the trace flag obsolete in this version. For further details, Monitor performance for availability groups – SQL Server Always On | Microsoft Learn
SQL 2022 new limits:
Conclusion
SQL Server’s Always On Availability Group feature includes a sophisticated flow control mechanism that preserves system performance and stability by managing the distribution of log messages across availability replicas. This intelligent system ensures that resource consumption is kept in check without compromising the synchronization health of the replicas. Monitoring these controls can be achieved through Performance Monitor and Extended Events, providing valuable insights for database administrators. With the latest updates in SQL Server 2022, Microsoft has further refined these controls, making them more efficient and thus eliminating the need for previous workarounds like Trace Flag 12310 in most standard scenarios. These improvements reflect Microsoft’s commitment to enhancing the resilience and performance of SQL Server environments.