Introduction
When managing SQL Server performance, it’s important to understand the relationship—and the distinction—between worker threads and CPU utilization. These are two separate metrics that can behave independently. For instance, you might experience high CPU utilization while still having available worker threads, or you could have all worker threads occupied even when CPU utilization is low.
To clarify this concept, let’s use a metaphor. Imagine a restaurant with 20 tables. Suppose the restaurant has only two waiters, one cook, and a single stove. Even if only half of the tables are occupied—meaning the restaurant is operating at 50% of its seating capacity—all the staff and the stove might be fully engaged. Despite having available tables, the restaurant cannot serve more customers efficiently because the workers are maxed out.
This scenario is analogous to how SQL Server operates. The CPU utilization might not be at its peak, but if the tasks exceed the available worker threads, performance can suffer. Simply adding more CPU power doesn’t necessarily enable SQL Server to handle more work; it’s about having enough worker threads to manage tasks efficiently.
The Role of Worker Threads
In SQL Server, worker threads are essential for processing tasks and queries. Adding more CPUs can increase the number of worker threads, similar to hiring more staff in our restaurant metaphor. The default number of worker threads is determined based on the CPU architecture (32-bit or 64-bit) and the number of CPUs, as documented in SQL Server’s guidelines.
However, when it comes to scaling, you have two main options: manually increasing the maximum number of worker threads or adding more CPUs.
Manually increasing the maximum worker threads is akin to trying to fit ten people into an SUV designed for seven. While you might squeeze everyone in, it won’t be comfortable, and it might even be unsafe. In the context of SQL Server, increasing worker threads beyond optimal levels can lead to performance degradation due to excessive context switching and resource contention.
On the other hand, adding more CPUs is like upgrading from an SUV to a van that can comfortably accommodate ten people. This approach provides SQL Server with more capacity to handle tasks efficiently without overloading the system.
Determining the Number of Databases per Availability Group
A common question that arises is, “How many databases should I have per Availability Group (AG)?” The answer depends on several factors, including your specific environment and workload.
In practice, you might encounter error logs in SQL Server such as:
The thread pool for Always On Availability Groups was unable to start a new worker thread because there are not enough available worker threads. This may degrade Always On Availability Groups performance. Use the "max worker threads" configuration option to increase the number of allowable threads.
This error indicates that SQL Server doesn’t have enough worker threads available for Always On Availability Groups, which can lead to performance degradation. For example, you might see logs like:
RECOVERY (): Max redo workers limit reached at instance level Max[102], Current[100]
This message shows that the maximum number of redo worker threads has been reached, causing issues with the performance of Availability Groups. Each database within an AG can use up to 16 redo threads, but there’s an instance-level limit of 100 redo threads for all AGs combined. When this limit is reached, SQL Server starts reusing threads among databases, potentially leading to performance bottlenecks.
If you have 100 databases in an AG, all the redo operations will compete for those 100 threads. In a light workload with small databases, this might not pose a problem. However, if the workload spikes or the databases are large, SQL Server may not have enough threads to keep up, resulting in delays and increased CPU pressure.
Why SQL Server Limits Worker Threads
You might wonder why SQL Server imposes a limit on worker threads, especially if there are more CPUs available. The reason is to prevent excessive context switching and resource contention, which can degrade performance. Managing too many threads simultaneously can introduce overhead that outweighs the benefits of parallel processing.
Recommendations for Databases per Availability Group
So, what’s a practical guideline for the number of databases to include per AG? While there’s no one-size-fits-all answer, a reasonable recommendation is to limit the number to around 90 databases per AG. This leaves some headroom in the worker thread pool to handle unexpected spikes in workload without hitting instance-level limits.
Of course, you should adjust this recommendation based on your specific environment. If your databases are small and the workload is light, you might be able to include more databases per AG. Conversely, if you have large databases with heavy transaction volumes, you might need to reduce the number of databases per AG to ensure optimal performance.
Why SQL Server Sometimes Recommends Adding More CPUs
Knowing that the AG limit is 100 redo threads, you might question why SQL Server sometimes suggests adding more CPUs even when this limit remains unchanged. The reason is that adding CPUs increases the overall capacity for worker threads at the system level, which can help manage workloads across multiple AGs.
For example, if you have multiple AGs, each requiring worker threads for redo operations, more CPUs can distribute the load more effectively. However, it’s important to note that adding CPUs doesn’t automatically increase the instance-level limit of 100 redo threads for AGs. You may need to optimize thread usage or carefully adjust the “max worker threads” setting to align with the increased CPU capacity.
This limitation can become more noticeable during failovers, especially in environments running SQL Server 2019 or later. During a failover, the system needs to catch up on all pending log blocks and apply them from the transaction log (LDF) to the data file (MDF). This process is CPU-intensive and requires sufficient worker threads. If the worker thread pool per AG is only a fraction of what’s needed, SQL Server has to queue tasks, leading to delays and higher CPU utilization.
Conclusion
Understanding the interplay between worker threads and CPU utilization is essential for optimizing SQL Server performance, particularly when using Always On Availability Groups. It’s not just about adding more CPU power; it’s about ensuring you have enough worker threads to manage your workload efficiently without exceeding optimal limits.
By carefully planning the number of databases per AG and appropriately scaling your CPU and worker thread resources, you can prevent performance bottlenecks, reduce failover times, and maintain high availability in your SQL Server environment.