CPU Utilization query reporting > 100% for certain instances #491

chadbaldwin · 2024-08-20T20:02:06Z

Here from SQL Server Community Slack chat:
https://app.slack.com/client/T1LTZ0BQV/C1MS1RA4B

Note: I am not currently a SQLWATCH user, but while doing research I found the query being used by SQLWATCH. Leaving this issue here by request of @marcingminski

I came across a bug regarding this commonly used query:

sqlwatch/SqlWatch.Monitor/Project.SqlWatch.Database/dbo/Procedures/usp_sqlwatch_logger_performance.sql

Lines 62 to 79 in 1b3543c

    
           select  
        
           		--original PR https://github.com/marcingminski/sqlwatch/commit/b8a8a5bbaf134dcd6afb4d5b9fef13e052a5c164 
        
           		--by https://github.com/marcingminski/sqlwatch/commits?author=sporri 
        
           		@date_snapshot_current 
        
           	,	percent_processor_time=convert(real,ProcessUtilization) 
        
           	,	percent_idle_time=convert(real,SystemIdle) 
        
           FROM (  
        
           		SELECT SystemIdle=record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int'),  
        
           			ProcessUtilization=record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 'int') 
        
           		FROM (  
        
           			SELECT TOP 1 CONVERT(xml, record) AS [record]  
        
           			FROM sys.dm_os_ring_buffers WITH (NOLOCK) 
        
           			WHERE ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR' collate database_default 
        
           			AND record LIKE N'%<SystemHealth>%' collate database_default 
        
           			ORDER BY [timestamp] DESC 
        
           			) AS x  
        
           		) AS y 
        
           OPTION (keep plan);

(Also used in dbadash and other tools/scripts).

It seems there is an issue when running this query against non-hypervisor instances. The ProcessUtilization value does not appear to be 100-based and I have not yet determined what it actually is based on. Just working on theory at this point. The value starts at 0, like normal, but as SQL load increases, the value quickly surpasses where you would expect it to be.

For example, I have an instance with the following stats:

cpu_count                      : 224
hyperthread_ratio              : 56
max_workers_count              : 7552
scheduler_count                : 224
scheduler_total_count          : 315
affinity_type_desc             : AUTO
virtual_machine_type_desc      : NONE
softnuma_configuration_desc    : ON
process_physical_affinity      : {
                                   {0,ffffffffffffff}
                                   {1,ffffffffffffff}
                                   {2,ffffffffffffff}
                                   {3,ffffffffffffff}
                                 }
socket_count                   : 4
cores_per_socket               : 28
numa_node_count                : 16
container_type_desc            : NONE

When I run the linked query above, I get:

percent_processor_time    : 180
percent_idle_time         : 55

My theory is that it might be multiplied by the number of sections in process_physical_affinity, but that is a complete guess for now.

One solution I have considered is to just make the query "less bad". For example, if virtual_machine_type_desc is NONE then do not use ProcessUtilization, instead use 100-SystemIdle, as SystemIdle does seem to reliably be 100-based. Unfortunately this lumps OS load into there, but that seems better than having nothing or something completely wrong.

The text was updated successfully, but these errors were encountered:

chadbaldwin · 2024-08-20T20:06:49Z

UPDATE: I guess I've got blinders on and I have been looking at this problem for too long. I don't think there is any need to look at process_physical_affinity. As it seems when virtual_machine_type_desc is NONE then the number of sections in process_physical_affinity is equal to the socket count.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU Utilization query reporting > 100% for certain instances #491

CPU Utilization query reporting > 100% for certain instances #491

chadbaldwin commented Aug 20, 2024

chadbaldwin commented Aug 20, 2024

CPU Utilization query reporting > 100% for certain instances #491

CPU Utilization query reporting > 100% for certain instances #491

Comments

chadbaldwin commented Aug 20, 2024

chadbaldwin commented Aug 20, 2024