Difference between the %CPU usage and load average, and when should it be a concern?

I’ve searched multiple answers here but couldn’t find an answer that is related to this scenario, but if you think you do, kindly point me to it.

I am including the numbers here for the ease of my own comprehention.

I have a 96 core baremetal linux server with 256 GB RAM that’s dedicated to run an in-house written distributed event-based asynchronous network service that acts as a caching server. This daemon runs with 32 worker threads. Apart from the main task of fetching and caching, this server also does a variety of related tasks in a couple of extra separate threads like polling other members’ health-checks, writing metrics to a unix socket, etc. The worker threads value isn’t bumped further because increasing this will increase the cache lock contention. There is not much disk activity from this server, as the metrics are tried to be written in batches and if the unix socket fails, it just ignores it and frees the memory.

This instance is a part of a 9 node cluster and the stats of this node speaks for the rest of the instances in this cluster.

With the recent surge in inbound traffic, I see the %CPU usage of the process has gone up considerably but the load average is still less than 1.

Please find the stats below.

:~$ nice top
top - 19:51:55 up 95 days,  7:27,  1 user,  load average: 0.33, 0.28, 0.32
PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
587486 cacher   20   0  107.4g  93.0g  76912 S  17.2  37.0   5038:13 cacher

The %CPU goes up till 80% some times but the load average is considerably less even then and doesn’t go beyond 1.5, and this happens mostly when there’s a cache miss and the cacher has to fetch it from an upstream, so it is mostly a set of network activities. As far as I understood, the compute heavy operation this service does at runtime is when it has to compute the hash of the item to be cached when it has to store it into the appropriate distributed buckets. There are no systemd limits set on this service for any parameters, and it is also tuned to disable the kernel’s oomkiller for this process, although it is nowhere near the upper limit. The systemd sockets to which this binds has been tuned to accommodate more tx and rx buffer.

  • why is the load average less than 1 on a 96 core server when the %CPU for the service that uses 32 threads fluctuates between 20% and 80% consistently?
  • On a 96 core server, how much is considered a safe value for %CPU to go safely? Does it have a relation with how many threads it is used? If the number of threads are bumped, is a higher %CPU usage theoretically accepted?

Thank you.