Multiprocess Java app locks up routinely

TL;DR – Why does our Java app in an ECS Docker container hang when launching 8 child processes, with the smoking guns being a hung cat /proc/<pid>/cmdline command or the presence of jspawnhelper processes, and why did this issue suddenly arise?

Details…

I develop and maintain a Java app that implements a service. This app is deployed to AWS ECS. We autoscale the app such that from 1 to 12 copies of it are running at once. Each app maintains a thread pool of 8 worker threads. Each thread picks up scheduled jobs. A job consists of the execution and direction of a headless web browser, a separate process that we initiate by calling Runtime.getRuntime().exec() in our Java app. The child process is then directed and monitored by a socket connection that is instigated between the Java app and the sub-process. The subprocess exits at the end of the job, and the thread picks up a new job and launches a new child process.

This architecture has existed and worked well for a number of years. Only recently, we started to experience a situation where the processing threads lock up and stop processing jobs. This happens quite regularly, taking anywhere from a few hours to a few days to occur with any particular instance of our app. We have worked backwards in time, deploying earlier versions of our app and its deployment definition, but have been unable to assign blame to any change we’ve made that could have initiated the problem.

We are struggling to figure out why this problem is happening, or how to mitigate it. By this question, we are asking if anyone has any ideas as to how to resolve or diagnose the issue. What we see in the wedged app and what we’ve tried to do to fix the problem are given below.

Once our app has become wedged, we get a view on what is going on either by attaching IntelliJ IDEA to the app’s main process, or by running jstack against it (via ssh). These methods provide the same information. What we find is that each of the worker threads is almost always stuck in one of two places:

  1. In the “Runtime.getRuntime().exec()” call that is attempting to launch the child process for the job.

  2. In the call to create the socket that will be used to communicate with the child process, “new Socket()”

We can ssh into the container hosting the app. We have looked around, and have yet to find a reason for the problem. We have checked for an “out of resource” condition. There is plenty of free memory, plenty of file handles, no/few zombie socket connections, and plenty of CPU. We may or may not have reached a point where the assigned process ids have wrapped around from their max value of 32768.

When we run a “ps -e” in one of these instances, the command will often lock up. When this occurs, doing a simpler “ps” will complete, suggesting that it is the ps command attempting to get some of the extra information that it is displaying that is causing the hang. Sure enough, if we compare the output of the two commands, there will be a one to one correspondence between output lines. If we take the process id of the first process that was output by only the second command, which will and we run this command:

cat /proc/<pid>/cmdline

The command locks up. So this is the most precise smoking gun that we have been able to find. We have googled on this condition, and found a number of articles that discuss this condition. Doing so has provided no fix for this issue, nor any real explanation as to why this is occurring. The most concrete suggestion is that we update our kernel version. This is something we would prefer to not have to do. We are running the most recent version of Centos 8 off of DockerHub. Neither this OS version nor its accompanying kernel version have been mentioned in any of the articles we have found.

Killing the Java app and restarting it in the same container immediately leads to the problem occurring again. So something outside of the app’s process is clearly out of whack. We’re guessing that we’ve exhausted some resource, but which one?

There’s a second condition that we have seen when viewing one of our wedged apps. Only in some cases, when we do a “ps”, we see 8 of the following processes running:

/usr/lib/jvm/java-17-amazon-corretto/lib/jspawnhelper

If we kill these processes, 8 more jobs get picked up but then the system locks up again after processing these 8 jobs. We assume that one of these processes is involved in each launch of one of our subprocesses. We don’t find any instances of this process when looking at a non-wedged container, so it appears that these processes are normally very short lived. Googling for problems with this process has not provided any info that lead to a fix to our problem.

What causes this unhealthy environment that we are seeing? How can we get eyes on the ultimate explanation of the problem? Is there some resource we’ve run out of, and if so, how can we see this?