I came across some weird behavior when using the wait command for running parallel jobs in a bash script. For the sake of simplicity I have reduced the problem to the following bash script:
#!/bin/bash
test_func() {
echo "$(date +%M:%S:%N): start $1"
sleep $1
echo "$(date +%M:%S:%N): end $1"
}
i=0
for j in {5..9}; do
test_func $j &
((i++))
sleep 3
done
echo "$(date +%M:%S:%N): No new processes, waiting for all to finish"
while [ $(pgrep -c -P$$) -ge 1 ]; do
echo "$(date +%M:%S:%N): $(pgrep -P$$ -d' ')"
wait -n $(pgrep -P$$ -d' ')
echo "$(date +%M:%S:%N): next $i"
((i++))
done
The above script spawns 5 parallel runs of the test_func
function, which each wait for j
seconds. I’ve added time stamps to each output to show the timings. The output of running this script is as follows:
03:53:854843895: start 5
03:56:855729952: start 6
03:58:856136029: end 5
03:59:856388725: start 7
04:02:857016376: end 6
04:02:857508665: start 8
04:05:857895265: start 9
04:06:857738397: end 7
04:08:858666941: No new processes, waiting for all to finish
04:08:864528182: 3837265 3837297
04:08:875479745: next 5
04:08:881049792: 3837265 3837297
04:08:892058494: next 6
04:08:899310728: 3837265 3837297
04:08:910466324: next 7
04:08:916130505: 3837265 3837297
04:10:858746305: end 8
04:10:859380011: next 8
04:10:864975972: 3837297
04:14:859172632: end 9
04:14:859818377: next 9
As can be seen from the output above, the script spawns all 5 processes, of which 3 end before the end of the for loop (due to the sleep 3
). At this point there are 2 processes still running, which are given correctly by the pgrep
command with IDs 3837265 and 3837297. However the wait
command in the while loop then immediately returns (< 0.1 seconds) for the next three calls, without any other processes finishing (shown with the pgrep
command), even despite giving it the process IDs to wait for.
As far as I can tell (and from some experimentation) the wait
command is immediately returning for each of the test_func
calls that finished before it was first called (which in this case is three times), before actually waiting. What I don’t understand is why this is the case, especially since I supply the process IDs to wait for.
I’m using Ubuntu 20.04.6 and GNU bash, version 5.0.17(1) for context.