SSH waits for background jobs to exit only in non-pseudo-terminal mode

Ugh, this one caused me days of head-scratching.

$ ssh somehost
somehost$ sleep 15 &
somehost$ exit

$ ssh somehost
somehost$ sleep 15 &
somehost$ exit

If you run that command sequence, you’ll find that it makes the SSH connection and then exits immediately – and it kills the background job (sleep 15) as it exits. Which I find intuitive if only because that’s obviously how it’s always worked.

But now run it this way:

$ ssh somehost 'sleep 15 &'

$ ssh somehost 'sleep 15 &'

Now it makes the SSH connection, waits fifteen seconds, and only then exits.

This subtle difference matters a lot if you’re e.g. writing an automation system that happens to sometimes spawn background jobs on remote machines via SSH, since it can lead to SSH commands hanging (potentially forever, depending on the nature of the background jobs).

And it’s very hard to debug, because no matter what debugging aids you put into your SSH session’s remote command – e.g. an explicit exit 0 at the end, set -xv, or echo 'Okay, going away now!' as the last command – it’ll still just hang instead of completing.

Turns out this is because when run without a remote command, SSH implicitly creates a pseudo-terminal. And the pseudo-terminal provides the behaviour of killing all subprocesses (including un-detached background jobs) when the main process ends.

When you run SSH with a remote command, SSH does not create a pseudo-terminal. I’m guessing technically what it’s doing is waiting for the remote side to close the connection, which the remote side (sshd) won’t do until all child processes have exited (or at least closed their stdin, stdout, and stderr?).

One workaround is to use the -t argument to SSH, to force creation of a pseudo-terminal. The downside is that remote programs may then assume an interactive session, so they may prompt for input in cases where they wouldn’t otherwise. That could cause misalignment between what you write through to the remote command and what it’s expecting input for, or may cause a hang (if you don’t write anything but keep the remote side’s stdin open), or may cause the remote command to abort because it detects the end of stdin.

Another option is to manually kill all background jobs before exiting the main command, with e.g.:

jobs -p | egrep '^\d' | xargs kill

jobs -p | egrep '^\d' | xargs kill

…or just kill all subprocesses:

pkill -P $$

Or, if you can precisely control where you create background jobs (and are confident they don’t spawn background jobs recursively), you can do:

some background command &
BACKGROUND_PID=$!

… # Rest of script.

kill ${BACKGROUND_PID}
# *Now* we can exit without hanging.

some background command &
BACKGROUND_PID=$!

… # Rest of script.

kill ${BACKGROUND_PID}
# *Now* we can exit without hanging.

Just beware with this more precise approach that it’s more fragile – subprocesses could get spawned in ways you didn’t anticipate (if not now, then maybe in future versions of your code / production environment), and that sometimes the PID returned is effectively invalid because it’s merely the PID of some transient subprocess anyway (not the long-lived subprocess(es)).

Leave a Comment Cancel reply