tasks failing to stop

keyframe · Post by **keyframe** » Sun Jan 19, 2020 4:51 pm

I'm hoping for some clues on how to debug a problem I've recently started having.

Since I upgraded to 2.3.1 (from 2.2.3) and centos8 from centos7, I've run into situation where deleting a job via the web interface appears to delete the job, however the process is still running on the afrender client.

Has anyone seen this? Any thoughts on how to debug?

G

Post by **timurhai** » Mon Jan 20, 2020 9:42 am

Hi.
Try to look at afserver and afrender processes output.

keyframe · Post by **keyframe** » Mon Jan 20, 2020 3:44 pm

Where are those logged?

Post by **timurhai** » Mon Jan 20, 2020 4:25 pm

If you installed Linux packages, it is SystemD.

Sometimes for an error catching is more easy to launch afserver and afrender manually in terminal.
Such way you can watch logs real-time.

keyframe · Post by **keyframe** » Mon Jan 20, 2020 5:03 pm

Heya Timur,

Searching through journalctl entries on client:

Jan 20 12:00:45 tws12 _afrender.sh[3788]: INFO Finished PID=12369: Exit Code=15 Status=0 (stopped)
Jan 20 12:00:45 tws12 _afrender.sh[3788]: INFO Task terminated/killed by signal: 'Terminated'

on the server:

Jan 20 12:00:19 tsr02 _afserver.sh[2373]: Mon 20 Jan 12:00.19: Job registered: "JOBSNAMEHERE"[6]: gene@tws12[1] - 33292 bytes.
Jan 20 12:00:45 tsr02 _afserver.sh[2373]: Mon 20 Jan 12:00.45: Deleting a job: "JOBNAMEHERE"[6]: gene@tws12[1] - 33601 bytes.
Jan 20 12:00:45 tsr02 _afserver.sh[2373]: ERROR Mon 20 Jan 12:00.45: AFCommon::writeFile: /var/tmp/afanasy/jobs/0/6.JOBNAMEHERE/data.json.tmp
Jan 20 12:00:45 tsr02 _afserver.sh[2373]: No such file or directory

However, the process is very much alive, and busy generating frames as it was before.

G

keyframe · Post by **keyframe** » Mon Jan 20, 2020 5:07 pm

Here's something interesting.

The PID of the task running on tws12 is 12377, NOT 12369 like the afrender log seems to suggest.

G

Post by **timurhai** » Mon Jan 20, 2020 5:23 pm

The issue is not "jobs failing to delete", but "child tasks failed to stop".
You can test it not just on a job deletion, but on any stop, skip, retry, the result should be the same.

Afrender run some command, that command can raise several child processes.
So to terminate/kill all of them, it sets a new session just before main task process start (before it can raises any child process) setsid()
https://github.com/CGRU/cgru/blob/maste ... ss.cpp#L35
https://www.google.com/search?q=man+setsid
Later, to send signal to all processes afrender uses killpg( getpgid()):
https://github.com/CGRU/cgru/blob/maste ... s.cpp#L622

And works on most common cases.
You can try to run/stop you task in some other way (manually in terminal) to find out how it should be stopped properly.
Also you can fin out some command flags, may be there is an option to create new processes group (session) or not.

keyframe · Post by **keyframe** » Mon Jan 20, 2020 6:43 pm

Thanks for the insight. I'll debug further.

I wonder whether this is related to me switching users as part of the task.

I've added a su - <submitting_user_name> - c "<command to execute>" to the command in order to the resulting frames to be owned by the same person that submitted the render, rather then the user that runs the afrender daemon.

What's really puzzling me though, is it that it only does this some time. Most of the time, the child processes are stopping as expected.

G

keyframe · Post by **keyframe** » Mon Jan 20, 2020 9:10 pm

The process tree looks like this, and I've noticed that su behaves a little differently between centos 7 and 8 regarding permissions -- perhaps there's more going on there that i'm unaware of.

Code: Select all

systemd(1)─┬─ModemManager(1097)─┬─{ModemManager}(1149)
           │                    └─{ModemManager}(1165)
           ├─NetworkManager(1123)─┬─{NetworkManager}(1161)
           │                      └─{NetworkManager}(1169)
           ├─accounts-daemon(1452)─┬─{accounts-daemon}(1454)
           │                       └─{accounts-daemon}(1457)
           ├─afrender(1968)───su(21906)───bash(21907)───hython-bin(21914)─┬─{hython-bin}(21924)
           │                                                              ├─{hython-bin}(21926)
           │                                                              ├─{hython-bin}(21927)
           │                                                              ├─{hython-bin}(21928)
           │                                                              ├─{hython-bin}(21929)

Post by **timurhai** » Tue Jan 21, 2020 12:21 pm

I think that after su - new session and process group created. And afrender method to stop all child tasks is not working.
If you wrote a command wrapper using "su -", may be there is a way to stop all childs by that wrapper too.

CGRU Forum

tasks failing to stop

tasks failing to stop

Re: tasks failing to stop

Re: tasks failing to stop

Re: jobs failing to delete

Re: tasks failing to stop

Re: tasks failing to stop

Re: tasks failing to stop

Re: tasks failing to stop

Re: tasks failing to stop

Re: tasks failing to stop