tasks failing to stop
tasks failing to stop
I'm hoping for some clues on how to debug a problem I've recently started having.
Since I upgraded to 2.3.1 (from 2.2.3) and centos8 from centos7, I've run into situation where deleting a job via the web interface appears to delete the job, however the process is still running on the afrender client.
Has anyone seen this? Any thoughts on how to debug?
G
Since I upgraded to 2.3.1 (from 2.2.3) and centos8 from centos7, I've run into situation where deleting a job via the web interface appears to delete the job, however the process is still running on the afrender client.
Has anyone seen this? Any thoughts on how to debug?
G
--
Rocky Linux 8.5, cgru 3.2.1
Rocky Linux 8.5, cgru 3.2.1
Re: tasks failing to stop
Hi.
Try to look at afserver and afrender processes output.
Try to look at afserver and afrender processes output.
Timur Hairulin
CGRU 3.3.1, Ubuntu 20.04, 22.04, MS Windows 10 (clients only).
CGRU 3.3.1, Ubuntu 20.04, 22.04, MS Windows 10 (clients only).
Re: jobs failing to delete
If you installed Linux packages, it is SystemD.
Sometimes for an error catching is more easy to launch afserver and afrender manually in terminal.
Such way you can watch logs real-time.
Sometimes for an error catching is more easy to launch afserver and afrender manually in terminal.
Such way you can watch logs real-time.
Timur Hairulin
CGRU 3.3.1, Ubuntu 20.04, 22.04, MS Windows 10 (clients only).
CGRU 3.3.1, Ubuntu 20.04, 22.04, MS Windows 10 (clients only).
Re: tasks failing to stop
Heya Timur,
Searching through journalctl entries on client:
Jan 20 12:00:45 tws12 _afrender.sh[3788]: INFO Finished PID=12369: Exit Code=15 Status=0 (stopped)
Jan 20 12:00:45 tws12 _afrender.sh[3788]: INFO Task terminated/killed by signal: 'Terminated'
on the server:
Jan 20 12:00:19 tsr02 _afserver.sh[2373]: Mon 20 Jan 12:00.19: Job registered: "JOBSNAMEHERE"[6]: gene@tws12[1] - 33292 bytes.
Jan 20 12:00:45 tsr02 _afserver.sh[2373]: Mon 20 Jan 12:00.45: Deleting a job: "JOBNAMEHERE"[6]: gene@tws12[1] - 33601 bytes.
Jan 20 12:00:45 tsr02 _afserver.sh[2373]: ERROR Mon 20 Jan 12:00.45: AFCommon::writeFile: /var/tmp/afanasy/jobs/0/6.JOBNAMEHERE/data.json.tmp
Jan 20 12:00:45 tsr02 _afserver.sh[2373]: No such file or directory
However, the process is very much alive, and busy generating frames as it was before.
G
Searching through journalctl entries on client:
Jan 20 12:00:45 tws12 _afrender.sh[3788]: INFO Finished PID=12369: Exit Code=15 Status=0 (stopped)
Jan 20 12:00:45 tws12 _afrender.sh[3788]: INFO Task terminated/killed by signal: 'Terminated'
on the server:
Jan 20 12:00:19 tsr02 _afserver.sh[2373]: Mon 20 Jan 12:00.19: Job registered: "JOBSNAMEHERE"[6]: gene@tws12[1] - 33292 bytes.
Jan 20 12:00:45 tsr02 _afserver.sh[2373]: Mon 20 Jan 12:00.45: Deleting a job: "JOBNAMEHERE"[6]: gene@tws12[1] - 33601 bytes.
Jan 20 12:00:45 tsr02 _afserver.sh[2373]: ERROR Mon 20 Jan 12:00.45: AFCommon::writeFile: /var/tmp/afanasy/jobs/0/6.JOBNAMEHERE/data.json.tmp
Jan 20 12:00:45 tsr02 _afserver.sh[2373]: No such file or directory
However, the process is very much alive, and busy generating frames as it was before.
G
--
Rocky Linux 8.5, cgru 3.2.1
Rocky Linux 8.5, cgru 3.2.1
Re: tasks failing to stop
Here's something interesting.
The PID of the task running on tws12 is 12377, NOT 12369 like the afrender log seems to suggest.
G
The PID of the task running on tws12 is 12377, NOT 12369 like the afrender log seems to suggest.
G
--
Rocky Linux 8.5, cgru 3.2.1
Rocky Linux 8.5, cgru 3.2.1
Re: tasks failing to stop
The issue is not "jobs failing to delete", but "child tasks failed to stop".
You can test it not just on a job deletion, but on any stop, skip, retry, the result should be the same.
Afrender run some command, that command can raise several child processes.
So to terminate/kill all of them, it sets a new session just before main task process start (before it can raises any child process) setsid()
https://github.com/CGRU/cgru/blob/maste ... ss.cpp#L35
https://www.google.com/search?q=man+setsid
Later, to send signal to all processes afrender uses killpg( getpgid()):
https://github.com/CGRU/cgru/blob/maste ... s.cpp#L622
And works on most common cases.
You can try to run/stop you task in some other way (manually in terminal) to find out how it should be stopped properly.
Also you can fin out some command flags, may be there is an option to create new processes group (session) or not.
You can test it not just on a job deletion, but on any stop, skip, retry, the result should be the same.
Afrender run some command, that command can raise several child processes.
So to terminate/kill all of them, it sets a new session just before main task process start (before it can raises any child process) setsid()
https://github.com/CGRU/cgru/blob/maste ... ss.cpp#L35
https://www.google.com/search?q=man+setsid
Later, to send signal to all processes afrender uses killpg( getpgid()):
https://github.com/CGRU/cgru/blob/maste ... s.cpp#L622
And works on most common cases.
You can try to run/stop you task in some other way (manually in terminal) to find out how it should be stopped properly.
Also you can fin out some command flags, may be there is an option to create new processes group (session) or not.
Timur Hairulin
CGRU 3.3.1, Ubuntu 20.04, 22.04, MS Windows 10 (clients only).
CGRU 3.3.1, Ubuntu 20.04, 22.04, MS Windows 10 (clients only).
Re: tasks failing to stop
Thanks for the insight. I'll debug further.
I wonder whether this is related to me switching users as part of the task.
I've added a su - <submitting_user_name> - c "<command to execute>" to the command in order to the resulting frames to be owned by the same person that submitted the render, rather then the user that runs the afrender daemon.
What's really puzzling me though, is it that it only does this some time. Most of the time, the child processes are stopping as expected.
G
I wonder whether this is related to me switching users as part of the task.
I've added a su - <submitting_user_name> - c "<command to execute>" to the command in order to the resulting frames to be owned by the same person that submitted the render, rather then the user that runs the afrender daemon.
What's really puzzling me though, is it that it only does this some time. Most of the time, the child processes are stopping as expected.
G
--
Rocky Linux 8.5, cgru 3.2.1
Rocky Linux 8.5, cgru 3.2.1
Re: tasks failing to stop
The process tree looks like this, and I've noticed that su behaves a little differently between centos 7 and 8 regarding permissions -- perhaps there's more going on there that i'm unaware of.
Code: Select all
systemd(1)─┬─ModemManager(1097)─┬─{ModemManager}(1149)
│ └─{ModemManager}(1165)
├─NetworkManager(1123)─┬─{NetworkManager}(1161)
│ └─{NetworkManager}(1169)
├─accounts-daemon(1452)─┬─{accounts-daemon}(1454)
│ └─{accounts-daemon}(1457)
├─afrender(1968)───su(21906)───bash(21907)───hython-bin(21914)─┬─{hython-bin}(21924)
│ ├─{hython-bin}(21926)
│ ├─{hython-bin}(21927)
│ ├─{hython-bin}(21928)
│ ├─{hython-bin}(21929)
--
Rocky Linux 8.5, cgru 3.2.1
Rocky Linux 8.5, cgru 3.2.1
Re: tasks failing to stop
I think that after su - new session and process group created. And afrender method to stop all child tasks is not working.
If you wrote a command wrapper using "su -", may be there is a way to stop all childs by that wrapper too.
If you wrote a command wrapper using "su -", may be there is a way to stop all childs by that wrapper too.
Timur Hairulin
CGRU 3.3.1, Ubuntu 20.04, 22.04, MS Windows 10 (clients only).
CGRU 3.3.1, Ubuntu 20.04, 22.04, MS Windows 10 (clients only).