tasks failing to stop

General discussions and questions.
keyframe
Posts: 62
Joined: Sat Jan 21, 2017 9:43 pm
Location: Toronto

tasks failing to stop

Post by keyframe »

I'm hoping for some clues on how to debug a problem I've recently started having.

Since I upgraded to 2.3.1 (from 2.2.3) and centos8 from centos7, I've run into situation where deleting a job via the web interface appears to delete the job, however the process is still running on the afrender client.

Has anyone seen this? Any thoughts on how to debug?

G
--
Rocky Linux 8.5, cgru 3.2.1
User avatar
timurhai
Site Admin
Posts: 916
Joined: Sun Jan 15, 2017 8:40 pm
Location: Russia, Korolev
Contact:

Re: tasks failing to stop

Post by timurhai »

Hi.
Try to look at afserver and afrender processes output.
Timur Hairulin
CGRU 3.3.1, Ubuntu 20.04, 22.04, MS Windows 10 (clients only).
keyframe
Posts: 62
Joined: Sat Jan 21, 2017 9:43 pm
Location: Toronto

Re: tasks failing to stop

Post by keyframe »

Where are those logged?
--
Rocky Linux 8.5, cgru 3.2.1
User avatar
timurhai
Site Admin
Posts: 916
Joined: Sun Jan 15, 2017 8:40 pm
Location: Russia, Korolev
Contact:

Re: jobs failing to delete

Post by timurhai »

If you installed Linux packages, it is SystemD.

Sometimes for an error catching is more easy to launch afserver and afrender manually in terminal.
Such way you can watch logs real-time.
Timur Hairulin
CGRU 3.3.1, Ubuntu 20.04, 22.04, MS Windows 10 (clients only).
keyframe
Posts: 62
Joined: Sat Jan 21, 2017 9:43 pm
Location: Toronto

Re: tasks failing to stop

Post by keyframe »

Heya Timur,

Searching through journalctl entries on client:

Jan 20 12:00:45 tws12 _afrender.sh[3788]: INFO Finished PID=12369: Exit Code=15 Status=0 (stopped)
Jan 20 12:00:45 tws12 _afrender.sh[3788]: INFO Task terminated/killed by signal: 'Terminated'

on the server:

Jan 20 12:00:19 tsr02 _afserver.sh[2373]: Mon 20 Jan 12:00.19: Job registered: "JOBSNAMEHERE"[6]: gene@tws12[1] - 33292 bytes.
Jan 20 12:00:45 tsr02 _afserver.sh[2373]: Mon 20 Jan 12:00.45: Deleting a job: "JOBNAMEHERE"[6]: gene@tws12[1] - 33601 bytes.
Jan 20 12:00:45 tsr02 _afserver.sh[2373]: ERROR Mon 20 Jan 12:00.45: AFCommon::writeFile: /var/tmp/afanasy/jobs/0/6.JOBNAMEHERE/data.json.tmp
Jan 20 12:00:45 tsr02 _afserver.sh[2373]: No such file or directory

However, the process is very much alive, and busy generating frames as it was before.

G
--
Rocky Linux 8.5, cgru 3.2.1
keyframe
Posts: 62
Joined: Sat Jan 21, 2017 9:43 pm
Location: Toronto

Re: tasks failing to stop

Post by keyframe »

Here's something interesting.

The PID of the task running on tws12 is 12377, NOT 12369 like the afrender log seems to suggest.

G
--
Rocky Linux 8.5, cgru 3.2.1
User avatar
timurhai
Site Admin
Posts: 916
Joined: Sun Jan 15, 2017 8:40 pm
Location: Russia, Korolev
Contact:

Re: tasks failing to stop

Post by timurhai »

The issue is not "jobs failing to delete", but "child tasks failed to stop".
You can test it not just on a job deletion, but on any stop, skip, retry, the result should be the same.

Afrender run some command, that command can raise several child processes.
So to terminate/kill all of them, it sets a new session just before main task process start (before it can raises any child process) setsid()
https://github.com/CGRU/cgru/blob/maste ... ss.cpp#L35
https://www.google.com/search?q=man+setsid
Later, to send signal to all processes afrender uses killpg( getpgid()):
https://github.com/CGRU/cgru/blob/maste ... s.cpp#L622

And works on most common cases.
You can try to run/stop you task in some other way (manually in terminal) to find out how it should be stopped properly.
Also you can fin out some command flags, may be there is an option to create new processes group (session) or not.
Timur Hairulin
CGRU 3.3.1, Ubuntu 20.04, 22.04, MS Windows 10 (clients only).
keyframe
Posts: 62
Joined: Sat Jan 21, 2017 9:43 pm
Location: Toronto

Re: tasks failing to stop

Post by keyframe »

Thanks for the insight. I'll debug further.

I wonder whether this is related to me switching users as part of the task.

I've added a su - <submitting_user_name> - c "<command to execute>" to the command in order to the resulting frames to be owned by the same person that submitted the render, rather then the user that runs the afrender daemon.

What's really puzzling me though, is it that it only does this some time. Most of the time, the child processes are stopping as expected.

G
--
Rocky Linux 8.5, cgru 3.2.1
keyframe
Posts: 62
Joined: Sat Jan 21, 2017 9:43 pm
Location: Toronto

Re: tasks failing to stop

Post by keyframe »

The process tree looks like this, and I've noticed that su behaves a little differently between centos 7 and 8 regarding permissions -- perhaps there's more going on there that i'm unaware of.

Code: Select all

systemd(1)─┬─ModemManager(1097)─┬─{ModemManager}(1149)
           │                    └─{ModemManager}(1165)
           ├─NetworkManager(1123)─┬─{NetworkManager}(1161)
           │                      └─{NetworkManager}(1169)
           ├─accounts-daemon(1452)─┬─{accounts-daemon}(1454)
           │                       └─{accounts-daemon}(1457)
           ├─afrender(1968)───su(21906)───bash(21907)───hython-bin(21914)─┬─{hython-bin}(21924)
           │                                                              ├─{hython-bin}(21926)
           │                                                              ├─{hython-bin}(21927)
           │                                                              ├─{hython-bin}(21928)
           │                                                              ├─{hython-bin}(21929)
--
Rocky Linux 8.5, cgru 3.2.1
User avatar
timurhai
Site Admin
Posts: 916
Joined: Sun Jan 15, 2017 8:40 pm
Location: Russia, Korolev
Contact:

Re: tasks failing to stop

Post by timurhai »

I think that after su - new session and process group created. And afrender method to stop all child tasks is not working.
If you wrote a command wrapper using "su -", may be there is a way to stop all childs by that wrapper too.
Timur Hairulin
CGRU 3.3.1, Ubuntu 20.04, 22.04, MS Windows 10 (clients only).
Post Reply