all render tasks getting reseted

General discussions and questions.
Post Reply
cg-cnu
Posts: 3
Joined: Fri Feb 10, 2017 5:00 am

all render tasks getting reseted

Post by cg-cnu »

Hi guys,

Having a weird issue over the last few weeks. We are keeping overnight renders and they are going good. :) But, once in a while when we come in the morning and start checking from the monitors(web/afwatch) the network will freeze for a minute. And, once its back, all the renders happening from hours will be restarted on all the nodes in one go. :shock:

Here is the afanasy logs from the server...

Code: Select all


Fri 10 Feb 08:27.54: Monitor zombie: coord@pc-14[17] e'firefox' 192.168.1.79
172806 Fri 10 Feb 08:27.54: Monitor zombie: coord@pc-21[24] e'firefox' 192.168.1.79
172807 Fri 10 Feb 08:27.54: Monitor zombie: <user>@<FQDN> e'2.1.0' 192.168.1.76
172808 Fri 10 Feb 08:27.54: Monitor zombie: coord@pc[2] e'firefox' 192.168.1.57
172809 Fri 10 Feb 08:28.14: Render: <ip> - ZOMBIETIME
172810 Fri 10 Feb 08:28.14: Render Offline:  off                    <FQDN>@render[31] 192.168.1.96
172811 Fri 10 Feb 08:28.14: Render: <ip> - ZOMBIETIME
172812 Fri 10 Feb 08:28.14: Render Offline:  off                    <FQDN>@render[30] 192.168.1.98
172813 Fri 10 Feb 08:28.14: Render: <ip> - ZOMBIETIME
172814 Fri 10 Feb 08:28.14: Render Offline:  off                    <FQDN>@render[8] 192.168.1.93
.......
User avatar
timurhai
Site Admin
Posts: 911
Joined: Sun Jan 15, 2017 8:40 pm
Location: Russia, Korolev
Contact:

Re: all render tasks getting reseted

Post by timurhai »

Hi.

Sorry, to few info, can't say anything. Try to 'catch' the bug.
Timur Hairulin
CGRU 3.3.1, Ubuntu 20.04, 22.04, MS Windows 10 (clients only).
cg-cnu
Posts: 3
Joined: Fri Feb 10, 2017 5:00 am

Re: all render tasks getting reseted

Post by cg-cnu »

Hi Timur,
Anywhere specific I should look into? Any info which will help you in understanding the issue.

Am also trying to understand the logs...
Few questions for you.
Why are the monitors showing as 'zombie monitor'
What does "ZOMBIETIME" mean.

Thanks!
User avatar
timurhai
Site Admin
Posts: 911
Joined: Sun Jan 15, 2017 8:40 pm
Location: Russia, Korolev
Contact:

Re: all render tasks getting reseted

Post by timurhai »

Terminology 'zombie' took from Linux processes.
( Node class can't delete itself, but can consider that it is not needed, it set zombie to true, then container deletes zombie nodes )

Renders and monitors became zombies after some time they have not connected to server.
40 seconds for monitors by default:
https://github.com/CGRU/cgru/blob/maste ... .json#L174
So your clients that should ask server for some new events every second, can't reach server for a long time.
May be server connections queue is full?
How many client your afserver serves?
How many open sockets it has?
What is open sockets state?

There lots of commands&utilities for sockets monitoring, here are some:

Code: Select all

netstat -nat | grep 51000 | wc -l
netstat -nat | egrep ':51000.*:.*TIME_WAIT' | wc -l
ss -tan state time-wait | wc -l
ss -tan 'sport = :51000' | awk '{print $(NF)" "$(NF-1)}' | sed 's/:[^ ]*//g' | sort | uniq -c
( it is separated commands as examples, do not type all lines together )

PS
Even jobs and users became zombies just before deletion, but it is not visible.
Timur Hairulin
CGRU 3.3.1, Ubuntu 20.04, 22.04, MS Windows 10 (clients only).
cg-cnu
Posts: 3
Joined: Fri Feb 10, 2017 5:00 am

Re: all render tasks getting reseted

Post by cg-cnu »

Here are the results from the commands right now in the server.

Code: Select all

netstat -nat | grep 51000 | wc -l
297
netstat -nat | egrep ':51000.*:.*TIME_WAIT' | wc -l
297
ss -tan state time-wait | wc -l
309
ss -tan 'sport = :51000' | awk '{print $(NF)" "$(NF-1)}' | sed 's/:[^ ]*//g' | sort | uniq -c

     32 192.168.1.53 192.168.1.63
     54 192.168.1.57 192.168.1.63
     98 192.168.1.79 192.168.1.63
     33 192.168.1.82 192.168.1.63
     53 192.168.1.90 192.168.1.63
      1 Address Peer

The issue is happening only in the morning after full night of renders.
I will check the stats in the morning and update you.

May be server connections queue is full?
I will monitor the sockets status and let you know.

How many client your afserver serves?
We have around 50.

Thanks for the help! I will get back with more information.
User avatar
timurhai
Site Admin
Posts: 911
Joined: Sun Jan 15, 2017 8:40 pm
Location: Russia, Korolev
Contact:

Re: all render tasks getting reseted

Post by timurhai »

Not much clients.
Not much open sockets.
Should work.
Also you can raise up open descriptors limit in the system - default for most distributives is 1024 - it is not much for servers, set at least 10240.
Timur Hairulin
CGRU 3.3.1, Ubuntu 20.04, 22.04, MS Windows 10 (clients only).
Post Reply