Af-Server crashing / hardware performance increase
Posted: Thu Aug 04, 2022 2:50 pm
Hello,
We built our whole pipeline in the afanasy environment. Super happy and it performs very good. But from time to time it crashes and needs to be restarted manually.
We have about 700 nodes and 5.000 jobs afanasy handles (99% of the time successfully). We are still on version 2.3.0 looking forward to updating to v3.3.0 soon.
We try to monitor the crashes but cant confirm that its related to some increased service uses. Mostly it crashes during afternoon or evening (which, I admit, sounds very likely like a load issue).
Watching our server in htop it mostly performs about 27 processes with one using 30-45% CPU. The other ones use about 3-10% CPU. Because of this, our servers hardware is primarily focused on single-core performance.
If the crashes happen, all 27 processes use 80-95% CPU load and all cores max-out to 99%. Then we have to restart the afserver.
Does anyone has experience with this kind of issue? How did you fix it?
What is the right hardware to running the af server in this kind of scope? How is multi-core performance compared to single-core performance?
Regarding the versions did the hardware usage or efficiency increase, meaning updating is our chance to fix?
Thanks a lot and I am super happy for any suggestions or ideas.
Eberrippe
We built our whole pipeline in the afanasy environment. Super happy and it performs very good. But from time to time it crashes and needs to be restarted manually.
We have about 700 nodes and 5.000 jobs afanasy handles (99% of the time successfully). We are still on version 2.3.0 looking forward to updating to v3.3.0 soon.
We try to monitor the crashes but cant confirm that its related to some increased service uses. Mostly it crashes during afternoon or evening (which, I admit, sounds very likely like a load issue).
Watching our server in htop it mostly performs about 27 processes with one using 30-45% CPU. The other ones use about 3-10% CPU. Because of this, our servers hardware is primarily focused on single-core performance.
If the crashes happen, all 27 processes use 80-95% CPU load and all cores max-out to 99%. Then we have to restart the afserver.
Does anyone has experience with this kind of issue? How did you fix it?
What is the right hardware to running the af server in this kind of scope? How is multi-core performance compared to single-core performance?
Regarding the versions did the hardware usage or efficiency increase, meaning updating is our chance to fix?
Thanks a lot and I am super happy for any suggestions or ideas.
Eberrippe