Automatic creation of restart point on HPC system - did something go wrong?

Hello,

I was running a transient structural simulation on a HPC system. After a couple of days of running properly and after about 20% of the simulation had completed, the simulation stopped. Neither the solver output report nor the .err file contain any error messages, and I am struggling to determine why the simulation stopped. Looking at the outline in Workbench (see image below), there is a symbol next to 'Solution' that is related to solution restarts (https://ansyshelp.ansys.com/account/secured?returnurl=/Views/Secured/corp/v201/en/wb_sim/ds_solution_restarts.html%23ds_restarts_managing).

What would cause the simulation to stop running and automatically create a solution restart? Did something go wrong? Can I just continue the solution from this restart point?


Thank you!

Answers

  • mrifemrife PHLForum Coordinator

    @RGDiscovery Hi. How was the model launched to the HPC system? Is RSM configured to submit to the cluster? If so do you still have the RSM log? If so any clues there?

    If you submitted the job manually was it a direct submission? I.E. did you issue the MAPDL command to start the batch solve? Or did you submit the job via a job scheduler? If a job scheduler was used are there any logs to be had there?

    Review the sole output file - some 'clue' may not be a error or warning message.

    If a job scheduler was used, either by direct submission or from Mechanical via RSM, does the job scheduler queue have a time limit?

    Is the HPC system running Linux? If so ask the cluster admin about any Cgroup rules in place. Or any OS rules on hardware usage. A Cgroup rule on say RAM usage (i.e. don't let compute node use more than 95% of RAM) can lead the OS to kill a process that is using a lot of RAM. So the OS may have killed the job. That may or may not be evident in the solve output file, or in the job scheduler log, or in the RSM log.

    You can restart the solution from time 0.11 at least. But you may want to find out about any time and/or hardware usage rules first. So you don't run into this again in a few days.

    Mike

Sign In or Register to comment.