Fluids

Fluids

An error has occurred in cfx5solve: The ANSYS CFX solver exited with return code 9.

    • matthe18
      Subscriber

      I receive this error when my case reaches the solution stage. I suspect memory allocation as the root-cause, but there is no other information given so I cannot be certain. The only hint I have seen to go on, so far, was an older post suggesting to increase the catalogue memory allocation.

      My recent case submissions have had the following memory modification flags and all returned code 9:

      -size 2.5 -single -large

      -size 2.9 -single -large

      -size-cat 10.0x -size-nr 2.0x -size-ni 2.0x -single -large

      -size-cat 10.0x -size-nr 2.5x -size-ni 2.5x -single -large

      -size-cat 10.0x -size-nr 3.0x -size-ni 3.0x -single -large

      Any advice or experience with this return code would be very helpful to me. Thank you for considering my issue.

    • Surya Deb
      Ansys Employee
      Hello,
      Can you check the out file to see if there are any other details printed out?
      Can you embed an image of the out file with relevant errors/warnings?
      Regards SD

    • matthe18
      Subscriber
      Hi SD Thank you for your response. I have attached images of the out files of a few of my most recent attempted runs. Some specifics on memory allocation modifications for each run, hopefully in the correct order of the images:
      run15: -size 2.9 -single -large
      run18: -size-cat 10.0x -size-nr 2.5x-size-ni 2.5x -single -large
      run19: -size-cat 20.0x -size-nr 2.5x-size-ni 2.5x -single -large
      run20: -size-cat 30.0x -size-nr 2.5x-size-ni 2.5x -single -large
      I have been certainly throwing a lot at the wall to see what sticks. I am noticing the -size modifier has allowed the solution to reach coefficient loop iteration phase on a few occasions (run15 posted as example).
    • matthe18
      Subscriber
      I just did a search for this message (image attached) that is returned via the console, and cursory information seems to indicate return code 9 could be an OS issue?

    • Surya Deb
      Ansys Employee
      Hello,
      Since this happens immediately the calculation starts, I would also suspect the initial conditions and/or setup issues.
      Could you double check your initial conditions and the setup? Also could you test this by running on different core count?
      Regards SD
    • matthe18
      Subscriber
      Hi SD Thank you for the suggestion. I had a different core count job queuing and was waiting on the results (25% increase in requested nodes/allocatable memory). This job has also returned code 9 on exiting. I had also reached out to someone within the research computing department and they mentioned the signal 9 is most commonly associated with an out-of-memory watchdog. In fact, when digging into it a bit more, they were able to tell me that run20 from above had actually returned a different exit code associated with an out of memory state within our job scheduler that was not reported to either the CFX out file or to the console* out file.
      Run20, according to the CFX out file, had a maximum node usage of 75% memory, with an overall total 70% across all nodes. When querying the job information, the scheduler reports a maximum node memory usage of 242.78G (of 256G available per node) which is closer to 95% of the node memory. I was told this maximum memory is the last valid (within boundaries) value returned before violating the maximum memory available. Research computing is suspecting that a possible reason is CFX is somehow requesting more memory than allocated, although the out file does tell me that memory usage may be less than the reported allocation.
      As far as my setup goes, I believe it should be valid. I ran the same model as steady-state to initialize this transient model without similar issues. I will see if there are any glaring errors in the setup.
      With everything above, my plan moving forward is to try and limit CFX memory to some valid floor limit unless another reason for the exit errors becomes apparent.
      Any further advice or recommendations are, of course, welcome.
      Thank you again matthe18

      *Edit: Looking at the console output again, I missed a line informing me "slurmstepd: error: Detected 939 oom-kill event(s) in step 3979731.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler."
Viewing 5 reply threads
  • You must be logged in to reply to this topic.