Fluids

Fluids

Inconsistent MPI Issues

    • heisenmech
      Subscriber
      Hi all,nI've been experiencing MPI problems on clusters, leading to failure of simulations. It is odd because it is very inconsistent as it sometimes runs for days then fails, and sometimes it fails almost instantly. The cases run on the cluster were tested on a different cluster and it was all fine. nI usually have a mesh with 30+ million nodes-structured for LES. I was curious if anyone else was also experiencing such parallelisation problems with Fluent (v20.1).nPlease see below for the error. nBest,nOguzhafluent_mpi.20.1.0: Rank 0:84: MPI_Bcast: 863: IBV connection to 96 (pid 20275) on channel 0 is broken. ibv_poll_cq(): bad status 12nfluent_mpi.20.1.0: Rank 0:84: MPI_Bcast: self cnode1033 peer cnode1034 (rank: 96)nfluent_mpi.20.1.0: Rank 0:84: MPI_Bcast: error message: transport retry exceeded errornfluent_mpi.20.1.0: Rank 0:84: MPI_Bcast: Internal MPI errornsrun: forcing job terminationnsrun: Job step aborted: Waiting up to 32 seconds for job step to finish.nslurmstepd: error: *** STEP 2498796.0 ON cnode1005 CANCELLED AT 2021-03-06T06:56:26 ***nsrun: error: cnode1101: tasks 147,156,158: Killednsrun: Terminating job step 2498796.0nsrun: error: cnode1101: task 146: Killednsrun: error: cnode1005: tasks 2,5-6,13: Killednsrun: error: cnode1100: task 140: Killednsrun: error: cnode1034: task 102: Killednsrun: error: cnode1101: tasks 150,157: Killednsrun: error: cnode1005: tasks 3,7,9-11,14: Killednsrun: error: cnode1006: tasks 17-19,22,24,27,31: Killednsrun: error: cnode1100: tasks 133,135,141-143: Killednsrun: error: cnode1024: task 45: Killednsrun: error: cnode1032: tasks 65-66,74,77: Killednsrun: error: cnode1033: task 84: Exited with exit code 16nsrun: error: cnode1033: tasks 87,89,92,94-95: Killednsrun: error: cnode1101: task 152: Killednsrun: error: cnode1025: tasks 48,51,59-60: Killednsrun: error: cnode1005: tasks 1,4: Killednsrun: error: cnode1034: tasks 98,108,110: Killednsrun: error: cnode1060: task 116: Killednsrun: error: cnode1032: tasks 67-68,73,75: Killednsrun: error: cnode1033: tasks 85,91: Killednsrun: error: cnode1101: task 155: Killednsrun: error: cnode1025: tasks 52,61: Killednsrun: error: cnode1034: task 101: Killednsrun: error: cnode1100: tasks 131,134,136,138: Killednsrun: error: cnode1032: task 79: Killedn The fluent process could not be started.nnrealt1476m53.643snusert5m24.318snsyst2m54.340snn
    • Rob
      Ansys Employee
      If it's random check on the system side. You're looking for RAM leaks (I'm not aware of any issues) and random acts of IT. Is the head node also on the cluster? n
    • heisenmech
      Subscriber
      We've been testing different solvers as well, and no issues with them at all. IT people wanted me to check with ANSYS if it's some sort of bug with parallelisation. Yes, the head node is on the cluster. n
Viewing 2 reply threads
  • You must be logged in to reply to this topic.