Fluids

Fluids

Odd MPI issue when running large simulation

    • JaySmall
      Subscriber

      I've been running scramjet simulations for my research work for several months on a coarse mesh. The simulations usually run without any issues. Recently, I created a finer mesh with improvements to key areas. This increases the total cells from 2 million to approximately 9.8 million. On the finer mesh, the simulation will run for 100-200 iterations before crashing and producing errors such as:


      Node 28: Process 14072: Received signal SIGSEGV.

      999999: mpt_accept: error: accept failed: No error (repeated multiple times)
      . . .
      999999: mpt_accept: error: accept failed: Abort(138030991) on node 31 (rank 31 in comm 0): Fatal error in PMPI_Waitall: Other MPI error, error stack:
      PMPI_Waitall(346)..............: MPI_Waitall(count=6, req_array=000000387A7FDE00, status_array=000000387A7FDF00) failed
      MPIR_Waitall(174)..............:
      MPIR_Waitall_impl(55)..........:
      MPIDI_Progress_test(185).......:
      MPIDI_OFI_handle_cq_error(1042): OFI poll failed (netmod\\ofi\\ofi_events.c:1042:MPIDI_OFI_handle_c q_error:Unknown error)
      Abort(70922127) on node 21 (rank 21 in comm 0): Fatal error in PMPI_Waitall: Other MPI error, error stack:
      PMPI_Waitall(346)..............: MPI_Waitall(count=16, req_array=000000E02DBFE170, status_array=000000E02DBFE270) failed
      MPIR_Waitall(174)..............:
      MPIR_Waitall_impl(55)..........:
      MPIDI_Progress_test(185).......:
      MPIDI_OFI_handle_cq_error(1042): OFI poll failed (netmod\\ofi\\ofi_events.c:1042:MPIDI_OFI_handle_c q_error:Unknown error)
      The fl process could not be started.

       

      I've talked to Ansys personnel who asked that I confirm the intel MPI was being used, and it has been confirmed using the .trn file.

      I'm running the simulations parallel on a machine with the following hardware:


      AMD Threadripper Pro 3975WX 32 core
      128 Gb DDR4 Memory


      Does anyone have experience with this type of error, and possibly how to go about fixing the issue? Thank you in advance for your help. 

    • Nikhil Narale
      Ansys Employee

      Hello, 

       

      Is this a case specific issue? Can you try a different case (with different geometry) and create a mesh with a cell count close to 9.8 M? Check if the same error comes up. 

       

      Nikhil

Viewing 1 reply thread
  • You must be logged in to reply to this topic.