mcuma
Subscriber

And here's the promised further information on the crashes. Now with a model that I know converges (the "Crankshaft" example that I got from Rescale but I figure it's one of your examples).

I get the following error when I run on a node with AMD Zen1 CPUs:

OMP: Error #100: Fatal system error detected.
OMP: System error #22: Invalid argument
forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source
libifcoremt.so.5   00007F14872F1555  for__signal_handl     Unknown  Unknown
libpthread-2.28.s  00007F144EDD9CE0  Unknown               Unknown  Unknown
libc-2.28.so       00007F144C5C0A4F  gsignal               Unknown  Unknown
libc-2.28.so       00007F144C593DB5  abort                 Unknown  Unknown
libiomp5.so        00007F1484DA4B23  Unknown               Unknown  Unknown
libiomp5.so        00007F1484D8FD17  Unknown               Unknown  Unknown
libiomp5.so        00007F1484D310A8  Unknown               Unknown  Unknown
libiomp5.so        00007F1484DE5E57  Unknown               Unknown  Unknown
libiomp5.so        00007F1484D2962D  Unknown               Unknown  Unknown
libiomp5.so        00007F1484D1F119  Unknown               Unknown  Unknown
libiomp5.so        00007F1484D1E68B  Unknown               Unknown  Unknown
libiomp5.so        00007F1484DA3B1F  Unknown               Unknown  Unknown
libiomp5.so        00007F1484D8698E  omp_get_num_procs     Unknown  Unknown
libansOpenMP.so    00007F146B886CEC  ppinit_               Unknown  Unknown
libansys.so        00007F147287D7EC  smpstart_             Unknown  Unknown
ansys.e            00000000004113F0  Unknown               Unknown  Unknown
ansys.e            000000000040EE28  MAIN__                Unknown  Unknown
ansys.e            000000000040ED22  main                  Unknown  Unknown
libc-2.28.so       00007F144C5ACCA3  __libc_start_main     Unknown  Unknown
ansys.e            000000000040EC39  Unknown               Unknown  Unknown
/uufs/chpc.utah.edu/sys/installdir/ansys/22.2/v222/ansys/bin/ansysdis222: line 77: 2638867 Aborted                 (core dumped) /uufs/chpc.utah.edu/sys/installdir/ansys/22.2/v222/ansys/bin/linx64/ansys.e -b nolist -s noread -i "dummy.dat" -o "solve.out" -dis -p ansys

Sounds like an issue with the Intel OpenMP library, but since it only occurs with the distributed solver, not the shared memory one, I suspect it may be due to some interaction with the Intel MPI that drives the distrbuted run. We have had quite few issues with Intel MPI on Rocky 8, some versions work and some don't, and in the versions that work we need to set the FI_FABRICS=verbs explicitly.

I am wondering if there's a way to tell the Mechanical IDE to use OpenMPI instead of Intel MPI. I know there's a command line option for that but I can't find if the IDE can set it.

The same example runs fine on nodes with Intel CPUs.

So, I think we are more less good. I'll instruct the user to stay on the Intel nodes, and to fix their model to improve convergence, and re-iterate my encouragement to support Rocky Linux in future Ansys releases, focusing both on Intel and AMD CPUs.